Indian AI startup Sarvam AI has released the first open-source Hindi language model called OpenHathi-Hi-0.1. The AI model is the first in a series of models which will “make contributions to the ecosystem with open models and datasets to encourage innovation in Indian language AI.”
Built on Meta AI’s Llama 2-7B model, a blog posted by the company stated that the model was on par with GPT-3.5 for Indic languages.
The blog explained that tokenisation, which is a crucial part of processing text in large language models is much more costly for Hindi compared to English because training text in the Hindi language is very little. Trained in two phases, the team behind the model worked to make this process cheaper.
It was then tested on a variety of benchmarks including standard ones like translation as well as several new ones like toxicity classification and text classification.
The base model has been made available on the Hugging Face platform so developers can finetune it and use it for specific use-cases.
Co-founders Pratyush Kumar and Vivek Raghavan had previously worked with another homegrown AI venture, AI4Bharat. Sarvam AI has partnered with AI4Bharat to use their language resources and benchmarks to train OpenHathi.
Currently employing around 18 people, Sarvam AI wants to build large language models that use voice as the common interface to make them more accessible to the demands of the Indian market.
Last week, the five-month-old startup raised $41 million in Series A funding led by Lightspeed Ventures with participation from Peak XV and Khosla Ventures. The startup is also working on a range of enterprise-grade models on its full stack Generative AI platform which will also release soon.