16 new datasets in Indian languages for Artificial Intelligence and Machine Learning research

Developed by the Linguistic Data Consortium for Indian Languages at the Central Institute of Indian Languages in Mysuru, the datasets are expected to help develop new technologies including Automatic Speech Recognition, Live Voice Translation

Published - January 09, 2024 09:45 pm IST - MYSURU

LDC-IL at the Central Institute of Indian Languages in Mysuru has developed 16 new datasets for Indian languages. The datasets cover 12 languages including Kannada, Tamil, Hindi and Malayalam; the institute also released two datasets for Chhattisgarhi.

LDC-IL at the Central Institute of Indian Languages in Mysuru has developed 16 new datasets for Indian languages. The datasets cover 12 languages including Kannada, Tamil, Hindi and Malayalam; the institute also released two datasets for Chhattisgarhi. | Photo Credit: FILE PHOTO

The Linguistic Data Consortium for Indian Languages (LDC-IL) is a Scheme of the Ministry of Education and it works on development of digital corpora in Indian languages. Housed in the Central Institute of Indian Languages (CIIL), Mysuru, the LDC-IL organised the 8th Project Advisory Committee meeting here on Monday.

Chaired by Shailendra Mohan, director, CIIL, the meeting was attended by various domain experts and industry specialists. As an important outcome, LDC-IL launched 16 new datasets in Indian languages to help bolster quality research in Artificial Intelligence and Machine Learning.

The first of its kind, these datasets will help develop new technologies in Indian languages, including Automatic Speech Recognition, Live Voice Translation and improve the quality of the results by such tools in Indian languages, a press release from the CILL said.

The datasets cover 12 scheduled languages - Hindi, Bengali, Tamil, Marathi, Kannada, Malayalam, Odia, Assamese, Konkani, Maithili, Urdu, and Nepali. It has two variants of Indian English, namely the Bengali variant of Indian English and the Kannada variant of English.

It is noted that Indian English is internationally recognised as a language in its own right and further has its own variants within India where different mother tongues influence English to get its own flavour, with some distinct linguistic and phonetic features, the release added.

In a first, the institute also released two datasets for Chhattisgarhi, a mother tongue usually clubbed together with Hindi. “This shows the seriousness of the government to ensure that education and technology will be bolstered for all mother tongues of India as has been recommended in the NEP-2020,” the CIIL said.

These datasets will bolster research and development in all Indian languages and academia and industry both will benefit from them. The applications developed based on these datasets will finally help in promotion of these languages, according to the CIIL.

All of these datasets are available on the Data Distribution Portal of LDC-IL as available at https://data.ldcil.org

The Linguistic Data Consortium for Indian Languages is the largest repository of Curated Text and Speech resources in Indian languages meant for linguistic research and for research and development in Artificial Intelligence and Machine Learning. With these 16 new datasets, the portal now has a total of 57 datasets covering 21 Indian languages.

The datasets produced by the LDC-IL are the first real-world data collected from the field. The LDC-IL datasets are unique in the sense that they are not crowdsourced and have been collected from actual verified sources and verified by the experts in the language. Apart from training, the datasets can also act as the benchmark for testing AI and Generative AI-based technology, the release said.

0 / 0
Sign in to unlock member-only benefits!
  • Access 10 free stories every month
  • Save stories to read later
  • Access to comment on every story
  • Sign-up/manage your newsletter subscriptions with a single click
  • Get notified by email for early access to discounts & offers on our products
Sign in

Comments

Comments have to be in English, and in full sentences. They cannot be abusive or personal. Please abide by our community guidelines for posting your comments.

We have migrated to a new commenting platform. If you are already a registered user of The Hindu and logged in, you may continue to engage with our articles. If you do not have an account please register and login to post comments. Users can access their older comments by logging into their accounts on Vuukle.