IIT Madras team develops easy OCR system for nine Indian languages

The Bharati script unifies nine Indian languages

April 27, 2019 06:03 pm | Updated April 28, 2019 01:07 am IST

Transliteration:  Tamil verse from Sangam literature written in Bharati script (centre) and English.

Transliteration: Tamil verse from Sangam literature written in Bharati script (centre) and English.

Taking a cue from European languages, several of which have the same (Roman letter–based) script, Srinivasa Chakravathy’s team at IIT Madras has, over the last decade, developed a unified script for nine Indian languages, named the Bharati script. The team has now gone a step further since developing the script: it has developed a method for reading documents in Bharati script using a multi-lingual optical character recognition (OCR) scheme. The team has also created a finger-spelling method that can be used to generate a sign language for hearing-impaired persons. In collaboration with TCS Mumbai, the researchers have found a way for persons with hearing disability to generate signatures using this finger-spelling technique.

The scripts that have been integrated include Devnagari, Bengali, Gurmukhi, Gujarati, Oriya, Telugu, Kannada, Malayalam and Tamil. English and Urdu have not been integrated so far. Dr Chakravarthy says, “Urdu and English alphabet systems have a very different phonetic organisation. But that does not mean a mapping is not possible. It is quite possible and can be done.”

In general, optical character recognition schemes involve first separating (or segmenting) the document into text and non-text. The text is then segmented into paragraphs, sentences words and letters. Each letter has to be recognised as a character in some recognisable format such as ASCII or Unicode. The letter has various components such as the basic consonant, consonant modifiers, vowels etc.

The team led by Srinivasa Chakravarthy (centre, sitting)  that developed the Bharati script.

The team led by Srinivasa Chakravarthy (centre, sitting) that developed the Bharati script.

 

Easy to read

The scripts of Indian languages pose a problem for such a character recognition because the vowel and consonant-modifier components are attached to the main consonant part. This difficulty is removed in the Bharati script which can be easily read. “In Bharati characters, these different components are segmentable by design. So OCR works quite accurately. Our OCR engines gives almost 100% accuracy even with mild noise added,” says Dr Chakravarthy.

Three-tiered structure

The ease in design comes about because the Bharati characters are made up of three tiers stacked vertically. The consonant at the root of the letter is placed in the centre and the modifiers are in the top and bottom tiers.

In collaboration with Sunil Kopparappu of Innovation Labs, TCS, Mumbai, the team has developed a universal finger-spelling language for the nine Indian languages. They are working on a system that can help people sign documents using a finger-spelling method, and future plans include developing a new Braille system with the Bharati script.

0 / 0
Sign in to unlock member-only benefits!
  • Access 10 free stories every month
  • Save stories to read later
  • Access to comment on every story
  • Sign-up/manage your newsletter subscriptions with a single click
  • Get notified by email for early access to discounts & offers on our products
Sign in

Comments

Comments have to be in English, and in full sentences. They cannot be abusive or personal. Please abide by our community guidelines for posting your comments.

We have migrated to a new commenting platform. If you are already a registered user of The Hindu and logged in, you may continue to engage with our articles. If you do not have an account please register and login to post comments. Users can access their older comments by logging into their accounts on Vuukle.