Basic version translates four Indian language pairs
It is generally understood that translation from one language to another requires an adaptive human brain, and not the rule-based rigidity of a machine. Even to a human being, it poses the trickiest of problems, and a successful translator jubilant over his product would have brought it only closer to the original.
However, in a ‘Robotesque' effort to infuse ‘thought' into a gizmo, a consortium of 11 academic and research institutions across the country came together to design the ‘Sampark Machine Translation Systems for Indian Languages,' which was launched here on Wednesday by the former President, A.P.J. Abdul Kalam.
It was the most successful among the three machine translation systems released at the World Wide Web International (W3I) Conference, the others being AnglaMT and Anuvadaksh providing translation from English to Indian languages.
Conceived to deliver translation in 18 Indian language pairs, Sampark is ready in its basic version for four among them — Punjabi to Hindi, Hindi to Punjabi, Urdu to Hindi and Telugu to Tamil.
Within a year, translational capabilities in 14 other bi-directional language pairs too will be launched. These include Tamil-Hindi, Telugu-Hindi, Hindi-Urdu, Kannada-Hindi, Punjabi-Hindi, Marathi-Hindi, Bengali-Hindi, Tamil-Telugu and Malayalam-Tamil, said Rajeev Sangal, Director of the International Institute of Information Technology, Hyderabad, which was part of the consortium. The project was executed under the Technology Development for Indian Languages (TDIL) Programme of the Department of Information Technology.
The programme is aimed at multiplying web content in Indian languages and improving Internet usage among these language speakers. In short, Sampark is a web application that translates content available in one Indian language into another. It can offer better quality in translation if the input text conforms to standard language, say the developers. To address the syntactic differences of grammar in various scripts, Computational Paninian Grammar is used as the unifying logical framework, Professor Sangal said.
“To begin with, large chunks of data are taken, and each word is tagged with the respective part of speech to enable the machine to learn. Then, the machine is fed with data to allow it to tag the words on its own. The work is then analysed to discover conflict areas and address them,” Rahmat Yousufzai, the IIIT-H professor who spearheaded the Urdu-Hindi team, said, explaining the ‘machine learning' process.
Understanding the meaning, performing a dictionary look-up and structure transfer will be the components of the machine translation towards generating the target language output.
As soon as the text is fed, the in-built Morphological Analyser begins identifying the verb in each sentence, and the Parser uses Paninian grammar rules to zero in on the kind of nouns it can support and arrive at the apt one. Long names such as those of institutions (e.g. University of Hyderabad), are made out to be proper nouns through recognition of repeated collocations. All unidentified words are considered proper nouns and transliterated. However, literature is a big no-no for translation on this system, as it cannot identify metaphors.
“We are at present focussing only on comprehensibility and not fluency. So, there may be errors of grammar at times. We hope to bring in future improvements based on user feedback,” Professor Sangal said.
AnglaMT System translates from English to Bengali, Malayalam, Punjabi and Urdu, while Anuvadaksh does it from English to Hindi, Bengali, Marathi, Oriya, Urdu and Tamil. The other institutions involved in the development of Sampark include IIT-Bombay and Kharagpur, C-DAC, Noida and Pune, the University of Hyderabad, Jadavpur University, Anna University-KBC Research Centre, Tamil University, IIIT-Allahabad, and IISc-Bangalore.
In all, 200 researchers worked on the project, which began in 2006. The three systems are available on www.tdil-dc.in.