Digitising Tamil books is still a big challenge

Image recognition technology has reached near-perfection when it comes to scanning English and other European language text.

But optical character recognition (OCR) software capable of digitising printed Tamil text with high levels of accuracy is still elusive.

According to P.R. Nakkeeran, director of Tamil Virtual Academy, there are a few OCR readers for Tamil but their accuracy varies dramatically depending on the text provided.

“In some of the old books, the printing typography used is such that OCR software constantly misread the Tamil letters and words. Often, the digitised Tamil output is of such poor quality, that it makes it economically more viable to get people to manually enter the text. This is, of course, more time consuming,” said Mr. Nakkeeran.

One of the more successful OCR readers for Tamil is the one developed at Medical Intelligence and Language Engineering (MILE) Lab by A.G. Ramakrishnan of the department of electrical engineering, and his team. The software, however, is not available for commercial use and is given for use in non-profit projects on a case-by-case basis.

Explaining the challenges in developing OCR software for Tamil, Prof. Ramakrishnan said, unlike English and a few other languages, every root verb in Tamil has thousands of permutations and combinations the OCR readers needs to recognise.

He recommends a combination of OCR and running the digitised text through spell-check, using word-processing software like Men Tamizh, developed by NDS Linksoft Solutions, to achieve more than 90 per cent accuracy.

“Any digitised text will have to be peer-reviewed before finalising. It holds good even for English text using OCR,” he said.

NDS Linksoft Solutions by N. Deivasundaram has released the Tamil spell-check of its Men Tamizh as a plug-in for Microsoft’s Word at the ongoing Chennai Book Fair at YMCA, Nandanam.

One person who has used the OCR reader of MILE Labs and achieved good results is N. Venkataraman, who digitised more than 300 books before converting them to Braille text for non-governmental organisation, Worth Trust.

Mr. Nakkeeran said the developer community would do well to focus on OCR software and other digitisation tools that will optimise the output of Tamil and help create more ‘searchable’ documents in local languages.

“Most of the digitisation happening now is in the form of PDFs and JPEG images. This situation has to change,” he said.

Optical character recognition software, capable of digitising printed Tamil text with high levels of accuracy, is still elusive