Automated conversion of the written text to spoken form is very useful, especially in this time of online classes. Having lectures originally presented in English made available in all Indian languages has obvious uses. A group from the computer science department of Indian Institute of Technology Madras is working on this. The researchers are developing the technology to enable text-to-speech conversion for 13 Indian languages: Assamese, Bodo, Bengali, Gujarati, Hindi, Kannada, Malayalam, Manipuri, Marathi, Odia, Rajasthani, Tamil, Telugu and their corresponding Indian English flavours. A study on this was published in the journal IEEE/ACM Transactions on Audio, Speech,and Language Processing.
In order that the synthesised speech sound as natural as possible, and close to a sentence that has been read out by a human being, there is a need to convert punctuations into pauses of suitable lengths. This is the approach when converting English language text into synthesised speech. When applying this to Indian languages, the first difficulty one encounters is that there are no punctuations, save the period. There are many such differences, “The longest English sentence could be about 6 seconds long, while in Indian languages sentences can last as long as 30 seconds,” says Hema A. Murthy from the Computer Science and Engineering department of IIT Madras who led the study. Such long sentences are essentially phrase-based, the researchers found, and each phrase is almost a complete unit.
In the study, voice professionals – news readers and radio jockeys – were made to read out text carefully selected to be representative of various fields. “The audio signal and the text were aligned including pauses. Text was syllabified using rules, and syllables and pauses were identified in the audio using acoustic properties,” explains Prof. Murthy in an email to The Hindu. “Since the text and audio are aligned at the syllable-level, computing syllable-rate, number of syllables between pauses was straightforward,” she adds.
An hour of speech contains about 350-400 sentences. The researchers collected 10 hours of data for every language. “Five hours of data was used for hypothesising, and a set of held out sentences from the database was used for testing the hypothesis,” says Prof. Murthy. The text sentences were chosen in such a way that maximum domain coverage is ensured. “This includes news, sports, fiction, etc, as we work on open domain text-to-speech synthesis systems,” adds Jeena J Prakash from Uniphore Software Systems, IITM Research Park, Chennai, who is the first author of the paper.
Using these inputs, the text is split into phrases using the findings. “A phrase location–based speech synthesis system was built [which delineates the first phrase, last phrase and middle phrases]. The phrases of the text were synthesised using the appropriate phrase-based synthesis systems.
The synthesised waveforms were concatenated,” explains Dr Prakash.
The results were tested on listeners to get a subjective evaluation. The original spoken sentences and the synthesised sentences were played out in random order. They found a uniform improvement across all Indian languages. “Currently we are part of a consortium on building speech-to-speech systems, where the objective is to replace the audio in the NPTEL/Swayam Lectures (in English) to vernacular,” says Prof. Murthy.