Personal digital assistants can soon ‘converse’ with you in Telugu, and with a greater level of accuracy. For, the International Institute of Information Technology (IIIT) Hyderabad has embarked on an ambitious project of crowdsourcing at least 2,000 hours of speech in this language that could be used in such devices.
According to Prakash Yalla from IIIT’s Technology Transfer Office, the move comes after talks between various organisations and the Centre.
Around ₹1 crore has been earmarked for the project. The initiative seeks to give a large number of speakers of regional languages, such as Telugu, access to gadgets that use Artificial Intelligence enabled speech recognition such as Siri or Alexa.
The intention is to crowdsource speech from different regions of both Telangana and Andhra Pradesh in order to include the diversity of language, spread across the region. For instance, there are differences in the Telugu spoken in Telangana, Rayalaseema and Andhra regions. Also, the way in which a youth speaks is different from the speech of senior citizens. The challenge is the quality of the dataset. For this, the IIIT team is working on devising certain data collection protocols.
“Good quality data is collected when we are speaking naturally, with emotions and in an environment which is natural to us, in conversational language. This would be different when we, say, talk on the stage or studio which is a controlled environment,” Mr Yalla says.
To capture data, the IIIT team comprising Mr. Yalla and Prof. Anil Vuppala from the Speech Processing Centre decided to go for a voice telephony platform, and teamed up with Ozonetel. A link is given to the volunteer with a topic in Telugu (Games or television, which is better?).
The user has to enter his or her mobile phone number, after which a screen seeks information of the age group of the volunteer: Junior (0-18), Young (18-30), Adult (31-59) and senior (60 above). It also seeks information on the volunteer’s native accent, and gender. After this the volunteer can speak into the microphone on the device and the speech will be recorded.
“The platform captures the data in a 16khz, 16 bit format, so that the entire voice spectrum is captured. To ensure data protection, the speech goes into different channels after which it is broken using an algorithm. These are anonymised and mixed into a master dataset. This fragment is then given to transcribers,” Mr. Yalla says.
The IIIT team is planning to have the 2,000 hour dataset ready in about a year, without compromising on quality. If required, the quantum of data could be increased to 5,000 hours.
“A lot of transcribers and speech contributors have come forward, including those from colleges. This will cover the youth spectrum. We are launching campaigns to see that systems are robust and adults are also covered, in rural areas as well,” Mr. Yalla says.