Keep those lips sealed! Tech companies can now read them

The Hindu’s imaging of how lipreading technology may work   | Photo Credit: Arivarasu M

“Start navigation, please,” said the driver in the car with noisy passengers. Within seconds, a speech recognition system identified the command and activated the navigation system simply by reading the driver’s lips.

In another instance, a patient in a hospital with breathing tubes placed below their vocal cords, finds it difficult to speak. The helper uses SRAVI – a mobile app that uses Liopa’s lip-reading technology – to scan the patient’s face while they silently mouth a sentence. The Artificial Intelligence (AI)-assisted system then displays three probable statements of what the patient may be trying to say.

(Subscribe to our Today's Cache newsletter for a quick snapshot of top 5 tech stories. Click here to subscribe for free.)

The uses of AI have grown tremendously over the years, from providing content recommendations on Netflix to self-driving vehicles. Audio recognition technology remains a vital use case, powering Amazon’s Alexa, Samsung’s Bixby, Apple’s Siri and the long-extinct Microsoft’s Cortana to perform simple tasks from voice commands.

But speech recognition from silent audio can be a different ball game.

How does it work?

Audio is built of sequence of 'phonemes' while lip movements are made of 'visemes' that are counterparts of phoneme in visual speech.

A typical AI system for speech reconstruction generally works on encoder-decoder principle, Rajiv Ratn Shah, Assistant Professor at Indraprastha Institute of Information Technology (IIIT) in Delhi tells The Hindu. The AI learns the mapping between lip movement and audio from an elaborate dataset consisting of combinations of lip movements and corresponding audio, and then encodes the same.

The encoded information helps the AI model to interpret what the corresponding audio would be for the given lip movement by decoding the information, he added. This means that AI learns to map a phoneme to a viseme of a given word.

Read More | Scientists turn speech-impaired man's vocal signals into words

For example, two or more phrases may belong to similar viseme classes, like ‘elephant juice’ and ‘I love you.’ This means that the speaker’s jaw movement looks similar while saying both the phrases, making it harder for the human eye to tell apart. But a smart AI-powered system complimented by a large dataset of possible combinations of lip movements and words can accurately decode the speaker’s words, Prof Shah described in a research paper titled ‘Harnessing AI for Speech Reconstruction using Multi-View Silent Video Feed’.

Additionally, speechreading involves looking, perceiving and interpreting spoken symbols. Cameras attached in cars, mobile phones and homes capture the user’s face and decipher the different movements of teeth, tongue and mouth. The process largely relies on computer vision and neural networks, according to a 2016 research paper titled ‘Lip Reading Sentences in the Wild’ co-authored by Google and the University of Oxford.

How useful is it?

Using AI to lipread silent speech can be useful especially since the task is not trivial for humans, according to Rajiv. To do it manually, expert lip readers and multimedia experts may be required, which doesn’t always guarantee real-time speech reconstruction, he added.

“Moreover, AI solutions work better than any normal person,” explains Prof Shah.

The technology has applications in several areas like security (capturing silent videos using CCTV) and crime investigations. United Kingdom-based startup Liopa’s LipSecure system can be integrated into biometric systems to prevent spoofing attacks, according to the company. The technology generates a random sequence and scans the user’s face and mouth while they say it in front of the camera.

In healthcare, apps such as SRAVI could assist patients with cerebral palsy and dysarthria to communicate with others.

London-based startup TrueSync uses AI to create lip-synced visualisations in multiple languages, a move that could one day replace the process of dubbing movies in different languages. Such a technology could work wonders with over-the-top (OTT) streaming platforms like Netflix, Hulu and Amazon Prime Video, that are increasingly hosting content in regional and international languages. Imagine watching Spanish-language series Casa Del Papel or Money Heist in your preferred language with real-time translation!

Prof Shah’s team is said to have built the world’s first-ever intelligible speech reading system in 2018, that uses multiple camera views to decode silent speech, a step up from single-view systems that may not account for distractions in a car or home setting.

Tech giants have entered the space. Is there a threat?

In 2016, Google’s AI division DeepMind and the University of Oxford created a lip-reading software which is said to have about 50% accuracy. At the 2021 Consumer Electronics Show, Sony unveiled the Visual Speech Enablement system; it uses camera sensors to augment lip-reading in any environment.

AI-powered tools are built on huge amounts of data that is only accessible by big companies like Google, Microsoft and Amazon, making it easier for them to devise automatic speech recognition systems, Shah comments. However, data privacy could be a significant challenge in an industry dominated by few tech moguls. “Companies must inform consumers about how their private data is being stored,” he states.

Customers mostly do not realise or understand the extent of intrusion they themselves are permitting or inadvertently allow, Supreme Court advocate N. S. Nappinai, founder of online safety awareness firm Cyber Saathi, tells The Hindu. “If, for instance, Sony’s product working on Intelligent Vision Image Sensor and AI uses lip reading to act on voice commands, as is reported, the cameras are always on and capture not just the command but all else. There is no telling how this data will be used or for that matter how such a product can be misused by criminals,” she comments.

Visual speech recognition technologies also face other challenges similar to any AI-powered systems — the emergence of deepfakes and excessive surveillance.

Read More | Deepfakes: tricking netizens

Morphed imagery and doctored videos can be created using the simplest of applications, making content look more realistic than manually manipulated media. This could cause large-scale damage since miscreants could use it to manipulate elections, spread hate and create a zero-trust society.

Presently customers are being subjected to rampant corporate surveillance including in EU countries, Nappinai explains. She elaborates, “This is despite the stringent General Data Protection Regulation (GDPR) levels of personal data protection and the innumerable fines imposed. Government use of such tech or of data gathered through such tech, again is a certainty.”

Nappinai gives the example of a double murder in January 2017, explaining, “In this case of 2018, it required a court order, which was granted, for the Amazon Echo recordings to be released.”

From the tech developer perspective, responsible tech is the key, she adds and thus concludes, “From a user perspective, it is important that they understand the extent to which a product compromises privacy before purchasing or using the same. Informed and knowing consent is key.”

(Update: Spelling of 'phonemes' in paragraph 5 and 6 has been rectified.)

Our code of editorial values

This article is closed for comments.
Please Email the Editor

Printable version | Sep 18, 2021 11:22:34 AM |

Next Story