A concept that has for long captured the imagination of several sci-fi writers, Human Machine Interface today is inching closer to bridging the gap between the worlds of machines and humans.
Though these machines may not be in the form of gentle humanoids as Hollywood has taught us to imagine — like Robin Williams in Bicentennial Man, or giant robots like Bumble-bee in Transformers — they are still rather powerful, and increasingly ubiquitous: they reside in our pockets in the form of smartphones.
While touch and gesture control have become all too common with several original equipment manufacturers (OEMs) playing around with it to boost consumer experience, the next big thing in this gadget segment appears to be voice input. Today, small and compact devices such as smartphones are now capable of processing complex voice input, and with the Siri application in the iPhone and the voice input support in the recent versions of Android operating system, it appears that tech majors are now working hard to outdo each other in this growing field.
As a user, your phone might allow you to type, touch, gesture or speak to it; however, at the end of the day, all machines understand are digital commands. The process of converting these human inputs into machine understandable format increases in complexity with increasing attempts to emulate the ‘human elements’ in these devices.
Speech recognition of complex strings is the latest offering in smartphones. This is, as of today, the most complex mode of commanding smartphones; complex, because processing speech inherently is an arduous task. Gestures or touch can be made independent across all users; for instance, auto-rotate screen, or ‘slide to unlock’ kind of operations are not dependent on the users’ behaviour. And even if they are, these user dependencies can be easily eliminated and, hence, are easy to normalise.
Speech, however, inherently has multiple traits to be worked upon — tone (frequency), volume (intensity) and the speed of utterance. Each of these traits varies not only between individuals, but also in most cases alters when the same person speaks at different times. The challenge in speech recognition is to normalise these traits into fundamental templates, which can be matched across different speakers to make consistent comparisons and logical decisions, explained Sneha Das, a signal processing student, and intern at the Indian Institute of Science. Once normalised, these templates are compared to new inputs and logical commands are handed over to the devices.
“Speech recognition is complex because to attain normalisation, the speech recognition engine must train itself by running multiple iterations of the same content. This involves heavy digital signal processing computation first, and a tedious look-up algorithm to make the comparisons,” added Ms. Das. With smartphone processors carrying Digital Signal Processing engines, sensible speech recognition has entered the smartphone segment.
Siri and Google Voice input capture spoken commands, convert them into text using complex algorithms and try matching them with a database of known terms and, in some cases, consult their backend servers to verify or derive more accurate decisions. With time, these applications personalise the results by better understanding the dialect of the speaker and by mapping the interests of the user to make more sensible suggestions.
After acquiring Siri Inc. in 2010, Siri, the intelligent personal assistant with voice input, has been popularised by Apple. After integrating with Apple iOS, and iPhone4S sporting this application in October, 2011, Siri has been enticing users with its abilities and has set the trend for voice input in natural language. The knowledge navigation aspect in Siri, which understands speech and performs tasks based on voice comprehension, happens at two levels — locally using the processing on board, and by communicating with the Apple servers. Users with good connectivity get results in almost real time.
Google has almost simultaneously, albeit amidst less hype, introduced voice input abilities starting from its Android 2.2 (Version Froyo) and is in many ways faring better than Siri, with every updated release.
This advantage Google seems to exercise over Siri can be traced back to GOOG411, a Google initiative in the U.S. in 2007, where Google provided free voice-based search of phone numbers. This project was discontinued in 2010, after functioning for about 30 months. While the purpose of this project was not entirely clear then, Google in the process had begun analysing voice samples of its users, which must have come in handy during its attempts at voice input.
This ability of Google to understand user dialects, coupled with its massive understanding of user search patterns has an evident advantage over Siri. With every release, one can expect both applications to get better, and in this tussle it is users who would benefit through better interaction with smartphones.