For anyone curious about what the next frontier of AI models would look like, all the signs are pointing towards multimodal systems, where users can engage with AI in several ways. People absorb ideas and form context by drawing meaning from images, sounds, videos and text around them. A chatbot, even though it can write competent poetry and pass the U.S. bar, hardly matches up to this fullness of cognition. If AI systems are to be as close a likeness of the human mind as possible, the natural course would have to be multimodal.
A new race opens up
As another good old tech race shapes up, leading AI companies are already playing catchup. On September 25, ChatGPT-maker OpenAI announced that it had enabled its GPT-3.5 and GPT-4 models to study images and analyse them in words, while its mobile apps will have speech synthesis so that people can have full-fledged conversations with the chatbot. The Microsoft-backed company had promised multimodality in March, during the release of GPT-4 and kept the addition on the backburner. However, the company has rushed the release after a report by The Information revealed that Google’s new yet-to-be-released multimodal large language model called Gemini, was already being tested in a bunch of companies.
The report also stated that Google had an easy advantage over competitors in the multimodal world because of its readily available bank of images and videos via its search engine and YouTube. But OpenAI is moving fast to make inroads. The company is actively hiring multimodal experts with pay packages up to a hefty $3,70,000 per year. It is also reportedly working on a new project called Gobi which is expected to be a multimodal AI system from scratch, unlike the GPT models.
How does multimodality work?
Multimodality itself isn’t a novel thing. The past couple of years have seen a stream of multimodal AI systems being released. Like OpenAI’s text-to-image model, DALL.E, upon which ChatGPT’s vision capabilities are based, is a multimodal AI model that was released in 2021. DALL.E is built on another multimodal text-to-image model called CLIP that OpenAI released the same year.
DALL.E is in fact the model which kickstarted the generative AI boom, and is underpinned with the same concept that runs other popular AI image generators like Stable Diffusion and Midjourney — linking together text and images in the training stage. The system looks for patterns in visual data that can connect with data of the image descriptions. This enables these systems to generate images according to the text prompts that users enter.
For multimodal audio systems, the training works in the same way. GPT’s voice processing capabilities are based on its own open-source speech-to-text translation model, called Whisper, which was released in September last year. Whisper can recognise speech in audio and translate it into simple language text.
Applications of multimodal AI
Some of the earlier multimodal systems combined computer vision and natural language processing models or audio and text together to perform some of the simpler but rather important functions like automatic image caption generation etc. And even if these multimodal systems weren’t an all-powerful model like GPT-4 gunning for the ultimate dream of artificial general intelligence (AGI), they carried enough value to address very real-world problems.
In 2020, Meta was working on a multimodal system to automatically detect hateful memes on Facebook. Meanwhile, Google researchers published a paper in 2021 about a multimodal system they had built to predict the next lines of dialogue in a video.
But there are other more complex systems still in the works. In May this year, Meta announced a new open-source AI multimodal system called ImageBind that had many modes — text, visual data, audio, temperature and movement readings. In the blog post, Meta had speculated that future multimodal models could add other sensory data to them, like “touch, speech, smell, and brain fMRI signals.”
The idea behind this is to have future AI systems cross-reference this data in similar ways that current AI systems do for text inputs. For instance, a virtual reality device in the future might be able to generate not just the visuals and the sounds of an environment but also other physical elements. A simulation of a beach could have not just the waves crashing on the shore, but also the wind blowing and the temperature there.
If that sounds too futuristic, there are other uses that can be found closer to the world we live in now, like in autonomous driving and robotics.
Other industries like medicine are “inherently multimodal,” according to a post by Google Research. Processing CT scans, or identifying rare genetic variations all need AI systems that can analyse complex datasets of images, and then respond in plain words. Google Research’s Health AI section has been working at this for some time now, having released papers around what the ideal method is to integrate multimodal AI systems in this field.
AI models that perform speech translation are another obvious segment for multimodality. Google Translate uses multiple models as do others like Meta’s SeamlessM4T model, which was released last month. The model can perform text-to-speech, speech-to-text, speech-to-speech and text-to-text translations for around 100 languages, the company said.