(Subscribe to our Today's Cache newsletter for a quick snapshot of top 5 tech stories. Click here to subscribe for free.)
Microsoft on Wednesday unveiled an artificial intelligence (AI) system for image captioning that, it says, can describe images as accurately as humans do.
The Redmond-based technology company said the new image captioning system is two times better than the image captioning model that’s been used in Microsoft product and services since 2015.
The captioning model rolled out through its cloud platform will let developers to use it in their apps.
“Image captioning is one of the core computer vision capabilities that can enable a broad range of services,” Xuedong Huang, Microsoft’s CTO, Azure AI Cognitive Services, said in a statement.
It is available in Seeing AI, a Microsoft app for blind and visually impaired users and will start rolling out later this year in Microsoft Word and Outlook, for Windows and Mac, and PowerPoint for Windows, Mac and web.
The feature will be used to generate alt text, photo description in a web page or document for people with no or limited eye-sight.
With Seeing AI, Microsoft aims to improve image captioning capability in Seeing AI talking camera app for differently abled people to describe photos, including those from social media apps.
Microsoft pre-trained a large AI model by pairing images with word tags, which were specific to the object in an image. Using word tags instead of full captions allowed researchers to feed lots of data into their model.
The pre-trained model was then fine-tuned for captioning on the dataset of captioned images. When it was presented with an image containing novel objects, the AI system leveraged the visual vocabulary to generate an accurate caption.
The model was evaluated on nocaps, the benchmark that evaluates AI systems on how well they generate captions for objects in images. The result showed AI system created captions that were more descriptive and accurate than the captions for the same images that were written by people, according to results presented in a research paper titled, VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training.