OpenAI is in the spotlight again. This time, for building an artificial intelligence (AI) model that can create near flawless one-minute-long videos based on text prompt. The video creation AI model, called Sora, is trained on videos and images of various durations, resolutions, and aspect ratios to generate crisp, clear, and photorealistic output.
OpenAI CEO Sam Altman announced the AI model on Thursday through X (formerly Twitter), asking people to send prompt suggestions so he could make the tool churn out videos. Shortly thereafter, he showcased the tool’s prowess with videos of two dogs podcasting on top of a mountain, woolly mammoths roaming on snow-capped mountains, and a bustling night in Tokyo. The videos were high-definition and looked cinematic at first glance. While generating videos from texts aren’t new, Sora’s achievement dwarfs Meta’s Make-a-Video and Google’s recently announced Lumiere text-to-video tools. Unlike the output from Meta’s, Google’s, or other earlier AI video tools, Sora provides studio-grade final product.
What is Sora and what can it do?
Sora in Japanese means sky, an imagery that evokes ‘limitless creative potential,’ per the company’s engineering team. This new diffusion-based AI model is built on the foundation of transformer architecture, similar to large language models like ChatGPT. It can create images and videos with near-accuracy on a given subject. It can construct a video from an image and also fill gaps in existing video clips.
Diffusion models are used to generate high-quality images and videos. They are named after the physical diffusion process in which molecules move from high-concentration to low-concentration zones. In machine learning, these models generate new data by reversing the diffusion process. The simple idea here is to add noise to data and then reverse the visual data back to its original state by filtering out the noise.
According to OpenAI, Sora works by “turning videos into patches by first compressing videos into a lower-dimensional latent space, and subsequently decomposing the representation into spacetime patches.” Patches is to Sora what tokens is to ChatGPT. Tokens unify diverse modalities of text like code, data, and natural languages. In a similar way, patches unify videos by compressing them — a form of tokenisation for visual data. When a user sends prompts to Sora, it creates a video by stitching together compressed patches of visual data.
How good is Sora’s output?
The video clips generated by Sora are so photorealistic that they will stun anyone who looks at them for the first time. It is a top-class AI-based image generator. But a closer look reveals there is work to be done in object tracking.
While OpenAI claims Sora can handle occlusion, a term in computer vision for objects disappearing when two or more of them come too close to each other, the text-to-video model does suffer from this limitation to an extent. For instance, in one of the clips shared by the Microsoft-backed company, people in the background disappear when the focus moves past a couple walking. In another video, when a half-filled glass breaks on a table, the contents of the glass are spilt on the table before the glass is broken. In one clip, which shows four men walking near a chair, one person goes missing when the camera pans out. It was also easy to spot a person with a missing arm in a clip, and a cat with two left paws in another. Such mishaps show that the AI model needs to understand space and time better.
It must be noted that Sora is not available to the public yet. The videos were handpicked by OpenAI so they may not be indicative of the tool’s average output. OpenAI plans to start sharing the model with third-party testers to receive feedback to improve the model.
Some experts are of the view that more systemic glitches will surface as more people gain access to the tool.
Can occlusion be remedied?
While AI researchers are looking to solve the object tracking problem, some AI experts predict it will be hard to set it right. They note that the fault does not stem from data, but from how the system builds reality.
“One of the most fascinating things [about] Sora’s weird physics glitches is most of these are not things that appear in the data. Rather, these glitches are in some ways akin to LLM “hallucinations”, artefacts from decompression and lossy compression,” AI expert Gary Marcus said in a Substack post. That means more data is not going to solve the problem, and generative AI-based models are not going to understand or function as per the physical laws of nature.
What about the training data?
OpenAI’s achievement through Sora is monumental and it will disrupt video creation and gaming industries. But the critical question on most people’s minds is on what visual data was Sora trained. Speculation is rife that the video generation tool was trained on data from game engines, movies, documentaries, YouTube videos, and possibly videos scraped from every corner of the web.
The company’s monk-like silence on the training data, while blowing the trumpet on the method of training, implies that data underpins Sora’s success in making near perfect videos. But this data could possibly include copyrighted work. Unless OpenAI shares this information, it will be hard to know.
The question of whether companies like OpenAI using unauthorised material scraped from the web to train AI violates copyright law is yet to be addressed by the courts. Tech firms claim they are protected by the copyright’s fair use doctrine and that lawsuits against them will stifle a growing AI industry.
What about misinformation?
Photorealistic video generation capabilities are worrying considering the burgeoning misuse of generative AI tools to spread misinformation. This is possibly the reason why OpenAI took the red-teaming route ahead of its public launch. Sora already has a filter that will block prompt requests that mention violent, sexual, or hateful language, as well as images of well-known personalities. A second filter will check frames of generated videos and block content that violates the company’s safety guidelines. OpenAI has also said Sora uses a fake-image detector developed for DALL.E 3, but given the industriousness of bad actors, none of these steps are watertight.
- Sora in Japanese means sky, an imagery that evokes ‘limitless creative potential,’ per the company’s engineering team.
- The video clips generated by Sora are so photorealistic that they will stun anyone who looks at them for the first time
- While AI researchers are looking to solve the object tracking problem, some AI experts predict it will be hard to set it right.
Published - February 19, 2024 10:30 am IST