On February 16th, American artificial intelligence research company OpenAI launched a video generation model called Sora. According to its official website, this model can be used to generate a one minute long video from text, which can have complex scenes such as multiple characters, specific types of movements, precise themes, and background details.
01 | Sora can generate multiple complex scenes. According to the OpenAI official website, Sora can generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background. Sora not only understands the requirements that users make in prompts, but also understands how these things exist in the physical world. In some scenes, Sora's effects are enough to confuse reality with falsehood. Currently, the OpenAI official website has updated 48 video demos generated by Sora, with bright colors and realistic effects. AI generated video image from OpenAISora: Mammoth walking in snow. It should be noted that currently Sora is a semi-finished product. OpenAI claims that it may be difficult to accurately simulate the physical principles of complex scenes and may not be able to understand specific examples of causal relationships. According to the OpenAI official website, Sora is a diffusion model that starts generating videos from videos that look like static noise, and then gradually transforms them into videos through multiple steps to eliminate noise. Sora can generate the entire video at once or expand the generated video to make it longer. By providing multiple frames of foresight to the model at once, OpenAI solves a challenging problem of ensuring that the theme remains unchanged even when temporarily out of view. Similar to the GPT model, Sora uses the Transformer architecture. Sora is based on past research on the DALL · E and GPT models.
It uses the restatement technique of DALL · E3, which involves generating highly descriptive titles for visual training data. Therefore, this model can more accurately follow the text instructions generated by the user. OpenAI states that Sora is the model foundation for understanding and simulating the real world, and it is believed that this feature will become an important milestone in achieving AGI (General Artificial Intelligence). 02 | As soon as the Sora video was released, it immediately shocked the industry. Although this is not the first AI video, other companies also have AI models similar to text generated videos. Google is testing a model called Lumiere, Meta is testing a model called Emu, and artificial intelligence startup Runway is also developing corresponding products to help make videos. However, foreign media pointed out that AI experts and analysts have stated that, The length and quality of Sora videos exceed the levels seen so far. Ted, Professor of Information Science at the University of Illinois at Urbana Champaign in the United States
Underwood pointed out that it was unexpected that there would be such a continuous level of video generation technology in the next two to three years, and OpenAI's videos may have demonstrated the best performance of the model. Several AI practitioners have stated that based on the preview video released by Sora, it is simply "crazy". In the Reditt community abroad, a netizen asked, will the Sora model released by OpenAI today become a milestone in the economic impact of automation? There are nearly 100 replies below. Some netizens said that initially, the release of ChatGPT showed users that everything is possible, but now artificial intelligence is constantly improving and developing, showing users powerful technological capabilities.
03 | The AI industry is all "rolling". Recently, Meta also announced a video joint embedding prediction architecture technology V-JEPA. This is a method of teaching machines to understand and simulate the physical world by watching videos. V-JEPA can learn by watching videos on its own without human supervision, labeling video datasets, or even generating a dynamic video based on a still image. Compared with other models, the flexibility of V-JEPA results in a 1.5 to 6-fold improvement in training and sample efficiency. In addition, in image classification, it can identify the main objects or scenes in the image; In terms of action classification, it identifies specific actions or activities in video clips; In terms of spatiotemporal action detection, it can identify the types of actions in the video and their specific time and location of occurrence. In terms of scoring, V-JEPA achieved an accuracy rate of 82.0% in Kinetics-400; Something Something v2 achieved an accuracy of 72.2%; The ImageNet1K image classification task achieved an accuracy of 77.9%. Meta claims that this is another important step taken by artificial intelligence models to plan, reason, and complete complex tasks using their learning and understanding of the world. Moreover, V-JEPA showcases Meta's advanced achievements in advancing machine intelligence through video understanding, laying the foundation for achieving higher levels of machine intelligence and artificial general intelligence (AGI). In summary, at the beginning of 2024, the progress of AI big model technology has accelerated comprehensively, and the ability to generate videos, images, and texts has greatly improved compared to a year ago. If 2023 is still the first year of AI image and text generation, OpenAI will drive the industry into the first year of AI video generation this year.
04 Jingtai Viewpoint | In 2024, the industry may achieve greater development. In recent years, breakthroughs in visual algorithms in generalization, suggestibility, generation quality, and stability will drive the arrival of technological turning points and the emergence of popular applications. The fields of 3D asset generation and video generation benefit from the maturity of diffusion algorithms, but the difficulties in data and algorithms outweigh those in image generation. Considering the acceleration effect of LLM on various fields of AI and the emergence of good open source models, the industry may achieve greater development in 2024. From the end of 2023 to the beginning of 2024, AI cultural and educational video applications such as Pika and HeyGen have gradually gained popularity, verifying the continuous progress and maturity of multimodal technology. The just released Sora model undoubtedly intensifies the fierce competition in this track.