On April 27, at the Future Artificial Intelligence Pioneer Forum of Zhongguancun Forum, China's first long-duration, high-consistency, and high-dynamic video model Vidu was officially released. It is reported that the Vidu video model is China's first long-duration, high-consistency, and high-dynamics video model jointly released by Tsinghua University and Biodata Technology. The release of this model marks that China has made significant progress in the field of video model technology, which can be comparable to advanced models such as Sora in the world.
Zhu Jun, deputy dean of the Institute of Artificial Intelligence of Tsinghua University and chief scientist of Shengshu Technology, said at the forum that Vidu has achieved "full-stack independent innovation" and "multi-dimensional comprehensive breakthrough", with a total of 6 characteristics: simulating the real physical world, imaginative, with multi-lens language, excellent video duration, high spatiotemporal consistency, and understanding of Chinese elements.
Vidu is powerful
According to reports, the Vidu model adopts the team's original Diffusion and Transformer fusion architecture U-ViT, which supports one-click generation of high-definition video content up to 16 seconds and resolution up to 1080P.
In terms of performance, Vidu is not only able to simulate the real physical world, generating scenes with complex details and conforming to the laws of real physics, such as reasonable light and shadow effects and delicate character expressions, but also has a rich imagination to create surreal content with depth and complexity.
In addition, Vidu is able to generate complex dynamic shots, and realize the switching between different shots such as long, close-up, medium-shot, close-up, etc., which is no longer limited to simple fixed shots. In particular, Vidu is able to understand and generate unique Chinese elements such as pandas and dragons, demonstrating a deep understanding of traditional Chinese culture.
At the same time, the generation method of Vidu is "one-step", the text-to-video conversion is direct and continuous, and the generation is completely end-to-end based on a single model, which does not involve intermediate frame insertion and other multi-step processing, which is an important innovation in technology.
The company behind it, Shengshu Technology, has attracted attention
As one of the developers of Vidu, the outside world may be relatively unfamiliar with Shengshu Technology.
Founded in March 2023, the core team of Shengshu Technology is from the Institute of Artificial Intelligence of Tsinghua University, in addition to a number of technical talents from Peking University and technology companies such as Alibaba, Tencent, and ByteDance.
Last year, Shengshu Technology completed a number of financings, with investors including Ant Group and Jinqiu Fund. In March this year, Shengshu Technology completed a new round of financing of hundreds of millions of yuan, led by Qiming Venture Capital, followed by Datai Capital, Hongfu Houde, Zhipu AI, old shareholders BV Baidu Venture Capital and Zhuoyuan Asia.
At present, the team has published nearly 30 papers in ICML, NeurIPS, ICLR and other top AI conferences. In terms of diffusion models, the team's achievements have involved full-stack technical directions such as backbone networks, high-speed inference algorithms, and large-scale training.
In addition, although it has not been established for a long time, Shengshu Technology has begun to promote the commercialization of large models. On the one hand, it directly provides model capabilities to B-end institutions in the form of APIs, and on the other hand, it creates vertical application products and charges them in the form of subscriptions.
Up to now, Shengshu Technology has cooperated with a number of game companies, personal terminal manufacturers, Internet platforms and other B-end institutions, and at the same time, Shengshu Technology also launched two tool products last year, namely the visual creative design platform PixWeaver 3D asset creation tool VoxCraft.
The Wensheng video model is accelerating its application penetration
Adobe, the global multimedia giant, announced on its official website that it will integrate Sora, Pika, Runway, etc. into the video editing software Premiere Pro (referred to as "PR"). In addition, Adobe is already working on a video model for Firefly that will power video and audio editing workflows in PRs, and AI-powered audio features are generally available to make audio editing faster, easier, and more intuitive. It is understood that the scale of Adobe's stock of users has reached 33 million, and it is expected to become a huge market for large models in the future.
Wensheng Video is expected to promote the productivity revolution of video creators, greatly reduce production costs and creation thresholds, and is expected to take the lead in the two major fields of short video and animation. Wensheng video model has a wide range of applications in various industries, including but not limited to marketing advertising, R&D training, e-commerce retail, entertainment games, etc.
Globally, the AIGC market size is expected to jump from $67 billion in 2023 to $897 billion by 2030, implying a compound annual growth rate of up to 45% in the sector. For the Chinese market, iResearch expects its industrial scale to increase from RMB 14.3 billion in 2023 to RMB 1,144.1 billion in 2030, with a compound annual growth rate of 87%.