On December 6th, Google officially announced its latest big model Gemini1.0. Gemini 1.0 focuses on multimodal capabilities, and Google defines Gemini as a "native multimodal" model. The performance of GeminiUltra exceeds current state-of-the-art results in 30 out of 32 widely used academic benchmarks in the development of Large Language Models (LLMs). In the MMLU (Massive Multi tasking Language Understanding) testing framework, Gemini
The score of Ultra is as high as 90.0%, even surpassing human experts. Received by Google CEO
Sunday Pichai is referred to as "Google's largest and most powerful AI model to date.". So how powerful is it to surpass human experts in this test set?
01 | Can achieve multimodality and significantly improve performance Gemini is based on Transformer
A multimodal model constructed by the decoder, which can process information in different forms of content such as video, audio, and text. The latest Gemini model is able to perform more complex reasoning and understand finer information compared to previous technologies. It can extract key points from hundreds of thousands of documents by reading, filtering, and understanding information, which will help achieve new breakthroughs in many fields from science to finance. Gemini1.0 is officially announced in three different specifications:
Medium Cup: GeminiNano - the most efficient device task model;
Big Cup: GeminiPro - the best model for a wide range of task extensions;
Ultra large cup: GeminiUltra - the largest and most capable model for highly complex tasks. Putting aside the complicated parameter information for now, let's use a few examples to give you a comprehensive understanding of Gemini's capabilities. When you draw a duck casually, Gemini can accurately recognize it from the curve to the shape of the duck. Draw a wavy line for the duck, it can understand your implied meaning and accurately point out the scene of the duck swimming in the water. At the same time, it can also humanize the call of ducks, even if it is spoken fluently in Mandarin. If you are idle and bored, you can also play a game with Gemini. You can point your finger to which region Gemini can say which country and its representative things. The Gemini model, as the first multimodal model released by Google and globally, supports cloud and edge testing. Related test data surface, Gemini
Ultra outperforms human expert models in MMLU (Massive Multi Task Language Understanding), with performance surpassing GPT-4 in multiple tasks when compared horizontally. Gemini
Ultra became the first model to surpass human experts with an accuracy of 90.0% in this test, compared to GPT-4 with only 86.4% accuracy.
02 | The training process can innovate infrastructure, algorithms, and datasets. In terms of infrastructure, Gemini is trained by Google TPUV5e and TPUV4, and exhibits engineering innovation during the training process. For example, by connecting 4096 TPUV4 chips to a dedicated optical switch, the 4x4x4 chip cube can be dynamically reconfigured as a super node of any 3D ring topology structure in about 10 seconds
Ultra also has targeted deployment of hot maintenance and other functions. In response to the high inter chip interconnection speed required for the Ultra version, Google has applied multiple patented technologies such as OCS optical switching, but the final speed is not yet provided in the article. In terms of algorithms, techniques such as single control algorithms and XLA compilers are used to optimize the training process, and stable training is achieved by preventing SDC and other issues. In terms of dataset: Gemini training and inference speed are improved through word segmentation technology, and a series of filtering methods are used to ensure the high quality of the data used for training. The latest version of Google's computing chip TPU
V5p synchronous release, TPU
V5p is the previous TPU
V4 version improvements, in line with TPU
Compared to v4, TPU
The floating-point performance of v5p has been doubled, and its training speed for large language models is faster than that of TPU
V4 is 2.8 times faster. CITIC Securities believes that the official release of the multimodal Gemini model can expand the application scenarios and bring about continuous upgrades in computing power demand. Minsheng Securities continues to be optimistic about the future prospects of the AI industry and believes that the release of models such as GPT-5 will also bring more catalysis.
03 | Possessing the ability to run offline. According to DeepMind, Gemini
Nano has the ability to run completely offline on the end side,
At present, Google has adapted Gemini to the built-in recording app of the Pixel system, which can automatically generate AI summaries based on recorded conversations, interviews, presentations, and other content even without network connection. In addition to the system's built-in app, Gemini
Nano's capabilities have also been integrated into the Android system, and developers of third-party applications can also call the built-in Gemini model capabilities of the phone through application adaptation. For example, the built-in input method of the phone can automatically generate appropriate quick replies based on the text messages sent to you by the other party in the Gemini chat app. Google developers also mentioned that there are plans to log Gemini into other Android smartphones in the future, but this adaptation work involves computing power adaptation of mobile hardware, so currently only the Pixel 8 Pro is the adaptation model for Gemini. So Gemini can completely surpass GPT
4.0? Although Google did not directly respond to this question, it reiterated that Gemini Ultra received a higher rating in MMLU compared to GPT-4, and is currently the only AI model that surpasses human expert testing results.
04 Jingtai Viewpoint | Focusing on leading manufacturers in computing power, algorithms, data, and more. For the entire industry, the promotion of Google's productization and commercialization will also bring about changes in the industry as a whole. At the same time, with the launch of models such as GPT-5, it is expected to see: 1) an increase in computing power demand brought by multimodal models, and 2) the emergence of more and more AI scenarios and products. Gemini's release will further bring more expectations for multimodal models. For the industry, multimodal materials will drive an increase in computing power demand; In the medium to long term, it is expected that the upgrade of multimodal models will enrich the usage scenarios of related products, coupled with cost optimization brought about by hardware upgrades and algorithm optimization. The progress of 2C products is worth looking forward to. We will continue to be optimistic about the long-term impact and changes of this wave of generative AI on the technology industry, and continue to pay attention to leading manufacturers in areas such as computing power, algorithms, data, and applications.