DeepSeek and Qingbei released hard-core papers!

Home > MarketWatch > Industry News

Time：2026-03-08

26492086-CEegAe.jpg?auth_key=1772985599-

On the eve of the official release of DeepSeek V4, it quietly launched a hard-core paper with the Peking University and Tsinghua teams - "DualPath: An Efficient Large Model Inference System for the Agent Era".

This paper does not blow parameters or talk about multimodality, but it directly hits the most painful bottleneck in the current AI industry: when large models become "capable agents", GPUs are no longer bottlenecks, and hard disks and networks are the "culprits" that are holding them back.

DeepSeek's solution nearly doubles the inference throughput and the kilocard cluster can scale linearly - this may be an "invisible moat" that is more valuable than the model itself.

Jingtai believes that the competition of AI in the future is not about whose model is bigger, but about who is "faster, more economical and more stable".

The large model can't run, it turns out to be dragged down by I/O

To understand why DeepSeek's new system "DualPath" is so powerful, we must first figure out where the current large model reasoning is "stuck".

The root of the problem: I/O (input/output) has become a bottleneck.

Today's agents, like humans, will carry out multiple rounds of conversations or task operations - such as planning a trip step by step, writing code, and checking information. Add a little bit of new content (tokens) at each step, but you must remember all previous conversation history. In this way, the context becomes longer and longer, often hundreds of thousands or even millions of tokens.

However, GPUs have limited capacity of high-speed video memory (HBM) and memory (DRAM) to accommodate so much historical data. As a result, the system can only store most of the historical cache (called KV-Cache) in a cheaper but much slower solid-state drive (SSD).

When the model needs to generate the next step, the historical data must be "moved" from the SSD back to the compute node. This leads to another problem, the current mainstream inference system adopts a "pre-populated + decoded" two-stage architecture

Pre-populated node: Responsible for reading the entire prompt and loading the required KV-Cache from the SSD; Decoding node: Responsible for generating answers word by word.

But here's a big bug: all KV-Cache can only be read from SSDs via pre-populated nodes. The result is that the storage network bandwidth of the pre-filled nodes is "full" and is so busy that it is paralyzed; And the decoding node obviously also has a network interface, but it is almost idle!

It's like a highway, with only one toll booth open, and hundreds of cars behind it are all blocked there, while the dozen or so empty toll gates next to it are unusable. To make matters worse, hardware developments have made this problem worse.

Over the past few years, the computing power of GPUs has increased rapidly, but the growth of network speed and memory capacity has fallen far behind. The result is: GPUs are getting faster and faster, but they are getting more and more "hungry" - waiting for data to be in a hurry.

That's why, no matter how powerful the model is, the actual operational efficiency is dragged down by I/O.

The breakthrough of DualPath starts from here - allowing idle decoding nodes to also help "move data" and double the entire data channel.

| DualPath: Let idle "helpers" move data together

Since the storage network bandwidth of the decoder node is idle most of the time, why not use it? This is the core idea of DualPath - instead of relying on just one node to read data, multiple nodes come together to help.

The traditional approach is this: all historical caches (KV-Cache) are read from the hard disk to the "pre-populated nodes" and then passed to the GPU computation. It's like one person carrying a sack of rice upstairs, panting from exhaustion, while others stand by and watch.

And DualPath did a clever thing: it opened up two data channels.

Main path (pre-filled path): The data is read from the hard disk to the pre-filled node, then sent to the GPU memory for calculation, and finally the complete cache is transmitted to the decoding node.

New path (decoding path): A part of the data is read directly from the hard disk into the memory of the decoding node; When calculations are required, these data are quickly transmitted to the pre-populated nodes through a high-speed network (using RDMA technology) like streams, and they participate in the calculation together.

These two roads are not a fixed division of labor, but intelligent allocation - whichever side has more free time will go whichever side there is. In this way, the I/O pressure that was originally on one node is distributed across the cluster, and the storage bandwidth of all nodes is fully utilized.

As a result, the overall throughput of the system has been greatly improved, truly making the powerful GPU no longer "wait for data to be hungry".

Jingtai View|Pay attention to the invisible champion of "computing power infrastructure"

Impact on the valuation logic of AI companies:

The value of future large model companies = model capability × inference efficiency × cost control; Companies that can develop their own inference systems (such as DeepSeek, Alibaba, Meta) will significantly widen the gap with "pure tuning API" players.

Mapping of hardware and cloud vendors:

Demand for high-speed networks (InfiniBand/RDMA) surged: good for Mellanox (Nvidia), Huawei, and Zhongke Shuguang; Distributed storage + SSD performance is the key: focus on domestic memory chips and NVMe solution providers; Kilocard cluster scheduling capabilities have become a new moat for cloud vendors: Alibaba Cloud, Tencent Cloud, and AWS may widen.

[PREV]：Nvidia's performance exploded, but the stock price plummeted, what is the market afraid of?

[NEXT]："Post-Buffett era" shareholder letter: 370 billion cash in hand, discipline is more important than feelings

[Back to List]

TEL：

18117862238

Email：yumiao@jt-capital.com.cn
Address：20th floor, Taihe · international financial center, high tech Zone, Chengdu

LINKS

Wechat
Tel

18117862238
Top