DeepSeek collaborates with Tsinghua and Peking University on a groundbreaking paper: Focusing on foundational infrastructure for intelligent agents, breaking through the Agent reasoning I/O bottleneck!

robot
Abstract generation in progress

On the eve of DeepSeek V4 release, a groundbreaking paper is now available.

Large models are rapidly evolving from single-turn chatbots to agents capable of autonomous planning, tool invocation, and solving real-world problems. However, this transformation has triggered a major upheaval in underlying computational architecture.

When large models interact with environments over long contexts—dozens or even hundreds of rounds—the bottleneck shifts from GPU compute power to storage I/O bandwidth. Because only a small number of tokens are appended each time, KV-Cache hit rates are extremely high (usually over 95%), causing GPUs to spend much of their time waiting for massive amounts of historical KV-Cache data to be read from external storage.

To break this deadlock, DeepSeek, in collaboration with research teams from Peking University and Tsinghua University, has proposed a new large model inference system—DualPath.

This system introduces a novel “dual-path KV-Cache loading” mechanism, cleverly utilizing idle network bandwidth within clusters to boost offline inference throughput of Agentic large models by up to 1.87 times, and increase online service throughput by an average of 1.96 times.

Currently, this research has been validated at scale on clusters with up to 1,152 GPUs, supporting top-tier large models like DeepSeek-V3.2 660B.

Why does severe I/O bottleneck occur?

To understand the innovation behind DualPath, first examine the pain points of existing architectures.

In typical agent trajectories, the model receives a prompt containing previous context and newly appended tokens, then generates the next action.

This multi-turn, short-append pattern causes the context length to grow rapidly, even reaching millions. Due to limited GPU (HBM) and system memory (DRAM) capacity, massive KV-Cache must be stored on cheaper but slower external storage like SSDs.

Modern large model inference systems generally adopt a Prefill-Decode separation architecture. The prefill node handles prompt processing and loads the hit KV-Cache, while the decoding node generates tokens step-by-step.

The problem lies precisely here.

As shown on the left side of Figure 1, in existing systems, all KV-Cache data is loaded directly from external storage into the prefill node. This creates an extreme imbalance: the storage NIC (SNIC) bandwidth of the prefill node is fully saturated, becoming the system’s absolute bottleneck; meanwhile, the decoding node’s storage NIC remains largely idle.

Furthermore, hardware development trends exacerbate this contradiction. As shown on the left of Figure 3, NVIDIA’s hardware evolution indicates that GPU compute power (FLOPS) has grown much faster than network bandwidth and VRAM capacity, leading to a severe imbalance between compute and I/O.

DualPath: Breaking the Bandwidth Ceiling with Dual Paths

Since the decoding node’s storage bandwidth is idle, why not utilize it? That’s the core idea of DualPath.

The research team restructured the KV-Cache loading architecture, establishing a new “storage->decoding->prefill” dual-path loading channel beyond the traditional “storage->prefill” route.

  1. Prefill read path: KV-Cache is read from persistent storage into the prefill node’s memory buffer, transferred to GPU memory for computation, and finally the complete KV-Cache is passed to the decoding node.

  2. Decoding read path: KV-Cache is first read from persistent storage into the decoding node’s memory buffer. During prefill, this data is transmitted via high-speed inter-node network (using RDMA technology) in a layered streaming manner to the prefill node for computation.

By dynamically allocating data flow between these two paths, DualPath transforms the original single-node I/O pressure into a global resource pooling load sharing, successfully aggregating storage bandwidth across all nodes.

Overcoming implementation challenges: traffic isolation and dynamic scheduling

The idea is straightforward, but deploying it in large model inference systems with sub-millisecond latency sensitivity involves significant engineering challenges.

The first challenge is network traffic interference.

Introducing additional KV-Cache transfers can easily conflict with critical collective communications during inference (such as AllToAll operations in MoE architectures), slowing down overall inference.

To address this, DualPath employs a traffic management mechanism centered on the compute NIC (CNIC). All traffic to and from GPUs—including host-to-device copies—is forced through the compute NIC, with the underlying network (e.g., InfiniBand’s virtual channels) enforcing strict QoS controls. Model inference communications are assigned to high-priority channels with 99% bandwidth, while KV-Cache transfers are assigned to low-priority channels, only occurring during network idle times, achieving perfect traffic isolation.

The second challenge is dynamic load balancing.

Given complex and variable requests, the system must decide in real-time which read path to use for each request, considering NIC queue lengths and GPU compute loads.

DualPath introduces an adaptive request scheduler (see Figure 5). This scheduler monitors each node’s disk read queue length and uses the number of tokens as a core load indicator. It categorizes nodes into overload, low-read-queue, and high-read-queue groups, prioritizing assigning new tasks to nodes with shorter, non-overloaded queues.

Within nodes, the scheduler also estimates execution time to batch requests with similar durations, minimizing GPU idle time caused by synchronization delays.

Throughput nearly doubles, supporting scale-up to thousands of GPUs

The team conducted comprehensive evaluations of DualPath on an NVIDIA Hopper GPU cluster with InfiniBand network and 3FS distributed storage. Tests included models like DeepSeek-V3.2 660B, DS 27B, and Qwen2.5-32B, using real agent reinforcement learning trajectory datasets.

Offline batch inference performance (e.g., RL rollout phase):

Under various agent concurrency levels and maximum context lengths, DualPath significantly outperforms baseline systems. For DeepSeek 660B, task completion times are greatly reduced, with throughput increasing up to 1.87 times.

As the appended token length per round or generation length increases, DualPath maintains stable performance gains, demonstrating successful elimination of storage network bottlenecks.

Online service performance:

With strict latency Service Level Agreements (e.g., sub-4-second first token delay), the system’s ability to handle burst requests is greatly improved. DualPath supports request arrival rates (APS) up to 2.25 times higher than baseline, while maintaining very low end-to-end generation latency. Ablation studies confirm that the dual-path loading mechanism and adaptive scheduling are key to performance gains.

Large-scale scalability:

The system performs excellently not only on small clusters but also at massive scale. Tests on a large cluster with 1,152 GPUs (48 prefill nodes and 96 decoding nodes) show near-linear performance scaling.

By reshaping the underlying data flow, DualPath paves the way for the infrastructure needed in the upcoming era of Agentic large models, enabling ultra-fast inference.

Source: AI Cambrian

Risk Disclaimer and Terms of Use

Market risks exist; invest cautiously. This article does not constitute personal investment advice and does not consider individual user’s specific investment goals, financial situation, or needs. Users should evaluate whether any opinions, viewpoints, or conclusions herein are suitable for their circumstances. Investment is at your own risk.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin

Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)