A paper knocked down storage stocks.

2026-03-26 01:25:52

Author: Deep潮 TechFlow

On March 25, U.S. tech stocks surged across the board, with the Nasdaq 100 index closing higher, but one category of stocks was bleeding against the trend:

SanDisk fell 3.50%, Micron dropped 3.4%, Seagate declined 2.59%, Western Digital decreased 1.63%. The entire storage sector suddenly felt like the power was cut off at a party.

The culprit is a paper—or more precisely, Google Research’s official promotion of a paper.

What exactly does this paper do?

To understand this, first clarify a rarely discussed concept in AI infrastructure: KV Cache.

When you interact with a large language model, the model doesn’t start from zero each time to understand your question. It stores the entire conversation context in memory in a format called “Key-Value Pairs” (KV), which is the KV Cache—short-term working memory for the model.

The problem is, the size of the KV Cache grows proportionally with the length of the context window. When the context reaches hundreds of thousands of tokens, the GPU memory consumed by the KV Cache can even surpass the model’s own weights. For inference clusters serving many users simultaneously, this is a real, daily money-burning infrastructure bottleneck.

The original version of this paper first appeared on arXiv in April 2025 and will be officially published at ICLR 2026. Google Research named it TurboQuant—a lossless quantization algorithm that compresses KV Cache to 3 bits, reducing memory usage by at least 6 times, without any training or fine-tuning, ready to use out of the box.

The technical approach involves two steps:

First, PolarQuant. It doesn’t represent vectors using standard Cartesian coordinates but converts them into polar coordinates—comprising “radius” and a set of “angles”—fundamentally simplifying the geometry of high-dimensional space, enabling subsequent quantization with lower distortion.

Second, QJL (Quantized Johnson-Lindenstrauss). After PolarQuant’s main compression, TurboQuant applies a 1-bit QJL transform for residual error unbiased correction, ensuring the accuracy of inner product estimates—crucial for the proper functioning of Transformer attention mechanisms.

Results: In LongBench benchmarks covering Q&A, code generation, and summarization tasks, TurboQuant matches or even surpasses the performance of the current best baseline KIVI; achieves perfect recall in “needle-in-a-haystack” retrieval tasks; and on NVIDIA H100, 4-bit TurboQuant accelerates attention logic operations by 8 times.

Traditional quantization methods have a fundamental flaw: each compressed data block requires storing “quantization constants” to decompress, adding metadata overhead—often 1 to 2 bits per value. While seemingly small, in a context of millions of tokens, these bits accumulate at an alarming rate. TurboQuant eliminates this overhead through PolarQuant’s geometric rotation and QJL’s 1-bit residual correction.

Why is the market panicking?

The conclusion is straightforward: a model that requires 8 H100s to handle a million-token context could theoretically need only 2. This means inference providers could handle over 6 times more concurrent long-context requests with the same hardware.

This strikes at the core narrative of the storage sector.

Over the past two years, Seagate, Western Digital, and Micron have been elevated by the AI capital frenzy, with the underlying logic being: larger models can “remember” more, and longer context windows demand unlimited memory, leading to explosive storage needs. Seagate’s stock surged over 210% in 2025, and the company’s capacity was sold out by 2026.

The emergence of TurboQuant directly challenges this narrative’s premise.

Wells Fargo tech analyst Andrew Rocha commented directly: “As context windows grow larger, data storage in KV Cache explodes, and memory demands rise accordingly. TurboQuant is directly attacking this cost curve… If widely adopted, it will fundamentally question how much memory capacity is truly needed.”

But Rocha also included a key condition: IF.

What truly warrants debate

Is the market overreacting? Most likely: yes, a bit.

First, the headline-grabbing claim of “8x acceleration.” Several analysts pointed out that this 8x boost compares new technology to the old 32-bit non-quantized systems, not to the currently widely optimized deployment systems. The actual improvement exists, but it’s not as dramatic as the headlines suggest.

Second, the paper only tested small models. All evaluations used models with at most around 8 billion parameters. The real concern for storage suppliers is with massive models of 70 billion or even 400 billion parameters, where KV Cache is truly astronomical. How TurboQuant performs at those scales remains unknown.

Third, Google has not released any official code yet. As of now, TurboQuant isn’t integrated into vLLM, llama.cpp, Ollama, or any mainstream inference frameworks. Community developers have independently reproduced early implementations based on the paper’s math, with one early reproducer noting that if the QJL error correction module isn’t implemented properly, the output can turn into gibberish.

But that doesn’t mean market concerns are unfounded.

This is the collective muscle memory from the 2025 DeepSeek moment. That event taught the market a harsh lesson: algorithmic efficiency breakthroughs can overnight transform the narrative of expensive hardware. Since then, any efficiency leap from top AI labs triggers a reflex in the hardware sector.

Moreover, this signal comes from Google Research, not an obscure university lab. Google has the engineering capacity to turn papers into production tools, and it is one of the world’s largest consumers of AI inference. Once TurboQuant is adopted internally, the procurement logic for servers at Waymo, Gemini, and Google Search will quietly shift.

The recurring script

There’s a classic debate worth taking seriously: Jevons’ Paradox.

19th-century economist William Stanley Jevons discovered that improvements in steam engine efficiency didn’t reduce coal consumption in Britain—in fact, consumption increased significantly—because efficiency lowered usage costs, stimulating larger-scale applications.

Supporters argue: if Google enables a model to run on 16GB of VRAM, developers won’t stop there. They will use the saved compute to run models six times larger, process more multimodal data, and support longer contexts. Ultimately, software efficiency unlocks demands previously inaccessible due to cost.

But this rebuttal relies on a premise: the market needs time to digest and re-expand. During the period from paper to production tool to industry standard for TurboQuant, can hardware demand grow fast enough to fill the efficiency “gap”?

No one knows. The market is pricing this uncertainty.

The deeper significance for the AI industry

More important than the stock movements of storage companies is the deeper trend TurboQuant reveals.

The main battleground of AI arms races is shifting from “massive compute” to “extreme efficiency.”

If TurboQuant proves its performance promise on large-scale models, it will bring a fundamental shift: long-context inference will no longer be a luxury only top labs can afford but will become the industry standard.

And this efficiency race aligns with Google’s strengths—near-optimal mathematical compression algorithms rooted in Shannon information theory, not brute-force engineering. TurboQuant’s theoretical distortion rate is only about 2.7 times the information-theoretic lower bound.

This means similar breakthroughs are unlikely to be singular. They represent a research path approaching maturity.

For the storage industry, a more realistic question isn’t “Will this impact demand?” but “As AI inference costs continue to be driven down by software, how wide can the moat in hardware remain?”

Currently, the answer is: still wide, but not wide enough to ignore signals like this.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.