Google's TurboQuant will ease bottlenecks, not cut memory demand: Analysts

Home > Business > Industry

Google's TurboQuant will ease bottlenecks, not cut memory demand: Analysts

Published: 01 Apr. 2026, 16:25 Updated: 01 Apr. 2026, 18:09

LEE JAE-LIM
[email protected]

Audio report: written by reporters, read by AI

A render of an AI chip with a Google logo [KOREA JOONGANG DAILY]

[NEWS ANALYSIS]

TurboQuant, Google’s latest AI efficiency breakthrough, has rattled memory semiconductor markets — dragging down shares of Samsung Electronics and SK hynix and Micron — amid concerns that its compression technology could dampen memory demand.

Those concerns have intensified on the belief that easing memory bottlenecks in data processing could reduce the need for additional capacity.

Samsung Electronics slipped 4.7 percent and SK hynix shares fell 6.2 percent on March 26 compared to the day before, following Google Research's dissertations about the breakthrough posted on its blog. The shares spiraled after the announcement, but rebounded sharply on Wednesday amid signs of a potential end to the Iran war. The shares of U.S. memory suppliers such as Micron and SanDisk also plummeted 6.9 and 11 percent, respectively, during the same period.

Analysts and academics, however, say the reaction is overblown, arguing that the technology should be better understood as a more efficient way to process data rather than a factor that would significantly curb long-term memory demand or the ongoing supply shortage.

TurboQuant compresses an AI model’s short-term memory, known as the Key-Value (KV) cache, reducing the amount of data that must be stored and transferred. The technology cuts KV cache usage to one-sixth while maintaining near-original accuracy, according to Google, resulting in up to an eightfold boost in inference speed on Nvidia H100 GPUs. This allows AI systems to run faster, handle longer inputs and serve more users simultaneously without needing more hardware.

The KV cache has long been a major bottleneck in AI inference, contributing to memory latency and rising compute costs as models process larger volumes of information with longer interactions with users as the technology advances. Since models must retain prior interactions to generate contextually relevant responses, memory demands grow with longer conversations.

A screen capture of the TurboQuant announcement on the Google Research blog [GOOGLE RESEARCH]

Will TurboQuant reduce memory demand?
The market consensus maintains that the memory upcycle will persist, supported by long-term supply agreements — often three years or longer — with major tech companies such as Google and Microsoft, which are already being finalized. Such commitments would be unlikely if a near-term price decline were expected.

However, some investors point to the possibility that a scale-back in price hikes could dampen the appeal of memory stocks. Even so, with supply still tight and higher memory prices constraining consumer electronics production, prices are likely to remain elevated. Moreover, some argue that relieving key bottlenecks in AI infrastructure will drive memory demand higher, as improved efficiency allows for a broader range of applications, from agents to more advanced AI models, to be scaled up.

A woman walks by a giant screen with a logo at an event at the Paris Google Lab on the sidelines of the AI Action Summit in Paris, Feb. 9, 2025. [AP/YONHAP]

“By reducing memory usage during inference, TurboQuant lowers the cost of running AI models, which in turn reduces the overall cost of AI services,” said KB Securities analyst Kim Il-hyuk. “At a time when AI demand is outpacing the construction of new data centers, this kind of software-level innovation could significantly boost infrastructure efficiency. For hyperscalers, it effectively allows existing data centers to process more workloads, delivering benefits comparable to building entirely new facilities.”

Experts say memory demand will continue to rise with AI, driven by KV cache advancements. Kim Jung-ho, a professor of electrical engineering at KAIST, said these technologies may slow growth, but won’t reduce overall demand.

“Memory demand in AI will keep rising,” the professor said. “Technologies like this may moderate the pace, but they won’t change the direction. KV cache usage is structurally tied to AI evolution. As models handle longer contexts — whether in physical AI or agent-based systems — memory requirements will inevitably scale with them."

Academics also point out that the KV cache has long been a major bottleneck, with ongoing research focused on reducing its footprint since the start of the AI boom. Google’s TurboQuant blog post does not introduce an entirely new concept, but rather revisits a paper first released in April of last year. The research has regained attention ahead of its presentation at the International Conference on Learning Representations (ICLR) 2026. At the same event, Nvidia is presenting a related method called KV Cache Transform Coding, which can compress unused short-term memory data as much as 20-fold.

Can TurboQuant be applied immediately?
A key debate is whether TurboQuant can be readily applied to large-scale AI models such as Gemini, ChatGPT and Claude. The original paper was tested on smaller open-source models with shorter context lengths, leaving uncertainty about its effectiveness at larger scales.

More context about its technological readiness will be outlined at the upcoming ICLR 2026 conference and following code release in the second quarter of this year, which is likely to be around June.

Han In-soo, an assistant professor at KAIST and a key figure behind the TurboQuant algorithm, believes it can be deployed immediately. Han has been serving as a visiting researcher at Google Research since July of last year. During this time, he led the development of the technology's key techniques, including PolarQuant, which refers to a preprocessing step that rotates and reshapes data so it can be compressed more efficiently without losing the key information that AI needs.

“TurboQuant can be applied directly to pretrained large language models without additional training or fine-tuning,” he said. “Its effectiveness will become clear once it is integrated into real-world systems.”

The technology looks especially useful for on-device AI, where models run directly on smartphones, cars, robots or wearables instead of in the cloud. Because these environments have strict memory limitations, efficiency is crucial. By reducing the memory required to retain context, TurboQuant could enable more powerful models to run locally.

It could also make a difference in search, recommendation and retrieval-based AI systems. These systems depend on storing and comparing large volumes of data, making memory a critical constraint. If that data can be compressed without losing accuracy, systems could run faster and scale more easily. This would be particularly helpful for retrieval-augmented generation, in which models need to quickly find and assess relevant information before producing an answer.

Still, some experts remain cautious. Kim argues that it may take two to three years before the technology is fully validated for large-scale deployment.

“An accuracy rate of around 99.7 percent, as stated in the paper, may seem strong, but as context length increases 10-fold or even 100-fold, error rates are likely to rise,” he said. “This could lead to more frequent issues such as hallucinations, which may limit practical usability in real-world applications.”

BY LEE JAE-LIM [[email protected]]