March 26th news, the Google Research team has officially launched a new vector quantization compression algorithm called TurboQuant. By innovatively combining PolarQuant and QJL technologies, it reduces the memory demand of key-value cache (KV Cache) in large language model (LLM) inference by at least six times. It also improves attention computation speed up to eight times on Nvidia H100 GPU, and achieves zero precision loss in multiple long context benchmark tests. This breakthrough is expected to significantly reduce AI deployment costs and accelerate the implementation of long context applications.
KV Cache Pain Points: High-dimensional vector memory overhead
When processing long sequences, LLMs need to maintain a cache composed of key (Key) and value (Value) vectors. These high-dimensional vectors are used for fast calculation of attention mechanisms to avoid repeated computations. However, as the context length increases, the memory consumption of KV Cache grows exponentially, becoming a major bottleneck that restricts model inference efficiency and deployment scale.

Although traditional vector quantization methods can compress data, they require additional storage of quantization constants (such as scaling factors, zero points, etc.). These constants are usually stored in full precision, introducing an additional 1-2 bit overhead per value, which partially offsets the compression benefits.
TurboQuant Core Innovation: Dual-phase Compression with PolarQuant + QJL
TurboQuant adopts a two-phase training-free compression framework that cleverly solves the overhead issue of traditional quantization:
PolarQuant (Polar Angle Compression):
First, the vectors are randomly rotated, and then the Cartesian coordinates (X/Y/Z, etc.) are converted into polar form (angle + radius). Since the angles are distributed within a fixed predictable range, this method eliminates the storage overhead required for boundary normalization in traditional quantization, achieving more efficient compression.
QJL (1-bit Error Correction, Quantized Johnson-Lindenstrauss):
After PolarQuant compression, residual errors still exist. QJL uses the Johnson-Lindenstrauss transformation for dimensionality reduction, then quantizes using a minimal 1-bit (+1/-1 sign). By using a special unbiased estimator, it achieves error correction without additional memory overhead during attention score calculation, ensuring no systematic bias in the overall process.
Combined, TurboQuant can compress the KV Cache to about a 3-bit level while maintaining the unbiasedness and high accuracy of inner product estimation.
Benchmark Test Performance: Comprehensive Leadership, Perfectly Suitable for Long Context
The Google team conducted extensive validation on open-source models such as Gemma and Mistral:
- LongBench (covering tasks such as long text question answering, code generation, and summarization): TurboQuant matches or exceeds existing baselines like KIVI, showing comprehensive leadership.
- Needle In A Haystack and other retrieval tasks: Achieves perfect downstream scores while compressing KV memory by at least six times.
- Nvidia H100 Test Results: With a 4-bit configuration, attention logits calculation speed can be improved by up to eight times.
In addition, on vector datasets like GloVe, TurboQuant's recall rate also outperforms traditional methods such as PQ and RabbiQ.
AIbase Comment: TurboQuant does not require model retraining or fine-tuning and can be directly applied to existing LLMs, suitable for any scenario that relies on vector quantization, including database retrieval, recommendation systems, and vector search engines. This not only allows a single consumer-level GPU to support longer contexts (such as tens of thousands of tokens), but also significantly lowers the hardware threshold for enterprise-level AI services.
Industry Significance: New Benchmark for AI Inference Efficiency
With the explosion of long context and multimodal applications, KV Cache memory has become a core constraint in AI infrastructure. The "near-optimal, data-independent" quantization framework of TurboQuant opens up a new path for efficient inference. Google Research stated that this technology has been detailed in papers presented at ICLR2026 and other conferences, and related code and implementation details are expected to be gradually open-sourced.
In the future, TurboQuant is expected to be integrated into mainstream inference frameworks such as vLLM and TensorRT, further promoting the democratization and scalability of AI deployment.
