Domestic large model team Moonshot AI officially released the technical report "Kimi Linear Tech Report" (report link) on Hugging Face today, announcing the launch of a new architecture Kimi Linear — a hybrid linear architecture that can directly replace the full attention mechanism. It combines efficiency and excellent performance, and is considered the new starting point for attention mechanisms in the "agent era" of artificial intelligence.

QQ20251031-100530.png

The report shows that Kimi Linear has made significant breakthroughs in three aspects: speed, memory efficiency, and long context processing capability. The model can reduce KV cache usage by up to 75%, and achieve a decoding throughput increase of up to 6 times at a context length of 1 million (1M), greatly optimizing long text reasoning and multi-turn dialogue performance.

The core innovation of Kimi Linear lies in three key technologies:

  • Delta Attention: A hardware-efficient linear attention mechanism that uses a gated Delta rule to optimize the structure, achieving a balance between performance and energy consumption;

  • Linear Architecture: The first hybrid linear architecture that comprehensively surpasses traditional full attention mechanisms across multiple metrics, balancing speed and model expressiveness;

  • Open Ecosystem and Empirical Validation: Moonshot provides open-source KDA kernel, vLLM integration support, and model checkpoints, and conducts large-scale, fair comparative experiments to verify the stability and scalability of Kimi Linear.

Moonshot AI stated that Kimi Linear is not only an architectural innovation but also a fundamental mechanism designed for the "AI Agent" era. With the maturation of linear attention technology, it is expected to become the next standard in applications such as long-context reasoning, intelligent assistants, and multimodal generation.

Address: https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct