Moonshot AI Launches Kimi Linear: 6 Times Faster Linear Attention Architecture, Open-Source KDA Kernel Released Simultaneously

Domestic large model team Moonshot AI officially released the technical report "Kimi Linear Tech Report" (report link) on Hugging Face today, announcing the launch of a new architecture Kimi Linear — a hybrid linear architecture that can directly replace the full attention mechanism. It combines efficiency and excellent performance, and is considered the new starting point for attention mechanisms in the "agent era" of artificial intelligence.

The report shows that Kimi Linear has made significant breakthroughs in three aspects: speed, memory efficiency, and long context processing capability. The model can reduce KV cache usage by up to 75%, and achieve a decoding throughput increase of up to 6 times at a context length of 1 million (1M), greatly optimizing long text reasoning and multi-turn dialogue performance.

The core innovation of Kimi Linear lies in three key technologies:

Delta Attention: A hardware-efficient linear attention mechanism that uses a gated Delta rule to optimize the structure, achieving a balance between performance and energy consumption;
Linear Architecture: The first hybrid linear architecture that comprehensively surpasses traditional full attention mechanisms across multiple metrics, balancing speed and model expressiveness;
Open Ecosystem and Empirical Validation: Moonshot provides open-source KDA kernel, vLLM integration support, and model checkpoints, and conducts large-scale, fair comparative experiments to verify the stability and scalability of Kimi Linear.

Moonshot AI stated that Kimi Linear is not only an architectural innovation but also a fundamental mechanism designed for the "AI Agent" era. With the maturation of linear attention technology, it is expected to become the next standard in applications such as long-context reasoning, intelligent assistants, and multimodal generation.

Address: https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct

Unveiling the Mystery of MiniMax M2: Why Choose Full Attention Mechanism?

The MiniMax M2 model uses a full attention mechanism, abandoning linear or sparse attention techniques. The development team believes that although the latter can save computing resources, full attention is more efficient in industrial applications and can improve model performance. This decision aims to optimize actual deployment results and promote the development of AI technology.

Meituan's All-Round Cat Makes a Grand Debut! LongCat-Flash-Omni Multimodal Large Model Opens Source and Tops the Charts Immediately, with Real-Time Interaction That Is Extraordinarily Fast

Meituan's open-source multimodal large model, LongCat-Flash-Omni, achieves a technological breakthrough, surpassing closed-source competitors in multiple benchmark tests, reaching industry-leading levels. The model supports real-time integration processing of text, speech, images, and video, with near-zero latency in interaction, pushing locally developed multimodal AI applications to a new level.

Shanghai Bank Launches Its First Hu-Shang Language Interactive AI Application to Support Smart Elderly Financial Services

Enterprises such as Shanghai Caiyue Star and Shanghai Bank signed a strategic cooperation agreement, launching the country's first complete Hu-Shang language interactive AI application, supporting elderly financial services and dialect intelligent system construction, providing more convenient financial services for elderly people who are accustomed to dialects.

Kunlun Wanyi SkyReels V3 Model Launch! One-Stop Aggregation of Top AI Video Capabilities such as Sora2 and Veo3.1

The AI video creation platform SkyReels under Kunlun Wanyi is officially launched with a new version, introducing the V3 model and five core function upgrades, supporting both web and mobile ends. The platform highlights the 'one-stop' and 'multi-modal aggregation' features, integrating top global AI multi-modal models to provide a seamless creative experience.

Moonshot Launches Kimi Linear Model: Processing Long Contexts 2.9 Times Faster

The Moonshot team has launched the Kimi Linear model, achieving a technological breakthrough in the AIGC field. The model uses a hybrid linear attention architecture, improving the speed of processing long contexts by 2.9 times and decoding speed by 6 times. Its performance surpasses the traditional Softmax attention mechanism, showing excellent results particularly in scenarios such as context processing and reinforcement learning.