Musk Likes! Kimi Paper Shakes the Traditional Foundations of Large Models: Same Computing Power, Efficiency Improved by 25%

Same computing power and data, why do some models perform better? Moonshot AI provides a fundamental answer.

On March 16, Kimi released a major technical report “Attention Residuals” (Attention Residuals). This research thoroughly restructures the "foundation" of large models since 2015 — residual connections (Residual Connections). Experiments show that, with the same computing power, the new method achieves the same performance as the baseline model using 1.25 times the computing power.

This breakthrough quickly caused a stir in the Silicon Valley AI community, with social media openly praising it as “Impressive work from Kimi.”

Jerry Tworek (main inventor of OpenAI o1): Called it the beginning of “Deep Learning 2.0.”

Andrej Karpathy (co-founder of former OpenAI): Expressed that the industry still has room to explore the understanding of “Attention is All You Need.”

Why modify the “time-honored foundation”?

Although traditional residual connections solve the problem of training deep networks, their “equal addition” approach is too crude. As the network deepens, the new contribution of each layer tends to be overwhelmed by accumulated information, leading many intermediate layers to become “ineffective workers.”

Kimi's “Elegant Rotation”:

The team found that the loss of information in the depth direction is highly consistent with the forgetting in the time dimension of RNNs. They then rotated the attention mechanism, originally used for processing text sequences, 90 degrees horizontally and applied it to the vertical depth dimension.

Through this, each layer no longer passively receives accumulated information but actively and selectively decides how much information to extract from previous layers through a small “query vector.” To address memory overhead in large-scale training, the team also innovatively proposed the Block AttnRes solution, dividing the network into several blocks. This ensures performance while keeping the inference delay increase within 2%.

In experiments, this architecture demonstrated strong generalization ability. It achieved a 7.5% improvement on the GPQA-Diamond science reasoning task, and significant gains of 3.6% and 3.1% in math and code generation tasks, respectively.

As the founder stated in his speech at GTC2026, the industry is gradually encountering the limits of scaling and must restructure foundational elements such as optimizers and residual connections. While most people are still focusing on “high-level renovation,” chose to go to the deepest level, striking a heavy blow to the future of deep learning with one decisive move.

Musk Likes! Kimi Paper Shakes the Traditional Foundations of Large Models: Same Computing Power, Efficiency Improved by 25%

Related Recommendations

Survey Shows More Than Half of Companies Regret Laying Off Employees Due to AI; Ford, IBM and Other Giants Rehire Human Workers

Former DeepMind Team Quantitative AI Company EquiLibre Completes Series A Funding with a Valuation of $500 Million

Valuation Rises to New Heights: Moonshot Kimi Enters a High-Growth Phase with ARR of $300 Million

Early Signs of Commercialization: Huang Zhenxin from Moonshot Explains Kimi's Differentiation Strategy

Dealing with AI Accelerating Malware Development, Apple Breaks Convention to Release iOS Security Update Early