Same computing power and data, why do some models perform better?
On March 16,

This breakthrough quickly caused a stir in the Silicon Valley AI community, with social media openly praising it as “Impressive work from Kimi.”
Jerry Tworek (main inventor of OpenAI o1): Called it the beginning of “Deep Learning 2.0.”
Andrej Karpathy (co-founder of former OpenAI): Expressed that the industry still has room to explore the understanding of “Attention is All You Need.”
Why modify the “time-honored foundation”?
Although traditional residual connections solve the problem of training deep networks, their “equal addition” approach is too crude. As the network deepens, the new contribution of each layer tends to be overwhelmed by accumulated information, leading many intermediate layers to become “ineffective workers.”

Kimi's “Elegant Rotation”:
Through this, each layer no longer passively receives accumulated information but actively and selectively decides how much information to extract from previous layers through a small “query vector.” To address memory overhead in large-scale training, the team also innovatively proposed the Block AttnRes solution, dividing the network into several blocks. This ensures performance while keeping the inference delay increase within 2%.

In experiments, this architecture demonstrated strong generalization ability. It achieved a 7.5% improvement on the GPQA-Diamond science reasoning task, and significant gains of 3.6% and 3.1% in math and code generation tasks, respectively.

As the founder stated in his speech at GTC2026, the industry is gradually encountering the limits of scaling and must restructure foundational elements such as optimizers and residual connections. While most people are still focusing on “high-level renovation,”
