In the application scenarios of generative AI, how to make models process long texts quickly and efficiently has always been a challenge for engineers. Recently, the technical team at Xiaohongshu open-sourced its self-developed RedKnot inference engine, offering a new "cost-effective and efficient" solution for long-context tasks.

The core innovation of RedKnot lies in breaking away from the traditional KV Cache (key-value cache) processing model. Previously, large models stored caches in token (token) dimensions, which led to linearly increasing memory consumption when handling long texts, significantly slowing down the inference speed and concurrency capability. RedKnot takes an alternative approach by splitting the KV Cache along the attention head dimension, and introduces three mechanisms: "head-agnostic sparsity," "sparse FFN," and "SegPagedAttention," achieving consistency between algorithm logic and storage granularity.

This architectural adjustment brings significant performance improvements. Test data shows that in a high-performance computing environment with 8 H800 cards, RedKnot can accelerate the time to first token (TTFT) by 1.6 to 3.54 times, and single-card concurrency capability increases by 4.7 to 7.8 times. During the prefilling phase, computational resource consumption (FLOPs) is reduced by 67% to 79.5%. Taking the performance of the DeepSeek-V4-Flash model on a 128K long-context task as an example, the time to first token improves by 5.16 times, and the efficiency of KV data transmission is optimized by 6.3 times, while maintaining stable inference accuracy above 95% of dense model performance.

Industry experts believe that the open-sourcing of RedKnot provides important references for engineering optimization of inference engines. In the context of increasingly scarce computing resources, this approach of fine-grained decomposition at the underlying architecture to alleviate the burden of long-text inference undoubtedly opens up a new technical path for building lighter and more efficient AI inference systems. Currently, the relevant code has been officially open-sourced, aiming to promote the popularization and implementation of long-text AI applications.