In modern industrial recommendation systems, "Generative Retrieval (GR)" based on large language models (LLMs) is gradually replacing traditional embedded search. However, this approach faces a tricky issue in practical applications: the model tends to "talk nonsense," generating non-existent product IDs or violating inventory logic.

To address this pain point, the research teams from Google DeepMind and YouTube recently jointly released a new framework called STATIC (Sparse Transition Matrix Accelerated Trie Index for Constrained Decoding). This technology significantly improves the constrained decoding speed of LLMs by an astonishing 948 times through innovative mathematical methods.

image.png

Key Technological Breakthroughs:

  • Turn "Tree" into "Matrix": Traditional constraint verification relies on prefix trees (Trie), but it runs inefficiently on hardware like GPUs/TPUs. STATIC flattens the complex tree structure into a static compressed sparse row (CSR) matrix, transforming the verification process into a vectorized operation that hardware excels at.

  • Exceptional Response Speed: In tests with a 3 billion parameter model, STATIC's single-step latency is as low as 0.033 milliseconds. Compared to traditional CPU retrieval solutions, it is nearly a thousand times faster; compared to existing hardware acceleration solutions, it leads by over 40 times.

  • YouTube's Successful Trial: This technology has been deployed in YouTube video recommendations, ensuring recommended content meets business constraints such as "freshness within the past 7 days." Testing shows that the playback volume of fresh videos increased by 5.1%, and click-through rate (CTR) also saw significant growth.

In addition, STATIC also addresses the shortcomings of generative retrieval during the "cold start" phase. Through precise decoding constraints, the model achieves a breakthrough in accuracy when recommending completely new products it has never seen before.