Zhipu officially launched the GLM-5.1 Highspeed API (GLM-5.1-highspeed) for selected enterprise customers today. This model achieves an astonishing output speed of 400 tokens/s, successfully breaking the current global speed limit for large model vendors' APIs.

This breaks the industry's past convention that "high-performance models inevitably lead to high latency" or "high-speed models can only be lightweight models." The GLM-5.1 Highspeed version is the first in domestic large models to bring premium model capabilities and ultra-low latency into production environments simultaneously, allowing users to no longer sacrifice model quality for response speed.

QQ20260522-094638.jpg

Disrupting Traditional Experience, Tackling Speed-Sensitive Scenarios

In long-range tasks and complex production environments, the improvement in speed has brought a qualitative change in product form:

  • AI Programming (Coding Agent): Based on the powerful capabilities of GLM-5.1, the new model enables "instant answers to questions." The model can understand engineering context while continuously generating code and modifying solutions. In reconstruction projects requiring dozens of calls, it completely eliminates the accumulated waiting time of several minutes.

  • Real-Time Dynamic Modeling: During 3D map field tests, players control character movement and input text, and the model can instantly complete modeling and dynamically change the scene in real-time.

  • Agent Swarm Parallel Scheduling: In long-range tasks, the model can process complex web pages within 30 seconds and instantly schedule 50 different personalities to answer in parallel, showcasing the雏形 of a new operating system.

Core Technology Revealed: TileRT High-Performance Inference Engine

The stable production-level capability of 400 TPS is due to the system-level optimization conducted by Zhipu GLM Team and TileRT Team:

  1. Inference Engine Layer (TileRT Compilation Period AOT Static Scheduling):

    Traditional mainstream frameworks use operators (operator/kernel) as the basic scheduling unit. In single token and small batch scenarios, this will amplify scheduling, memory access, and synchronization overhead. TileRT completely abandons dynamic scheduling at the runtime layer and statically schedules the entire computation graph into a persistent GPU persistent Engine Kernel during compilation (AOT). Within a single card, computing, asynchronous IO, and communication are decomposed into tile-level micro-tasks. The entire inference launches only one kernel, with intermediate results directly transmitted through registers, Shared Memory, and L2Cache without writing back to global memory.

  2. Scheduling System Layer:

    Through dynamic batching, request merging, and KV cache scheduling optimization, the tail latency in high-concurrency scenarios is significantly reduced.

  3. Infrastructure Layer:

    On multi-card scale, TileRT extends the idea of Warp Specialization within SM to the entire 8-card NVL topology. Different GPU ranks are specialized into different workers based on computational density and data dependencies, combined with network links and load balancing for collaborative optimization, ensuring high-performance and stable operation.

Open Plan

The GLM-5.1 Highspeed version is suitable for AI programming, real-time interaction, business decision-making, and real-time voice scenarios that require extremely low response latency. The service is now officially available on Zhipu MaaS Platform and is open to selected enterprise customers