Artificial intelligence startup Inception Labs has recently announced the release of Mercury2, which is not only a high-performance inference model but also represents a bold "paradigm shift" in its underlying architecture.

image.png

The model completely abandons the currently popular Transformer architecture and instead uses diffusion-based models to generate text, aiming to break through the performance bottlenecks of traditional large models.

image.png

Different from traditional models that generate tokens (characters) one by one, Mercury2 works more like an experienced editor. Instead of generating one character at a time, it can perform global optimization and rewriting on multiple text blocks simultaneously. This parallel processing logic allows Mercury2 to demonstrate remarkable speed advantages when handling complex logical reasoning tasks.

According to test data obtained by AIbase, under the drive of NVIDIA Blackwell GPU, Mercury2's generation speed reaches an astonishing 1009 tokens per second. In end-to-end latency tests, the model responds in just 1.7 seconds, which is more than eight times faster than Google's Gemini3Flash and far exceeds Anthropic's Claude Haiku4.5. Despite its extremely fast speed, its quality remains competitive with current top lightweight reasoning models in authoritative reasoning benchmark tests such as GPQA Diamond and AIME.

In terms of commercial strategy, Inception Labs has adopted a highly competitive pricing plan, with input and output costs only a quarter of those of similar competitors. Currently, Mercury2 has officially opened API interfaces and supports ultra-long context of 128,000 tokens and tool calling functions. For voice assistants, search systems, and programming tools that pursue extreme response speed, this "unconventional" diffusion reasoning model undoubtedly offers an attractive new option.

Summary:

  • 🌀 Revolutions in the underlying architecture: Abandoning the traditional character-by-character generation mode, adopting diffusion model technology, supporting global optimization of multiple text blocks simultaneously, achieving a qualitative change in reasoning logic.

  • Outstanding performance: Achieving second-level response with the latest hardware, generating over 1000 tokens per second, with latency performance significantly better than Gemini3 and Claude4.5.

  • 💰 High cost-effectiveness for commercial use: Challenging the existing market landscape with extremely low cost, supporting long texts and API access, focusing on enterprise-level AI applications sensitive to latency.