The global artificial intelligence community is experiencing a technological revolution centered around the concept of "AI native language." In response to the current common issue of "language-centric" large models with externally attached visual or audio modules, a large model development team has recently officially released and open-sourced a new native multimodal large model, LongCat-Next, along with its core discrete tokenizer. The goal is to break down modal barriers, allowing AI to understand and perceive the physical world natively, just as it does with text.

This breakthrough is centered on redefining the underlying architecture of AI. During their research, the team discovered that under a unified modeling framework and optimization objective, a semantically complete discrete representation can be constructed. To achieve this, LongCat-Next introduces a new DiNA (Discrete Native Autoregressive) architecture, which completely changes the previous dilemma where multimodal information could only be "projected" but not "internalized." This architecture converts images, sounds, and texts into the same source discrete Tokens, allowing all modalities to share the same set of parameters, attention mechanisms, and loss functions in the base model. Whether it's visual perception and creation, or auditory perception and speech, everything is mathematically condensed into an elegant "Next Token Prediction (NTP)," achieving architectural simplicity and lightweight deployment.

image.png

In constructing "visual words," the team pioneered the dNaViT (Discrete Native Resolution Visual Tokenizer) technology. This technology supports native arbitrary resolution and performs exceptionally well in detail-sensitive tasks such as document parsing and complex chart reasoning. dNaViT uses an 8-layer residual vector quantization (RVQ) mechanism, achieving an extreme pixel space compression of up to 28 times, and employs a decoupled dual-track generative decoder to ensure high-fidelity image and text reconstruction. This design creates a complete loop of "Image → Token → Image," enabling the model to truly learn and internally generate its own visual language within the textual domain.

Regarding the industry-recognized challenge that "discretization inevitably leads to information loss," the team successfully approximated high-dimensional continuous representations within a limited discrete space by constructing a SAE (Semantic Alignment Encoder) for hierarchical fitting, proving that discrete representations can also serve as a complete carrier for unified understanding and generation. In benchmark tests using LongCat-Flash-Lite MoE (68.5B total parameters, 3B activated parameters) as a base, LongCat-Next demonstrated significant industrial potential in cross-modal collaboration. On the OmniDocBench test, its performance not only surpassed Qwen3-Omni but also defeated the dedicated vision model Qwen3-VL, breaking the stereotype that discrete models are not good at fine-grained perception.

Moreover, this unified framework did not compromise its core language capabilities while achieving cross-modal collaboration. Data shows that LongCat-Next continues to lead in pure text tests such as MMLU-Pro and C-Eval; in tool calls and code writing, its SWE-Bench score significantly exceeds that of similar models. In the audio field, the model also shines brightly, achieving extremely low word error rates in Chinese and English speech synthesis on SeedTTS, and supporting low-latency parallel text-to-speech generation and personalized voice cloning. As the model becomes fully open-sourced on GitHub and HuggingFace