Breaking the "English-centric" barrier in semantic representation has become the new battleground for large model evolution.

On March 26, the CodeFuse team of Ant Group and Shanghai Jiao Tong University officially launched the F2LLM-v2 series Embedding model. This series not only demonstrated dominant performance in authoritative evaluations but also offered a high-performance and highly efficient semantic representation solution for global developers with a fully open-source approach.

image.png

Outstanding Performance: Sweeping 11 SOTA on MTEB

In the most authoritative MTEB ranking for evaluating Embedding models, F2LLM-v2 demonstrated comprehensive advantages:

11 Champions: It ranked first in 11 language and domain lists, including German, French, Japanese, and code retrieval.

Competitive Challenge: Even its lightweight members in the family repeatedly defeated well-known industry large models at the same size.

Comprehensive Coverage: The evaluation tasks covered 430 sub-scenarios, such as medical Q&A and code retrieval, achieving full coverage.

image.png

All-Round Understanding: Mastering 282 Natural Languages and Over 40 Programming Languages

The strength of F2LLM-v2 comes from its extremely inclusive training foundation:

Multi-language Enhancement: It particularly strengthens support for medium- and low-resource languages (such as Nordic and Southeast Asian language families), achieving true global coverage.

Programming Expertise: It deeply understands over 40 programming languages, such as Python, Java, and Go, making it an ideal choice for RAG (Retrieval-Augmented Generation) and code assistant developers.

High-Quality Samples: Based on 60 million publicly available samples that have been rigorously cleaned, it ensures the purity and breadth of the model's knowledge.

image.png

Extreme Efficiency: A Full-Scale Model Family from 80M to 14B

To meet the needs of scenarios from mobile devices to cloud computing, the CodeFuse team developed a complete model matrix:

Mobile-Friendly: Small models ranging from 80M to 330M use "model pruning" and "knowledge distillation" technology, allowing smooth operation on mobile devices.

"Nested" Black Technology: It supports dynamic dimension adjustment, allowing users to freely switch between 8 dimensions and full dimensions, finding the perfect balance between inference speed and storage cost.

Completely Open Source: Transparency Defines Community Standards

Different from many "black box" models, F2LLM-v2 insists on a fully open-source approach:

Full Release: All model weights of every size are available for download.

Transparency in Details: It publishes a complete technical report, revealing the entire training process.

Reproducibility: It releases all code and checkpoints, encouraging researchers worldwide to carry out secondary development based on this.

Conclusion: Breaking Barriers, Exploring the Infinite Possibilities of AI

As another major achievement of the CodeFuse Open Source Series, the release of F2LLM-v2