Recently, Tencent Youtu Lab officially open-sourced a text representation model called Youtu-Embedding, aiming to improve the efficiency in enterprise-level intelligent customer service and knowledge base management. The model effectively extracts information, avoiding the problem of misleading generation in specific fields for large models. This issue often occurs in enterprise applications, especially when users ask specific questions, and the model may generate irrelevant answers based on general corpora.

image.png

Youtu-Embedding can effectively address the phenomenon where models perform poorly in different fields. Although the model is well-trained on general corpora, its performance in specialized fields such as law and medicine may significantly decrease. To tackle this pain point, Tencent started training the model from scratch, using up to 3 trillion tokens of Chinese and English corpora, laying a solid foundation for the model's language understanding capabilities. In addition, Tencent provided rich manually annotated data to ensure the model's applicability in real business scenarios.

To better understand the user's true intention, Tencent introduced large-scale weakly supervised training. Through this training method, Youtu-Embedding can identify sentences with different expressions but similar intentions, thereby establishing accurate mapping relationships in the semantic space. For example, when users ask "How long is the warranty for this product?" and "Can it be repaired for free if it breaks?", although the expressions are different, both are asking about the warranty policy.

In terms of multi-task training, Tencent designed an innovative fine-tuning framework to ensure that the model can adapt to different task requirements. The model uses a unified data format and differentiated loss functions, which can effectively enhance the capabilities of tasks such as text similarity, retrieval, and classification. At the same time, the dynamic sampling mechanism allows the model to reasonably allocate resources during training, thus achieving balanced development across various tasks.

Youtu-Embedding has already achieved a high score of 77.46 on the Chinese Semantic Evaluation Benchmark CMTEB, becoming one of the top-performing Chinese semantic models. This model is applicable to various scenarios, including intelligent Q&A, content recommendation, and knowledge management, and shows great potential in building Retrieval-Augmented Generation (RAG) systems.

Tencent Youtu Lab continues to focus on the development of open-source technologies. In addition to Youtu-Embedding, the lab has also launched projects such as Youtu-Agent and Youtu-GraphRAG, providing developers with more tools and resources to promote the rapid development of AI applications.

Project: https://github.com/TencentCloudADP/youtu-embedding

Key Points:  

🌟 Youtu-Embedding is an open-source text representation model developed by Tencent, aimed at improving the efficiency of enterprise intelligent customer service and knowledge base management.  

🔍 The model enhances the understanding of user intent through large-scale weakly supervised training and multi-task collaborative evolution.  

📈 On the Chinese Semantic Evaluation Benchmark CMTEB, Youtu-Embedding achieved a high score of 77.46, demonstrating its strong performance and application potential.