BAAI Introduces BGE M3-Embedding: Revolutionary Advancements in Text Embedding!

BAAI has released BGE M3-Embedding with assistance from University of Science and Technology of China researchers. This new method addresses shortcomings of existing text embedding models, such as language support, retrieval functionality, and input granularity.

Features of BGE M3-Embedding

Existing models, like Contriever and GTR, have made significant advances in the field but struggle to support multiple languages, varied retrieval needs, and long input texts. BGE M3-Embedding overcomes these limitations by accommodating over 100 languages, diverse retrieval functionalities, and can handle input data up to 8192 tokens.

Implementation of M3-Embedding

M3-Embedding uses an innovative self-knowledge distillation approach and optimizes batching strategies for large input lengths. It supports three main retrieval functionalities: dense, lexical, and multi-vector retrieval. The distillation process combines relevance scores from these functionalities to help the model efficiently perform multiple retrieval tasks.

Evaluation and Conclusion

The model was evaluated and found to outperform existing models in over 10 languages, showing similar results in English and better performance with longer texts. The M3 model is a versatile solution that addresses crucial limitations in existing methods, making it a substantial step forward in information retrieval.

For more details, check out the Paper and Github.

Source link

Stay in the Loop

Get the daily email from AI Headliner that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

You might also like...