BAAI has released BGE M3-Embedding with assistance from University of Science and Technology of China researchers. This new method addresses shortcomings of existing text embedding models, such as language support, retrieval functionality, and input granularity.
Features of BGE M3-Embedding
Existing models, like Contriever and GTR, have made significant advances in the field but struggle to support multiple languages, varied retrieval needs, and long input texts. BGE M3-Embedding overcomes these limitations by accommodating over 100 languages, diverse retrieval functionalities, and can handle input data up to 8192 tokens.
Implementation of M3-Embedding
M3-Embedding uses an innovative self-knowledge distillation approach and optimizes batching strategies for large input lengths. It supports three main retrieval functionalities: dense, lexical, and multi-vector retrieval. The distillation process combines relevance scores from these functionalities to help the model efficiently perform multiple retrieval tasks.
Evaluation and Conclusion
The model was evaluated and found to outperform existing models in over 10 languages, showing similar results in English and better performance with longer texts. The M3 model is a versatile solution that addresses crucial limitations in existing methods, making it a substantial step forward in information retrieval.