Nomic AI has released a new text-embedding model called Nomic Embed. The model is open-source and high-performing with an extended context length that supports tasks such as retrieval-augmented-generation (RAG) and semantic search. Nomic Embed is especially significant because it addresses the challenge of developing a text embedding model that is more effective than current closed-source models.
One distinctive feature of Nomic Embed is its 8192 context length, which sets it apart from other existing models. The model’s reproducibility and transparency also make it stand out in the field of AI.
Nomic Embed has been designed through a multi-stage contrastive learning pipeline. It starts with training a BERT model with a context length of 2048 tokens, named nomic-bert-2048. The model is then contrastively trained with a large number of text pairs, ensuring high-quality labeled datasets and hard-example mining. As a result, Nomic Embed outperforms existing models on various benchmarks, showcasing its competitive edge.
One of the main highlights of Nomic Embed is its emphasis on transparency and reproducibility. This has led to the release of model weights, training code, and curated data, demonstrating a commitment to openness in AI development.
With its performance on long-context tasks, Nomic Embed is positioned to advance the field of text embeddings and set new standards for evaluation paradigms. This makes it a promising addition to the landscape of AI models.