Nomic Embed: Training a Reproducible Long Context Text Embedder

Nussbaum, Zach, Morris, John X., Duderstadt, Brandon, Mulyar, Andriy

Feb-2-2024–arXiv.org Artificial Intelligence

This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short and long-context tasks. We release the training code and model weights under an Apache 2 license. In contrast with other open-source models, we release a training data loader with 235 million curated text pairs that allows for the full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Feb-2-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - Qatar (0.14)
  - UAE (0.14)
- Europe > Italy (0.28)
- North America > United States
  - Oregon (0.14)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.45)
  - Natural Language > Large Language Model (1.00)