beam-logo

Choosing the Best Embedding Models for RAG and Document Understanding

author-profile-picNathanael Chiang
April 7 2025
Tutorials

Understanding Embedding Models

Embedding models are machine learning models that convert human-readable text into complex mathematical representations. These representations, known as embeddings, can represent various objects, such as text, images, and audio. For example, text embedding models attempt to map them to numerical vectors that capture the semantic meaning and context of the text. In contrast, texts with similar meanings are mapped to nearby points. Then, computers compare text data based on meaning rather than exact matches of words. 

Embedding models enable a variety of natural language processing tasks by providing these vector representations. Because semantically similar texts end up with vectors that are close together, one can use embeddings for tasks such as semantic text retrieval, text classification, clustering of documents, and measuring semantic textual similarity. These models bridge the gap between human language and machine processing by translating unstructured text into structured numerical form. Modern embedding models are typically based on transformer neural networks that produce embeddings that can understand context. Depending on context, the same word may have different vector representations. For instance, the word "bank" will have different embeddings in "river bank" vs. "bank account" because the meaning of “bank” is different in the two contexts. Context analysis is vital in being able to let these machines have a more human-like understanding to have the ability to derive themes and proper meanings from text.

Types of Embedding Models

Embedding models come in two broad categories: proprietary (closed-source, provided via APIs or licensed software) and open-source models. Each category has its pros and cons in terms of cost, privacy, dependency, and flexibility.

Proprietary Embedding Models

Proprietary embedding models are offered as a service (often via an API) by companies and include models like OpenAI’s text embedding APIs or Cohere’s embedding models. For example, OpenAI recently introduced two new models, a highly efficient text-embedding-3-small and a more powerful text-embedding-3-large, as successors to their earlier text-embedding-ada-002 model.

Proprietary models typically have usage costs; you pay per number of tokens or API calls. OpenAI’s pricing, for instance, after recent cost reductions, is around $0.13 per million tokens for its large model. This means that if you embed one million text segments of about 1000 tokens each, you will spend around $130 in API fees. While these API models are convenient and often high-performance out-of-the-box, the costs can add up for large-scale use.

On the upside, you do not need to host the model or manage infrastructure as the provider handles all computation and updates. However, you are dependent on the vendor, so service availability, latency, and changes are out of your control. There can also be latency when making API calls so users may receive occasional slow responses from OpenAI’s embedding API.

Open-Source Embedding Models

Open-source embedding models have publicly available architectures and weights so that they can be run on your own hardware or cloud. This can be highly cost-effective at scale or for applications that run for a long time. Since you have complete control over the model and your data, you have more privacy and data security because no data needs to be sent to an external API.

Open-source models also typically allow full customization, so users can fine-tune or even modify the code which would be impossible to do with closed APIs. This flexibility also extends to deployment. You can optimize inference by quantizing the model or deploying it on specialized hardware as needed.

Embeddings in RAG Systems

Retrieval-augmented generation (RAG) is a technique that combines information retrieval with generative AI. It combines the capabilities of an LLM (Large Language Model) with a document retriever that supplies relevant context to provide responses grounded in facts. This can help reduce LLMs from generating text that sounds logical but is actually not factual, also known as hallucination. Embedding models are a core component of RAG systems. They are used to convert both user queries and documents into vectors for semantic search.

In a typical RAG pipeline, documents are first encoded into embedding vectors using an embedding model and stored in a vector database. When a user sends a query, the same embedding model encodes the query into a vector as well. The system then performs a vector similarity search. It compares the query’s embedding to the stored document embeddings using techniques such as cosine similarity to retrieve the most semantically relevant documents. Those top-ranked documents are then forwarded to the generative model, which will incorporate them into the answer.

Using this dense retrieval with embeddings provides significant improvements from traditional keyboard search. The system can find information with similar meanings, even if the query and documents don’t share the exact same keywords.

Choosing the Best Embedding Model

With many options available, how do you choose the best embedding model for your RAG use case? The “best” model depends on a balance of several factors and specific use cases and requirements.

Semantic Accuracy

Semantic accuracy is arguably the most important factor. How well does the model capture meaning and provide useful information? It can be measured by the evaluation of benchmarks such as MTEB (Massive Text Embedding Benchmark), which provides an average score on tasks and retrieval-specific scores that indicate how well the model performs in finding relevant texts. While benchmarks are a good starting point, keep in mind that no single model dominates all tasks or domains since a model that’s best on average may not be the best for your specific domain or tasks. Therefore, it is crucial to prioritize models that do well on tasks similar to yours.

Model Size and Memory

The memory (RAM/VRAM) needed to load embedding models and perform inference is dictated by the size of the model. Smaller models with hundreds of millions of parameters or less can often run on CPU, but larger models such as 7B model may require a high-end GPU to run efficiently. For most use cases, a medium-sized model around 500M to 1B parameters may be sufficient once it is fine-tuned so jumping straight to a multi-billion parameter model might be overkill.

The size of the model can also affect speed. Smaller models may be preferable when your application needs to embed text in real-time or embed large volumes of data quickly. However, larger models can benefit from GPU batching and provide higher accuracy. It is essential to test different embedding models to find the right balance between efficiency and accuracy since a slightly less accurate model that is five times faster may be a better choice for production.

Dimensionality

Different models output vectors of various lengths (384, 768, 1024, etc.). This hyperparameter is often set by the model architecture. Higher dimensional embeddings can encode more information, which may improve accuracy, especially on diverse semantic content. High-dimensional vectors do use more storage and can slow down similarity computations. However, beyond a certain point, returns diminish, so it is worth considering if the slight gain from very high dimensional output is worth the cost in resources.

Open-Source Embedding Models

There are many types of open-source embedding models. Some popular models include:

  • E5-large-v2 (intfloat/e5-large-v2)
  • SFR-Embedding (Salesforce/SFR-Embedding-2_R)
  • GTE models (Alibaba-NLP/gte-Qwen2-7B-instruct)
  • BGE models (bge-base-en-v1.5)
  • Jina Embeddings v2 (jinaai/jina-embeddings-v2-base)

E5-large-v2 (intfloat/e5-large-v2)

E5-large-v2 is an embedding model developed by Microsoft with 24 layers (~350M parameters) and an embedding size of around 1024. It performs well on semantic search and textual similarity, often outperforming older baseline models, but it is no longer the top model. Its MTEB average score is lower than that of newer, larger models, but it is still very competitive for most applications. The multilingual-e5-large model is an extension of the E5 model that supports over 100 languages and provides cross-lingual retrieval. Since the model is MIT-licensed, it is available for commercial use.

SFR-Embedding (Salesforce/SFR-Embedding-2_R)

SalesForce AI Research released the SFR-Embedding-2_R model, and it has quickly become one of the top performers. It produces 4096-dimensional embeddings along with an MTEB average score of about 67.6 across 56 datasets in 2024, which was one of the highest reported at the time. SFR-Embedding is a multitask model trained on diverse tasks like information retrieval, clustering, and classification. The downside is that it is a 7B model with 4096-dim vectors, so it is very resource intensive, so inference is slower, and memory is heavier than other models. However, this model is for research purposes only.

GTE models (Alibaba-NLP/gte-Qwen2-7B-instruct)

gte-Qwen2-7B-instruct is fine-tuned from the Qwen family of foundation models specifically for embedding tasks, especially for multilingual tasks. It is a 7-billion parameter with an embedding dimension of 3584 with a maximum input length of 32,000 tokens. It has achieved state-of-the-art performance, consistently ranking #1 on the MTEB leaderboard for both English and Chinese. The smaller gte-Qwen2-1.5B-instruct has about 1.5 billion parameters and offers an alternative with almost the same quality as the 7B model but at a fraction of the memory and computational cost, which might be ideal for those with limited GPU resources.

BGE models (BAAI/bge-base-en-v1.5)

The BGE models are developed by the Beijing Academy of AI for a variety of embedding tasks. bge-base-en-v1.5 is an English embedding model with roughly 110M parameters that transforms text into a 768-dimensional vector. This makes it lightweight to run as it provides fast and memory-efficient retrieval and semantic similarity performance, significantly better than older baseline models. It offers a good balance for English document understanding tasks when resources are limited, and these models are released under permissive open licenses.

Jina Embeddings v2 (jinaai/jina-embeddings-v2-base)

The jina-embeddings-v2-base model was released by Jina AI, targeted at long-document understanding. The base-en model is an English encoder based on a custom JinaBERT architecture with a context window of 8192 tokens, larger than typical BERT-style models. It is relatively small, with about 100M parameters, and it produces embeddings optimized for both accuracy and efficiency.

There are also domain-specific versions, such as the jina-embeddings-v2-base-code model, which is multilingual across code languages. Overall, the Jina v2 models are good choices for when you need to embed very long texts without chunking and when you need a lightweight model. However, the trade-off is that they may not reach the raw accuracy of huge models such as Qwen 2 or SFR on short-text benchmarks,s but for most real-world use cases, their performance is likely sufficient.

Conclusion

Choosing the right embedding model depends on many components: accuracy, speed, size, domain fit, and flexibility all matter. The best approach may be trial and error, using benchmarks to guide you and then using targeted experiments to narrow down your options. By evaluating models against the criteria above and testing them on your specific use case, you can confidently select an embedding model that best meets your needs and delivers high-quality document understanding for your RAG application.

Keep Reading

card-cover-image
Tutorials
Engineering

Best Stateful Sandboxes for Code Execution in 2026

Compare stateful code execution sandboxes for AI agents. Explore isolation, persistence, and GPU support to find the best runtime for your agents.

author-profile-picNathanael Chiang
card-cover-image
Tutorials
Engineering

Best Code Execution Environments for AI Agents in 2026

Compare the five best code execution environments for AI agents in 2026 — Beam, E2B, Modal, CodeSandbox, and Daytona — across isolation model, GPU access, cold-start latency, deployment flexibility, and price.

author-profile-picEli Mernit

Ship your app in minutes

Get started with $30 of free credit, refreshed every month