Embedding models comparison

Kamil Rzechowski

11 Jun 2024.6 minutes read

Embedding models comparison webp image

Retrieval Augmented Generation Chatbot uses a knowledge base to answer questions about information the model has no clue about. The knowledge base usually is an organization's internal documentation. The model has never seen the documentation during the training process. The documentation is given to the model as a context inside the prompt, and therefore, the model is able to answer questions correctly. The model's prompt length is limited, so documentation has to go through the selection process before being appended as a context to the prompt. The quality of generated responses highly depends on the selection process. In this article, I will focus on embedded models, which are the core part of the document retrieval system.

For the purpose of this article, I ran experiments on the Tapir documentation. I created the test dataset of the query (question) and documentation paragraph pairs. Each pair relates to the different parts of Tapir’s documentation. I evaluated the model based on the top-3 accuracy.


"question": "What happens if decoding a path input fails?"
## Decoding failures

If decoding a path input fails, a `400 Bad Request` will be returned to the user. When using the default decode
failure handler, this can be customised to instead attempt decoding the next endpoint, by adding an attribute to the 
path input with `.onDecodeFailureNextEndpoint`.

Alternatively, another strategy can be implemented by using a completely custom decode failure handler. Both
topics are covered in more detail in the documentation of [error handling](

Retriever architecture

The retriever architecture consists of a bi-encoder model and optionally cross-encoder. The bi-encoder encodes the query in the form of a feature vector and compares the vector distance to other documents’ vectors inside the vector base. Taking the above example, we take the query "What happens if decoding a path input fails?", pass it through the embedding model, and get the feature vector [0.0023,-0.3412, -0.0826, 0.7933,...] in response. The same operation is performed on all documents and feature vectors are compared with each other. K-closest documents are returned.


Image 1. Visualization of text embedding vectors in 2D space. Word embeddings that are similar in semantics are close to each other. Therefore retrieving documents based on the closest distance to the query embedding results in obtaining the most relevant documents to the user query. Embeddings distance is usually measured using L2 distance or cosine similarity.

Cross-encoder is used for re-ranking. It takes both the query and document text as an input and outputs the document similarity score. It is much slower, but more accurate at the same time. Therefore the bi-encoder is used for first-stage document selection and a cross-encoder is used for fine document selection. Combining the power of those two, we can get high accuracy and speed.


Image 2. Retriever module architecture.

Custom vs. Cloud API embedder

When choosing the embedder model one can go for the paid cloud API solution like the OpenAI embeddings model or use a custom, self-hosted model. Custom models can be chosen and implemented using for example HuggingFace (HF) leaderboard and HF API.

Public cloud models provide easy-to-use API, community support, scalability, and great performance, but cost a lot. On the other hand, implementing an deploying your own embedding model is more work at the beginning, but can save you a lot of money in the long run and are much more secure. Below I will compare the performance of both solutions.


OpenAI text-embedding-3-small $0.02 / 1M tokens
OpenAI text-embedding-3-large $0.13 / 1M tokens
Custom model Infrastructure cost

Table 1. Price comparison cloud API vs self-hosted model.

Bi-encoder evaluation results

For evaluation purposes, I decided to use L2 distance.


Table 2. Evaluation results for different embedding models on document retrieval tasks.

As we can see, GPT embedding models perform the best. However, the difference becomes small at the top-5 accuracy. Therefore, it might be worth comparing results with the additional re-ranking step.


Re-ranking is an additional step, that allows a boost retrieval engine, by reranking top-k results from vector store search. For re-ranking the cross-encoder architecture is used, which allows for better measurement of query-document similarity. I tested two different re-ranking models with the top-performing embedders in each category: {custom embedder, public API embedder}. I used the top-15 documents from the similarity search step and evaluated reranking on the top-3 documents match.


Table 3. Evaluation results for custom and cloud API embedding models + two different reranking models. Top-15 documents were returned from a similarity search and top-3 accuracy was computed on the output of the re-ranking model.

The re-ranking module significantly improved the custom embedding model and slightly changed the performance of the GPT model. The custom model almost reached the Gpt model performance. Worse performance in terms of re-ranking the GPT-embedding model seems, to indicate, that the cross-encoder model reached its performance limit and further improvement could be achieved only with cross-encoder finetuning on a custom dataset or with a better cross-encoder model. Taking into account that BAAI/bge-reranker-large is considered one of the best in class, fine-tuning options seems like the most promising option.

Time and VRAM consumption

The re-ranking step allowed us to reach the OpenAI embedder performance. Let’s check the speed and resources needed to validate whether it is still worth implementing a custom model vs using the OpenAI API.


Table 4. The evaluation was performed on RTX 3090 for custom models and with cloud API for the OpenAI embedding model.

By looking at the gray fields in the table, we can see, that the custom model + re-ranking takes almost the same time as the API request to the OpenAI embedder model without re-ranking. It is not a very fair comparison, because the custom model performance was measured locally and API call has the additional overhead of internet communication, but still, custom model + reranking steps seems like a promising approach. By combining custom models with re-ranking we can achieve comparable results, but with cheaper solutions long term and with the guarantee of data privacy.

Prompt impact

The prompt has proven to be a powerful tool for LLM-based chatbots. Let’s verify how well it can do for document retrieval engines.

The dataset single record created based on the Tapir documentation consists of the Header1, Header2, HeaderN, and paragraph content.


A few experiments were carried out, to find the optimal prompt for a bi-encoder retrieval system.


Table 5. Retrieval module accuracy depends on the embedding prompt.

As can be seen, the prompt highly influences the retrieval performance. Encoding the page content results in only 77% of documents being matched. However, adding headers following the content boosts the accuracy to 84%. Furthermore, highlighting headers with markdown syntax allows for obtaining the best results.

More about prompt engineering for retrieval systems can be found in the article How to improve document matching when designing a chatbot?


The retrieval module is the most important component of the RAG chatbot. Good-quality document matches allow for meaningful responses from the instructions tuned LLM. Building a robust and accurate retrieval engine consists of designing a prompt, choosing and deploying the embedding module together with the vector store, and finally adding the reranking module. Each step should be evaluated, and the best-performing setup should be chosen.

In the article, we also compared public Cloud API vs custom model implementation. Even though the cloud API archives the best accuracy, with few tricks we can nearly match its performance with the custom model, at a fraction of the cost and faster inference.

In SoftwareMill, we create solutions tailored to customer needs. Depending on whether you need quick integration with public API or you want to take the best out of your data and implement the custom model with fine-tuning, we can implement it. We have established partnerships with the main cloud providers and can provide free discounts and credits for starting your project. If you have any questions, please feel free to contact us. We are more than happy to help you out. We also encourage you to check out our portfolio and related articles.

Reviewed by: Rafał Pytel

Blog Comments powered by Disqus.