What kind of AI Chatbot do you need?
An AI Chatbot is an instruction-tuned Large Language Model (LLM). Such a chatbot can provide fast, accurate, and relevant answers to questions it is faced with. However, it is limited to the information, text tone, format, and style it was trained on.
The Retrieval Augmented Generation (RAG-Chatbot) technique enables connecting the AI Chatbot with your internal, private organization's documentation. It is capable of answering questions based on the provided knowledge base. Users can ask questions and get answers based on the documentation, even though the model has never seen it during the training process (your internal documentation stays private and safe!). Correct documents are retrieved by the AI, based on the user query, and appended to the query as a context before feeding to the AI Chatbot.
Building the RAG-Chatbot requires parsing all the organization's documentation to the correct format and making a few design choices. Figure 1 presents all possible solutions for the RAG-Chatbot design and deployment.
Figure 1. Ways to host your RAG-Chatbot.
The most suitable solution should be chosen depending on the use case. Below, I will review each approach and discuss its pros and cons.
RAG-Chatbot solutions
Public API (LLM SaaS)
Cloud API-based solutions are great in terms of development speed, Proof-of-Concept validation, and small usage traffic. They also offer regular model updates free of charge. However, they come with many limitations:
- Limited customization.
- Not price-efficient for very large traffic.
- It might not be safe enough for highly confidential data.
- Foundation model
It is the best approach in most cases. It is quick to implement and works reasonably well. You pay as you go (per request) plus a small constant fee. It is the cheapest option if the traffic is unbalanced (high usage picks and idle state most of the time) or fairly small (less than 10,000 per day). It also guarantees automatic updates to newer model versions, improving answer quality over time. Using the Vertex AI from Google Cloud or Bedrock solutions from AWS guarantees your data privacy. It is the best approach for creating RAG-Chatbot Proof-of-Concept for your organization and validating whether this is the right solution for you or just creating the conversational chatbot if you have moderate traffic that does not require any customization.
Custom model
This solution is a natural extension of the previous case but can be built as the target solution from the beginning as well. It allows for the fine-tuning of the model on customers' data and, therefore, achieves even better RAG-Chatbots performance or functionalities that otherwise wouldn’t be possible. Model fine-tuning is the right choice if:
- You want the model to act in a certain way.
- Copy a certain tone of voice.
- Output data in a specific format.
- Need to optimize model performance (implement RAG without a knowledge base).
- Force the model to do something it was not trained for.
- Most importantly, the model fine-tuning should be performed if prompting and the knowledge base option fail.
The custom model comes with greater expenses. We need to prepare the dataset in the correct format for training, validating, and testing. Training costs vary depending on the model type, dataset size, and infrastructure provider but roughly can be estimated as $0.0008 per 1,000 tokens or ~20$/h. You can find details about LLM fine-tuning costs for AWS here and here for GCP.
Additional costs come as well when it comes to model serving. We cannot use any more on-demand prices. The custom model has an hourly base charge of ~$7-20/h. For this price, we get the private model that is fully available to us.
Fine-tuning a model using cloud provider API speeds up development time but comes with optimization limitations. Only selected models are available for fine-tuning, and engineers cannot get under the hood to tweak the model any further. Detailed information about model fine-tuning using cloud API can be found here based on the example of AWS Bedrock and the Titan model.
Self-hosted
Self-hosted RAG Chatbots mean we have full control over the model and inference pipeline. They offer high customization capabilities and data privacy. They allow for easy fine-tuning of both LLM and embedder models and the creation of advanced retrieval algorithms. After creating a custom Rag Chatbot, the question is, where to host it? We have two options: on the cloud or on-premise. The pros and cons of those approaches are presented below.
Cloud self-hosted
This solution is easy to maintain, scalable, and reliable. The cloud provider takes care of infrastructure maintenance. However, it comes with greater expenses compared to on-premise solutions. The basic setup costs about $1.2/h and can go up to $12/h per instance. It can serve up to 50k requests a day per instance.
On-premise
The on-premise, self-hosted model provides the highest data privacy. There are no calls to external APIs, and all data stays within the organization network. The self-hosted model gives full control over the RAG Chatbot and allows for any kind of customization. Running costs are low, as we only need to pay for electricity and maintenance. Maintenance and scalability are the largest drawbacks of this solution.
Figure 2. Decision diagram for RAG-Chatbot implementation.
Do I need a custom model?
Fine-tuning can significantly enhance the performance of the model on specific tasks. By exposing the model to task-specific data during fine-tuning, it learns to make more accurate predictions relevant to that task. For example, fine-tuning the pre-trained model on medical texts allows the creation of a model specialized in medical language understanding. Fine-tuning domain-specific data helps the model incorporate domain-specific knowledge, improving its understanding and generation capabilities within that domain. Overall, fine-tuning allows you to leverage the strengths of pre-trained models while tailoring them to your specific needs, leading to better performance and efficiency for your particular tasks.
Figure 3. Model optimization diagram.
Model fine-tuning should be a deliberate decision. Usually, the RAG-Chatbot development process follows the flow presented in Figure 3. We start with prompt engineering, which is the lowest-hanging fruit. We take as much as possible from it and move to the Retrieval Augmented Generation using the connected knowledge base. Right now, we have good-quality answers using our internal organization's knowledge. Then, we will decide to improve model performance further by fine-tuning it to our data and end up with the best solution in class.
Limitations of prompt engineering
Prompt engineering is always the first choice of model customization because it is cheap and quick to implement. Often, it is also sufficient, and nothing more is needed. The downside of prompt engineering is that it increases the context length, and you may hit the model context limit. If, after trying prompt engineering, the model performance is not good enough, you should consider fine-tuning the model. Fine-tuning is particularly good at copying style and tone. Model fine-tuning is a lot more work because you have to prepare the dataset in the appropriate format and train the model on equipment that is also not cheap.
Running costs comparison
Table 1 presents estimated costs and performance for each deployment mode.
Table 1. Running costs and performance comparison among 4 different deployment options.
AWS pricing: https://aws.amazon.com/bedrock/pricing/
GCP pricing: https://cloud.google.com/vertex-ai/pricing#custom-trained_models
Data privacy
Data privacy depends on the services you use and the cloud provider. Therefore, it is extremely important to carefully read the terms of the agreement before using any cloud provider API. Below, we present the current state of the three leading Cloud Solutions providers.
Google Cloud
- Data passed in Gemini API calls may be used for training and may be read by human reviewers. Therefore, do not submit sensitive, confidential, or personal information to the service. Gemini API terms of service Outside of VertexAI can be found here
- Inside VertexAI, designed for enterprise-grade support, you would be accessing the Vertex AI Gemini API. Its terms of service prohibit using your data to train new foundation models.
AWS
- According to Bedrock's terms of service, input and output to the Titan model are not used for foundation model training.
- “By default, AWS Chatbot stores and processes user data such as AWS Chatbot configurations, notifications, user inputs, AWS Chatbot-generated responses, and interaction data. This data helps AWS Chatbot continuously improve and develop both itself and Artificial Intelligence (AI) technologies.” (source).
Azure
- Azure OpenAI does not store and use your data. (source)
- OpenAI can use your data for training new models. (source)
In summary, enterprise-grade solutions don’t use input/output data, and public APIs use input data for model training and AI improvement. It is worth mentioning that all the data mentioned above privacy policies are always under local law. Companies are obligated to share their data if required to do so according to the law or a binding order of a governmental body.
Summary
Table 2. Pros and cons of Cloud API vs self-hosted RAG-Chatbot.
In SoftwareMill, we create solutions tailored to customer needs. Whether you need your RAG-Chatbot to be implemented quickly, expect low traffic, or want a custom solution that takes the best out of your data, we can implement it. We have established partnerships with leading cloud providers and can provide free discounts and credits to start your project. If you have any questions, please feel free to contact us. We are more than happy to help you out. We also encourage you to check out our portfolio and related articles.