What can you get from finetuning LLM?
When working in the natural language processing domain, and more specifically with chatbots and Large Language models, at a certain point, you hear about terms like prompt engineering and model fine tuning as possibilities to customize or improve your system. Still, often, it is hard to understand what kind of improvement we can expect and what characteristics we can expect from which improvement step.
In this blog post, we won't just delve into theoretical concepts. We'll take a practical approach, exploring steps for improving chatbots using the medical domain. We'll start by outlining our methodology, and then we'll dive into real-world examples. These examples will cover various medical questions, each analyzed across one of three setups: a simple large language model, RAG, and a fine tuned LLM. By the end, you'll clearly understand how these setups compare and the impact of fine tuning on chatbot and llm model's performance.
Casual steps for improving chatbots
When improving chatbots, we follow the pipeline, as seen in the graphic above.
Prompt engineering and Few-shot prompting
The first start with prompt engineering, which can be defined as adding control sentences or words to guide the LLM and consequently get better performance. Prompt engineering allows the implementation of persona (in blue in the example below), which significantly improves the quality of answers. On the other hand, few shot prompting can be explained as providing example pairs of input & output (in green in the example below).
On the other hand, by adding longer and longer prompts, we might arrive at the limits of our LLM. Nowadays, they have a few thousand-word limits, however, with RAG retrieved facts and maintaining the history of our chat in context, we might quickly arrive at that point.
Additionally, there is research gaining more traction into the fact that a very long context is not always good as some facts in the middle might get lost (more about that in “Lost in the Middle: How Language Models Use Long Contexts”). Long context also prolongs the processing time, which might not be favourable in some applications.
Improve Knowledge Base
Another good area for improvement is the extension of context. This step requires quite an effort, as finding the right data and making sure its facts do not contradict the ones already present requires a lot of work. Sometimes, it may also be beneficial to check the current knowledge database critically and correct facts that are either ambiguous or plain wrong (which happens more often than one thinks).
Fine tune a model - improving model performance
Now, we arrive at the holy grail of many problems, called fine tuning. LLMs are casually pre trained on large amounts of data available on the internet, ie Wikipedia or Common Crawl (while keeping in mind their licenses when using LLM for production). Using that data, the model is trained without the need of labels, i.e., with next token prediction (given words “LLMs are” predict next word).
Within the fine tuning, we can distinguish at least two different types of Q&A LLMs:
- LLM Fine tuning, where we use large amounts of available data and perform methods similar to the pretraining step, but we also let the pre trained model specialise a bit on our target domain.
- Task-specific fine tuning (otherwise called supervised fine tuning), where a Q&A dataset for our domain is provided and we try to learn a model to answer questions. In the experimentation phase, we will explore a model trained specifically on Medical Q&A datasets (e.g., MedQA has about 12k question-answer pairs, while more general-purpose MedMCQA has about 190k pairs of labeled examples).
Medical domain characteristic
When considering the medical domain, our first association are words “difficult” and “complex” (unless you are a medical profesional, this area is much more straightforward for you than for Casual Joe). To outline the main reasons, we have:
- need for complex and specialised knowledge - there is a large selection of specialised terminology, complex biological processes, and complicated relationships between symptoms, diseases, and treatments.
- high cost of errors - in the medical domain, there is no room for mistakes, as with hallucinations or incorrect answers, lives might be at stake.
- dynamic and evolving field - this area constantly changes with new research, treatments, and guidelines. This makes building a reasonable knowledge base with up-to-date facts even harder.
- ethical and legal considerations - There is a risk of discrimination or biased output in the medical domain.
- data quality and availability, medical data is difficult to acquire as its privacy is of great concern. As mentioned before, the area is changing, so older datasets might be outdated in certain areas.
- ambiguity and lack of context: the answer “it depends” is more than often happening in this area, as often we need broader context when answering certain questions (i.e. medical history of patient) and often cases in the knowledge base can contradict each other.
Compared setups
Within this check, we will explore three different setups: bare question to llama2, llama2 with prompt engineering, and Meditron (llama2 fine tuned on medical data). The premise of this research is to use 7B models.
Bare LLM model
LLAMA-2 is quite a popular model, with a large portion of open-source models using it as a base in the past 8 months (ranging from Alpaca, an instruction fine tuned version, and Vicuna, a conversation fine tuned version, to Meditron and Galen, a medical domain fine tuned). With this base model, we use a default prompting scheme with no added system instructions.
Prompt engineering
Prompt engineering is structuring an instruction that can be interpreted and understood by a Generative AI model. It can be understood as adding specialised phrases or words like “You are an expert in the medical domain.”.
Prompt engineering also includes tricks like Chain-of-Thought or few-shot prompting.
In our example, we use the following system prompt (so instruction controlling the system):
Model finetuned on Q&A medical data
Last but not least, is the model fine tuned on Q&A medical datasets, called Meditron (currently on top of medical leaderboards). For questions provided, we use the following instructions to unleash the full potential of this model:
Results
In this section, we will review various errors in medical-related Q&A, hinting at what we can expect from prompt engineering and fine-tuning.
Errors in answers
3 different answers from each of the models with wrong answers for models other than Meditron:7b
The right answer to that question is “chromosome replication without cell division.” Out of the three, only Meditron answers correctly. Creating a medical expert's persona (prompt engineering) was not enough for this example. Prompt engineering allowed us to choose a different, more descriptive answer; however, this answer was not correct.
The main objective of model finetuning is improving the model's knowledge and enhancing the ability to comprehend the domain's specialized vocabulary, which was successfully achieved in this example.
Conciseness of answer
Example number 2: verbose answers for casual llama-2 models, while very concise answers for Meditron:7B
Example number 3: verbose answers for casual llama-2 models, while very concise answers for Meditron:7B
With these examples, we see the main drawback of large-audience LLMs (like simple LLAMA-2): Their answers try to accommodate every audience, even those that are not really specialised. Consequently, they are not concise enough for a specialised audience, so they are a bit useless.
Fine tuned models can have shorter, more concrete answers without the need to explain everything, as the targeted audience might be more specialised in the area (medical professionals like doctors or nurses).
Fine tuned LLMs, compared to general-purpose LLMs, would probably perform better in criteria like correctness (how correct an answer is) and conciseness (trying to avoid lengthy answers to increase UX). In the medical domain, there are other criteria worth focusing on, which are:
- Avoidance of unverified treatment (as general purpose LLMs are trained on web data, they might be susceptible to this kind of error)
- Public Health (asking for advice regarding behaviour not consistent with general public health norms).
Conclusions
In this blog post, we discussed possible ways of improving the model's performance and presented an experimental setup with 3 different variants: simple llama2, llama2 with prompt engineering, and Meditron (a model fine-tuned on medical domain data). We clearly saw that models could improve their knowledge via fine-tuning, changing the characteristics of the default format of answers (i.e., making them more concise), and ultimately more useful for the final user.
Reviewed by: Kamil Rzechowski