LLM Fine-Tuning and Model Selection Using Neptune and Transformers

Imagine you’re facing the following challenge: you want to develop a Large Language Model (LLM) that can proficiently respond to inquiries in Portuguese. You have a valuable dataset and can choose from various base models. But here’s the catch — you’re working with limited computational resources and can’t rely on expensive, high-power machines for fine-tuning. How do you decide on the right model to use in this scenario?

This post explores these questions, offering insights and strategies for selecting the best model and conducting efficient fine-tuning, even when resources are constrained. We’ll look at ways to reduce a model’s memory footprint, speed up training, and best practices for monitoring.

LLM fine-tuning and model selection, implemented workflow
The workflow we’ll implement. We will fine-tune different foundation LLM models on a dataset, evaluate them, and select the best model.

Large language models

Large Language Models (LLMs) are huge deep-learning models pre-trained on vast data. These models are usually based on an architecture called transformers. Unlike the earlier recurrent neural networks (RNN) that sequentially process inputs, transformers process entire sequences in parallel. Initially, the transformer architecture was designed for translation tasks. But nowadays, it is used for various tasks, ranging from language modeling to computer vision and generative AI.

Below, you can see a basic transformer architecture consisting of an encoder (left) and a decoder (right). The encoder receives the inputs and generates a contextualized interpretation of the inputs, called embeddings. The decoder uses the information in the embeddings to generate the model’s output, one token at a time.

Large Language Models (LLMs) are huge deep-learning models pre-trained on vast data. These models are usually based on an architecture called transformers.
Transformers architecture. On the left side, we can see the encoder part, which is composed of a stack of multi-head attention and fully connected layers. On the right side, we can see the decoder, which is also composed of a stack of multi-head attention, cross-attention to leverage the information from the encoder, and fully connected layers.

Hands-on: fine-tuning and selecting an LLM for Brazilian Portuguese

In this project, we’re taking on the challenge of fine-tuning three LLMs: GPT-2, GPT2-medium, GPT2-large, and OPT 125M. The models have 137 million, 380 million, 812 million, and 125 million parameters, respectively. The largest one, GPT2-large, takes up over 3GB when stored on disk. All these models were trained to generate English-language text.

Our goal is to optimize these models for enhanced performance in Portuguese question answering, addressing the growing demand for AI capabilities in diverse languages. To accomplish this, we’ll need to have a dataset with inputs and labels and use it to “teach” the LLM. Taking a pre-trained model and specializing it to solve new tasks is called fine-tuning. The main advantage of this technique is you can leverage the knowledge the model has to use as a starting point.

Setting up

I have designed this project to be accessible and reproducible, with a setup that can be replicated on a Colab environment using T4 GPUs. I encourage you to follow along and experiment with the fine-tuning process yourself.

Note that I used a V100 GPU to produce the examples below, which is available if you have a Colab Pro subscription. You can see that I’ve already made a first trade-off between time and money spent here. Colab does not reveal detailed prices, but a T4 costs $0.35/hour on the underlying Google Cloud Platform, while a V100 costs $2.48/hour. According to this benchmark, a V100 is three times faster than a T4. Thus, by spending seven times more, we save two-thirds of our time.

You can find all the code in two Colab notebooks:

We will use Python 3.10 in our codes. Before we begin, we’ll install all the libraries we will need. Don’t worry if you’re not familiar with them yet. We’ll go into their purpose in detail when we first use them:

Loading and pre-processing the dataset

We’ll use the FaQuAD dataset to fine-tune our models. It’s a Portuguese question-answering dataset available in the Hugging Face dataset collection.

First, we’ll look at the dataset card to understand how the dataset is structured. We have about 1,000 samples, each consisting of a context, a question, and an answer. Our model’s task is to answer the question based on the context. (The dataset also contains a title and an ID column, but we won’t use them to fine-tune our model.)

Fine-tunning the models using FaQuAD dataset
Each sample in the FaQuAD dataset consists of a context, a question, and the corresponding answer. | Source

We can conveniently load the dataset using the Hugging Face `datasets` library:

Our next step is to convert the dataset into a format our models can process. For our question-answering task, that’s a sequence-to-sequence format: The model receives a sequence of tokens as the input and produces a sequence of tokens as the output. The input contains the context and the question, and the output contains the answer.

For training, we’ll create a so-called prompt that contains not only the question and the context but also the answer. Using a small helper function, we concatenate the context, question, and answer, divided by section headings (Later, we’ll leave out the answer and ask the model to fill in the “Resposta” section on its own).

We’ll also prepare a helper function that wraps the tokenizer. The tokenizer is what turns the text into a sequence of integer tokens. It is specific to each model, so we’ll have to load and use a different tokenizer for each. The helper function makes that process more manageable, allowing us to process the entire dataset at once using map. Last, we’ll shuffle the dataset to ensure the model sees it in randomized order.

Here’s the complete code:

Loading and preparing the models

Next, we load and prepare the models that we’ll fine-tune. LLMs are huge models. Without any kind of optimization, for the GPT2-large model in full precision (float32), we have around 800 million parameters, and we need 2.9 GB of memory to load the model and 11.5 GB during the training to handle the gradients. That just about fits in the 16 GB of memory that the T4 in the free tier offers. But we would only be able to compute tiny batches, making training painfully slow.

Faced with these memory and compute resource constraints, we’ll not use the models as-is but use quantization and a method called LoRA to reduce their number of trainable parameters and memory footprint.

Quantization

Quantization is a technique used to reduce a model’s size in memory by using fewer bits to represent its parameters. For example, instead of using 32 bits to represent a floating point number, we’ll use only 16 or even as little as 4 bits.

This approach can significantly decrease the memory footprint of a model, which is especially important when deploying large models on devices with limited memory or processing power. By reducing the precision of the parameters, quantization can lead to a faster inference time and lower power consumption. However, it’s essential to balance the level of quantization with the potential loss in the model’s task performance, as excessive quantization can degrade accuracy or effectiveness.

The Hugging Face `transformers` library has built-in support for quantization through the `bitsandbytes` library. You can pass

or

model loading methods to load a model with 8-bit or 4-bit precision, respectively.

After loading the model, we call the wrapper function `prepare_model_for_kbit_training` from the `peft` library. It prepares the model for training in a way that saves memory. It does this by freezing the model parameters, making sure all parts use the same type of data format, and using a special technique called gradient checkpointing if the model can handle it. This helps in training large AI models, even on computers with little memory.

After quantizing the model to 8 bits, it takes only a fourth of the memory to load and train the model, respectively. For GPT2-large, instead of needing 2.9 GB to load, it now takes only 734 MB.

LoRA

As we know, Large Language Models have a lot of parameters. When we want to fine-tune one of these models, we usually update all the model’s weights. That means we need to save all the gradient states in memory during fine-tuning, which requires almost twice the model size of memory. Sometimes, when updating all parameters, we can mess up with what the model already learned, leading to worse results in terms of generalization.

Given this context, a team of researchers proposed a new technique called Low-Rank Adaptation (LoRA). This reparametrization method aims to reduce the number of trainable parameters through low-rank decomposition.

Low-rank decomposition approximates a large matrix into a product of two smaller matrices, such that multiplying a vector by the two smaller matrices yields approximately the same results as multiplying a vector by the original matrix. For example, we could decompose a 3×3 matrix into the product of a 3×1 and a 1×3 matrix so that instead of having nine parameters, we have only six.

Low-Rank Adaptation (LoRA)
Low-rank decomposition is a method to split a large matrix M into a product of two smaller matrices, L and R, that approximates it.

When fine-tuning a model, we want to slightly change its weights to adapt it to the new task. More formally, we’re looking for new weights derived from the original weights: Wnew= Wold+ W. Looking at this equation, you can see that we keep the original weights in their original shape and just learn W as LoRA matrices.

In other words, you can freeze your original weights and train just the two LoRA matrices with substantially fewer parameters in total. Or, even more simply, you create a set of new weights in parallel with the original weights and only train the new ones. During the inference, you pass your input to both sets of weights and sum them at the end.

Fine-tuning using low-rank decomposition
Fine-tuning using low-rank decomposition. In blue, we can see the original set of weights of the pre-trained model. Those will be frozen during the fine-tuning. In orange, we can see the low-rank matrices A and B, which will have their weights updated during the fine-tuning.

With our base model loaded, we now want to add the LoRA layers in parallel with the original model weights for fine-tuning. To do this, we need to define a `LoraConfig`.

Inside the `LoraConfig`, we can define the rank of the LoRA matrices (parameter `r`), the dimension of the vector space generated by the matrix columns. We can also look at the rank as a measure of how much compression we are applying to our matrices, i.e., how small the bottleneck between A and B in the figure above will be.

When choosing the rank, keeping in mind the trade-off between the rank of your LoRA matrix and the learning process is essential. Smaller ranks mean less room to learn, i.e., as you have fewer parameters to update, it can be harder to achieve significant improvements. On the other hand, higher ranks provide more parameters, allowing for greater flexibility and adaptability during training. However, this increased capacity comes at the cost of additional computational resources and potentially longer training times. Thus, finding the optimal rank for your LoRA matrix that balances these factors well is crucial, and the best way to find this is by experimenting! A good approach is to start with lower ranks (8 or 16), as you will have fewer parameters to update, so it will be faster, and increase it if you see the model is not learning as much as you want.

You also need to define which modules inside the model you want to apply the LoRA technique to. You can think of a module as a set of layers (or a building block) inside the model. If you want to know more, I’ve prepared a deep dive, but feel free to skip it.

Within the `LoraConfig`, you need to specify which modules to apply LoRA to. You can apply LoRA for most of a model’s modules, but you need to specify the module names that the original developers assigned at model creation. Which modules exist, and their names are different for each model.

The LoRA paper reports that adding LoRA layers only to the keys and values linear projections is a good tradeoff compared to adding LoRA layers to all linear projections in attention blocks. In our case, for the GPT2 model, we will apply LoRA on the `c_attn` layers, as we don’t have the query, value, and keys weights split, and for the OPT model, we will apply LoRA on the `q_proj` and `v_proj`.

If you use other models, you can print the modules’ names and choose the ones you want:

In addition to specifying the rank and modules, you must also set up a hyperparameter called `alpha`, which scales the LoRA matrix:

 As a rule of thumb (as discussed in this article by Sebastian Raschka), you can start setting this to be two times the rank `r`. If your results are not good, you can try lower values.

Here’s the complete LoRA configuration for our experiments:

We can apply this configuration to our model by calling

Now, just to show how many parameters we are saving, let’s print the trainable parameters of GPT2-large:

We can see that we are updating less than 1% of the parameters! What an efficiency gain!

Fine-tuning the models

With the dataset and models prepared, it’s time to move on to fine-tuning. Before we start our experiments, let’s take a step back and consider our approach. We’ll be training four different models with different modifications and using different training parameters. We’re not only interested in the model’s performance but also have to work with constrained resources.

Thus, it will be crucial that we keep track of what we’re doing and progress as systematically as possible. At any point in time, we want to ensure that we’re moving in the right direction and spending our time and money wisely.

What is essential to log and monitor during the fine-tuning process?

Aside from monitoring standard metrics like training and validation loss and training parameters such as the learning rate, in our case, we also want to be able to log and monitor other aspects of the fine-tuning:

  1. Resource Utilization: Since you’re operating with limited computational resources, it’s vital to keep a close eye on GPU and CPU usage, memory consumption, and disk usage. This ensures you’re not overtaxing your system and can help troubleshoot performance issues.
  2. Model Parameters and Hyperparameters: To ensure that others can replicate your experiment, storing all the details about the model setup and the training script is crucial. This includes the architecture of the model, such as the sizes of the layers and the dropout rates, as well as the hyperparameters, like the batch size and the number of epochs. Keeping a record of these elements is key to understanding how they affect the model’s performance and allowing others to recreate your experiment accurately.
  3. Epoch Duration and Training Time: Record the duration of each training epoch and the total training time. This data helps assess the time efficiency of your training process and plan future resource allocation.

Set up logging with neptune.ai

neptune.ai is a machine learning experiment tracker and model registry. It offers a single place to log, compare, store, and collaborate on experiments and models. Neptune is integrated with the `transformers` library’s `Trainer` module, allowing you to log and monitor your model training seamlessly. This integration was contributed by Neptune’s developers, who maintain it to this day.

To use Neptune, you’ll have to sign up for an account first (don’t worry, it’s free for personal use) and create a project in your workspace. Have a look at the Quickstart guide in Neptune’s documentation. There, you’ll also find up-to-date instructions for obtaining the project and token IDs you’ll need to connect your Colab environment to Neptune.

We’ll set these as environment variables:

There are two options for logging information from `transformer` training to Neptune: You can either set `report_to=”neptune”` in the `TrainingArguments` or pass an instance of  `NeptuneCallback` to the `Trainer`’s `callbacks` parameter. I prefer the second option because it gives me more control over what I log. Note that if you pass a logging callback, you should set `report_to=”none` in the `TrainingArguments` to avoid duplicate data being reported.

Below, you can see how I typically instantiate the `NeptuneCallback`. I specified a name for my experiment run and asked Neptune to log all parameters used and the hardware metrics. Setting `log_checkpoints=”last”` ensures that the last model checkpoint will also be saved on Neptune.

Training a model

As the last step before configuring the `Trainer`, it’s time to tokenize the dataset with the model’s tokenizer. Since we’ve loaded the tokenizer together with the model, we can now put the helper function we prepared earlier into action:

The training is managed by a `Trainer` object. The `Trainer` uses a `DataCollatorForLanguageModeling`, which prepares the data in a way suitable for language model training.

Here’s the full setup of the `Trainer`:

That’s a lot of code, so let’s go through it in detail:

  • The training process is defined to run for 20 epochs (EPOCHS = 20). You’ll likely find that training for even more epochs will lead to better results.
  • We’re using a technique called gradient accumulation, set here to 8 steps (GRADIENT_ACCUMULATION_STEPS = 8), which helps handle larger batch sizes effectively, especially when memory resources are limited. In simple terms, gradient accumulation is a technique to handle large batches. Instead of having a batch of 64 samples and updating the weights for every step, we can have a batch size of 8 samples and perform eight steps, just updating the weights in the last step. It generates the same result as a batch of 64 but saves memory.
  • The MICRO_BATCH_SIZE is set to 8, indicating the number of samples processed each step. It is extremely important to find an amount of samples that can fit in your GPU memory during the training to avoid out-of-memory issues (Have a look at the `transformers` documentation to learn more about this).
  • The learning rate, a crucial hyperparameter in training neural networks, is set to 0.002 (LEARNING_RATE = 2e-3), determining the step size at each iteration when moving toward a minimum of the loss function. To facilitate a smoother and more effective training process, the model will gradually increase its learning rate for the first 100 steps (WARMUP_STEPS = 100), helping to stabilize early training phases.
  • The trainer is set not to use the model’s cache (model.config.use_cache = False) to manage memory more efficiently.

With all of that in place, we can launch the training:

While training is running, head over to Neptune, navigate to your project, and click on the experiment that is running. There, click on `Charts` to see how your training progresses (loss and learning rate). To see resource utilization, click the `Monitoring` tab and follow how  GPU and CPU usage and memory utilization change over time. When the training finishes, you can see other information like training samples per second, training steps per second, and more.

At the end of the training, we capture the output of this process in `trainer_output`, which typically includes details about the training performance and metrics that we will later use to save the model on the model registry.

But first, we’ll have to check whether our training was successful.

Evaluating the fine-tuned LLMs

Model evaluation in AI, particularly for language models, is a complex and multifaceted task. It involves navigating a series of trade-offs among cost, data applicability, and alignment with human preferences. This process is critical in ensuring that the developed models are not only technically proficient but also practical and user-centric.

LLM evaluation approaches

LLM evaluation approaches
Diagram of different evaluation strategies organized by evaluation metrics and data | Modified based on source

The chart above shows that the least expensive (and most commonly used) approach is to use public benchmarks. On the one hand, this approach is highly cost-effective and easy to test. However, on the other hand, it is less likely to resemble production data. Another option, slightly more costly than benchmarks, is AutoEval, where other language models are used to evaluate the target model. For those with a higher budget, user testing, where the model is made accessible to users, or human evaluation, which involves a dedicated team of humans focused on assessing the model, is an option.

Evaluating question-answering models with F1 scores and the exact match metric

In our project, considering the need to balance cost-effectiveness with maintaining evaluation standards for the dataset, we will employ two specific metrics: exact match and F1 score. We’ll use the `validation` set provided along with the FaQuAD dataset. Hence, our evaluation strategy falls into the `Public Benchmarks` category, as it relies on a well-known dataset to evaluate PTBR models.

The exact match metric determines if the response given by the model precisely aligns with the target answer. This is a straightforward and effective way to assess the model’s accuracy in replicating expected responses. We’ll also calculate the F1 score, which combines precision and recall, of the returned tokens. This will give us a more nuanced evaluation of the model’s performance. By adopting these metrics, we aim to assess our model’s capabilities reliably without incurring significant expenses.

As we said previously, there are various ways to evaluate an LLM, and we choose this way, using standard metrics, because it is fast and cheap. However, there are some trade-offs when choosing “hard” metrics to evaluate results that can be correct, even when the metrics say it is not good.

One example is: imagine the target answer for some question is “The rat found the cheese and ate it.” and the model’s prediction is “The mouse discovered the cheese and consumed it.” Both examples have almost the same meaning, but the words chosen differ. For metrics like exact match and F1, the scores will be really low. A better – but more costly – evaluation approach would be to have humans annotate or use another LLM to verify if both sentences have the same meaning.

Implementing the evaluation functions

Let’s return to our code. I’ve decided to create my own evaluation functions instead of using the `Trainer`’s built-in capabilities to perform the evaluation. On the one hand, this gives us more control. On the other hand, I frequently encountered out-of-memory (OOM) errors while doing evaluations directly with the `Trainer`.

For our evaluation, we’ll need two functions:

  • `get_logits_and_labels`: Processes a sample, generates a prompt from it, passes this prompt through a model, and returns the model’s logits (scores) along with the token IDs of the target answer.
  • `compute_metrics`: Evaluates a model on a dataset, calculating exact match (EM) and F1 scores. It iterates through the dataset, using the get_logits_and_labels function to generate model predictions and corresponding labels. Predictions are determined by selecting the most likely token indices from the logits. For the EM score, it decodes these predictions and labels into text and computes the EM score. For the F1 score, it maintains the original token IDs and calculates the score for each sample, averaging them at the end.

Here’s the complete code:

Before assessing our model, we must switch it to evaluation mode, which deactivates dropout. Additionally, we should re-enable the model’s cache to conserve memory during prediction.

Following this setup, simply execute the `compute_metrics` function on the evaluation dataset and specify the desired number of generated tokens to use (Note that using more tokens will increase processing time).

Storing the models and evaluation results

Now that we’ve finished fine-tuning and evaluating a model, we should save it and move on to the next model. To this end, we’ll create a `model_version` to store in Neptune’s model registry.

In detail, we’ll save the latest model checkpoint along with the loss, the F1 score, and the exact match metric. These metrics will later allow us to select the optimal model. To create a model and a model version, you will need to define the model key, which is the model identifier and must be uppercase and unique within the project. After defining the model key, to use this model to create a model version, you need to concatenate it with the project identifier that you can find on Neptune under “All projects” – “Edit project information” – “Project key”.

Model selection

Once we’re done with all our model training and experiments, it’s time to jointly evaluate them. This is possible because we monitored the training and stored all the information on Neptune. Now, we’ll use the platform to compare different runs and models to choose the best one for our use case.

After completing all your runs, you can click `Compare runs` at the top of the project’s page and enable the “small eye” for the runs you want to compare. Then, you can go to the `Charts` tab, and you will find a joint plot of the losses for all the experiments. Here’s how it looks in my project. In purple, we can see the loss for the gpt2-large model. As we trained for fewer epochs, we can see that we have a shorter curve, which nevertheless achieved a better loss.


Comparison of the loss across different experiments. Purple: gpt2-large. Yellow: opt-125m. Red: gpt-medium. Gray: gpt2.

The loss function is not yet saturated, indicating that our models still have room for growth and could likely achieve higher levels of performance with additional training time.

Go to the `Models` page and click on the model you created. You will see an overview of all the versions you trained and uploaded. You can also see the metrics reported and the model name.


Model versions saved on Neptune’s model registry. Listed are the model version’s ID, the time of creation, the owner, and the metrics stored with the model version.

You’ll notice that none of the model versions have been assigned to a “Stage” yet. Neptune allows you to assign models to different stages, namely “Staging,” “Production,” and “Archived.”

While we can promote a model through the UI, we’ll return to our code and automatically identify the best model. For this, we first fetch all model versions’ metadata, sort by the exact match and f1 scores, and promote the best model according to these metrics to production:

After executing this, we can see, as expected, that gpt2-large (our largest model) was the best model and was chosen to go to production:


The gpt2-large model achieved the best metric scores and was promoted to the “Production” stage.

Once more, we’ll return to our code and finally use our best model to answer questions in Brazilian Portuguese:

LLM inference before and after fine-tuning
Model inference before and after fine-tuning. The text shows a small piece of information about the rules to pass a course and asks: “What does passing the course depend on?” Before fine-tuning, the model only repeats the question. After fine-tuning, the model can answer the question correctly.

Let’s compare the prediction without fine-tuning and the prediction after fine-tuning. As demonstrated, before fine-tuning, the model didn’t know how to handle Brazilian Portuguese at all and answered by repeating some part of the input or returning special characters like “##########.” However, after fine-tuning, it becomes evident that the model handles the input much better, answering the question correctly (it only added a “?” at the end, but the rest is exactly the answer we’d expect).

We can also look at the metrics before and after fine-tuning and verify how much it improved:

Given the metrics and the prediction example, we can conclude that the fine-tuning was in the right direction, even though we have room for improvement.

Do you feel like experimenting with neptune.ai?

How to improve the solution?

In this article, we’ve detailed a simple and efficient technique for fine-tuning LLMs.

Of course, we still have some way to go to achieve good performance and consistency. There are various additional, more advanced strategies you can employ, such as:

  • More Data: Add more high-quality, diverse, and relevant data to the training set to improve the model’s learning and generalization.
  • Tokenizer Merging: Combine tokenizers for better input processing, especially for multilingual models.
  • Model-Weight Tuning: Directly adjust the pre-trained model weights to fit the new data better, which can be more effective than tuning adapter weights.
  • Reinforcement Learning with Human Feedback: Employ human raters to provide feedback on the model’s outputs, which is used to fine-tune the model through reinforcement learning, aligning it more closely with complex objectives.
  • More Training Steps: Increasing the number of training steps can further enhance the model’s understanding and adaptation to the data.

Conclusion

We engaged in four distinct trials throughout our experiments, each employing a different model. We’ve used quantization and LoRA to reduce the memory and compute resource requirements. Throughout the training and evaluation, we’ve used Neptune to log metrics and store and manage the different model versions.

I hope this article inspired you to explore the possibilities of LLMs further. In particular, if you’re a native speaker of a language that’s not English, I’d like to encourage you to explore fine-tuning LLMs in your native tongue.

Was the article useful?

Thank you for your feedback!

Explore more content topics: