LLMs Fine-tuning
Foundation large language models (LLMs) are trained on vast corpora of data. While they achieve strong overall performance, they often struggle with tasks where the data distribution differs significantly from their training set. Fine-tuning is a crucial technique to address this limitation [1].
Fine-tuning is the process of taking updating the parameters of a pre-trained model
by training the model on a dataset specific to the task. Figure 1 shows the overall
workflow of fine-tuning an LLM.
Implementation
In this section, we fine-tune the GPT-2 model. The complete implementation script is available on Fine-tuning.
The fundamental components in fine-tuning a model using Hugging Face application programming interface (API) [2] are tokenizer, foundation model, training arguments, (hyperparameters), and data collator, and trainer API.

Tokenizer is responsible for tokenizing the inputs into tokens and encoding them to the corresponding token IDs. Each foundation model has its own tokenizer, developed based on the pre-defined vocabulary (or dictionary) for the model. For loading both tokenizer and model, we define a checkpoint w.r.t. the foundation model we are going to exploit for fine-tuning. Here, we have model_name = "gpt2".

Truncation and padding are essential configurations that must be specified for the tokenizer. To achieve this, a dedicated function (e.g., tokenization_fn) is typically defined to set these parameters accordingly. Within this function, the max_length parameter plays a key role, as it determines the sequence length used for both truncation and padding.

Next, we need to load the model from the pre-defined checkpoint.

For fine-tuning the loaded model, training arguments must be properly defined. In this regard, we have
-
output_dir: the directory where checkpoints are saved.
-
eval_strategy: the evaluation strategy.
-
per_device_train/eval_batch_size: keeps virtual random access memory (VRAM) usage low.
-
gradient_accumulation_steps: simulates larger effective batch size without increasing VRAM.
-
num_train_epochs: number of passes over the dataset for fine-tuning.
-
learning_rate: learning rate.
-
logging_steps: number of steps for logging loss.
-
save_steps: number of steps to save checkpoints.
-
label_names: labels in the dataset.

The data collator handles padding and batching. It takes a list of individual data samples and organizes them into a single and consistent batch using padding, creating attention masks, and handling special tokens.

Finally, we define the trainer and perform the fine-tuning. We also record the training time.

Evaluation
To evaluate the performance of the fine-tuned model, we exploit perplexity, bilingual evaluation understudy (BLEU) [3], and recall-oriented understudy for gisting evaluation (ROUGE) [4].
-
Perplexity: it measures the model's uncertainty; lower perplexity implies that the model assigns higher probability to the actual next word in the sequence, resulting a more confident and accurate model.
-
BLEU: this metric evaluates the quality of machine-translated text by comparing it to human-created reference translations. To this end, it computes the overlap of n-grams between the machine-translated text and the reference translation.
-
ROUGE: it calculates precision, recall, and F1 score to quantify the overlap (n-grams) in words, phrases, and sequences between the machine-translated text and the reference translation.
Table I indicates the corresponding results. It is worth noting that achieving a highly efficient model requires fine-tuning on an appropriately selected dataset with a sufficient number of samples. However, the objective of this project is limited to reviewing the fine-tuning mechanisms in LLMs. Consequently, the resulting model performance may not be fully optimized.
PPL | BLEU | ROUGE | Time (s) | |||||||
---|---|---|---|---|---|---|---|---|---|---|
bleu | unigrams | bigrams | trigrams | quadgrams | rouge1 | rouge2 | rougeL | rougeLSum | ||
50.1325 | 0.7069 | 0.7103 | 0.7079 | 0.7056 | 0.7036 | 0.6751 | 0.6361 | 0.6742 | 0.6746 | 92.3040 |
References
[1] M. R. J, K. VM, H. Warrier, and Y. Gupta, “Fine tuning llm for enterprise: Practical guidelines and recommendations,” 2024, https://arxiv.org/abs/2404.10779.
[2] Hugging Face, http://huggingface.co/
[3] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. USA: Association for Computational Linguistics, 2002, p. 311–318, https://doi.org/10.3115/1073083.1073135.
[4] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Annual Meeting of the Association for Computational Linguistics, 2004, https://api.semanticscholar.org/CorpusID:964287.