In the previous post, I discussed how training a state-of-the-art model from scratch is unaffordable for most companies. Meanwhile, Apple announced their AI strategy, showcasing how small models, like the 7 billion parameter model, can be personalised and adapted on our smartphones and smartwatches. This is where fine-tuning becomes essential. Fine-tuning involves adjusting the parameters or weights of a pre-trained language model to tailor it to a new, specific task or dataset. Notably, fine-tuning was initially introduced to address “catastrophic forgetting,” the tendency of an artificial neural network to forget previously learned information when learning new information, as well as to manage the computational and memory effort required. This effort arises from the learning strategy of backpropagation, which makes a model “differentiable” and involves calculating “gradients” that indicate how a small change to a parameter will impact the model’s output. The process includes generating a prediction, calculating gradients, determining the prediction’s error, and then using the gradients to adjust the model’s parameters. Fine-tuning aims to keep training resources minimal and tweak a model without losing its original capabilities.

Consider a deep neural network (DNN). Essentially, it is a combination of numerous linear components, also known as parameters (m*x+b), which involves additions and multiplications, or in its matrix representation, matrix multiplications. A neural network is considered deep if it has more than just the input and output layers. This additional layer is essential for tweaking a model; for example, an attentional layer transforms a sequence-to-sequence model into a transformer. This is where the fine-tuning approach of “parameter-efficient fine-tuning” (PEFT) comes into play.
In a nutshell, PEFT (Parameter-Efficient Fine-Tuning) approaches involve adding a prefix to the input layer, a postfix to the output layer, or adapting one or multiple hidden layers (known as adapters). It’s important to note that fine-tuning can be thought of as learning changes to the parameters rather than adjusting the parameters themselves. The model is kept frozen exactly as it was loaded, for instance, from Huggingface https://huggingface.co/, and only the necessary changes to the parameters are learned to improve the model’s performance on the fine-tuned task. Common techniques such as “Low Rank Adaptation” (LoRA) approach tuning not by adjusting parameters directly but by considering how these changes impact the model. In the context of a smartwatch, the model remains constant, and the modifications are prefixed, postfixed, or introduced in between. Simplistically, a rank tells how independent the vectors (rows or columns) in a matrix are. Two vectors are linearly independent if neither can be represented as a multiple of the other. To further simplify, when we talk about adding, we literally mean the “+” operation. Within the respective matrix operation between the given foundation model and the tuning matrix, we can decide which row and column gets adapted.

Now, hands-on. The playbook for fine-tuning is quite straightforward:
- Go to one of the big model hubs like Huggingface. Select and download one of the over 400,000 available models.
- Train using your chosen fine-tuning approach, keeping the foundation model frozen while performing backpropagation. Ensure the framework supports the appropriate tweaking of the model. For RUST, check out Candle https://github.com/huggingface/candle and Candlelighter https://github.com/BDUG/Lighter (PEFT, coming soon).
- Deploy and run the adapted model.
Therefore, fine-tuning is more of an operational question rather than a fundamental technological one.