LLM from Zero to Hero

Transformers are a striving type of machine learning model. These models support common tasks in different modalities, such as:

  • Natural Language,
  • Computer vision,
  • Audio and
  • etc..

Support a set of different tasks such as

  • Classification (aka segmentation)
  • Recognition (aka detection)
  • Completion (aka add missing parts, question answering, translation)
  • Filtering (aka removing, summarization)

As other neural network models, they can also be exported to a format like ONNX and TorchScript for deployment in production environments. A set of common base model architectures has been evolved, for each modality:

  • Text models e.g.,
  • Vision models e.g.,
  • Audio models e.g.,
  • Multimodal models e.g.,
  • Reinforcement models
  • Time series models
  • Graph models

All of these classes can be initialised in a simple and unified way from pretrained instances. Instances contain hyperparameters, tokenizer and model weights. Those pretrained instances can be found on platforms such as  Hugging Face Hub

These models can be used for inference, training with respective datasets and fine-tuned (aka prompt tuned) with various approaches. Fine-tuning is when a pre-trained model (a “foundation model”) is customised using additional data to learn new information or be trained for a specific task. In comparison RAG is integrating a vector database for factual input. It’s important to remember that RAG and fine tuning are not mutually exclusive approaches, have varying strengths and weaknesses, and can be used together.

RAGFine tuning
Integrate external knowledgeyesno
Change model behaviournoyes
Minimise hallucinations yesno

Fine tuning is also part of the so-called LLM pipeline. 

Data qualityLow, raw data e.g., from internetHigh, fine tuned for e.g.m specific dialogHuman feedback
Model typeBase modelSFT modelReward-/ RL – model = RHLF model
DescriptionA language model encodes statistical information about language.Specific data or a specific small model with adapted linear layer aka dense layer (Dora, LoRa, etc.).

In usual cases it is a combination of hyper parameter adaptation and model training with adapted layer. This is commonly known as PEFT (Parameter Estimation and Fine-Tuning).
DPO approach
Source: https://magazine.sebastianraschka.com/p/10-ai-research-papers-2023
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *