To begin with, let’s clarify our discussion. AI, or Artificial Intelligence, is the field that involves using computers to perform tasks that resemble human abilities. To achieve this, we need to implement these tasks effectively.
One approach to achieving this is by enabling machines to learn, a process commonly referred to as machine learning, or ML for short. Learning, in its essence, involves creating a model, which can be thought of as a function, that takes input and produces an output (often represented as y = f(x)). When this model learns complex patterns through multiple layers, we call it deep learning (DL). Traditionally, deep learning was primarily used for tasks like categorizing input into predefined labels, a practice known as “supervised learning.”
In supervised learning, both the expected output (y) and the input (x) are known, and the model’s weights, or parameters, are adjusted accordingly. However, when we lack these predefined labels, we resort to a mathematical concept of similarity. The crux of the matter lies in defining this similarity. Typically, it involves creating a vector space, applying an embedding, and then utilizing linear algebra techniques like Euclidean distance to measure the similarity. This approach is commonly referred to as “unsupervised learning” and is employed for tasks such as clustering in algorithms like KNN.
In contrast to discrimination tasks, where the model distinguishes between predefined categories, the model can also generate entirely new data based on what it has learned from existing data. This newly generated data is similar in some way to the data it was trained on. In its simplest form, this generated data is the mean value of each learned parameter, also known as the primary component. The idea here is that data with similar information tends to cluster around the mean, which implies that statistically, the value distribution is low. In other words, the model extracts the underlying information via the mean value, and the variance (i.e., the distance from the mean) represents the variations the model can use to generate new data.
Technically, there are various approaches to building generative models. One common method is “self-unsupervised learning.” In this approach, the given dataset is augmented, for example, by rotating or converting images to grayscale. The similarity between these augmented versions is used to define labels. Data that originates from the same source, regardless of augmentation, is assigned a label indicating maximum similarity (distance of 0). Additionally, “contrastive” data samples are used to represent a label of value 1, indicating minimum similarity or maximum distance.

When we eliminate the prediction head, which is typically the last layer, from a deep learning (DL) neural network and train it extensively using abundant data, we obtain what is referred to as a “Foundation Model”. Having a foundation model given, “transfer learning” allow to application specific post-learn.
Foundation models are categorized based on the type of output they generate, such as:
- Text
- Images
- Audio
- Decisions
When the output of a foundation model is text, it’s commonly referred to as a “large language model” (LLM). One method for self-supervised similarity assessment in LLMs is CLIP. In CLIP, text is paired with images to establish similarity relationships.
Overall picture:
