Let’s start with a simple sentence, that is tokenized into individual words and converted into their corresponding word embeddings.
Each word embedding represents a word as a dense vector (e.g., word2vec via Skip gram model). A dense vector, is a representation of data where each element in the vector is a real number. In the context of natural language processing and machine learning, dense vectors are often used to represent words, capture the semantic meaning and contextual information of the words, e.g., word similarity, syntactic relationships, or other linguistic aspects.
Result of the word embedding is a vector, matrix or tensor representing the input in all its e.g., linguistic and causal context. For images, an embedding method commonly used is called Convolutional Neural Network (CNN) embeddings. CNNs are widely used in computer vision tasks and can learn powerful feature representations from images.
The embedding result is passed to an attention pipeline. This pipeline give more context to e.g., the words of a sentence or previous chat information. This happens by contextual scoring vectors. Mathematically, attention scores are computed using dot products of token embeddings followed by softmax normalization. In context of image processing it could be e.g., the context “sea”, “sand”, “pals” for the given input of a swimming man.
A neural network for the learning e.g., for translation it could be a sequence-2-sequence network. The so called prompt-engineering helps to give the output variances and natural behavior. Also there are neural networks applied.