Transformer, attention and what it changed

Let’s begin by examining the fundamentals. In the initial stages, machine learning involves the process of mapping information to a mathematical representation, such as Euclidean vectors. This process is commonly referred to as embedding. During embedding, the number of dimensions is predetermined.

When we transform one vector into another, it is termed sequence-to-sequence translation. The term “sequence” is used because each dimension (x, y, z, etc.) represents an object in an ordered set. Alternatively, we could describe this as vector-to-vector translation.

Bag of words

The fact that the dimensions are well know leads us to the bag-of-words (BOW) approach, that is often used in language processing. Here the amount of unique words is put into a vector dimension e.g., for “I eat fish” BOW is (1,1,1). In comparison for “I eat fish fish” the BOW is (1,1,2). Usually no unique order exists, because “I eat meat” could also result in (1,1,1).

Problem: Order of objects in the set (vector)

N-Gram

An n-gram means a sequence of n words. For example the 2-gram of “I eat fish” is (I, eat) and (eat, fish). In this model, the probability of an object depends on the n-1 objects. This simply mean that “fish” only has a probability that depends on “eat”, but never to “I”. To overcome that, we will have to increase n, which will increase the computation complexity exponentially.

Problem: Context probability is low

Recurrent Neural Networks (RNN)

The RNN is the same as the n-gram model, except that the output of the current input will depend on the output of all the previous computations. In our example of “I eat fish” it would construct a kind of tree, e.g., for “fish”: ( (I, eat), fish).

Problem: The gradient vanish (aka exploding gradients problem) and very low opportunity to do calculation parallelization. A exploding gradient comes with the depth on hidden layers of a neural network. Each “neuron” is just a linear component. In the context of a mathematical the linear component is the portion that varies linearly with the input variables e.g., y = mx + b. The “gradient” is the rate of change of this function from one time step to the next. For a deep neural network this gradient usually increases by each hidden layer. You usually see it when the loss becomes the NaN value.

Long Short Term Memory (LSTM)

To solve the RNN problem of gradient vanishing, the LSTM network is split into components.Each component forget the previous gradient and in addition allows to get calculated in parallel.

Problem: The input still has a predefined, non variable length.

Transformer

The processing of variable input and output length is managed via the encoder-decoder architecture. The encoder as well as decoder consists of so called Attention layer, described below. After the encoder step a fixed length (hidden state) is given.

The ordering (aka positioning) while the concurrent processing is managed via positional encoding. There are many reasons why a single number, such as the index value, is not used to represent an object’s position. Transformers use a smart positional encoding scheme, where each position/index is mapped to a vector.  For our example of “I eat fish” it could look like

  • “I”–>0 –> (..,..,..,..,..)
  • “eat”–>1 –> (..,..,..,..,..)
  • “fish”–>2 –> (..,..,..,..,..)

The resultier vector does a sinus and cosine encoding of the position. See here.

Attention

Bahdanau et al. (2014) introduced the attention mechanism to tackle the bottleneck issue associated with fixed-length encoding vectors. In this mechanism, every query vector is compared against a key database to calculate a score value. The matching process involves computing the dot product of the particular query in focus with each key vector (potentially incorporating a norm like Euclidean distance). For each word, the key with the highest score, determined through a process like “softmax,” is selected. The corresponding value associated with this key, which may vary based on a threshold, becomes the output.

It is no magic in these matrix multiplications. The weights of the query, key and value matrixes are learned for the specific application context.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *