Why machine learning looks so woozy

Let’s start with the math behind. A matrix is row-wise just a listing of properties (columns). These properties are also called observations or what ever you like to call it.

Consider 6 vehicle and each has different observations like color, length, etc. For mathematical reasons we not put e.g., the color in words “red”, “black”, etc. we encode them as numbers red = 1, black = 2, etc.

By doing so we simply create a 1 dimensional vector space. Is this not nice. Just by counting we are already in the field of linear algebra.

if we do it property by property we build a multi-dimensional space.

Let us consider two properties (2D) that are describing our vehicle. Every vehicle will get a unique position in the 2 dimensional space.

Having this in place we can e.g., compare vehicles and how equal they are. For example vehicle B and D might both tend to be “black”. Now let us see a vector as mathematicians, not physics, do. In math a vector has a root e.g., in cartesian coordinate system the (0,0).

By doing so the vehicle B could be on position x = 2.5 and y = 2.5. Now the interesting part of math. The experts play around with the wording e.g., they call x = i head and y = j head.

So far so good. Now we do have vectors with a root, describing all our given objects e.g., our vehicles. Besides introducing more and more dimensions we could also (re-)order the objects in this space to describe more sophisticated logic.

Resulting in

Let us take the currently most prominent application: Transformer networks according the Google paper https://arxiv.org/pdf/1706.03762.pdf

Source: https://arxiv.org/pdf/1706.03762.pdf

Now let us divine step-wise. First lets start with the encoder (left part) and afterwards the decoder (right part).

Encoder step no. 1: The first two steps of the encoder is somehow equal to what we did afore for the vehicle properties (color, length). The common name is “embedding”, because we embedded an object into a mathematical space.

“The “positional encoding part” is for our vehicle e.g.m something like is “The “positional encoding part” is for our vehicle e.g.m something like is e.g. vehicle B is in front of vehicle A. The encoding of this information is an operation like the shear and rotate shown afore. In our 2 dimensional space we e.g., introduce a third space to map the spatial information of vehicle order.

For sentences and words as in the original paper are encoding via

  • Formula no. 1: PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
  • Formula no. 2: PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

with

  • pos: the position of the word in the sequence
  • i: the dimension index of the positional encoding
  • d_model: the properties of each input. In case of our vehicle it is = 2 and in case of the original Transformer paper it is = 512.

Encoder step no. 2: The “multi head attention” is a combination of multiple “self attention”. Self attention is the reflection on the importance between properties.

For our vehicle example this is quite boring. For words as in the original paper it could be interesting to e.g., map relationships between single words. A nice example for words is “the cat is not passing the river because it is to huge”. Here the interesting relationship is the fact who of both cat or river, is to huge. In case of our vehicle example, at Henry Fords time, the color black could most probable map to a specific length of the vehicle.

Source: https://blogs.sw.siemens.com/automotive-transportation/2016/04/15/a-car-tailor-made-for-you-that-would-still-make-henry-ford-happy/

Core of the “attention mechanism” is what we known e.g., from Python dictionaries (aka lookup table, see https://docs.python.org/3/tutorial/datastructures.html#dictionaries). A query (Q) that maps to a key (K), resulting in a value (V). The comparison between Q and K can happen in various approaches. To be precise this mapping requires human in the loop, while training. It is a good field application for a separate neural network.

For the afore shown Ford the self-attention could look like the following:

LengthColorNumber of wheelsNumber of seats
Length0.70.10.60.5
Color0.1
Number of wheels0.6
Number of seats0.5

The “add norm” could be e.g., a “softmax” function which normalize each line to the sum of 1.

In the original paper itself the whole procedure is expressed as follows:

The multiplication by V is to reconstruct the sparsity of the input matrix.

Multi head in this context describes that multiple types of relationships can be expressed e.g., for vehicle, for bicycle, etc. In its simplest form the heads are concatenated e.g., via multiplication.

Encoder step no. 3: The last encoder step reduces the input back to the original size e.g, 4 properties for one vehicle.

This happens, in the original papers, via the following formula which takes the mean value (aka expected value) and the variance under consideration.

Note: For 4 vehicle properties the transformer approach is oversized, but with hundreds of properties and respective interrelationships it is a good way to have a one shot encoding for complex conditions.

The decoding is the output of the transformation e.g., in translation it is the target language such as Spanish (output) when translating English (input). Technically the computation is almost equal as the encoding.

Note: The shifted right just notate that the all words behind the current are masked. Masked means they are not considered. This usually happens via a so called padding word (null-word). For our vehicle example let us assume we just set all following properties to zero.

LengthColorNumber of wheelsNumber of seats
Length0.7000
Color0.10.100
Number of wheels0.60
Number of seats0.5….

This padding approach allow to introduce different ways of processing between training and inference While training, the encoding and decoding inputs are passed completely, but while inference the decoding is given property by property. Non given properties are padded with e.g, 0.

  1. length
  2. length, color
  3. length, color, number of wheels
  4. length, color, number of wheels, number of seats

The combination of encoder and decoder happens also via the attention mechanism. The encoder provides the query (Q) and key (K), that is mapped to the value (V) as provided by the decoder.

From thereon the rest is about constructing the expected sparsity and learning capability.

Well done 🙂

Update 2023/12/10

Thank you for your given questions.

Feedback no. 1: Sure all what I explained is straight forward an don’t necessarily must follow the approaches as realized in the original paper. The learning happens once at the encoder and once at the decoder.

Feedback no. 2: While training and inference the left part takes the input and the right part the output. For translation e.g., English and Spanish. While training always the full sentences. While inference right stepwise (aka time-wise).

Feedback no. 3: The often read term “greedy search” refer to the selection of the maximum fitting key (K) for a query (Q). If more than one search path is followed in parallel some call it e.g.,”beam search”.

Feedback no. 4: What does the devision by h in the “Attention is all you need” paper for “multi head attention” mean is quite simple. It just split the given matrix into column-wise slices.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *