Time, neural network and transformer

Lets us start with time. Time allows to introduce a sequential order – one after the other. When we look for example on usual word embedding it is not just assigning a word a single number, it is usually an array of numbers what e.g., encodes the context of the given word with a hand of surrounding words (compare exemplary external link). The context allows to mathematically describe the word in its local interdependency. This is e.g., important for translation because same words could result in different meaning like “Margarita eats Bianca” vs “Bianca eats Margarita”.

In most cases the local encoding is sufficient. A technique called “positional encoding” is often used instead. One out of many methods is to have a set of mathematical functions e.g., sin, cos. For each word position 1st, 2nd, 3rd, etc. the sin and cos function give a value which more or less unique give a latent vector-space for the order in time. From classical neural network technique we call it an activation function.

In context of transformer methods a method called “self-attention” is applied to not only encode the position, but also the semantical interdependency. The relationship is often called a similarity, because in a probability space (0..1) it assign more weight on more probable relationships e.g., “While driving my car through a dessert it get hot”. If the car gets hot, more attention should be on the relation between “hot” and “car” instead of between “hot” and “dessert”. The term “self” notate that it is applied on one vector-space not between two or more vector-spaces like in the translation from English to Spanish.

Transformer networks use an encoding technique of “look up table” (LUT) as known from databases. In a simple form the database entry is a list of key (K)-value (V) sets that gets requested by a query (Q). If we know query e.g., “it” the similarity with “car” should be higher as with “dessert”. To do this the given word embedding must be rearranged (aka transposed) into the attention space. This is done by a (W)eight that is multiplied with the number representation of the word. Usually the (W)eight is a (separately) trained neural network. Each K and Q has its own transformation (W)eight. By doing so a simple dot-product allows to determine the similarity (a value closer to 1= 100%, done by the “SoftMax” operation) between a given query and a key. The Value at the end is just using the similarity multiplied with the encoding for a given input sequence.

Having multiple domains to consider each domain gets its own “head” of attentions. Each head has a structure like the following:

  1. word embedding
  2. positional encoding
  3. self attention
  4. residual connections – the final multiplication that give more “attention” between query and key.

Transformation; so far we just looked at the “encoding” part of the encoder-decoder (aka sequence-2-sequence). Usually the decoder part does the same as the encoder part e.g., English and Spanish for translation.

Encoder (representation) and decoder (representation) requires to keep track (aka attention, not self-attention) of the position between the words.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *