Transformer and markov model (work in progress)

I just read the article here, giving the following statement :

The self-attention mechanism allows Transformer models to consider the entire context of a sequence, rather than just the current state. This is a significant advantage over Markov Models, which are limited by their fixed context length.

Krystian Safjan, Understanding the Differences in Language Models – Transformers vs. Markov Models, 2023

But how about the higher order hidden Markov. It is higher-order, because it uses all not only a few prior time steps (e.g., states, words) to predict the next time step. In all cases one important restrictions lies in the fundamental stochastic idea of causality, including that causes precede their effects.

Due to that, transformer not follow stochastic process, except … the future already happened and was a causal effect of the past aka prior state (also see EPR paradox). Transformer oversee the whole process outcome, what implies that it already happened. Let’s have a short look on the Bayes probability, it may helps.

Let’s consider the following sentence: I like cherries, because they are sweet. This sentence can have two variances

  1. I like cherriesvs.I don’t like cherries” and
  2. because they are sweetvs.because they are bitter

If I have the a Markov model a certain probability would go from “I like cherries” to either “because they are sweet” or “because they are bitter”. Bayes would see it the other way around:

  • P( “because they are sweet” | “I like cherries”), aka. “they are sweet” because “I like them” what is equally in its probability with
  • P(“I like cherries” | “because they are sweet”), aka “I like cherries” because “they are sweet”.

In a more formal way:

  • A = I like cherries
  • A not = I don’t like cherries
  • B = because they are sweet
  • B not = because they are bitter

P(A) * P(B|A) = P(B) * P(A|B)

This would imply that I can assume the future state under a certain likelihood.

Accepting that all we talk about is just under a certain probability also a e.g., a given sentence is just the result of a stochastic process that covers the experience of previous process executions to a certain likelihood. In a nutshell, this is when we say that our AI is hallucinating.

Coming back to the Markov assumption. An interesting question is, is their a probability to reach a certain state after n-steps e.g., going from “I like cherries” to “because they are sweet” or for translation tasks from a given input to a resulting output e.g., from a given German sentence snipped (attention Query) to a next English word of the translation (attention Value).

Let us assume a given embedding where each embedded element is already split into three parts, via either a neural network (additive approach, also see and here) or any other kernel (multiplicative approach):

  • query
  • key
  • value

the attention score gives the relevance between elements in the given sequence (aka time steps).

The attention score (aka attention filter, because it blends out what is less of relevance) is calculated by multiplying of each word query with the own and other word keys, as to determine the probabilistic importance (0..1). It can somehow be seen as a dependent probability of how likely is it to go from a query to a key state. This describes the relevance between elements. In Markov’s view, how likely is it to go from one word to another. We just have to treat the transition probability as the relevance between words.

Each of these results are in a sum multiplied with the values for each word. The result is a weighted sum of all probabilities to come from the current element and time step to the end. Important words have a higher probability as less important one.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *