Practical attention mechanism

Google created the “Transformer” neural network architecture (check the provided link) to tackle the difficulties associated with sequence transduction, a challenge that arises when working with sequence-to-sequence models. The core concept in this architecture revolves around a technique known as “attention.” Now, let’s delve into the practical application of the “attention” mechanism.

To begin with, embeddings assume a pivotal role in the process. They facilitate a semantic translation of elements like sentences into a latent vector space. This latent space serves as a compact, lower-dimensional representation where data undergoes transformation and encoding into a condensed form.

Lets play with the relationship between

  • A: eating, drinking vs.
  • B: food and drinks.
Our playfield 2D space

Alternatively:

Same playfield in 1D space

The “attention head” under consideration needs to establish correlations that reflect the relationships between

  • food with eating and
  • drinks with drinking

This masks our latent 2D space embedding as follows:

Semantic masking for chosen embedding

Now let us introduce context information:

  • C1: I eat salmon
  • C2: I eat steak
  • C3: I drink red-wine
  • C4: I drink water
  • C5: I drink a white-wine mix

Mapped to our 2D latent space it somehow looks like the following:

A given query (Q1) “I drink water with fine” somehow maps to the latent space as follows:

As evident from the observation, the shortest distance is observed between Q1 and C5, indicating a significant degree of similarity. When computed using the dot product similarity, the resultant value is 1.

import numpy as np

# Define two vectors
vector_a = np.array([2, 3])
vector_b = np.array([4, 1])

# Calculate the dot product similarity
dot_product = np.dot(vector_a, vector_b)
magnitude_a = np.linalg.norm(vector_a)
magnitude_b = np.linalg.norm(vector_b)

similarity = dot_product / (magnitude_a * magnitude_b)

print("Dot Product:", dot_product)
print("Similarity:", similarity)

Thanks to the Google paper titled “Attention Is All You Need,” a concept called “self-attention” is introduced. This concept involves examining the semantic connections within a sentence by treating the sentence itself as the query. Take, for instance, the sentence: “The animal was not swimming in the sea because it was running on the street.” Each word’s relationship with all other words in the sentence is analyzed. An example is the word “it.” The question arises whether “it” pertains to the “animal” or the “street.”

To achieve this, an embedded version of the sentence is employed, as previously illustrated. The query of each word (also known as the key) gauges the semantic similarity (or relationship) with every other word. For the word “it,” this process might appear as follows:

  • “it” to “The” = 0.0
  • “it” to “animal”= 0.8
  • “it” to “was” = 0.05
  • “it” to “not” ….
  • etc.”

The outcome vector is referred to as the “self-attention” vector. When we once again multiply this vector with the input, it allows us to obtain the weightings for the probability distribution (also known as “value”).

Every “attention head” signifies a distinct type of evaluated connection. In totality, the embedding, in conjunction with these “attention heads,” enables the mapping of diverse semantic relationships.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *