In 2022, DeepMind unveiled a deep learning transformer architecture named ‘Gato’ (here), which happens to be the Spanish word for ‘cat.’ While detailed information about the model remains limited, we can make some educated guesses based on the research paper. The paper indicates that Gato encompasses a single sequence model designed to handle more than 600 diverse tasks, with ‘solving’ meaning performing better than random chance.
The strength of the Gato architecture lies in its embeddings and the spatial vector-space it creates, which facilitates tasks such as playing video games, engaging in conversations, and image classification. According to the paper, this spatial space is constructed using a logical model referred to as the ‘super-embedding‘, which consists of:
- Episodes that organize time-steps.
- Each time-step comprises an observation followed by an action.
- Observations are structured as ordered lists of tokens.
The core building blocks of this super-embedding are the tokens, which could take various forms, such as individual words in a text, image patches like 16×16 pixel arrays, or discrete values representing elements like the game environment and the actions to be taken. Regardless of their source, each token (ID) is associated with a unified embedding space. Constructing such embeddings typically (here) involves employing one or more parameterized embedding functions, often represented by separate machine learning models.
Once the embeddings for each token are established, it is assumed that a common ‘attention’ mechanism is applied to elucidate the semantic relationships between the tokens. Various implementations (e.g., here) of this mechanism, such as multi-head attention, are explored in the paper.
The paper discusses several common types of attention mechanisms:
- Scaled dot-product attention, a self-attention mechanism frequently employed to compute attention weights for input sequence positions. It is a prevalent choice in transformer architectures.
- Multi-head attention, a variation of scaled dot-product attention that employs multiple attention heads in parallel, each focusing on different input representations.
- Additive attention, which resembles scaled dot-product attention but calculates attention weights differently, using a similarity measure between query and key vectors, followed by a feed-forward neural network.
- Location-based attention, a mechanism allowing the model to concentrate on specific input regions by employing a convolutional neural network to learn attention weights.
- Co-attention, useful when dealing with multiple inputs, enabling the model to discern relationships between various inputs. It is commonly applied in tasks like visual question answering, image analysis, and video captioning, where understanding the connections between inputs is crucial for decision-making.”
And one more. Gato does all of this with just 1 billion parameter. IMHO a good indicator how important the multi modality in embedding is.