Classical sequence-to-sequence (Seq2Seq) models based on Recurrent Neural Networks (RNNs) condense the entire input sequence into a single fixed-length context vector. While effective for short sequences, they struggle with longer or more complex inputs due to their inability to capture fine-grained, long-range dependencies. RNNs also process tokens sequentially, which:
- Slows down training and inference (no parallelization).
- Suffers from vanishing gradients, hindering long-term context retention.
- Leads to performance degradation on longer sequences.
Transformers address many of these issues with self-attention, allowing any token to relate to any other token in parallel. A “cross-table” (example) of attention weights quantifies relationships between tokens, enabling global context-modeling. However, transformers lack an inherent notion of order, so positional encodings are used to incorporate sequence ordering information.
Aspect | Seq2Seq | Transformers |
Context Handling | Single, fixed-length vector | Full pairwise attention (“cross-table”) |
Parallelization | Sequential (no parallel computation) | Highly parallelizable (self-attention) |
Long-Range Dependencies | Susceptible to vanishing gradients & limited context | Captures distant dependencies effectively |
Positional Encoding | Implicit through recurrence | Requires explicit positional encoding |
Typical Bottlenecks | Slow training, limited context vector | Large memory usage for attention over very long sequences |
Modern applications involve multiple tasks and conversations, creating fragmented and distributed contexts. While transformers excel within a single conversation, they face new obstacles when context must persist and link across-multiple interactions. Each conversation (or task) may be isolated, yet users expect systems to recall or relate information from previous tasks.
Key challenges:
- Context Fragmentation: Each conversation is treated independently, losing continuity and coherence across tasks.
- Memory Bottlenecks: Attention mechanisms typically have fixed context windows, limiting the retention of long-term history.
- Cross-Task Relationships: Transformers are optimized for intra-task focus; linking knowledge between tasks is more complex.
- Global Context Across Tasks: Demands new architectures (memory-augmented, retrieval-based) to store and retrieve shared context.
To handle context across tasks and conversations, researchers explore
- Memory-Augmented Networks: Maintain an external memory component that persists beyond a single conversation.
- Retrieval-Augmented Generation (RAG): Dynamically fetch relevant pieces of information from a knowledge base.
- Hierarchical Attention Mechanisms: Organize multiple attention layers or modules to capture context at different scopes (per conversation, per session, or globally).
As conversations and tasks become more fragmented and intertwined, ensuring continuity and consistency of context becomes a central requirement, especially with newly upcoming AI agent approaches.