LLM – we need to decouple facts from logic

Modern AI models are inherently inefficient because they work on the same task with the same level of intensity: they engage the same amount of compute for every question instead pulling static facts in.

DeepSeek Engram (https://arxiv.org/abs/2601.07372) helps in solving this problem by decoupling Memory (facts) from Reasoning (logic), so that the model can “search” for known facts instead of processing them again and again. Do not mix it up with the RAG approaches. See Engram as a dictionary and RAG as a library. Engrams are build into the model.

  • Engrams: Hashing + Gating: Hash-based retrieval is checked by a “gate” to see if it fits the current context. Offloads massive parameter tables to RAM instead of expensive GPU VRAM.
  • RAG: Embedding + Prompting: Text is retrieved and literally pasted into the prompt window. Usually stored in vector databases.

To see this change in form, take a higher priced restaurant with a chef as an embodiment of the model’s cognitive power.

In a traditional LLM, it’s as if a Michelin-starred head chef is repeatedly pulled away from composing a delicate, multi-course tasting menu just to pour water or slice bread, an absurd misuse of talent for trivial tasks. Engram alters the kitchen environment. Rather than cut off the chef, it adds a well-organized pantry and service station run by the waitstaff. When a guest requests bread or water, a server retrieves it instantly from the shelf, an O(1) grab, while the chef stays fully focused on complex, high-value dishes.

The system actually works beneath the hood by an adapted tokenizing concepts (the engram) and compressing them into a dense memory store, indexed via a hash map. This enables the model to retrieve specific information immediately instead of rummaging through its entire “brain”. A context-aware gate decides when such a fast-access memory is appropriate and if the item retrieved doesn’t fit the current situation.

Engram in a nutshell: If you see the word Apple, your brain activates an engram associated with the fruit; a language model converts “Apple” into a specific token ID (e.g., 8675). Same happens for the sentence “A quintessential pomaceous fruit, orb-like treasure wrapped in a taut, glossy skin that transitions from deep ruby reds to sun-drenched yellows and vibrant greens, protecting a firm, juicy interior of ivory flesh that delivers a perfect, symphonic snap followed by a complex balance of tart acidity and floral sweetness, all centered around a star-shaped core of dark seeds that has cemented its status as a timeless symbol of both wholesome health and forbidden knowledge” – it is also just a token ID (e.g., 8675).

Memories leave durable imprints and at scale this organizational shift is massive. By eliminating cognitive “prep work”, models become dramatically more efficient without growing larger to map everything with logic – the algorithms.

This architecture produced an improvement “Needle-in-a-Haystack” retrieval accuracy from 84% to 97%. Needle-in-a-Haystack is like asking a restaurant to remember that one guests specific allergy note – the fact. A weak kitchen guesses based on patterns (generic: “most guests don’t want peanuts”) – the logic.

So in brief: stop asking the chef to fetch the bread, and the food and the thinking gets much better.

DeepSeek’s analysis of “U-shaped scaling laws” shows that introducing a static memory in around 20-25% of a model’s compute parameters is the best solution for building intelligence. Finally, Engram shows that the future of AI isn’t about trying to build even larger models, but instead about building the sorts of smarter, more organized and cheaper ones that scale faster and are easier to run.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *