Architecting the Cognitive Operating System of 2026

For the past few years, AI has resembled a gold rush. Organizations hurried to dump everything they owned, documents, tickets, wikis, logs, into vector databases, convinced that semantic retrieval would finally make large language models dependable. Retrieval-Augmented Generation (RAG) became the default pattern: embed everything, search everything, paste the top-k chunks into a prompt, and hope the model behaves.

Source: https://www.researchgate.net/figure/Retrieval-Augmented-Generation-Architecture_fig1_378364457 

It worked, but only in the way a patch works. Vectors and its embeddings reduce hallucinations and extended knowledge, yet they quietly entrenched a deeper misconception: retrieval is not memory, and search is not reasoning. As systems scaled, it became obvious that a stateless chatbot backed by an enormous filing cabinet is still stateless. The “vector gold rush” is ending not because vector search failed, but because it was never meant to be the foundation of intelligence. It is a storage primitive, not a cognitive architecture.

Source: https://medium.com/@messouaboya17/the-rise-of-the-llm-os-from-aios-to-memgpt-and-beyond-513177680359 

What is replacing it is a more ambitious framing: the LLM as an operating system. Popularized by Andrej Karpathy and now actively engineered across industry and open source, this paradigm reframes the model’s role. The LLM is no longer just an application that emits text; it becomes the CPU of a new computational stack. The context window behaves like RAM, long-term stores resemble disk, agents act as processes, and tools are invoked as system calls. Intelligence stops being a request–response loop and starts behaving like a running system with state.

The restaurant analogy: In the vector era, the setup resembled a chaotic diner: a brilliant chef with no memory. Every order required a sprint to a warehouse next door to grab a handful of loosely relevant pages. Sometimes the dish worked; often it didn’t. In the emerging architecture, the same chef operates inside a Michelin-grade kitchen. A manager maintains customer context. Prep cooks stage ingredients. Sous-chefs handle specialized tasks. The chef never leaves the stove. The intelligence didn’t change; the architecture did.

The pressure driving this transition comes from hardware physics. Modern AI systems are no longer compute-bound; they are memory-bound. GPUs can perform staggering amounts of math, but inference stalls when data cannot reach the cores fast enough. Each generated token requires scanning an ever-growing key–value cache. This widening gap between compute and data movement is the memory wall.

That is why recent GPU generations emphasize memory over raw FLOPS. The transition from platforms like the NVIDIA H100 to B200 prioritizes bandwidth and capacity, delivering massive increases in on-package HBM. In operating-system terms, RAM finally became large and fast enough to hold a meaningful working state instead of thrashing. At multi-terabyte-per-second bandwidths, large context windows become practical rather than pathological.

Even so, HBM is expensive and finite. A new tier is emerging between DRAM and SSDs: high-bandwidth flash. By combining NAND with stacking techniques borrowed from HBM, this “warm memory” tier offers terabyte-scale capacity with latency low enough to remain in the cognitive loop. Hot thoughts live in HBM, warm memories in high-bandwidth flash, and cold archives on network storage.

Source: https://www.researchgate.net/figure/Schematic-illustration-of-the-memory-hierarchy-in-traditional-CMOS-based-computing_fig2_368522350 

Hardware sets the stage, but software turns capacity into capability. Early LLM serving systems treated memory naively, allocating large contiguous blocks per request and wasting vast portions of GPU memory. 

The breakthrough came when serving engines adopted a classic OS idea: virtual memory. 

Systems such as vLLM page the attention cache into small blocks that can be allocated and moved on demand. To the model, memory appears continuous; underneath, utilization approaches saturation. What we commonly call Memory Management Unit (MMU).

Paging unlocks more than efficiency, it enables sharing. System prompts, personas, and common prefixes can exist once in physical memory and be referenced by thousands of concurrent agents. Divergence triggers copy-on-write, mirroring how Unix efficiently forks processes. Entire agent swarms become feasible without linear memory growth.

With working memory under control, long-term memory becomes the next constraint. Flat vector databases excel at fuzzy recall but collapse under structured reasoning. They do not encode identity, hierarchy, or causality. As collections grow, noise increases and relevance decays. The answer is not abandoning vectors, but embedding them in a richer memory hierarchy.

Source: https://agentman.ai/blog/reverse-ngineering-latest-ChatGPT-memory-feature-and-building-your-own 

Architectures inspired by systems like MemGPT treat memory explicitly. Core memory-persona, goals, active context is pinned. Recall memory is summarized, compacted, or discarded. Archival memory remains vast and invisible until explicitly paged in. Crucially, the agent decides when to write, compress, or forget. Memory ceases to be an external script and becomes a cognitive function.

This leads naturally to hybrid storage. Vectors provide similarity. Knowledge graphs encode relationships. Key-value stores track state. Together they form a cognitive file system rather than a flat embedding soup. Reasoning across people, projects, and timelines becomes guided traversal, not blind nearest-neighbor search.

Source: https://agentman.ai/blog/reverse-ngineering-latest-ChatGPT-memory-feature-and-building-your-own 

On top of this substrate live the agents themselves. During the gold rush, an “agent” was often little more than a Python loop around an API call. Mature systems now resemble microkernel architectures – small, fine granular servers offer respective “tools”. The LLM handles planning and reasoning. A separate runtime executes tools, manages I/O, and enforces policy – the OS daemons and services. Model actions become system calls like the libc still represents in todazs OS, validated, sandboxed, and auditable like PID1 systemd and LInux container (aka docker). A hallucination degrades into a failed syscall, not a production incident.

This separation makes security tractable. Capabilities are scoped. Side effects are contained. Multiple agents coexist without corrupting shared state. Over time, agents become networked processes: scheduling agents negotiate with calendar agents, procurement agents coordinate with finance agents, all via defined protocols rather than prompt glue.

Source https://llm-d.ai/blog/llm-d-announce 

Source: https://www.haibinlaiblog.top/index.php/nsdi26-can-we-use-mlfq-in-llm-serving/ 

At scale, intelligence becomes elastic. Models shared across GPUs. Prefill and decode are disaggregated. Specialized expert models handle distinct workloads. The system routes cognition the way a cloud scheduler routes jobs, scaling capacity up and down with demand.

Restaurant analogy Seen again through the restaurant lens, the picture is complete. The executive chef, the compute, never leaves the stove. The counter, the context window, is meticulously staged. Hot ingredients sit in HBM, warm supplies in the walk-in cooler, bulk stock in storage. Sous-chefs specialize. Managers coordinate. Multiple kitchens cooperate as a franchise. What was once a frantic diner becomes a disciplined brigade.

This is the real end of the vector gold rush. Not the rejection of vectors, but their demotion from centerpiece to component. The systems that win in 2026 will not merely deploy models; they will boot operating systems, persistent, stateful, secure, and self-aware of their own memory. 

Architectural Summary:

Flat RAG → Virtualized memory → Hierarchical storage → Agent “microkernel” OS → Elastic distributed intelligence (the Kuebernetes one)

Posted in Uncategorized | Leave a comment

Restaurant experience risk with LLM

LLMs and agents seem to serve experts much more than they benefit others. One common comparison that is drawn is that LLMs are less likely to be a professional chef than an incredibly fast sous-chef, able to see every recipe ever written. This is very potent in the power of a well-experienced head chef. The chef knows the cuisine, knows the balance of flavor, knows when a recipe is missing acid or when to turn too much salt and can tell when a suggestion might clash directly with the rest of the menu.

The sous-chef speeds up prep work, recommends modifications and assists with idea exploration, but the chef is still tasked with taste, consistency and quality. The same sous-chef is infinitely more dangerous for someone who does not have cooking experience. Recipes look perfect, directions sound authoritative, substitutions seem natural. But without a grasp of fundamentals, heat control, seasoning, timing, even following the instructions correctly the result can be inedible. It isn’t clear why it tastes bad when something tastes different. The issue isn’t with the recipe generator; it’s with the judgment to evaluate the output. This roughly maps to the way LLMs are used in development.

Experienced engineers approach code generated from other coding like a recipe draft. They can see right away what’s missing, error handling, clear contracts, separation of concerns, long-term maintainability. They mix the proportions, swap ingredients, and sometimes throw out the dish. Newer users tend to think that somehow it must be “good” if the code compiles and executes something (just as you would like to think that a dish is right because it followed a recipe from start to finish).

Hyrum’s Law also shines through in this analogy. Whatever is the undocumented quirk, cooking, a pan that runs hotter than predicted, say, or an oven that browns in patches, will eventually be the thing to use. Change the pan, and the dish breaks. And in software, the quirks generated by LLM are as frequently made into accidental dependencies as the quirks generated by the AI. The experts compensate on purpose when they go; novices accidentally build it into any future meal. Agent and agent-based systems are equivalent to automatic kitchen stations in a kitchen. In an efficiently executed restaurant, automation will optimize speed of service which does not sacrifice the quality of dishes and orders because the menu is set, processes are established and chefs are in control of the order. On an inexperienced kitchen, the very same automation generates inconsistent and uneven dishes at scale. Errors are not separate, they are multiplied. So is the technical-debt principle of a perfect fit here.

Fast food is inexpensive to create but costly to sustain over the long haul. LLMs render “fast food software” extremely easy to create: fast, filling, instantly satiating. Experienced teams take time to transform that into a balanced, sustainable meal. Others swallow up indigestion, systems that are difficult to reform, difficult to reason through, becoming more fragile. I relate to this restaurant analogy because it encapsulates a simple truth repeated across developer circles: LLMs don’t eliminate taste, judgment, and responsibility. They amplify them. Just as good tools don’t make a great cook, LLMs don’t make a great engineer. They just make the difference visible sooner — and at scale.

Posted in Uncategorized | Leave a comment

JEPA: The step after generative language models

For a long time, AI research has followed two main paths.

  • Discriminative/ Contrastive Model: The other path tries to label the world, deciding what something is rather than rebuilding it.
  • Generative Model: One path tries to recreate the world exactly, piece by piece. It paints every pixel, predicts every word, and rebuilds reality in detail.

These ideas gave us today’s language models and image generators. But they hit limits. They struggle with real understanding, long-term planning, and uncertainty. They copy patterns well, but they don’t truly understand what’s going on.

A different idea, called JEPA, takes another route. Instead of copying the world or labeling it, it tries to understand the world by predicting what is missing, using abstract concepts rather than raw details.

The old kitchen

Think of traditional AI models as line cooks. When a customer orders a dish, the cook must recreate everything from scratch. Every cut, every garnish, every seed must be placed just right.  If the cook isn’t sure where something goes, they guess. From far away, the plate looks fine. Up close, it’s messy. That’s how generative models work. They focus on tiny details. When details are uncertain, they fill in the gaps, even if they’re wrong.

Now imagine food critics. They don’t cook anything. They compare.  They taste two dishes and say whether they are similar. Or they check if a dish matches a description.

The culinary academy

Now imagine a culinary academy with a strange rule: No one is allowed to cook. The goal is not to recreate dishes, but to understand the flavor itself. This is where JEPA lives. A student is shown part of a dish.  Maybe they see lamb and rosemary on one side of the plate, but the sauce on the other side is hidden.

Instead of imagining how the sauce looks, the student forms a mental idea of what they see:

“This is rich, savory, herbal”

Then the student is asked:

“Given this, what is likely on the hidden side?”

They don’t imagine yogurt, mint leaves, or seeds.  They think in concepts:

“Rich meat is usually balanced with something cool and fresh”

So they predict:

“The hidden part probably tastes creamy and acidic”

This prediction lives entirely in the mind. No images. No pixels, no token, no words. Just meaning.

Now the main difference is how the learning happens. To check the answer, a master looks at the real sauce and forms their own mental idea:

“Cool, creamy, slightly sweet”

The student isn’t graded on visuals. They’re graded on how close their idea is to the master’s idea. If the student missed the sweetness, they adjust their understanding:

“Ah, pomegranate adds sweetness, not just acidity”

No one ever cared where the seeds were placed. Only the essence mattered.

The most important twist. The master changes slowly, becoming more stable over time.  The student changes quickly, trying to catch up. This prevents cheating. The student can’t just say “everything tastes the same” to get perfect scores, because the master keeps raising the bar. The student is always chasing a slightly better version of their own understanding.

JEPA doesn’t try to copy reality.  It doesn’t obsess over tiny details. It doesn’t hallucinate missing pieces. It builds a world model made of meaning.

That’s not cooking the dish. That’s understanding the flavor.

The technical food laboratory view 

These text excerpts are the result of a Discord discussion. Many thanks to my conversation partners.

The evolution of Machine Learning in four simple steps:

  1. The Encoder takes raw input data (like a sentence, an image, or an audio clip) and condenses it into a compact, numerical summary. The similarity of the encoding is the usual classification/ Discrimination. 
  1. The Decoder/ Mapping takes the Context Vector provided by the encoder and “unpacks” it into the desired output format. Simply: from A to B. 
  1. If between the encoder and decoder lies a “bottleneck.” This is the only information the decoder gets to see – often known under the Term “Autoencoder” or “Filter”. 
  1. The so-called “Transformer” where Attention mechanism now allows the decoder to look back at specific parts of the encoder’s input, making the “bottleneck” much more flexible. 

In all cases the mathematical representation e.g., a vector, a tensor, etc. is the so-called embedding – putting information into an alternative, better processable representation. 

And now JEPA does the next 5th step.  In simple words, it gives the same concepts an almost similar embedding independently of its initial given representation, also for the generated, in the input missing, information part. The goal is not to generate something across the distribution, rather more to reconstruct the identity (facts). 

The steps, via an image example,  are:

  • Split the image into tiles
  • Generate for each tile an embedding e.g., tensor; the target encoding 
  • Learn a predictor that can generate a missing tile from given other tiles 
Posted in Uncategorized | Leave a comment

LLM hallucination

AI hallucination is defined as when a generative model creates outputs that sound plausible but are in reality false, nonsensical, or unconnected to reality. LLM Modern generates text, for instance, via a probabilistic next-word prediction mechanism. At base, these models are merely autocompletion engines that look at a given sequence of words and guess the most likely sequence of words that would follow. Next words are selected based on a probability distribution.

This probability-driven process is powerful for producing fluent and diverse text, calling it generative. It generates what seems right statistically, not by verifying facts. Some call it perception, instead of factual. The distribution of the input sequence words, as base for the output e.g., next token sequence word, is defined by so-called heads.

A head is the learned decision-brain that gets the distribution. The distribution is a function of the relationship between tokens (e.g. words). It is the co called context windows that defines how many tokens are taken into account from the currently considered token. Think about the sentence: “The chicken is too spicy for the waiter”. 

Heads extract relationships on “Who finds the chicken too spicy?”. Is it, literally, a piece of spicy chicken, and the waiter (tasting it or handling it) thinks it’s too spicy? In common sense and context, a human reader would explain this as in various “heads” 1) the waiter finds the chicken dish too spicy to handle or 2) eat. Because waiters do not typically eat customers’ food (we know that), this sentence might imply a scenario like the waiter sampling a dish 3) or joking 4) with a chef.

These heads are not only about training, it is also the data deciding on the distribution as well as the transformation (aka Transformer) from input sequence words to the next token sequence, also after the training by injecting grounding facts e.g., via domain related documents through the RAG approach. 

Posted in Uncategorized | Leave a comment

The mechanics of modern LLM – explained in an easy-to-understand way

In today’s world, modern AI is often perceived as a kind of miracle and is frequently mistaken for magic. In reality, however, this power is based on clear architectural principles. To make these concepts tangible, I translate each of them into a restaurant scenario.

1. Rotary Positional Embeddings (RoPE) – Context Through Position

RoPE anchors words not only in their semantic meaning but also in their positional context. This enables a model to understand relationships across long text spans and to perform meaningful extrapolation in extended contexts. A waiter does not only remember what was ordered. “First the appetizer, then the main course”—the meaning depends on the order. RoPE helps the AI preserve this ordering even when requests become very long.

2. Chinchilla Scaling Laws – Quality Beats Pure Size

Training models at scale is important, but size alone is not sufficient. Large and efficient models only perform well when trained with the appropriate amount of data. Choosing the right model size relative to the dataset is critical. If there are recipes for only five dishes, a large kitchen with twenty chefs is useless. Better: fewer chefs, well trained, with complete recipes.

3. Causal vs. Bidirectional Attention – Time Direction Matters

Causal attention only has access to past tokens, while bidirectional attention can also consider future context. The suitability of each approach depends on the task.

  • Causal: The chef prepares courses one by one without knowing what the guest will order later.
  • Bidirectional: An event catering team knows the full menu in advance and plans accordingly.

4. KV Cache – Memory Instead of Repetition

The KV cache stores previously computed information and avoids unnecessary recomputation, which is especially beneficial when many sequential queries must be handled quickly. The waiter remembers that the guest does not want sugar and does not ask again for every coffee—saving time and effort.

5. Stability of Large Transformers – Controlling Complexity

Large models require specialized normalization techniques to ensure numerical stability and reliable operation. In a very busy kitchen, clear workflows, hygiene protocols, and guidelines ensure that operations remain orderly even under heavy load.

6. Mixture-of-Experts (MoE) – Specialization Over Generalists

MoE activates only the parts of a model that are required for a given request, improving efficiency and scalability. For sushi, you call the sushi chef; for desserts, the pastry chef. Not every cook needs to do everything.

7. RLHF – Learning from Human Feedback

AI systems are improved through human judgment, becoming more valid, helpful, and appropriate. Guests rate the food, and the chef adjusts taste, portion size, and presentation accordingly.

8. Preference vs. Reward Modeling – Taste vs. Scores

Preference modeling captures user inclinations, while reward modeling quantifies quality. Both approaches complement each other.

  1. Preference: “I like spicy food.”
  2. Reward: ⭐⭐⭐⭐ for this curry.

9. Hallucinations – When the Kitchen Improvises

Models can generate plausible but incorrect content. Techniques such as RAG help reduce this risk. The waiter invents a dish that does not exist. RAG is like checking the menu before answering.

10. Length vs. Clarity – Less Is Often More

Overly long answers can degrade quality. The goal is clear, targeted communication. The waiter briefly explains what is in the tomato dish, rather than telling the entire history of tomatoes.

11. Learning Phases – From Recipes to Guest Satisfaction

  • Pretraining: foundational knowledge = Culinary school
  • SFT: following instructions = Cooking by recipes
  • RLHF: human-centered fine-tuning = Cooking based on guest feedback

12. Knowledge Distillation – Large Kitchen, Small Bistro

Large models transfer their knowledge to smaller, more efficient models. A Michelin-star restaurant designs recipes; a bistro prepares them faster and more affordably.

13. Small Models in RAG – Precise Assistants

In RAG systems, smaller and more focused models tend to be more reliable. A specialized sommelier understands the wine list better than a general service waiter.

14. Jargon Adaptation – Language Depends on the Guest

AI must switch between technical language and everyday language depending on the audience. You speak technical jargon with the chef and plain language with the guest.

15. Hallucinations Despite RAG – A Source Is Not the Truth

External data can still be incorrect or incomplete, so validation remains essential. The delivery list says “fresh fish,” but no one verifies it.

16. Latency & Throughput – Speed Matter

AI systems are defined by response time, scalability, and resource efficiency. If the best meal takes two hours to arrive, it becomes useless.

17. LoRA & QLoRA – Fine-Tuning Instead of Rebuilding

Targeted adaptation enables effective training without fully updating the base model. Adding a new spice instead of redesigning the entire menu.

18. AI Evaluation – More Than Just Taste

Evaluation must cover quality, safety, robustness, and usability. Not only tasty, but also hygienic, reliable, and compatible.

19. Model Controllability – Style at the Push of a Button

Controllability enables distinct, context-aware responses. The same dish served rustic-style or as fine dining.

Conclusion

Modern AI is not random or magical, but a highly orchestrated system of architectures, training processes, feedback mechanisms, and efficiency considerations. Like a good restaurant, no single factor determines success—the overall experience emerges from the interaction of all components.

Posted in Uncategorized | Leave a comment

The LLM Inference Performance Restaurant

With the constantly transforming landscape of artificial intelligence, Large Language Models (LLMs) are a remarkable step within the age of artificial intelligence.

But, how do we take user interactions and put them in good use of scarce GPU resources and costs?

Imagine our GPU as a high-end, modern, and beautiful kitchen that has everything to serve up to you in an absolutely perfect dining experience.

  • Prefill Phase: Knowing the Order. In the first step, a model absorbs the full input prompt in one successful forward pass to produce Key/Value (KV) tensors – the ‘remembering’ of the prompt. Consider that it’s your “Chef Reading the Order”. This stage consumes considerable computational power and requires the Time To First Token (TTFT) – how long it takes diners to see their initial course hit the table.
  • Decode Phase: The Plating stage. The words arrive step by step; every token is a step that depends on what has come before it. Think of it like the “Plating of the Risotto”. The chef sets one delicious grain, one after the other. Since the location of each grain depends on the position of the last, he has to view the plate, let go of a grain and gaze again. A lot depends on how much the chef can reach for the ingredients (memory speed) and therefore the Time Per Output Token (TPOT) or “typing speed” – the chef’s slow timing in adding each one as he continues to plate the meal.

Performance Metrics: The Kitchen’s Lifeblood.

Meeting Capacity Challenges The Counter Space Problem. How big is it possible to really expand our kitchen?

Through the use of Little’s Law, we understand the limitation has less to do with the chef’s speed than the size of the kitchen.

The Math of the Kitchen:

  • Chef’s Speed: Able to produce 1000 “grains” (tokens) per second.
  • Average Order: 100 grains per plate of Risotto.
  • Throughput: 10 Requests Per Second (RPS).

If a complete meal takes just 2.3 seconds to finish, the kitchen has to serve around 23 requests at the same time (10 RPS×2.3s) to be efficient.

Which means, it requires enough GPU VRAM (Memory = Counter Space) for 23 plates at once.

There is light at the End of the Tunnel. The following mechanism resolve these bottlenecks and improve the kitchen experience for three reasons.

  • Disaggregated serving shifts GPU inference into production mode by decoupling compute-heavy prefill (“reading”) from memory-heavy decode (“plating”) across multiple GPUs or servers, so each stage can scale separately.
  • Runtime efficiency becomes possible because of the iteration-level scheduling: When new requests are issued ongoing, they load instead of waiting for full batches, and global routing chooses the best node based on load and cached context.
  • Large context windows are handled by distributed KV caches that move beyond a single GPU into system memory or storage without the need for VRAM limits. Rapid, RDMA-based data transfer then links these two stages, allowing near-instant hand-offs.

Combined, this makes it possible to avoid conflicts between resources, boost throughput, and reduce latency, and enables large-scale, data-center-level inference. vLLM or SGLang is usually all you need …

Posted in Uncategorized | Leave a comment

How AI Is Transitioning from Bots to Empowering Partners

In a world where technology advances at breakneck speed, the evolution of artificial intelligence (AI) offers a compelling glimpse into the future of human–machine collaboration. AI is no longer confined to answering questions or automating repetitive tasks. Instead, it is steadily progressing toward becoming a genuine partner in our professional lives.

At the heart of this transformation lies a deceptively simple concept: memory.

While today’s AI systems appear intelligent, their lack of long-term memory prevents them from learning, adapting, and improving over time. Understanding this limitation—and how it is being addressed—is key to understanding the next phase of AI.

Generative AI has captured global attention with its ability to produce human-like text, code, and creative output. These systems feel conversational, helpful, and often impressively insightful.

But there is a hidden constraint. Generative AI excels at momentary intelligence—it responds brilliantly in the present but forgets everything afterward.

Generative AI operates in isolated interactions. You ask an AI to draft a formal letter. It delivers a polished result. Later, you ask for a summary—but the AI has no awareness of the previous task. The context is gone.

Each interaction begins with a blank slate. This makes generative AI highly reactive, but not adaptive. Agent-based systems build on generative models by adding tools, goals, and decision-making loops. Ask a basic AI for an stock price, and it may return outdated data. Ask an agent-based system, and it checks real-time sources before responding.

Agents can:

  • Browse the web
  • Execute tools
  • Plan multi-step tasks

Yet even these systems suffer from a crucial limitation. They still forget. Without memory, even advanced agents repeat the same work, mistakes, and inefficiencies.

Memory enables:

  • Learning from experience
  • Long-term improvement
  • True autonomy
Also read: https://arxiv.org/pdf/2512.13564

Without it, intelligence remains shallow. Memory in AI is often misunderstood as simple data retention. In reality, it is about selective persistence.

  • Context is like a whiteboard—useful for the moment, erased afterward.
  • Memory is like a journal—capturing insights that matter for the future.

This distinction is critical. Modern AI agents rely on multiple forms of memory, each serving a distinct role:

1. Working Memory: Immediate Awareness

Handles the current task and short-term context. If you’re planning a trip to Tokyo, the agent remembers Tokyo as the destination throughout the conversation.

2. Episodic Memory: Learning from Experience

Stores past successes, failures, and outcomes. If a tool fails during execution, the agent remembers this and avoids it next time.

3. Semantic Memory: Stable Knowledge

Retains long-term facts, rules, and preferences.The agent remembers you prefer Celsius over Fahrenheit and adapts automatically.

CapabilityWithout MemoryWith Memory
Repeated TasksStarts from zero every timeBuilds on past work
Error HandlingRepeats mistakesLearns and self-corrects
PersonalizationConstantly re-asks preferencesAnticipates user needs
ProductivityLinearCompounding over time

Memory transforms automation into accumulated intelligence. We are entering an era where AI is no longer defined by how well it answers questions—but by how well it remembers, learns, and grows alongside us. The future of AI is not about smarter answers, it’s about shared history.

Posted in Uncategorized | Leave a comment

Compact summary from Generative AI 2 Agent Based Automation

Sequence-to-sequence processing is machine learning in which a chain of information e.g., words is translated into another e.g., for language translation, a chat bot and image generation.

Suppose you tell the computer the sentence, “When was Christopher Columbus born?” The model gets such inputs and takes it as an example providing a “1451” as its reply. During the training, the model just learned which output sequence best match the input sequence.

These architectures are typically referred to as encoder-decoder networks. An input sequence is transformed into an abstract intermediate representation (the latent space, encoding) and then translated back into a target form (decoding).

Prominent examples include:  

  • Autoencoders (AE): These primarily act as filters or even for compression most often used in the form of filters and compression or other applications.  
  • Variational Autoencoders (VAE): These already belong to Generative AI, since they allow variation in output by modeling distribution functions.  

A recurring challenge with such a feature is different sequence lengths for some of these networks is the ‘information bottleneck’ – a condition with different sequence lengths leading to an “information bottleneck” which can cause the loss of context information information e.g.m that “Dirk” is the “name” in the sentence “My name is Dirk”.

From this came the development of Transformers models, where (self-)attention mechanism allow the “full context” linkage – every token with all others.

This can be visualized as an evaluation matrix: In the sentence “My name is Dirk,” the model knows, thanks to the matrix, that there is a much stronger semantic connection between “name” and “Dirk” than “name” and “is”. 

When solving simple tasks such as “1000 + 451,” we make the AI sound clever. But when it comes to more complex formulas, such as “1000 + 451 / 2”, purely language-based models often come up short because they haven’t memorized each conceivable number combination.

The answer is an “agentic” workflow with

  • tools and
  • memory.

A calculator tool then performs the exact calculation – a system called function as the intent of the users request. Via tools and memory, the agent becomes the “brain” that can respond by understanding when a tool to be used is required (sensing), entering the provided output (actuator), and returning the response to the user.

A contemporary AI architecture approach are State Space Models (SSMs). They essentially describe tasks with a lower number of variables, in that they shrink the solution space and make performance easier to measure. They should be considered special working memory that can serve as a sub-module for logic tasks or serve as the foundation of entire topologies, such as Mamba. The key difference between the two and the classical models, is, in this case, that the intermediate situation is never modeled as a static matrix, but a mathematical function. This obscures more recognizable architectures:

  • RNNs: Like Recurrent Neural Networks, SSMs are based on past information.  
  • CNNs: Your data can even be mathematically modeled as convolution; basically a compression to the “essentials”.  

The consideration of this state in a continuous (the entire time) or a discrete (at specified points in time) framework depends on how you will implement it. Hardware-aware optimization is another critical factor for SSMs: You eliminate performance bottlenecks (as in NUMA architectures) by choosing fast SRAM for the processing of fast cores (e.g., convolutions) and DRAM for large-scale data. 

As the task becomes harder, the more “thinking” (planning) is needed. A prominent concept such as Chain of Thought (CoT). With e.g., CoT the AI begins by jotting down its ideas, sort of like a student who jotting down notes on a math problem.

Such an AI autonomy may be classified in three ways:  

  • Low: The AI answers back (a shot).  
  • Medium: The AI is in a fixed sequence (like a plan), for instance, searching for information in a database (vector search) or checking a calculator.  
  • High: AI works out plans by itself. For instance, it understands that it first needs to do some computations, but then format the text especially well to fully complete the task.  

At the possible fourth level of autonomy the “memory” comes into the game. Like humans, we now distinguish between two functions:

  • Session-based memory offers a personal touch to conversations. It remembers who you are, how you prefer to be addressed, but this information is lost if the program goes to sleep.
  • The vector database, or long-term memory, is on the other hand a vast archive that provides knowledge permanently.

The interaction between the archive and the answer is called RAG (Retrieval-Augmented Generation).

The process is simple:  

  • Search: The AI gets exact chunks of information in the archive that answer a particular question.  
  • Answer: The AI can only source such findings for a well-founded response.  

Personalization and intelligence attributed to an AI is due to the interaction of various data sources and behavioral rules. They can be decomposed into the following levels:  

  • Contextual knowledge: Linking to specialized databases (e.g., for mathematics and foreign languages) to provide expertise.  
  • System prompt: Definition of behavioral pattern and guidelines (e.g., tone of voice, politeness, safety standards).  
  • User profile: Store user preferences, (e.g., culinary preferences) so responses are personalized.  
  • Conversation history: Adding past conversations to create a logical discussion.  

The standardization of these interfaces of models, storage and tools is increasingly being solved via the Model Context Protocol (MCP).

This protocol consists of three core roles:  

  • Host: The environment running in the program (e.g. an app or IDE plugin) which decides what resources are needed.  
  • Client: The gateway on the host that connects to the server and takes requests from the server.  
  • Server: Actual provider of services or data (e.g., a calculator service, a database connection).  

Once all of afore mentioned building blocks are present, intelligent strategies emerge that AI utilizes to solve problems, e.g., what I expect since years from my smart speakers at home or my vehicles voice command interface.

You might describe these as AI’s “thought patterns” such as:  

  • ReAct: The AI thinks, acts (e.g., invokes a tool), evaluates the result, and keeps thinking. It’s an ongoing trade-off between logic and action.  
  • Self-Refine: AI not satisfied with first draft. This rereads the answer that it comes up with critically to be a perfect one.  
  • Constitutional AI: AI is given ethical guidelines. It ensures that in places the AI receives moral guidelines. Instead of humans having to go through each and every reply in a manual review, the “digital constitution” (the rules) becomes an internal guiding light so that no offense or error can occur in the first place.  

What we see is a change from one-off nudges to a full workflow engineering ecosystem. Just like intent analysis is a particular form of pattern recognition in itself, workflows are creating reusable blueprints for AI-driven processes. This shift in direction gives low-code/no-code environments even greater flexibility and power than ever before.


Posted in Uncategorized | Leave a comment

A picture is more than 1000 words

Posted in Uncategorized | Leave a comment

The SoC Cost Trap 2.0: Beyond Moore’s Law – Strategies for the Age of System Integration


The choice of a manufacturing node, once the undisputed benchmark for innovation, is transforming into just one of many variables in a complex strategic equation. This article provides an in-depth analysis that re-evaluates the traditional comparison between monolithic CPU/GPU designs in 7nm and 5nm FinFET nodes and specialized silicon photonics solutions in 130nm SOI. The expanded multi-criteria matrix now incorporates crucial, industry-disrupting trends: standardization through UCIe (Universal Chiplet Interconnect Express), the revolution of AI-driven chip design, the next-generation transistor GAAFET, and new integration methods like Backside Power Delivery. The examination of low-, mid-, and high-volume scenarios leads to one central conclusion: the era of node dominance is over. The future belongs not to the smallest transistor, but to the most intelligent system integration within a single package.


1. Introduction: The Shifting Strategic Dilemma of Semiconductor Innovation

The relentless progression to finer feature sizes—7nm, 5nm, and soon 3nm—continues to promise enormous gains in performance per watt (PPA: Power, Performance, Area). However, the classic “Moore’s Law” as the sole driver of the industry is eroding. Complexity and costs are escalating to an extent that shakes the very foundations of semiconductor economics:

  • Exponential Cost Explosion: The fixed costs for a tape-out (the creation of photomasks) have reached astronomical levels. While they were in the range of $5-10 million at 28nm, they can now exceed $400 million at 5nm and are approaching the billion-dollar mark at 3nm.
  • Specialization as a Survival Strategy: In parallel, older but highly specialized nodes are gaining importance. Silicon Photonics is a prime example of a cost-effective solution for creating extremely fast optical data links, especially in markets where cutting-edge nodes are unprofitable.

This article analyzes this conflict of objectives and demonstrates that the choice of a node is no longer a purely technical decision but a profound business-strategic one, significantly influenced by new industry standards and design trends.


2. The Evaluation Criteria: A 360° View in a New Era

A future-proof decision requires an evaluation of all relevant dimensions, including the latest disruptive technologies.

  • Fixed Costs (Weight: 20%): One-time NRE (Non-Recurring Engineering) costs for design, verification, and mask sets.
  • Variable Costs (Weight: 25%): Cost per functional chip, driven by wafer price, die size, and production yield.
  • Scalability & Cost Elasticity (Weight: 15%): Potential for unit cost reduction at high volumes.
  • Supply Risk & Geopolitics (Weight: 15%): A dimension of growing importance. The concentration of the most advanced nodes in a few foundries located in geopolitically sensitive regions (e.g., TSMC in Taiwan) represents a massive strategic risk.
  • Time-to-Market & Design Complexity (Weight: 15%): Influenced by design complexity and the ability to reuse IP.
  • Technical Performance & Thermals (Weight: 10%): PPA and the increasingly critical requirement of thermal management (TDP).

3. The Cost-Benefit Analysis: Why the Monolithic Approach is Failing

The optimal choice of a node is a direct function of the target volume. The analysis clearly shows the limits of a “one-size-fits-all” approach, as illustrated by the detailed evaluation matrix below.

Evaluation Matrix (Low-Volume / High-Volume)

CriterionWeight7nm CPU/GPU (Score / Weighted)5nm CPU/GPU (Score / Weighted)130nm Photonics (Score / Weighted)
Fixed Costs0.252 / 3 (0.50 / 0.75)1 / 2 (0.25 / 0.50)4 / 5 (1.00 / 1.25)
Variable Costs0.302 / 3 (0.60 / 0.90)1 / 3 (0.30 / 0.90)4 / 4 (1.20 / 1.20)
Scalability0.154 / 5 (0.60 / 0.75)3 / 4 (0.45 / 0.60)2 / 3 (0.30 / 0.45)
Supply Risk0.102 (0.20)2 (0.20)4 (0.40)
Time-to-Market0.102 (0.20)1 (0.10)4 (0.40)
Performance0.104 (0.40)5 (0.50)2 (0.20)
Total Score (LV/HV)1.002.50 / 3.401.80 / 3.253.50 / 4.00

Scenario 1: Low-Volume (10k – 100k units) – The Specialist

Here, fixed costs are king. With a total score of 3.5, the 130nm photonics node is the clear winner for specialized applications. Modern FinFET nodes (7nm: 2.5; 5nm: 1.8) are economically unviable here.

Scenario 2: High-Volume (> 5 million units) – The Mass Market

Here, the rules are rewritten. The fixed costs are amortized. As the narrative conclusion of the original analysis states, the economics shift in favor of advanced nodes, making the 7nm node the “sweet spot” (score: 3.4) of cost and performance, while 5nm (score: 3.25) is more performant but disproportionately more expensive.

The Insight: No single node can optimally meet all requirements. This forces a paradigm shift—away from the monolithic SoC and towards heterogeneous integration.


4. The Node is Dead, Long Live the System: Why Integration Dethrones the Nanometer

For decades, the equation was simple: a smaller node meant a better chip. This era of monolithic dominance, where the manufacturing process was the undisputed king, is over. The statement “the node is no longer decisive” does not mean that advanced nodes are irrelevant. It means they have gone from being the only answer to being one of many questions in a much larger optimization problem.

The Collapse of the Old Order

Three factors have ended the sole reign of the node:

  1. The Economic Wall: Exponential costs make the most advanced nodes inaccessible to 95% of the market.
  2. The Physics Wall: The benefits of scaling are diminishing, and modern digital processes are ill-suited for analog, RF, or high-voltage circuits.
  3. The Thermal Wall (Dark Silicon): The enormous transistor density makes it impossible to power all areas of a huge monolithic chip at full performance simultaneously without overheating it.

The Rise of the New Order: System-in-Package

Intelligence is shifting from the pure transistor level to the architecture and integration level. The star is no longer the individual chip, but the System-in-Package (SiP). The new king is not the node, but advanced packaging technology.

  • The New Enabler is Packaging: Technologies like 2.5D interposers (CoWoS), embedded bridges (EMIB), and 3D stacking (Foveros) are the true innovation drivers.
  • The New Metric is TCO (Total Cost of Ownership): Instead of just optimizing the PPA of a single chip, architects now optimize the total performance per total cost.

The decisive competence is no longer just access to the most advanced fab, but the architectural brilliance to intelligently partition and seamlessly integrate a complex system.


5. Application-Specific Architectures & Integration Technologies

The chiplet strategy allows for tailor-made solutions for different markets. The choice of packaging is crucial for cost and performance.

Relevant Applications

ApplicationSuitable CombinationRationale
Edge-AI/Inference7/5nm + 130nm Photon-ChipletCompute-heavy, low-volume
Automotive-ADAS7nm SoC + Photonics-LinksModerate volumes, deterministic IO
Data Centers (NIC/CXL)130nm Photon + 7nm ControllerIO-dominant, low-power
Consumer-Wearables5nm monolithicHigh-volume PPA-optimized

Integration Technologies

Packaging TypeNode CombinationCostMaturity
CoWoS/InFO-SoW5nm + 5nmVery high, >$100High
Silicon-Interposer 2.5D5nm + 130nmMedium, $50-80Increasing
Fan-Out-RDL (chiplet)7nm + 130nmLow, $15-30High
Organic Substrate Stand-Al.130nmVery low, <$5Very High

6. Trend 1: The Standardization of the Ecosystem – UCIe

Perhaps the most important trend is Universal Chiplet Interconnect Express (UCIe), an open standard from industry giants like Intel, AMD, ARM, Google, and TSMC. It defines a standardized interface that allows chiplets from different manufacturers and different nodes to be seamlessly combined.

Why UCIe is a Game-Changer:

  • Interoperability: Creates an open marketplace for chiplets (the “Lego principle”).
  • Cost Reduction: Fosters competition and lowers development costs.
  • Risk Mitigation: Enables a more flexible and resilient supply chain.

Reference: Synopsys explains the basics of Die-to-Die Interfaces. Link


7. Trend 2: AI Accelerates Chip Design Itself

While AI drives the demand for more powerful chips, it also revolutionizes their development. AI-powered EDA (Electronic Design Automation) tools are dramatically accelerating the highly complex process of chip layout (place and route).

  • Example: Google’s AlphaChip: Google DeepMind uses AI to create optimized layouts for its TPUs in hours instead of months. This not only shortens the development cycle but often leads to more powerful designs.

Reference: IT Boltwise reports on Google’s breakthrough in chip design through AI. Link


8. Roadmap: The Technological Enablers of the Next Generation

8.1 Transistors of the Future: From FinFET to GAAFET

The successor to the FinFET is the Gate-All-Around FET (GAAFET). Here, the gate completely surrounds the channel of the transistor, allowing for much better electrostatic control and enabling scaling below 3nm. Samsung has already introduced this technology in its 3nm process.

Reference: Fortune Business Insights analyzes the market potential of GAAFET technology. Link

8.2 A Revolution in Power Supply: Backside Power Delivery (BPDN)

Backside Power Delivery moves the power delivery network to the back of the wafer. This significantly reduces voltage drop and creates more space for signal routing on the front side. BPDN is a crucial building block for realizing 2nm nodes and beyond.

Reference: Semiconductor Engineering discusses the advantages of Backside Power Delivery. Link

8.3 Integration at the Limits of Physics: Co-Packaged Optics (CPO)

Co-Packaged Optics integrates optical transceivers directly next to the processor chip. This eliminates losses from electrical traces, reduces latency, and dramatically lowers energy consumption per bit—a key technology for future AI data centers.

Reference: Market analyses by Yole Group predict explosive growth for the CPO market. Link


9. Strategic Conclusion & Recommendations for 2025 and Beyond

  1. Bet on Open Standards: Develop a chiplet strategy based on the UCIe standard to secure access to a broad ecosystem and reduce dependencies.
  2. Integrate AI into the Design Process: Implement AI-powered EDA tools to shorten development cycles and increase design efficiency.
  3. Think in Systems, Not in Silicon: The decisive competitive advantage no longer lies in the silicon alone, but in the masterful integration of various chiplets via advanced packaging.
  4. Plan for the Next Generation: Understand the implications of GAAFETs and Backside Power Delivery for your future product roadmaps.
  5. Treat Geopolitics as a Design Parameter: A diversified foundry strategy that also relies on proven nodes in geopolitically stable regions is essential for your company’s resilience.

The future belongs to those who master the complexity of heterogeneous integration and embrace the new paradigms of open standards and AI-driven design. The chiplet approach is no longer an option—it is the strategic necessity for sustainable innovation.

Posted in Uncategorized | Leave a comment