Transformers are challenging the CNN

For decades, Convolutional Neural Networks (CNNs) have been the undisputed kings of computer vision. If a machine was “seeing,” it was likely using a CNN. But the landscape is shifting. Vision Transformers (ViTs) are moving from the world of Natural Language Processing into the visual realm, fundamentally changing how AI perceives the world.

The core difference lies in their philosophy of sight. CNNs act like detectives with magnifying glasses, sliding a small window over an image to find local patterns like edges and textures. They build a global view slowly, layer by layer. In contrast, ViTs take a big-picture approach from the start. They break an image into a grid of patches, like puzzle pieces, and treat them as a sequence.

To make this transition, the model performs a process called Linear Projection. Imagine a 16×16 pixel patch in full color; that is 768 individual pixel values. The ViT flattens these pixels into a single vector and multiplies them by a learnable matrix to create a “Token.” Just as a language model treats a word as a token, the ViT treats each patch as a visual word. Because ViTs use Self-Attention, every single patch talks to every other patch simultaneously. From the very first layer, the model can understand that a patch in the top-left corner is related to one in the bottom-right.

This global communication solves the Picasso Problem. Because CNNs focus so much on local features, they might see an eye, a nose, and a mouth and conclude “that’s a face,” even if those features are scrambled. ViTs are different. Because they model global relationships immediately, they are much better at understanding spatial structure and focusing on overall shapes rather than just high-frequency textures.

However, since Transformers treat patches like a list, they don’t naturally know which patch goes where. To fix this, they use Positional Encodings, essentially digital GPS coordinates, to tell the model the layout of the image. Without these, the model would view a landscape as a random bag of pixels.

If ViTs are so smart, why haven’t they replaced CNNs entirely? It comes down to data hunger. CNNs come with “pre-installed” knowledge: they know that pixels near each other are usually related. This makes them efficient on smaller datasets. ViTs start with a blank slate and must learn the rules of physics and space from scratch, which usually requires massive datasets, often hundreds of millions of images, to outperform their rivals.

The gap is closing thanks to hybrid models like Swin Transformers and modernized CNNs like ConvNeXt. If you are working with limited data and standard hardware, the CNN remains a reliable specialist. But if you have the data and the compute to spare, the Vision Transformer offers a more robust, holistic way for machines to truly see.

Posted in Uncategorized | Leave a comment

Architectural Parallels and Divergences in Neural Memory

Modern generative AI is hitting a familiar wall: every time we try to make models “smarter” by stuffing in more knowledge, we also make them more expensive to run. In classic dense Transformers, memory and compute are tightly coupled, more parameters mean more FLOPs and memory, both during training and inference. To break this coupling, researchers have turned to conditional and augmented computation. Two of the most important approaches are Mixture of Experts (MoE) and the newer Mixture of Value Embeddings (MoVE). They attack the same problem from very different angles.

MoE scales models by adding many specialized subnetworks, called experts, while only activating a small subset of them for each token. A routing network decides which experts are used, typically selecting just one or two. This allows models to grow to hundreds of billions of parameters without paying the full compute cost every time. In practice, MoE replaces the feed-forward layers in Transformers and relies on sparse activation to keep inference affordable. The upside is massive total capacity; the downside is complexity. Routing introduces engineering challenges like expert capacity limits, dropped tokens, and heavy all-to-all communication across GPUs. Training is also fragile: if routing becomes imbalanced, a few experts get all the traffic while others “die,” forcing the use of auxiliary load-balancing tricks that can hurt final model quality.

MoVE takes a more surgical approach. Instead of adding subnetworks, it augments the attention mechanism itself. Research into interpretability has shown that the Value stream of attention carries much of a Transformer’s semantic content. MoVE exploits this by introducing a global bank of learnable value embeddings, concept vectors shared across all layers. At each token step, the model softly mixes a subset of these vectors and adds them to the standard value projection. The key idea is that you can scale memory simply by increasing the size of this embedding bank, without deepening the network or increasing active FLOPs. Memory and compute become genuinely decoupled.

These architectural differences lead to very different scaling behaviors. MoE sparsely couples capacity to compute: total parameters grow, but only a fraction are active per token. MoVE fully decouples them: compute stays constant while memory grows along an independent axis. As a result, MoVE can enter “memory-dense” regimes where the model stores far more static knowledge without becoming slower at reasoning.

Optimization dynamics differ just as sharply. MoE’s hard routing makes it vulnerable to expert collapse and requires careful balancing. MoVE uses soft, differentiable mixing, so every memory slot receives gradients and none are starved. The challenge there is not collapse but selective, ensuring the memory bank doesn’t blur into a generic average.

Finally, both approaches pay off. In text generation, MoVE consistently improves perplexity as memory slots scale, suggesting that factual knowledge can be offloaded into the embedding bank while attention layers focus on structure and reasoning. In autoregressive image generation, MoVE boosts visual fidelity by acting as a shared visual memory, and it integrates cleanly with low-rank attention schemes that are already optimized for efficiency. MoE, meanwhile, continues to shine when domain specialization matters, with different experts naturally capturing syntax, programming patterns, or language-specific features.

From a systems perspective, the trade-offs are stark. MoE places heavy demands on networking and memory bandwidth because of token routing across devices, making deployment challenging without high-end interconnects. MoVE behaves much more like a standard Transformer, with lower communication overhead and simpler scaling. Both can benefit from offloading or compression, but MoVE’s shared memory has better locality than sparsely accessed experts.

Zooming out, MoE and MoVE sit within a broader movement toward memory-augmented and modular neural architectures. Other ideas, like product-key memories, parameter-efficient expert retrieval, or even more radical geometric memory systems, share the same goal: separating “where knowledge lives” from “how reasoning happens.” In multimodal systems, similar mixture principles already show clear gains by routing inputs to specialized encoders.

The most likely future is not a winner-takes-all scenario but hybrids. MoE layers can handle specialized reasoning and task-specific logic, while MoVE augments attention with a global, editable knowledge store – like a book for lookups. Combined with efficient attention compression, such systems promise models that are not just larger, but smarter per FLOP.

Posted in Uncategorized | Leave a comment

Language Model Introduce Operating System Mechanism To Sustain

The rapid shift of language models from research artifacts to production-critical systems has forced a deep re-evaluation of how inference workloads are architected. Early optimization efforts focused almost exclusively on training, where performance is dominated by dense matrix multiplications and raw FLOPs. In contrast, large-scale inference has revealed a different truth: serving language models efficiently is not compute-bound, but memory-bound. The dominant constraint is no longer GPU arithmetic throughput, but how effectively memory bandwith.

During inference, transformer-based models generate tokens sequentially. For each new token, the model must attend to all previously generated tokens to preserve context. To avoid recomputing attention inputs at every step, a process that would scale quadratically with sequence length, serving engines cache intermediate key and value tensors (aka KV cache) in GPU memory. This KV cache grows linearly with both sequence length and model depth, quickly becoming the largest consumer of memory. For modern models with long context windows, the KV cache high likelly exceeds the size of the model weights themselves.

Traditional inference stacks inherited their memory management strategies from training frameworks, where static shapes and predictable lifetimes dominate. Applied to inference, these assumptions break down. Legacy systems typically pre-allocate KV cache memory for the maximum possible sequence length, regardless of how short a prompt or response actually is. This leads to severe over-reservation, compounded by internal fragmentation from fixed-size slabs and external fragmentation caused by the requirement that each sequence’s KV cache occupy a contiguous block of memory. The result is strikingly poor utilization: in many production setups, only 30–40% of available GPU memory is actually used for active tokens – 60-70% waste. From a cost perspective, this means that a significant portion of the investment in accelerators such as those from NVIDIA delivers no effective throughput.

PagedAttention emerged as a response to this systemic inefficiency, applying a principle from operating systems, virtual memory paging, to the KV cache. Instead of allocating one large contiguous buffer per sequence, PagedAttention divides the KV cache into small, fixed-size blocks, each holding the keys and values for a limited number of tokens. Logical token order is decoupled from physical memory layout via a block table that maps sequence positions to arbitrary memory blocks.

This seemingly simple abstraction has far-reaching consequences. Memory is allocated on demand as tokens are generated, eliminating over-reservation entirely. Because blocks are uniform and interchangeable, external fragmentation disappears: any free block can be reused by any sequence. Internal fragmentation is limited to the final, partially filled block of a sequence, reducing waste to a handful of tokens rather than thousands. In practice, systems built on PagedAttention reduce KV cache waste from as high as 80% to well under 5%, fundamentally changing the LLM serving code. The vLLM inference engine builds directly on PagedAttention and extends it into a full production-grade architecture. By treating KV blocks as reference-counted objects, vLLM enables efficient memory sharing across sequences with common prefixes. Parallel sampling and beam search benefit immediately: multiple candidate continuations can share the same prompt cache, with copy-on-write semantics ensuring correctness when paths diverge. At scale, this can cut memory usage by more than half for complex decoding workloads.

Posted in Uncategorized | Leave a comment

LLM – we need to decouple facts from logic

Modern AI models are inherently inefficient because they work on the same task with the same level of intensity: they engage the same amount of compute for every question instead pulling static facts in.

DeepSeek Engram (https://arxiv.org/abs/2601.07372) helps in solving this problem by decoupling Memory (facts) from Reasoning (logic), so that the model can “search” for known facts instead of processing them again and again. Do not mix it up with the RAG approaches. See Engram as a dictionary and RAG as a library. Engrams are build into the model.

  • Engrams: Hashing + Gating: Hash-based retrieval is checked by a “gate” to see if it fits the current context. Offloads massive parameter tables to RAM instead of expensive GPU VRAM.
  • RAG: Embedding + Prompting: Text is retrieved and literally pasted into the prompt window. Usually stored in vector databases.

To see this change in form, take a higher priced restaurant with a chef as an embodiment of the model’s cognitive power.

In a traditional LLM, it’s as if a Michelin-starred head chef is repeatedly pulled away from composing a delicate, multi-course tasting menu just to pour water or slice bread, an absurd misuse of talent for trivial tasks. Engram alters the kitchen environment. Rather than cut off the chef, it adds a well-organized pantry and service station run by the waitstaff. When a guest requests bread or water, a server retrieves it instantly from the shelf, an O(1) grab, while the chef stays fully focused on complex, high-value dishes.

The system actually works beneath the hood by an adapted tokenizing concepts (the engram) and compressing them into a dense memory store, indexed via a hash map. This enables the model to retrieve specific information immediately instead of rummaging through its entire “brain”. A context-aware gate decides when such a fast-access memory is appropriate and if the item retrieved doesn’t fit the current situation.

Engram in a nutshell: If you see the word Apple, your brain activates an engram associated with the fruit; a language model converts “Apple” into a specific token ID (e.g., 8675). Same happens for the sentence “A quintessential pomaceous fruit, orb-like treasure wrapped in a taut, glossy skin that transitions from deep ruby reds to sun-drenched yellows and vibrant greens, protecting a firm, juicy interior of ivory flesh that delivers a perfect, symphonic snap followed by a complex balance of tart acidity and floral sweetness, all centered around a star-shaped core of dark seeds that has cemented its status as a timeless symbol of both wholesome health and forbidden knowledge” – it is also just a token ID (e.g., 8675).

Memories leave durable imprints and at scale this organizational shift is massive. By eliminating cognitive “prep work”, models become dramatically more efficient without growing larger to map everything with logic – the algorithms.

This architecture produced an improvement “Needle-in-a-Haystack” retrieval accuracy from 84% to 97%. Needle-in-a-Haystack is like asking a restaurant to remember that one guests specific allergy note – the fact. A weak kitchen guesses based on patterns (generic: “most guests don’t want peanuts”) – the logic.

So in brief: stop asking the chef to fetch the bread, and the food and the thinking gets much better.

DeepSeek’s analysis of “U-shaped scaling laws” shows that introducing a static memory in around 20-25% of a model’s compute parameters is the best solution for building intelligence. Finally, Engram shows that the future of AI isn’t about trying to build even larger models, but instead about building the sorts of smarter, more organized and cheaper ones that scale faster and are easier to run.

Posted in Uncategorized | Leave a comment

Manifold AI Model-Architecture

A language model has to understand many things at the same time when it reads a sentence: word meanings, grammar, context, and world knowledge. Consider the sentence “The bank is by the river”. The model must simultaneously consider:

  • bank as a financial institution
  • bank as a bench
  • bank as a riverbank
  • the grammatical role of the word
  • the surrounding context from earlier sentences

The model evaluates all of these possibilities in parallel. However, they must ultimately be compressed into a single internal representation. It intensifies. Larger models know more and process more signals at the same time, but they still rely on the same single internal representation.

To understand this limitation, it is important to distinguish between embeddings and channels. An embedding is the initial vector representation of a token when it enters the model. A channel, by contrast, is the high-dimensional vector space through which these embeddings flow as the model processes them. As attention, feed-forward layers, and residual connections are applied, embeddings are continuously transformed and merged with context and intermediate reasoning results.

Concretely:

  • each token is represented as a vector with hundreds or thousands of values
  • each Transformer layer transforms this vector
  • residual connections add the previous vector to the new one

The channel becomes a bottleneck, because the vector sizes increase, but not infinitely. Newer architectures such as the DeepSeek paper https://arxiv.org/pdf/2512.24880v2 attempt to address this by introducing multiple internal paths. Instead of mixing everything immediately, different kinds of information can flow in parallel for longer. However, parallelism alone is not sufficient. Without constraints, models tend to overuse some paths and neglect others, leading to imbalance and instability.

The key improvement is organized parallelism. By enforcing balanced usage of channels, all types of information get a fair chance to contribute before being combined. This allows models to preserve multiple interpretations longer and merge them more deliberately. Organized channels are sometimes confused with Mixture of Experts (MoE) architectures, but they solve different problems.

Mixture of Experts splits the network into multiple expert sub-networks. A router selects which experts process each token, activating only a subset at a time. MoE primarily improves scalability and specialization.

  • MoE answers: “Which part of the network should process this token?”. MoE reduces competition for compute. Some call it Sparse (few active).
  • Channels answer: “How can multiple interpretations coexist inside the model without interfering?”. Channels reduce competition between meanings. Some call it Dense (all active).

The central insight is clear: progress in language models does not come only from more parameters, more data, or more compute. It comes from structuring how information flows internally.

Posted in Uncategorized | Leave a comment

The hidden AI bottleneck

If you’ve ever splurged on a processor or raved about a supercomputer, you’ve probably been talking about “Gigahertz,” “MIPS” or “TeraFLOPS.” We often imagine these numbers in terms of horsepower in a car, a given reading showing us how fast a machine will race. But computer speed isn’t one number, it’s a dialogue between the software’s intentions and the hardware’s reality. To understand performance we have to consider the two main languages that computers speak, MIPS and FLOPS, the secret translation layer known as the micro-operation and the huge wall of brick modern processors are smashing against: memory latency and energy efficiency.


The Integer World: MIPS

Imagine a mail sorting bureaucrat. They look at an envelope, choose where to put it in a bin and stamp it. This is the world of MIPS. It deals with integers—whole numbers used for logic, decision making and memory addresses. When your computer runs an operating system, opens a web browser, or decides which line of code to run next (an if-then statement), it operates in a world of integers. The MIPS can be counted as a speed gauge on an arbitrary processor that measures how well the processor can handle such control flow instructions. Nonetheless, MIPS has a reputation for misleading – jokingly referred to as “Meaningless Indicator of Processor Speed”. Why? Because not every instruction is created equal. A Complex Instruction Set (CISC) may do a lot of work in one instruction while a Reduced Instruction Set (RISC) may need five instructions to do the same thing. Comparing with MIPS is like comparing their output to that of a chef, but no one asks whether they are chopping an onion, whether they are plating a soufflé, or what is their working temperature on a line cook plate per minute.


The Scientific World: FLOPS



FLOPS measure pure mathematical throughput. MIPS is about control, FLOPS is about simulation. It is no wonder supercomputers and gaming GPUs are obsessed with FLOPS or similar sub-units; they’re programmed to scour vast real-world matrices with massive numbers.4 But this is a trap: A processor might have a FLOPS (floating point operations per second) rating of massive theoretical sizes in the wild, but without an existing system with specific “vector” instructions (SIMD, single instruction, multiple data) to run them and deal with huge numbers simultaneously you will never see that speed in practice.


The Secret Layer: Micro-Ops


Here the thing gets interesting. Neither MIPS nor FLOPS believe that an “Instruction” is the fundamental unit of work. But on a contemporary CPU, that is a lie. Processors don’t actually execute the instructions you write. They are dependent upon the “decoupling” of ISA from the actual hardware program. When your program pushes a complicated direct instruction to the CPU, for example, “add the number in memory to this register”, a processor part of the CPU called the Decoder takes that instruction and divides it into smaller, atomic tasks in the form of micro-operations.


Think of it as a room in a restaurant. You (the software) will order a “Burger” as input (one command). The kitchen (the CPU) manages to take it apart: grill patty, toast bun, slice tomato (three micro-operations). Everything changes here because of the translation layer. The result is a single complex instruction can explode into hundreds of micro-ops by microcode or if two separate instructions are fused, one micro-op becomes optimized for efficiency (macro-fusion). The CPU then goes ahead the micro-ops out-of-order, scans ahead for tasks which do not depend on one another. If you do need to run, they are parallelized.


The Real Problem: Meeting the Memory Wall


You can have the fastest chef in the world (MIPS) and the largest stove (FLOPS), but if a waiter needs over an hour to prepare foods and move them from the fridge, then it’s slow. This is what Wulf and McKee called the Memory Wall in 1995 and it is today’s single biggest computing threat. Processors have been accelerating exponentially and to speed up the process, time taken to fetch data from RAM has not been keeping pace.

  • The Latency Gap: With a modern CPU that is now running at 5GHz, multiple instructions can be executed in just a fraction of a nanosecond. However, fetching data from main RAM (DRAM) can take 100+ nanoseconds. Which is to say, every single time the CPU is required to request data out of RAM, it can sit and idle for hundreds of clock cycles twiddling its thumbs.
  • The Energy Crisis: Time is not enough, power matters. As Chief Scientist Bill Dally at NVIDIA has said, “compute is free, data is priceless.” Operating a 64-bit floating-point calculation can take approximately 20 picoJoules for energy, but moving the data across the chip to memory can take more than 1,000 picoJoules. We’re burning more energy moving numbers around than we’re actually crunching them.


The New King: Operations Per Watt


In the age of AI and colossal data centers, “How fast?” has been superseded by “How efficient?” If you’re running a data center with 100,000 chips, electricity is your largest expense. This led to the metrics of Operations Per Watt (OPS/W).

  • The Brute Force Approach (GPUs): Contemporary AI chips such as NVIDIA’s Blackwell B200 are absolute beasts, able to offer up to 18 PetaFLOPS of FP4 compute. But they are energy-hungry, and TDPs can go as high as 1,000 Watts per chip. The industry is attempting to combat this by scaling precision down: Instead of 64-bit computing per watt of power, it uses either 4-bit (FP4) or 8-bit (FP8) math to squeeze more operations out of every watt.
  • The Specialists (ASICs): Specialized chips such as Groq’s LPU (Language Processing Unit) emerge to eliminate the limitation of the general-purpose GPU. By eliminating the complex hardware involved in graphics and employing ultra-fast internal memory (SRAM) as opposed to slow external memory (HBM), their vision is to enable tokens delivered faster and at a lower energy cost by removing data movement.
  • The Biomimics (Neuromorphic): The ultimate goal is to simulate the behaviour of the human brain, an estimated exaFLOP equivalent with ~20 Watts of processing power. Chips like Intel’s Loihi 2 are working their way closer to that goal, reaching over 15 TOPS/W (Trillion Operations Per Watt) for specific workloads only by employing power when “spikes” of data happen, rather than running a clock around the clock.
  • The Jevons Paradox: Ironically, as we make chips more efficient (more OPS/W), we don’t seem to use less energy. We simply build bigger models. This is the Jevons Paradox: efficiency means more demand.


The Bottom Line

What is the difference between MIPS and FLOPS? We are asking about the surface-level workload, logic vs. math. The technology does not even begin to measure its true speed and efficiency, however. It is defined by three hidden battles: how efficiently the Decoder can translate your code into the secret language of micro-ops, how effectively the system can smash through the Memory Wall to keep those micro-ops fed with data, and how many Operations Per Watt the silicon can deliver before it melts the data center.


In 2024 and beyond, the most important metric might not be how fast you can compute, but how fast you can wait—and how much it costs to keep the lights on while you do.

Posted in Uncategorized | Leave a comment

First Think Than Talk

NVIDIA TiDAR, which stands for “Think in Diffusion, Talk in Autoregression”, is a hybrid architecture designed to make Large Language Models (LLMs) significantly faster without losing quality. Traditional models like GPT work like a person writing a letter one word at a time (Autoregressive). TiDAR changes this by combining two different “personalities” into one model.

  1. Thinking (Diffusion): The model “thinks” ahead by drafting several potential future words simultaneously in a single step. It’s like a fast brainstormer sketching out the next few sentences all at once.
  2. Talking (Autoregression): The model then instantly “verifies” those drafts using the traditional one-by-one method to ensure they make sense and follow grammar rules.

The Trick: TiDAR does both of these things in a single forward pass on the GPU. By filling up “free slots” in the GPU’s memory during computation, it achieves speeds up to 5.9x faster than standard models while maintaining the high quality of human-like text.

Posted in Uncategorized | Leave a comment

Rethinking AI Infrastructure

A fundamental transformation is reshaping the hardware landscape, driven not by the familiar cadence of Moore’s Law, but by the physical realities of data movement. For the past decade, the dominant narrative in AI acceleration has been the relentless pursuit of floating-point operations per second (FLOPS). This “compute-centric” paradigm, epitomized by the ascent of massive Graphics Processing Units (GPUs), operated on the assumption that faster arithmetic was the primary bottleneck to machine intelligence. The research, articulated in the technical paper “Challenges and Research Directions for Large Language Model Inference Hardware”, posits that the industry has hit a “Memory Wall”. To understand the urgency of this proposal, one must quantify the divergence between compute and memory evolution. The Google research highlights a startling statistic: between 2012 and 2022, the 64-bit floating-point performance of NVIDIA GPUs increased by approximately 80-fold. In the same period, memory bandwidth grew by only 17-fold. This divergence creates a bottleneck where the performance of Large Language Models is no longer determined by how fast the chip can calculate, but by how fast it can retrieve weights from memory. As models scale from 100 billion to 10 trillion parameters, and context windows expand from thousands to millions of tokens, this “Memory Wall” becomes the definitive constraint on AI progress.  

The most disruptive introduction of High Bandwidth Flash (HBF) represents a convergence of storage and memory, designed to provide the massive capacity required for next-generation models without the prohibitive cost of DRAM. By stacking multiple 3D NAND dies and connecting them with thousands of Through-Silicon Vias (TSVs), HBF bypasses the serial bottlenecks of traditional SSDs. The specifications are formidable: a single stack of HBF targets read bandwidths approaching 1.6 TB/s with a capacity of 512GB.  

This density fundamentally changes the deployment model for AI. Where a 1-Trillion parameter model currently requires a rack of GPUs simply to hold its weights, HBF could allow such a model to reside entirely within a single compute node. But Flash memory has higher latency than DRAM, requiering management capabilities as descriped by NUMA and AIOS – the system knows which weights it needs next and can pipeline the requests. Google calls this Processing-Near-Memory (PNM), embedding lightweight accelerator logic directly into the base layer of the memory stacks. This allows the system to send semantic commands (like “find the top-50 matching tokens”) rather than raw address requests. Due to Google this allows 21.9x improvement in throughput and a 60x reduction in energy consumption per token. This fits into a broader vision of dis-aggregated interconnects (instead of cross bar) architecture, likely utilizing Compute Express Link (CXL) to allow memory and compute to scale independently.  

Hardware does not exist in a vacuum; it is being built to support specific algorithmic breakthroughs. A critical aspect of this roadmap is its synergy with Google’s concurrent research into next-generation model architectures like Titans and Hope. Unlike Transformers which have a fixed context window, Titans introduce a “Neural Memory” module that learns and updates in real-time. These models require mutable, persistent memory that is much larger than standard weights—a requirement that HBF is uniquely positioned to satisfy. Similarly, the “Hope” architecture’s hierarchical attention mechanism maps perfectly to the PNM hardware, allowing low-level retrieval tasks to be offloaded to the memory stack while the GPU handles high-level reasoning.  

The implications of this shift extend beyond the datacenter to the edge. The “Memory Wall” is even more formidable on mobile devices, where power budgets are strictly capped. Google’s Coral NPU platform is positioned as a vehicle for these technologies, where HBF could enable “Ambient AI”—always-on, privacy-preserving intelligence that processes voice and video locally without touching the cloud. Because HBF consumes near-zero power when idle, it is ideal for battery-powered devices that need to wake up instantly to perform an inference.  

Ultimately, this research is a manifesto for the “Inference Era” of Artificial Intelligence. It argues that we must stop judging hardware by “FLOPS” and start measuring “Tokens per Dollar” and “Tokens per Watt”. By prioritizing High Bandwidth Flash to solve capacity, Processing-Near-Memory to solve energy, and 3D Stacking to solve density, Google has outlined a viable path to the 100-Trillion parameter models of the future. As David Patterson and his team suggest, the success of the next decade of AI will depend less on the raw speed of our processors and more on the intelligence of our memory architectures.

Posted in Uncategorized | Leave a comment

Breaking the Silicon Ceiling: How Bio-Inspired “Neuro-Channel” Networks Could End the GPU Era

The early history of Artificial Intelligence has been, in principle, written as linear algebra and mainly as the operation of “multiply-accumulate”. From the start of the perceptron, we claimed that learning could only be accomplished if inputs were multiplied by weights, an easy decision that has left the advance of AI unwittingly tied to an ever-higher dependence on high-precision, energy-hungry hardware. Today, the development and operation of Large Language Models (LLMs) and deep computer vision systems relies on the presence of GPUs that can perform floating-point operations. This reliance has led to the emergence of a crisis of sustainability and access, which has placed cutting-edge intelligence within the limits of immense data centers. Yet, “Neuro-Channel Networks” (NCN) eleminate all floating-point multiplication between them is entirely eliminated during the forward pass.

To understand the magnitude of this shift, one must first confront the “multiplication tax” inherent in modern deep learning. A single 32-bit floating-point multiplication consumes approximately 37 times more energy than a 32-bit integer addition. When a neural network is designed around the dot product, it forces the hardware to perform the most expensive arithmetic operation just as frequently as the cheaper accumulation step. This is the primary reason why AI accelerators consume kilowatts of power. NCN reject this premise entirely. The NCN architecture formalizes this by replacing “weights” with “Channel Widths”, moving from a logic of projection to a logic of flow control. The core innovation lies in the “Neuro-Channel Perceptron”, which replaces the standard neuron.

import torch
import torch.nn as nn

def ncn_channel_function(x, w):
    """
    Implements the NCN Channel Function: sgn(x) * min(|x|, |w|)
    
    Args:
        x (torch.Tensor): The input tensor.
        w (torch.Tensor): The weight tensor (channel width).
    """
    # Calculate magnitudes
    abs_x = torch.abs(x)
    abs_w = torch.abs(w)
    
    # Apply the clamping logic: min(|x|, |w|)
    clamped_magnitude = torch.min(abs_x, abs_w)
    
    # Restore the original sign of x
    return torch.sgn(x) * clamped_magnitude

# Example Usage
input_x = torch.tensor([-5.0, 2.0, 0.5])
weight_w = torch.tensor([3.0, 3.0, 3.0])

output = ncn_channel_function(input_x, weight_w)
print(f"Output: {output}") 

# Expected: [-3.0, 2.0, 0.5]

This models a pipe, if you try to push 100 gallons of water through a pipe that handles 50, you get 50 out. Crucially, this logic requires only comparators and multiplexers in hardware, operations that are vastly cheaper and smaller than multipliers. To solve the “Dead Gradient” problem, where a closed channel might stop learning entirely, a secondary “Neurotransmitter” parameter. This acts as a regulator, ensuring that even if the structural channel is closed, gradient information can still flow, allowing the network to recover and learn robustly.

The “Neurotransmitter”:

import torch
import torch.nn as nn

class NCNLayer(nn.Module):
    def __init__(self, in_features, out_features):
        super(NCNLayer, self).__init__()
        # The 'Channel Width' (Structural Weight)
        self.w = nn.Parameter(torch.randn(out_features, in_features))
        # The 'Neurotransmitter' (Gradient Regulator)
        # Typically initialized to a small positive value
        self.n = nn.Parameter(torch.full((out_features, in_features), 0.01))

    def forward(self, x):
        # x shape: [batch, in_features]
        # We broadcast x to match the weight dimensions for the channel op
        # Note: Simplified for a single linear-style pass
        
        # 1. Structural Channel Function: sgn(x) * min(|x|, |w|)
        abs_x = torch.abs(x).unsqueeze(1) # [batch, 1, in_features]
        abs_w = torch.abs(self.w)          # [out, in]
        
        channel_out = torch.sgn(x).unsqueeze(1) * torch.min(abs_x, abs_w)
        
        # 2. Neurotransmitter Bypass: n * x
        regulator_out = self.n * x.unsqueeze(1)
        
        # Total output (summed over input features)
        return torch.sum(channel_out + regulator_out, dim=2)

# Example Usage
model_layer = NCNLayer(in_features=4, out_features=2)
input_data = torch.tensor([[10.0, -0.5, 2.0, -8.0]])
output = model_layer(input_data)

print(f"Output with Neurotransmitter: {output}")

The implications of this architecture extend far beyond theoretical curiosity. By utilizing only addition, subtraction, and bitwise operations, NCNs promise to reduce the energy cost of individual synaptic operations by up to 90% for specific arithmetic paths. Furthermore NCNs rely on standard CPU instructions like ADD and CMP (compare), they could theoretically allow advanced pattern recognition to run efficiently on commodity CPUs, ultra-low-power microcontrollers, and battery-harvesting edge devices, decoupling AI from the scarcity of the GPU market.

While currently in the proof-of-concept stage, having successfully validated their ability to solve non-linear problems like XOR and the Majority Function with 100% accuracy, Neuro-Channel Networks represent a necessary correction to the historical trajectory of Deep Learning.

Posted in Uncategorized | Leave a comment

Do LLMs widen the gap between junior and senior engineers?

Large language models and agentic systems appear to benefit experienced engineers far more than they help less experienced ones. A useful analogy is that LLMs resemble an exceptionally fast sous-chef rather than a professional head chef, one who has memorized every recipe ever written. In the hands of a skilled chef, such a sous-chef is extraordinarily powerful. The chef understands the cuisine, the balance of flavors, and can immediately recognize when a dish lacks acidity, has too much salt, or clashes with the rest of the menu. The sous-chef accelerates preparation, suggests variations, and supports creative exploration, but responsibility for taste, consistency, and quality remains firmly with the chef.

For someone without cooking experience, however, that same sous-chef can be actively harmful. Recipes appear polished, instructions sound authoritative, and substitutions seem reasonable. Yet without an understanding of fundamentals, heat control, seasoning, timing, even correctly following the instructions can result in an inedible dish. Worse, the cook may not understand why it tastes wrong. The problem is not the recipe generator but the lack of judgment needed to evaluate its output.

This maps closely to how LLMs are used in software development. Experienced engineers treat AI-generated code as a draft: they quickly identify missing error handling, unclear contracts, poor separation of concerns, or long-term maintainability risks. They adjust proportions, replace ingredients, and sometimes discard the result entirely. Less experienced developers, by contrast, may assume the code is “good enough” simply because it compiles and produces output—much like assuming a dish must be correct because the recipe was followed step by step.

This is where Hyrum’s Law becomes relevant. In cooking, undocumented quirks, a pan that runs hot or an oven that browns unevenly, inevitably become part of the recipe. Change the pan, and the dish breaks. In software, the quirks introduced by LLM-generated code are just as likely to become accidental dependencies. Experts compensate for such quirks intentionally; novices unknowingly encode them into future systems.

Agent-based systems extend the analogy to automated kitchen stations. In a well-run restaurant, automation improves throughput without sacrificing quality because the menu is stable, processes are well understood, and chefs remain in control. In an inexperienced kitchen, the same automation produces inconsistent dishes at scale. Errors are no longer isolated, they are multiplied. The same principle applies to technical debt.

Fast food is cheap to produce but expensive to live on. LLMs make “fast-food software” remarkably easy to generate: quick, filling, and immediately satisfying. Experienced teams invest effort to turn that output into something balanced and sustainable. Others accumulate indigestion, systems that are brittle, difficult to reason about, and hard to evolve.

The core truth is simple and increasingly visible: LLMs do not eliminate taste, judgment, or responsibility. They amplify them. Just as good tools do not make a great cook, LLMs do not make a great engineer. They merely reveal the difference sooner and at scale.

Posted in Uncategorized | Leave a comment