The hidden AI bottleneck

If you’ve ever splurged on a processor or raved about a supercomputer, you’ve probably been talking about “Gigahertz,” “MIPS” or “TeraFLOPS.” We often imagine these numbers in terms of horsepower in a car, a given reading showing us how fast a machine will race. But computer speed isn’t one number, it’s a dialogue between the software’s intentions and the hardware’s reality. To understand performance we have to consider the two main languages that computers speak, MIPS and FLOPS, the secret translation layer known as the micro-operation and the huge wall of brick modern processors are smashing against: memory latency and energy efficiency.


The Integer World: MIPS

Imagine a mail sorting bureaucrat. They look at an envelope, choose where to put it in a bin and stamp it. This is the world of MIPS. It deals with integers—whole numbers used for logic, decision making and memory addresses. When your computer runs an operating system, opens a web browser, or decides which line of code to run next (an if-then statement), it operates in a world of integers. The MIPS can be counted as a speed gauge on an arbitrary processor that measures how well the processor can handle such control flow instructions. Nonetheless, MIPS has a reputation for misleading – jokingly referred to as “Meaningless Indicator of Processor Speed”. Why? Because not every instruction is created equal. A Complex Instruction Set (CISC) may do a lot of work in one instruction while a Reduced Instruction Set (RISC) may need five instructions to do the same thing. Comparing with MIPS is like comparing their output to that of a chef, but no one asks whether they are chopping an onion, whether they are plating a soufflé, or what is their working temperature on a line cook plate per minute.


The Scientific World: FLOPS



FLOPS measure pure mathematical throughput. MIPS is about control, FLOPS is about simulation. It is no wonder supercomputers and gaming GPUs are obsessed with FLOPS or similar sub-units; they’re programmed to scour vast real-world matrices with massive numbers.4 But this is a trap: A processor might have a FLOPS (floating point operations per second) rating of massive theoretical sizes in the wild, but without an existing system with specific “vector” instructions (SIMD, single instruction, multiple data) to run them and deal with huge numbers simultaneously you will never see that speed in practice.


The Secret Layer: Micro-Ops


Here the thing gets interesting. Neither MIPS nor FLOPS believe that an “Instruction” is the fundamental unit of work. But on a contemporary CPU, that is a lie. Processors don’t actually execute the instructions you write. They are dependent upon the “decoupling” of ISA from the actual hardware program. When your program pushes a complicated direct instruction to the CPU, for example, “add the number in memory to this register”, a processor part of the CPU called the Decoder takes that instruction and divides it into smaller, atomic tasks in the form of micro-operations.


Think of it as a room in a restaurant. You (the software) will order a “Burger” as input (one command). The kitchen (the CPU) manages to take it apart: grill patty, toast bun, slice tomato (three micro-operations). Everything changes here because of the translation layer. The result is a single complex instruction can explode into hundreds of micro-ops by microcode or if two separate instructions are fused, one micro-op becomes optimized for efficiency (macro-fusion). The CPU then goes ahead the micro-ops out-of-order, scans ahead for tasks which do not depend on one another. If you do need to run, they are parallelized.


The Real Problem: Meeting the Memory Wall


You can have the fastest chef in the world (MIPS) and the largest stove (FLOPS), but if a waiter needs over an hour to prepare foods and move them from the fridge, then it’s slow. This is what Wulf and McKee called the Memory Wall in 1995 and it is today’s single biggest computing threat. Processors have been accelerating exponentially and to speed up the process, time taken to fetch data from RAM has not been keeping pace.

  • The Latency Gap: With a modern CPU that is now running at 5GHz, multiple instructions can be executed in just a fraction of a nanosecond. However, fetching data from main RAM (DRAM) can take 100+ nanoseconds. Which is to say, every single time the CPU is required to request data out of RAM, it can sit and idle for hundreds of clock cycles twiddling its thumbs.
  • The Energy Crisis: Time is not enough, power matters. As Chief Scientist Bill Dally at NVIDIA has said, “compute is free, data is priceless.” Operating a 64-bit floating-point calculation can take approximately 20 picoJoules for energy, but moving the data across the chip to memory can take more than 1,000 picoJoules. We’re burning more energy moving numbers around than we’re actually crunching them.


The New King: Operations Per Watt


In the age of AI and colossal data centers, “How fast?” has been superseded by “How efficient?” If you’re running a data center with 100,000 chips, electricity is your largest expense. This led to the metrics of Operations Per Watt (OPS/W).

  • The Brute Force Approach (GPUs): Contemporary AI chips such as NVIDIA’s Blackwell B200 are absolute beasts, able to offer up to 18 PetaFLOPS of FP4 compute. But they are energy-hungry, and TDPs can go as high as 1,000 Watts per chip. The industry is attempting to combat this by scaling precision down: Instead of 64-bit computing per watt of power, it uses either 4-bit (FP4) or 8-bit (FP8) math to squeeze more operations out of every watt.
  • The Specialists (ASICs): Specialized chips such as Groq’s LPU (Language Processing Unit) emerge to eliminate the limitation of the general-purpose GPU. By eliminating the complex hardware involved in graphics and employing ultra-fast internal memory (SRAM) as opposed to slow external memory (HBM), their vision is to enable tokens delivered faster and at a lower energy cost by removing data movement.
  • The Biomimics (Neuromorphic): The ultimate goal is to simulate the behaviour of the human brain, an estimated exaFLOP equivalent with ~20 Watts of processing power. Chips like Intel’s Loihi 2 are working their way closer to that goal, reaching over 15 TOPS/W (Trillion Operations Per Watt) for specific workloads only by employing power when “spikes” of data happen, rather than running a clock around the clock.
  • The Jevons Paradox: Ironically, as we make chips more efficient (more OPS/W), we don’t seem to use less energy. We simply build bigger models. This is the Jevons Paradox: efficiency means more demand.


The Bottom Line

What is the difference between MIPS and FLOPS? We are asking about the surface-level workload, logic vs. math. The technology does not even begin to measure its true speed and efficiency, however. It is defined by three hidden battles: how efficiently the Decoder can translate your code into the secret language of micro-ops, how effectively the system can smash through the Memory Wall to keep those micro-ops fed with data, and how many Operations Per Watt the silicon can deliver before it melts the data center.


In 2024 and beyond, the most important metric might not be how fast you can compute, but how fast you can wait—and how much it costs to keep the lights on while you do.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *