JEPA: The step after generative language models

For a long time, AI research has followed two main paths.

  • Discriminative/ Contrastive Model: The other path tries to label the world, deciding what something is rather than rebuilding it.
  • Generative Model: One path tries to recreate the world exactly, piece by piece. It paints every pixel, predicts every word, and rebuilds reality in detail.

These ideas gave us today’s language models and image generators. But they hit limits. They struggle with real understanding, long-term planning, and uncertainty. They copy patterns well, but they don’t truly understand what’s going on.

A different idea, called JEPA, takes another route. Instead of copying the world or labeling it, it tries to understand the world by predicting what is missing, using abstract concepts rather than raw details.

The old kitchen

Think of traditional AI models as line cooks. When a customer orders a dish, the cook must recreate everything from scratch. Every cut, every garnish, every seed must be placed just right.  If the cook isn’t sure where something goes, they guess. From far away, the plate looks fine. Up close, it’s messy. That’s how generative models work. They focus on tiny details. When details are uncertain, they fill in the gaps, even if they’re wrong.

Now imagine food critics. They don’t cook anything. They compare.  They taste two dishes and say whether they are similar. Or they check if a dish matches a description.

The culinary academy

Now imagine a culinary academy with a strange rule: No one is allowed to cook. The goal is not to recreate dishes, but to understand the flavor itself. This is where JEPA lives. A student is shown part of a dish.  Maybe they see lamb and rosemary on one side of the plate, but the sauce on the other side is hidden.

Instead of imagining how the sauce looks, the student forms a mental idea of what they see:

“This is rich, savory, herbal”

Then the student is asked:

“Given this, what is likely on the hidden side?”

They don’t imagine yogurt, mint leaves, or seeds.  They think in concepts:

“Rich meat is usually balanced with something cool and fresh”

So they predict:

“The hidden part probably tastes creamy and acidic”

This prediction lives entirely in the mind. No images. No pixels, no token, no words. Just meaning.

Now the main difference is how the learning happens. To check the answer, a master looks at the real sauce and forms their own mental idea:

“Cool, creamy, slightly sweet”

The student isn’t graded on visuals. They’re graded on how close their idea is to the master’s idea. If the student missed the sweetness, they adjust their understanding:

“Ah, pomegranate adds sweetness, not just acidity”

No one ever cared where the seeds were placed. Only the essence mattered.

The most important twist. The master changes slowly, becoming more stable over time.  The student changes quickly, trying to catch up. This prevents cheating. The student can’t just say “everything tastes the same” to get perfect scores, because the master keeps raising the bar. The student is always chasing a slightly better version of their own understanding.

JEPA doesn’t try to copy reality.  It doesn’t obsess over tiny details. It doesn’t hallucinate missing pieces. It builds a world model made of meaning.

That’s not cooking the dish. That’s understanding the flavor.

The technical food laboratory view 

These text excerpts are the result of a Discord discussion. Many thanks to my conversation partners.

The evolution of Machine Learning in four simple steps:

  1. The Encoder takes raw input data (like a sentence, an image, or an audio clip) and condenses it into a compact, numerical summary. The similarity of the encoding is the usual classification/ Discrimination. 
  1. The Decoder/ Mapping takes the Context Vector provided by the encoder and “unpacks” it into the desired output format. Simply: from A to B. 
  1. If between the encoder and decoder lies a “bottleneck.” This is the only information the decoder gets to see – often known under the Term “Autoencoder” or “Filter”. 
  1. The so-called “Transformer” where Attention mechanism now allows the decoder to look back at specific parts of the encoder’s input, making the “bottleneck” much more flexible. 

In all cases the mathematical representation e.g., a vector, a tensor, etc. is the so-called embedding – putting information into an alternative, better processable representation. 

And now JEPA does the next 5th step.  In simple words, it gives the same concepts an almost similar embedding independently of its initial given representation, also for the generated, in the input missing, information part. The goal is not to generate something across the distribution, rather more to reconstruct the identity (facts). 

The steps, via an image example,  are:

  • Split the image into tiles
  • Generate for each tile an embedding e.g., tensor; the target encoding 
  • Learn a predictor that can generate a missing tile from given other tiles 
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *