The backbone of modern open source LLMs.
- pranav chellagurki
- 4 days ago
- 4 min read
When GPT 4 came out, and around that same time the “Sparks of Artificial General Intelligence” write up started circulating, I remember being genuinely hyped. I was reading benchmarks, examples, edge cases, all of it. Then we went through the usual cycle for a couple years. One provider beats another, then open source drops something controversial, then someone ships a new trick, then the leaderboard reshuffles again.
But cut to 2025 and it feels different. LLMs are starting to look like a commodity. People are less emotionally invested in tiny benchmark deltas and more invested in cost, latency, and how much compute you need to get something useful. Personally, Iam way more fascinated by whether a model can run locally on my machine than whether it is the absolute best model available through an API call.
However, the annoying truth is that scaling laws are very much still a thing. Are there better architectures out there that will reduce the need for massive training runs on huge data centers and GPU clusters? Of course. But until we find those methods and make them practical, scaling still does a decent job, even if it comes with diminishing returns.
I always kinda like coming back to the brain comparison here. The human brain is basically the most efficient compute system we have access to. It runs on something like 20 watts, which is roughly, I think a few hundred calories per day. Meanwhile, running foundation models at scale can lead to a hundreds of dollars in inference costs pretty quickly. That gap should make us humble. It also hints that we are not at the end of the architecture story.
Now, I want to be careful with the common claim that bigger always wins. But if you hold a bunch of variables roughly constant, more capacity still tends to buy you something. A 1T parameter model has more room to store patterns and skills than a 100B parameter model. That does not mean it is automatically smarter. It means it has more headroom to become smarter.
Because think of it right: in its most inefficient incarnation, an LLM can be a universal matching function, where every possible input string already has a matching output string. At infinity, you can still theoretically have a system that answers anything. So there is some merit to scaling. A 1T parameter system is closer to that reality than a 100B parameter system.
So if scaling is still helping, but compute is becoming the real bottleneck, what do we do? This is where I think Mixture of Experts deserves way more attention, because it is one of the clearest ways we have right now to make big models more compute efficient.
Traditional LLMs are dense. When you run a dense model, every token you generate runs through the same big number of layers, and you pay the compute cost of the full model every time. Even if only parts of the network are really doing the work for a specific question, the whole system still participates, so the per token compute stays heavy. And for fast inference, you typically want the model weights sitting in GPU memory, because pulling weights from slower memory over and over (if model weights>>GPU memory) makes the whole experience slow.
MoE changes the rule. Instead of one where everyone works on everything, you have a bunch of experts inside the model, plus a router. For each token, the router picks a small subset of experts to activate, and only those experts do the heavy compute. So even if the model has a trillion parameters in total, the active compute per token might be closer to something like tens of billions. You get big capacity without paying full compute every token.
But. MoE is mostly a compute win, but NOT a memory win. This is the part that confused me at first too. The full set of experts still exist, those weights have to live somewhere. The benefit is that you are not running ALL of them every time.
So why do people talk about MoE making huge models more runnable, if the weights still exist somewhere anyway?
Because in real systems, there are different ways to store and access those weights. On personal hardware, we will most likely not be able to keep all experts in GPU memory at once, but you can STILL run the model if you keep some weights in system RAM or even on disk and move experts in and out. That can be slower, but it can turn “impossible” into “possible.”
If you have been reading my posts, and watching my paper reads, you know I love analogies :).
Here is another one: imagine you employ 100 specialists. In a dense setup, every time you ask a question, all 100 specialists have to show up and do work, even if only 3 of them are relevant. In an MoE setup, you still employ the 100 specialists, they still exist in your organization, but you have a dispatcher who says “only you three come answer this one.” The specialists still have to be reachable, but you do not pay the cost of having all of them actively working on every single question.
That is the real MoE advantage for people running large models on personal hardware. It is selective compute.
Anyway, this is probably an incomplete argument. MoE has drawbacks of its own, and one of them I talked about in the “determinism - better of a choice” post, please do read it. And I do realize this can sound like it trivializes the truly amazing open source community by calling any one architecture “the secret backbone.” My main intention is just to show how important it is to make systems compute-efficient, as much as increasing the numbers on eval metrics.


Comments