top of page
Search

Non determinism - better off as a choice.

  • Writer: pranav chellagurki
    pranav chellagurki
  • 2 days ago
  • 3 min read

In my day job, people ask me all the time why they cannot get deterministic outputs from ChatGPT.


My usual answer is that its stochastic, flexible nature is exactly what makes it useful as a general purpose QA system. It replaces older patterns like regex scripts and classifier based bots, which can give you more deterministic outputs, but are much more rigid.


That is true, but it is not the whole story.


Instead of only trying to sell people on that worldview, I think it is also useful to talk about the non determinism in large models themselves, and how, in many setups, it is actually influenced by demand and infrastructure, not just by randomness baked into the model.


Because yes, randomness guided by the temperature settings is a good to have feature. But I have also had that awful experience where I was sure the model would behave exactly as in my local tests, and then watched it do something different in front of an audience - specifically when you are building things for legal or compliance teams, where the flows are closer to rule based systems than to free flowing creative generations.


Setting the temperature to 0 and freezing the seed helps. It does not fully solve the problem.

After some digging, the simplest and most likely explanation is the interaction of Mixture of Experts architectures with batching done in servers.


The analogy that I give all the time:

Think of a hospital. When you arrive, a staff member sends you to Doctor X, who is best equipped to treat your condition. If you are the only patient, that same staff member will always send you to Doctor X, you see the same doctor, they follow the same standard procedure, and your consultation is effectively deterministic. Now imagine the normal case, where lots of patients show up at once. The staff member still tries to pick the best doctor for each case, but if Doctor X is already at capacity, you might get sent to Doctor Y instead. At that point, which doctor you see depends not only on your symptoms, but also on who else is in the waiting room at the same time. Your consultation is no longer guaranteed to be identical every visit, even if you show up with the same symptoms.


MoE models and batching work in a similar way.


Many modern LLMs, and the serving stacks around them, use Mixture of Experts architectures. Roughly, instead of one big monolithic model, you have many “expert” sub models. For each token, a router decides which experts to send it to. Each expert has a capacity, a maximum number of tokens it can process per batch, and when it is full, extra tokens get routed elsewhere.


Your request does not run in isolation. It is usually batched together with many other users’ prompts and tokens. All these tokens compete for the same pool of experts and their limited capacity.


So even if your input text is identical, the composition of the batch, basically what other users are doing at that moment, can change which experts your tokens hit. When the “ideal” expert is at capacity, your tokens may be routed to backup experts instead. That small change in routing early in the sequence can cascade into different downstream activations and, eventually, different outputs.


So your output now depends not only on your own prompt and settings, but also on what else the system is serving at the same time.


This is a big reason why “run this prompt 10 times with temperature = 0” can still give slight variations in a real deployment.


I have also heard that there are additional factors such as GPU parellism. Its something I want to explore too - how distrubuted computing works in detail, but the batching and expert routing mechanism seems to be the primary cause of nondeterminism in practice.


Personally, I think making LLM systems capable of strong determinism for specific patterns, even under heavy load and large batches, should be a problem to solve for if we want to automate complex, high value tasks. When I am building something that behaves more like a rules engine for a legal team than like a creative writing assistant, I do appreciate the rigidity.

 
 
 

Comments


bottom of page