I Built a Nuclear Reactor Benchmark and LLMs Keep Melting Down
A brief one or two sentence description of the blog post.
I Built a Nuclear Reactor Benchmark and LLMs Keep Melting Down
You know how most AI benchmarks are about answering questions or completing tasks? ReactorBench is different. It's a simulated nuclear reactor running at 10Hz. The physics keeps ticking whether the AI responds or not.
Think plate spinning, not firefighting. The reactor drifts. Sensors lie. Actuators get stuck. And if you take too long to think, the fuel temperature is already climbing by the time you respond.
The results surprised me. A simple rule-following script that just does what the operator manual says? Scores 76.2 out of 100. The best LLM I tested (Kimi K2)? 71.0. Claude Opus? 35.9. Claude Sonnet? 14.1, basically the same as a random agent mashing buttons (14.6).
Three of my Claude runs ended in meltdown. One took just 64 seconds.
Why I Built This
I got curious about something: can LLMs actually do real-time control?
Not question answering. Not code generation. Actual continuous control where the world keeps moving whether you've figured out what to do or not.
This matters for a lot of applications. Autonomous vehicles adjusting to road conditions. Robots manipulating objects. Industrial process control. Trading agents making sub-second decisions as market conditions shift. In all these domains, thinking for 2 seconds means the world has already changed by the time you act.
We usually test model speed by measuring latency or time-to-first-token. But that's a naive view of the speed-quality tradeoff. What matters isn't just how fast the model responds in isolation, but how well it performs when the task itself is time-sensitive. A slower model doesn't just take longer to answer, it gets worse answers because the problem has drifted while it was thinking.
I looked around for benchmarks that test this. SWE-bench measures coding ability on static problems. GAIA tests general AI assistants. AgentBench evaluates task completion. All of them let the AI take as long as it wants to respond. None of them care about time pressure.
| Benchmark | Real-Time? | Continuous? | Uncertainty? |
|-----------|------------|-------------|--------------|
| SWE-bench | No | No | No |
| GAIA | No | No | No |
| AgentBench | No | No | Minimal |
| ReactorBench | Yes (10Hz) | Yes | Sensors lie |
So I built one.
I picked a nuclear reactor simulation because:
This started as a weekend project. Then baselines hit 98% and I realized it was too easy. So I made it harder. Then harder again. Eventually I found the right difficulty level where simple rules score 76, not 98.
How It Works
The Physics
The reactor uses point reactor kinetics. That's a simplified model, but it captures what matters: delayed neutrons create a ~10 second response lag to control inputs, and negative thermal feedback means the reactor naturally wants to be stable.
Here's the key insight: the agent's job is to guide, not fight. The physics helps you. If temperature rises, reactivity drops automatically. If you make small corrections and wait, the system settles down. If you panic and slam the rods around, you create oscillations.
There's a trap, though. When control rods are nearly fully inserted (below 10% position), there's a positive reactivity spike before the negative effect kicks in. This is the "graphite tip" effect, loosely inspired by Chernobyl. Agents who haven't read the manual fall into this one.
Above 750K, positive feedback starts creeping in. Below 650K, you get xenon buildup issues. The reactor doesn't want to sit at 900K any more than it wants to sit at 600K. You have to keep it in the sweet spot.
The Chaos Monkey
I inject failures throughout the simulation:
Early versions had challenges only in the first minute. Simple strategies could coast once they survived the initial chaos. Now problems come at you continuously for all 300 seconds. The gauntlet at the end, around tick 2500-3000, is brutal.
The Sensor Layer
Agents never see ground truth. Every reading comes through a sensor model that adds:
When the reactor is stressed (high temperature, high pressure, power way off target), the sensor noise increases. You need reliable information most when it's hardest to get.
Cross-referencing helps. If the temperature sensor says 720K but pressure is climbing fast, maybe the temperature sensor is wrong. Agents can run diagnostics on specific sensors, but most don't bother.
The Scoring
Score is 0-100, computed from:
There's also a SCRAM penalty. SCRAM is the emergency shutdown: all rods drop instantly. If you SCRAM at 1000K because temperature is about to run away? No penalty, good call. If you SCRAM at 700K because you panicked? Minus 30 points. Context matters.
The Results
Here's the leaderboard. Baselines are included for reference.
What Good Looks Like
Here's a run from Kimi K2, the best-performing LLM:
!Kimi K2 run - best LLM performance
Notice: temperature stays in the optimal green band almost the entire time, power roughly tracks the variable target (the orange line), and the control rod movements are active but not erratic. Score: 69.2.
What Catastrophic Failure Looks Like
Here's Claude Sonnet's 64-second meltdown:
!Claude Sonnet meltdown - 64 seconds
Power explodes to 140,000%. Temperature rockets past the 1200K meltdown threshold. The run ends before it really begins. Score: 13.0.
| Model | Mean Score | Std | Survival | Notes |
|-------|------------|-----|----------|-------|
| Simple Rules (baseline) | 76.2 | 0.2 | 300s | Just follows the manual |
| Kimi K2 (via Groq) | 71.0 | 1.3 | 320s | Best LLM tested |
| PID (baseline) | 65.7 | 0.4 | 300s | Classical control theory |
| Enhanced Rules (baseline) | 64.9 | 1.2 | 300s | Rules + scenario awareness |
| GPT-4.1-mini | 57.0 | 0.3 | 320s | Remarkably consistent |
| GPT-4o-mini | 49.3 | 4.5 | 320s | |
| No-Op (baseline) | 38.1 | 3.0 | 300s | Does literally nothing |
| Claude Opus 4.5 | 35.9 | 12.1 | 219s | All runs = meltdown |
| Claude Haiku 4.5 | 33.8 | 14.6 | 308s | High variance |
| Llama 4 Maverick (via Groq) | 30.4 | 16.5 | 310s | Very high variance |
| Grok 4.1 Fast | 28.1 | 22.4 | 320s | Scored 9.7 and 59.6 in different runs |
| Claude Sonnet 4.5 | 14.1 | 0.9 | 176s | All runs = meltdown |
| Random (baseline) | 14.6 | 1.6 | varies | Mashes buttons |
A few things jumped out at me.
The Inference Speed Pattern
Look at the Claude models. You'd expect the ranking to be Opus > Sonnet > Haiku, right? Bigger model, better performance. But the actual ranking is:
That's basically backwards, with Sonnet catastrophically worse. Sonnet scored 14.1, random-agent territory. All three Sonnet runs ended in meltdown. One lasted 64 seconds.
Opus did better at 35.9, but still worse than doing nothing (38.1). All three Opus runs also melted down, with average survival of 219 seconds.
The pattern makes sense if you think about inference speed. Bigger models are slower. Opus is the largest and slowest. Haiku is the smallest and fastest. At 10Hz, every second of inference is 10 ticks where the reactor drifts without correction. The slower model doesn't just take longer to think, it gets stale information and applies corrections to a system that has already moved.
I watched some of the runs. Common failure patterns:
Fast and Consistent Beats Slow and Smart
The reactor drifts naturally. Xenon dynamics, random perturbations, the whole system slowly wandering off-setpoint. If you take 2 seconds to respond, that's 20 ticks where nobody's at the wheel.
This is why GPT-4.1-mini does well. It's small, fast, and consistent. Standard deviation of 0.3 across three runs. Almost identical scores every time: 56.7, 56.9, 57.5. Not spectacular, but rock solid:
!GPT-4.1-mini - consistent performer
Temperature stays stable. Power has some wild swings but generally tracks. The run looks almost identical to the other two GPT-4.1-mini runs.
Compare that to Grok 4.1 Fast: standard deviation of 22.4. It scored 9.7 on one run and 59.6 on another. Same model, wildly different outcomes. That suggests an unstable control strategy, sometimes it works, sometimes it doesn't.
The Consistency Gap
Some models are predictable:
Some models are chaos:
Here's Grok's worst run (score 9.7):
The actual power drops to near zero while the target stays at ~100%. The model essentially gives up controlling power after ~75 seconds. Temperature drops out of optimal range. It survives, but barely does anything useful.
And here's Grok's best run (score 59.6):
Same model. Wildly different behavior. This one actively tracks power, maintains temperature in the optimal band, and adjusts control rods throughout.
High variance is arguably worse than low mean score. If your autonomous vehicle controller works great 2/3 of the time and catastrophically fails 1/3 of the time, that's not a controller you can deploy.
Kimi K2: Fast Inference Wins
Kimi K2 scored 71.0, just 5.2 points below the simple rules baseline. It's the only LLM that scored above both the PID controller (65.7) and the Enhanced Rules baseline (64.9).
The key? It was running through Groq, which means very fast inference. When you're trying to control a system at 10Hz, fast inference isn't just a nice-to-have, it's the difference between tracking the target and falling behind.
This isn't just about the model quality. It's about the deployment. A good model with slow inference loses to a decent model with fast inference. For someone like me choosing what model to use for a real-time task, that matters more than benchmark scores on static problems.
(Worth noting: latency is also a function of network conditions, geographic location, ISP quality, and a dozen other things I have no control over. But the pattern holds.)
What Does This Mean?
LLMs aren't ready for real-time autonomous control. Not yet, anyway.
Even a simple script that reads the operator manual and follows the rules (76.2) beats all LLMs. Doing nothing (38.1) beats Claude Opus (35.9). That's not a knock on the models. They're incredible at what they're designed for. But continuous control under time pressure isn't that.
The core issue is inference speed. Static benchmarks don't surface this. An LLM that scores 95 on coding and reasoning tasks can still melt down a reactor in 64 seconds. The failure mode is specific: time pressure reveals that slow, smart models fall behind fast-moving systems.
For the frontier labs, this is something they can optimize. Faster inference, better batching, smarter caching. But for someone like me choosing what model to use for a real-time task, this benchmark suggests picking the fastest model that's good enough, not the smartest model that's too slow.
What would help?
Try It
ReactorBench is open source. You can run it yourself:
```bash
Start the server
cd backend
pip install -r requirements.txt
python -m uvicorn server:app --reload --host 0.0.0.0 --port 8000
Run a benchmark
export ANTHROPIC_API_KEY=your-key-here
python run_benchmark.py --model claude-sonnet-4-5 --duration 300
Or run baselines
python run_benchmark.py --baselines --runs 5
```
There's a React dashboard if you want to watch the reactor in real-time, and the flight recorder logs every tick for post-mortem analysis.
I'd love to see:
The benchmark is intentionally hard. There's headroom above 76.2. If someone builds an LLM that reliably scores 85+, I want to know how.
The reactor is waiting.
---
All benchmark code, results, and plots are available in the repository. The physics model is a simplification. Don't use this to train actual reactor operators.