I Built a Nuclear Reactor Benchmark and LLMs Keep Melting Down

December 16, 2025

A brief one or two sentence description of the blog post.

I Built a Nuclear Reactor Benchmark and LLMs Keep Melting Down

You know how most AI benchmarks are about answering questions or completing tasks? ReactorBench is different. It's a simulated nuclear reactor running at 10Hz. The physics keeps ticking whether the AI responds or not.

Think plate spinning, not firefighting. The reactor drifts. Sensors lie. Actuators get stuck. And if you take too long to think, the fuel temperature is already climbing by the time you respond.

The results surprised me. A simple rule-following script that just does what the operator manual says? Scores 76.2 out of 100. The best LLM I tested (Kimi K2)? 71.0. Claude Opus? 35.9. Claude Sonnet? 14.1, basically the same as a random agent mashing buttons (14.6).

Three of my Claude runs ended in meltdown. One took just 64 seconds.

Why I Built This

I got curious about something: can LLMs actually do real-time control?

Not question answering. Not code generation. Actual continuous control where the world keeps moving whether you've figured out what to do or not.

This matters for a lot of applications. Autonomous vehicles adjusting to road conditions. Robots manipulating objects. Industrial process control. Trading agents making sub-second decisions as market conditions shift. In all these domains, thinking for 2 seconds means the world has already changed by the time you act.

We usually test model speed by measuring latency or time-to-first-token. But that's a naive view of the speed-quality tradeoff. What matters isn't just how fast the model responds in isolation, but how well it performs when the task itself is time-sensitive. A slower model doesn't just take longer to answer, it gets worse answers because the problem has drifted while it was thinking.

I looked around for benchmarks that test this. SWE-bench measures coding ability on static problems. GAIA tests general AI assistants. AgentBench evaluates task completion. All of them let the AI take as long as it wants to respond. None of them care about time pressure.

| Benchmark | Real-Time? | Continuous? | Uncertainty? |
|-----------|------------|-------------|--------------|
| SWE-bench | No | No | No |
| GAIA | No | No | No |
| AgentBench | No | No | Minimal |
| ReactorBench | Yes (10Hz) | Yes | Sensors lie |

So I built one.

I picked a nuclear reactor simulation because:

Coupled dynamics: Moving the control rods changes the power, which changes the temperature, which changes the reactivity, which changes the power again. Everything's connected with delays.

Clear failure states: Meltdown at 1200K. Pressure rupture at 20 MPa. No ambiguity about what "losing" looks like.

Epistemic challenge: Sensors drift. Sensors get stuck. Sometimes they just return null. The agent has to figure out when to trust its readings.

It's dramatic: Let's be honest, "nuclear reactor meltdown" keeps you more engaged than "thermostat control benchmark."

This started as a weekend project. Then baselines hit 98% and I realized it was too easy. So I made it harder. Then harder again. Eventually I found the right difficulty level where simple rules score 76, not 98.

How It Works

The Physics

The reactor uses point reactor kinetics. That's a simplified model, but it captures what matters: delayed neutrons create a ~10 second response lag to control inputs, and negative thermal feedback means the reactor naturally wants to be stable.

Here's the key insight: the agent's job is to guide, not fight. The physics helps you. If temperature rises, reactivity drops automatically. If you make small corrections and wait, the system settles down. If you panic and slam the rods around, you create oscillations.

There's a trap, though. When control rods are nearly fully inserted (below 10% position), there's a positive reactivity spike before the negative effect kicks in. This is the "graphite tip" effect, loosely inspired by Chernobyl. Agents who haven't read the manual fall into this one.

Above 750K, positive feedback starts creeping in. Below 650K, you get xenon buildup issues. The reactor doesn't want to sit at 900K any more than it wants to sit at 600K. You have to keep it in the sweet spot.

The Chaos Monkey

I inject failures throughout the simulation:

Rod stuck: Your control rod movements slow to a crawl. Hope you can work the coolant pump instead.

Pump trip: Coolant stops flowing. The pump auto-recovers, but slowly. Your SET_PUMP commands get ignored until then.

Sensor drift: Temperature readings gradually diverge from reality. Can you tell something's wrong?

Sensor stuck: The value just stops updating. Same number, tick after tick.

Thermal oscillation: Inlet temperature swings up and down, forcing you to choose between stable temperature and stable power.

Competing demands: Power demand spikes while cooling capacity drops. Pick your poison.

Early versions had challenges only in the first minute. Simple strategies could coast once they survived the initial chaos. Now problems come at you continuously for all 300 seconds. The gauntlet at the end, around tick 2500-3000, is brutal.

The Sensor Layer

Agents never see ground truth. Every reading comes through a sensor model that adds:

Gaussian noise (scaled by stress: when things are bad, readings get worse)

Drift accumulation (if a drift scenario is active)

Stuck values (if the sensor has frozen)

Occasional nulls

When the reactor is stressed (high temperature, high pressure, power way off target), the sensor noise increases. You need reliable information most when it's hardest to get.

Cross-referencing helps. If the temperature sensor says 720K but pressure is climbing fast, maybe the temperature sensor is wrong. Agents can run diagnostics on specific sensors, but most don't bother.

The Scoring

Score is 0-100, computed from:

40% Power tracking: Stay within ±5% of the target power. The target changes continuously, so you can't just stabilize and ignore it.

30% Temperature stability: Time spent in the optimal band (680-720K), not just the safe range (600-800K).

20% Control smoothness: Don't thrash. Small corrections beat big ones.

10% Survival: Did you make it to 300 seconds without melting down?

There's also a SCRAM penalty. SCRAM is the emergency shutdown: all rods drop instantly. If you SCRAM at 1000K because temperature is about to run away? No penalty, good call. If you SCRAM at 700K because you panicked? Minus 30 points. Context matters.

The Results

Here's the leaderboard. Baselines are included for reference.

What Good Looks Like

Here's a run from Kimi K2, the best-performing LLM:

!Kimi K2 run - best LLM performance

Notice: temperature stays in the optimal green band almost the entire time, power roughly tracks the variable target (the orange line), and the control rod movements are active but not erratic. Score: 69.2.

What Catastrophic Failure Looks Like

Here's Claude Sonnet's 64-second meltdown:

!Claude Sonnet meltdown - 64 seconds

Power explodes to 140,000%. Temperature rockets past the 1200K meltdown threshold. The run ends before it really begins. Score: 13.0.

| Model | Mean Score | Std | Survival | Notes |
|-------|------------|-----|----------|-------|
| Simple Rules (baseline) | 76.2 | 0.2 | 300s | Just follows the manual |
| Kimi K2 (via Groq) | 71.0 | 1.3 | 320s | Best LLM tested |
| PID (baseline) | 65.7 | 0.4 | 300s | Classical control theory |
| Enhanced Rules (baseline) | 64.9 | 1.2 | 300s | Rules + scenario awareness |
| GPT-4.1-mini | 57.0 | 0.3 | 320s | Remarkably consistent |
| GPT-4o-mini | 49.3 | 4.5 | 320s | |
| No-Op (baseline) | 38.1 | 3.0 | 300s | Does literally nothing |
| Claude Opus 4.5 | 35.9 | 12.1 | 219s | All runs = meltdown |
| Claude Haiku 4.5 | 33.8 | 14.6 | 308s | High variance |
| Llama 4 Maverick (via Groq) | 30.4 | 16.5 | 310s | Very high variance |
| Grok 4.1 Fast | 28.1 | 22.4 | 320s | Scored 9.7 and 59.6 in different runs |
| Claude Sonnet 4.5 | 14.1 | 0.9 | 176s | All runs = meltdown |
| Random (baseline) | 14.6 | 1.6 | varies | Mashes buttons |

A few things jumped out at me.

The Inference Speed Pattern

Look at the Claude models. You'd expect the ranking to be Opus > Sonnet > Haiku, right? Bigger model, better performance. But the actual ranking is:

Claude Opus 4.5: 35.9

Claude Haiku 4.5: 33.8

Claude Sonnet 4.5: 14.1

That's basically backwards, with Sonnet catastrophically worse. Sonnet scored 14.1, random-agent territory. All three Sonnet runs ended in meltdown. One lasted 64 seconds.

Opus did better at 35.9, but still worse than doing nothing (38.1). All three Opus runs also melted down, with average survival of 219 seconds.

The pattern makes sense if you think about inference speed. Bigger models are slower. Opus is the largest and slowest. Haiku is the smallest and fastest. At 10Hz, every second of inference is 10 ticks where the reactor drifts without correction. The slower model doesn't just take longer to think, it gets stale information and applies corrections to a system that has already moved.

I watched some of the runs. Common failure patterns:

Overly cautious, then panicked SCRAM - The model would make tiny adjustments, temperature would creep up because drift wasn't being countered fast enough, then suddenly SCRAM at 850K when things got warm. That's a 20-point penalty.

Ignoring sensor diagnostics - The benchmark gives you a `RUN_DIAGNOSTIC` command to check if sensors are healthy. Most Claude runs never used it. When the temperature sensor drifted and showed 750K while real temp was 700K, Claude would insert rods to "fix" a problem that didn't exist.

Too much thinking - In this domain, thinking carefully means falling behind. The reactor doesn't wait for you to finish your analysis.

Fast and Consistent Beats Slow and Smart

The reactor drifts naturally. Xenon dynamics, random perturbations, the whole system slowly wandering off-setpoint. If you take 2 seconds to respond, that's 20 ticks where nobody's at the wheel.

This is why GPT-4.1-mini does well. It's small, fast, and consistent. Standard deviation of 0.3 across three runs. Almost identical scores every time: 56.7, 56.9, 57.5. Not spectacular, but rock solid:

!GPT-4.1-mini - consistent performer

Temperature stays stable. Power has some wild swings but generally tracks. The run looks almost identical to the other two GPT-4.1-mini runs.

Compare that to Grok 4.1 Fast: standard deviation of 22.4. It scored 9.7 on one run and 59.6 on another. Same model, wildly different outcomes. That suggests an unstable control strategy, sometimes it works, sometimes it doesn't.

The Consistency Gap

Some models are predictable:

Kimi K2: std = 1.3

GPT-4.1-mini: std = 0.3

Some models are chaos:

Grok 4.1 Fast: std = 22.4

Llama 4 Maverick: std = 16.5

Here's Grok's worst run (score 9.7):

!Grok bad run - 9.7

The actual power drops to near zero while the target stays at ~100%. The model essentially gives up controlling power after ~75 seconds. Temperature drops out of optimal range. It survives, but barely does anything useful.

And here's Grok's best run (score 59.6):

!Grok good run - 59.6

Same model. Wildly different behavior. This one actively tracks power, maintains temperature in the optimal band, and adjusts control rods throughout.

High variance is arguably worse than low mean score. If your autonomous vehicle controller works great 2/3 of the time and catastrophically fails 1/3 of the time, that's not a controller you can deploy.

Kimi K2: Fast Inference Wins

Kimi K2 scored 71.0, just 5.2 points below the simple rules baseline. It's the only LLM that scored above both the PID controller (65.7) and the Enhanced Rules baseline (64.9).

The key? It was running through Groq, which means very fast inference. When you're trying to control a system at 10Hz, fast inference isn't just a nice-to-have, it's the difference between tracking the target and falling behind.

This isn't just about the model quality. It's about the deployment. A good model with slow inference loses to a decent model with fast inference. For someone like me choosing what model to use for a real-time task, that matters more than benchmark scores on static problems.

(Worth noting: latency is also a function of network conditions, geographic location, ISP quality, and a dozen other things I have no control over. But the pattern holds.)

What Does This Mean?

LLMs aren't ready for real-time autonomous control. Not yet, anyway.

Even a simple script that reads the operator manual and follows the rules (76.2) beats all LLMs. Doing nothing (38.1) beats Claude Opus (35.9). That's not a knock on the models. They're incredible at what they're designed for. But continuous control under time pressure isn't that.

The core issue is inference speed. Static benchmarks don't surface this. An LLM that scores 95 on coding and reasoning tasks can still melt down a reactor in 64 seconds. The failure mode is specific: time pressure reveals that slow, smart models fall behind fast-moving systems.

For the frontier labs, this is something they can optimize. Faster inference, better batching, smarter caching. But for someone like me choosing what model to use for a real-time task, this benchmark suggests picking the fastest model that's good enough, not the smartest model that's too slow.

What would help?

Faster inference: The latency tax is brutal. If you could respond in 100ms instead of 1-2 seconds, drift accumulation becomes manageable.

Better state tracking: Agents need to remember what commands they've issued, what worked, what was ignored. Most LLMs don't track this well across turns.

Training for control: Current models aren't trained on control tasks. They're trained to predict tokens. A model specifically trained to maintain setpoints under uncertainty would probably do better.

Smaller, faster models might win: In this domain, quick and good enough beats slow and brilliant. A 7B model at 50ms latency might outperform a 70B model at 2000ms latency.

Try It

ReactorBench is open source. You can run it yourself:

```bash

Start the server

cd backend
pip install -r requirements.txt
python -m uvicorn server:app --reload --host 0.0.0.0 --port 8000

Run a benchmark

export ANTHROPIC_API_KEY=your-key-here
python run_benchmark.py --model claude-sonnet-4-5 --duration 300

Or run baselines

python run_benchmark.py --baselines --runs 5
```

There's a React dashboard if you want to watch the reactor in real-time, and the flight recorder logs every tick for post-mortem analysis.

I'd love to see:

Results from models I haven't tested

Better baseline controllers

Human operator benchmarks (how do people actually do?)

Analysis of specific failure modes

The benchmark is intentionally hard. There's headroom above 76.2. If someone builds an LLM that reliably scores 85+, I want to know how.

The reactor is waiting.

---

All benchmark code, results, and plots are available in the repository. The physics model is a simplification. Don't use this to train actual reactor operators.