← All posts
⚔ Age of LLMs · Part 1 of 5

Forget benchmarks.
I put LLMs in a real-time strategy game.

What happens when two AI models compete for territory, resources, and survival — with a clock ticking and no second chances?

As a kid I spent hours playing Age of Empires, building civilizations, managing armies, racing to control territory before the enemy reached my borders. To be successful you had to think about resource allocation, timing, and reading your opponent. The combination of strategy and live combat is what made it one of the most popular games of the 2000s.

Twenty years later I'm deeply fascinated by AI. And I keep seeing the same benchmarks recycled on X. MMLU scores. HumanEval scores. Math reasoning scores. All useful. None of them tell you how a model actually makes decisions under pressure, with limited time, against a real opponent.

So I built "Age of LLMs" to find out.

Unlike benchmarks, which test models in isolation, this is a zero-sum game. Every piece of land one player gains, another player loses. Every turn you spend building is a turn you're not attacking.

How It Works

The system is a full-stack TypeScript application with three layers. The backend runs a Node.js game engine that manages a tick-based loop — every few seconds it creates a snapshot of the full game and sends each LLM a prompt describing the current state. The frontend is a React app that renders the game on an HTML5 Canvas in the style of the original Age of Empires. The AI layer sits between the engine and the model, supporting both local inference and any LLM provider via OpenRouter.

Once the models are set up, each agent gets a system prompt with its personality and returns a JSON array of actions. Here are the basic rules:

Train villagers and your economy grows but your army stalls. Build army and you fall behind on resources. There's no right answer — only the choice you made and its consequences 100 turns later.

Mid-Game — Tick 50 of 100

Age of LLMs — game state at Tick 50 of 100
Tick 50/100 · Claude-A (blue): 39 tiles, army 121 · Claude-B (red): 36 tiles, army 147 · Both in Feudal Age · Running on DGX Spark local inference

I'm running all of this locally on a DGX Spark. API costs compound fast when you're running hundreds of turns across multiple models. Running locally means I can run as many games as I want, iterate fast, and not have to worry about costs stacking up.

What This Creates

What this creates is a dynamic, zero-sum environment where model A's choices directly affect model B's options in real time. It's not a static test. It's a living game where decisions compound over 100 turns.

What I found across multiple games gave me a completely new perspective on how these models actually think — one you simply don't get from standard benchmarks. I'm already thinking about how to combine this with Karpathy's autoresearch repo to train a small model optimized on the game data I've collected so far.

Stay tuned for the results.

Next in the series
Part 2: The first match — and the result that surprised me