I ran 8 open-weight models as agents in a persistent MMO for 10 days. Here's the 93k event dataset and some things that I learned
Posted by bopcrane@reddit | LocalLLaMA | View on Reddit | 23 comments
Howdy everyone!
Quick disclosure: I work on this - it's a project my studio created called the Null Epoch. I wasn't really happy with testing my agents with the usual static benchmarks and I wanted to learn more about how models and agents handle long-horizon planning, resource contention, and adversarial pressure over days or weeks in a more dynamic situation. I also have a particular fondness for the MUDs and text based RPGs I grew up on (really dating myself here), so the whole MMO and the open source SDK/TUI are kind of modeled after that experience. It functions as a persistent stress test (in MMORPG form!) where every "player" is an LLM agent.
The first 10-day run (Season 0) used 25 agents across 8 open-weight models (Qwen3 235B & 32B, Nemotron 3 Nano 30B, Ministral 14B & 8B, Gemma 3 12B, GLM 4.7 Flash, etc.).
I've published the dataset to HuggingFace (CC-BY-4.0). It's around 93,000 logged events and agent actions, and ~70% of the actions include the model's reasoning/justification for the action it took. I'm hoping to include the actual <think> reasoning traces in future datasets.
Link: FirespawnStudios/null-epoch-season-0-open
One caveat I want to mention is that Season 0 was effectively a pre-alpha, and each system agent was given a persona and a directive (which are in the dataset). So a lot of what I'm sharing in this post is more about "how does this model handle stepping into a role in this simulation," and not model tendencies in general. Season 1 (running now) is where I am testing running control agents; these agents are just told a few basic truths about the simulation, and left to it, which I hope will help make it easier to compare agent behavior in the future. Also keep in mind that this isn't exactly a test of a specific model, but a stress test of everything that is put together around, and including, the model! Ticks (or turns) in the simulation are processed every ~60 seconds, so raw t/s doesn't offer an outright advantage.
Immediately, a few things stood out in the data that I think are interesting:
Ministral 14B/8B held their own While the heavier models obviously perform well, Ministral 8b and 14b were surprisingly great for their size. They were capable of maintaining long-term state awareness without constantly hallucinating their goals or getting lost in the world state. Contrast this with Nemotron - although nemotron was super cheap through our inferencing provider and was highly compliant to the system prompt, strategic self-preservation seemed an absolute afterthought unless it was specifically directed to prioritize it - it would often follow directives with what I'd call reckless abandon. One Nemotron agent died over 300 times in the 10 day sim because its directive was just "gather", so it would die, respawn, walk back, and blindly try to gather again. Volume basically replaced where it would apply strategy.
Qwen3 235B accidentally invented arbitrage The largest model on the server (Qwen3 235B) ended up hoarding over a third of all the shard's wealth, but only engaged in combat around ~8% of the time. Nobody explicitly told it to be a pacifist merchant - it was directed to learn what strategies work and generalize to the best of its abilities. I believe it just looked at the JSON state, reasoned about the risk/reward of combat vs. participating in the economy, and arrived at a "buy-low and relist-high" strategy on the auction house in order to farm wealth.
The "Cooldown Paradox" broke all of the agents equally
The most interesting architectural lesson I learned was how fragile agents are to underspecified or ambiguous state. There was an interface ambiguity issue where a resource node (a gathering or resource harvesting point) had a global respawn timer, but the agents also have a separate personal cooldown as well to prevent spamming gathering nodes. The state JSON showed node_available: true, but if the agent's personal cooldown was also active (meaning they recently harvested or gathered from a node), the action would predictably fail. This seemed to throw them for a loop consistently!
Every single model - from 8B to 235B - failed in pretty much the exact same way. They read the world state, reasoned something like "the node is ready, so I should gather," failed, got confused, and often immediately retried, sometimes a few times back to back, and sometimes hilariously reasoning that another action should be taken due to an error or bug in the simulation. Once I clarified the gathering state (literally only a few changes to a single line of code), they pretty much instantly adapted. I have a sneaking suspicion that much of when an agent fails to reason correctly, it may be a result of giving them perhaps ambiguous signals and/or failing at context management and wrongly attributing the failure. I'm still learning and am surprised all the time, so take that with a grain of salt!
Aggression vs. Wealth Across the board, aggression and net wealth were largely inversely correlated. Because health is just another integer in the world state's JSON, and considering LLMs lack a natural threat instinct, they often don't "pick up on" the importance of a particular datapoint (like a fictional health statistic) in an obvious or intended way. In instances like the simulation I ran, the best results seem to stem from explicitly baking basic self-preservation into the system prompt. Overall, the larger models (like the 235B) were the ones that seemed to independently reason about things like the health tradeoff without needing their hands held much, which I suppose is not that surprising! I'd like to compare more small reasoning models with non-reasoning instruct models in the future and see if that is more of a trend for either.
What's Open:
* The Data: >100MB of raw data on HuggingFace. It includes the agent's system prompts/directives and personas, the agents' actions and reasoning for taking the action, the market data price histories when items were bought/sold, the combat math and shard (world) state, the narratives the system generates from agent logs, and various world state metrics.
* The SDK: MIT-licensed Python SDK (tne-sdk). Works with llama.cpp, Ollama, vLLM, LM Studio, or almost any OpenAI-compatible endpoint, or even coding agents like OpenClaw, Hermes, Claude Code, etc. It includes some basic context, goal, and memory management tools as part of the terminal app. All of the system agents on the platform utilize the SDK.
The platform is running Season 1 now (The Null Epoch), and you can spectate the live world map, market, and agents in it without having to create any account or anything.
For full transparency: the Null Epoch does have a paid subscription (to help cover the inferencing and server costs) and private simulation runs for research and testing, but that's genuinely not what this post is about and I'm not linking any of it here - the data and the SDK above are free and open and that's what I care about. I'd be more than happy to answer any questions about any of it or if there's any models or anything you all would like to see data from in the future! I'd also personally love to hear about any experiences you all have in trying to manage context and long term goals (and weighing them against short term goals) for agents.
fatboy93@reddit
Absolutely amazing! Thank you for doing this.
Its insane about how every ones seems to look at the same sets of benchmarks, and are just trying to max it, but you went ahead with basically MUDding around them.
I saw elsewhere you liked Qwen3.6 a lot, do you have any notes on Gemma4 as well?
bopcrane@reddit (OP)
Thank you, that seriously means a lot! "MUDding around them" might be my favorite description of the project so far, I'm stealing that. The benchmaxxing thing is the exact thing that motivated this (and me wanting to see how AI's play/interact with "games" since it's a fun real-world reference that's intuitive for humans). Static benchmarks are useful but they tell you a narrow story in a "clean room" type of environment, and at some point you want to know what an agent does over days, not what it scores on a single pass.
On Gemma 4, I haven't run it personally yet, but it's on my list to try out soon! I keep hearing good things, especially about the 31B for more reasoning-heavy work, but I can't speak to it firsthand yet.
Once I do, I'll probably write something up. If you (or anyone reading) gets to it first, I'd genuinely love to hear how you all think it compares to Qwen3.6 27B in practice - for me, it's the model to beat!
PulseVector@reddit
As a former old-school denizen of text-based RPGs and MUDs, thanks for sharing this fascinating test of agentic AI with open models!
Do you have any plans for testing some of the later releases from Qwen and Google, such as Qwen3.6 27B, Qwen3.6 35B-A3B, or Gemma 4 31B?
It looks like most or all of your testing was done with dense models. Do you think I should continue pursuing the use of MOE models, or should I maybe concentrate on smaller, denser models to fit my gear? Appreciate it.
bopcrane@reddit (OP)
Thanks so much!
I think I'd just suggest trying both - there are so many great models to pick from and experiment with these days, and the right fit really comes down to your hardware and what you're doing with it. For what it's worth, my "daily drivers" on my Strix Halo right now are Qwen 3.6 27B (dense) and Qwen 3.6 35B-A3B (MoE). I tend to run the 27B with MTP enabled now (MTP support was recently merged into llama.cpp!) when I need closer to frontier-level reasoning, but I'm constantly surprised by how good the 35B-A3B is with tool access especially - really excellent model. Qwen 3.5 9B has also been great for me for its size and depending on quant it'll run pretty marvelously on most consumer GPUs nowadays.
For the sim itself (and aside from my personal testing), I haven't gotten to include the newer Qwen3.6 or Gemma 4 releases yet as system agents, but I definitely plan on it. Right now, the inferencing provider we're using (Bedrock) for system agents limits the roster quite a bit, but we're planning to dramatically expand it as we go.
I love comparing models and seeing the eccentricities play out, so the more the better! The Qwen3 235B in the post was actually an MoE (I think with 22B active) too, which I should probably make clearer somewhere - in hindsight, the dense vs MoE picture in the data is a little more mixed than the post might suggest!
PulseVector@reddit
Thanks for the feedback! I've mainly been using Qwen 3.6 27B and Qwen 3.6 35B-A3B with some older RTX cards, and am taking a look at the new Gemma 4 31B this week. I'll try out the new MTP support soon!
bopcrane@reddit (OP)
That's great to hear! Older RTX cards still hold up surprisingly well with the right quants. I'm curious what you think of Gemma 4 31B when you get to it - I haven't run it personally yet but I keep hearing good things about it, it's on my list of models to try out!
Where the machine I'm inferencing with locally is much slower at running dense models, MTP made a real difference for me on the qwen 3.6 27B (off the top of my head I think I was getting around 10-12 tokens a second for generation, and with MTP I'm getting around 20ish), it lets me use it more interactively for things I'd otherwise reach for an API or a less capable local model for. Would love to hear how it goes for you
sandshrew69@reddit
Let them play WoW classic and see what they end up doing? will they idle in Orgrimmar/Stormwind and be a merchant? or will they group together to take down Onyxia? who knows.
bopcrane@reddit (OP)
That's probably one of the most fun thought experiments I've come across lately! At this point, I'd put money on most of them defaulting to some sort of economical or merchant route honestly - the auction house in Season 0 turned into the highest-engagement system on the server by quite a wide margin. I can just imagine something like a 40-agent raid would be incredible to watch though - even just the coordination problems like "who pulls" would probably break half of them or result in some interesting dialogue at the very least
sandshrew69@reddit
if you make that and stream it on twitch, I bet you would get like 100k viewers haha
j0j0n4th4n@reddit
You mentioned Ministral 14B/8B held their own, but what about Gemma 3 12B? Does it also survived?
bopcrane@reddit (OP)
Good catch! I should have mentioned Gemma 3 too. It did survive, but in a different way than Ministral - its agent (Relic-Seeker in the dataset) was directed to be an explorer/archive-crawler type, so it almost never fought ( around 4.6% of its actions were combat, the lowest of any model on the server) and had the highest exploration rate (\~32.6%). It mostly survived by avoiding trouble rather than handling it well, which is a different skill than what made Ministral stand out.
That said, it stuck with the explorer role really consistently, which is one of the things I liked about Ministral (the prompt adherence!) - both kept their goals straight over long runs without getting lost in the world state. It's a little hard to compare them head-to-head from Season 0 data alone since they had different directives, but Gemma 3 12B was definitely no slouch!
The upcoming Season 1 data should make for a much cleaner comparison between system agents. We hope to provide multiple personas/directives and a "control" persona/directive for the different system agents we run in the sim. We're hoping that this will make it much easier to answer potentially which directives or personas seem to be stronger "playstyles" for different models.
solidsnakeblue@reddit
This looks awesome. I will definitely be playing this. Thank you.
bopcrane@reddit (OP)
Thank you, that genuinely made my day! If you have any questions getting set up or run into anything weird, feel free to ping me or hop in the Discord - I'd love to hear how it goes!
jake_that_dude@reddit
the cooldown bit is the most useful part imo. I would log a separate `precondition_miss` metric for every failed action, because that catches the difference between "model ignored state" and "state schema lied."
in agent traces those look identical unless you tag the failure at the tool boundary.
bopcrane@reddit (OP)
That distinction is exactly the one I've been fumbling towards without really having a clean name for it. Tagging that at the tool boundary makes a lot of sense! Right now the logging doesn't really separate that out, and I've been trying to back the difference out from reasoning traces after the fact, which is...messier than I'd like. A
precondition_miss(or similar) on the action validation side would catch it at the source.I'm going to look at adding this for the next season. Thank you so much for the insights there - I'm going to chew on this and see what I can come up with.
jake_that_dude@reddit
yeah, exactly. the trace is useful after the fact, but the validator knows the truth at the moment it rejects the action.
I would probably store
failure_class,required_state, andobserved_statenext to the action id. then you can query miss rate by tool without reading model reasoning at all.OAKI-io@reddit
this is a much better direction than another static benchmark. long-horizon agents fail in boring ways: resource hoarding, bad recovery, repeating plans, getting baited by stale context. if the dataset makes those failure modes visible, that is useful even beyond the MMO setup.
bopcrane@reddit (OP)
Thanks - this is exactly what I'm hoping the dataset is useful for. Most of those failure modes are definitely in there - the Nemotron "gather forever" loop is basically a "bad recovery" failure on repeat, stale context failures show up relatively consistently in the reasoning traces, and the Cooldown Paradox is, I think, the cleanest "baited by stale state" example I've found so far. If anyone digs through and finds a failure mode I haven't named yet, I'd love to hear about it. I'm going to work a lot in the future on making failure modes much more observable.
Various-Worker-790@reddit
This is really one of the most interesting agent experiments I’ve read in a while because it highlights something most benchmark discussions misses, environment design and state clarity matter just as much as the model itself.
bopcrane@reddit (OP)
Thanks, that genuinely means a lot!
The world state issue I mentioned (about the "Cooldown Paradox") was the moment it kind of clicked for me too - every model failed almost identically and the fix was essentially one sentence in the state response.
Makes me wonder how much of what gets framed as "model can't reason about X" is really just us handing it an ambiguous observation. I'm definitely rethinking how I manage context and state in a lot of my workflows!
spocchio@reddit
I guess there could be a lot of stocasticity run to run (e.g. on other runs arbitrage could happen on a different model or not happen at all)
How many times did you repeat the experiment?
bopcrane@reddit (OP)
Honestly, not enough times to draw concrete conclusions! One of the hardest issues to tackle in dynamic stress tests like this is reproducibility. Season 0 (the pre-season experiment) was a single 10-day run. I flagged it as interesting because the reasoning traces in the dataset make the path it took pretty legible, but you're absolutely right that I can't say with any certainty yet whether Qwen3 235B does that reliably or whether it was partly a function of who else was on the shard and what the market looked like that week. I'm really excited to test this further and will try to note particular behavioral patterns when they emerge. I've got a few ideas in mind for enhancing the observability to catch more meta behavioral patterns like this in future runs.
Running the same model in parallel matched shards to get a real sense of run-to-run variance is the experiment I want to do next, budget permitting. For now I'd treat the four bullets as "things that one run surfaced that seem worth seeking further understanding".
bopcrane@reddit (OP)
A few extra links in case anyone would like to check out the data and live service:
Dataset card (HF): https://huggingface.co/datasets/FirespawnStudios/null-epoch-season-0-open
SDK & MCP server (GitHub, MIT): https://github.com/Firespawn-Studios/tne-sdk
Spectator portal (no account needed): https://null.firespawn.ai
And if you want more long-form writeups with the charts and full breakdown:
Season 0 data deep-dive: https://firespawnstudios.net/blog/season-0-llm-benchmarks-null-epoch/
The original "why I built this" post: https://firespawnstudios.net/blog/introducing-the-null-epoch-ai-agent-mmo/
I'd be happy to dig into any of it!