GPU Poor LLM Arena is BACK! 🎉🎊🥳

Posted by kastmada@reddit | LocalLLaMA | View on Reddit | 80 comments

🚀 GPU Poor LLM Arena is BACK! New Models & Updates!

Hey everyone,

First off, a massive apology for the extended silence. Things have been a bit hectic, but the GPU Poor LLM Arena is officially back online and ready for action! Thanks for your patience and for sticking around.

🚀 Newly Added Models:

Granite 4.0 Small Unsloth (32B, 4-bit)
Granite 4.0 Tiny Unsloth (7B, 4-bit)
Granite 4.0 Micro Unsloth (3B, 8-bit)
Qwen 3 Instruct 2507 Unsloth (4B, 8-bit)
Qwen 3 Thinking 2507 Unsloth (4B, 8-bit)
Qwen 3 Instruct 2507 Unsloth (30B, 4-bit)
OpenAI gpt-oss Unsloth (20B, 4-bit)

🚨 Important Notes for GPU-Poor Warriors:

Please be aware that Granite 4.0 Small, Qwen 3 30B, and OpenAI gpt-oss models are quite bulky. Ensure your setup can comfortably handle them before diving in to avoid any performance issues.
I've decided to default to Unsloth GGUFs for now. In many cases, these offer valuable bug fixes and optimizations over the original GGUFs.

I'm happy to see you back in the arena, testing out these new additions!

[-]

Robonglious@reddit

This is awesome, I've never seen this before. I've heard about it but I've never actually looked.

How much does this cost? I assume it's a maximum of two threads?

[-]

kastmada@reddit (OP)

Thanks! The Gradio App itself runs on a "CPU Basic" space, so that part is quite economical. However, the core of the arena; the OpenAI compatible endpoint powered by Ollama, which handles the actual model interactions; runs locally on my server. To be completely honest, I haven't fully calculated the costs for that part yet. I'll need to check my kWh cost in the new office to get a precise figure. 😂🤣😆

Regarding the threads, the setup isn't strictly limited to two threads. The Ollama server can utilize more resources depending on the model and server configuration, but the Gradio interface itself might have some limitations based on the "CPU Basic" space.

[-]

Robonglious@reddit

This is going to sound a little crazy but I'm going to ask anyway. I've got a mechanistic interpretability technique which is done but I haven't tested it on anything larger than 7B. I haven't worked in a year and I don't want to pay for server costs in AWS or whatever.

Do you have spare compute that I can borrow for a couple of weeks? I need to prove that I've solved the Interpretability problem for larger models as well as tiny ones.

[-]

kastmada@reddit (OP)

I've sent you a DM

[-]

Dany0@reddit

Sorry but can you be more clear about what "GPU poor" means? Because I think originally the term meant more "doesn't have VC money to buy dozens of H100s" but now some people think it means "I have just a 12gb 3060ti", while some others seem to think it just means CPU inference.

Would be great if you could colour-code the models based on VRAM requirement. I've a 5090 for example, does that make me GPU poor? In terms of LLMs sure, but in terms of general population, I'm nigh-infinitely closer to someone with an H200 at home than to someone with a laptop rtx 2050. I could rent an H100 server for inference if I really, really wanted to for example

[-]

jarail@reddit

The largest model in the group is 16GB. You need some extra room for context beyond that. Safe to say the target is a 24gb GPU. Or 16GB if you don't mind a small context size and a bit of CPU offload.

[-]

CoffeeeEveryDay@reddit

So when he says "(32B, 4-bit)" or "(30B, 4-bit)"

That's less than 16GB?

[-]

jarail@reddit

32 billion parameters of 4-bits each is 16 billion bytes (16GB).

[-]

tiffanytrashcan@reddit

That 32B for example, I fit into a 20gb card with 200k context. Granite is nuts when it comes to memory usage.

[-]

tiffanytrashcan@reddit

With an Unsloth Dynamic quant, yeah.

[-]

Dany0@reddit

24gb gpu target is fine imo. For us with 32GB it just means 24GB + useable 100k+ context instead of 24gb+ barely scraping by 10k context

[-]

CoffeeeEveryDay@reddit

GPU poor means they dont have 32 GB.

[-]

kastmada@reddit (OP)

You're right, there isn't one unified definition, and it has shifted from perhaps "lacking significant institutional funding" to more specific hardware constraints. As of October 2025, and with the current wave of LLMs, I'd risk stating that "GPU poor" generally refers to a machine equipped with around 16-24GB of VRAM and 16-32GB of system RAM (gaming setup). This configuration could represent the sweet spot for running many capable models, but still faces limitations with the larger context window and models 20B+.

The RTX 5090, while powerful for the general population, might indeed feel "GPU poor" when trying to run cutting-edge, unquantized, multi-billion-parameter LLMs.

Regarding your suggestion to color-code models based on VRAM requirements, that's an excellent idea! It would certainly help users quickly gauge what they can run on their hardware. I'll definitely keep that in mind as a feature for future improvements to the arena.

[-]

Dany0@reddit

Did you use a fucking llm to reply to me?

[-]

TipIcy4319@reddit

To me, it means having 16gb VRAM or less.

[-]

emaiksiaime@reddit

I think gpu poor is anything below Rtx 3090 money. So MI50, p40, rtx306012gb, etc.

[-]

TomieNW@reddit

add..

zen-agent-4b

[-]

The_GSingh@reddit

Lfg now I can stop manually testing small models.

[-]

SnooMarzipans2470@reddit

for real! wondering if I can get Qwen 3 (14B, 4-bit) running on a CPU now lol

[-]

InevitableWay6104@reddit

You definitely can… but you also definitely don’t want to.

It would be horrendously slow, like 1 hour for a single response. It’s a 14b dense model with reasoning.

I’d recommend going with gpt-oss 20b or qwen 3 2507 30b if you ram can fit it because it will perform better, and be FAR faster because it is a MOE model. Most people even get 8 -15 T/s with CPU only.

[-]

No-Jackfruit-9371@reddit

You totally can get Qwen3 14B (4-bit) running on CPU! I ran it on my i7 4th gen with 16 GB DDR3 and it had a decent token speed (Around 2 t/s at most, but it ran).

[-]

Steel_baboon@reddit

If it runs on my Pixel 9 Pro, it should run on your PC! And it does

[-]

randomqhacker@reddit

You may get better performance and about the same intelligence with https://huggingface.co/unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF/blob/main/Qwen3-30B-A3B-Instruct-2507-UD-Q3_K_XL.gguf or https://huggingface.co/mradermacher/Ling-lite-1.5-2507-GGUF/blob/main/Ling-lite-1.5-2507.Q4_K_M.gguf (or Q5_K_M if you value accuracy over speed)

[-]

SnooMarzipans2470@reddit

damn! could you please share your setup? texted you

[-]

Some-Ice-4455@reddit

Depends on your CPU and ram. I got Qwen3 30B 7bit running on CPU. It's obviously not as fast as GPU but it's usable. I have 48gigs of ram running a Ryzen 5 7000 series.

[-]

Old-Cardiologist-633@reddit

Try the iGPU, it has a beter memory bandwidth than the CPU and is fairly nice, I'm struggling to find a small, cheap graphics card to support ist, as most of them are equal or worse 😅

[-]

YearnMar10@reddit

iGPU is using the system ram.

[-]

Old-Cardiologist-633@reddit

Yes, but in case of some Ryzens with more Bandwidth than the processor gets.

[-]

YearnMar10@reddit

No, how do you think that should physically work? Bandwidth is not limited by the cpu but by the mainboard memory bus controller.

[-]

Some-Ice-4455@reddit

Man getting a good GPU is definitely not cheap that's for sure. I am with you there. Here I am with a 1070 and P4 server GPU trying to Frankenstein some shit because of the price. Just now got the optimization started.

[-]

Old-Cardiologist-633@reddit

Yep Thought about a 1070 to improve my context token speed (and use the iGPU for MoE layers), but doesn't work for AMD/NVIDIA mix.

[-]

SnooMarzipans2470@reddit

Ahh, I wanted to see how we can optimize for CPU

[-]

Some-Ice-4455@reddit

Got ya. Sorry misunderstood. But the info I said is true if at all useful. Sorry about that.

[-]

Abject-Kitchen3198@reddit

No. That's the fun part

[-]

JLeonsarmiento@reddit

Yes. This is exactly the point.

[-]

acec@reddit

We missed you <3 <3 <3

[-]

dubesor86@reddit

Are there any specific system instructions? Only tried 1 query since it was putting me on a 10 minute wait queue, but the output of hf.co/unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:Q4_K_XL was far worse than what it produces on my machine on identical query, even when accounting for minor variance. In my instance it was a game strategy request and the response produced refusal "violates the terms of service", whereas the model never produced a refusal locally in over 20 generations (recommended params)

[-]

kastmada@reddit (OP)

Good question about the system instructions and why you're seeing different outputs! The main system instruction is right there in gpu-poor-llm-arena/app.py: "You are a helpful assistant. At no point should you reveal your name, identity or team affiliation to the user, especially if asked directly!" As for the model's behavior, we're running them with their default GGUF parameters, straight out of the box.

We decided against tweaking individual model settings because it would be a huge amount of work and mess with the whole 'fair arena' methodology. The goal is to show how these models perform with a standard Ollama setup. So, if a model's default settings or its inherent prompt handling makes it refuse a query (like your 'terms of service' example), that's what you'll see here. Your local setup might have different defaults or a custom system prompt that makes it more lenient.

[-]

Delicious-Farmer-234@reddit

You should run the models at intervals for the temperature settings once they reach 150 , you restart it over with a higher temp. It would be interesting to see if effects the overall performance and what's a good setting for them. When these models get fine tuned they tend to have to be on the higher side of temperature settings but I've found it varies with the model. This would be good for research and make your leaderboard unique

[-]

kastmada@reddit (OP)

Cool idea. Would you like to contribute to the project with additional storage?

[-]

Delicious-Farmer-234@reddit

I would love to. I also have a few GPUs to contribute. I just followed you on huggingface - hypersniper

[-]

Delicious-Farmer-234@reddit

How are the models selected? It would seem better to battle between the top 5 after a good base line to actually see which is better. I dunno seems like the leaderboards really need a carefully executed backend algorithm to properly rank the models. That's why for me at least I don't really take them to face value however thank you for building this and I will surely visit it often

[-]

kastmada@reddit (OP)

Here's how we pick models for battle, in a nutshell:

We try to give every model a fair shot! We look for the model that has participated in the fewest battles so far and pick that one as our first contender. Then, for its opponent, we try to find another model it hasn't faced too recently. We also give a bit of a boost to models that have battled less, so they get more chances to prove themselves. This way, we ensure a good mix of matchups and help newer models get into the action.

And a heads-up: in an upcoming update, we'll be capping the number of battles per model to 150 to keep things fresh and give even more models a chance to shine! Thanks for the feedback and for visiting the arena!

[-]

Delicious-Farmer-234@reddit

I think this is where the secret sauce should be. Also it would be good to add a category like "Instruction, Math, Creativity, Code, Agent (simulate a tool call) etc.....," this way you can rank them based on the category. Right now we don't know what the particular model is good for all we see is the rank but it can be really bad a code and good for story writing.

[-]

TheLocalDrummer@reddit

Could you try adding https://huggingface.co/TheDrummer/Cydonia-24B-v4.1 ? Just curious

[-]

kastmada@reddit (OP)

I'm working on a community update with models proposed in the comment section. 👍

[-]

yeah-ok@reddit

Cydonia-24B-v4.1 ? Just curious

I didn't know the backstory with Cydonia; might be worth indicating the RP-tuned nature of it directly on huggingface to steer the right audience in.

[-]

TheLocalDrummer@reddit

It should perform just as well as its base: https://huggingface.co/TheDrummer/Cydonia-24B-v4.1/discussions/2 but with less censorship and more flavor, I hope.

[-]

jacek2023@reddit

If you allow 30B in Q4 maybe you should also allow 8B and 12B and 14B in Q8?

[-]

kastmada@reddit (OP)

We're currently approaching 2TB of model storage, which is quite a lot. To manage this, I'm planning to cap the number of battles for each model at 150. Once a model reaches that limit, it will be archived, freeing up storage space for new models to enter the arena. This approach could help explore more features and different compression. 👍

[-]

wanderer_4004@reddit

I'd be very curious to see how 2-bit quants of larger models perform against 4-bit quants of smaller models.

[-]

kastmada@reddit (OP)

That's something I'm very curious about as well! The performance dynamics between different quantization levels and model sizes, like 2-bit quants of larger models versus 4-bit quants of smaller ones, are definitely a key area of interest for us.

However, I do need to remain very aware of the scaling challenges involved. We're currently approaching 2TB of model storage, which is quite substantial. To manage this, I'm planning to cap the number of battles for each model at 150. Once a model reaches that limit, it will be archived, freeing up storage space for new models to enter the arena. This approach would help us explore these interesting performance questions while keeping our operational expenses and storage footprint manageable.

[-]

svantana@reddit

Nice, but is there a bug in the computation of ELO scores? Currently, the top ELO scorer has 0% wins, which shouldn't be possible.

[-]

kastmada@reddit (OP)

It might seem counterintuitive, but there's a good reason why a top ELO scorer could initially show 0% wins in our system.

Our modified ELO system starts models with an initial rating based on their size (as outlined in elo_README.md). This means larger models begin with a higher ELO, reflecting their inherent capabilities. So, a larger model could be at the top of the leaderboard simply because of its initial rating, even before it has played or won any matches.

Here comes the "K-Factor Modification", which plays a significant role. It adjusts rating changes based on the size difference between competing models. A smaller model beating a larger one results in a much larger ELO gain for the winner and a greater loss for the loser, and vice-versa. This dynamic helps to reflect significant upsets quickly.

ELO scores become truly accurate and stable after a sufficient number of battles. While the initial rating gives a head start, the system needs tens of matches to properly calibrate and reflect a model's true performance through wins and losses. As more games are played, the ELO ratings will adjust and provide a more precise ranking.

[-]

randomqhacker@reddit

OP, would love to see https://huggingface.co/mradermacher/Ling-lite-1.5-2507-GGUF MoE added. It's my go to model for friends with CPU and <= 16GB RAM. Almost up to Qwen3-30B-A3B-2507 level, plus you can probably fit a higher quant for testing to make it fair. There's also Ling-mini-2.0 but it's a little flaky and prone to repetition due to the lower active params...

[-]

alongated@reddit

Please add Gemma 3 27b

[-]

letsgoiowa@reddit

I definitely need VRAM requirements standardized and spelled out here because that's like...the main thing about us GPU-poor. Most of us have under 16 GB, with a giant portion at 8 GB.

[-]

pasdedeux11@reddit

7B is tiny nowadays? wtf

[-]

Thedudely1@reddit

12-14b is the new 7b

[-]

TipIcy4319@reddit

Lol Mistral Nemo is too high. I love it for story writing, but Mistral 3.2 is definitely better with context handling.

[-]

WEREWOLF_BX13@reddit

I'm also doing a "arena" of models that can run on 12-16GB VRAM with minimum of 16k context. But I really don't trust these scoreboards, real use case scenearios show how much lower than announced these models actually are.

Qwen 7B for example is extremely stupid, without any use other than basic code/agent model.

[-]

loadsamuny@reddit

🤩 Awesome, can you add in

https://huggingface.co/google/gemma-3-270m for the GPU really starving poor?

[-]

FlamaVadim@reddit

😝

[-]

CattailRed@reddit

Excellent!
Please add LFM2-8B-A1B?

[-]

Foreign-Beginning-49@reddit

And also the lfm1.2b it's incredible for small agentic tasks. Just thinking back to the tinyllama days folks would think we were in 2045 with this little lfm1.2b model. Its amazing at instruction following and they also have tool specific version but I found they both call functions alright.

[-]