How do you choose your open-source LLM without having to test them all?

[-]

heyitsdannyle@reddit

You can do it in my tool https://www.evaligo.com/prompt-engineering if you do, please tell me if it was helpful and if anything is missing.

[-]

In my case, the criteria are pretty simple. It has to be a MoE design to have acceptable performance on Strix Halo. It must fit within the 120 GB of VRAM that I've made available with good context, at least 64k. It should be large to utilize all the VRAM, leaving maybe 20-30 GB or so for the OS and rest. That gives me a budget like 80 GB for model and 20 GB for context, or something like that. Different models hit a bit different sizes. I only use llama.cpp, so having a GGUF is must. This is typically not selective at all. I only use 5 or 6 bit post-training quantization because I worry that 4 bit harms the model too much.

After these considerations, the list that I know about has become pretty short. I only have 3 models I know about: GLM 4.5 Air, Qwen3-Next 80B-A3B and gpt-oss-120b. I've got these three installed, two as the "derestricted" variants, and I typically find myself mildly preferring gpt-oss-120b for its terse think process and about 2x faster inference compared to rest. I've been using a gpt-oss-120b model for months at this point, because it seems to do quite well at coding and can achieve many tasks quite reasonably and has pretty good basic world knowledge as a kind of encyclopedia -- it knows more stuff than I do in all fields except the few ones where I might have more expertise, and seems to debug issues rapidly and offers pretty solid advice most of the time.

[-]

noiserr@reddit

My story is quite similar. Though I use the ROCm backend personally. This is because for coding agents prompt processing is the biggest bottleneck and ROCm path provides faster prompt processing than Vulkan.

I would like to find another model that works as well as gpt-oss-120b but I haven't found one yet. I haven't tested the Qwen3 Next 80B yet.

[-]

walrusrage1@reddit

Which unrestricted oss are you using?

[-]

slolobdill44@reddit

Hey I’m in your same shoes (have 120gb strix halo) and really appreciate your detail here.

Do you have any tips for getting GLM 4.5 Air to run faster? I prefer its responses to gpt-oss-120 but I usually get 2x tps using gpt vs. air

[-]

Holiday-Case-4524@reddit (OP)

Thank you man for sharing your approach. I really appreciate it, i will keep in mind and use all this considerations. 🙏

[-]

OldCulprit@reddit

Interesting article.....

Orchestrating a Council to find the best models....

A weekend ‘vibe code’ hack by Andrej Karpathy quietly sketches the missing layer of enterprise AI orchestration | VentureBeat

[-]

Far_Statistician1479@reddit

Eventually in software engineering, you have to make a decision.

[-]

jacek2023@reddit

You can't test them all because there are new fine-tunes everyday, but you can test some to see how they work in general. This is a brutal hobby.

[-]

Acrobatic-Show3732@reddit

i put it in a pokeball and tell it i chose it.

[-]

Holiday-Case-4524@reddit (OP)

thank you, could be an interesting research approach to propose to the AI community, i will share your idea

[-]

Salt_Discussion8043@reddit

Ye he told deepmind to put their models in Pokeballs

[-]

ac101m@reddit

Lmao

[-]

Theio666@reddit

You have to test. Models have quirks which you'll not be able to see just from metrics, and some of these can happen depending on the app you want to use the model. For cloud it's easier since there's just like 5 models worth trying, for smaller local ones - good luck :D

[-]

Salt_Discussion8043@reddit

Ye and usually you have at least some private data specific to your task

[-]

Salt_Discussion8043@reddit

Another vote for test them all

[-]

Lissanro@reddit

I noticed the following correlations:

- For larger models IQ4 is generally good, but there are exceptions: for example, for Kimi K2 Thinking originally trained using INT4 QAT, Q4_X is the best (as described in the Q4_X recipe by Ubergram: https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF )

- Medium size model, from sizes like GLM 4.5 Air to GLM-4.6, I find IQ5 works the best, but this also may have to do something with the fact that they are originally BF16, while larger DeepSeek and Kimi models were FP8 (with K2 Thinking being an exception). GPT-OSS-120B is an obvious exception, since has been post trained in MXFP4 format, so there is no need of higher quant for it.

- Small models like Qwen3 30B-A3B tend to degrade noticeably at IQ4, with IQ5 still possible to notice higher error rate, especially in more complex tasks, with with small models tend to struggle even at the original precision. For vision enabled small models, I find that 8-bit quantizations preserve the quality better.

As of choosing a model to use, for general use I prefer K2 0905 when no thinking is needed or for Roo Code; K2 Thinking could be better, but currently it is not very well supported by Roo Code, and there are still open bug reports related to it. K2 Thinking however works well for other things, including general-purpose chatting in SillyTavern.

Medium size models can be especially useful if fully fit in VRAM. For example, in my case I have just 96 GB VRAM but 1 TB RAM, so while it is enough VRAM to hold even full 256K context cache and common expert tensors of K2 Thinking, it is not enough to fully load GLM-4.6, but I still can fully fit in VRAM something like GLM-4.5 Air or GPT-OSS-120B - this can be useful for bulk processing something that is not too complex, but still too hard for smaller models. Obviously, medium size models can be also useful on RAM-constrained systems, where it is just not practical to use anything bigger.

And finally small models, it is something that I use when I need the best speed for simple tasks. They are not that great for general chat or Roo Code, but being small, it is relatively easy to fine-tune them. For fine tuning, I often use 0.6B-4B range of models, since for specialized simple tasks, fine-tuning something like 30B-A3B or higher is not needed in most cases.

[-]

Zc5Gwu@reddit

What is your hardware?

Usually you do a calculation like the following: 1. Try to fit everything in vram on a single card 2. Try to spread weights across multiple cards 3. Try to use vram for active weights and fit rest in ram (with llama.cpp)

For models <32b use Q8 or greater For models >32b use Q4 or greater For creative work you can go lower For coding work you need better quants

[-]

Holiday-Case-4524@reddit (OP)

thank you for your tips, i am working on a T4 (16GB). What about instruction or thinking model?

[-]

Zc5Gwu@reddit

Thinking vs instruct is personal preference I suppose. Do you prefer quicker but possibly less “smart” responses?

Thinking benefits logic and long context reasoning.

[-]

Holiday-Case-4524@reddit (OP)

thank you again :D

[-]

-dysangel-@reddit

I just test them all

[-]

xxPoLyGLoTxx@reddit

Gotta test em all!