I found a perfect coder model for my RTX4090+64GB RAM

[-]

Tot_hits@reddit

Lol cute...that's not really coding, that is s scripting.

Reply

[-]

So if i understood well you have RooCode extension in visual code hooked to your local LLM with the model **Qwen3-Yoyo-V3-42B-A3B-Thinking-TOTAL-RECALL-ST-TNG-III-i1-GGUF** is that correct ? i'm a noob in all of this i just build my Ai server With an asus x99-e ws 128 gb of ram and one RTX 3090 and 3 x RTX 3060 i'm planning to replace every RTX 3060 with a RTX 3090 but i want to learn more stuff about LLM rag and finetuning and also build my own local LLM for developing new full stack apps. so if you have open source local models to suggest i can use for my day to day dev i'll be gratefull.

Reply

[-]

smugself@reddit

What cpu you running? i7 or xeon

Reply

[-]

DeerWoodStudios@reddit

A Xeon E5 2697 V4

Reply

[-]

smugself@reddit

Nice. I'm bottle necked at 64gigs of ram because of the i7. Someday I might pickup a xeon chip off eBay to get me 128gb of system RAM.

Reply

[-]

DeerWoodStudios@reddit

I bought it from Aliexpress cost me 30 euros

Reply

[-]

smugself@reddit

Thanks for the suggestion!

Reply

[-]

lemondrops9@reddit

You got 48 GB of Vram then. Should try some 70B models. I've been quite surpised how good GLM 4.5 Air Q2 KL. Normally I stay away from 2 quants but its quite good. I tried some smaller coding tests and was very happy with the results.

Reply

[-]

milkipedia@reddit

I must admit I'm thoroughly confused about why a fine tune on Star Trek TNG makes for a better coder

Reply

[-]

Blizado@reddit

Me too, maybe it is not because of the ST TNG stuff but because of DavidAU's BRAINSTORM process (which improves reasoning). Because this is a DavidAU model and his finetunes are special. The original YOYO finetune model is only a 30B model, DavidAU made a 42B out of it with better reasoning and a ST TNG dataset finetune. So I would guess it is the improved reasoning. Would be interesting if DavidAU had a finetune for coding only with his BRAINSTORM process, sound perfect for this.

Reply

[-]

StateSame5557@reddit

There is one without TNG training, the Qwen3-Yoyo-V3-42B-A3B-Thinking-Total-Recall

Reply

[-]

randomqhacker@reddit

Yeah but then you wouldn't have Lt. Cmdr. Data optimizing your code!

Reply

[-]

lemon07r@reddit

Thinking models do a lot better with tool calls than instruct models I've noticed. Try [https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507) I bet it will beat your sci-fi tuned fraken-merge any day..

Reply

[-]

Blizado@reddit

I wouldn't bet on that because of BRAINSTORM.

Reply

[-]

lemon07r@reddit

And I bet it effectively lobotomizes the model than actually helping anything. These models are no better than, no sorry, they're actually worse than the distill models by that one guy who vibe coded a nonfunctional distillation script that functionally did nothing but clone identical weights. Yet everyone ate it up and raved about his new tech and how much better his new models were. Snake oil. Have we learned nothing about confirmation bias from that last debacle? Give us benchmarks. Human one shot anecdotal evidence is meaningless, our experiences without an extremely large example size in blind testing is completely unreliable.

Reply

[-]

Blizado@reddit

Well, I can understand why you think so and can't blame you for that because you are right. In the LLM environment, a lot is always promised, even by the base model creators themselves, and then disappointment comes more often than one would like. So maybe you are right and it is the same here, on the other side DavidAU use BRAINSTORM since many months now in his models. I would think he wouldn't waist that much time with a technology that didn't work at all and he also do a finetune afterwards, what can fix what get broken in the process. But yeah, his models are not made for coding in the first place, but side effects can be sometimes strange on LLMs. On the other hand, however, we also need these new attempts. We are still in the very early stages of LLMs, and there is still a lot of room for improvement. But without experiments and new techniques, it is impossible to make fundamental improvements. And as far as benchmarks are concerned, that is a whole other topic, which should also be viewed critically, given how often people cheat in this area.

Reply

[-]

lemon07r@reddit

I'll be honest, I haven't wanted to straight out say it because I don't mean any disrespect and have seen him around on discord, seems like a nice dude but he doesn't have a good track record. I evaluate models for my personal use and sometimes run my own benchmarks against them; his models in particular.. were usually the bottom of the barrel. I've stopped testing them all together after a while. Wanted to refrain from saying it cause it's totally fine if others like his models more than me, but he hasn't put out anything notable ever, with objective evidence of it.

Reply

[-]

Blizado@reddit

You could be right, possible. I didn't tried DavidAU's model that much yet. Maybe he is one of this user who spend too much time and money on a dead end. Didn't had the time yet to try this model out here enough, only for some minutes in a not very good setup.

Reply

[-]

StateSame5557@reddit

For MLX people, I created the quants for it(nightmedia), the model is the best coder I found

Reply

[-]

Hot_Turnip_3309@reddit

https://huggingface.co/mradermacher/Qwen3-Yoyo-V3-42B-A3B-Thinking-TOTAL-RECALL-ST-TNG-III-i1-GGUF/resolve/main/Qwen3-Yoyo-V3-42B-A3B-Thinking-TOTAL-RECALL-ST-TNG-III.i1-Q4_K_M.gguf?download=true

Reply

[-]

Ummite69@reddit

Can this be integrated into visual studio ?

Reply

[-]

srigi@reddit (OP)

If you mean Copilot, if it allows to configure OpenAI compatible with the base URL model, then it could. I use Roo Code in VS Code. I personally believe it is far superior to integrated Copilot.

Reply

[-]

Ummite69@reddit

I want a local model that work as copilot does, in Visual Studio (not code). Is it possible?

Reply

[-]

social_tech_10@reddit

I presume you mean RooCode is better than Copilot using the same model, and if so, what makes it better? Is it just the system prompt? And can you give an example?

Reply

[-]

Holiday_Purpose_3166@reddit

Depends what coding you're doing. If it's single page edits (no heavy multi file) it's fine. You can do better with a GPT-OSS-20B in that case. Or use the Thinking variant from mradermacher. Pollutes in thinking but gets the job better earlier. Or Magistral Small 1.2 2509 or Devstral Small 1.1 2507.

Reply

[-]

redblood252@reddit

I can’t keep up…. Just read the paper on REAP this morning. Bit what the hell is yoyo and what is total recall st tng first iteration ?? And they are compounded? Sounds too hacky. Is this even gonna remain relevant in the next months?

Reply

[-]

jacek2023@reddit

There are many hidden gems on huggingface to discover, it's a shame most people know just the few most popular models and never try something new

Reply

[-]

Blizado@reddit

Problem is there are so many models, you would spend more time by trying models out than with using them. Since you also need to find out the best parameter setting for each model for best results for your usecase. Wrong parameters and a very good model looks for you like it is a very bad model. That is very time consuming and there are way too many models out. If you try to keep up here you quickly lose the motivation and stick to the best model you found so far, tweak the parameters over time for best results and only look on new hyped models. At least when you have not only fun with trying out LLMs and also want to use them. :D

Reply

[-]

LilPsychoPanda@reddit

And by the time you are done benchmarking, there may be a new and better one released already 😅

Reply

[-]

Blizado@reddit

Yep, and the side effect? You have more and more models on your harddrive you wanted to test...

Reply

[-]

Kyla_3049@reddit

Just stick to the recommended inference settings that Unsloth has.

Reply

[-]

Blizado@reddit

Well, it depends for what you use the model and there is always room for tweaking. But sure, you can use the "default" setting for the model, maybe that is already the best for coding, can be.

Reply

[-]

Blizado@reddit

Same setup, that sounds promising. Will give it a try, thanks. What RAM do you exactly have?

Reply

[-]

srigi@reddit (OP)

Since, I'm on AMD 9800X3D, I have 2x 32GB, G.Skill DDR5@6000 CL26. I know, that latency is a little bit of flex, I wanted that for gaming. However, this very special (and expensive) memory has zero overclocking potential, not even 6200.

Reply

[-]

Blizado@reddit

Thanks. Yeah, I have G.Skill DDR5@6000 with only CL30, but on 6200 with near zero impact on the timing, still CL30.

Reply

[-]

srigi@reddit (OP)

I had these 6000 CL30 before, but only 2x16GB, and was too able overclock them to 6200. I kind of regret going into these CL26.

Reply

[-]

Blizado@reddit

Yeah, for LLMs the RAM speed is more important than latency. I also want to see if I could let them rum on 6400, but so far I always run into RAM errors after some less minutes of testing them. 6200 is rock stable without any manual tweaks beside setting them in UEFI to 6200Mhz. But to be fair, it was only a compromise from my side to go to 6000 CL30. I also tend to look for lower latency's on RAM. I learned about that some weeks ago, before that my RAM also only was on 6000Mhz for over a year now. And also only I want to upgrade to 128GB with another pair of the same 2x32GB, only to learn that is was a very bad idea on DDR5 to go for 4 modules... 4200Mhz was max. Don't do that! :D

Reply

[-]

usernameplshere@reddit

Interesting find, would love to try this on my 3090, but I only have 32GB RAM, rip. Do you know how big roo codes system prompt is? Cline consumes 14k, which would make 32k kinda hard to work with.

Reply

[-]

srigi@reddit (OP)

15-16k. In my setup, I used 100k ctx-size. You could go down to 64k and your RAM need will probably fit. In my case, I have the luxury to run llama-server on a big machine, and code on the notebook (so RAM is not occupied by IDE/VSCode)

Reply

[-]

usernameplshere@reddit

So it's roughly the same as Cline, sad. I will try it out, but I don't think it will fit, even with a smaller context window. I'm at 1,7GB VRAM and \~11GB RAM util before even starting to launch LM Studio.

Reply

[-]

Glittering-Call8746@reddit

Can u try wsl and llama.cpp ? I wanna know 3090 vs 4090 , I'm on the fence to get 3080 x2 or 4090

Reply

[-]

billy_booboo@reddit

This has to be the most intriguing post I've read on here by far

Reply

[-]

tomakorea@reddit

Why didn't you use IQ4\_XS isn't it better than Q4\_K\_M in terms of precision and smaller footprint?

Reply

[-]

AppearanceHeavy6724@reddit

IQ4\_XS were universally ass whenever I've tried. IQ4\_XS of Mistral Small 3.2 for example was producing very strange prose, with considerably more confusion.

Reply

[-]

ScoreUnique@reddit

Yeah I'm surprised, I always sticked to IQ quants because I'm a firm believer of "make the most out of the available hardware" will try a Q4 xl next time.

Reply

[-]

ArtfulGenie69@reddit

Iq quants have had a run through of like a few thousand prompts to tune them a bit so they are kind of modified weights. People claim it makes them better at English but it kinda warps the original model so it may be better to try both and see what is best for you, if you need multilingual don't use IQ for sure.

Reply

[-]

Blizado@reddit

Also a good to know information, since I use LLMs not in english anymore.

Reply

[-]

lemon07r@reddit

better yet, use intel autoround quants if they're available. they probably provide the least amount of loss for their quant size

Reply

[-]

tomakorea@reddit

Oh thanks for the info! it's good to know

Reply

[-]

srigi@reddit (OP)

I'll test IQ4 later. I want to get the impression of the performance of Q4_K_M, before I move to IQ4 to be able to judge any failings in tool calling.

Reply

[-]

NoFudge4700@reddit

Are you having any tool call failures?

Reply

[-]

srigi@reddit (OP)

IQ4 was far more "stupid" than Q4_K_M. It was "overworking" the task from my little demo. I will not use it.

Reply

[-]

dinerburgeryum@reddit

Genuinely, this is why I prefer static quants to I-quants. I-quants looks great on paper, but the dataset is so critical to preserving what you need out of the tool, and I don't trust (no offense to the people doing the hard work) the quantizers to get my exact needs correct in their datasets.

Reply

[-]

JEs4@reddit

That’s a fascinating insight. On a related note, I’ve started falling back to multiple embedding models with 384 dim embedders used for semi-structured data concatenated with full dimensional text embeddings. Above 384 dims, semi structured ranking gets washed out by any other vectors. Smaller models can seemingly be much better in specific use cases.

Reply

[-]

ElectronSpiderwort@reddit

Before writing off the 30B A3B modes, test them at Q8 or the very least Q6, and with KV cache at F16. Q8 cache in particular absolutely tanks quality for me. You will have less context, yes, but you will have actual performance

Reply

[-]

Ok_Top9254@reddit

Same, I'd rather tank the model quality than KV cache, it starts going absolutely nuts if it's not f16.

Reply

[-]

stuckinmotion@reddit

oh wow interesting, I've switched to Q8 KV recently and didn't realize it might be impacting tool calling accuracy so much. I'll switch back to F16 (which I think is default anyway?), I don't know that it helped my prefill that much anyway (which is what I was going for)

Reply

[-]

MrMisterShin@reddit

OP definitely do this. KV cache @ Q8 ruined tool calling and got agentic coding stuck in loops. I reverted to F16 and also have the model at Q8. Granted I used two 3090s and it fits in VRAM, it should still be fast enough if you have to offload to system RAM.

Reply

[-]

MisterBlackStar@reddit

You mean base Qwen3 coder at q8 and without the kv cache params (or params set at fp16)? Or the model suggested by OP?

Reply

[-]

MrMisterShin@reddit

The base at q8 with the KV cache at full precision (FP16).

Reply

[-]

see_spot_ruminate@reddit

It takes about 45gb to offload to vram

Reply

[-]

MrMisterShin@reddit

I know, that’s why I said it should still be fast enough t/s, if you have to offload to system RAM. The model uses 3 billion active parameters, have the GPU hold the bulk of the computation/weights and your fine. Use —n-gpu-layers and —n-cpu-moe in Llama.cpp to your advantage and it will run just fine.

Reply

[-]

see_spot_ruminate@reddit

Oh, I wasn't trying to say you were wrong, lol.

Reply

[-]

GreenGreasyGreasels@reddit

>mradermacher/Qwen3-Yoyo-V3-42B-A3B-Thinking-TOTAL-RECALL-ST-TNG-III-i1-GGUF Hey, Bill, what was that model you told me was good for coding on my system? Yeah, it is the mradermacher's Qwen three, the Yoyo Version 3 forty two billion model with three billion active parameters thinker. Make sure you get the one with the nifty Start Trek The Next Generation release three, and this is important - remember to get the Total Recall's third version in the imatrix ggufs format - got all that? Whelp, never mind!

Reply

[-]

randomqhacker@reddit

I'm not downloading until I see ALF.

Reply

[-]

notlongnot@reddit

don't forget "i1" for iteration 1 ... maybe.

Reply

[-]

nmkd@reddit

It's for Importance Matrix GGUF quants. Not iteration.

Reply

[-]

Miserable-Dare5090@reddit

Yeah the names are getting crazy now 🤣 this is a davidAU TNG/Total Recall trained model merged with a Yoyo finetune, etc etc. it’s such a “early days of this tech” kind of moment. thanks for the laugh.

Reply

[-]

BumbleSlob@reddit

The names have been crazy for ages and then got more refined and now are dipping back to crazy. Shoutout to all the llama-2-wizard-vicuña-dolphin fans out there.

Reply

[-]

noctrex@reddit

I quantized this also, seems nice to me. https://huggingface.co/noctrex/Qwen3-30B-A3B-CoderThinking-YOYO-linear-MXFP4_MOE-GGUF

Reply

[-]

somethingdangerzone@reddit

BUY AN AD

Reply

[-]

perkia@reddit

Why? This works and is completely free.

Reply

[-]

coding_workflow@reddit

What tool you use for coding with Qwen ? Cli? No issues with tools use?

Reply

[-]

srigi@reddit (OP)

VSCode+RooCode extension. As I said, this model doesn't fail on tools (finally)

Reply

[-]

MrMisterShin@reddit

I know, that’s why I said it should still be fast enough t/s, if you have to offload to system RAM. The model uses 3 billion active parameters, have the GPU hold the bulk of the computation/weights and your fine. Use —n-gpu-layers and —n-cpu-moe in Llama.cpp to your advantage and it will run just fine.

Reply

[-]

cleverusernametry@reddit

Shill post?

Reply

[-]

AutomaticDriver5882@reddit

How do you upgrade it to that ram?

Reply

[-]

srigi@reddit (OP)

> --n-cpu-moe 28 Using this arg - it says how many MoE layers are offloaded to the CPU. The lesser the number, the more of them stays on GPU (faster inference), but you need VRAM to store them there.

Reply

[-]

AutomaticDriver5882@reddit

Ah ok thanks

Reply

[-]

lumos675@reddit

Downloading now.. i hope the dataset trained on be newer than qwen coder.

Reply

[-]

k0setes@reddit

You mention a comparison to vanilla, but how does it compare to Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf unsleth I got decent results with it in clein. In this case, does the benefit of the 42B model compensate for the 3-fold drop in speed?

Reply

[-]

InvertedVantage@reddit

I can't load this on my AMD 7900XTX with 24GB VRAM and 128 GB system RAM. I also have an NVIDIA 3060 12GB for a total of 36GB of VRAM. However loading it on these gets me 9 tk/s and I can't load it at all with a context over like 8k. What am I doing wrong here?

Reply

[-]

srigi@reddit (OP)

Sorry, I have no experience with AMD cards. I'm just using llama.cpp with cuda DLLs on Windows and things just works.

Reply

[-]

ikmalsaid@reddit

What about 8GB+64GB?

Reply

[-]

Blizado@reddit

Should work, with 24GB VRAM he only used 30GB RAM, so he didn't used even 50% of his RAM. But of course it will be a lot slower, since 8GB VRAM cards (I assume it's an NVidia) are also not as powerful as a 4090. We shouldn't forget that after a 5090 the 4090 is still the second best consumer card for AI before a 5080, after that three cards it gets noticeable slower alone from the PCI-e bandwidth speed, as long we speak from single GPU setups. So it is not only the lack of VRAM why it gets a lot slower. But it is worth a try.

Reply

[-]

LagOps91@reddit

GLM 4.5 air will likely be the best you can run. there is also a 4.6 air in the works, but not sure yet when exactly it will come out.

Reply

[-]

srigi@reddit (OP)

GLM air(s) are 100/300B, no way I can get 40tk/s on a single RTX 4090.

Reply

[-]

LagOps91@reddit

It will be slower, but 10 t/s is still possible. the model is much better than anything in the 30b range.

Reply

[-]

false79@reddit

I think you are confusing having a model that goes well beyond the available VRAM vs a model smaller and more nimble one to get things done. Given the right context instead of the entire all things universe, one can be very productive coder.

Reply

[-]

Easy_Kitchen7819@reddit

Compare it with agentica deepswe 32b

Reply

[-]

NoFudge4700@reddit

You’ve given me hope. I might upgrade my RAM now lol.

Reply

[-]

Brave-Hold-9389@reddit

Nice, saved this post

Reply

Reply to Post

92 Comments