What's the current best code autocomplete LLM for local deployment (as of April 2026)?
Posted by danielecappuccio@reddit | LocalLLaMA | View on Reddit | 14 comments
I know this question has already been asked a thousand times, probably, but... what's the best or close-to-best model I can use with Continue for local IDE-like code autocomplete? Assume reasonable amount of VRAM to work with (\~16GB, so no GLM or similar trillion parameters models)
Answers to similar questions still point to Qwen2.5-Coder, hence a two(almost three)-generations old model.
Also, do I need Base models only or I'm also fine with Instruct ones?
EffectiveCeilingFan@reddit
Dude, you are going places in life. The number of people that see a Qwen2.5 Coder recommendation and don’t bother to check if that’s actually a recent model then come on here complaining how much it sucks is mind-boggling. Hats off to you for actually doing some research. This case, though, is one of the only areas where Qwen2.5 Coder is actually still relevant. It’s quite adept at FIM code completion.
Code next edit completion hands-down goes to Zeta 2 and it’s not even close. You have to use Zed, but it’s better than anything else I’ve tested. It’s only 8B so not too difficult to run at acceptable speeds. Continue’s Instinct next-edit completion model is kinda cheeks. If you need a VSCode extension, there’s one for Sweep Next Edit, of which the 1.5B variant is open weights. If you just need something bog standard, though, I’m pretty sure Qwen2.5 Coder is plug-and-play with Continue. In fact, llama.cpp has some FIM presets with speculative decoding and stuff already configured.
nodejshipster@reddit
Zeta 2 can not be run locally, it's not open source. You have to use it through Zed, which runs the model in the cloud, not on your system. So the fact that it is 8B is irrelevant since you're not running it anyway.
EffectiveCeilingFan@reddit
https://huggingface.co/zed-industries/zeta-2
Literally a Google search away my guy
nodejshipster@reddit
I stand corrected. Did my due diligence before writing that comment and couldn't find it anywhere.
DistanceAlert5706@reddit
Have you tried Mellum from Jetbrains? Their single line models are good. Sweep is very interesting model, not FIM tho, will check vscode extension.
CombustedPillow@reddit
I'm new to this stuff and I've struggled to find something useful, but I just found one that seems to work fast and doesn't spew nonsense:
qwen2.5-coder-7b-ts-fim-autocomplete-defuss-gguf : GGUF
I found it searching for models with keyword "autocomplete" in LM Studio which is what I run models on instead. It has a lot of settings and visuals that help you make informed decisions. I still have a lot to learn in terms of tuning params though.
In case you want to try LM Studio, this is my setting for it in Continue. Once I load a model to my local server
RudeboyRudolfo@reddit
https://huggingface.co/Tesslate/OmniCoder-9B-GGUF
I had some good results with this one.
CombustedPillow@reddit
Is something wrong with my setup? Because I keep getting terrible auto-complete and I've tried several recommended models including this one:
BelgianDramaLlama86@reddit
As said, Qwen2.5-Coder 3B or 7B will actually still work pretty well for this. Instruct models seem to work fine, although wisdom says the base models are even better. I don't know if it actually matters for this one though. Qwen3-Coder (30B) can also be used for FIM, and is probably even better, but substantially bigger obviously. You can still either get a low quant and fit it in VRAM or run it offloaded to CPU and probably still get enough speed. I have a 12 GB GPU and tried option 2, and it works well enough for me.
Hubir7@reddit
What quant you running?
BelgianDramaLlama86@reddit
Q4_K_XL by Unsloth.
horeaper@reddit
granite4 are trained for FIM, their 7b moe model (1b active) works very well even on small vram gpus.
RedParaglider@reddit
What I'm wondering is what people are using for an end to end stack. model? > llama.cpp > frontend?
I tried setting up gemma to do it, but it kept spewing chatty stuff not FIM output.
EffectiveCeilingFan@reddit
I don’t think Gemma 4 has any FIM training. Quickly looking at the tokenizer, there are no FIM tokens. There also isn’t any mention of a particular FIM format in the model card.