Coders are getting better and better
Posted by 808phone@reddit | LocalLLaMA | View on Reddit | 91 comments
Just checking, what are people using for their local LLM? I'm currently trying Qwen2.5 Coder 7B and it seems to be really fast and pretty accurate so far. This is on a Mac using LM studio. Thanks
Some_Endian_FP17@reddit
Supernova something that runs on Qwen 2.5 14B. It honestly is the best coding assistant I've used, online or offline, because it's so focused on coding. ChatGPT rambles on and is shackled by too many safeguards.
tspwd@reddit
Better than Claude 3.5?
aitookmyj0b@reddit
No. In the context of coding, the gap between Claude 3.5 and Open source is like quite large. Not in the same league.
f2466321@reddit
Probably isn’t case if you can use Mistral large 2 but Takes 3-4 3090 to run it and it Will Be still 3x slower than Claude
aitookmyj0b@reddit
In my experience Claude is leagues ahead of everything, including the huge models.
f2466321@reddit
Probably but i encourage you to try Mistral large 2 , its insane , for sure on par with 4o
KedMcJenna@reddit
I just gave it a try based on your comment, and wow, yes. It solved on the 2nd attempt a tricky problem with a spaghettified React component that neither Claude nor ChatGPT had made much headway with. I grabbed an API key. A free billion tokens a month? Am I reading that right?
aitookmyj0b@reddit
Sure I'll give it a try
OfficialHashPanda@reddit
Maybe for some tasks, Mistral large 2 can rival Claude 3.5 Sonnet, but for most of my coding usecases, Sonnet unfortunately does much better. I also found deepseek v2 Code to be somewhat better than Mistral Large 2 specifically for coding and a lot faster, though it takes more vram to run locally.
Healthy-Nebula-3603@reddit
https://livecodebench.github.io/leaderboard.html
Even Queen 2.5 32b is crushing misteal large 2 ....
Healthy-Nebula-3603@reddit
https://livecodebench.github.io/leaderboard.html
Queen 2.5 is better
Orolol@reddit
On livebench, sonnet 3.5 absolutely crush Mistral large.
tspwd@reddit
I was hoping this wasn’t the case any more. Thanks for clarifying!
Healthy-Nebula-3603@reddit
https://livecodebench.github.io/leaderboard.html
Queen 32b seems to have a level of seonnet 3.5 new ... Deepseek is far worse .
PitchSuch@reddit
If you have the hardware to run the full Deepseek 2.5 model it isn't very far from Claude 3.5 Sonnet.
tspwd@reddit
Do you think it could run on an M4 Pro 128GB?
Inspireyd@reddit
There are people who claim that he is actually outgrowing Claude.
shaman-warrior@reddit
And do they give a specific example? I would be super curious
nightman@reddit
There will probably be no comprehensive examples as it's simply not true (maybe in somw corner case)
808phone@reddit (OP)
Yeah, it's good. I'm testing it now but it answered a number of programming questions a lot better than the stripped down Qwen.
Ystrem@reddit
Hi can I run it on GPU with only 8GB VRAM somehow ? Thx
TerminatedProccess@reddit
Go look at the huggingface link in the conversation. Then click on Files and you will see a whole list of models that are designed to work under different memory conditions.
llIlIIllIlllIIIlIIll@reddit
Even 4o, Claude, o1?
MusicTait@reddit
wow great.. so how do you run it? copy and paste or is there a way to integrate in, say, vs code
Some_Endian_FP17@reddit
Continue.dev and run the model as an OpenAI-compatible endpoint.
Pineapple_King@reddit
what supernova? do you have a link or name of manufacturer?
giblesnot@reddit
https://blog.arcee.ai/introducing-arcee-supernova-medius-a-14b-model-that-rivals-a-70b-2/
808phone@reddit (OP)
Loading now!
shaman-warrior@reddit
Whut
Pineapple_King@reddit
ohh! Thank you!
Some_Endian_FP17@reddit
SuperNova Medius, you can get the GGUF files at https://huggingface.co/bartowski/SuperNova-Medius-GGUF
Many thanks to Bartowski.
No_Afternoon_4260@reddit
It is apache 2
remghoost7@reddit
Any clue if it supports FIM...?
It doesn't seem like it on the main repo page...
iyzL0Ken0bi@reddit
I appreciate the input here. Im going to check out this Supernova. Ive been working on a Convoy defense fps game in Unreal 5 and I need a hand in some of the scripting. Thanks
808phone@reddit (OP)
I'm going to try the 14B but the 7B was already good for the tasks I gave it.
softwareguy74@reddit
I too am curious about this. I currently exclusively use Claude sonnet 3.5 and it's amazing. Can I expect a local LLM to match this to some degree?
808phone@reddit (OP)
Yes it can match it to some degree. It works for a lot of things. I would only use it for private data. Otherwise if you are paying $20/month for the commercial stuff, just keep using it, but local LLM is really getting much better.
Yud07@reddit
Qwen2.5 32b 4k context window at iq4xs is just about right for 16 GB VRAM. A little spillover of layers into CPU/RAM
BurgerQuester@reddit
What Mac do you run this on?
808phone@reddit (OP)
I'm running M1Max 64G/32
BurgerQuester@reddit
Ah great! I’ve got the mac too.
I haven’t run a model locally yet though, need to look into this.
Thank you
808phone@reddit (OP)
I never used all 64G, and finally I have use for it.
BurgerQuester@reddit
What is the performance like?
me1000@reddit
Qwen 2.5 32B is outperforming Claude for me on a lot of tasks I've been throwing at it the last couple weeks. It's a hell of a model, and it's not even their coding specific model.
Healthy-Nebula-3603@reddit
https://livecodebench.github.io/leaderboard.html
Yes queen 2.5 32b and 70b are monsters .
talk_nerdy_to_m3@reddit
I have never tried a local LLM coder, but I have a hard time believing that anything can come close to Claude. They are way ahead of even GPT 4o from my experience. I would be shocked if Qwen is really that good but I will give it a try! What are you using for UI to chat with it, system prompt, temp etc?
Qual_@reddit
Qwen 32b is okayish, but unusable withing an IDE, is it not capable of fill in the middle. Qwen 7b coder is capable of fill in the middle, but it's kind of dog shit as soon as you need more than truncate functions, or just auto complete the arguments in a function call. Nothing came close to gpt4o and Claude new sonnet. I really don't know what they are coding with Qwen to be satisfied enough
3-4pm@reddit
There's a lot of pro-qwen propaganda here that doesn't match reality.
Qual_@reddit
Yes, but to be honest 'it's good" and it's even surprisingly good for it's 7b size. It's kind of on par with copilot back in the days with GPT 3.5. The issue is that when you actively use copilot with GPT 4o or new sonnet 3.5 ( either with copilot/cursor) and the in files changes etc etc. It's just nowhere near the capabilities of closed models yet. No matter how you twist the benchmarks or whatever. It's a cool model, and i'm glad to be able to rely on it if I would lose my internet connexion, but let's be real for a moment.
Anjz@reddit
I think that’s the key though. A year ago most smaller models were shitty. With the Qwen 32b Coder coming out soon, I think people don’t understand the gravity of having an amazing coding model run on a local 3090/4090. With the price of APIs, integration of multi agentic, reiterative ‘create a full stack software’ like bolt.new, makes less sense. Of course zero-shot the big LLMs will always win.
Qual_@reddit
ofc, I love local models, but i'm still using the "big" ones for prod/ serious stuff.
I don't mind messing with gemma 2 27b, so it can do something the whole night on hundreds of thousand lines, but the choice is harder the more the "big ones" get cheaper.
For exemple Gemini flash is almost free, probably cheaper than the electricity cost of running the equivalent models myself.
I'm just skeptical when i read "better than Claude for specific cases". There is no way.
Emotional-Pilot-9898@reddit
I agree here. Nothing comes close to Claude. With it weren't the case. For me, Qwen models work good for other tasks. Decent at coding, but not better than Claude.
Python developer here. Claude has better Linux recommendations as well.
808phone@reddit (OP)
Claude has been great for me, but in the end, ChatGPT seems to always get the answer correctly when Claude or Gemini fails.
sedition666@reddit
You should definitely try some recommended models out. It is a lot closer than you would imagine.
me1000@reddit
LMStudio (tbh, all the local clients are bad, but it works fine for my needs). MLX Q4. Temp is 0.5 and my system prompt is:
Notably, I often give claude instructions to stop using bullet points and write prose, and it still really likes to use bullet points.
I was also surprised with how well Qwen was performing. Sonnet 3.5 has been my daily model since it came out.
MusicTait@reddit
nice one!
zero_proof_fork@reddit
Have you tried connecting it to cline, this is where Claude is shining for me, its not so much the model , its the model combined with an IDE extension that grab large amounts of code context over multiple files, no copy and pasting between different windows.
MasterDragon_@reddit
Can you share what hardware you are using to run it locally at reasonable speed?
me1000@reddit
M3 Max MacBook Pro 128GB or ram. About 18 tokens per second
kuroninh0@reddit
Dear god, how much it cost? I was thinking in buying a M1 Max 32gb
me1000@reddit
It was $5k. It’s primarily a work machine, but given the option to max out the RAM I did so I could run local models.
My M4 Max is on the way! :D
kuroninh0@reddit
The M4 Max will be release only next year, no?
me1000@reddit
No, the announced it last week. Expected delivery date is Friday.
https://www.apple.com/newsroom/2024/10/new-macbook-pro-features-m4-family-of-chips-and-apple-intelligence/
kuroninh0@reddit
wow man that's huge! congratz!
MasterDragon_@reddit
Thanks.
Pedalnomica@reddit
I'm running the 72B (at 8-bit) and Claude 3.5 Sonnet definitely has a better shot at getting complicated stuff right. I basically just use the 7B coder or Claude depending.
me1000@reddit
I haven’t been using the 72B much because it’s a bit too big for my machine, but I can run it, it’s just slow. And funny enough the 32B was doing a little better at coding than the 72B (both Q4).
Pedalnomica@reddit
Maybe I should try the 32B
me1000@reddit
You should double check me, but IIRC the 32B model actually had a much higher score on the coding benchmarks than the 72B. Which makes me think they trained the smaller model on more coding data.
MaskedDelta@reddit
It could be because the larger model is being run at lower precision in the user’s machine, impacting performance negatively. It’s amazing what these models can do when their quality is not diluted to run at scale.
me1000@reddit
Im talking about the published benchmark numbers. I don’t run benchmarks on my machine.
cantgetthistowork@reddit
Have you tried comparing it with nemotron?
Weary_Long3409@reddit
Yeah, qwen 2.5 32b is a gpt-4o-mini killer for me. Hope there's a full-fledged 32b coder.
badgerfish2021@reddit
waiting for that one as well, the blog post said there would be one but nothing yet...
glowcialist@reddit
One of the main developers was asked about Qwen2.5 Coder 32b a few days ago and just responded "Not today", kind of implying soon. I have my fingers crossed for a release like 24 hours from now, but I'm probably wrong.
DeltaSqueezer@reddit
'not today' sounds more like 'f-- off and stop bothering me' ;)
femio@reddit
Like what tasks?
me1000@reddit
It’s better at following instructions when I ask it to write paragraphs and not bullet points. But I’m mostly asking it c and c++ coding questions.
ForsookComparison@reddit
Mistral-Nemo 12B is my sweet spot right now between performance and quality. Pretty acceptable speeds using CPU inference on DDR4
nuclear_semicolon@reddit
I have been using this model locally for a while now, and it has been working wonders
visualdata@reddit
For coding I mostly use Claude 3.5, Its really worth the price. But Qwen comes close
PutMyDickOnYourHead@reddit
I run Deepseek Coder 33B with Continue. Canceled my Github Copilot subscription the second I got it working.
Anjz@reddit
A year ago most smaller models were super shitty in general. With the Qwen 32b Coder coming out soon, I think people don’t understand the gravity of having an amazing coding model run on a local 3090/4090. They think, “Oh Claude is so much better at one shot” With the price of APIs, integration of multi agentic, reiterative ‘create a full stack software’ like bolt.new, makes less sense. Of course zero-shot the big LLMs will always win. I just think it’s a giant leap for AI, not relying on expensive APIs and reiterative ‘swarm’ software that would eventually give a better output than one shot expensive models.
Natural-Sentence-601@reddit
It's not just coding too. Anthropic Claude Sonnet 3.5 engages fully in conversations on "How best to proceed" about architecture, reuse libraries, UML-like design, and frameworks, all while providing demonstration code snippets. Because it doesn't have a plugin to VSC or GitHub CoPilot, you have to copy-paste into VSC, but it is just awesome.
hashms0a@reddit
Hail Qwen 🫡
ThaisaGuilford@reddit
Stop there chinese spy
hashms0a@reddit
😂😂
Embarrassed-Way-1350@reddit
I use codegeex4 it has great performance in python which is what I use it for
fasti-au@reddit
Qwen and deepseek are both great of the center choices
KingGongzilla@reddit
how do these small local coding models compare to github copilot in terms of quality?
epigen01@reddit
Same my go-to coders are qwen2.5-coder & codestral. Qwen2.5 is noticeably faster albeit sometimes too verbose while codestral is concise & clean but with longer runtimes.