Best AI (agent?) for coding locally?
Posted by Open-Impress2060@reddit | LocalLLaMA | View on Reddit | 22 comments
Ryzen 5, 7500F
RX 9070 XT
32 GB DDR5
I want to code a website and an app for something and I was wondering, whats the best AI I can run with my hardware, and should I use a tool like Claude Code or Pi agent to run them?
I tried Gemma4 on Pi Agent and it was really weird for some reason however I think Pi Agent was somewhat to blame. Should I try again locally? It also took like 6-7 minutes to get an output.. with ChatGPT it often takes somewhere near 20 seconds and they are often way better quality. The time is not my concern, but I though that local AI's are almost as good as those from OpenAI and Claude nowadays? Anyways, for now I want to code just a landing page. Should I just do it with Chat or are there good alternatives for my hardware right now?
Thanks in advance!
tonyboi76@reddit
on a 9070 XT (16GB) + 32GB RAM youve got real options, but a few things first:
6-7 min per response is way off, something was running on CPU or the model didnt actually fit on the gpu. ROCm + llama.cpp Vulkan should give you 20-40 t/s on something like qwen2.5-coder-14b at Q4. confirm the gpu is actually doing the work via the AMD equivalent of nvidia-smi (radeontop or rocm-smi).
for harness: aider is the most mature for local-model coding, install with pip and point it at a llama.cpp server. continue or cline as vs code extensions also work fine. id avoid pi for now, theres a reason most people use the others.
honest part: for building a full website + app, local 14b will frustrate you. the quality gap vs chatgpt/claude is real and big. use local for focused tasks (write this function, refactor this file) and frontier models for the actual planning and integration. dont try to do everything locally on consumer hardware right now, the math doesnt work yet.
Open-Impress2060@reddit (OP)
6-7 minutes (sometimes even far longer) for the end result though not just for the answer that was instant. I use arch so idk if the drivers were properly working.
Whats the best you recommend for Linux? VsCode extension?
For the landing page, though, it works out no? Or should i use even there public models
tonyboi76@reddit
ah ok totally different framing then, that wasnt clear from the first post. 6-7 min for a full agent task on a local 14b is actually pretty normal, not a misconfig. multi-step coding (read files, plan, edit, run, fix) takes several turns and each turn is 20-40 sec on consumer hardware vs 2-3 sec on chatgpts datacenter gpus, that compounds fast.
so the bottleneck isnt your setup, its just that local 14b on consumer hardware is genuinely 10-15x slower end-to-end than frontier hosted models. the speed gap is real and not really fixable without bigger hardware. local makes sense for offline / privacy / cost reasons, not for matching frontier speed.
Unlucky-Message8866@reddit
qwen3.6 27b mtp + a properly tuned pi is superb. ditched all cloud subscriptions, running 100% local since qwen3.6 release.
Leflakk@reddit
Hi, mine, easy to install, many features, low context footprint.
https://github.com/leflakk/openclose
Open-Impress2060@reddit (OP)
Shpuld i get it though llama.cpp or ollama
Jorlen@reddit
I have Pi setup with a vscode extension that lets by bypass using Pi CLI and instead use it in Vscode, works really well.
What tok/sec are you getting with just chatting with the LLM you setup? What are you using for your LLM inference? Ollama? llama-cpp?
Open-Impress2060@reddit (OP)
Well i get an answer immediately- but it takes that long for it to finish coding
indiealexh@reddit
You could run a MOE model if you offload to CPU / RAM, the trick is balancing it where as much as possible is in the GPU. (see --n-cpu-moe parameter)
Open-Impress2060@reddit (OP)
Im super new idrk how to do that what would you recommend i use linux so idk if it works with the drivers and all
totosse17@reddit
You can run qwen 3,6 35b a3b. You can put all the expert to the video card. For free local harness you can use opencode or Hermes agent with coding skills
Open-Impress2060@reddit (OP)
What do you mean "the expert to the video card"? You mean running the ai on the video card
blackhawkx12@reddit
its an MoE models or "mixture of experts", different with dense model like 27b where the whole library and knowledge loaded inside GPU, with MoE, its like splitting the library and knowledge and it can live in different place like CPU hence the name A3B or only 3B (knowledge) active at one time. Usually its a chance for smaller GPU to have a chance to run big model but without sacrificing performance too much, but with your good graphic, you can easily load them there. CMIIW
As for me i think 27B dense is better than 35B A3B, but you do you and always test in your use case, cheers mate.
Open-Impress2060@reddit (OP)
Do i have to do anything to do that tho or can i just install it through llama.cpp
totosse17@reddit
Yes indeed
Spirited_Friend_8428@reddit
Your hardware is actually pretty solid for local coding models. A 9070 XT + 32GB DDR5 can comfortably run most 7B–14B coding models, and even some 32B quantized ones if you’re patient.
The main thing though: local AI still isn’t consistently on the level of GPT-4.1 / Claude Sonnet for real-world coding workflows. It’s improved a lot, but Reddit tends to overhype “almost as good.” For landing pages and smaller apps? Sure, local can be great. For architecture, debugging weird issues, or multi-file reasoning, cloud models still win pretty hard.
A few recommendations for your setup: Skip Pi Agent for now. It’s still kinda janky and adds overhead/confusion. Use a simpler stack: LM Studio Ollama Open WebUI + Continue.dev in VS Code For models, try these instead of Gemma4: Qwen2.5-Coder 14B → probably the sweet spot for your hardware DeepSeek-Coder V2 Lite Codestral Qwen2.5-Coder 32B Q4/K_M if VRAM allows and you don’t mind slower speeds Gemma is decent, but a lot of people find it inconsistent for agent-style coding tasks. Also, 6–7 minutes for a response sounds wrong unless: you loaded a huge quant, inference fell back to CPU, or Pi Agent was doing extra tool/agent loops.
With your GPU you should usually see something more like 20–60 tok/s on 7B–14B models.
NigaTroubles@reddit
Qwen2.5 !!
Electronic-Bid-7601@reddit
just wondering whats wrong with qwen2.5? please no hate, im new to this world. whats better than qwen on a 8gig gpu?
NigaTroubles@reddit
You can use qwen3.6 35b a3b model Its smart and way better than qwen2.5 and you can run it
Electronic-Bid-7601@reddit
thanks! got any recommendations for a 16gig gpu? Im looking at one of those data center gpus rn
Gesha24@reddit
If you want something simple like a landing page - just use free Google Gemini. If you want something complicated - pay for Claude or ChatGPT.
If you want to learn more about LLMs and become better at using them - yes, use local ones. Just understand that results won't be as good the moment you move away from regular simple tasks.
I have found that Gemma is very picky about the harness, I had the best luck with opencode. I also was not able to get consistently good results once the context went above 100K. Qwen, on the other hand, seems to work fine with any harness.
I was able to achieve the best results switching between models during the project. But again - the moment you try to do something less common, you will face challenges. My most recent example - neither Gemma nor Qwen were capable of creating a working dashboard in Datadog using Datadog's MCP server. Given the same exact spec file Gemma completely failed to create anything, Qwen created a dashboard that has one working graph out of 30 and Claude Sonnet created a totally working dashboard.
johnnydotexe@reddit
I've been running Qwen2.5-coder-14b-instruct with k/v cache = Q8 and Qwen2.5-coder-1.5b-instruct as the draft model on my 4070 ti super and it seems to run well without spilling over in to my system ram (32gb ddr4). I'm no pro at this and just use it for little python based project ideas for work, and I'm sure I'm probably not running it at its full potential for my gpu...but it's OK as-is.