Anyone running local LLM coding setups on 24GB VRAM laptops? Looking for real-world experiences

Posted by AmazinglyNatural6545@reddit | LocalLLaMA | View on Reddit | 31 comments

Hi everyone

I’m wondering if anyone has real day-to-day experience with local LLM coding on 24GB VRAM? And how do you use it? Cline/Continue in VScode?

Here’s the situation: I’ve been using Claude Code, but it’s getting pretty expensive. The basic plan recently got nerfed — now you only get a few hours of work time before you have to wait for your resources to reset. So I’m looking into local alternatives, even if they’re not as advanced. That’s totally fine — I’m already into local AI stuff, so I am a bit familiar with what to expect.

Right now I’ve got a laptop with an RTX 4080 (12GB VRAM). It’s fine for most AI tasks I run, but not great for coding with LLMs.

For context:

unfortunately, I can’t use a desktop due to certain circumstances
I also can’t go with Apple since it’s not ideal for things like Stable Diffusion, OCR, etc. and it's expensive as hell. More expensive that non-apple laptop with the same specs.
cloud providers could be expensive in the case of classic permanent usage for work

I’m thinking about getting a 5090 laptop, but that thing’s insanely expensive, so I’d love to hear some thoughts or real experiences from people who actually run heavy local AI workloads on laptops.

Thanks! 🙏

[-]

Teetota@reddit

Cline+devstral do some good job on code explaining, test generation, vulnerability analysis, documentation. Not actual coding but still quite helpful.

[-]

AmazinglyNatural6545@reddit (OP)

Thank you! It's helpful 👍

[-]

Simple-Worldliness33@reddit

Hi !

I'm running on 2x3060 12gb and x99 motherboard (yes, very old cheap stuffs).
Almost the time with these 2 models:
- unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:IQ4_NL
- unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:IQ4_NL
with ollama and exactly 57344 context lenght.
So it fits into 23gb vram and i run those at 60-70t/s up to 40k context.
After that, the speed decreases a bit to reach 20t/s.

It covers almost everything I need for a daily use.
I'm coding a bit, brainstorm and give him knowledge as memory.
I provide auto websearch to help the instruct model to be more accurate.
Coder model is mostly to review code and optimise it.

If I need more context (happens once a week) with large knowledge, I ran unsloth/gpt-oss-20b-GGUF:F16 with 128K context.

As UI, I'm using Open Webui and also with Continue in VSCode and Coder model ofc.

I'm thinking about upgrade gpus but i think using could models like Claude for specific case would be cheaper in a long term. I'm using Claude once a month, I think.
My goal is to have a daily use model.
Don't forget that prompt fine tune is also the key to have a good model.

Hope it could be helpful.

[-]

AmazinglyNatural6545@reddit (OP)

It's extremely helpful. Thank you so much for your time that you spent writing it.

If you are wondering about Claude I could share my experience. I used Claude standard plan 20usd and used Claude code. It was ok but eventually they shrinked the limits and it was capable to run around 2h and then I was waiting for hours to get the quota. Now I use their max plan and run it every day. I faced some limits when I used it too much and it switched automatically from opus to sonnet. But now when last sonnet is almost the same as Opus, there is no such problem anymore for me personally.

As a hard thinker/planning/architecture - Claude Opus at me personal opinion is not so good as the latest chatgpt which you can get for 20 USD/month.

[-]

RobotRobotWhatDoUSee@reddit

I just posted about this in this thread; I use gpt-oss 120B and 20B for local coding (scientific computing) on a laptop with AMD previous-gen igpu setup (780M Radeon). I'd works great. I get ~12tps for 120B and about 18tps for 20B. You would probably need to use --n-cpu-moe, and world need to have enough RAM. (I upgrade my RAM to 128GB SODIMM, though I see that is out of stock currently, 96 GB still in stock -- either way, confirm RAM is compatible with your machine before buying anything!)

[-]

AmazinglyNatural6545@reddit (OP)

That's awesome idea. I highly appreciate your comment Sir. A bit off topic but: Have you tried to run stable diffusion like automatic or comfy UI there? Is it really slow? (I know it's slower than a dedicated gpu but I'm just wondering how much)

[-]

false79@reddit

I wouldn't get a mobile GPU laptop. That's just handicapping yourself.

I use 7900XTX + Ryzen 5600 + 64GB RAM and I'm pretty happy to use that as an Open API compatible server hosted through llama.cpp.

I'll will have both mac and windows computers running vs code configured to hit that box.

https://www.reddit.com/r/LocalLLaMA/comments/1obqkpe/comment/nkhnbtu/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

[-]

AmazinglyNatural6545@reddit (OP)

Basically 7900XTX has also 24gb vram although it's more power-demand thus a bit efficient due to the form-factor. Could you, please, share your experience working with your setup? I'm interested in almost the same ram/GPU size but only as "laptop" factor so highly interested how does it work for you.

[-]

false79@reddit

If you want a desktop performance in a mobile solution, Thinkpad P16 with the RTX Pro 5000 Mobile 24GB is pretty much the same as 7900XTX. Can do CUDA too.

$8000+USD

[-]

AmazinglyNatural6545@reddit (OP)

It costs a fortune for 24gb vram. Even though it's much more stable than plan 5090 mobile the vram amount is the same and I'm not at that level to worry about some failure during 60 hours of training etc 😅 To be honest, I don't understand what the purpose of that expensive monster is. Basically it's the same small amount of vram for much more money.

[-]

false79@reddit

Everything I posted about the hardware is my reply here. Everything about the software is in that link. 170tps+

The key here is 7900XTX is the poor man's 4090 desktop GPU, which has 3-4x memory bandwidth than mobile GPU's. But to deliver quality, you need a well defined system prompt that captures the universe of what you want to do and nothing else. Making the prompts, I do it one shot style, attaching a file from the project as reference for Cline+LLM to lean on when reasoning in Act mode. It will lead to quality execution during Act mode.

I would like to get faster but really this hardware and software config is meeting my needs as the entire model resides on the GPUs memory.

One of the clients I use, I use MacBook Air M2 24GB with VS code + Cline extension. The model is hosted on the desktop 7900XTX. This setup does not have the IDE and compiler compete for resources.

[-]

AmazinglyNatural6545@reddit (OP)

Thank you so much for this info. Really useful. So 24gb is quite good enough as I thought. Quite inspiring 🙃

[-]

megadonkeyx@reddit

just not worth it for agentic coding, nice for games tho.

[-]

AmazinglyNatural6545@reddit (OP)

Nah, games aren't interesting for me. Only vram

[-]

Icy-Corgi4757@reddit

I have an msi raider 5090 laptop and it has been solid for this. It runs a few small models at the same time for some agentic stuff etc.

It's a niche buy but if you need linux and 24gb vram minumum in a mobile format there is no other option. Fwiw, micro center has some relatively cheap (compared to 5090 mobile launch price), I paid $4100 or so for it and it's now regularly listed at 3100 or so, as are most of the other 5090 mobile laptops they have.

For agentic coding tasks I don't know if the value will really be there. You would have to find a specific model that performs well enough for your use cases that necessitates the 24gb card. I wouldn't go into it looking for a solution (good coding model for 24gb), but rather find that out beforehand and then make the purchase if it makes sense, whether this be putting a few bucks into openrouter to test potential models, etc.

[-]

AmazinglyNatural6545@reddit (OP)

Thank you so much. That's extremely useful info! The suggestion is awesome. Will do it 👍🍻

[-]

alexp702@reddit

We ran qwen coder 30b on a 4090. It’s very fast, but quite bad. Continue hooked up to it did ok code completion. Cline however requires a big (read huge) context, so a 64Gb Mac or Strix halo laptop are probably your only sensible option.

If you’re tinkerer you could muck about with offloading to boost context size but performance falls rapidly.

None will compare at all to Claude. To get close you need Qwen 480b that’s Mac Studio territory.

[-]

AmazinglyNatural6545@reddit (OP)

Could you please, share your experience how bad it is? I mean, is it capable of writing some unit tests etc.? Or just some small code completion hints?

My 12gb vram is good only to run deepseek + cline for auto completion, some code clarification and it isn't so good at all. So I mostly use Claude now 😓

[-]

jumpingcross@reddit

Maybe you can still try running it? I think you should be able to fit a quant of the REAP version if you offload to system RAM. I find it to work pretty well with aider for basic tasks like implementing functions and refactoring.

[-]

AmazinglyNatural6545@reddit (OP)

I tried cline + heavy quantized deepseek and it works however not so great at all. Mostly auto completion and some simple solutions for the copy-paste code but it isn't the thing I would like to have. I'm looking for at least something that could analyze a few files in a bunch and try to refactor them or create unit tests etc.

[-]

alexp702@reddit

For us it gives and answer which is hit and miss. The problem is the answer is often quite basic and filled with problems. The larger models consider the problem better and correct errors better. Often the 30b will simply not arrive at a useful answer. Drop some money on open router and you can try both for very little.

[-]

AmazinglyNatural6545@reddit (OP)

I see, thank you!

[-]

No_Afternoon_4260@reddit

You won't find a suitable coding model under glm' size (357B).
If you want to have something remotely performing like Claude this is the bear minimum but you'd prefer 600B-1T like deepseek or k2.
You want my opinion, forget about it, no laptop in end 2025 will give you the resources for anything meaningful regarding coding.
The best you could hope for is a good quant of 24B or a small one of 30B and that's not code territory at all.

[-]

AmazinglyNatural6545@reddit (OP)

I don't have any illusion of "having a laptop that will do the same code stuff like Claude code". I understand that it will be much worse but I'm trying to figure out how much. Will it have some practical usage for coding or not at all.

I do a lot of other ai stuff like stable diffusion, training, ocr , rag etc. on my rtx 4080 12gb vram but the local ai coding has been 'never-reachable hill' for me for all of that time.

[-]

brianlmerritt@reddit

I've got an RTX 3090 with that 24GB of memory. It can run gpt-oss:20B and Qwen3:30B (coder or not coder) plus Devstral.

I haven't even tried it on anything agentic like Cline or Claude code (local) for real development - it just isn't good enough, mostly because I can't get the context up high enough.

If I ask a coding or medical question, I get a good response, but it's not quick. Probably 5-10 seconds thinking, and then 20-30 tps.

I find Sonnet 4.5 so much better even than earlier versions and of course better than these 20B or 30B models.

TBH before you buy the Strix Halo or Mac with 128gb+ ram, try the models you want on Open Router or Novita on pay per token. You can also try those much larger models and maybe the new flavour of month ones are really good. Try MiniMax M2, GLM 4.6, and the "old" stalwards Qwen3 and Deepseek.

There is also https://github.com/katanemo/archgw/tree/main/demos/use_cases/claude_code_router but I haven't tried it yet with Clause Code. It uses the same LLM router as Huggingface use, and you can tell it easy tasks go to the local LLM but anything harder or agentic goes to Open Router or Claude. Might save you money, might waste time but give you some learning experience of "what else doesn't work as good as Claude Code".

Sorry for taking so long - the other thing I found with Claude Code is early days I used up all my in tier tokens asking it to fix and redo stuff over and over. If you get that, have vscode on the cheap plan and ask GPT-5 or Gemini Pro 2.5 to do the fix. Often what one can't do seems easy for the other. I have Claude Code plugin in VSCode so it all works on the same interface but of course different subscriptions.

[-]

AmazinglyNatural6545@reddit (OP)

Thank you so much for your energy and all of this really useful info👍

[-]

Zc5Gwu@reddit

It’s ok for non-agentic stuff. It will hallucinate occasionally but probably fine for tests and small stuff.

[-]

xx_qt314_xx@reddit

just grab some openrouter credits and play around and see which models you can accept, and then if they fit in your VRAM.

[-]

Pristine-Woodpecker@reddit

I use Devstral or Qwen Coder for simpler things when I don't want to eat into Claude quota (opencode, Crush, frontend doesn't really matter). That's either on an older 24GB card or on a Macbook. Your statement of "More expensive that non-apple laptop with the same specs" is weird. Then just buy whatever supposed cheaper alternative there is? Ryzen AI laptops can run gpt-oss-120b, as can 24GB GPUs with MoE offloa

[-]

AmazinglyNatural6545@reddit (OP)

Unfortunately, unified memory devices aren't so capable of handling stable diffusion / ocr etc. Besides, their TTFT is not the most enjoyable thing in the world, especially if it has decent size (big pdf etc.)

[-]

noctrex@reddit

For small programming projects and scripts you can get by with local models, like Qwen3-Coder or Devstral-Small-2507, but the limiting factor is small context size.

Even with a Q4 quant of devstral I'm limited to a max 80k context size, so that the model does not spill over to RAM. So for limited use only, very useful for some quick scripting