Is there any local LLM that comes close to GPT-4 in reasoning and capabilities? Hardware suggestion?

Posted by ExtensionAd182@reddit | LocalLLaMA | View on Reddit | 46 comments

Hi everyone,
I'm looking for a local LLM solution that gets as close as possible to GPT-4 in terms of:

Deep reasoning
Research assistance (Deep research)
Document drafting
Coding (apps, websites, debugging, architecture)
Image generation and analysis (Can create image but can also understand images i send)
File analysis
Summarization
Strategy ideation
Web search integration

Essentially, I need a powerful local assistant for daily professional work, capable of helping in a similar way to GPT-4, including creative and technical tasks.

My questions:

Is there any model (or combination of tools) that realistically approaches GPT-4's quality locally?
If so, what's the minimum hardware required to run it?
CPU?
GPU (amount of VRAM)?
RAM?
Or any AIO solutions / off-the-shelf builds?

I’m okay with slower speeds, as long as the capabilities and reasoning are solid.

Thanks in advance for any insights. I really want to break free from the cloud and have a reliable, private assistant locally.

[-]

Glittering-Koala-750@reddit

No local model is truly at GPT-4’s level in all respects—especially for nuanced reasoning, up-to-date knowledge, and reliability

[-]

I guess at some point it will no longer be a new thing and these will come down in price, and better local options will develop. I will wait until then, rather than waste thousands on something that is still a bit crappy...

[-]

ExtensionAd182@reddit (OP)

Thanks, and what cheapest/minimum hardware i need to get to run phi4 or llama 70b locally?

[-]

Glittering-Koala-750@reddit

I run phi4 on 16GB VRAM 5090ti which as a full pc cost £1500. The others will require much more VRAM

[-]

ExtensionAd182@reddit (OP)

Thanks! Just to clarify — there’s no such thing as a “5090 Ti” (yet), right? Maybe you meant 4090 Ti, or a 4090 with 24GB VRAM?

Also, you mentioned Phi-4: I’ve heard it’s very fast and compact, but can it really match models like GPT-4 or DeepSeek in reasoning and coding? Curious about your experience!

I'm trying to find the cheapest build possible that still gets close to GPT-4 quality — even if it's slower. That £1500 setup sounds promising though. Could you share full specs?

[-]

FullstackSensei@reddit

Nobody can answer whether any model can replace GPT-4o for you and your use csses but you. Cloud providers for those models might give you the wrong impression depending on which quantization they use and what system prompt they have. The only real option IMO is to download and test. Llama.cpp and ik_llama.cpp are your best friends here since they can run those models even if you don't have enough RAM, albeit at significantly reduced speeds. If you have the storage to download a GGUF, you can try it out and see if it fits your needs or not.

[-]

Glittering-Koala-750@reddit

No I said no local LLM is close to gpt-4

[-]

Glittering-Koala-750@reddit

Sorry type 5070ti

[-]

LazyChampionship5819@reddit

What's the token speed ?

[-]

Secure_Reflection409@reddit

60 t/s at Q4 on a 4080S.

[-]

LazyChampionship5819@reddit

Thanks. And what's the GPU memory?

[-]

Secure_Reflection409@reddit

16GB

[-]

Glittering-Koala-750@reddit

I use a quant via llama and never measured it to be honest. Will have a look. I doubt it is fast!

[-]

LazyChampionship5819@reddit

Yes. Pls let me know. I trying to buy the GPU like your's .I just want to check the token speed for 30b&70b vision models

[-]

Healthy-Nebula-3603@reddit

Like gpt 4?

Did you speed last whole year?

In all those aspect will be batter any current 24+ b model

The vest will be qwen 3 32b, QwQ 32b of course,

[-]

DepthHour1669@reddit

Hi, when you say GPT-4, which one do you mean? OpenAI released:

gpt-4-0314
gpt-4-32k-0314
gpt-4-0613
gpt-4-32k-0613
gpt-4-1106-preview
gpt-4-vision-preview
gpt-4-turbo-0125-preview
gpt-4-turbo-2024-04-09
gpt-4o-2024-05-13
gpt-4o-2024-08-06
gpt-4o-2024-11-20

[-]

jacek2023@reddit

I suggest testing DeepSeek online first, and if you're satisfied, switch to the local setup.

[-]

DepthHour1669@reddit

Hi, when you say GPT-4, which one do you mean? OpenAI released:

gpt-4-0314
gpt-4-32k-0314
gpt-4-0613
gpt-4-32k-0613
gpt-4-1106-preview
gpt-4-vision-preview
gpt-4-turbo-0125-preview
gpt-4-turbo-2024-04-09
gpt-4o-2024-05-13
gpt-4o-2024-08-06
gpt-4o-2024-11-20

[-]

Revolutionalredstone@reddit

deepcogito - very much GPT at home ;D

[-]

FullstackSensei@reddit

How much money do you have to throw at this? And how much time are you willing to spend setting this up? The full fat 671B deepseek should do the trick for anything text generation. For images, you'll probably have to try out a few models to see which suits you best. Deep research seems to be the hardest to replicate offline, but that's mostly a software issue.

For DeepSeek, depending on your expectations for speed, you're looking at minimum 2k for 2-3tk/s. For images you'll need 1-2 24GB GPUs just for this (on top of 1-2 for DeepSeek). The rest is suffering through software setup.

[-]

ExtensionAd182@reddit (OP)

I don't have a budget but I'm wondering what's the cheapest hardware to achieve what i need, i don't need top real time speed but anything usable works.

[-]

FullstackSensei@reddit

Usable is very relative and so is cheapest. Having targets for bkth, and how much elvow grease are you willing to put helps guide recommendations. Otherwise, you'll get a lot of "the Mac studio M3 Ultra with 512GB for 10k will do the job"

[-]

ExtensionAd182@reddit (OP)

You're right — let me clarify.

I’m looking for the cheapest possible setup that can run a GPT-4-like LLM locally with acceptable performance (even if slow), as long as the reasoning quality is strong.

I don't need real-time speed.

I’m okay with using terminal tools, CLI apps, and config files.

I can invest time and effort into setting things up, scripting, and tweaking — I just don't want to waste money on the wrong build.

What I need it to do:

High-quality text generation (reasoning close to GPT-4)

Code generation + bug fixing

Document writing

Web/app/software ideas

Web browsing integration

Image generation

So I’m not chasing real-time speeds or ultra outputs — I just want a smart assistant I fully own and control, even if it takes a dozen minutes per output.

If I can start cheap and upgrade later, that would be ideal.

[-]

FullstackSensei@reddit

Depending on how much money you want to throw at the problem, all single CPU, in increasing cost: * Broadwell Xeon: E5-2699vr, E5-2698v4 or E5-2697v4 (I'd just go for the 99, the difference in price is negotiable. I'd say this is the oldest platform to consider because older ones don't support FMA nor F16c instructions, which are vital for CPU inference. You get four channels of DDR4-2400, which costs like 0.50/GB. There's no shortage of 2011-3 motherboards, but I'd stick with one that has IPMI and preferably ATX form factor to have an easier time fitting everything in a normal PC case. Aim for a board with 8 DIMMs so you can go for 512GB using 64GB DIMMs. This platform gives you 40 PCIe Gen 3 lanes. Most boards don't have an M.2 socket for NVMe, but that's fine. More on this in a bit.

Cascade Lake Xeon: the refresh for socet LGA3647. much bigger CPU (physically). You get six channels of DDR4-2933 for almost double the memory bandwidth of Broadwell. Retail CPUs are quite more expensive than Broadwell, but there's a trick: Engineering Samples (ES), namely QQ89 (search on ebay). It costs about the same as E5-2699v4 but you get 24 cores. It has the same CPUID as retail models so BIOS compatibility is not an issue even with the latest BIOS. You also get AVX-512, but there's little evidence as of now if that speeds up inference on this platform. You get 48 Gen 3 PCIe lanes and a lot of boards expose two x16 slots. You also usually get at least one M.2 NVMe. Six channels give you 384GB RAM with 64GB modules. 2933 memory can sometimes be found ~0.70/GB, but 2666 can be found for ~0.60/GB without much hassle. It also supports Optane PDIMMs if you're into that (STH has a great guide on Optane PDIMM).
Epyc Rome or Milan: this is the fastest DDR4 platform you can get without cost going up significantly. You get 8 channels of DDR4-3200 for 45% more bandwidth vs Cascade Lake. 3200 memory is at least 0.80/GB, but going for 2933 can lower your cost without much impact on speed. 2666 memory will further lower cost at a further reduction in speed. There are a few nuances with these CPUs: Epyc is for some reason allergic to Hynix LRDIMM memory with some boards. Samsung LRDIMMs are fine though. You can have 2-3 Hynix modules mixed with Samsung without issue, but YMMV. Epyc's real world memory bandwidth is heavily dependent on the number of CCDs in the CPU. You really need all 8 to be populated to get the maximum performance out of the memory controller. Easiest way to get that is to look for models with 256MB L3 cache. I'd say go for at least 32 cores. I have a bunch of 7642s (48 cores) and they're super nice and can be found cheaper than 32 core models (less known, I guess). For the extra cost you also get 128 lanes of PCIe Gen 4 to play with. ATX board models aren't as plentiful as the Xeons above because the CPU is even bigger than LGA3647 with two more memory channels, but those that are out there usually have at least three x16 slots. First Gen Epyc boards like Asrock's EPYCD8-2T or Supermicro's H11SSL will operate those slots at Gen 3 speed though. Beware of Dell/Lenovo locked CPUs. Unlocked CPUs become brand locked if installed on a Lenovo/Dell board. Also be careful of models that have a letter in the middle, some are heavily neutered, some don't work without a modded BIOS. These are Epyc issues only.

Which one of those to choose will depend on what you consider acceptable performance, even if you don't need real-time. This is a very subjective thing. I have all three in my home lab. I'd suggest looking for a good deal on a motherboard that ticks all the boxes you need, and building a system around that. A couple of months ago I got an Asrock LGA3647 board for 70 because it had some bent pins and a broken VGA output. The VGA is useless anyway since I use IPMI KVM anyways. The pins took 20 mins to fix using my phone's camera and a pair of fine tweezers.

For storage on any of these platforms, I strongly suggest skipping M.2 and going for HHHL PCIe NVMe storage cards (Google it). Those tend to be cheaper per TB, offer higher speeds, and at least an order of magnitude better endurance. Even one with 50% life left will have something like 10x endurance left vs a brand new consumer SSD of the same capacity. I have Samsung's 3.2TB PM1725 and they're great. HHHL cards plug into any PCIe slot and work as boot drives on most Broadwell boards (I'm not aware of any from a major brsnd that doesn't work).

Ebay has a lot of coolers for LGA3647, but if you want something quiet look for Asetek's LC570 LGA3647 on ebay. Been using them for over two years and really like them. For Epyc, socket SP3 is the same as threadripper TR4 and TRX40, so any cooler for any of those will work fine.

For GPUs, the best performance for the buck in the 24GB category is the 3090. They're down now to under 600 a pop if you're willing to search locally in classifieds. It also offers the easiest software setup. The best value for the money IMO is the Arc A770. It's down to around 200 for the 16GB model. Software setup is slightly more involved than the 3090 but not by much. Performance is great for the price. You can fit two 3090s or 3-4 A770s on a single ATX board depending on slot and lane organization. Arc really needs Rebar support to stretch it's legs. For that, ReBarUEFI is your friend. The project has a long motherboard compatibility list.

How much VRAM you'll need, and by extension how many GPUs, will depend on how many models you'll need to run in parallel (for ex: deepseek for text and Gemma 3 for image, maybe a 3rd for OCR). You'll also need to keep the KV cache at least at Q8 to have good performance, better yet fp16. That will impact how much VRAM you need. I'd suggest starting with one 3090 or two A770s. You can always add more if you feel you need the room.

For starters, download some models on whatever hardware you have and play with CPU inference using llama.cpp. Adjust the number of threads and throw some real use cases at it and watch the output speed. Don't pay attention to the response, just figure how slow is too slow for you, even if you don't need real-time. Also consider running multiple models for your different tasks, eg: Qwen 2.5 Coder or GLM-4 32B for coding, Gemma for image, etc. Me thinks you should spend quite a bit of time reading and understanding the software side of running LLMs locally before spending on hardware, to understand things like how model quantization and KV cache quantization, and model settings (temperature, samplers, etc) affect performance. Unsloth's doxumentation pages are your friend for figuring out the values of each model, and chatgpt is your friend to understand what each of those settings does.

Finally, keep in mind things are still changing very rapidly in this field. Everyone is moving to MoE models, smaller models are still getting increasingly smarter, and what might not be possible/true today could very much become reality in a couple of months.

[-]

eaz135@reddit

Very useful info. I don't mean this in a troll way, but going back to your point of claims/suggestions that "the Mac studio M3 Ultra with 512GB for 10k will do the job". For something thats as easy adding-to-cart on the Apple website, how will it compare to a dual 3090 setup with a Cascade Lake Xeon?

[-]

FullstackSensei@reddit

In sheer tk/s? It will heavily depend on the size of the models you run. One thing is for sure: prompt processing will be ALOT faster on the 3090s.

You can buy more than four Cascade Lake systems (each with two 3090s) for the price of that Mac Studio. Heck, my watercooled, quad 3090 Epyc rig cost significantly less than half the Mac studo. Even if the Mac is 2x as fast and 5x as power efficient, I don't think 10k is a reasonable price.

[-]

Left_Stranger2019@reddit

Heck, relative to generated income, 2x speed + 5x power efficiency + grab n go might be

[-]

FullstackSensei@reddit

Except it's like 1/3 the speed. So if you measure generated income by tk/s, unless your income is like 0.20/hour, then you're losing a lot of income by the slowdown of having to wait for the Mac to generate output.

[-]

Left_Stranger2019@reddit

They aren’t going to add PC building departments. You will get box and you will use Box.

[-]

National_Meeting_749@reddit

Brother. Actually think about and respond to stuff yourself too, don't just have GPT do it.

You didn't answer his questions. GPT just restated your entire post.

What do you consider to be useable in terms of tokens per second. And what's your maximum budget for right now?

[-]

caetydid@reddit

I am ordering a workstation with dual rtx 5090 and 786G DDR5 RAM. I suppose I could run deepseek on it, but no idea what would be the speed since most layers will still reside off-loaded from GPU. Costs approximately 18k.

[-]

FullstackSensei@reddit

Which CPU(s) are you getting? Check ik_llama.cpp and ktransformers for what performance to expect depending on CPU.

[-]

eaz135@reddit

I'm curious what your main use case is for that beast?

[-]

tat_tvam_asshole@reddit

tbh, it's not about a single model anymore. that's an outdated paradigm. it's about having a model that is smart enough to treat the Internet as a rag database w/mcp tooling, and become smart in situ than being totally capable as is (barring offline use ofc)

Context history and fine tuned for tool usage, including leveraging larger models if absolutely necessary or permissible, is the more efficient way to being at/above gpt4 levels locally

[-]

Double_Cause4609@reddit

GPT-4 is...Hard to define. Do you mean the actual full fat GPT 4 model? GPT-4o? o3?

In terms of the original GPT-4 model, I think that actually (hot take) modern 32B models are generally on par, if not better, with a special shoutout to Gemma 3 27B QAT for being quite easy to run.

In that case, I'd figure you'd probably want around 24GB of VRAM on the low end, 32GB being preferable if possible.

If you want something like an o3-mini Qwen 235B MoE is pretty good, but the only cost effective way to run it for single-user inference is to run it on CPU, which may require navigating the used server market. I have a friend who set up a custom deep research pipeline and it works like a charm.

Image generation is a bit nuanced because generally local lags behind in ease of use, but excels specific workflows. Generally you're going to be doing a lot of configuration, research, and essentially becoming something inbetween a prompter and a blender 3d artist to get what you want out of it. Some of this can be waived with agentic frameworks that can interpret your feedback for you, but it will still take quite a bit to get going. If you want something like GPT 4o's image generation I guess Bagel technically works (and there might be one or two others like it), but I actually think building bespoke workflows in ComfyUI for common tasks and building a dedicated agent to interpret your requests might work better. You may be able to do some fun things with embedding projection for semantic generation, too.

For image analysis, it's a bit harder. Technically Mistral Small and Gemma 3 have vision adapters, but I'm not sure how good they really are. Personally, my preference is to use bespoke image analysis models and pipelines that I have standard LLMs interpret the output of for me (usually paired with a prior encoded in a knowledge graph) but everyone has their preferences.

For summarization and research it gets a bit complicated. There's a little bit of overhead in building systems that can actually search the open web, and there's a lot of problems that you don't expect to need to deal with at first. How do you handle arbitrary images? PDFs? They don't sound that complicated at first but producing a general purpose system that can handle all of them isn't the full time job of a single person; handling variable web-native data is the job of full companies.

With that said, if you want to produce a local research stack that can handle documents and research that you format into markdown yourself (or use some light cloud services, special shoutout to Mistral for having some great tools), you can get some pretty rich results out of local processing that actually outperform cloud systems if you're willing to put some thought into your local data.

[-]

VegaKH@reddit

An option that is a little more reasonable to run on consumer hardware than Deepseek is Qwen3-235B-A22B, which can run slowly on a system with a good CPU and 256 GB of DDR5 RAM at FP8. Results should be similar to what you'll get running the same model on OpenRouter. I would suggest depositing $10 there, and using OpenRouter chat to see how you like the model before purchasing the hardware. Benchmarks don't tell the whole story, but according to LLM Arena, this model beats GPT-4 in development, but is behind on general text queries. Artificial Analysis has this model well ahead of GPT-4. It is also well ahead of GPT-4 on Aider's leaderboard.

If you add a high end GPU like a 5090 and offload some layers, it will be much faster, but also a lot more expensive to build.

[-]

stiflers-m0m@reddit

A lot of folks. Me included when i went down this route. Is that there will not be one model to rule them all. Good rag models are poor generalists. Good vision are meh coding. Qnd good coding are meh everything else. I have 4 a4000s as i wanted vram density. 20 gig a pop. Got them used for under a grand each. Takes up four pcie slots.

Buy an expandable rig. Start using it and add to it. Once you actually start your workflow and seeing how it will work, you can iterate.

[-]

rockybaby2025@reddit

Agree. Every model got their pros and cons

[-]

e79683074@reddit

GPT-4 is 2023 tech. Lots happened by then. Now it's a legacy model that nobody uses.

[-]

ExtensionAd182@reddit (OP)

Like no one uses? Gpt4o is the basic model of chatgpt

[-]

ladz@reddit

For coding in my usage (full stack browser dev), Qwen 3 32b quantized to q5 easily beats GPT-4o and it runs on a single v100 at 25t/s.

[-]

llmentry@reddit

GPT-4o != GPT-4

Entirely different models; supposedly 4o has \~1/8th the parameters of GPT-4. And 4o is now effectively superseded by GPT-4.1 (again, entirely different model), at least for us API users.

GPT-4 is a name I've not heard in a long time. A long time.

[-]

Ok-Lobster-919@reddit

Compared to claude 4 or gemini 2.5, it's a child.

After using Claude 4 Opus: holy shit we aren't ready for the future. Thing could like... one-shot a client/server MMORPG architecture or something.

[-]