TheaterFire

Which model is running on your hardware right now?

Posted by Everlier@reddit | LocalLLaMA | View on Reddit | 145 comments

Reply with just a model name, upvote if somebody already mentioned the model you're running

Reply to Post

145 Comments

getfitdotus@reddit

R1-70b distill FP8
View on Reddit #48546545

No_Force_7468@reddit

What's your hardware like? Which gpu you using?
View on Reddit #49400054

getfitdotus@reddit

4x rtx ada a6000
View on Reddit #49401630

minpeter2@reddit

\*\*QWEN 2.5 72B Instruct\*\*
View on Reddit #48578419

No_Force_7468@reddit

What hardware are using ? Which gpu
View on Reddit #49399863

minpeter2@reddit

a100 80gb x2
View on Reddit #49400105

Everlier@reddit (OP)

Mistral Small q4 via Ollama
View on Reddit #48539826

Fingyfin@reddit

This but non quantisized, regular Mistral Small
View on Reddit #48541808

Vegetable_Sun_9225@reddit

why non quantized?
View on Reddit #48565944

Fingyfin@reddit

I have the VRAM for it and only use it through Ollama/OpenWebUI. Haven't really looked into quantized models yet, awaiting for another GPU before building a dedicated rig for it.
View on Reddit #48728985

Zenobody@reddit

You're running Mistral Small BF16? Might as well run Large Q3 (or higher if possible), no?
View on Reddit #48546726

Evening_Ad6637@reddit

Similar, mistral small, but q6 and via llama.cpp/llama-server
View on Reddit #48548936

ShitPoastSam@reddit

I have 8gb vram. Any idea if this fits?
View on Reddit #48679146

MaxDPS@reddit

Same. This model is really solid. Super quick as well.
View on Reddit #48561455

YordanTU@reddit

Mistral-Small-3 Q4\_K\_M with KoboldCPP
View on Reddit #48548342

Herr_Drosselmeyer@reddit

Q5 via Oobabooga WebUI but still upvoted since it's basically the same.
View on Reddit #48541245

BumbleSlob@reddit

Deepseek-R1-Distilled-Qwen-32b On my M2 Max MacBook Pro I get around 14 tokens per second. I find it super useful to understand why the model is responding any particular way, such that I can refine my prompt if it goes off on a tangent. I’m also very much on team “making a model second guess itself is beneficial” although if the rumors about models being able to do this in latent space in the future are true, that may change. But until a latent space reasoning model actually arrives, that’s me. 
View on Reddit #48702300

overnightmare@reddit

Mistral small 24B q4_K_M
View on Reddit #48546635

Heavy_Ad_4912@reddit

I was thinking of using it too, are you using it for agentic ops as well?
View on Reddit #48550328

overnightmare@reddit

Sorry, I don know I’ve never used agents
View on Reddit #48640531

TitoZola@reddit

Deepsex 14b
View on Reddit #48541200

koalfied-coder@reddit

How is it with coding? I still need to try
View on Reddit #48559656

kaisurniwurer@reddit

Coding? Is this all you guys think about this days?
View on Reddit #48588217

koalfied-coder@reddit

Whatcha mean these days?
View on Reddit #48589583

CV514@reddit

Tomorrow I'll think about Enterprise Resource Planning too. I'll use GodSlayer-12B-ABYSS.i1-Q5_K_M for that. Otherwise, coding is all I think about.
View on Reddit #48605099

Conscious-Tap-4670@reddit

I thought this model name was a joke, but here we are.
View on Reddit #48628551

daaangerz0ne@reddit

I also choose this guy's model
View on Reddit #48545546

narcomo@reddit

Who wouldn’t
View on Reddit #48614901

Selphea@reddit

Chronos Gold 12b
View on Reddit #48625085

ttkciar@reddit

[Phi-4-45B](https://huggingface.co/ehristoforu/phi-4-45b) I'm evaluating it against Phi-4 and Phi-4-25B.
View on Reddit #48623493

OkLynx9131@reddit

Genuine question, is there some sort of leader board which shows the best avg consumer friendly model? For overall tasks? I don't want DeepSeek r1 670B model which i can't run. Possibly models below 25B params?
View on Reddit #48562025

Everlier@reddit (OP)

[Open LLM Leaderboard 2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?params=3%2C7) is meant to be such place, most complete and up-to-date board, but of course in this field it's impossible to create any kind of long-living accurate ranking - it get's obsolete sooner than you deploy it
View on Reddit #48563821

OkLynx9131@reddit

Holy shit. Thank you so much! It's super helpful.
View on Reddit #48623420

rdkilla@reddit

LLama3.3:70b-instruct-Q8\_0
View on Reddit #48549121

pathfinder6709@reddit

What hardware and TTFT and tok/s?
View on Reddit #48604413

rdkilla@reddit

4xP40 12 seconds TTFT, 3.5 tok/s . its really slow i have some work to do
View on Reddit #48612243

martinerous@reddit

Gemma 2 27B Q6. Mistral Small 22B Q6. I know there is the new Mistral Small 24B but it is worse for my use case.
View on Reddit #48608969

__some__guy@reddit

Mistral Small (22B/24B), with a 3 bit quant, is the largest that fits in my RTX 3060. Noticeably better than the next smaller size of models. RTX 3090 soon.
View on Reddit #48607581

Cruel_Tech@reddit

DeepSeek R1 Qwen 2.5 32b the 5 bit medium quant run via llama.cpp server on my 3090. I also run llama 3.2 8b on a 2080 that I use for title and tag generation in Open WebUI
View on Reddit #48603357

harsh_khokhariya@reddit

llama 3.2 3b
View on Reddit #48540846

Heavy_Ad_4912@reddit

I have seen many people saying it performs really well for its size, how is it compared to llama3.18b ?
View on Reddit #48550419

umataro@reddit

Obviously, it contains less knowledge. To the point where it's tragically bad at answering almost anything. But if you let it search the web for you, it's pretty useful for summarising the results. You'll never skim 10 webpages and find what you're looking for as quickly as this does.
View on Reddit #48599249

harsh_khokhariya@reddit

i don't know about 3.1 8b, but for a 3b model, it works very well, and with great speed, using it as a router or a sentence analyzer
View on Reddit #48581412

minpeter2@reddit

Still 8B is a bit lower, at least in my usage environment.
View on Reddit #48552532

getmevodka@reddit

Llama 3.3 70b
View on Reddit #48540826

caetydid@reddit

A quant which uses 34Gb of memory. It is a bit slow but quality is great!
View on Reddit #48543109

umataro@reddit

So close yet so far... If only there was a way to run it on my 32GiB macbook. I mean, qwen2.5-coder:32b is very good but llama 3.3 70b is a great complement for sanity checking the generated code
View on Reddit #48598787

koalfied-coder@reddit

GOAT
View on Reddit #48559667

getmevodka@reddit

it really is 😂
View on Reddit #48560699

jrherita@reddit

Human Brain (but need to get back to playing with LLMs)
View on Reddit #48546298

CattailRed@reddit

...you have a human brain hooked up to hardware in your basement?
View on Reddit #48549788

jrherita@reddit

Bedroom actually
View on Reddit #48596354

Everlier@reddit (OP)

Plot twist: you're not joking, just 100% sure nobody would believe you have a hardware to run a human brain type of network
View on Reddit #48549708

Al-Ei@reddit

DeepSeek-R1-Distill-Qwen-1.5B
View on Reddit #48593378

_supert_@reddit

- Mistral large (123B) 4.5bpw exl2 - Mistral small (24B) exl2
View on Reddit #48579654

Everlier@reddit (OP)

aka "We have Le Chat at Home"
View on Reddit #48580001

_supert_@reddit

I keep coming back to them. I've used R1 over API (deepinfra) and many others locally. But Mistral and Claude just have the most amenable personalities.
View on Reddit #48589578

maddogawl@reddit

Mistral Small 2501 q4 via LMStudio
View on Reddit #48589549

justGuy007@reddit

DeepSeek-Coder-v2:16b
View on Reddit #48587115

zabirauf@reddit

deepseek-r1:70b-llama-distill-q4\_K\_M
View on Reddit #48582660

Glittering_Mouse_883@reddit

Athene-v2 in ollama
View on Reddit #48582274

EsotericTechnique@reddit

Dolphin 3 8B
View on Reddit #48580311

Secure-Item3619@reddit

Mistral Small 24b q8 MLX
View on Reddit #48579508

townofsalemfangay@reddit

Unsloths Deepseek 671b R1.. in all it's glory at 2.5tks 😂
View on Reddit #48563282

cher_e_7@reddit

same- but 2.51 quant on epyc 7713 at 10.7 t/s and Q4\_K\_M at 6.7 t/s
View on Reddit #48579412

klam997@reddit

dolphin3 llama 8b
View on Reddit #48576746

Timely-Ad-2597@reddit

All deepseek Distills!
View on Reddit #48575387

dasnihil@reddit

fuseo1-r1-t1 something something q4 gguf
View on Reddit #48546908

YearnMar10@reddit

FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview ?
View on Reddit #48550425

dasnihil@reddit

yes exactly that, q4.
View on Reddit #48562702

YearnMar10@reddit

Tried the flash model of that already? I am curious if it’s better. I didn’t have time yet
View on Reddit #48572382

dasnihil@reddit

yes, the one i'm using is this: FuseO1-DeepSeekR1-QwQ-SkyT1-Flash-32B-Preview-Q4, i have tested tool use with it, the system prompt is a hit or a miss with this model. it works when it works.
View on Reddit #48574835

celsowm@reddit

Old and good llama 3.1 8b 4bits
View on Reddit #48549010

cleverusernametry@reddit

Why..? I stopped using it months ago
View on Reddit #48574511

celsowm@reddit

For portuguese texts still the best on this size
View on Reddit #48574765

FrederikSchack@reddit

Please help benchmarking your hardware and compare with others, small test: https://www.reddit.com/r/LocalLLaMA/s/8JGE00PItl
View on Reddit #48572531

Rustybot@reddit

Deepseek-r1:14b
View on Reddit #48558500

FrederikSchack@reddit

Would you like to help comparing hardware by making a small test: https://www.reddit.com/r/LocalLLaMA/s/8JGE00PItl
View on Reddit #48572323

Western_Courage_6563@reddit

Gemma2 2b. On my phone.
View on Reddit #48553024

hannibal27@reddit

Which application do you use?
View on Reddit #48569884

Western_Courage_6563@reddit

Pocket pal
View on Reddit #48571163

Everlier@reddit (OP)

I also have one in my pocket, it's crazy that we got such tech in our lifetime
View on Reddit #48558415

RandumbRedditor1000@reddit

Mistral small 24B, and DeepScaler
View on Reddit #48552419

Everlier@reddit (OP)

Regarding the latter - testing or applying it to something?
View on Reddit #48558176

RandumbRedditor1000@reddit

Mostly testing rn, it seems to be very good at math
View on Reddit #48570162

Perturbee@reddit

Llama-3.1-8B-Lexi-Uncensored-V2
View on Reddit #48569563

j_sequeira@reddit

Unsloth's DeepSeek R1 IQ1\_S at 3 t/s
View on Reddit #48565217

Everlier@reddit (OP)

The instance when 3 t/s is blazingly quick, kudos!
View on Reddit #48567395

dreamai87@reddit

Qwen-2.5-coder-14b with cline and continue dev. Running model using llama-cpp with 2 batch inference mode of context memory 13k each on MacBook M2 Pro (32gb)
View on Reddit #48564315

spac420@reddit

gemma 9b. dont @ me!
View on Reddit #48564178

Thrumpwart@reddit

Qwen 2.5 14B 1M context. I can fit 210,000 context and Q8 GGUF into the Chunky Boi :)
View on Reddit #48551525

Everlier@reddit (OP)

That's very nice! Having a 14B that can see entire repo would be pretty awesome. I was wondering how these 1M Qwens perform when roped even further past their original training. There was [a paper](https://arxiv.org/abs/2502.05167) last week saying that basically most of context scaling methods scale quite poorly outside literal matching.
View on Reddit #48558061

Thrumpwart@reddit

I can fit 500,000k context on my M2 Ultra :) it's pretty awesome.
View on Reddit #48558396

Everlier@reddit (OP)

Insane, I'm very curious if it can really deliver on its "codebase-level" completion feature in such mode (although it's from Coder fine-tunes, and I'm not sure if 14B has it). p.s. Can't even imagine how long PP takes though :D
View on Reddit #48558889

Thrumpwart@reddit

Yeah PP takes forever on the M2 Ultra. However, with the right prompt and a good idea of what I want to get out of each inference run I'm willing to wait. Using the 14B model it does seem proficient with code. Obviously not as good as the 32B coder model (which I absolutely LOVE - especially the Unsloth fine tune with 128k context), but I load an entire 170,000 token code base into the context along with some other documents and it does well.
View on Reddit #48560213

Everlier@reddit (OP)

That's an experience I'd like to try some day, awesome setup!
View on Reddit #48560571

Thrumpwart@reddit

Thank you!
View on Reddit #48562252

Anyusername7294@reddit

Deepscaler
View on Reddit #48550973

Everlier@reddit (OP)

Just for funsies or for formulaic reasoning?
View on Reddit #48557718

Anyusername7294@reddit

From what learned, this is the best <2B model there. It's better than most 8B models I tried
View on Reddit #48560830

koalfied-coder@reddit

neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8
View on Reddit #48559615

Everlier@reddit (OP)

Are you rocking vllm or nm-vllm?
View on Reddit #48560447

koalfied-coder@reddit

VLLM python -m vllm.entrypoints.openai.api\_server \\ \--model neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8 \\ \--gpu-memory-utilization 0.95 \\ \--max-model-len 8192 \\ \--tensor-parallel-size 4 \\ \--enable-auto-tool-choice \\ \--tool-call-parser llama3\_json
View on Reddit #48560742

Kooky-Somewhere-2883@reddit

Deepseek-R1 604B (sorry its my dream)
View on Reddit #48543017

koalfied-coder@reddit

lol same
View on Reddit #48559708

synw_@reddit

Qwen 2.5 coder 32b IQ4_XS
View on Reddit #48543407

Adro_95@reddit

Is that usable on a 16gb card?
View on Reddit #48554679

synw_@reddit

Probably not with good context and speed: try the 14b coder with a higher quant
View on Reddit #48556637

Adro_95@reddit

Thanks
View on Reddit #48559050

nathan-portia@reddit

Deepseek 14b, but also trying to play around with multimodal and image enabled models.
View on Reddit #48555831

Everlier@reddit (OP)

Nice, 14b with low/mid context is also one of my go-to's. Can't wait for a more user-friendly way to run Qwen 2.5 VLs
View on Reddit #48558680

hiper2d@reddit

My daily driver on 16Gb VRAM though Ollama: [https://huggingface.co/bartowski/cognitivecomputations\_Dolphin3.0-R1-Mistral-24B-GGUF](https://huggingface.co/bartowski/cognitivecomputations_Dolphin3.0-R1-Mistral-24B-GGUF) (IQ4\_XS) Fast, uncensored, with thinking, and the best of everything I've tested so far.
View on Reddit #48555510

Everlier@reddit (OP)

I tried it a couple of days ago - was OK-ish, but seemed a bit overcooked and hard-to-steer, I did only a few surface-level tests though You might find [Harbor](https://github.com/av/harbor) interesting, same setup as you described + more, all in one CLI + App
View on Reddit #48558562

p4s2wd@reddit

Mistral-Large-Instruct-2411-AWQ with 128k context-length and gets 17 - 19 t/s
View on Reddit #48552989

Everlier@reddit (OP)

O\_O You've got quite a setup, vllm, aphrodite, KTransformers, SGLang or something else?
View on Reddit #48558378

Heavy_Ad_4912@reddit

Guys, which is better sdl3.5 large or flux-1 dev? Has anyone used both of them, what's your opinion?
View on Reddit #48550541

Everlier@reddit (OP)

Flux is better, but it's subjective. SDL has a very specific look to it that lacks details, Flux loosk more defined and "on purpose"
View on Reddit #48557653

Wubbywub@reddit

llama3.1:8b
View on Reddit #48557162

guigouz@reddit

qwen2.5-coder
View on Reddit #48557099

fab_space@reddit

Qwen coder
View on Reddit #48553025

meepbob@reddit

deepseek-r1-distill-qwen-32b@iq3\_m
View on Reddit #48552082

AaronFeng47@reddit

Qwen2.5-32B-Instruct-Q4\_K\_L
View on Reddit #48549025

Guna1260@reddit

Athene using aphrodite-engine
View on Reddit #48548703

AppearanceHeavy6724@reddit

Mistral Nemo-2407 Q4
View on Reddit #48548183

OmarBessa@reddit

A finetune of DeepSeek Distill Qwen 32B
View on Reddit #48548154

terminoid_@reddit

EXAONE 32B, still the best model than can run on 16GB, imo
View on Reddit #48542191

RapidRaid@reddit

Do you use flash attention / lower quants? Or does it fit without it?
View on Reddit #48546759

terminoid_@reddit

I use Q4KM or Q3KM quants.
View on Reddit #48547498

RapidRaid@reddit

phi4:14b on a Mac Mini M4 (base model) as dedicated home server. It works okayish. A bit too slow for my taste, but the results are pretty good for a local model. I'll probably upgrade to the M4 Studio, once it comes out (for bigger models + fast compute).
View on Reddit #48546644

Brandu33@reddit

huihui\_ai/deepseek-r1-abliterated:14b is on at this very moment, not sure of it yet.
View on Reddit #48546510

ee_di_tor@reddit

patricide-12B-Unslop-Mell, Q5\_K\_M
View on Reddit #48545523

ProfitRepulsive2545@reddit

Mistral-Large-Instruct-2411 (IQ3\_XXS)
View on Reddit #48545095

dmter@reddit

R1 671B Q1.5 using llama.cpp works surprisingly well off my 128GB, 3090, with nvme to keep the model. R1 distill llama 3.3 70B surprisingly is not much faster so I'd rather just ask 671B as it makes better think and code.
View on Reddit #48544260

Dgamax@reddit

how long it takes to answer ? that's crazy it can run on this config :o
View on Reddit #48544547

dmter@reddit

Many hours if you want a complex function of about 100 lines. However half of that time it thinks the tests and does side functions so you can get good ideas from think already after 10-20 ninutes.
View on Reddit #48544864

unlikely_ending@reddit

Llama 3.2 1B
View on Reddit #48544023

Rachados22x2@reddit

None !
View on Reddit #48543957

lolzinventor@reddit

Meta-Llama-3.1-8B-5000-LuV1-F16.gguf
View on Reddit #48543155

FullstackSensei@reddit

Llama 3.3 70B, Llama 3.1 70B Nemotrom,, Qwen Coder 2.5, and Qwen 2.5 72GB.
View on Reddit #48542839

Expensive-Paint-9490@reddit

DeepSeek-R1-IQ4\_XS.
View on Reddit #48541826

fizzy1242@reddit

Qwen 72b
View on Reddit #48541149

ashrafazlan@reddit

EVA LLaMA 3.33 70B
View on Reddit #48540755

Yagnikanna_123@reddit

Gemma2:2b via ollama
View on Reddit #48540460

TotalStatement1061@reddit

CoderO1-deepseekr1-coder-14B-preview
View on Reddit #48540335

CattailRed@reddit

DeepSeek-V2-Lite
View on Reddit #48540142