Which model is running on your hardware right now?

[-]

No_Force_7468@reddit

What's your hardware like? Which gpu you using?

Reply

[-]

No_Force_7468@reddit

What hardware are using ? Which gpu

Reply

[-]

Fingyfin@reddit

This but non quantisized, regular Mistral Small

Reply

[-]

Fingyfin@reddit

I have the VRAM for it and only use it through Ollama/OpenWebUI. Haven't really looked into quantized models yet, awaiting for another GPU before building a dedicated rig for it.

Reply

[-]

Zenobody@reddit

You're running Mistral Small BF16? Might as well run Large Q3 (or higher if possible), no?

Reply

[-]

Evening_Ad6637@reddit

Similar, mistral small, but q6 and via llama.cpp/llama-server

Reply

[-]

ShitPoastSam@reddit

I have 8gb vram. Any idea if this fits?

Reply

[-]

MaxDPS@reddit

Same. This model is really solid. Super quick as well.

Reply

[-]

YordanTU@reddit

Mistral-Small-3 Q4\_K\_M with KoboldCPP

Reply

[-]

Herr_Drosselmeyer@reddit

Q5 via Oobabooga WebUI but still upvoted since it's basically the same.

Reply

[-]

Deepseek-R1-Distilled-Qwen-32b On my M2 Max MacBook Pro I get around 14 tokens per second. I find it super useful to understand why the model is responding any particular way, such that I can refine my prompt if it goes off on a tangent. I’m also very much on team “making a model second guess itself is beneficial” although if the rumors about models being able to do this in latent space in the future are true, that may change. But until a latent space reasoning model actually arrives, that’s me.

Reply

[-]

overnightmare@reddit

Mistral small 24B q4_K_M

Reply

[-]

Heavy_Ad_4912@reddit

I was thinking of using it too, are you using it for agentic ops as well?

Reply

[-]

overnightmare@reddit

Sorry, I don know I’ve never used agents

Reply

[-]

TitoZola@reddit

Deepsex 14b

Reply

[-]

koalfied-coder@reddit

How is it with coding? I still need to try

Reply

[-]

kaisurniwurer@reddit

Coding? Is this all you guys think about this days?

Reply

[-]

koalfied-coder@reddit

Whatcha mean these days?

Reply

[-]

CV514@reddit

Tomorrow I'll think about Enterprise Resource Planning too. I'll use GodSlayer-12B-ABYSS.i1-Q5_K_M for that. Otherwise, coding is all I think about.

Reply

[-]

Conscious-Tap-4670@reddit

I thought this model name was a joke, but here we are.

Reply

[-]

daaangerz0ne@reddit

I also choose this guy's model

Reply

[-]

narcomo@reddit

Who wouldn’t

Reply

[-]

Selphea@reddit

Chronos Gold 12b

Reply

[-]

ttkciar@reddit

[Phi-4-45B](https://huggingface.co/ehristoforu/phi-4-45b) I'm evaluating it against Phi-4 and Phi-4-25B.

Reply

[-]

OkLynx9131@reddit

Genuine question, is there some sort of leader board which shows the best avg consumer friendly model? For overall tasks? I don't want DeepSeek r1 670B model which i can't run. Possibly models below 25B params?

Reply

[-]

Everlier@reddit (OP)

[Open LLM Leaderboard 2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?params=3%2C7) is meant to be such place, most complete and up-to-date board, but of course in this field it's impossible to create any kind of long-living accurate ranking - it get's obsolete sooner than you deploy it

Reply

[-]

OkLynx9131@reddit

Holy shit. Thank you so much! It's super helpful.

Reply

[-]

rdkilla@reddit

LLama3.3:70b-instruct-Q8\_0

Reply

[-]

pathfinder6709@reddit

What hardware and TTFT and tok/s?

Reply

[-]

rdkilla@reddit

4xP40 12 seconds TTFT, 3.5 tok/s . its really slow i have some work to do

Reply

[-]

martinerous@reddit

Gemma 2 27B Q6. Mistral Small 22B Q6. I know there is the new Mistral Small 24B but it is worse for my use case.

Reply

[-]

someguy@reddit

Mistral Small (22B/24B), with a 3 bit quant, is the largest that fits in my RTX 3060. Noticeably better than the next smaller size of models. RTX 3090 soon.

Reply

[-]

Cruel_Tech@reddit

DeepSeek R1 Qwen 2.5 32b the 5 bit medium quant run via llama.cpp server on my 3090. I also run llama 3.2 8b on a 2080 that I use for title and tag generation in Open WebUI

Reply

[-]

harsh_khokhariya@reddit

llama 3.2 3b

Reply

[-]

Heavy_Ad_4912@reddit

I have seen many people saying it performs really well for its size, how is it compared to llama3.18b ?

Reply

[-]

umataro@reddit

Obviously, it contains less knowledge. To the point where it's tragically bad at answering almost anything. But if you let it search the web for you, it's pretty useful for summarising the results. You'll never skim 10 webpages and find what you're looking for as quickly as this does.

Reply

[-]

harsh_khokhariya@reddit

i don't know about 3.1 8b, but for a 3b model, it works very well, and with great speed, using it as a router or a sentence analyzer

Reply

[-]

minpeter2@reddit

Still 8B is a bit lower, at least in my usage environment.

Reply

[-]

getmevodka@reddit

Llama 3.3 70b

Reply

[-]

caetydid@reddit

A quant which uses 34Gb of memory. It is a bit slow but quality is great!

Reply

[-]

umataro@reddit

So close yet so far... If only there was a way to run it on my 32GiB macbook. I mean, qwen2.5-coder:32b is very good but llama 3.3 70b is a great complement for sanity checking the generated code

Reply

[-]

koalfied-coder@reddit

GOAT

Reply

[-]

getmevodka@reddit

it really is 😂

Reply

[-]

jrherita@reddit

Human Brain (but need to get back to playing with LLMs)

Reply

[-]

CattailRed@reddit

...you have a human brain hooked up to hardware in your basement?

Reply

[-]

jrherita@reddit

Bedroom actually

Reply

[-]

Everlier@reddit (OP)

Plot twist: you're not joking, just 100% sure nobody would believe you have a hardware to run a human brain type of network

Reply

[-]

Al-Ei@reddit

DeepSeek-R1-Distill-Qwen-1.5B

Reply

[-]

_supert_@reddit

- Mistral large (123B) 4.5bpw exl2 - Mistral small (24B) exl2

Reply

[-]

Everlier@reddit (OP)

aka "We have Le Chat at Home"

Reply

[-]

_supert_@reddit

I keep coming back to them. I've used R1 over API (deepinfra) and many others locally. But Mistral and Claude just have the most amenable personalities.

Reply

[-]

maddogawl@reddit

Mistral Small 2501 q4 via LMStudio

Reply

[-]

justGuy007@reddit

DeepSeek-Coder-v2:16b

Reply

[-]

zabirauf@reddit

deepseek-r1:70b-llama-distill-q4\_K\_M

Reply

[-]

Glittering_Mouse_883@reddit

Athene-v2 in ollama

Reply

[-]

EsotericTechnique@reddit

Dolphin 3 8B

Reply

[-]

Secure-Item3619@reddit

Mistral Small 24b q8 MLX

Reply

[-]

townofsalemfangay@reddit

Unsloths Deepseek 671b R1.. in all it's glory at 2.5tks 😂

Reply

[-]

cher_e_7@reddit

same- but 2.51 quant on epyc 7713 at 10.7 t/s and Q4\_K\_M at 6.7 t/s

Reply

[-]

klam997@reddit

dolphin3 llama 8b

Reply

[-]

Timely-Ad-2597@reddit

All deepseek Distills!

Reply

[-]

dasnihil@reddit

fuseo1-r1-t1 something something q4 gguf

Reply

[-]

YearnMar10@reddit

FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview ?

Reply

[-]

dasnihil@reddit

yes exactly that, q4.

Reply

[-]

YearnMar10@reddit

Tried the flash model of that already? I am curious if it’s better. I didn’t have time yet

Reply

[-]

dasnihil@reddit

yes, the one i'm using is this: FuseO1-DeepSeekR1-QwQ-SkyT1-Flash-32B-Preview-Q4, i have tested tool use with it, the system prompt is a hit or a miss with this model. it works when it works.

Reply

[-]

celsowm@reddit

Old and good llama 3.1 8b 4bits

Reply

[-]

cleverusernametry@reddit

Why..? I stopped using it months ago

Reply

[-]

celsowm@reddit

For portuguese texts still the best on this size

Reply

[-]

FrederikSchack@reddit

Please help benchmarking your hardware and compare with others, small test: https://www.reddit.com/r/LocalLLaMA/s/8JGE00PItl

Reply

[-]

Rustybot@reddit

Deepseek-r1:14b

Reply

[-]

FrederikSchack@reddit

Would you like to help comparing hardware by making a small test: https://www.reddit.com/r/LocalLLaMA/s/8JGE00PItl

Reply

[-]

Western_Courage_6563@reddit

Gemma2 2b. On my phone.

Reply

[-]

hannibal27@reddit

Which application do you use?

Reply

[-]

Western_Courage_6563@reddit

Pocket pal

Reply

[-]

Everlier@reddit (OP)

I also have one in my pocket, it's crazy that we got such tech in our lifetime

Reply

[-]

RandumbRedditor1000@reddit

Mistral small 24B, and DeepScaler

Reply

[-]

Everlier@reddit (OP)

Regarding the latter - testing or applying it to something?

Reply

[-]

RandumbRedditor1000@reddit

Mostly testing rn, it seems to be very good at math

Reply

[-]

Perturbee@reddit

Llama-3.1-8B-Lexi-Uncensored-V2

Reply

[-]

j_sequeira@reddit

Unsloth's DeepSeek R1 IQ1\_S at 3 t/s

Reply

[-]

Everlier@reddit (OP)

The instance when 3 t/s is blazingly quick, kudos!

Reply

[-]

dreamai87@reddit

Qwen-2.5-coder-14b with cline and continue dev. Running model using llama-cpp with 2 batch inference mode of context memory 13k each on MacBook M2 Pro (32gb)

Reply

[-]

spac420@reddit

gemma 9b. dont @ me!

Reply

[-]

Thrumpwart@reddit

Qwen 2.5 14B 1M context. I can fit 210,000 context and Q8 GGUF into the Chunky Boi :)

Reply

[-]

Everlier@reddit (OP)

That's very nice! Having a 14B that can see entire repo would be pretty awesome. I was wondering how these 1M Qwens perform when roped even further past their original training. There was [a paper](https://arxiv.org/abs/2502.05167) last week saying that basically most of context scaling methods scale quite poorly outside literal matching.

Reply

[-]

Thrumpwart@reddit

I can fit 500,000k context on my M2 Ultra :) it's pretty awesome.

Reply

[-]

Everlier@reddit (OP)

Insane, I'm very curious if it can really deliver on its "codebase-level" completion feature in such mode (although it's from Coder fine-tunes, and I'm not sure if 14B has it). p.s. Can't even imagine how long PP takes though :D

Reply

[-]

Thrumpwart@reddit

Yeah PP takes forever on the M2 Ultra. However, with the right prompt and a good idea of what I want to get out of each inference run I'm willing to wait. Using the 14B model it does seem proficient with code. Obviously not as good as the 32B coder model (which I absolutely LOVE - especially the Unsloth fine tune with 128k context), but I load an entire 170,000 token code base into the context along with some other documents and it does well.

Reply

[-]

Everlier@reddit (OP)

That's an experience I'd like to try some day, awesome setup!

Reply

[-]

Thrumpwart@reddit

Thank you!

Reply

[-]

Anyusername7294@reddit

Deepscaler

Reply

[-]

Everlier@reddit (OP)

Just for funsies or for formulaic reasoning?

Reply

[-]

Anyusername7294@reddit

From what learned, this is the best <2B model there. It's better than most 8B models I tried

Reply

[-]

koalfied-coder@reddit

neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8

Reply

[-]

Everlier@reddit (OP)

Are you rocking vllm or nm-vllm?

Reply

[-]

koalfied-coder@reddit

VLLM python -m vllm.entrypoints.openai.api\_server \\ \--model neuralmagic-ent/Llama-3.3-70B-Instruct-quantized.w8a8 \\ \--gpu-memory-utilization 0.95 \\ \--max-model-len 8192 \\ \--tensor-parallel-size 4 \\ \--enable-auto-tool-choice \\ \--tool-call-parser llama3\_json

Reply

[-]

Kooky-Somewhere-2883@reddit

Deepseek-R1 604B (sorry its my dream)

Reply

[-]

koalfied-coder@reddit

lol same

Reply

[-]

synw_@reddit

Qwen 2.5 coder 32b IQ4_XS

Reply

[-]

Adro_95@reddit

Is that usable on a 16gb card?

Reply

[-]

synw_@reddit

Probably not with good context and speed: try the 14b coder with a higher quant

Reply

[-]

Adro_95@reddit

Thanks

Reply

[-]

nathan-portia@reddit

Deepseek 14b, but also trying to play around with multimodal and image enabled models.

Reply

[-]

Everlier@reddit (OP)

Nice, 14b with low/mid context is also one of my go-to's. Can't wait for a more user-friendly way to run Qwen 2.5 VLs

Reply

[-]

hiper2d@reddit

My daily driver on 16Gb VRAM though Ollama: [https://huggingface.co/bartowski/cognitivecomputations\_Dolphin3.0-R1-Mistral-24B-GGUF](https://huggingface.co/bartowski/cognitivecomputations_Dolphin3.0-R1-Mistral-24B-GGUF) (IQ4\_XS) Fast, uncensored, with thinking, and the best of everything I've tested so far.

Reply

[-]

Everlier@reddit (OP)

I tried it a couple of days ago - was OK-ish, but seemed a bit overcooked and hard-to-steer, I did only a few surface-level tests though You might find [Harbor](https://github.com/av/harbor) interesting, same setup as you described + more, all in one CLI + App

Reply

[-]

p4s2wd@reddit

Mistral-Large-Instruct-2411-AWQ with 128k context-length and gets 17 - 19 t/s

Reply

[-]

Everlier@reddit (OP)

O\_O You've got quite a setup, vllm, aphrodite, KTransformers, SGLang or something else?

Reply

[-]

Heavy_Ad_4912@reddit

Guys, which is better sdl3.5 large or flux-1 dev? Has anyone used both of them, what's your opinion?

Reply

[-]

Everlier@reddit (OP)

Flux is better, but it's subjective. SDL has a very specific look to it that lacks details, Flux loosk more defined and "on purpose"

Reply

[-]

Wubbywub@reddit

llama3.1:8b

Reply

[-]

guigouz@reddit

qwen2.5-coder

Reply

[-]

fab_space@reddit

Qwen coder

Reply

[-]

meepbob@reddit

deepseek-r1-distill-qwen-32b@iq3\_m

Reply

[-]

AaronFeng47@reddit

Qwen2.5-32B-Instruct-Q4\_K\_L

Reply

[-]

Guna1260@reddit

Athene using aphrodite-engine

Reply

[-]

AppearanceHeavy6724@reddit

Mistral Nemo-2407 Q4

Reply

[-]

OmarBessa@reddit

A finetune of DeepSeek Distill Qwen 32B

Reply

[-]

terminoid_@reddit

EXAONE 32B, still the best model than can run on 16GB, imo

Reply

[-]

RapidRaid@reddit

Do you use flash attention / lower quants? Or does it fit without it?

Reply

[-]

terminoid_@reddit

I use Q4KM or Q3KM quants.

Reply

[-]

RapidRaid@reddit

phi4:14b on a Mac Mini M4 (base model) as dedicated home server. It works okayish. A bit too slow for my taste, but the results are pretty good for a local model. I'll probably upgrade to the M4 Studio, once it comes out (for bigger models + fast compute).

Reply

[-]

Brandu33@reddit

huihui\_ai/deepseek-r1-abliterated:14b is on at this very moment, not sure of it yet.

Reply

[-]

ee_di_tor@reddit

patricide-12B-Unslop-Mell, Q5\_K\_M

Reply

[-]

ProfitRepulsive2545@reddit

Mistral-Large-Instruct-2411 (IQ3\_XXS)

Reply

[-]

dmter@reddit

R1 671B Q1.5 using llama.cpp works surprisingly well off my 128GB, 3090, with nvme to keep the model. R1 distill llama 3.3 70B surprisingly is not much faster so I'd rather just ask 671B as it makes better think and code.

Reply

[-]

Dgamax@reddit

how long it takes to answer ? that's crazy it can run on this config :o

Reply

[-]

dmter@reddit

Many hours if you want a complex function of about 100 lines. However half of that time it thinks the tests and does side functions so you can get good ideas from think already after 10-20 ninutes.

Reply

[-]