RTX 5070 Ti (new) vs RTX 3090 / 3090 Ti (used) for LLM inference + clustering

Posted by FeiX7@reddit | LocalLLaMA | View on Reddit | 16 comments

I am thinking to get one of them (or two of them to cluster)
I need purely for LLM Inference
both cost same in my country

Bigger the models I can fit and faster I can run them better

I am thinking to get 5070 ti and add second one, but if value per dollar is more for 3090 I rather pick it.
so please share your opinions about that.

(Currently I am on AMD, I run Qwen3.5 27B and it is SOOO slow, so I need faster inference)

[-]

According-Hope1221@reddit

5070 Ti natively supports FP8 and FP4, the older 3090 does not.

They both support.

FP32 FP16 BF16 INT8 INT4

[-]

lemondrops9@reddit

Easy choice for LLMs, its the 3090 as the extra 8GB of Vram is a must have for the model you want to run.

[-]

FeiX7@reddit (OP)

What about speed? 5070 ti is faster and newer

[-]

zipperlein@reddit

They have about the same memory bandwidth. Batching may be faster on 5070 but single user inference should be similiar.

[-]

The 5070 Ti will be faster for modern quantization schemes. However, that assumes you can fit the model in memory. I had a 2080 Ti and a 5070 Ti, and as I recall when the model was entirely on the 2080 Ti the speed was around 50-70% the 5070 Ti for regular models, and less for modern quantization schemes such as MXFP4. In particular it was noticeable for prompt processing, where compute also matters not just bandwidth, where the 2080 Ti took much longer for MXFP4 and such.

Now, the 3090 (Ti) is newer than the 2080 Ti and has way more memory bandwidth, so you'll have to compare benchmarks.

I ran both at the same time as well and it definitely impacted the speed, though of course still much faster than falling back to CPU for even just a few layers to fit on the 5070 Ti.

A caveat I didn't think about when getting the 5070 Ti was that at least with llama.cpp the KV cache is not split, meaning if you need 5GB KV cache for your desired context length, you'll be using 5GB on each card, so 10GB total for KV cache. For long contexts like 256k this meant I couldn't load nearly as big a model as I thought I could.

YMMV, just my experience.

[-]

VersionNo5110@reddit

I was facing the same dilemma and finally decided for 2x 5070ti. Probably I could have benefit more VRAM, but I didn’t want to take the risk to buy used hardware, you never know. Plus 5070 ti will have a better lifespan. From my benchmarks, 5070ti are (slightly) faster in both token processing and token generation. But mileage may vary (processing engine, version, etc.). Anyway, they are very similar in terms of speed I think. So far I’m very happy with this modern hardware, which is both fast and “cheap” (~1500€ for both GPUs). I can run most models I want, plus I can always offload to RAM if I want to run bigger MOE models, and still keep decent performances. You would anyway offload with 48Gb of vram at some point.. That’s my take, not the most common as most people decided for 3090 instead, which I also understand.

[-]

taking_bullet@reddit

MoE models are underrated. It's easier (and cheaper) to grab another 64GB RAM than replace GPU. That's the future of local LLMs, not dense models.

[-]

VersionNo5110@reddit

As a reference I’m getting:

~37 t/s with unsloth/gemma-4-31B-it-GGUF ~145 t/s with unsloth/Qwen3.6-35B-A3B-GGUF

[-]

OutlandishnessIll466@reddit

I don't know, I run vllm on 2x 3090. It's very good, getting 30-50 tokens per second on a single agent. Beauty of vllm is that sometimes hermes spawns multiple backend agents that start doing work in parrallel. In that case vllm goes up to 150 tokens per second.

Having said that. I still prefer 35B A3B for speed, because once you get used to > 100 tokens per second, 50 feels really slow when the agent needs to lookup multiple websites and needs to figure out the best way to not get blocked for each website. Honostly 35B is serving more then fine for now.

[-]

Glittering-Call8746@reddit

5070ti get them while they cost "cheap" if u want vram just get amd r9700 pro and u will run 27b no problem . 3090 has no warranty . If u doing real work u wouldn't want that. If u just playing.. whatever suits you tbh..

[-]

AdamDhahabi@reddit

I have both 5070 Ti and 3090 in my build. 3090 is slightly faster running GGUFs with llama.cpp
It's because the 3090 has slightly more memory bandwidth. So 5070 Ti makes sense if you go the VLLM route and that requires fast PCIE bus speeds. In case consumer mainboard, forget about VLLM. And if you compare power consumption then 2x 3090 wouldn't be that much worse compared to 3x 5070 Ti.

[-]

pepedombo@reddit

Mind that all these 55-85tps for 27B are only valid for vllm and one 24gb 3090/4090/5090. I spent 2 days trying to run vllm properly via docker and it ended up crap when setting 5070+5060 for a test. In llama.cpp for 27B Q4 on both cards the average is 25-30tps, probably dual 5070 might go somewhat faster (depends on pci-e slots). Normally I go Q5/Q6 with average 20-24tps, sometimes I run 27B Q8 on 3gpus with 14tps or 2x11tps just to see how much detail is lost while using lower quants.

it's obvious that 2x3090 is a solid starter pack for 27B dense, for moe models you can run cheper 5060x2 or x4, still cheaper than single 5090. Everything depends on $ and compromise.

[-]

_ballzdeep_@reddit

I'm running Qwen3.5 27B int4 on a single 3090 and getting 55 to 85 TPS depending on the task. You decide :)
I just got myself a 3090ti to run TP=2 and go wild.

I think VRAM is everything here. 3090s are still very powerful for inference imo and the 8GB extra VRAM matters quite a bit, both in terms of speed and quantization.

Unless you can go up to a 4090 or a 5090, I'd go with the 3090.

[-]

FeiX7@reddit (OP)

50+ tokens is wild, on my Strix Halo I have 10 tps only ) but I have 96gb vram and less than 200W power consumption, what about TDP of 3090?

[-]

_ballzdeep_@reddit

I set the power limit of the card to 275W.

I used this repo for the vLLM setup: https://github.com/noonghunna/qwen36-27b-single-3090

And I just noticed this morning that they posted a config where you can run 196k context on a single 3090. Insane.

[-]

FeiX7@reddit (OP)

and also what about getting mac studio for inference?