Anyone using Tesla P40 for local LLMs (30B models)?

Posted by ScarredPinguin@reddit | LocalLLaMA | View on Reddit | 15 comments

Hey guys, is anyone here using a Tesla P40 with newer models like Qwen / Mixtral / Llama?

RTX 3090 prices are still very high, while P40 is around $250, so I’m considering it as a budget option.

Trying to understand real-world usability:

how many tokens/sec are you getting on 30B models?
is it usable for chat + light coding?
how bad does it get with longer context?

Thank you!

[-]

braydon125@reddit

So. Much. Slower dude.

Running 4x 3090 in a threadripper node. Its sick

[-]

IHaveTeaForDinner@reddit

A P40 is $500 AUD a 3090 is $1600, bit of a difference

[-]

braydon125@reddit

I guess my.point is this, theres a formula that will allow you project t/s using your vram memory bandwidth, model weight

Tokens per second=memory bandwidth/model weight size

[-]

FullstackSensei@reddit

Haven't tried 30B models but love my P40s. Have eight of them in a single rig. They're nowhere near as fast as my 3090s, but I didn't pay much for them either.

Ik_llama.cpp is your best friend with the P40s. Being datacenter cards, they support P2P, which ik enables and uses during inference. This speeds things up considerably vs vanilla. In my experience, 2x faster or even more if talking about prompt processing.

Cooling them is much easier than most think. First, they're not power hungry. They'll happily idle at 8-9W each. Second, you can power limit them to 170-180W without a noticeable degradation in performance. Third, and my favorite, the PCB is the same as the 1080Ti FE/reference or Titan XP. So, you can slap any waterblocks for these two cards onto the P40. You'll need to cut a 1.5x1.5cm piece out of the acrylic/POM for the EPS connector. Alternatively, you can desolder the EPS connector and solder two 8 pin PCIe power connectors. I went the dremel route because it cuts the cabling in half.

Do yourself a favor and get an older Broadwell Xeon with DDR4 or even DDR3 motherboard to go with them. Those Xeons come with 40 lanes and are dirt cheap, as are their motherboards. While I don't see more than 6GB/s per card in my rig running minimax at Q4, I'm sure I'd get an additional 1-2t/s if I had an x16 connection to each GPU.

I'm sure many have seen this half a dozen times already, but here's my P40 rig for the umpteenth time:

[-]

PairOfRussels@reddit

I get 10t/s with qwen3.5 35B on a single p40. Can you share some start parameters for me to test out?

[-]

RobotRobotWhatDoUSee@reddit

I actually hadn't seen this before that's great. This is a tower setup?

[-]

FullstackSensei@reddit

X10DRX. It's huuuuuuge. Like EATX looks tiny next to it. The format is SSI-MEB with a couple of screws in a different location.

And yes, it's a tower setup. The case is an old Xigmatek Elysium. That motherboard fits in old HPTX cases. The only modern alternative I'm aware of is the Phanteks Enthoo server edition, but haven't personally verified if it fits.q

[-]

SolarDarkMagician@reddit

I had 2 P40s and they were great, ran super fast with llama.cpp.

[-]

ThinkExtension2328@reddit

Sir that “had” is concerning why did you “had” and not have?

[-]

SolarDarkMagician@reddit

My family got me a 5060TI for Christmas and I couldn't get the P40s and the 5060 to play together in my machine.

[-]

ThinkExtension2328@reddit

O ok

[-]

WonderRico@reddit

I was running two of them a while back. Custom 3d printed ducts in the front and in the back with noctua fans (2 smalls and 1 big) and it ran smoothly. At the time, I knew little about LLMs. I bet now, using vLLM and tensor parallel, they would do fine with MoE models like Qwen3.5 A3B. (but I'm too lazy to plug them back and see)

[-]

Weak_Ad9730@reddit

2x P40 Running great Cheap reliable motherf.

[-]

ZebraMussell@reddit

Life is tough, but it’s tougher if you’re tryin' to run Llama 3 on a card with no active coolin'. If you can't afford the 3090, the P40 will get the job done lol just don't expect it to win any races.

[-]

ScarredPinguin@reddit (OP)

Thank you for your comment, I plan to either put it into my 1U server or put it to a tower, 3D print the adaptor and put a fan there.