Getting 5.3 t/s with 70B and a P40 @ IQ2S 4k context. Anyone else get the same?

Posted by My_Unbiased_Opinion@reddit | LocalLLaMA | View on Reddit | 11 comments

Just wanna confirm if what I am getting is abnormally slow. Im getting 32 t/s with 8b Q8 which seems to be expected. Can anyone out there with a P40 run llama 3 70B iQ2S @ 4096 context one GPU and let me know if you are getting the same speed? It seems like I should be getting like 8-9 t/s according to some benchmarks I am seeing online..

Reply to Post

11 Comments

[-]

hapliniste@reddit

10 time the calculations, 10 time slower, no?

[-]

My_Unbiased_Opinion@reddit (OP)

Makes sense, I'm just seeing that other people with 70b on a p40 are getting like 8-9 t/s with 4k context. But I'm getting like 5. In other words, I'm asking if there is any tweaks or settings I'm missing out on.

[-]

MikeLPU@reddit

Old man goes to the doctor: - Doctor, I need your help, I have problems in the bed. I can have a sex with my wife only three times a week. “How old are you,” the doctor asked. "85." “So you, lovebird, sexual giant,” the doctor said. “But doctor, my neighbor is also 85, and he tells me that they have sex with his wife every day. — Bro, so you tell too!

[-]

kryptkpr@reddit

It's 8 Tok/sec if to have two P40 with row split. One P40 or layer split produces the 5.5 you see

[-]

TraditionLost7244@reddit

are you nuts using a iq2s?? try an iq 4 and see if you prefer the output of it, also context bump to 6k

[-]

nero10578@reddit

Sounds about right for P40. Source: I had quite a few but sold them for 3090s.

[-]

Healthy-Nebula-3603@reddit

With IQ2S I can get 5 t/s on my CPU AMD 7950x3d ....

[-]

My_Unbiased_Opinion@reddit (OP)

Yeah that DDR5 and cache really puts in work I bet.

[-]

Status_Contest39@reddit

me too

[-]

maz_net_au@reddit

I'm getting about 5t/sec on a pair of P40's with a 70B IQ4\_XS quant. My understanding is that IQ quants are slower on old cards (this might help mitigate that soon: [https://github.com/ggerganov/llama.cpp/pull/8215](https://github.com/ggerganov/llama.cpp/pull/8215) ). I briefly had a pair of 3090's running the same IQ4 quant and would get 11t/sec from them. If anyone gets 9 t/s on a P40 for a 70b, I'd love to know how. Having said that, I haven't done any optimisations at all.

[-]

My_Unbiased_Opinion@reddit (OP)

Yep, you are right. I jumped the gun, the 8-9 is for command-r, not 70b. I edited my post in shame lol.