zelkovamoon

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

zelkovamoon@reddit (OP)

I appreciate the help. I *think* it should be as simple as appending those commands, probably don't need to change much else about your configuration - but I guess I'm not 100% sure

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

zelkovamoon@reddit (OP)

You can do TP on llama.cpp with tensor split and split mode commands

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

Update to my previous comment side note: --reasoning-budget 1536 \ --reasoning-budget-message ". Okay, enough thinking. Let's answer now." \ This actually works. Looks like meats back on the menu boys.

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

zelkovamoon@reddit (OP)

I had been running on an octominer x12 - and it was surprisingly pretty good - if i could use nvlink to bridge the cards, it might be a big unlock -- per snapo84's comments, it looks like it is possible. The octominer is going to be retired for a newer platform soon - but anyway, yeah, as long as the cards work these might be the second best 'budget' option, number one being going with SXM2 + V100s

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

zelkovamoon@reddit (OP)

This is actually very useful - thank you. I took your initial comment to mean that you literally didnt have nvlink, not that you just felt it was unnecessary - so, that's on me. Looking at your setup - have you tried running with '--tensor-split' and '--split-mode row' to see how performance changes? It looks like you're probably still running in pipeline - i'd be curious to know what difference in tps you'd see. ====== Side note: \*apparently\* there are new controls for reasoning budget in llama.cpp that i was not aware of - see '--reasoning-budget' at [https://manpages.debian.org/unstable/llama.cpp-tools/llama-server.1.en.html](https://manpages.debian.org/unstable/llama.cpp-tools/llama-server.1.en.html) I'm literally about to try it - i had reasoning disabled like you do, but if i can limit thinking to a reasonable number of tokens i would be interested in doing that. We'll see if it works!

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

zelkovamoon@reddit (OP)

Yeah, on one of my servers I tried using a 2070 super and even that handles small model inference like a boss. How long have you had the cards? Do they seem well built, reliable? The nvlink angle is specifically for tensor parallelism, which would be relevant to what I want to do - so I still need to know if it would work, but I'll take your experience under advisement

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink

Posted by zelkovamoon@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

zelkovamoon@reddit (OP)

Current pricing says I can get these cards at sub 500$; so for the same money you could ostensibly get 44gb instead of 24gb - and at this point, the extra memory is more valuable to me than the extra speed. A single 3090 can run Qwen 3.5 35b heavily quantized, but you're making a lot of concessions that you definitely wouldn't have to make if you had more memory.

Qwen3.5-27B Q4 Quantization Comparison

Posted by TitwitMuffbiscuit@reddit | LocalLLaMA | View on Reddit | 116 comments

[-]