inevitabledeath3

I don't know about it being cursed. The cards have NVLink capability to work in pairs, so that part isn't cursed. It's also not like the power requirements are much different to server hardware. So I don't really get what's "cursed" about it.

we really all are going to make it, aren't we? 2x3090 setup.

Posted by RedShiftedTime@reddit | LocalLLaMA | View on Reddit | 164 comments

[-]

inevitabledeath3@reddit

Is that for 32GB model or 16GB? €600 is actually fairly reasonable for 32GB of RAM. If that's the 16GB model though then yikes.

we really all are going to make it, aren't we? 2x3090 setup.

Posted by RedShiftedTime@reddit | LocalLLaMA | View on Reddit | 164 comments

[-]

inevitabledeath3@reddit

3090s haven't gotten more expensive where I live. Even if they are where you live you can always look at the V100 and MI50 systems.

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

Posted by sandropuppo@reddit | LocalLLaMA | View on Reddit | 93 comments

[-]

inevitabledeath3@reddit

DFlash works on 3090? I had issues when I tried.

Open Models - April 2026 - One of the best months of all time for Local LLMs?

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 153 comments

[-]

inevitabledeath3@reddit

I think you misread. Those prices are by the hour, and for DeepSeek V4 Pro you need 4 of them. Although I guess Flash only needs one.

Open Models - April 2026 - One of the best months of all time for Local LLMs?

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 153 comments

[-]

inevitabledeath3@reddit

Which data center has GPUs that cheap? Also I don't think a single GPU would work for DeepSeek in most cases.

The exact KV cache usage of DeepSeek V4

Posted by Ok_Warning2146@reddit | LocalLLaMA | View on Reddit | 57 comments

[-]

inevitabledeath3@reddit

The people who can run this model are serving multiple users and so need more than 1m of total context. They might be serving 20 users all with 1m of context so need 20x that amount.

DS4-Flash vs Qwen3.6

Posted by flavio_geo@reddit | LocalLLaMA | View on Reddit | 108 comments

[-]

inevitabledeath3@reddit

Expert offloading to where? DGX Sparks uses unified memory. The GPU and CPU share the same pool of memory. So CPU offloading wouldn't help at all, would just slow things down.

Researchers developed a memory device that kept working at 700°C, opening a path to electronics for Venus, drilling, and AI

Posted by sr_local@reddit | hardware | View on Reddit | 74 comments

[-]

inevitabledeath3@reddit

That's a r/suicidebywords if I have ever seen one. Are you like okay? Do you need a hug? You shouldn't put yourself that much. I am sure there are much dumber people on here. This place is a state.

Researchers developed a memory device that kept working at 700°C, opening a path to electronics for Venus, drilling, and AI

Posted by sr_local@reddit | hardware | View on Reddit | 74 comments

[-]

inevitabledeath3@reddit

> AI has no place in science because it is not repeatable. Machine Learning has a place in science but only for prediction, not experimentation. You said this which implies machine learning is not a type of AI. For the record machine learning and AI in general is used in many scientific fields for various purposes including data analysis which is something you said can't be done.

Researchers developed a memory device that kept working at 700°C, opening a path to electronics for Venus, drilling, and AI

Posted by sr_local@reddit | hardware | View on Reddit | 74 comments

[-]

inevitabledeath3@reddit

Machine learning is AI by definition. You should stop telling other people what science is and go and study some computer science and data science yourself lmao.

Final voting results for Qwen 3.6

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 285 comments

[-]

inevitabledeath3@reddit

What would you use a model that size for though? I am having a hard time finding a good use for even a 27B model, nevermind 9B.

Final voting results for Qwen 3.6

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 285 comments

[-]

inevitabledeath3@reddit

We are talking about running models at home FFS. If you need AI models especially capable ones you can rent them from the cloud for very little money. Renting a 30B parameter model would cost less than the price of a machine to run a 9B model anyway.

Final voting results for Qwen 3.6

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 285 comments

[-]

inevitabledeath3@reddit

Well some people should buy more or better GPUs.

Final voting results for Qwen 3.6

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 285 comments

[-]

inevitabledeath3@reddit

Why do we need 9B models?

What it took to launch Google DeepMind's Gemma 4

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 136 comments

[-]

inevitabledeath3@reddit

Also you probably should be using VLLM and SGLang as they are higher performance tools than llama.cpp.

What it took to launch Google DeepMind's Gemma 4

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 136 comments

[-]

inevitabledeath3@reddit

VLLM uses HF transformers and I am guessing SGLand does as well. llama.cpp isn't really a serious inference engine for larger models or large deployments as it underperforms other tools especially at scale with concurrency. It's only really used for self-hosting and edge devices like phones and laptops because of it's wide compatibility with different hardware. If it took a lot of work to add support to llama.cpp - which it seems like it did - then it makes sense they wouldn't bother targeting it. These were 80B models designed to scale. I don't think it's fair to say they didn't help integrate it just because they didn't target the specific tool you use.

What it took to launch Google DeepMind's Gemma 4

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 136 comments

[-]

inevitabledeath3@reddit

Did alibaba really do nothing or did they just not work with llama.cpp specifically? It seemed like people were hosting these models presumably using VLLM or SGLand long before llama.cpp got support.

Qwen3.6-Plus

Posted by Nunki08@reddit | LocalLLaMA | View on Reddit | 226 comments

[-]

inevitabledeath3@reddit

How do you know that we can't run it? I have seen people here running 397B before. Some of us work for organisations putting together their own infrastructure for LLMs. I am part of that process at my University.

Qwen3.6-Plus

Posted by Nunki08@reddit | LocalLLaMA | View on Reddit | 226 comments

[-]

inevitabledeath3@reddit

Is it difficult to do the 1M context window?

Qwen3.6-Plus

Posted by Nunki08@reddit | LocalLLaMA | View on Reddit | 226 comments

[-]

inevitabledeath3@reddit

Minimax already did this. It's not new behaviour for them. Qwen always had proprietary max versions. GLM is the one that's unusual.

Tell me about the RTX 8000 - 48GB is cheap right now

Posted by Thrumpwart@reddit | LocalLLaMA | View on Reddit | 133 comments

[-]

inevitabledeath3@reddit

Doesn't the 2080ti have NVLink or SLI?

Tell me about the RTX 8000 - 48GB is cheap right now

Posted by Thrumpwart@reddit | LocalLLaMA | View on Reddit | 133 comments

[-]

inevitabledeath3@reddit

You could try ik_llama.cpp they have -sm graph which is essentially tensor parallelism and they also support NVLink and point to point connections.

Tell me about the RTX 8000 - 48GB is cheap right now

Posted by Thrumpwart@reddit | LocalLLaMA | View on Reddit | 133 comments

[-]

inevitabledeath3@reddit

You could try ik_llama.cpp they have -sm graph which is essentially tensor parallelism and they also support NVLink and point to point connections.

llama.cpp at 100k stars

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 48 comments

[-]

inevitabledeath3@reddit

I think what we really need is something like LM Studio or ollama for other inference tools like VLLM and SGLang. Heck having one for ik_llama.cpp would be a start.

llama.cpp at 100k stars

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 48 comments

[-]

inevitabledeath3@reddit

For local models I thought quantised was the standard. To be honest though I don't think llama.cpp is all that it's made out to be. I use VLLM and SGLang now and they generally have better performance. The main advantage is llama.cpp is that it's simpler and has easy to use wrappers like ollama, LM Studio, and Unsloth Studio.

GPT-OSS-120B vs DGX Spark

Posted by AdamLangePL@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

inevitabledeath3@reddit

A DGX Spark has a GPU dude

Friendly reminder inference is WAY faster on Linux vs windows

Posted by triynizzles1@reddit | LocalLLaMA | View on Reddit | 111 comments

[-]

inevitabledeath3@reddit

I mean if you want real performance try VLLM and SGLang. Heck try ik_llama.cpp. Even llama.cpp directly is better than ollama.

Unsloth announces Unsloth Studio - a competitor to LMStudio?

Posted by ilintar@reddit | LocalLLaMA | View on Reddit | 270 comments

[-]

inevitabledeath3@reddit

Thought I would reply after trying VLLM some more. You are right it is indeed faster in many cases, though I think there are some model architectures and edge cases where ik_llama.cpp or even llama.cpp is still faster. Talking with you was actually the push I needed to practice using tools like VLLM and SGLang.

It costs you around 2% session usage to say hello to claude!

Posted by Complete-Sea6655@reddit | LocalLLaMA | View on Reddit | 81 comments

[-]

inevitabledeath3@reddit

This is a misunderstanding of how prompt caching works. If there are other users with exactly the same prefix - which there obviously will be in this case since it's a system prompt - then the prompt is cached. This is trivial for modern inference software to handle. If they are counting this as not cached just because it's a new session for you then that's kind of ridiculous, since I can basically guarantee that it will be cached on their end thanks to all users using the same system prompt. If it's not being cached on the backend they are incompetent. Just straight up incompetent. Open source inference software can do this, they have literally no excuse.

TurboQuant in Llama.cpp benchmarks

Posted by tcarambat@reddit | LocalLLaMA | View on Reddit | 81 comments

[-]

inevitabledeath3@reddit

Eh they did more than that. They came up with a way of proving the divergence was within certain bounds.

TurboQuant in Llama.cpp benchmarks

Posted by tcarambat@reddit | LocalLLaMA | View on Reddit | 81 comments

[-]

inevitabledeath3@reddit

The whole point of TurboQuant was that it is provably low loss. They talk specifically about proving the distortion or divergence to be within certain bounds.

Liquid AI's LFM2-24B-A2B running at ~50 tokens/second in a web browser on WebGPU

Posted by xenovatech@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

inevitabledeath3@reddit

This is not a state space model

Intel will sell a cheap GPU with 32GB VRAM next week

Posted by happybydefault@reddit | LocalLLaMA | View on Reddit | 354 comments

[-]

inevitabledeath3@reddit

Actually VLLM has mainline support now. Intel has been working on this in fairness to them.

M5 Max Actual Pre-fill performance gains

Posted by M5_Maxxx@reddit | LocalLLaMA | View on Reddit | 38 comments

[-]

inevitabledeath3@reddit

You have more than enough to run with full context. I can run Qwen 3.5 27B with full context on a pair of 3090s with 48GB VRAM.

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

[-]

inevitabledeath3@reddit

So Nemotron is an odd case because it's actually pre-trained in NVFP4. Only post-training and mid-training are done at higher precisions. So it makes sense the results are a bit weird. As I thought though we don't know that NVFP4 is actually significantly worse then FP8 or even FP16/BF16. NVFP4 works in a different way to conventional 4 bit formats which are mostly integer formats or even MXFP4. It used significantly more space as well since it has more scaling values. I think until someone does some proper research we won't really know, and by then someone will have found a better way to do quantization.

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

[-]

inevitabledeath3@reddit

I know of ik_llama.cpp but have never heard of this -sm graph. I couldn't find your comments talking about it.

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

[-]

inevitabledeath3@reddit

Also do you have a source for saying NVFP4 is unreliable?

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

[-]

inevitabledeath3@reddit

So Sparks have no problems getting 1K prefill speed or more from what I have seen. That's one of the things they are actually good at thanks to being compute-heavy. I am talking about people who do use it for work though. I don't understand why you would go and do this unless you use it regularly. It doesn't really make sense otherwise. I don't think you actually understand just how much people are spending on tokens. Jensen Huang basically said that top level engineers should be spending 250K in tokens a year. To me that's a bit ridiculous, but the truth is you can spend thousands on tokens. Have a look at https://www.viberank.app/. You could basically buy 2 DGX sparks or more per year and still come out ahead compared to some of those guys. These are individuals as well, not teams or companies.

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

[-]

inevitabledeath3@reddit

What is -sm graph and what kind of speeds do you get on a rig like this?

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

[-]

inevitabledeath3@reddit

As I have said before Sparks can run MoE models perfectly fine in NVFP4. You keep assuming dGPUs are the only option because you had issues with Strix Halo, which is a platform known for its software problems and just isn't computationally as strong as Spark. You haven't even run numbers for Sparks it seems even though they are much cheaper in both purchase and running costs. 3090s are fun to play with but aren't really practical in my eyes for LLMs. For image generation yeah they are great. Don't get me wrong, it's probably still more expensive than subscription pricing and maybe even API pricing. The thing is that I don't like the concept of usage limits or the privacy concerns with APIs. LLMs are expensive whether you go local or cloud and the ecosystem is in its infancy.

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

[-]

inevitabledeath3@reddit

Yes OpenClaw needs SOTA models generally. It can actually be dangerous to use it with less advanced models as it's more susceptible to prompt injection that way and more likely to make mistakes. GLM coding plans recently had a price hike. They aren't quite as expensive as Claude but still pretty pricey tbh. They also reduced the usage limits.

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

[-]

inevitabledeath3@reddit

Claude Max 20x is $200 a month and dosen't include API usage for things like OpenClaw. So it's much more expensive than you are thinking. Like $2400 a year.

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

[-]

inevitabledeath3@reddit

See I had issues getting FP8 working here. I may give it a shot. I should note it isn't really supported in hardware like it is on newer architectures, so will be slower than on a new card as you are dealing with things like the Marlin kernel at this point.

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

[-]

inevitabledeath3@reddit

I mean your first issue is that you are using Q4, Q5, and Q6 quants presumably with llama.cpp rather than vLLM and NVFP4 which the Spark can do. llama.cpp just isn't as fast or as scalable as professional serving frameworks. That's true regardless of if you are talking about 3090s or Spark. I also don't think the GGUF models are as good as proper NVFP4 or even AWQ or SmoothQuant. Software wise Strix Halo is a bit of a mess to be honest with Vulkan outperforming ROCm in many cases. That's if you can even get ROCm working. This is a problem because tools like VLLM and SGLang don't work with Vulkan. I've seen benchmarks hit 40-60 TPS with Qwen 3.5 Moe models on Spark. I will be testing it out myself hopefully at some point when we get some at University.

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

[-]

inevitabledeath3@reddit

This is ignoring the existence of things like the DGX Spark and Strix Halo. A single DGX Spark can run Qwen 3.5 122B or Nemotron 3 Super. A pair of them can run Minimax. You would probably need 6 to run GLM 5 decently well though.