inevitabledeath3

PSA

Posted by Signal_Ad657@reddit | LocalLLaMA | View on Reddit | 523 comments

Next year we're getting 0.5T model from Grok

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 200 comments

Next year we're getting 0.5T model from Grok

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 200 comments

Next year we're getting 0.5T model from Grok

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 200 comments

we really all are going to make it, aren't we? 2x3090 setup.

Posted by RedShiftedTime@reddit | LocalLLaMA | View on Reddit | 164 comments

inevitabledeath3@reddit

I don't know about it being cursed. The cards have NVLink capability to work in pairs, so that part isn't cursed. It's also not like the power requirements are much different to server hardware. So I don't really get what's "cursed" about it.

we really all are going to make it, aren't we? 2x3090 setup.

Posted by RedShiftedTime@reddit | LocalLLaMA | View on Reddit | 164 comments

we really all are going to make it, aren't we? 2x3090 setup.

Posted by RedShiftedTime@reddit | LocalLLaMA | View on Reddit | 164 comments

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

Posted by sandropuppo@reddit | LocalLLaMA | View on Reddit | 93 comments

Open Models - April 2026 - One of the best months of all time for Local LLMs?

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 153 comments

Open Models - April 2026 - One of the best months of all time for Local LLMs?

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 153 comments

The exact KV cache usage of DeepSeek V4

Posted by Ok_Warning2146@reddit | LocalLLaMA | View on Reddit | 57 comments

inevitabledeath3@reddit

The people who can run this model are serving multiple users and so need more than 1m of total context. They might be serving 20 users all with 1m of context so need 20x that amount.

DS4-Flash vs Qwen3.6

Posted by flavio_geo@reddit | LocalLLaMA | View on Reddit | 108 comments

inevitabledeath3@reddit

Expert offloading to where? DGX Sparks uses unified memory. The GPU and CPU share the same pool of memory. So CPU offloading wouldn't help at all, would just slow things down.

Researchers developed a memory device that kept working at 700°C, opening a path to electronics for Venus, drilling, and AI

Posted by sr_local@reddit | hardware | View on Reddit | 74 comments

inevitabledeath3@reddit

That's a r/suicidebywords if I have ever seen one. Are you like okay? Do you need a hug? You shouldn't put yourself that much. I am sure there are much dumber people on here. This place is a state.

Researchers developed a memory device that kept working at 700°C, opening a path to electronics for Venus, drilling, and AI

Posted by sr_local@reddit | hardware | View on Reddit | 74 comments

inevitabledeath3@reddit

> AI has no place in science because it is not repeatable. Machine Learning has a place in science but only for prediction, not experimentation. You said this which implies machine learning is not a type of AI. For the record machine learning and AI in general is used in many scientific fields for various purposes including data analysis which is something you said can't be done.

Researchers developed a memory device that kept working at 700°C, opening a path to electronics for Venus, drilling, and AI

Posted by sr_local@reddit | hardware | View on Reddit | 74 comments

Final voting results for Qwen 3.6

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 285 comments

Final voting results for Qwen 3.6

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 285 comments

inevitabledeath3@reddit

We are talking about running models at home FFS. If you need AI models especially capable ones you can rent them from the cloud for very little money. Renting a 30B parameter model would cost less than the price of a machine to run a 9B model anyway.

Final voting results for Qwen 3.6

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 285 comments

Final voting results for Qwen 3.6

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 285 comments

What it took to launch Google DeepMind's Gemma 4

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 136 comments

What it took to launch Google DeepMind's Gemma 4

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 136 comments

inevitabledeath3@reddit

VLLM uses HF transformers and I am guessing SGLand does as well. llama.cpp isn't really a serious inference engine for larger models or large deployments as it underperforms other tools especially at scale with concurrency. It's only really used for self-hosting and edge devices like phones and laptops because of it's wide compatibility with different hardware. If it took a lot of work to add support to llama.cpp - which it seems like it did - then it makes sense they wouldn't bother targeting it. These were 80B models designed to scale. I don't think it's fair to say they didn't help integrate it just because they didn't target the specific tool you use.

What it took to launch Google DeepMind's Gemma 4

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 136 comments

inevitabledeath3@reddit

Did alibaba really do nothing or did they just not work with llama.cpp specifically? It seemed like people were hosting these models presumably using VLLM or SGLand long before llama.cpp got support.

Qwen3.6-Plus

Posted by Nunki08@reddit | LocalLLaMA | View on Reddit | 226 comments

inevitabledeath3@reddit

How do you know that we can't run it? I have seen people here running 397B before. Some of us work for organisations putting together their own infrastructure for LLMs. I am part of that process at my University.

Qwen3.6-Plus

Posted by Nunki08@reddit | LocalLLaMA | View on Reddit | 226 comments

Qwen3.6-Plus

Posted by Nunki08@reddit | LocalLLaMA | View on Reddit | 226 comments

Tell me about the RTX 8000 - 48GB is cheap right now

Posted by Thrumpwart@reddit | LocalLLaMA | View on Reddit | 133 comments

Tell me about the RTX 8000 - 48GB is cheap right now

Posted by Thrumpwart@reddit | LocalLLaMA | View on Reddit | 133 comments

Tell me about the RTX 8000 - 48GB is cheap right now

Posted by Thrumpwart@reddit | LocalLLaMA | View on Reddit | 133 comments

llama.cpp at 100k stars

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 48 comments

inevitabledeath3@reddit

I think what we really need is something like LM Studio or ollama for other inference tools like VLLM and SGLang. Heck having one for ik_llama.cpp would be a start.

llama.cpp at 100k stars

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 48 comments

inevitabledeath3@reddit

For local models I thought quantised was the standard. To be honest though I don't think llama.cpp is all that it's made out to be. I use VLLM and SGLang now and they generally have better performance. The main advantage is llama.cpp is that it's simpler and has easy to use wrappers like ollama, LM Studio, and Unsloth Studio.

GPT-OSS-120B vs DGX Spark

Posted by AdamLangePL@reddit | LocalLLaMA | View on Reddit | 18 comments

Friendly reminder inference is WAY faster on Linux vs windows

Posted by triynizzles1@reddit | LocalLLaMA | View on Reddit | 111 comments

Unsloth announces Unsloth Studio - a competitor to LMStudio?

Posted by ilintar@reddit | LocalLLaMA | View on Reddit | 270 comments

inevitabledeath3@reddit

Thought I would reply after trying VLLM some more. You are right it is indeed faster in many cases, though I think there are some model architectures and edge cases where ik_llama.cpp or even llama.cpp is still faster. Talking with you was actually the push I needed to practice using tools like VLLM and SGLang.

It costs you around 2% session usage to say hello to claude!

Posted by Complete-Sea6655@reddit | LocalLLaMA | View on Reddit | 81 comments

inevitabledeath3@reddit

This is a misunderstanding of how prompt caching works. If there are other users with exactly the same prefix - which there obviously will be in this case since it's a system prompt - then the prompt is cached. This is trivial for modern inference software to handle. If they are counting this as not cached just because it's a new session for you then that's kind of ridiculous, since I can basically guarantee that it will be cached on their end thanks to all users using the same system prompt. If it's not being cached on the backend they are incompetent. Just straight up incompetent. Open source inference software can do this, they have literally no excuse.

TurboQuant in Llama.cpp benchmarks

Posted by tcarambat@reddit | LocalLLaMA | View on Reddit | 81 comments

TurboQuant in Llama.cpp benchmarks

Posted by tcarambat@reddit | LocalLLaMA | View on Reddit | 81 comments

Liquid AI's LFM2-24B-A2B running at ~50 tokens/second in a web browser on WebGPU

Posted by xenovatech@reddit | LocalLLaMA | View on Reddit | 18 comments

Intel will sell a cheap GPU with 32GB VRAM next week

Posted by happybydefault@reddit | LocalLLaMA | View on Reddit | 354 comments

M5 Max Actual Pre-fill performance gains

Posted by M5_Maxxx@reddit | LocalLLaMA | View on Reddit | 38 comments

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

inevitabledeath3@reddit

So Nemotron is an odd case because it's actually pre-trained in NVFP4. Only post-training and mid-training are done at higher precisions. So it makes sense the results are a bit weird. As I thought though we don't know that NVFP4 is actually significantly worse then FP8 or even FP16/BF16. NVFP4 works in a different way to conventional 4 bit formats which are mostly integer formats or even MXFP4. It used significantly more space as well since it has more scaling values. I think until someone does some proper research we won't really know, and by then someone will have found a better way to do quantization.

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

inevitabledeath3@reddit

So Sparks have no problems getting 1K prefill speed or more from what I have seen. That's one of the things they are actually good at thanks to being compute-heavy. I am talking about people who do use it for work though. I don't understand why you would go and do this unless you use it regularly. It doesn't really make sense otherwise. I don't think you actually understand just how much people are spending on tokens. Jensen Huang basically said that top level engineers should be spending 250K in tokens a year. To me that's a bit ridiculous, but the truth is you can spend thousands on tokens. Have a look at https://www.viberank.app/. You could basically buy 2 DGX sparks or more per year and still come out ahead compared to some of those guys. These are individuals as well, not teams or companies.

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

inevitabledeath3@reddit

As I have said before Sparks can run MoE models perfectly fine in NVFP4. You keep assuming dGPUs are the only option because you had issues with Strix Halo, which is a platform known for its software problems and just isn't computationally as strong as Spark. You haven't even run numbers for Sparks it seems even though they are much cheaper in both purchase and running costs. 3090s are fun to play with but aren't really practical in my eyes for LLMs. For image generation yeah they are great. Don't get me wrong, it's probably still more expensive than subscription pricing and maybe even API pricing. The thing is that I don't like the concept of usage limits or the privacy concerns with APIs. LLMs are expensive whether you go local or cloud and the ecosystem is in its infancy.

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

inevitabledeath3@reddit

Yes OpenClaw needs SOTA models generally. It can actually be dangerous to use it with less advanced models as it's more susceptible to prompt injection that way and more likely to make mistakes. GLM coding plans recently had a price hike. They aren't quite as expensive as Claude but still pretty pricey tbh. They also reduced the usage limits.

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

inevitabledeath3@reddit

See I had issues getting FP8 working here. I may give it a shot. I should note it isn't really supported in hardware like it is on newer architectures, so will be slower than on a new card as you are dealing with things like the Marlin kernel at this point.

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

inevitabledeath3@reddit

I mean your first issue is that you are using Q4, Q5, and Q6 quants presumably with llama.cpp rather than vLLM and NVFP4 which the Spark can do. llama.cpp just isn't as fast or as scalable as professional serving frameworks. That's true regardless of if you are talking about 3090s or Spark. I also don't think the GGUF models are as good as proper NVFP4 or even AWQ or SmoothQuant. Software wise Strix Halo is a bit of a mess to be honest with Vulkan outperforming ROCm in many cases. That's if you can even get ROCm working. This is a problem because tools like VLLM and SGLang don't work with Vulkan. I've seen benchmarks hit 40-60 TPS with Qwen 3.5 Moe models on Spark. I will be testing it out myself hopefully at some point when we get some at University.

Honest take on running 9× RTX 3090 for AI

Posted by Outside_Dance_2799@reddit | LocalLLaMA | View on Reddit | 255 comments

inevitabledeath3@reddit

This is ignoring the existence of things like the DGX Spark and Strix Halo. A single DGX Spark can run Qwen 3.5 122B or Nemotron 3 Super. A pair of them can run Minimax. You would probably need 6 to run GLM 5 decently well though.