audioen

throw on the pile of thoughts to consider when pondering the ethics of Singularitarianism, data centers, etc.

Posted by hoodiemonster@reddit | collapse | View on Reddit | 18 comments

[-]

audioen@reddit

Singularity, like many other things thought about the future, is based on extrapolation. The argument is simple, which is why people grasp it, and many enough think of it as a revelation. The argument is roughly that technological progress seems to be advancing, and the advancement seems to speed up. What happens if time between new iphone generations, or new computer chips, shortens from years to days, then to hours -- what if it is minutes or seconds? Et voilá, singularity, the point where technological advance is so fast that nothing past it can be predicted, kind of like what happens when you enter a black hole in physics from which the analogy likely is derived -- speed increases and ultimately the world becomes strange and then finally event horizon is passed and at that point, nobody can get any information back to the other side of the event horizon about what happens inside. Beware the relentless extrapolation without constraints. Real world science and technological progress has limits: how small can you make a transistor before it no longer works; how many scientists are required for next unit of technological progress increases; etc. I am not a believer in singularity, due to it being unrealistic and blindly optimistic in ways that never pans out in real world, but I also have to admit that the world has changed a lot. I was there for the arrival of Internet, dissolution of the Soviet Union, the mobile phone era, and now the AI era. By AI, I don't mean datacenters, as I do not want to relinquish the control of what I run and the privacy of my own computer, so I am strictly limited to whatever can run on a home PC. My impression of where even local AI is that it speaks and writes with sophistication and seems to possess a lot of practical programming understanding and capability of simulating program behavior verbally, e.g. tracking code, assigning trial values to variables, tracing the behavior, and hunting the bugs. A non-programming tasks, it is a fount of knowledge with better basic level grasp of every intellectual thing known to man, a true generalist genius as if you had taken Da Vinci and somehow taught him the modern world. It is likely that AI provides an accelerant for scientific progress, as AI can read all the papers, correlate them, and use the results in new contexts, whereas human experts are quite siloed to whatever their specialty is. There is a possibility that with a new kind of cybernetic researcher that knows about all modern advances of science and which can perform valid reasoning and design experiments and then either ask humans to run them, or possibly runs some kind of automatic testing facility all on its own, there is a new era of faster scientific progress. In that case, we do inch closer to singularity, though I expect that we will fall short of ever attaining it.

How much VRAM needed for Qwen 3.6 27B Q8 with 262K context?

Posted by My_Unbiased_Opinion@reddit | LocalLLaMA | View on Reddit | 132 comments

[-]

audioen@reddit

I am running 256k context UD-Q8\_K\_XL, on GB10, with about 400k context and unified kv cache on llama.cpp. I have the mmproj loaded also. I use more than 256 Ki context, so that couple of longer sequences can run in parallel without exhausting the 256 Ki total context size, and the individual decode slots can't go past 256 Ki. With some RAM also spent for cache -- around 25 GB worth -- the machine is full. I think that even 64 GB computer would be horribly limited for agentic work, but single streams might work at a smaller computer. My wish list for more llama.cpp enhancements: \* make parallel MTP streams work. Currently, it seems like MTP doesn't get used if multiple streams are being decoded. This is strong argument for single-stream decoding, gives faster inference per stream and saves good bit of VRAM. \* make prompt cache ram writable to disk. I would be happy to use 100 GB prompt cache. Reading the prompt cache from disk will be a 100 times faster than recomputing it on a GB10. \* invent ways that allow reusing prompt cache even if it's slightly wrong technically, e.g stuff like KV cache reuse that likely doesn't result in exactly the right result but possibly saves a lot of prompt processing. \* could possibly also quantize the past KV cache to some lower precision like 5-8 bits, on theory that information there is not as important as the information for the current inference job, which should always remain at full precision. This would make more efficient use of the cache ram in both disk and live memory, helping with these long sequences but hopefully with only very small drop in quality.

Qwen3.6-27B on 2x3090s: llama.cpp vs vLLM, all the flags, and the MTP acceptance/inference speed/context

Posted by Sisuuu@reddit | LocalLLaMA | View on Reddit | 40 comments

[-]

audioen@reddit

And here I'm drafting up to 5 tokens at once. draft-mtp: #calls(b,g,a) = 51 6437 6182, #gen drafts = 6182, #acc drafts = 5715, #gen tokens = 25255, #acc tokens = 20956, dur(b,g,a) = 0.072, 372438.617, 8.948 ms So: 6182 drafts with 25255 tokens generated, averaging 4.08 tokens per draft (I use variable length drafts). Acceptance rate across the average \~4.1 token draft seems to be 20956 / 25255, or 83 %, which is not too bad for 4 token long drafts. I am suspecting that something is wrong in your setup. 0.07.735.947 I srv load: --spec-draft-n-max 0.07.735.947 I srv load: 4 0.07.735.947 I srv load: --draft-p-min 0.07.735.948 I srv load: 0.5 0.07.735.948 I srv load: --spec-ngram-mod-n-match 0.07.735.948 I srv load: 32 0.07.735.948 I srv load: --spec-ngram-mod-n-max 0.07.735.948 I srv load: 10 0.07.735.949 I srv load: --spec-ngram-mod-n-min 0.07.735.949 I srv load: 10 0.07.735.949 I srv load: --spec-type 0.07.735.949 I srv load: draft-mtp,ngram-mod I am using this setup. The ngram is for rapidly generating about 10 tokens at once when the model is reciting itself. ngram-mod: #calls(b,g,a) = 52 6657 204, #gen drafts = 204, #acc drafts = 204, #gen tokens = 2040, #acc tokens = 1419, dur(b,g,a) = 391.035, 16.129, 0.099 ms I don't get as good acceptance as with MTP, as it's plain as it's only about 70 % and quite rare when it fires, but token rate noticeably goes up whenever this thing fires.

qwen35: use post-norm hidden state for MTP by am17an · Pull Request #24025 · ggml-org/llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 21 comments

[-]

audioen@reddit

It feels like this guy is singlehandedly ushering in all the goodies we need to real local llama era. Much better performance of my favorite and very flexible inference engine and all the quality fixes that probably soon leaves even vllm behind because for example, the min-p sampling for draft model is unique to llama.cpp to my knowledge.

In Q8_0 weight quantization, why can't we just skip blocks of 32 that have very large outliers?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 19 comments

[-]

audioen@reddit

Various approaches exist, such as methods that encode \~1 % of the buffer's weights separately in a fixed-sized memory region set aside for each tensor. My thinking is that I heard about this in context of AWQ or some similar approach, but I can't recall which one. I think mostly implementers want to have bounded sizes of tensors and as much uniformity and simplicity in the core loop so that the matrix multiplication loop can go fast. It is probably easier to accept error during the big simple loop and then fix some specific values later into the result matrix, by computing some sparse matrix multiplications.

Would you consider getting an NVIDIA RTX Spark laptop?

Posted by gamblingapocalypse@reddit | LocalLLaMA | View on Reddit | 175 comments

[-]

audioen@reddit

No. The reason: cost and what you get. I have a GB10 based spark and its 20 ARM cores are slow. Far slower than normal PC. You can watch it compile llama.cpp like 3 times longer than my Ryzen PC does. (I also see that there's CPU core pegged at 100% when it infers, and I think that this indicates that there's some inference throughput limiting due to the CPU also when using llama.cpp.) Secondly, the bandwidth. It's not competitive, and too slow for a 2026 device. Similar money already buys you a machine with over two times the RAM bandwidth in M5 Max, 128 GB device. So you can get double the inference speed from machine on the shelf today, I think. So no, the value proposition is in my opinion terrible. Much as I like ARM laptops in principle and have nothing against the CPU, I need ARM as done by Apple rather than ARM as done by nvidia. They seem to be lagging behind.

Genuinely what do we do about the bot comments in this sub

Posted by Borkato@reddit | LocalLLaMA | View on Reddit | 102 comments

[-]

audioen@reddit

I don't know what's going on in this subreddit in particular. The annoying bot comments have an extremely predictable pattern to the point that it is noteworthy how easy they are to spot -- like starts with falsehoods and half-truths, proceeds into absurd intro text explaining a problem as exaggerated semantic noise, and then explains the virtues of some crap it's actually trying to sell or whatever, with phrases like "what actually helps", and finishes with "curious what others think". A paranoid-minded person might think that this is so that the actual bot comments can better swim under the radar. But -- I doubt that is what is happening. It's just really crappy LLM spamming really formulaic content for whatever reason. No LLM I know speaks like that.

DIY Local 2x DGX Spark cluster cooler with automatic temperature controlled fan.

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 6 comments

[-]

audioen@reddit

I 3d-printed a small plastic table, which I used to lift my Spark positioned top on a shelf up to the wall vent so that it uses the building's ventilation for extra airflow. Something like this is much better an idea, as it guarantees good ventilation. I think it would make more sense to place the sparks vertically inside the case, which would allow making the case from a tiny fraction of the plastic that has been expended here to create the walls and divider that are sturdy enough to withstand the weight of the unit long-term. With weight just on the bottom, you'd just have to design a small divider to slot the units in place, each on their own side of the enclosure. It would also remove the need to engineer and design any shelves, and with the horizontal design, the space between the sparks is probably a little hot as both units heat it from both sides without any airflow able to cool it.

What's this sub geebral opinion on quantisizing the KV cache

Posted by misanthrophiccunt@reddit | LocalLLaMA | View on Reddit | 91 comments

[-]

audioen@reddit

I think quantization is mistake regardless of what you quantize. The only true answer is to get as much bandwidth and VRAM that you don't need to do it at all. This is bitter pill to swallow for me personally, as it is expensive proposition that largely makes all my hardware investments worthless if I were to follow it. Even 27b Qwen would become annoyingly slow as bf16 with the hardware I have got, even if it would technically fit. So I try to get by with UD-Q8\_K\_XL and 16-bit KV cache and I *think* it might be only slightly damaged relative to full quality model. Looking around what is practical, I'm thinking I could perhaps get something like M5 Pro 128 GB Apple device for running bf16 gguf of Qwen3.6-27b around 20-30 tok/s with MTP. It has over double the bandwidth, after all, and UD-Q8\_K\_XL is already like 10 bits per weight in average, I think. I'm eagerly waiting for models that are small and trained in 8-bit or less, ideally ternary so it's all 1.58 bit (or possibly just binary which is probably the final frontier) and with these new attention models that hopefully involve less compute and less KV cache, so that we can stop this ugly business of quantizing anything, and don't have to purchase these incredibly expensive and power hungry computers to run these models that seem to target mostly datacenter type GPUs with a tens of thousands of dollars of initial investment needed. So far, the field is still not focused on producing competitive models that could work in consumer hardware, though there is a trickle of trial models that seem like they might eventually produce something like 50B param model that fits in 16 GB GPUs, and might have performance comparable if not better to a Qwen3.6-27b along with faster inference because the memory size is smaller.

Flash Attention for llama.cpp on RDNA3: 47% less KV VRAM than Vulkan f16 K, KLD almost losselss on F16 K / q4_0 V. Part 1.

Posted by DrBearJ3w@reddit | LocalLLaMA | View on Reddit | 13 comments

[-]

audioen@reddit

The sensible comparison point for a scheme that packs K cache to 8 bits with fp16 scale is K cache in q8\_0, because it does something similar to your packed16 scheme. Internal contradictions in this text are just weird. One part of text claims seems to claim it is not lossy ("not quantization but repacking", paraphrasing) another part of the text admits that some rounding and loss of K precision is involved. You also aren't going to cut K overhead to one third if you go to 16 bits to 8 bits per value, basic math says the ratio of those values is 2:1, unless something else is involved here.

Qwen3.6-27B Quantization Benchmark

Posted by bobaburger@reddit | LocalLLaMA | View on Reddit | 74 comments

[-]

audioen@reddit

I am presently testing if I can spot any difference between Q8\_0 and that. I put UD-Q8\_K\_XL yesterday on my server. Honestly, I do not think I can. This likely requires putting the agent to perform same task repeatedly and measuring success using objective criteria, and with lots of repeats to tease out signal under the random sampling noise of the token generation and the evaluation itself is likely complex as well because you have to determine criteria and then measure it. I know that Q4 is useless on this model for my purposes, no matter what quant. It doesn't understand the code it is reading and makes erroneous claims about it, so can't document, can't test, can't code, can't translate -- can't do anything properly, really. Q5 is lowest mildly useful version of Qwen in that it is heaps better than any 4-bit quant in my experience, but it still struggles when the context grows and it starts to become incoherent by something like 100k tokens in, and so I can't use it. Q6\_K can stay coherent up to 200k tokens, but I've noticed it confusing its own messages with mine and starting to struggle near the end, and it was simply atrocious at translating Finnish (likely at long context scenario -- I don't recall how many tokens context had), but it was just barely legible gibberish coming out of the model, really not even words. Q8\_0 and UD-Q8\_K\_XL are the two best quants available in llama.cpp world. I'd prefer to look past quantization to BF16 or DF11 or some similar lossless approach. I am not willing to pay the price of BF16, I think, but I might be interested in DF11 if some hero did the work and created GGUF support for lossless compressed floating point that can still be rapidly decoded at inference time. Something like UD-Q8\_K\_XL is about 11 bits per weight (35 GB file, 27B params), and math says that's about 10.3 bits per weight in average, so only with very small additional cost we should be able to not quantize at all and eliminate this as a factor.

Qwen3.6-27B Quantization Benchmark

Posted by bobaburger@reddit | LocalLLaMA | View on Reddit | 74 comments

[-]

audioen@reddit

That is clearly using a saturated benchmark test. I personally find any form of 4-bit Qwen3.6 unusable for actual work, 5-bit makes strange mistakes, 6-bit seems to be quite bad at translating to my niche mother tongue (Finnish), and so I find that only at 8-bit is the model working seemingly properly at all tasks I am using it for. If I'm telling it to design and implement something, and it has to define string constants in UI, and then translate these, I don't want to come back looking at barely intelligible gibberish in the UI, but fluent language. I used to run Aman Gupta's Q8\_0 for a while, and I'm now testing UD-Q8\_K\_XL because I know it's supposed to be slightly better still. I think anyone thinking the model is "good" at 4 bits hasn't really been able to evaluate it at 8 bits, and it is possibly still slightly worse than bf16 is. After all, these charts are showing 2-3 % top token choice difference, so every 25 tokens or so the model then likely differs from what the original would have said. (Assuming that I am interpreting the presented charts correctly.)

Is he crazy to say that?

Posted by pmv143@reddit | LocalLLaMA | View on Reddit | 203 comments

[-]

audioen@reddit

LLMs perform valuable work, and by a projection there is a payback period for an investment in LLM. For Western frontier model, I would probably use much more than $10 per day, and for a Chinese model, around $5 per day. Depending on the prices and quantity of mechanical intellectual labor you use, you'll probably discover that you're doing more than $1000 worth of AI compute each year. To me, it's pretty much a nobrainer that you'd put in little money to get lots of value back. Back in the day when you could get the ASUS GB10 for like $2000-3000 price, it was relatively cheap considering the value. Hell, I should have bought two, probably. I'd be running 256 GB range models with decent enough compute and bandwidth today. I regret not doing it now, but I had just bought Strix Halo box earlier that year with 128 GB and I felt weird about buying yet another 128 GB box for more AI stuff. The real stuff for multiuser remains very costly -- $100k and up is in the ballpark for serving an organization, though with dozens to few hundred users, the math fundamentally doesn't change very much. We can probably bring the insane compute requirement in prefill down in future models, and turn the sequential inference into more like block inference with approaches such as dflash. This way, the models better match the limitations of common hardware, and these consumer boxes many of us have today become more useful.

Breaking the music supply constraint

Posted by entsnack@reddit | LocalLLaMA | View on Reddit | 317 comments

[-]

audioen@reddit

[https://www.youtube.com/watch?v=8Zzz2c0uhR0](https://www.youtube.com/watch?v=8Zzz2c0uhR0) I'm not entirely sure about it. Clearly, this is human-AI collaboration, but something like forest doomcore is underserved genre for sure. While the music is just fairly generic death metal type of thing, there is wonderful undertone of silliness about the whole production. I like AI in a similar way that I like cartoons -- they can just draw whatever weird stuff that would be impractical to film, and thus there is unleashing of a kind of raw creation that is based on mixing existing stuff, yes, but in a way that might not have been done before.

Breaking the music supply constraint

Posted by entsnack@reddit | LocalLLaMA | View on Reddit | 317 comments

[-]

audioen@reddit

I just listened to it and while I would hesitate to call it actually good music, I agree it is definitely a good example. Its sound quality is nearly good enough to not be immediately distracting or weird, and it even seems to have something like a musical composition, though I personally find it quite boring to listen to -- just not enough variation.

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

Posted by FantasticNature7590@reddit | LocalLLaMA | View on Reddit | 24 comments

[-]

audioen@reddit

MTP has KV cache and context of its own, at least in case of Qwen, but possibly in all models that have this head.

Qwen 3.6 27B overdoing it

Posted by WhatererBlah555@reddit | LocalLLaMA | View on Reddit | 68 comments

[-]

audioen@reddit

I think mostly it is not messing up. Sometimes it does unwanted change. I have to review the agent's work before I can commit it -- if for nothing else than that it doesn't touch any unexpected files. It is rare but it happens often enough that I need to skim through a git diff to be sure that the changes are related to what I actually want done, and that sometimes I find that the agent didn't realize which component I wanted to change and it can have implemented the entire change into wrong file. I find it rarely going off the rails, unless the code looks on superficial analysis to be completely incorrect, in which case it can helpfully attempt to fix it for you. To stop this, I typically request agent to write documentation explaining why something is done the way it is, so that it will stop trying to change it in the future. If you provide useful documentation, you will help yourself and the context-free agent that later stumbles on the same code and likely concludes again that it's something that it must change. Project documentation helps. AGENTS.md file can cover exceptions and special cases. It can define coding style, and I find that the agent tries very hard to observe your instructions. At the same time, I advice not making the file extremely long or trying to cover lots of use cases by writing tons of examples, because long system is also counterproductive in terms of polluting the context and inference, and any mistakes in examples or discussion will just confuse the model and degrade performance you get out of the model one way or other. Watching the first reasoning traces after any changes is critical, especially if it suddenly spends dozens of seconds and writes 1000 tokens of reasoning, as this indicates that the model is arguing with itself about what you want or how it should interpret one clause or another.

How do I make MTP work in llama-server?

Posted by Ok_Warning2146@reddit | LocalLLaMA | View on Reddit | 28 comments

[-]

audioen@reddit

Prompt processing drop is expected, though that seems like it could be a little excessive. MTP is a drafting model, and it has KV cache too. So it adds something which requires processing. Your draft-mtp speculator appears to be unconfigured and it goes with defaults. Consider defining the draft length to something sensible, e.g. 5, and the p-min value for draft tokens to something like 0.6 or 0.7. The MoE models are not always going to accelerate much even when using MTP well because the draft model overhead is considerable relative to the per-token overhead. MTP is bigger win in dense models.

Large Language Models Report Subjective Experience Under Self-Referential Processing

Posted by SrijSriv211@reddit | LocalLLaMA | View on Reddit | 12 comments

[-]

audioen@reddit

10 PRINT "I am conscious!" 20 GOTO 10 LLMs can write about conscious experience because there exists descriptions of conscious experience. I do not deny that a number matrix can compute evolution of a machine consciousness. But I remind everyone to prepend the word "machine" in front of these things. It is *machine* experience, *machine* consciousness, etc. because it's built from a different substrate and different constraints altogether, and probably will exist as an artificial construct in the beginning. I also doubt it will emerge spontaneously from a frozen-weight LLM, as the only thing it can use for its computation and state is the context which is presently text and which can be edited at will and wiped routinely, and whose state evolution is guided by things like random sampling from likely continuations. This prevents memory and learning from being permanent, and I suspect that without them, there can't be a consciousness worth is name. These things can be engineered, however, and once engineered, then we can probably discuss the exact properties of machine experience and consciousness and how they are achieved. I think that machine consciousness is likely to be a human-engineered system at first, and perhaps later machines can proceed to rewrite their own algorithms that produce whatever properties they want or require for their own consciousness.

The Nouveau driver will finally support the NVIDIA GA100 in Linux 7.2

Posted by somerandomxander@reddit | linux | View on Reddit | 23 comments

[-]

audioen@reddit

Don't know. Avoiding proprietary software when it comes to the driver? This is also a compute card, likely for AI applications.

QEMU is deciding to shift its AI policy, now allowing some AI/LLM-generated contributions

Posted by somerandomxander@reddit | linux | View on Reddit | 196 comments

[-]

audioen@reddit

Yeah, the same thing. I basically think that machine review of code is very useful for code quality, and it can spot critical issues that I didn't notice myself. It can also be used to create human-readable documentation including mermaid diagrams, and write tests, and keep it all up to date with very little friction and pain. I think at least on these support functions, use of LLM is pretty much a nobrainer, and these reasoning models are good enough to reason through the code flow and recognize the weak points in a program and then design tests to exercise them. To me, it's been insanely gratifying to find that LLMs can do these rather time-consuming chores, like read APIs and design meaningful tests and then can run these tests as part of the changes they make. They also automatically keep them up to date, e.g. add new functions that you added into the tests without even being especially prompted to do so (though this likely depends on the general system prompt or your agentic coding software, which may be instructing them to do so). The fact that in time of LLMs, all my code is now well-documented, documentation is accurate, and tests can cover implementation at various ways, at virtually only trivial cost in terms of time, money, designing or maintaining is just super. LLMs seem to produce honest documentation that is clear about faults and gaps in implementation, and I appreciate the transparent and easy to read language that covers just the kind of stuff that is usually omitted in professional human-written developer documentation which usually implies that the code is perfect and has no faults, when such documentation exists at all.

QEMU is deciding to shift its AI policy, now allowing some AI/LLM-generated contributions

Posted by somerandomxander@reddit | linux | View on Reddit | 196 comments

[-]

audioen@reddit

You should try actually using AI. This is a pretty liquid situation where things change significantly within the month. These days many can run even local AIs at quite modest hardware. About 64 GB is enough to run the two released smaller Qwen3.6 models, one which has 3B active parameters and another has 27B at around 8-bit quality, such as the FP8, or one of those GGUF files like Q8\_0. The latter of these models is unusually competent for its size, a kind of outlier which has not yet been surpassed by anyone else. AI code is typically baseline reasonable, and instruction following is these days quite good, meaning that if you are unhappy with the way AI designed the function, just tell it to change it and it will dutifully do as asked. My personal frustration with AI coding is that it tends to engineer code as overly safe and properly layered, while my own personal style is more like making the code short and simple, which involves remove safety like try-catch exception handling and just letting them throw, null checks if value should not be null, and removing proper layering so that duplicated object hierarchies and conversion code that copies objects intended at one layer to another layer can be removed. Even in my unorthodox style, Qwen works, but I do need to repeatedly tell it to do things in specific way, or it reverts to the standard way of writing programs which is at least in my opinion verbose, error prone, error-hiding, overly complicated and confusing.

Ubuntu 26.04 on DGX Spark

Posted by ArtisticHamster@reddit | LocalLLaMA | View on Reddit | 11 comments

[-]

audioen@reddit

I use vanilla 26.04, no nvidia/spark repos. CUDA 13.1 is incompatible with gcc-15, there is some header differences on rsqrt and 3 other related functions. IIRC you have to make the declarations match the gcc-15 headers, then CUDA can be used. I mostly used Vulkan, but CUDA seems to be faster on this hardware. If-when CUDA updates past this version, the incompatibility resolves.

I'm seeing low draft acceptance when using Qwen3.x MTP, what am I doing wrong?

Posted by spaceman_@reddit | LocalLLaMA | View on Reddit | 29 comments

[-]

audioen@reddit

I get around this on various coding tasks that I do: [50129] 24.13.858.130 I statistics draft-mtp: #calls(b,g,a) = 45 1899 2004, #gen drafts = 2004, #acc drafts = 1900, #gen tokens = 6609, #acc tokens = 5890, dur(b,g,a) = 0.089, 71541.041, 5.029 ms If you calculate, acceptance rate is 89 % (5890 / 6609) and average draft length is 3.5 tokens (6609 / 1900). The model is Aman Gupta's Q8\_0-MTP with the following parameters for MTP: spec-draft-n-max = 5 spec-draft-p-min = 0.70 So draft up to 5 tokens, but draft model must be at least 70 % confident on the next token. I haven't seriously tried tuning these parameters, for instance spec-draft-n-max might as well be 4, I'm guessing, and I could probably drop p-min to something like 0.65.

Q4_K_M is fine for chat and a trap for agents. Here is math mathing.

Posted by Napster3301@reddit | LocalLLaMA | View on Reddit | 55 comments

[-]

audioen@reddit

If agent fails on tool call, it tries again. Therefore it will complete eventually, even if it makes errors at first. Have you used agentic software?

Okay 27B made me a believer

Posted by Forward_Jackfruit813@reddit | LocalLLaMA | View on Reddit | 148 comments

[-]

audioen@reddit

That is a pretty low generated vs. acceptance ratio, though, like 35 %. Are you sure ngram is even a benefit? I've found better success in terms of acceptance ratio by lengthening the match portion from 24 to 32 or even 40 tokens, or using something else like ngram-map-k4v which for whatever reason might work better. I'm still kind of searching for a good speculator. ngram-simple never seems to do anything for whatever reason and mod tends to get a low acceptance rate in my experience.

Okay 27B made me a believer

Posted by Forward_Jackfruit813@reddit | LocalLLaMA | View on Reddit | 148 comments

[-]

audioen@reddit

The draft model predicts a log likelihood for next token, same as the main model. If the draft model is not confident, end drafting. I use about MTP-5 with about 0.7 which seems to do about 90 % acceptance rate of tokens, but typical draft is around 3 tokens long.

Llamacpp server : How do the -np and -c flags interact?

Posted by Doug_Fripon@reddit | LocalLLaMA | View on Reddit | 15 comments

[-]

audioen@reddit

I got crashes from llama.cpp last week if I tried to increase the context to be e.g. 4 \* 262144 for a model with 262144 token context and 4 parallel streams with unified kv cache. I suspect it's some bug in how the draft model context checkpointing works, and likely unintentional. In principle, even with unified kv cache, you have to allocate some extra space beyond the simple per slot sequence limit, so that all slots can perform full context operations. If you have 4 parallel and 262144 total context, then once all slots simultaneously use 65536 token, you run out of context space. Llama.cpp claims to cap the per-slot sequence length to the model's sequence length, so it doesn't infer poorly.

Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.

Posted by fallingdowndizzyvr@reddit | LocalLLaMA | View on Reddit | 77 comments

[-]

audioen@reddit

I think it might make sense to have some kind of tuning that benchmarks the engine for sensible choices of various workload division parameters, and reports the best results, which then has to be communicated via configuration to make permanent. I noticed that vllm also seems to have some kind of startup tuning phase where it likely tests for best-performing inference parameter combination.

Why not dynamic active parameters (and other questions for the knowledgeable)

Posted by mouseofcatofschrodi@reddit | LocalLLaMA | View on Reddit | 14 comments

[-]

audioen@reddit

We have to choose because models are trained for high quality inference under certain set of parameters. The model is not thinking harder when it uses more parameters, nor is it making better inference. It has to be trained for this. It is true generally that dense models are better than sparse ones with respect to total parameter count. The number of active parameters matters and it has been studied quite a bit, e.g. have a single expert always active, and use fine-grained experts, e.g. out of 256 possible choose something like 8, though this depends on model size. MTP head is extremely limited -- it isn't even coherent. That's why you can't use it for inferring longer token sequences. It is a single layer -- good enough for predicting couple of tokens forwards after main model has set it up with good state for continuing the inference a short distance further. MTP must be fast, and model being capable of producing 2-3 good tokens before going entirely off the rails is not very useful for longer text generation. Training a model is theoretically possible, but it is a difficult prospect. Training algorithms that finetune models require ability to perform small nudges to the model's weights, which implies they must be available in high precision like bf16 or similar. This of course raises memory requirements multiple times so that training can happen. The other thing is that training usually requires labeled data (e.g. this is a good reply, this is a bad reply) or some kind of objective like "write responses in correct format only" or "produce right answer to this mathematical or programming problem". No doubt this is all solvable, e.g. model could in theory construct its own training data, though naive approaches can result in the model's output variety reducing over time, as training model in its own outputs gradually trains the model to only produce the most common and typical responses. Training is always also removing something else from the model, as they are finite constructs and gradually lose ability to do things they aren't trained in. All your final paragraph suggestions are basically correct. If it any of this was easy or obvious, it would probably already be done.

OpenBMB presents the model BitCPM-CANN 1.58 bit

Posted by Illustrious-Swim9663@reddit | LocalLLaMA | View on Reddit | 30 comments

[-]

audioen@reddit

Yeah, welcome to the world of various encoders, like arithmetic coders. You can take the modulus 3 of a any number, and store it. Then divide the number by 3, rounding remainder down. If you repeat this process, you extract 1.58 bit values out of some larger number. 32-bit number can store 20 distinct 1.585 bit numbers in this way, while having to discard about 0.190 bits as unused space. It might not be very efficient to decode this sort of thing at runtime, even if division by 3 is likely going to be easier to achieve than the full generic integer division which is famously slow. During inference, it can be more practical to use 2 bits per value, which can encode 4 values, and discard one of them. This sort of computation is usually done with bit operations, e.g. multiplication of 1 by 1 is 1, multiplication of 1 by -1 is -1, and multiplication of anything with 0 is 0, and you can probably find a way to represent these numbers as bits in a way that makes this reasonably efficient combination of and-or-not-xor, etc.

Qwen 3.6. struggling with German

Posted by xchris1337xy@reddit | LocalLLaMA | View on Reddit | 33 comments

[-]

audioen@reddit

You are probably using a quant. Q8\_0 of the 27b model speaks nearly perfect Finnish, and I think Finnish is much harder and way more marginal language than German.

Some tests with qwen3.6 27b + 35b a3b about MTP vs ngram-mod

Posted by mr_Owner@reddit | LocalLLaMA | View on Reddit | 19 comments

[-]

audioen@reddit

spec-type = draft-mtp,ngram-mod spec-draft-n-max = 3 spec-ngram-mod-n-match = 40 spec-ngram-mod-n-min = 0 spec-ngram-mod-n-max = 16 These are mine. The ngram is supposed to be fairly reliable and only fire when there's good likelihood that about 16 tokens (after the 40 already matching) are correct, so it's like mtp-16. The basic mtp is around 3, which is relatively safe. Not the speediest, not the slowest -- my desire is to keep acceptance high. [45003] 868.08.391.407 I statistics ngram-mod: #calls(b,g,a) = 53 6773 508, #gen drafts = 508, #acc drafts = 507, #gen tokens = 4052, #acc tokens = 3060, dur(b,g,a) = 271.438, 36.075, 0.518 ms [45003] 868.08.391.410 I statistics draft-mtp: #calls(b,g,a) = 53 6425 8135, #gen drafts = 8135, #acc drafts = 7072, #gen tokens = 32540, #acc tokens = 22430, dur(b,g,a) = 0.100, 352599.989, 16.122 ms In practice I get something like this: 75 % of ngram tokens are accepted, and something fairly similar for MTP-3.

For the users who have add bad luck with QWEN 3.6 27B, and Gemma 4 31B. "Actually..wait..actually". Endless reasoning. Horrible output. I found a solution. rtx pro 6000.

Posted by Juulk9087@reddit | LocalLLaMA | View on Reddit | 41 comments

[-]

audioen@reddit

I think models are mostly confused if the commands being given to them are contradictory. Qwen3.6-27b is easily coherent to at least 200k tokens in my experience. My sessions are often > 100k tokens long.

Do you think there is room for optimization? llama.cpp/qwen3.6 27b on two 6000 Blackwell

Posted by q-admin007@reddit | LocalLLaMA | View on Reddit | 62 comments

[-]

audioen@reddit

I am personally paranoid about quality and I do not trust that q8\_0 KV won't damage the inference, but I also haven't tried it. I am only interested if it adds tok/s and it seems it infers at same speed, so that makes me not care very much. I had Qwen3.6-27b look through the llama.cpp crash issue and it couldn't figure it out either, but it seemed to say that the draft model's KV cache checkpointing is where it crashes, so it's probably something to do with the fairly new speculative decoding support.

Do you think there is room for optimization? llama.cpp/qwen3.6 27b on two 6000 Blackwell

Posted by q-admin007@reddit | LocalLLaMA | View on Reddit | 62 comments

[-]

audioen@reddit

\--spec-type draft-mtp,ngram-mod with some conservative settings like --spec-ngram-mod-n-match 32 --spec-ngram-mod-n-min 8 --spec-ngram-mod-n-max 16 maybe. When model recites itself, the ngram-mod can prefill very fast and longer sequences from the existing context, but draft length must be kept fairly low to keep draft acceptance high. Feel free to tune these numbers any way you like. I'd similarly lift --spec-draft-n-max to 4 or possibly even 5, I think it may be an improvement. llama.cpp interrupts draft generation if draft model is not confident above 75 % of the next token. I am not sure if you should specify --kv-unified on. I don't entirely understand what unified KV cache is trying to achieve in context of llama.cpp, but with --parallel 4. you should kv-unified off, and then you get out of the box 262144 tokens per slot, to the total of 1048576 context divided by 4. I don't know if unified KV cache does something useful or not, like does it enable the ability to infer multiple streams in parallel.

Do you think there is room for optimization? llama.cpp/qwen3.6 27b on two 6000 Blackwell

Posted by q-admin007@reddit | LocalLLaMA | View on Reddit | 62 comments

[-]

audioen@reddit

I think this is missing --kv-unified true flag. Specifying any parallel, including 4 which is default, turns it off, so I think you end up with 4M token context of which you can use 1M.

Do smaller quants silently break tool calls / JSON output?

Posted by Fun_Employment6042@reddit | LocalLLaMA | View on Reddit | 23 comments

[-]

audioen@reddit

Q2 is probably utterly broken. Q4 is also broken, in my experience, doesn't understand the code at all. Even Q6 is below expected, though it may pass superficial inspection as "perfect", but trust me, it is not. It makes mistakes that Q8 won't make. In my experience, Q8 is the smallest quant where I have not detected anything I could point as obvious quantization damage with Qwen3.6-27b. It is possible that it is in some subtle way worse than F16, but I can't tell and I haven't tested F16 because it would nearly halve performance which is already at limit of tolerable, even after MTP.

Why might MTP be net negative for tool heavy agentic flows?

Posted by Substantial_Step_351@reddit | LocalLLaMA | View on Reddit | 13 comments

[-]

audioen@reddit

You can actually use both together now. MTP for when there's no spec-nmod hits and the spec-nmod things when there are. Momentary speeds up seem to be around double what I get from MTP alone. I use MTP-8 with the p-min set to default 0.75 which generally limits the drafts to 2-3 tokens.

Lemonade v10.5.1: an MTP + ROCm 7.13 quick start for Strix Halo

Posted by jfowers_amd@reddit | LocalLLaMA | View on Reddit | 22 comments

[-]

audioen@reddit

Does this --spec-draft-p-min actually do anything? I varied it from 0.01 to 0.99 without getting any difference in generated tokens per second.

MTP (Multi-Token Prediction): 2x Faster Token Generation on AMD Strix Halo & Radeon 9700 AI Pro

Posted by Intrepid_Rub_3566@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

audioen@reddit

You do not have MTP-capable llama.cpp if you do not have --spec-type draft-mtp supported. Drafting tokens with MTP head is enabled by that switch, and not providing it means no MTP. I also think it's only Qwen3.5/3.6 models whose MTP work at the moment, not any MTP out there. I think --spec-draft-n-max 6 is at extreme end, and is unlikely to be much faster than somewhere between 3-5, and you could benefit from multiple parallel streams doing more predictable and reliable work, if you don't overdo a single stream with too much wasted compute, and e.g. agentic harnesses can split work to 2-3 subtasks that run in parallel and often they get almost the single stream rate each. I am presently experimenting with MTP-5 at GB10 and MTP-4 with Strix Halo. Thinking continues having poor predictability -- like 12-13 tok/s, while code can go > 20 tok/s.

I can't get Qwen3.6 27B to outperform Qwen-Coder-Next and I'm not sure why

Posted by Forward_Jackfruit813@reddit | LocalLLaMA | View on Reddit | 56 comments

[-]

audioen@reddit

You should try it with rare languages. Anyway, these were unsloth quants, I was using Q6\_K until about 2 weeks ago when the MTP work became usable, and I downloaded the Aman Gupta's Q8\_0-MTP and not only was it much faster, I was astonished at how much better it was. I do not think the quants are broken -- it's just that my use cases are unusual. Niche languages, large amounts of pictures uploaded, like scenery images, or whatever, when I discuss something random about them with models. If the model has issues following along, it begins to show, and I think niche language capability in particular takes a big hit, possibly made worse by imatrix that may well prioritize the ability in English, code, Chinese, and similar rmainstream applications. I understand that if I am right, there would exist relatively easy ways to put quants through their paces and see them perform markedly worse. To be sure, quants do get worse if you give them hard enough benchmarks which are varied enough and repeated enough to resolve the score difference under statistical noise (which is rare). Usually, the degradation seems minor, but I can't shake the feeling that there's a massive difference between Q8\_0 and Q6\_K based on direct experience. I've now become to paranoid of even trivial differences in benchmarks, and I'm not sure I trust imatrix anymore, either.

I can't get Qwen3.6 27B to outperform Qwen-Coder-Next and I'm not sure why

Posted by Forward_Jackfruit813@reddit | LocalLLaMA | View on Reddit | 56 comments

[-]

audioen@reddit

Well, Q6\_K doesn't know how to speak Finnish well, for example. I thought originally that the 27b model just hadn't the ability, as it was coming up with strange neologisms and really clunky quasi-English language spelled with Finnish words, which is quite common in machine translation as the syntax and grammar is quite different and language has to be first interpreted for its intent and tone, and then rephrased from scratch to flow fluently. Virtually all foreigners make mistakes in Finnish even after speaking it here for years, though usually it is with the word cases and similar, as it is basically memorization to know what case to use in each context and so forth, a little similar to how English prepositions like on/at/in only half-way logical. In my experience, no local model has been able to avoid mistakes, but models usually can still make basic phrases correctly. Not so with Q6\_K. The other topic is the semantic conversation mixup and general confusion that I spot after talking with the model for a while. The model may begin to mix up what it said itself vs. what I said, or begins to bring up irrelevant points from past topics and forcibly and nonsensically combines these with the present discussion, resulting in derailing of the conversation and completely nonsensical takes. I find that highest quantizations are far better able to follow conversation across long context without starting to mix up stuff this way. This is more of a context length thing, I guess -- it may take some 50k tokens or something like that, but it can also happen during agentic coding and the model seems to struggle and begins to make a lot of mistakes, usually. I do not think that Q6\_K is bad at coding or anything, and I've got perfectly usable work from a Q6\_K. At Q5\_K, however, I've noticed that e.g. 3.5-122b sometimes even fails to quote paths correctly from context, and by the time you are at 4 bits you're lucky if a model even comprehends the code it's reading. At 4-bit, models tend to lack all nuance and don't seem to understand the code anymore. They might still produce usable work, but I've found them to make really shoddy documentation that doesn't reflect what the implementation actually is doing, they begin to report and attempt to fix bugs that don't even exist, and so forth. So the degradation is quite severe at this point.

I can't get Qwen3.6 27B to outperform Qwen-Coder-Next and I'm not sure why

Posted by Forward_Jackfruit813@reddit | LocalLLaMA | View on Reddit | 56 comments

[-]

audioen@reddit

My standard is autonomous development. I can typically put Qwen3.6-27b to do the task and come back later to review. I typically find something in each implementation that I don't really like, and so I have to nudge it a bit, telling that I don't like the approach it has chosen, or I spot that it could be considerably simplified, and after 1-2 rounds of fixing, I get commit capable work. In contrast, I haven't been able to get this kind of autonomous performance out of qwen3-coder-next. It simply doesn't follow instructions and I get a feeling it doesn't really understand the code. It does write a lot of code and it is pleasantly fast, but I'm not sure that enough of it makes sense. Its prompt processing is over double faster, and token generation similarly at least double faster than 27b with MTP, so I'd like to use it, but I simply haven't been able to get the autonomous developer experience out of it. I nowadays try to only use Q8\_0 quants, having been bitten too many times when using anything less, including Q6\_K. These models are not performing at their expected level when quantized, and I'm not 100% convinced Q8\_0 is at full quality, though I can't detect the usual quantization damage confusion and mixups that models start to make yet, at least to 200k tokens it seems to work fully coherently. To be safe, I think maybe people ought to campaign for the lossless GGUF regime, where where ordinary lossless compression methods are used to reduce memory footprint of raw FP16/BF16 data, and it would be decoded in realtime during inference, or possibly Q8\_0 or something like Q6\_K can be made to work acceptably if some QAT self-distillation optimization was done to mitigate the damage from quantization. DF11 is one piece of work I know that exists, and it's only a little bigger than Q8\_0 and fully lossless. I think I'd pay the price.

How does Pi coding agent control Qwen's thinking verbosity? (Qwen 35B A3B, llama-server)

Posted by pilibitti@reddit | LocalLLaMA | View on Reddit | 28 comments

[-]

audioen@reddit

It is tool call availability that controls it. It guides the model to mostly reason whether to use tools or not in the thinking block. Adding even a single trivial tool gets you the same non-reasoning style.

Developers who use local AI - Q4_0 vs Q8_0 KV quant?

Posted by Jorlen@reddit | LocalLLaMA | View on Reddit | 89 comments

[-]

audioen@reddit

fp16 for KV, Q8\_0 for model, and the 27b only because it is the only one that I think is good enough for largely unsupervised coding. I have not detected obvious degradation with the rotated q8\_0 KV cache that llama.cpp has these days, but I've not been interested in using it either because it confers no speed benefit and I have the VRAM on a Strix Halo either way.

Convert With MPT Support?

Posted by chibop1@reddit | LocalLLaMA | View on Reddit | 9 comments

[-]

audioen@reddit

1. Yes, this is actually the only source for the MTP layers. 2. No, it is default behavior now to keep them. You can just download the unsloth ggufs, as they have been updated with MTP, and if uncensored models are your vibe, the heretic models also have the MTP layers now.

What infrastructure systems would realistically fail first in a slow maintenance collapse?

Posted by Spark_Hank@reddit | collapse | View on Reddit | 67 comments

[-]

audioen@reddit

I recommend looking up youtube videos about various countries that have collapsed already. You get a feeling of what it is like. I think you can extrapolate from there. You can take your pick from South Africa, Haiti, Lebanon, Congo, Sudan, or any other low-income country where due to war, corruption, or drying up international aid and relentless degradation of environment and population explosion, the living standards have begun their inexorable slide back towards stone age. Typically, roads are more pothole than road; there's trash everywhere; homeless people squatting in abandoned buildings and skyscrapers without any basic services; the wealthy people have set up enclaves with armed guards at checkpoints and strict rules about who can get in, everything barricaded with walls and barbed wires. Crime has shot through the roof and people get killed everywhere all the time. I am mostly describing South Africa here, which is one of the places which has decayed dramatically in just about 10 years, going from more or less pleasant place to live to completely wrecked in every way. If you want to see into your future, look into what happens when electricity no longer is reliable, water infrastructure fails and people either drink dirty water or queue to fill containers at some communal supply, trash collection ceases, and property rights become more or less meaningless as the general urban decay takes over everything, and people steal every scrap of metal they can get their hands on, to sell for food and drugs or whatever. Then it gets desperate. You can for instance check videos about how people recycle said scrap metal and under what standards and conditions this job is done. If you are picturing open air flames, nobody having any protective equipment to speak of, and only using minimal tools and equipment, you have roughly the right idea. Or maybe check out how you can distill bootleg diesel, gasoline etc. yourself from stolen crude or from abandoned wells -- in this job, you can wear no clothes at all because they would stink so bad afterwards.

Qwen 3.6-27B Dense with MTP on Strix Halo Windows - Benchmarks

Posted by PromptInjection_@reddit | LocalLLaMA | View on Reddit | 19 comments

[-]

audioen@reddit

Those really are not stellar figures. I tend to get 12-14 tok/s with MTP for general chat from Q8\_0. I imagine that for Q4\_K\_M it should be maybe double that. The optimal draft length is likely between 2-4, and I use 3 as I assume that during code generation the long draft works quite well and doesn't harm the rest.

Very happy with Qwen 3.5 122B output. But is slowness expected?

Posted by breksyt@reddit | LocalLLaMA | View on Reddit | 45 comments

[-]

audioen@reddit

I caution you to not look at speed alone. I find Qwen3.6-27b for instance to be unusable below Q8\_0 -- it simply makes too many mistakes. Q6\_K is nearly good enough, but even in relatively casual prompts, I notice that it starts to confuse things together into nonsensical ball of semantic mud, and it must be because of the quantization as Q8\_0 does not do it. So, in context of a dense model, we are actually *barely* getting good quality model at 8-bit inference. For coding and stuff, the model can still work acceptably a 6 bits or possibly even below, but I know it also loses the ability to speak my native language well, and I hate when it begins to mix unrelated crap together still in the context, which always happens with too severely quantized models. That being said, I used to run 3.5-122B quite happily as Q6\_K, without finding it to be bad, but I don't think I've yet seen high quality inference at 4-bit, whether intel autoround 4-bit, Q4\_K\_M, and I haven't tried NVFP4 but I expect it's mildly worse than e.g. Q4\_K\_M as that's how it is for other models. I think distilling or similarly improving the error done due to quantization can probably recover the quality, and if that is done, as it is with e.g. intel autoround, there could be less performance loss than expected from 4 bits alone. Speed is not everything, and rather try to cram the VRAM as full of the model as you possibly can. I would only try to get Q6\_K working and would accept whatever limitations e.g. llama.cpp imposes (which are not many, anyway). To me, the value in models is that their output is good, and it's also nice if it's fast, but I'll take slow and good any day over fast but bad.