Marcuss2

KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag)

Posted by acluk90@reddit | LocalLLaMA | View on Reddit | 94 comments

[-]

Marcuss2@reddit

I am quite skeptical of these quantifications, I think most of them "work" because most models are actually quite inefficient when it comes to storing information in KV Cache. I would like to see performance with Qwen3.5 and DeepSeek V4 architecture where information is stored much more densely.

New DeepSWE benchmark finds Claude Opus cheats

Posted by DeltaSqueezer@reddit | LocalLLaMA | View on Reddit | 92 comments

[-]

Marcuss2@reddit

There is no way GPT-5.4 mini beats Kimi K2.6. From my experience that model just pain gets stuck in a loop. Something is off about this benchmark.

ZAYA1-8B: Frontier intelligence density, trained on AMD

Posted by carbocation@reddit | LocalLLaMA | View on Reddit | 108 comments

[-]

Marcuss2@reddit

This sounds too good to be true. But I am willing to be proven wrong.

Kimi K2.6 vs DeepSeek V4 Pro

Posted by bigboyparpa@reddit | LocalLLaMA | View on Reddit | 38 comments

[-]

Marcuss2@reddit

Myself, I have tested DeepSeek V4 Flash and it is better than Kimi K2.5, as in it could do tasks Kimi K2.5 couldn't do. With Pro, I would wait for the actual release as this is a preview, but I will likely make V4 Flash a workhorse model.

ibm-granite/granite-4.1-8b · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 35 comments

[-]

Marcuss2@reddit

Qwen3.5 has been out for months. I get if someone does not compare to model released last week. This is not it.

ibm-granite/granite-4.1-8b · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 35 comments

[-]

Marcuss2@reddit

True, my mistake.

ibm-granite/granite-4.1-8b · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 35 comments

[-]

Marcuss2@reddit

It doesn't seem to bench very well. Even Qwen3.5 4B beats it handily.

Kimi K2.6 Released (huggingface)

Posted by BiggestBau5@reddit | LocalLLaMA | View on Reddit | 277 comments

[-]

Marcuss2@reddit

50% of what he says is true, 50% is total garbage. Problem is, he stands behind that 50% of garbage, even when called out.

Llama4 108b $800 setup

Posted by kylerrr02@reddit | LocalLLaMA | View on Reddit | 13 comments

[-]

Marcuss2@reddit

Qwen3.5 122B is likely going to give you much better experience.

Experiment: Olmo 3 7B Instruct Q1_0

Posted by butlan@reddit | LocalLLaMA | View on Reddit | 47 comments

[-]

Marcuss2@reddit

Could you try doing a distillation from the original full weight Olmo 3 7B?

It looks like there are no plans for smaller GLM models

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 128 comments

[-]

Marcuss2@reddit

I don't have one. I can say you can make one. I mean honestly just use Qwen3.5 or Gemma 4

It looks like there are no plans for smaller GLM models

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 128 comments

[-]

Marcuss2@reddit

If you really want the style of GLM-5.1, you should be able to distill it into Qwen3.5

Announcement: Temporary LLM Content Ban

Posted by ChemicalRascal@reddit | programming | View on Reddit | 326 comments

[-]

Marcuss2@reddit

Hey, this is fine, but can you link to subreddits which actually discuss it like /r/LocalLLaMA?

PrismML — Announcing 1-bit Bonsai: The First Commercially Viable 1-bit LLMs

Posted by brown2green@reddit | LocalLLaMA | View on Reddit | 182 comments

[-]

Marcuss2@reddit

Went trough the paper, their methodologies are somewhat questionable how they measure knowledge density. For example, we already quantize models to 4 bits, they tend to almost always take full bf16 weights for the other models. Also they measure intelligence per GB, but intelligence does not scale linearly, but logarithmically.

Kimi K2.6 will drop in the next 2 weeks, K3 is WIP and will be huge

Posted by No-Thought-4995@reddit | LocalLLaMA | View on Reddit | 68 comments

[-]

Marcuss2@reddit

To be fair, it has been over two months now since Kimi K2.5 came out. K2.6 coming soon is not a shocker.

Introducing ARC-AGI-3

Posted by Complete-Sea6655@reddit | LocalLLaMA | View on Reddit | 100 comments

[-]

Marcuss2@reddit

This will get benchmaxxed to shit.

OpenCode source code audit: 7 external domains contacted, no privacy policy, 12 community PRs unmerged for 3+ months

Posted by Spotty_Weldah@reddit | LocalLLaMA | View on Reddit | 48 comments

[-]

Marcuss2@reddit

https://github.com/Kilo-Org/kilocode is right now built on top of opencode. I know they strip some of the telemetry stuff. I wonder how it compares.

OpenCode source code audit: 7 external domains contacted, no privacy policy, 12 community PRs unmerged for 3+ months

Posted by Spotty_Weldah@reddit | LocalLLaMA | View on Reddit | 48 comments

[-]

Marcuss2@reddit

Kilo code is now basically an opencode fork.

Total beginner here—Why is LM Studio making me do the "heavy lifting" manually?

Posted by Ofer1984@reddit | LocalLLaMA | View on Reddit | 121 comments

[-]

Marcuss2@reddit

Honestly, to get started, install `kilo` or `opencode`, open it as CLI and tell it what you need with the free models they provide.

Application code has dozens of static analyzers, SQL has almost nothing, here's what exists.

Posted by Anonymedemerde@reddit | programming | View on Reddit | 29 comments

[-]

Marcuss2@reddit

Actually, in the Rust world, for SQL server interactions, SQLX exists. Default behavior is that it connects to your SQL server and verifies queries up against it as well as type checks between SQL and Rust.

Qwen3.5 27B vs Devstral Small 2 - Next.js & Solidity (Hardhat)

Posted by Holiday_Purpose_3166@reddit | LocalLLaMA | View on Reddit | 43 comments

[-]

Marcuss2@reddit

You used IQ4_XS for Devstral and Q6_K for Qwen3.5

Qwen3.5 27B vs Devstral Small 2 - Next.js & Solidity (Hardhat)

Posted by Holiday_Purpose_3166@reddit | LocalLLaMA | View on Reddit | 43 comments

[-]

Marcuss2@reddit

Why are you running different quantizations? I would understand if you tried to match it size for size, but no, you are using far better quantization on a larger model.

24gb M4 Mac Mini vs 9070XT + 32gb system RAM. What to expect?

Posted by Soft-Distance-6571@reddit | LocalLLaMA | View on Reddit | 17 comments

[-]

Marcuss2@reddit

Absolutely you will. One of the main bottlenecks is memory bandwidth. At least when you offload some or all weights to system RAM.

Why Senior Engineers Let Bad Projects Fail

Posted by Ordinary_Leader_2971@reddit | programming | View on Reddit | 121 comments

[-]

Marcuss2@reddit

Bad projects don't just show up out of nowhere. Just bad leadership leads to bad projects.

D7VK 1.1 adds experimental Direct3D 6 support for classic PC games on Linux

Posted by RenatsMC@reddit | linux | View on Reddit | 18 comments

[-]

Marcuss2@reddit

As said in another comment. Mali and Adreno, they support OpenGL ES, but not full fat OpenGL. Android also requires Vulkan support, but not OpenGL support.

D7VK 1.1 adds experimental Direct3D 6 support for classic PC games on Linux

Posted by RenatsMC@reddit | linux | View on Reddit | 18 comments

[-]

Marcuss2@reddit

There might be games which work with one and not the other. Also, there are many chips which don't support OpenGL. Vulkan support is far more common.

NVIDIA Nemotron 3 Nano 30B A3B released

Posted by rerri@reddit | LocalLLaMA | View on Reddit | 96 comments

[-]

Marcuss2@reddit

I don't see any mention of NVFP4 in the model card or the paper.

Aquif 3.5 Max 1205 (42B-A3B)

Posted by Holiday_Purpose_3166@reddit | LocalLLaMA | View on Reddit | 56 comments

[-]

Marcuss2@reddit

This sounds too good to be true..

Micron Announces Exit from Crucial Consumer Business

Posted by FullstackSensei@reddit | LocalLLaMA | View on Reddit | 190 comments

[-]

Marcuss2@reddit

I suspect there is more behind it, like OpenAI paying them to do this. They can literally get a lot more profit from it right now.

Qwen3 Next almost ready in llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 36 comments

[-]

Marcuss2@reddit

Kimi-Linear next. I do expect that one to be a lot faster as the linear part is very similar and MLA transformer is already implemented.

AMD Ryzen AI Max 395+ 256/512 GB Ram?

Posted by quantier@reddit | LocalLLaMA | View on Reddit | 91 comments

[-]

Marcuss2@reddit

That gives you a limit of about 10 tokens/s at generation.

AMD Ryzen AI Max 395+ 256/512 GB Ram?

Posted by quantier@reddit | LocalLLaMA | View on Reddit | 91 comments

[-]

Marcuss2@reddit

I think that in the following year we will see a lot more models using linear attention.

New Qwen models are unbearable

Posted by kevin_1994@reddit | LocalLLaMA | View on Reddit | 293 comments

[-]

Marcuss2@reddit

One of the reasons I hope for smaller Kimi models or distilling Kimi-K2, they don't suffer from this.

MiniMax LLM head confirms: new model M2.1 coming soon

Posted by External_Mood4719@reddit | LocalLLaMA | View on Reddit | 8 comments

[-]

Marcuss2@reddit

I wasn't too terribly impressed with the M2, I had to explicitly tell it how to use `cat` to read a file.

Want to run claude like model on ~$10k budget. Please help me with the machine build. I don't want to spend on cloud.

Posted by LordSteinggard@reddit | LocalLLaMA | View on Reddit | 131 comments

[-]

Marcuss2@reddit

Wouldn't GLM 4.6 work better in this case as it has less parameters?

Kimi Linear released

Posted by Badger-Purple@reddit | LocalLLaMA | View on Reddit | 65 comments

[-]

Marcuss2@reddit

Welch Labs made a video on MLA, comparing it to other approaches: https://www.youtube.com/watch?v=0VLAoVGf_74 TL;DR: MLA makes the model compress it's KV cache into a smaller space, this is actually more efficient and more performant than using GQA which most modern models use. Hence I expect MLA based transformer to be better than a "regular" one used today. Of course you can screw it up by having the space parameter too small, but I don't think this is the issue here.