tomz17

Llama.ccp

Posted by Pancake502@reddit | LocalLLaMA | View on Reddit | 22 comments

MiniMax M3 - Coding & Agentic Frontier, 1M Context, Multimodal

Posted by dryadofelysium@reddit | LocalLLaMA | View on Reddit | 220 comments

Someone hid a full RAT inside a fake npm package and exfiltrated victim data to HuggingFace

Posted by BattleRemote3157@reddit | programming | View on Reddit | 101 comments

tomz17@reddit

TBF, it's being targeted because it's one of the easiest ways to get a PROLIFIC supply chain attack bootstrapped. It's 100% ROI. Same reason malware writers target windows instead of more esoteric OS's with smaller userbases. There is nothing fundamentally "safer" about any other languages / container / distro package repository. Basically anything where users can contribute items AND simultaneously specify transitive dependencies is going to suffer from the exact same problems if it bubbles to the top of the ROI calculation for attackers.

Qwen3.5 27B running at ~65tps with DFlash speculation on 2x 3090

Posted by Kryesh@reddit | LocalLLaMA | View on Reddit | 21 comments

Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090

Posted by cleversmoke@reddit | LocalLLaMA | View on Reddit | 44 comments

tomz17@reddit

Interesting. In that use-case MTP is going to beneficial (as you've seen). My coding sessions are more like an order of magnitude different. The agent spends a LOT more time reading code than writing things (tool calls, code, etc)

Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090

Posted by cleversmoke@reddit | LocalLLaMA | View on Reddit | 44 comments

tomz17@reddit

>Cannot wait for PP speed to increase Not sure that's in the card, as MTP fundamentally requires those extra forward passes (e.g. 1 pass per MTP token). Therefore I keep it turned off, even in VLLM. The big win for MTP are small-context requests with more deterministic outputs (e.g. code).

Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090

Posted by cleversmoke@reddit | LocalLLaMA | View on Reddit | 44 comments

tomz17@reddit

what is the ratio of prompt to tg, because for most agentic workflows I've used the prompt dominates over generation. Given that you are (presumably) outputting 85k tokens per task, I'd say your paritcular use case is very atypical, no?

That's a good news...

Posted by Pjotrs@reddit | LocalLLaMA | View on Reddit | 244 comments

tomz17@reddit

The prefill with MTP is always going to be slower since it requires multiple forward passes. This is doubly extra true with multiple cards linked over a slow interface (e.g. PCIE). Even with VLLM + nvlink, I still disable MTP for agentic workflows, as the gains from faster generation are almost immediately lost on the prompt processing penalty.

how would you set up a local llm server for a business of 7 people?

Posted by snowieslilpikachu69@reddit | LocalLLaMA | View on Reddit | 58 comments

Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant

Posted by gladkos@reddit | LocalLLaMA | View on Reddit | 83 comments

tomz17@reddit

Turboquant can indeed be faster in situations where you are memory-bandwidth limited vs. compute-limited. So 100% depends on the hardware and whether you are interested in prefill vs. decode, and single user vs. multi-user (e.g. in multi-user setups prefill can tank your decode), etc. etc. etc.

Web-Search is coming to a screeching performance halt as Google shuts down their free search index, and traffic defenders like Cloudflare challenge AI at every gateway. What are our options?

Posted by NetTechMan@reddit | LocalLLaMA | View on Reddit | 238 comments

tomz17@reddit

There are entire websites which no longer work properly under Linux+Firefox due to aggressive not detection.  I have to pull out my macbook just to browse the home depot website.  

What do I do if I accidentally put regular gas in a premium car?

Posted by Street_Firefighter_7@reddit | askcarguys | View on Reddit | 303 comments

tomz17@reddit

>As I said, the threshold for a knock sensor to detect knock is VERY low Right, and as I said, the engine still has to knock, right? And that's an all-or-nothing thing. You either have pre-ignition in a cylinder or you don't. It's not like you can build a sensor which violates causality. This is the reason why ever car out there is not simply loaded up with a race-fuel map from the factory letting the knock-sensor retard down from there. Either way, internet-expert all you want, the people who built the damn engine are telling you pretty explicitly to put a particular octane in. They are doing this even though it costs them sales (e.g. a car that could safely take a lower/cheaper octane would sell better). They even put an f'in sticker on the fuel door itself in many case. Ignore them at your own peril.

What do I do if I accidentally put regular gas in a premium car?

Posted by Street_Firefighter_7@reddit | askcarguys | View on Reddit | 303 comments

tomz17@reddit

Again the engine has to knock prior to timing being retarded...  And that happens each time the control loop tries to increase the maximum power output.  It's why engine tunes have different maps for different octanes.  You can't just yolo the highest octane map and let the knock sensor figure it out.     > will be absolutely fine on regular gas with zero ill effects Well yes for that very rare situation when you put in the wrong gas by accident or are in an area that doesn't have gas with the sufficient octane level, etc.  Don't make it sound like driving around with the wrong type of gas daily is going to have zero ill effects long term.  

What do I do if I accidentally put regular gas in a premium car?

Posted by Street_Firefighter_7@reddit | askcarguys | View on Reddit | 303 comments

tomz17@reddit

>You will be fine, car will adjust and reduce performance. To be fair, it can only do that after it detects a knock. So this is not really meant to be a long-term solution, but more of a, "I am stuck in the middle of the desert and there is no premium gas around, kind of solution."

[Paper on Hummingbird+: low-cost FPGAs for LLM inference] Qwen3-30B-A3B Q4 at 18 t/s token-gen, 24GB, expected $150 mass production cost

Posted by ayake_ayake@reddit | LocalLLaMA | View on Reddit | 56 comments

tomz17@reddit

TBF, the real target audience for these aren't the people who want to inference at home. The target audience for these are the applications where you want a small AI running on some edge device out in the field, where internet connectivity and power are at a premium (or not available).

Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix)

Posted by AmazingDrivers4u@reddit | LocalLLaMA | View on Reddit | 66 comments

Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix)

Posted by AmazingDrivers4u@reddit | LocalLLaMA | View on Reddit | 66 comments

Deepseek v4 pricing is genuinely silly, did the math and now i am questioning my entire stack

Posted by Skid_gates_99@reddit | LocalLLaMA | View on Reddit | 77 comments

tomz17@reddit

Ok. But in theory you *could*, given the proper resources (i.e. a few $10k's for a single user and a few $100k for multiple concurrent users, which is peanuts for any company making money on any AI outputs). Whereas you cannot ever run any of the closed frontier models locally regardless of resources. So for instance if anthropic were to alter the deal, and you were 100% reliant on their product stack, you are kind of stuck. Given that you should assume all non-local providers train on your data, it's downright stupid, IMHO, to build your house of cards today on something you can't completely control. Whereas with DS GLM QWEN MiniMax, etc. you can always just pull the ripcord and buy the hardware to make the problem go away for good. Being a few months behind SOTA is not a big deal for the vast majority of applications.

Opus 4.7 Max subscriber. Switching to Kimi 2.6

Posted by meaningego@reddit | LocalLLaMA | View on Reddit | 106 comments

tomz17@reddit

Ish.  They are cheaper because they are second (which is an excellent strategy in the current market conditions).  If they had to push the envelope while US companies ripped them off, the situation would be flipped.  

Those of you running minimax 2.7 locally, how are you feeling about it?

Posted by laterbreh@reddit | LocalLLaMA | View on Reddit | 129 comments

MiniMax-M2.7's MIT-Style License Is a Misleading Restriction That Bans Commercial Use and Fails Free Software Standards

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 34 comments

tomz17@reddit

>does that mean I can’t sell the book? That's not what the license says. That's also not what copyright law (at least in the USA) says. So yes, you \*can\* always sell the book (at least in the USA), **even if** you used a completely closed-weight model to produce it. The book just may not be protected by copyright if it is generated by AI instead of a human, but that's not a model-licensing issue. >so their lawyers push them towards less clear wording. I hope I am wrong with all that said. Yes, you are grasping at straws.

MiniMax-M2.7's MIT-Style License Is a Misleading Restriction That Bans Commercial Use and Fails Free Software Standards

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 34 comments

MiniMax-M2.7's MIT-Style License Is a Misleading Restriction That Bans Commercial Use and Fails Free Software Standards

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 34 comments

MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — bench numbers

Posted by Visual_Synthesizer@reddit | LocalLLaMA | View on Reddit | 19 comments

tomz17@reddit

These look really compelling for small team deployments (as long as the minimax 2.7 license isn't constraining for your legal/regulatory environment). However, the real question will be how much the 4-bit quantization lobotomizes the model performance. IIRC MiniMax 2.5 was a complete dog when it came to degradation w.r.t. quantization.

MiniMax-M2.7 GGUF Quants — Full Set (Q2_K to Q8_0 + BF16)

Posted by Asleep_Training3543@reddit | LocalLLaMA | View on Reddit | 21 comments

MiniMax-M2.7 GGUF Quants — Full Set (Q2_K to Q8_0 + BF16)

Posted by Asleep_Training3543@reddit | LocalLLaMA | View on Reddit | 21 comments

MiniMax M2.7 is NOT open source - DOA License :(

Posted by KvAk_AKPlaysYT@reddit | LocalLLaMA | View on Reddit | 229 comments

DFlash: Block Diffusion for Flash Speculative Decoding.

Posted by Total-Resort-3120@reddit | LocalLLaMA | View on Reddit | 127 comments

Qwen3.5 27B running at ~65tps with DFlash speculation on 2x 3090

Posted by Kryesh@reddit | LocalLLaMA | View on Reddit | 21 comments

What it took to launch Google DeepMind's Gemma 4

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 136 comments

Claude Code's source just leaked — I extracted its multi-agent orchestration system into an open-source framework that works with any LLM

Posted by JackChen02@reddit | LocalLLaMA | View on Reddit | 317 comments

tomz17@reddit

Actually, since anthropic engineers have publicly admitted they are now using claude to write 100% of their claude code, the copyright enforce-ability (of at least parts of that source code) may really be in question (i.e. Thaler v. Perlmutter). In particular, their choice of claiming 100% (instead of, say 99.999%) may really bite them in the ass.

Claude Code's source just leaked — I extracted its multi-agent orchestration system into an open-source framework that works with any LLM

Posted by JackChen02@reddit | LocalLLaMA | View on Reddit | 317 comments

local llm inference on M4 Max vs M5 Max

Posted by purealgo@reddit | LocalLLaMA | View on Reddit | 2 comments

Kimi K2.6 will drop in the next 2 weeks, K3 is WIP and will be huge

Posted by No-Thought-4995@reddit | LocalLLaMA | View on Reddit | 68 comments

vLLM First timer 3090 + 3090Ti with Qwen 3.5 27b Q4

Posted by edankwan@reddit | LocalLLaMA | View on Reddit | 9 comments

tomz17@reddit

the --enforce-eager is killing performance get rid of that and add --max-num-seqs 16 (or lower) to prevent the oom during warmup if you are running a low number of sessions you would also benefit from speculative decoding (albeit not w.r.t. your TTFT). e.g. --speculative-config '{"method": "mtp", "num\_speculative\_tokens": 1}'

dGPU gang we're so back

Posted by ForsookComparison@reddit | LocalLLaMA | View on Reddit | 41 comments

I understand the disappointment if minimax 2.7 does not become open weights but we have had a lot..

Posted by LegacyRemaster@reddit | LocalLLaMA | View on Reddit | 28 comments

tomz17@reddit

>But I wouldn't consider it a cheap. cheap is relative... when all is said and done, compared to the domestic competitors, the chinese LLM providers will have under-spent by at least an order of magnitude.

I understand the disappointment if minimax 2.7 does not become open weights but we have had a lot..

Posted by LegacyRemaster@reddit | LocalLLaMA | View on Reddit | 28 comments

tomz17@reddit

>Idk about "cheap" advertising lol. How else would a chinese company get their api-only coding product, which started-off substantially worse than openAI, Anthropic, etc. out there to a foreign audience.... NOBODY in the USA would have heard, much less used MiniMax, Qwen, GLM etc. if they started off as closed-api models.

I understand the disappointment if minimax 2.7 does not become open weights but we have had a lot..

Posted by LegacyRemaster@reddit | LocalLLaMA | View on Reddit | 28 comments

tomz17@reddit

Nobody is releasing open-weights as a charity. They are doing it to build brand-recognition and cheap advertising in a space where incumbents like OpenAI, Anthropic, etc. had a major lead over them. If MiniMax had just offered an API when releasing abab or MiiniMax 1, nobody would have given them a second thought because the competitors were so far ahead. I expect most if not all of these companies are going to pivot to monetization strategies that try to lock you in as the performance differences tighten up between different models.

This is incredibly tempting

Posted by No_Mango7658@reddit | LocalLLaMA | View on Reddit | 115 comments

tomz17@reddit

That's a lot of money to spend for something that is already effectively e-waste. On top of that, power usage is going to be ridiculous for a system like this. Not sure what the use-case is.

Glm 5.1 👀

Posted by Namra_7@reddit | LocalLLaMA | View on Reddit | 99 comments

CLI coding client - alternative to (not so) OpenCode

Posted by momsi91@reddit | LocalLLaMA | View on Reddit | 22 comments

DeepSeek just called itself Claude mid-convo… what?? 💀

Posted by Annual_Point7199@reddit | LocalLLaMA | View on Reddit | 9 comments

tomz17@reddit

I've seen this before, and it's likely due to the fact they trained off of claude output (a known tactic for Chinese LLM's)

Nvidia P4000, i need some help

Posted by prxy15@reddit | LocalLLaMA | View on Reddit | 11 comments

Nvidia P4000, i need some help

Posted by prxy15@reddit | LocalLLaMA | View on Reddit | 11 comments

How’d I do?

Posted by No_Development5871@reddit | LocalLLaMA | View on Reddit | 5 comments

tomz17@reddit

>They are marked as parts only because they couldn’t test them. I have a bridge in brooklyn if you are interested. Either way, ALWAYS treat any listing saying "we couldn't test" as "we definitely tested it and are now selling our e-waste to some gullible idiot"... Hopefully it works out for you, op.

Dual 3090s (power-limited) - Are 3x PCI-E cables w/daisy-chain "okay?"

Posted by overand@reddit | LocalLLaMA | View on Reddit | 17 comments

tomz17@reddit

Each PCI-E cable is only rated for 150watts... So if you power limit to 150watts you can indeed safely use a single wire and daisy-chain the connectors. Anything above that is running it out of spec, and YMMV.

Chonkers and thermals (dual 3090)

Posted by BetStack@reddit | LocalLLaMA | View on Reddit | 16 comments

tomz17@reddit

>The 3090 fans are really best in class. What does that have to do with baking the GDDR6 dies in an area without fans? >All Ampere GPUs were clamshell nothing special about the 3090. AFAIK, only the 3090's have memory on the back of the PCB. Dude, this is literally the first thing the crypto-miners back in the day discovered about the 3090's. It's why so many used mining 3090's had those janky extra homebrew heatsinks added to their backplates, despite their "best in class coolers"

Chonkers and thermals (dual 3090)

Posted by BetStack@reddit | LocalLLaMA | View on Reddit | 16 comments

tomz17@reddit

"I ran every stop sign today and am still alive, therefore running stop signs is super smort!" Again, all 3090's are clamshell memory layouts. Op really should really check the memory temps on that bottom card before running any memory-intensive workloads.

Chonkers and thermals (dual 3090)

Posted by BetStack@reddit | LocalLLaMA | View on Reddit | 16 comments

tomz17@reddit

Yeah, the part that actually matters most in this config are the memory modules on the "top side" of the bottom card. Read those temps and decide accordingly.