jacek2023

MTP is nice and all, but what about PP speeds?

Posted by milpster@reddit | LocalLLaMA | View on Reddit | 30 comments

jacek2023@reddit

It's very important to minimize prompt processing (number of tokens to process), make sure you use latest llama.cpp and you "preserve thinking", this way my prompt processing is fast

Someone out there likely needs this: TP vs PP for 2 identical GPUs

Posted by xspider2000@reddit | LocalLLaMA | View on Reddit | 33 comments

Someone out there likely needs this: TP vs PP for 2 identical GPUs

Posted by xspider2000@reddit | LocalLLaMA | View on Reddit | 33 comments

Someone out there likely needs this: TP vs PP for 2 identical GPUs

Posted by xspider2000@reddit | LocalLLaMA | View on Reddit | 33 comments

next MiniMax will be released in ~10 Days

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 52 comments

jacek2023@reddit (OP)

Yes, I probably won't be able to run this model, but I wanted to point out that MiniMax is still open, while Qwen’s current status is unknown.

next MiniMax will be released in ~10 Days

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 52 comments

next MiniMax will be released in ~10 Days

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 52 comments

How do I improve my T/S

Posted by KneelB4S8n@reddit | LocalLLaMA | View on Reddit | 9 comments

jacek2023@reddit

Try changing manually --n-cpu-moe (various values), it can be faster than automatic split, also experiment with lower quants, for each quant try to build max speed then compare

Get you some GPUs, it's not worth the hacks around lack of RAM

Posted by MotokoAGI@reddit | LocalLLaMA | View on Reddit | 80 comments

jacek2023@reddit

What is your setup? I remember I tried tps 3 but it didn't work, so I could only use two GPUs. Will try your command but I think I tried similar and wasn't able to use big "max model len"

NVIDIA announces Nemotron 3 Ultra

Posted by themixtergames@reddit | LocalLLaMA | View on Reddit | 137 comments

My home data center

Posted by alecKarfonta@reddit | LocalLLaMA | View on Reddit | 86 comments

Get you some GPUs, it's not worth the hacks around lack of RAM

Posted by MotokoAGI@reddit | LocalLLaMA | View on Reddit | 80 comments

jacek2023@reddit

Please share vllm command to use 100k or 200k context on qwen 27B, I use 200k context on llama.cpp, I don't know how to use vllm correctly, because context is always tiny

My home data center

Posted by alecKarfonta@reddit | LocalLLaMA | View on Reddit | 86 comments

(YT) PewDiePie released his harness/webui

Posted by Dany0@reddit | LocalLLaMA | View on Reddit | 450 comments

We need some polls on many topics - 2026

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 13 comments

We need some polls on many topics - 2026

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 13 comments

Added an old 2070 Super to my rig and I can't go back...worse, now I need more

Posted by PferdOne@reddit | LocalLLaMA | View on Reddit | 46 comments

Added an old 2070 Super to my rig and I can't go back...worse, now I need more

Posted by PferdOne@reddit | LocalLLaMA | View on Reddit | 46 comments

jacek2023@reddit

My 2070 was my first GPU for AI. I won a Kaggle gold medal thanks to it. Later, I bought a 3090, and then I tried using both the 3090 and 2070 together with llama.cpp. It worked, so later I bought some 3060s and more 3090s. 😄

Don’t bite me for that question please…

Posted by Thin_Pollution8843@reddit | LocalLLaMA | View on Reddit | 79 comments

My home data center

Posted by alecKarfonta@reddit | LocalLLaMA | View on Reddit | 86 comments

Gryphe/Pantheon-Reasoning-27B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 18 comments

StepFun 3.7 Flash

Posted by Everlier@reddit | LocalLLaMA | View on Reddit | 151 comments

Qwen 3.6 27B overdoing it

Posted by WhatererBlah555@reddit | LocalLLaMA | View on Reddit | 68 comments

How do I make MTP work in llama-server?

Posted by Ok_Warning2146@reddit | LocalLLaMA | View on Reddit | 28 comments

How do I make MTP work in llama-server?

Posted by Ok_Warning2146@reddit | LocalLLaMA | View on Reddit | 28 comments

How do I make MTP work in llama-server?

Posted by Ok_Warning2146@reddit | LocalLLaMA | View on Reddit | 28 comments

jacek2023@reddit

I don't know llama-benchy but I believe to "benchmark" mtp you need a real usecase, create some task (code something) and run it in both cases to compare t/s

StepFun 3.7 Flash

Posted by Everlier@reddit | LocalLLaMA | View on Reddit | 151 comments

LiquidAI/LFM2.5-8B-A1B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 83 comments

LiquidAI/LFM2.5-8B-A1B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 83 comments

LiquidAI/LFM2.5-8B-A1B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 83 comments

LiquidAI/LFM2.5-8B-A1B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 83 comments

LiquidAI/LFM2.5-8B-A1B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 83 comments

LiquidAI/LFM2.5-8B-A1B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 83 comments

Need some advice on AI workflow

Posted by Xyklone@reddit | LocalLLaMA | View on Reddit | 19 comments

jacek2023@reddit

First try to setup at least 20 t/s or it will be too slow and you will be unhappy (just lower the quant or optimize llama.cpp arguments). Then install pi, it's simple and uses small number of tokens. You run pi in the directory with source code and you type: "I have a piece of code (about 1300 loc, single file) that I would like to refactor." then add "propose a plan how to do it step by step, because I don't know and I need to eat a dinner now". Then it will be doing its things, you don't need to manually copy or edit any files. You can also tell it how to build your project so it will fix any compilation errors.

ReAligned-Qwen3.5 Release

Posted by faldore@reddit | LocalLLaMA | View on Reddit | 23 comments

Looks like Miminax-M3 is just around the corner

Posted by OnkelBB@reddit | LocalLLaMA | View on Reddit | 39 comments

Info: Nvidia Cuda 13.3 landed

Posted by parrot42@reddit | LocalLLaMA | View on Reddit | 47 comments

RTX5080 vs RTX 3090 ?

Posted by DarkAndrei@reddit | LocalLLaMA | View on Reddit | 48 comments

Intel b60 48gb?

Posted by oldschooldaw@reddit | LocalLLaMA | View on Reddit | 32 comments

RTX5080 vs RTX 3090 ?

Posted by DarkAndrei@reddit | LocalLLaMA | View on Reddit | 48 comments

Is Granite-4.1-30b Overshadowed by Qwen3.6 & Gemma4 models?

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 69 comments

Looks like Miminax-M3 is just around the corner

Posted by OnkelBB@reddit | LocalLLaMA | View on Reddit | 39 comments

Intel b60 48gb?

Posted by oldschooldaw@reddit | LocalLLaMA | View on Reddit | 32 comments

jacek2023@reddit

It would be very helpful if someone shared benchmarks from modern models like Qwen and Gemma on the current llama.cpp. Specs on paper are not as important as the actual implementation.

Looking for Suggestions — Single 5090 & 64gb DDR5

Posted by icedgz@reddit | LocalLLaMA | View on Reddit | 33 comments

Okay 27B made me a believer

Posted by Forward_Jackfruit813@reddit | LocalLLaMA | View on Reddit | 148 comments

Okay 27B made me a believer

Posted by Forward_Jackfruit813@reddit | LocalLLaMA | View on Reddit | 148 comments

Okay 27B made me a believer

Posted by Forward_Jackfruit813@reddit | LocalLLaMA | View on Reddit | 148 comments

jacek2023@reddit

make sure to enable both ngram and mtp: 114.16.742.760 I slot print_timing: id 0 | task 39816 | n_decoded = 100, tg = 61.54 t/s 114.18.691.082 I slot print_timing: id 0 | task 39816 | prompt eval time = 1392.97 ms / 353 tokens ( 3.95 ms per token, 253.42 tokens per second) 114.18.691.087 I slot print_timing: id 0 | task 39816 | eval time = 3573.28 ms / 302 tokens ( 11.83 ms per token, 84.52 tokens per second) 114.18.691.088 I slot print_timing: id 0 | task 39816 | total time = 4966.25 ms / 655 tokens 114.18.691.089 I slot print_timing: id 0 | task 39816 | graphs reused = 34434 114.18.691.091 I slot print_timing: id 0 | task 39816 | draft acceptance = 0.70845 ( 260 accepted / 367 generated) 114.18.691.104 I statistics ngram-mod: #calls(b,g,a) = 296 37226 1361, #gen drafts = 1361, #acc drafts = 1361, #gen tokens = 86611, #acc tokens = 26543, dur(b,g,a) = 2755.062, 483.359, 18.123 ms 114.18.691.107 I statistics draft-mtp: #calls(b,g,a) = 296 35865 35865, #gen drafts = 35865, #acc drafts = 31585, #gen tokens = 107595, #acc tokens = 81082, dur(b,g,a) = 0.734, 407243.871, 102.353 ms

One letter to appease them all

Posted by ivari@reddit | LocalLLaMA | View on Reddit | 70 comments

Qwen3.5 35B A3B uncensored heretic Native MTP Preserved is Out Now With the Full 785 MTPs Preserved and Retained, Available in Safetensors, GGUFs. NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats

Posted by LLMFan46@reddit | LocalLLaMA | View on Reddit | 83 comments

Qwen3.5 35B A3B uncensored heretic Native MTP Preserved is Out Now With the Full 785 MTPs Preserved and Retained, Available in Safetensors, GGUFs. NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats

Posted by LLMFan46@reddit | LocalLLaMA | View on Reddit | 83 comments