jacek2023

MTP is nice and all, but what about PP speeds?

Posted by milpster@reddit | LocalLLaMA | View on Reddit | 30 comments

[-]

jacek2023@reddit

It's very important to minimize prompt processing (number of tokens to process), make sure you use latest llama.cpp and you "preserve thinking", this way my prompt processing is fast

Someone out there likely needs this: TP vs PP for 2 identical GPUs

Posted by xspider2000@reddit | LocalLLaMA | View on Reddit | 33 comments

[-]

Someone out there likely needs this: TP vs PP for 2 identical GPUs

Posted by xspider2000@reddit | LocalLLaMA | View on Reddit | 33 comments

[-]

Someone out there likely needs this: TP vs PP for 2 identical GPUs

Posted by xspider2000@reddit | LocalLLaMA | View on Reddit | 33 comments

[-]

jacek2023@reddit

No, I don't have nvlink. \-sm tensor works faster than -sm layer

next MiniMax will be released in ~10 Days

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 52 comments

[-]

jacek2023@reddit (OP)

Yes, I probably won't be able to run this model, but I wanted to point out that MiniMax is still open, while Qwen’s current status is unknown.

next MiniMax will be released in ~10 Days

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 52 comments

[-]

next MiniMax will be released in ~10 Days

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 52 comments

[-]

jacek2023@reddit (OP)

On this sub we pay for GPUs not for APIs.

How do I improve my T/S

Posted by KneelB4S8n@reddit | LocalLLaMA | View on Reddit | 9 comments

[-]

jacek2023@reddit

Try changing manually --n-cpu-moe (various values), it can be faster than automatic split, also experiment with lower quants, for each quant try to build max speed then compare

Get you some GPUs, it's not worth the hacks around lack of RAM

Posted by MotokoAGI@reddit | LocalLLaMA | View on Reddit | 80 comments

[-]

jacek2023@reddit

What is your setup? I remember I tried tps 3 but it didn't work, so I could only use two GPUs. Will try your command but I think I tried similar and wasn't able to use big "max model len"

NVIDIA announces Nemotron 3 Ultra

Posted by themixtergames@reddit | LocalLLaMA | View on Reddit | 137 comments

[-]

jacek2023@reddit

Too big for my local setup but Nemotron Super is perfect. Nano is also nice.

My home data center

Posted by alecKarfonta@reddit | LocalLLaMA | View on Reddit | 86 comments

[-]

jacek2023@reddit

How is that possible? Try testing 5090 alone first, then add one 3090 (with env vars).

Get you some GPUs, it's not worth the hacks around lack of RAM

Posted by MotokoAGI@reddit | LocalLLaMA | View on Reddit | 80 comments

[-]

jacek2023@reddit

Please share vllm command to use 100k or 200k context on qwen 27B, I use 200k context on llama.cpp, I don't know how to use vllm correctly, because context is always tiny

My home data center

Posted by alecKarfonta@reddit | LocalLLaMA | View on Reddit | 86 comments

[-]

jacek2023@reddit

I use 3060, the goal was to test 4 GPUs before adding fourth 3090. But it helps with some big models.

(YT) PewDiePie released his harness/webui

Posted by Dany0@reddit | LocalLLaMA | View on Reddit | 450 comments

[-]

jacek2023@reddit

All I know about this guy is that he had South Path episode

We need some polls on many topics - 2026

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 13 comments

[-]

We need some polls on many topics - 2026

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 13 comments

[-]

jacek2023@reddit

I think you should spend more time on using local LLMs than posting

Added an old 2070 Super to my rig and I can't go back...worse, now I need more

Posted by PferdOne@reddit | LocalLLaMA | View on Reddit | 46 comments

[-]

jacek2023@reddit

I connected them to the motherboard, I needed to recompile llama.cpp (because 2070 is very old) and then llama.cpp detected them.

Added an old 2070 Super to my rig and I can't go back...worse, now I need more

Posted by PferdOne@reddit | LocalLLaMA | View on Reddit | 46 comments

[-]

My 2070 was my first GPU for AI. I won a Kaggle gold medal thanks to it. Later, I bought a 3090, and then I tried using both the 3090 and 2070 together with llama.cpp. It worked, so later I bought some 3060s and more 3090s. 😄

Don’t bite me for that question please…

Posted by Thin_Pollution8843@reddit | LocalLLaMA | View on Reddit | 79 comments

[-]

jacek2023@reddit

Some people have hobbies

My home data center

Posted by alecKarfonta@reddit | LocalLLaMA | View on Reddit | 86 comments

[-]

jacek2023@reddit

nicely optimized space, I have just a single x399 with 4 GPUs in the single openframe, you have the whole lab 😄

Gryphe/Pantheon-Reasoning-27B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

jacek2023@reddit (OP)

I didn’t know you had an account here 😄

StepFun 3.7 Flash

Posted by Everlier@reddit | LocalLLaMA | View on Reddit | 151 comments

[-]

jacek2023@reddit

I am able to run Q3 locally at a good speed, and 3.7 seems censored, while 3.5 looks uncensored

Qwen 3.6 27B overdoing it

Posted by WhatererBlah555@reddit | LocalLLaMA | View on Reddit | 68 comments

[-]

jacek2023@reddit

Agentic coding is the art of creating good rules (AGENTS.md, etc)

How do I make MTP work in llama-server?

Posted by Ok_Warning2146@reddit | LocalLLaMA | View on Reddit | 28 comments

[-]

jacek2023@reddit

Maybe there is a reason why llama-bench doesn't benchmark speculative decoding.

How do I make MTP work in llama-server?

Posted by Ok_Warning2146@reddit | LocalLLaMA | View on Reddit | 28 comments

[-]

jacek2023@reddit

I don't know this tool. Does it send real prompts (with real tasks) or random noise?

How do I make MTP work in llama-server?

Posted by Ok_Warning2146@reddit | LocalLLaMA | View on Reddit | 28 comments

[-]

jacek2023@reddit

I don't know llama-benchy but I believe to "benchmark" mtp you need a real usecase, create some task (code something) and run it in both cases to compare t/s

StepFun 3.7 Flash

Posted by Everlier@reddit | LocalLLaMA | View on Reddit | 151 comments

[-]

jacek2023@reddit

Sounds great. Previous Step Flash was quite usable on my setup. This one is smaller?

LiquidAI/LFM2.5-8B-A1B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 83 comments

[-]

jacek2023@reddit (OP)

Yes you are right

LiquidAI/LFM2.5-8B-A1B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 83 comments

[-]

jacek2023@reddit (OP)

It's always a good idea to switch from ollama to llama.cpp :)

LiquidAI/LFM2.5-8B-A1B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 83 comments

[-]

jacek2023@reddit (OP)

I think all models from the graph are MoE, and Qwen 9B is not

LiquidAI/LFM2.5-8B-A1B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 83 comments

[-]

jacek2023@reddit (OP)

gpt-oss-20b is A2B if I remember correctly, you compare to A9B

LiquidAI/LFM2.5-8B-A1B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 83 comments

[-]

jacek2023@reddit (OP)

you compare A1B to A9B

LiquidAI/LFM2.5-8B-A1B · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 83 comments

[-]

jacek2023@reddit (OP)

https://preview.redd.it/xlmbv1qblw3h1.png?width=2800&format=png&auto=webp&s=eb87395565bcadeb192343ddf6e5bf1dec5c1565

Need some advice on AI workflow

Posted by Xyklone@reddit | LocalLLaMA | View on Reddit | 19 comments

[-]

jacek2023@reddit

First try to setup at least 20 t/s or it will be too slow and you will be unhappy (just lower the quant or optimize llama.cpp arguments). Then install pi, it's simple and uses small number of tokens. You run pi in the directory with source code and you type: "I have a piece of code (about 1300 loc, single file) that I would like to refactor." then add "propose a plan how to do it step by step, because I don't know and I need to eat a dinner now". Then it will be doing its things, you don't need to manually copy or edit any files. You can also tell it how to build your project so it will fix any compilation errors.

ReAligned-Qwen3.5 Release

Posted by faldore@reddit | LocalLLaMA | View on Reddit | 23 comments

[-]

jacek2023@reddit

OK but how this is different from heretics? Heretics are still censored?

Looks like Miminax-M3 is just around the corner

Posted by OnkelBB@reddit | LocalLLaMA | View on Reddit | 39 comments

[-]

jacek2023@reddit

Then it will be useless for me, just like DeepSeek or Kimi

Info: Nvidia Cuda 13.3 landed

Posted by parrot42@reddit | LocalLLaMA | View on Reddit | 47 comments

[-]

jacek2023@reddit

what does it mean?

RTX5080 vs RTX 3090 ?

Posted by DarkAndrei@reddit | LocalLLaMA | View on Reddit | 48 comments

[-]

jacek2023@reddit

Then you have RTX 5000 Pro or 5090 but both are expensive

Intel b60 48gb?

Posted by oldschooldaw@reddit | LocalLLaMA | View on Reddit | 32 comments

[-]

jacek2023@reddit

Could you share some results?

RTX5080 vs RTX 3090 ?

Posted by DarkAndrei@reddit | LocalLLaMA | View on Reddit | 48 comments

[-]

jacek2023@reddit

I think you need two 3090 for good context size. I use 200000 context on Q8 and three 3090s

Is Granite-4.1-30b Overshadowed by Qwen3.6 & Gemma4 models?

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 69 comments

[-]

jacek2023@reddit

It's because of the hype. There are very interesting models published by Mistral and NVIDIA and people don't discuss them.

Looks like Miminax-M3 is just around the corner

Posted by OnkelBB@reddit | LocalLLaMA | View on Reddit | 39 comments

[-]

jacek2023@reddit

I hope it is not BIG because m2 is big enough for my setup

Intel b60 48gb?

Posted by oldschooldaw@reddit | LocalLLaMA | View on Reddit | 32 comments

[-]

jacek2023@reddit

It would be very helpful if someone shared benchmarks from modern models like Qwen and Gemma on the current llama.cpp. Specs on paper are not as important as the actual implementation.

Looking for Suggestions — Single 5090 & 64gb DDR5

Posted by icedgz@reddit | LocalLLaMA | View on Reddit | 33 comments

[-]

jacek2023@reddit

A3B is dumber than 27B

Okay 27B made me a believer

Posted by Forward_Jackfruit813@reddit | LocalLLaMA | View on Reddit | 148 comments

[-]

jacek2023@reddit

what are your t/s?

Okay 27B made me a believer

Posted by Forward_Jackfruit813@reddit | LocalLLaMA | View on Reddit | 148 comments

[-]

jacek2023@reddit

do you see better speed without them?

Okay 27B made me a believer

Posted by Forward_Jackfruit813@reddit | LocalLLaMA | View on Reddit | 148 comments

[-]

jacek2023@reddit

make sure to enable both ngram and mtp: 114.16.742.760 I slot print_timing: id 0 | task 39816 | n_decoded = 100, tg = 61.54 t/s 114.18.691.082 I slot print_timing: id 0 | task 39816 | prompt eval time = 1392.97 ms / 353 tokens ( 3.95 ms per token, 253.42 tokens per second) 114.18.691.087 I slot print_timing: id 0 | task 39816 | eval time = 3573.28 ms / 302 tokens ( 11.83 ms per token, 84.52 tokens per second) 114.18.691.088 I slot print_timing: id 0 | task 39816 | total time = 4966.25 ms / 655 tokens 114.18.691.089 I slot print_timing: id 0 | task 39816 | graphs reused = 34434 114.18.691.091 I slot print_timing: id 0 | task 39816 | draft acceptance = 0.70845 ( 260 accepted / 367 generated) 114.18.691.104 I statistics ngram-mod: #calls(b,g,a) = 296 37226 1361, #gen drafts = 1361, #acc drafts = 1361, #gen tokens = 86611, #acc tokens = 26543, dur(b,g,a) = 2755.062, 483.359, 18.123 ms 114.18.691.107 I statistics draft-mtp: #calls(b,g,a) = 296 35865 35865, #gen drafts = 35865, #acc drafts = 31585, #gen tokens = 107595, #acc tokens = 81082, dur(b,g,a) = 0.734, 407243.871, 102.353 ms

One letter to appease them all

Posted by ivari@reddit | LocalLLaMA | View on Reddit | 70 comments

[-]

jacek2023@reddit

Wow I am reading religion news on r/LocalLLaMA and they are even ontopic 😄

Qwen3.5 35B A3B uncensored heretic Native MTP Preserved is Out Now With the Full 785 MTPs Preserved and Retained, Available in Safetensors, GGUFs. NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats

Posted by LLMFan46@reddit | LocalLLaMA | View on Reddit | 83 comments

[-]

jacek2023@reddit

ah so are maintaining both versions 😄

Qwen3.5 35B A3B uncensored heretic Native MTP Preserved is Out Now With the Full 785 MTPs Preserved and Retained, Available in Safetensors, GGUFs. NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats

Posted by LLMFan46@reddit | LocalLLaMA | View on Reddit | 83 comments

[-]

jacek2023@reddit

but does it mean you will also create 3.6 or skip it?