Pristine_Income9554

Why don't we still have any games with AI agents used as NPC characters?

Posted by Another__one@reddit | LocalLLaMA | View on Reddit | 110 comments

[-]

Pristine_Income9554@reddit

https://preview.redd.it/w96nwkn4rx4h1.png?width=2560&format=png&auto=webp&s=530204052fddffee4c2e6b6d082fe1c8eeff93e2 4 days ago I started to making one. And obvious problems- it's too expensive even if you make model decide 1/3 of things, or If you local it's too slow. But I have my ideas how to get around it.

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

Posted by Alternative-Cat-1347@reddit | LocalLLaMA | View on Reddit | 40 comments

[-]

Pristine_Income9554@reddit

yea, but only fit will f.. up with MoE setup, at least for me

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

Posted by Alternative-Cat-1347@reddit | LocalLLaMA | View on Reddit | 40 comments

[-]

Pristine_Income9554@reddit

bigger then default -b -ub will speed up prompt processing for a price of cache size. If you not load full model in to vram you need set one of this 3 flags (-cmoe or -ncmoe or -ot), I don't saw anything new news ab llama.cpp auto setting weights for MoE models

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

Posted by Alternative-Cat-1347@reddit | LocalLLaMA | View on Reddit | 40 comments

[-]

Pristine_Income9554@reddit

\-cmoe or -ncmoe or -ot all 3 do same thing in different way. I don't see any mtp flags. For my setup mtp slower so I cant recommend anything on this topic My: ./llama-server -m ..../Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --alias "Qwen3.6-35B-A3B" --host 0.0.0.0 -t 8 -tb 12 -cmoe -b 2048 -ub 2048 --ctx-size 65536 --jinja -fa on -ctk q8_0 -ctv q8_0 --fit on --fit-target 248 --no-mmap --no-context-shift --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.00 --chat-template-kwargs '{"preserve_thinking": true}'

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

Posted by Alternative-Cat-1347@reddit | LocalLLaMA | View on Reddit | 40 comments

[-]

Pristine_Income9554@reddit

and you forget to mention what fork you are using as default llama.cpp don't have turbo4

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

Posted by Alternative-Cat-1347@reddit | LocalLLaMA | View on Reddit | 40 comments

[-]

Pristine_Income9554@reddit

no -cmoe and -t -tb -b -ub flags? What makes this more efficient?

Qwen3.6-35B - Terrible instruction following when using context files (with vanilla pi-agent). Model issue or am I doing something wrong?

Posted by FusionX@reddit | LocalLLaMA | View on Reddit | 9 comments

[-]

Pristine_Income9554@reddit

1. CUDA 13.2 can be a problem. 2. add --chat-template-kwargs '{"preserve\_thinking": true}'

Qwen 3.6 35 UD 2 K_XL is pulling beyond its weight and quantization (No one is GPU Poor now)

Posted by dreamai87@reddit | LocalLLaMA | View on Reddit | 68 comments

[-]

Pristine_Income9554@reddit

https://preview.redd.it/964515wba5wg1.png?width=390&format=png&auto=webp&s=437527cb49e8afbf0fa6b20493b6c16e4d9b359d date

Qwen 3.6 35 UD 2 K_XL is pulling beyond its weight and quantization (No one is GPU Poor now)

Posted by dreamai87@reddit | LocalLLaMA | View on Reddit | 68 comments

[-]

Pristine_Income9554@reddit

it's ok on my end on windows and arch linux. I compile llama on my own with cuda 12.8

Qwen 3.6 35 UD 2 K_XL is pulling beyond its weight and quantization (No one is GPU Poor now)

Posted by dreamai87@reddit | LocalLLaMA | View on Reddit | 68 comments

[-]

Pristine_Income9554@reddit

just normal last llama builds

Qwen 3.6 35 UD 2 K_XL is pulling beyond its weight and quantization (No one is GPU Poor now)

Posted by dreamai87@reddit | LocalLLaMA | View on Reddit | 68 comments

[-]

Pristine_Income9554@reddit

if i remember right, improve performance for coding when model repeat same code

Qwen 3.6 35 UD 2 K_XL is pulling beyond its weight and quantization (No one is GPU Poor now)

Posted by dreamai87@reddit | LocalLLaMA | View on Reddit | 68 comments

[-]

Pristine_Income9554@reddit

40 gb ddr4 2993MhZ, Ryzen 2700x

Qwen 3.6 35 UD 2 K_XL is pulling beyond its weight and quantization (No one is GPU Poor now)

Posted by dreamai87@reddit | LocalLLaMA | View on Reddit | 68 comments

[-]

Pristine_Income9554@reddit

you 2 messed up somewhere https://preview.redd.it/8km75phykrvg1.png?width=503&format=png&auto=webp&s=ccfe714215e0cf0e52109ecb84ba4b525233f074 Generation 21-17 t/s prompt processing \~578.82 tokens/s on RTX2060 6gb \`\`\` \-m Qwen3.6-35B-A3B-UD-Q4\_K\_XL.gguf --alias "Qwen3.6-35B-A3B" --host [0.0.0.0](http://0.0.0.0) \-cmoe -b 2048 -ub 2048 --ctx-size 65536 --jinja -fa on -ctk q8\_0 -ctv q8\_0 --fit on --fit-target 128 --no-mmap --no-context-shift --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.00 -np 1 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 \`\`\`

Qwen3.5-35B-A3B-Heretic running surprisingly fast on RTX 3060 Ti 8GB - is Heretic castrated compared to original?

Posted by Temporary-Lack-1408@reddit | LocalLLaMA | View on Reddit | 47 comments

[-]

Pristine_Income9554@reddit

You should have way better numbers. try \`-t 6 -tb 12 -ngl 999 -cmoe -b 2048 -ub 2048 --ctx-size 65536 --jinja -fa on -ctk q8\_0 -ctv q8\_0 --fit on --fit-target 128 --no-mmap\`

Imrpove Qwen3.5 Performance on Weak GPU

Posted by MarketingGui@reddit | LocalLLaMA | View on Reddit | 22 comments

[-]

Pristine_Income9554@reddit

llama-server.exe -m D:\\ggufModels\\Qwen3.5-35B-A3B-UD-Q4\_K\_XL.gguf --alias "Qwen3.5-35B-A3B" -t 6 -tb 12 -cmoe -b 2048 -ub 2048 --ctx-size 65536 --jinja -fa on -ctk q4\_0 -ctv q4\_0 --fit on --fit-target 64 -np 1 --no-mmap --no-context-shift 12 t/s with rtx 2060 6gb vram; 40gb ram 2936 MHz; Ryzen 7 2700x

System prompt for Qwen3.5 (27B/35BA3B) to reduce overthinking?

Posted by thigger@reddit | LocalLLaMA | View on Reddit | 27 comments

[-]

Pristine_Income9554@reddit

you can use grammar if you don't need tool caps to setup any response format you like

GLM4.7-Flash REAP @ 25% live on HF + agentic coding evals

Posted by ilzrvch@reddit | LocalLLaMA | View on Reddit | 20 comments

[-]

Pristine_Income9554@reddit

use -ctk q8\_0 -ctv q8\_0

Fix for GLM 4.7 Flash has been merged into llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 91 comments

[-]

Pristine_Income9554@reddit

Who cares ab fix that can be fixed with flag --override-kv deepseek2.expert\_gating\_func=int:2 . OP title is deceptive as main problem with GLM 4.7 Flash is broken flash attention

Fix for GLM 4.7 Flash has been merged into llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 91 comments

[-]

Pristine_Income9554@reddit

Fixed != merged. It still has problems fixed before it will be merged in to master tree

glm-4.7-flash has the best thinking process with clear steps, I love it

Posted by uptonking@reddit | LocalLLaMA | View on Reddit | 38 comments

[-]

Pristine_Income9554@reddit

[https://github.com/ggml-org/llama.cpp/issues/18944](https://github.com/ggml-org/llama.cpp/issues/18944)

My gpu poor comrades, GLM 4.7 Flash is your local agent

Posted by __Maximum__@reddit | LocalLLaMA | View on Reddit | 169 comments

[-]

Pristine_Income9554@reddit

[https://github.com/ggml-org/llama.cpp/issues/18944](https://github.com/ggml-org/llama.cpp/issues/18944) why it's slow with Llama.cpp

My gpu poor comrades, GLM 4.7 Flash is your local agent

Posted by __Maximum__@reddit | LocalLLaMA | View on Reddit | 169 comments

[-]

Pristine_Income9554@reddit

flash attention is broken, try to turn it off

I fine-tuned a 7B model for reasoning on free Colab with GRPO + TRL

Posted by External-Rub5414@reddit | LocalLLaMA | View on Reddit | 2 comments

[-]

Pristine_Income9554@reddit

reinventing a wheel? [https://unsloth.ai/docs/get-started/unsloth-notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks)

AI has replaced programmers… totally.

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 297 comments

[-]

Pristine_Income9554@reddit

It's more problem of open source. Even if AI could implement quant method for new model, you need spend time with it for free.

AI has replaced programmers… totally.

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 297 comments

[-]

Pristine_Income9554@reddit

Common... any guy or a girl can Quant a model. You only need good enough gpu and slightly straight hands.

ChatGPT stopped lying to me when I started treating it like a scared kid

Posted by Nan0pixel@reddit | LocalLLaMA | View on Reddit | 13 comments

[-]

Pristine_Income9554@reddit

It's same thing as a tip meta. Models trained on data from humans, and we are lazy as fuck. Thx to laziness we have our civilization advance(to make hard work easier). And we now expect a thing trained on our data don't have our flaws?

Wan 2.1 1.3B fighting video is not as good as the Qwen 2.5 fighting videos I previously posted. I used the Wan 2.1 1.3B from Huge.com. Qwen 2.5 must be using some other type of super model for videos. Because this Wan has lost its' way.

Posted by Extension-Fee-8480@reddit | LocalLLaMA | View on Reddit | 10 comments

[-]

Pristine_Income9554@reddit

yep

Wan 2.1 1.3B fighting video is not as good as the Qwen 2.5 fighting videos I previously posted. I used the Wan 2.1 1.3B from Huge.com. Qwen 2.5 must be using some other type of super model for videos. Because this Wan has lost its' way.

Posted by Extension-Fee-8480@reddit | LocalLLaMA | View on Reddit | 10 comments

[-]

Pristine_Income9554@reddit

man even 6gb vram gpu can generate video using 14b wan model...

Exceeding VRAM limit with QWQ IQ3XXS i1 quant, no OOM? (LM studio)

Posted by No_Expert1801@reddit | LocalLLaMA | View on Reddit | 7 comments

[-]

Pristine_Income9554@reddit

Windows or Linux?

Think Tool Boosts Accuracy by 54%! (+ Ollama integration)

Posted by Straight-Worker-4327@reddit | LocalLLaMA | View on Reddit | 21 comments

[-]

Pristine_Income9554@reddit

You missing things that this Tool works with any good model with ollama without training. If model trained how to work with Function Calling, it will work well not only with this *“think” tool*, but with search or RAG as well.

Qwen LIED TO US

Posted by random-tomato@reddit | LocalLLaMA | View on Reddit | 7 comments

[-]

Pristine_Income9554@reddit

you forgot about Qwen 7b

Think Tool Boosts Accuracy by 54%! (+ Ollama integration)

Posted by Straight-Worker-4327@reddit | LocalLLaMA | View on Reddit | 21 comments

[-]

Pristine_Income9554@reddit

Who'd be more interesting to have on this Function Call separate model trained just for reasoning

Think Tool Boosts Accuracy by 54%! (+ Ollama integration)

Posted by Straight-Worker-4327@reddit | LocalLLaMA | View on Reddit | 21 comments

[-]

Pristine_Income9554@reddit

Even if we assume full chat context + reasoning Function Call in the same call gives better result, it's still just Function Call like RAG or internet search, or img gen, that trying to cheaply have similar result as reasoning models, it's nothing new, just stripped down Function Call that only ask model a question with custom prompt

Think Tool Boosts Accuracy by 54%! (+ Ollama integration)

Posted by Straight-Worker-4327@reddit | LocalLLaMA | View on Reddit | 21 comments

[-]

Pristine_Income9554@reddit

It's just the same reasoning thing wrapped inside Function Calling so you don't need train model to output thinking and answer in 1 reply, but instead you have 2 with similar result

Is the DeepSeek model poisoned at the data level?

Posted by aospan@reddit | LocalLLaMA | View on Reddit | 10 comments

[-]

Pristine_Income9554@reddit

Would be strange if they had datasets not aligned with CCP policies. Model is not created for personal use of western people. You could run it, but don't expect it to have same world view as west.

1 Million Token Context Length 🔥

Posted by CelebrationClean7309@reddit | LocalLLaMA | View on Reddit | 39 comments

[-]

Pristine_Income9554@reddit

**Qwen** models don't like q4 kv cache

Opensource 8B parameter test time compute scaling(reasoning) model

Posted by TheLogiqueViper@reddit | LocalLLaMA | View on Reddit | 36 comments

[-]

Pristine_Income9554@reddit

It's only for me, or it's way too repetitive?

It's getting difficult to evaluate models.

Posted by baehyunsol@reddit | LocalLLaMA | View on Reddit | 52 comments

[-]

Pristine_Income9554@reddit

I guess lmms was hooked to RAG with Korean laws inside? If yes then you only need model with good context attention and reasoning.

KoboldcPP is such a gigantic leap in QoL coming from Oobabooga is just ridiculous.

Posted by pumukidelfuturo@reddit | LocalLLaMA | View on Reddit | 58 comments

[-]

Pristine_Income9554@reddit

Oh this guy don't saw tabbyAPI with exl2 models

6 bit quantization

Posted by Ok-Cicada-5207@reddit | LocalLLaMA | View on Reddit | 9 comments

[-]

Pristine_Income9554@reddit

6 bit will be better in quality, not in performance. Use exl2 format if you have nvidia

Is LLM Studio good?

Posted by Top_Sonic@reddit | LocalLLaMA | View on Reddit | 91 comments

[-]

Pristine_Income9554@reddit

Read question of the author.

Is LLM Studio good?

Posted by Top_Sonic@reddit | LocalLLaMA | View on Reddit | 91 comments

[-]

Pristine_Income9554@reddit

tabbyAPI, koboldcpp beckend, ST fronted

Tumera 0.1.0a2 is here!

Posted by Sad-Fix-7915@reddit | LocalLLaMA | View on Reddit | 9 comments

[-]

Pristine_Income9554@reddit

The lifecycle of software and code is so much shorter compared to the time a lot of those patterns were invented, the next update or the next technology or the next best pattern could come out tomorrow, making your effort be in vain. You will not see new like mvvm architecture in next 2-3 years 100% b it works not only on pc but on phones using maui(new Xamarin) and you don't need invent bicycle, like mvc for asp.net

Tumera 0.1.0a2 is here!

Posted by Sad-Fix-7915@reddit | LocalLLaMA | View on Reddit | 9 comments

[-]

Pristine_Income9554@reddit

MVVM only looks scary, but with it comes order and convenience

Tumera 0.1.0a2 is here!

Posted by Sad-Fix-7915@reddit | LocalLLaMA | View on Reddit | 9 comments

[-]

Pristine_Income9554@reddit

John Gossman, a Microsoft WPF and Silverlight architect, announced MVVM on his blog in **2005**. Model–view–viewmodel is also referred to as model–view–binder, especially in implementations not involving the . NET platform.

Tumera 0.1.0a2 is here!

Posted by Sad-Fix-7915@reddit | LocalLLaMA | View on Reddit | 9 comments

[-]

Pristine_Income9554@reddit

You will not learn how to drive a car by driving bicycle. Don't waist your time. If you want to finish app then you don't really need mvvm, but if you want to learn pls start straight with mvvm, because without it you will get bad habits. For example have bunch of app beckend code in frontend ChatPage.xaml.cs

Tumera 0.1.0a2 is here!

Posted by Sad-Fix-7915@reddit | LocalLLaMA | View on Reddit | 9 comments

[-]