Ban phrases on llama.cpp with this script.

[-]

EncampedMars801@reddit

Could you elaborate on how this works/how you implemented it?

Reply

[-]

Basically what it does is that it keeps the last N tokens of the model's output in memory (N = length of the longest banned phrase + 3). If it founds a banned sentence, it rewinds the buffer to just before the match started, applies a heavy logit bias (`-999`) on the triggering token so the model won't pick it again, then resumes generation from that point.

Reply

[-]

Evening_Ad6637@reddit

And why not simply using gbnf?

Reply

[-]

aeqri@reddit

Can you explain how you'd make it work? Correct me if I'm wrong, but isn't the goal of GNBF to force a format/grammar, whereas in this case we want the model to not output a specific sequence of tokens? How would you, for example, not allow the model to generate the phrase "barely above a whisper" using GNBF?

Reply

[-]

n00b001@reddit

I know nothing of GBNF However... JSON output is similar to this, no? Permit only certain token generation through logit bias, one token at a time, even if the grammar is multi token Pydantic structured output can have quite complex constraints

Reply

[-]

aeqri@reddit

Phrase banning isn't something you'd use if you wanted the model to generate valid JSON or other structured formats. This is more for preventing certain phrases in natural language. Think AI slop like "You're absolutely right!" or phrases leading to refusals, such as "As an AI model". Fundamentally, constrained decoding acts as a whitelist: "You're only allowed to pick from these 3 tokens because they're the ones that satisfy the schema." Phrase/string banning is more like a blacklist. The goal is to let the model be free and creative with 99.9% of its vocab, but trigger a restriction only when a specific sequence is generated. The problem with using a whitelist/grammar approach is that it has no way to look ahead. To prevent "barely above a whisper", you can't just ban the word "barely", because it'd also affect other valid phrases. You also can't just ban "above" if the previous token was "barely", because, once again, what if it was just generating "barely above average"?

Reply

[-]

n00b001@reddit

Llms use a finite token set right? So I guess a whitelist or a blacklist are both the same really ;) Ie: LLM, you're allowed to use one of these 5 tokens, (whitelist) or you're allowed one of these 9995 tokens, (blacklist) or one of these 10000 tokens (no filter)

Reply

[-]

DeProgrammer99@reddit

I really like how you set it up as a proxy. That seems like a much more usable approach than trying to chase (or straight-up reimplement) all the new llama.cpp features with a wrapper library like LlamaSharp (which doesn't even wrap `common`, i.e., even speculative decoding needs reimplemented by anything that uses the wrapper).

Reply

[-]

iamapizza@reddit

If I've understood it correctly this repo sets up a proxy url in front of llama Web server, and watches for the banned words. https://github.com/BigStationW/llama-cpp-phrase-ban/blob/main/ban_phrases.py But I'm not technical enough to say how the rewind works. Does it actually make the LLM go out a few words?

Reply

[-]

EncampedMars801@reddit

Yeah, I mean that part's obvious. I'm more interested in how they implemented the actual phrase banning.

Reply

[-]

CommonPurpose1969@reddit

I love the idea! However, there is an issue with the proxy. When a token is banned, it seems to remain banned until it is replaced or the generation is complete. This means that if a banned phrase starts with 'You' and is detected, you won't see that token again.

Reply

[-]

aeqri@reddit

The proper way to do it would be to only generate a single token with the logit bias first, then continue generating without it afterwards. Wouldn't be that big of a change - just an extra step.

Reply

[-]

CommonPurpose1969@reddit

I agree. It should probably be written in Rust, too, to ensure that it can process the incoming tokens quickly enough. This is because the proxy plays catch-up and, by the time it stops, tokens have been generated past the banned phrase and eventually thrown away together with the banned phrase tokens. If anyone is interested: [https://github.com/ggml-org/llama.cpp/discussions/9699](https://github.com/ggml-org/llama.cpp/discussions/9699)

Reply

[-]

i_am__not_a_robot@reddit

True **constrained decoding** is superior to this, if your goal is a hard guarantee that banned phrases cannot be produced.

Reply

[-]

CommonPurpose1969@reddit

Constrained decoding as in GBNF? Using GBNF to pull that kind of functionality will slow down llama-server to the point of being unusable.

Reply

[-]

droptableadventures@reddit

They're doing the exact same thing. "rewind" is not entirely accurate here, it's beginning inference again from the previous point.

Reply

[-]

i_am__not_a_robot@reddit

From my brief look at this, it looks like when a banned phrase appears, the buffered tokens are rewound and generation restarts with an added logit\_bias against the triggering token. True "constrained decoding" would filter or mask invalid next tokens **before** sampling.

Reply

[-]

droptableadventures@reddit

> True "constrained decoding" would filter or mask invalid next tokens before sampling. There's no difference between that and setting the sampler bias for that token to -infinity.

Reply

[-]

Chromix_@reddit

There was already an "anti slop sampler" in 2024 [here](https://www.reddit.com/r/LocalLLaMA/comments/1fqqez5/i_made_a_configurable_antislop_sampler_which/). Support for OpenAI API [was added](https://www.reddit.com/r/LocalLLaMA/comments/1fyr1ch/antislop_sampler_gets_an_openaicompatible_api_try/) a bit later. It still seems to be under semi-active development. The last PR [was merged 2 months ago](https://github.com/sam-paech/antislop-sampler/commits/main/). Just for completeness: There's also the [XTC sampler](https://www.reddit.com/r/LocalLLaMA/comments/1fv5kos/say_goodbye_to_gptisms_and_slop_xtc_sampler_for/). It doesn't ban phrases, but leads to more diverse results in general and could be used together with phrase-banning.

Reply

[-]

jungle@reddit

Uhm... 4 is 102 in binary??? 0x08 in hexa??? I would add "binary" and "hexadecimal" to the banned words. And any math-related words and symbols for good measure.

Reply

[-]

willrshansen@reddit

The forbidden '2'!

Reply

[-]

Due-Function-4877@reddit

Yes. I was led to believe there would be no math. Also ban people that recite their child's age in weeks after 12 weeks or in months after one year.

Reply

[-]

Total-Resort-3120@reddit (OP)

True true 😂

Reply

[-]

a_beautiful_rhind@reddit

This is built into ik_llama currently along with regex banning. Haven't tried the latter part yet but I assume it's for all the eye glinting and whatever.

Reply

[-]

henk717@reddit

KoboldCpp as well, for us the phrase banning is combined with the token banning. So if what you wish to ban is a token we just bias against the token, if its not a token we use the phrase banning approach. Keeps it very simple and efficient.

Reply

[-]

henk717@reddit

If you'd like something that has native phrase banning you can also use KoboldCpp, for us its built in.

Reply

[-]

HornyGooner4402@reddit

For some reason I read this as "Ban these phrases on llama.cpp" and I was confused on why you'd ban "the result" and the number "4'

Reply

[-]

xeeff@reddit

remindme! 3d

Reply

[-]

RemindMeBot@reddit

I will be messaging you in 3 days on [**2026-05-05 21:38:24 UTC**](http://www.wolframalpha.com/input/?i=2026-05-05%2021:38:24%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1t227hk/ban_phrases_on_llamacpp_with_this_script/ojkqxlh/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1t227hk%2Fban_phrases_on_llamacpp_with_this_script%2Fojkqxlh%2F%5D%0A%0ARemindMe%21%202026-05-05%2021%3A38%3A24%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201t227hk) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

Reply

[-]

MoneyPowerNexis@reddit

remindme! 5y Hello to future me, I hope the AI hardware bubble has popped. If it hasn't remember you can always go outside. Maybe visit your friend Brad and pet his dogs or something and consider if all the attention you paid to this was worth it.

Ban phrases on llama.cpp with this script.

Reply to Post

31 Comments

EncampedMars801@reddit

Total-Resort-3120@reddit (OP)

Evening_Ad6637@reddit

aeqri@reddit

n00b001@reddit

aeqri@reddit

n00b001@reddit

DeProgrammer99@reddit

iamapizza@reddit

EncampedMars801@reddit

CommonPurpose1969@reddit

aeqri@reddit

CommonPurpose1969@reddit

i_am__not_a_robot@reddit

CommonPurpose1969@reddit

droptableadventures@reddit

i_am__not_a_robot@reddit

droptableadventures@reddit

Chromix_@reddit

jungle@reddit

willrshansen@reddit

Due-Function-4877@reddit

Total-Resort-3120@reddit (OP)

a_beautiful_rhind@reddit

henk717@reddit

henk717@reddit

HornyGooner4402@reddit

xeeff@reddit

RemindMeBot@reddit

MoneyPowerNexis@reddit

Hialgo@reddit