TheaterFire

Ban phrases on llama.cpp with this script.

Posted by Total-Resort-3120@reddit | LocalLLaMA | View on Reddit | 31 comments

Ban phrases on llama.cpp with this script.
Check the README for setup instructions: [https://github.com/BigStationW/llama-cpp-phrase-ban](https://github.com/BigStationW/llama-cpp-phrase-ban)

Reply to Post

31 Comments

EncampedMars801@reddit

Could you elaborate on how this works/how you implemented it?
View on Reddit #85045694

Total-Resort-3120@reddit (OP)

Basically what it does is that it keeps the last N tokens of the model's output in memory (N = length of the longest banned phrase + 3). If it founds a banned sentence, it rewinds the buffer to just before the match started, applies a heavy logit bias (`-999`) on the triggering token so the model won't pick it again, then resumes generation from that point.
View on Reddit #85046469

Evening_Ad6637@reddit

And why not simply using gbnf?
View on Reddit #85050863

aeqri@reddit

Can you explain how you'd make it work? Correct me if I'm wrong, but isn't the goal of GNBF to force a format/grammar, whereas in this case we want the model to not output a specific sequence of tokens? How would you, for example, not allow the model to generate the phrase "barely above a whisper" using GNBF?
View on Reddit #85051534

n00b001@reddit

I know nothing of GBNF However... JSON output is similar to this, no? Permit only certain token generation through logit bias, one token at a time, even if the grammar is multi token Pydantic structured output can have quite complex constraints
View on Reddit #85056198

aeqri@reddit

Phrase banning isn't something you'd use if you wanted the model to generate valid JSON or other structured formats. This is more for preventing certain phrases in natural language. Think AI slop like "You're absolutely right!" or phrases leading to refusals, such as "As an AI model". Fundamentally, constrained decoding acts as a whitelist: "You're only allowed to pick from these 3 tokens because they're the ones that satisfy the schema." Phrase/string banning is more like a blacklist. The goal is to let the model be free and creative with 99.9% of its vocab, but trigger a restriction only when a specific sequence is generated. The problem with using a whitelist/grammar approach is that it has no way to look ahead. To prevent "barely above a whisper", you can't just ban the word "barely", because it'd also affect other valid phrases. You also can't just ban "above" if the previous token was "barely", because, once again, what if it was just generating "barely above average"?
View on Reddit #85096843

n00b001@reddit

Llms use a finite token set right? So I guess a whitelist or a blacklist are both the same really ;) Ie: LLM, you're allowed to use one of these 5 tokens, (whitelist) or you're allowed one of these 9995 tokens, (blacklist) or one of these 10000 tokens (no filter)
View on Reddit #85106649

DeProgrammer99@reddit

I really like how you set it up as a proxy. That seems like a much more usable approach than trying to chase (or straight-up reimplement) all the new llama.cpp features with a wrapper library like LlamaSharp (which doesn't even wrap `common`, i.e., even speculative decoding needs reimplemented by anything that uses the wrapper).
View on Reddit #85057850

iamapizza@reddit

If I've understood it correctly this repo sets up a proxy url in front of llama Web server, and watches for the banned words. https://github.com/BigStationW/llama-cpp-phrase-ban/blob/main/ban_phrases.py But I'm not technical enough to say how the rewind works. Does it actually make the LLM go out a few words?  
View on Reddit #85046167

EncampedMars801@reddit

Yeah, I mean that part's obvious. I'm more interested in how they implemented the actual phrase banning.
View on Reddit #85046291

CommonPurpose1969@reddit

I love the idea! However, there is an issue with the proxy. When a token is banned, it seems to remain banned until it is replaced or the generation is complete. This means that if a banned phrase starts with 'You' and is detected, you won't see that token again.
View on Reddit #85080003

aeqri@reddit

The proper way to do it would be to only generate a single token with the logit bias first, then continue generating without it afterwards. Wouldn't be that big of a change - just an extra step.
View on Reddit #85098010

CommonPurpose1969@reddit

I agree. It should probably be written in Rust, too, to ensure that it can process the incoming tokens quickly enough. This is because the proxy plays catch-up and, by the time it stops, tokens have been generated past the banned phrase and eventually thrown away together with the banned phrase tokens. If anyone is interested: [https://github.com/ggml-org/llama.cpp/discussions/9699](https://github.com/ggml-org/llama.cpp/discussions/9699)
View on Reddit #85099169

i_am__not_a_robot@reddit

True **constrained decoding** is superior to this, if your goal is a hard guarantee that banned phrases cannot be produced.
View on Reddit #85054646

CommonPurpose1969@reddit

Constrained decoding as in GBNF? Using GBNF to pull that kind of functionality will slow down llama-server to the point of being unusable.
View on Reddit #85080288

droptableadventures@reddit

They're doing the exact same thing. "rewind" is not entirely accurate here, it's beginning inference again from the previous point.
View on Reddit #85055693

i_am__not_a_robot@reddit

From my brief look at this, it looks like when a banned phrase appears, the buffered tokens are rewound and generation restarts with an added logit\_bias against the triggering token. True "constrained decoding" would filter or mask invalid next tokens **before** sampling.
View on Reddit #85055966

droptableadventures@reddit

> True "constrained decoding" would filter or mask invalid next tokens before sampling. There's no difference between that and setting the sampler bias for that token to -infinity.
View on Reddit #85058423

Chromix_@reddit

There was already an "anti slop sampler" in 2024 [here](https://www.reddit.com/r/LocalLLaMA/comments/1fqqez5/i_made_a_configurable_antislop_sampler_which/). Support for OpenAI API [was added](https://www.reddit.com/r/LocalLLaMA/comments/1fyr1ch/antislop_sampler_gets_an_openaicompatible_api_try/) a bit later. It still seems to be under semi-active development. The last PR [was merged 2 months ago](https://github.com/sam-paech/antislop-sampler/commits/main/). Just for completeness: There's also the [XTC sampler](https://www.reddit.com/r/LocalLLaMA/comments/1fv5kos/say_goodbye_to_gptisms_and_slop_xtc_sampler_for/). It doesn't ban phrases, but leads to more diverse results in general and could be used together with phrase-banning.
View on Reddit #85071766

jungle@reddit

Uhm... 4 is 102 in binary??? 0x08 in hexa??? I would add "binary" and "hexadecimal" to the banned words. And any math-related words and symbols for good measure.
View on Reddit #85046590

willrshansen@reddit

The forbidden '2'!
View on Reddit #85061887

Due-Function-4877@reddit

Yes. I was led to believe there would be no math. Also ban people that recite their child's age in weeks after 12 weeks or in months after one year.
View on Reddit #85056450

Total-Resort-3120@reddit (OP)

True true 😂
View on Reddit #85046792

a_beautiful_rhind@reddit

This is built into ik_llama currently along with regex banning. Haven't tried the latter part yet but I assume it's for all the eye glinting and whatever.
View on Reddit #85046347

henk717@reddit

KoboldCpp as well, for us the phrase banning is combined with the token banning. So if what you wish to ban is a token we just bias against the token, if its not a token we use the phrase banning approach. Keeps it very simple and efficient.
View on Reddit #85055814

henk717@reddit

If you'd like something that has native phrase banning you can also use KoboldCpp, for us its built in.
View on Reddit #85055784

HornyGooner4402@reddit

For some reason I read this as "Ban these phrases on llama.cpp" and I was confused on why you'd ban "the result" and the number "4'
View on Reddit #85053048

xeeff@reddit

remindme! 3d
View on Reddit #85046217

RemindMeBot@reddit

I will be messaging you in 3 days on [**2026-05-05 21:38:24 UTC**](http://www.wolframalpha.com/input/?i=2026-05-05%2021:38:24%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1t227hk/ban_phrases_on_llamacpp_with_this_script/ojkqxlh/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1t227hk%2Fban_phrases_on_llamacpp_with_this_script%2Fojkqxlh%2F%5D%0A%0ARemindMe%21%202026-05-05%2021%3A38%3A24%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201t227hk) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|
View on Reddit #85046261

MoneyPowerNexis@reddit

remindme! 5y Hello to future me, I hope the AI hardware bubble has popped. If it hasn't remember you can always go outside. Maybe visit your friend Brad and pet his dogs or something and consider if all the attention you paid to this was worth it.
View on Reddit #85047900

Hialgo@reddit

Nice!
View on Reddit #85045744