Curb Your Inference: AICI for rewriting context in real time, constrained generation, backtracking KV-cache

Posted by tucnak@reddit | LocalLLaMA | View on Reddit | 15 comments

[-]

tucnak@reddit (OP)

You'd been longing for o1-like ecstasy with open source tooling? Grown disillusioned with the benchmark rat-race championed by OpenAI and the Chinese? Maybe you heard about GBNF but never really worked out what it meant, & it didn't apply to you because you never used llama.cpp in the first place? Here's your chance!

AICI (Microsoft) is a very special kind of tool; it allows you to write "controllers"—programs that will allow you to take control of the inference-time by dynamically re-writing the contents of the context window, go back and forth using a backtracking KV-cache. For example, you could use it to parse a function call, execute it, go back in time, and simply rewrite the call with the result, and carry on with the generation in a single pass. You choose to sample continuations however you please. You might even use a reward model to rank individual reasoning steps, and use it as a primitive to build search, or whatever. Your friends will call it reinforcement learning. Don't tell anyone that I'd told you about this; it's a closely-guarded secret, and I would be in big trouble if they knew it was me who gave it away!

AICI is language-agnostic; the controllers are Webassembly programs.

[-]

ResidentPositive4122@reddit

Fast: Wasm modules are compiled to native code and run in parallel with the LLM inference engine, inducing only a minimal overhead to the generation process

How do the modules get to talk with the inference engine and what inference engines are supported? Are logits enough (i.e. compatible w/ vLLM?)

[-]

tucnak@reddit (OP)

They like rLLM, and it's how they support both Huggingface and llama.cpp are supported. The thing is not half-bad, the code is not bad which is surprising because, duh, you would expect shitty code from AI researchers.

GBNF is released, and 15 months since we're barely touching the surface!

[-]

hyperdynesystems@reddit

Is there a good reason to use this over something like LMQL (just happened to be the prompt-constraint method that I could actually get working reliably when I tried out all of them, personally)?

I do wish LMQL was getting regular updates though.

[-]

tucnak@reddit (OP)

Depends on how far you're willing to get the capability you want!

I'm not sure whether LMQL supports "time travel", i.e. backtracking KV cache but I suspect that if you really wanted to, you could rig it with their Python API anyway. The tool you will end up using is not as important as training yourself to think in clauses, and seeing how clauses can be used to build higher-level capabilities, like function-calling, chain-of-thought reasoning, or reward-based search.

[-]

hyperdynesystems@reddit

I'm not sure LMQL supports backtracking by default but I know it supports different decoder schemes where you specify the number of samples and get back different results for each sequence.

[-]

ResidentPositive4122@reddit

Having looked at the repo briefly, this looks like more of a research thing and not an actively developed platform intended for production. Kinda feels like Guidance - started at MS, the maintainer left, took the project with them, but they simply don't have the resources to keep up with such a fast evolving ecosystem.

[-]

tucnak@reddit (OP)

I mean, you're not wrong... although I wouldn't put it in the same bucket as Guidance. AICI is the real deal. I wouldn't be surprised if o1 works similarly under the hood... The key idea, of course, is that you don't actually want to be making arbitrary decisions during inference-time (the basis of most LLM sampling strategies: it's arbitrary.) I like the networking analogy: logits are L1. Let's say that vanilla OpenAI completion API is like L3, then AICI is basically a L2 solution. It deals with logits still, but because you can think of it like a glorified grammar-generator, you get to reason about logits as lexemes which has a few nice properties. Most importantly, it's easier (and better for performance) to implement "L4" features like function-calling, or reward-based search, if you have code at L2 in the first place. (Function-calling is a function of grammar.) To that end, L1 is far too low, and L3 is too high. Food for thought. Skilled coders will be able to replicate some of the ideas from AICI in the environment of their liking!

[-]

Enough-Meringue4745@reddit

Interesting. I’m going to have to look into this further

[-]

next-choken@reddit

Is "rewriting context in real time" actually worth anything in a world where most decent llm inference engines support automatic prefix caching with paged attention?

[-]

tucnak@reddit (OP)

Yes, because this occurs at logrobs with low-level K/V cache operations. You don't want "automatic prefix caching" is a trivial controlller.

[-]

next-choken@reddit

What does that get you in practice though? Seems like with current solutions you can already easily backtrack to any point in the prefix/context if it's automatically cached

[-]

tucnak@reddit (OP)

You get to implement something like o1 basically, and preserve all the information that otherwise were lost. Contingent on the low-level ops. Because you control K/V cache, batching is also crazy good. You can time travel to multiple positions and continuations from there at the same time.

[-]

next-choken@reddit

Can't you do that with paged attention and automatic prefix caching as well though? I guess with paged attention you cache "blocks" and with this "time travel" thing maybe you can surgically dice up the cache at the token level. Is that right? Can you comment on how much more efficiency this actually gets you and in what situations this gain is most noticeable?

[-]

tucnak@reddit (OP)

I would say paged attn in particular is orthogonal to the grammar layer, i.e. even though the industry hasn't congerved on some shared vision of K/V cache yet, I think it's a super-cool project/exercise to try and make a paged attn implementation at token resolution. I can smell copy-on-write when it's cooking! :-) Now, as far as automatic prefix caching is concerned: the implementations like vLLM concern with them at a higher (completion) level. To borrow OSI topology from networking:

Logit probablility distribution is L1 (i.e. Physical layer)
Grammar is L2 (i.e. Data link layer)
(Chat) completion API is L3 (i.e. Network layer)

End-of-text tokens, max_tokens limit setting, the "stop" sequences—you can think of these as implicit grammar. Basically, in the base case you have any token close, and the stop sequences are the stop clause. Then, your grammar is either anything repeated at most max_tokens times, or terminated with stop clause. So the key takeaway is as follows: Layer 2 is always there, even if you don't think about it like that. The grammar itself doesn't have to be static. If you wanted to implement something like o1, your controller would have to do three things: maintain some DAG-like internal representation of possible continuations, every branching point is cached, and use a reward model to assign scores individually. This enables you to search the token space by making multiple passes from the same position, possibly with slightly different grammars, but also search the embedding space (RAG) as well as merging/truncating divergent paths, and to the "user" it would seem like a single completion request. You would want to somehow preserve the internal state on completion so that you could use information from dead-end paths to make better predictions subsequently. For example, by fine-tuning the model to perform the problematic, few-shot function calls in 0.

Therefore, it's not a matter of gaining efficiency, or going through tokens faster (although that would be important, too, and there's totally lots of cool optimisations left on the table) rather it's about bringing completely new capabilities, & ultimately having tighter control of the inference time.

I'm sure o1 uses similar techniques under the hood + scheduling.