Compile English function descriptions into 22MB neural programs that run locally via llama.cpp

Posted by yuntiandeng@reddit | LocalLLaMA | View on Reddit | 20 comments

We built a system where a neural compiler takes a plain-English function description and produces a "neural program" (a combination of a continuous LoRA adapter and a discrete pseudo-program). At inference time, these adapt a fixed interpreter to perform the specified task. This is very suitable for implementing "fuzzy functions", functions that are easy to describe in language but painful to implement with rigid rules (such as classifying the urgency of a message, or even counting the number of verbs in a sentence, or even regular expressions which is always painful for me).

The key idea: the interpreter (Qwen3 0.6B or GPT-2 124M) weights are never modified. All task-specific behavior comes from the compiled program. The compiler itself is a 4B LM that generates the adapter weights and pseudo-program from the spec. Trained end-to-end on a dataset of 10 million (English description, function input, function output) examples synthesized by gpt-5.2.

Inference runs entirely locally through llama-cpp-python. The base model is shared and the "neural programs" are LoRA adapters that we can easily swap at runtime. The Qwen3 0.6B interpreter is \~594 MB base model (GGUF Q6_K), and each compiled program (GGUF Q4_0) adds \~22 MB. Runs pretty fast on my Mac Mini.

We also trained a compiler to adapt a GPT-2 124M interpreter that runs in the browser via WebAssembly with wllama (\~134 MB Q8_0 base + \~5 MB per Q4_0 program). Interestingly, even a model as old as GPT-2 can get a decent performance.

Results on FuzzyBench show that the adapted 0.6B interpreter is on par with prompting a 32B model (at the cost that each new task requires a new compilation):

PAW + Qwen3 0.6B interpreter: 73.4%
Qwen3 0.6B prompting: 9.8%
Qwen3 32B prompting: 68.7%

You can easily use it by:

pip install programasweights

import programasweights as paw
f = paw.compile_and_load("Classify if this is urgent or not.")
f("Need your signature by EOD")  # "urgent"

Demo: https://programasweights.com

[-]

Craftkorb@reddit

It's not usable fully locally, why is this here?

[-]

yuntiandeng@reddit (OP)

Fair pushback. To clarify: the compiled function runs fully locally. Once you''ve downloaded a .paw, you can run it offline as a normal Python function with llama.cpp, no internet needed. The compiler itself is in the cloud, since it's a 4B model that generates the LoRA + pseudo-program for you. I don't expect most users to have a GPU, and a one-time cloud compile to build these functions is what I believe to be the future (LLMs growing from problem solvers into tool builders).

The compiler details (training, architecture) are in a paper currently under review at ICML. We'll release the camera-ready if accepted.

One nice property: even if our compiler service goes down, every program that was already compiled keeps working forever, since it's just a local LoRA + pseudo-program. So users aren't tied to our infra in any long-term way.

[-]

DeltaSqueezer@reddit

We have gou. Will you share the compiler or you keep it secret for commercial reasons?

[-]

thrownawaymane@reddit

Data collection

Notice that OP didn't come back after the tide turned

[-]

yuntiandeng@reddit (OP)

Just busy with the day job. I'm a CS professor with too many responsibilities (see https://yuntiandeng.com/). Came back to share a follow-up: we built a small game on top of PAW that turned out very fun as a demonstration of what we can do with PAW: https://programasweights.com/alien

[-]

Silver-Champion-4846@reddit

What about something more complex and harder than "classify the urgency of the model", such as Arabic diacritization?

[-]

yuntiandeng@reddit (OP)

I think it works on broader tasks than classifying urgency. For example, our website itself used PAW function in multiple places: the helper (lower right corner of the website) and the search in program hub. I even built a game last night using a single PAW function that works amazingly well: https://programasweights.com/alien

That said, unfortunately it only works for English right now as only English is included in our training data (paper under review at ICML now)...

[-]

Silver-Champion-4846@reddit

Hope someone successfully ports it to multilinguality.

[-]

Imaginary-Unit-3267@reddit

The compiler itself is a 4B LM that generates the adapter weights and pseudo-program from the spec.

How the hell do you "generate adapter weights" with a language model? That's called finetuning, and I'm pretty sure language models don't finetune each other. Unless I'm dumb and just totally missing something.

[-]

askchris@reddit

I had the same question, so after digging a bit, I realized they have a private 4B model (on their end) trained to instantly generate the 22MB LoRA weights for a smaller model that you host locally (ie. Qwen 3 0.6B or GPT-2).

If you request a function via their API, your small local model + LoRA can perform that function at about the quality of a 32B model.

So their key mechanism is the 4B model, but it's private.

I guess it's more like a fast LoRA weights generator, skipping the typical LoRA training step.

[-]

yuntiandeng@reddit (OP)

exactly!

[-]

Confident_Ideal_5385@reddit

Apparently doc-to-lora works this way and is open source. Not sure it would generalise from document parsing to a "compiler" without training a whole new set of weights though, but it seems like the same basic approach.

[-]

yuntiandeng@reddit (OP)

This isn't finetuning. We finetuned Qwen3 4B to output LoRA weights given a spec using one forward pass, just in the same way as in text-to-LoRA / hypernetworks.

The training objective is end-to-end: the loss pushes the compiler toward LoRAs that, when applied, make the interpreter do the right thing on examples for the spec.

Details in a paper under ICML review, happy to post the camera ready version here when it's out.

[-]

Chromix_@reddit

For new use-cases like "classify this as urgent" one seems to be at the mercy of the server building and providing the new LoRA, since the "compiler" isn't included in the code on GitHub, so not a fully local solution, even though the existing "programs" can be used locally.

[-]

yuntiandeng@reddit (OP)

Just to clarify, it is local at inference time but not local at compile time (but compile is a one-time thing per function). The use case I imagine is for most users without GPUs to use cloud LLMs as tool builders to build tools to solve their specialized tasks. For power users I'd say they can either deploy bigger LLMs locally, or even collect and finetune their own LoRA weights.

Compiler and training details are in a paper currently under review at ICML, we'll release a camera ready version if it's accepted.

[-]

thrownawaymane@reddit

Link to the code for the compiler?

[-]

Cool-Chemical-5629@reddit

Some time ago, I was trying to mod a text-heavy game by adding TTS. The game itself already had TTS, but it was very basic and my idea was to use better neural TTS voices and give each character their own voice.

Ironically, the actual implementation of the TTS with unique voices for individual characters turned out the be the EASY part. Unfortunately I ran into an unexpected issue - parsing of the actual lines the characters should read.

I was able to receive individual in-game texts containing speakers' lines, but these come with many informational texts unrelated to the actual speaker's line. The added difficulty lies in the fact that when you look at the line as a human, you automatically "feel" what the speaker's line is, but there is no easy and reliable way to parse it programatically 100% of the time, because there are no cues / hints in the string of text and sometimes it's just the speaker's line, some other times it may contain something extra.

That's when I got the idea - what if I could use an LLM to just detect the speaker's line and then pass only that part of the string to the TTS?

Well, first of all, using any kind of LLM, even the smallest one felt like adding extra overhead and generally it felt like trying to kill a fly with a nuclear bomb - simply overkill.

My question in this regard is: Is this something that would let me do this kind of LLM heavy lifting job without the need of deploying heavy weapons of the LLMs?

[-]

riffraff98@reddit

small language models like functiongemma and such would be perfect for this, but they need to be fine-tuned and trained, with examples in their context. they're not smart enough to generalize.

Use a big llm to generate a corpus of 1000 cleaned transcripts. Use that as fine-tuning data for functiongemma, falcon, or some other smol boi.

[-]

Usual-Box-8256@reddit

can you use it in Cursor or Claude Code?

[-]

yuntiandeng@reddit (OP)

Yes! Just use this prompt and agents can figure out themselves:

I want to use ProgramAsWeights (PAW) to create fuzzy text functions (search, classification, format repair, etc.) that run locally. Read the instructions at https://programasweights.com/AGENTS.md and help me integrate it into my project. [Describe your use case briefly.]