Replaced an LLM's text generation head with one that emits raw machine opcodes. Here are my findings

Posted by ilbert_luca@reddit | LocalLLaMA | View on Reddit | 7 comments

Follow-up to my previous post about why AI agents should not control machines through text.

The idea: every AI agent today generates human text, parses it, then executes it. That's like controlling a robot arm by dictating English. Tesla FSD replaced that pattern. Cameras go in, steering commands come out, no text in between. Can we do the same for software? Skip the text, emit machine instructions directly?

I took a frozen Qwen 1.5B, ripped off the decoder head, and replaced it with a small cross-attention head (38M params) that emits raw CHIP-8 opcodes. The instruction tokens are queries, the machine state (display, registers, previous opcode) is keys/values. The LLM never generates a token. It just encodes the instruction once, then the head reads the machine and emits opcodes directly.

Demo video attached. Some highlights:

"add 7 and 8 and show the result" → 16 opcodes, 17ms. Does BCD extraction and draws "15". Real multi-digit arithmetic.
"draw a star using a subroutine called twice" → 18 opcodes with CALL/RET. Two stars at different positions.
"wait for 10 ticks then draw a 3" → emits a busy-wait loop (GETDT, SKE, JUMP), then draws after the timer expires.
"count from 1 to 5 and display each digit" → loop with increment, skip-if-equal, backward jump.

Every opcode executes on a real CHIP-8 emulator. No text, no parsing, no translation layer. The model emits loops, conditionals, subroutines, timer waits, and arithmetic. 1-3ms per opcode.

The interesting failure: "two plus three" breaks. The model produces an arithmetic program but with wrong operands. Turns out the frozen LLM's hidden states for "two" and "2" are nearly orthogonal (cosine sim 0.09) in arithmetic context. The LLM could bridge that gap, but only through token generation, which I removed. Understanding lives in the hidden states. Computation lives in the decoding. Remove the decoder, lose the computation.

It's a CHIP-8 VM and a proof of concept. But the cross-attention pattern (instruction queries machine state) doesn't depend on CHIP-8 specifically.

Experiments repo: https://github.com/ilbertt/reflex

[-]

_ram_ok@reddit

Did you just say that by Tesla FSD implementing traditional machine learning / deep learning models that they are replacing the “pattern” of using LLMs for automation?

What in the chicken before the egg are you on about

ilbert_luca@reddit (OP)

Not really, I explained it better in my previous post. The analogy is about how the output works, not which model you use.

Old FSD (pre v12) had neural nets that detected objects, but then quite a lot of hand-written C++ decided what to do with that. FSD v12 replaced most of that with a single net that outputs driving commands much more directly from camera input.

AI agents today work like old FSD. The LLM outputs text that describes what to do, then separate code parses that into actions. The model isn't connected to the machine. With Reflex, the aim is to make the model output opcodes and pass them directly to the VM. There's nothing in between: no text, no parser, no translation step.

I tried explaining the idea with that CHIP-8 demo. Does it make more sense?

thehpcdude@reddit

Neural network outputs can just drive servos too. Some people actually think that all of AI is just LLM's in the middle.

asraniel@reddit

you wouldnt believe what some devs do nowadays.. llms for absolutely everything

Fusseldieb@reddit

This is highly interesting as there's not a lot of AI's that can read opcodes (eg. do something actually useful with them). A year ago I tried to build something more than a "Hello World" with GPT5 and it would fail spectacularly, so AI's good at this stuff would be immensely useful.

Currently, if you want to analyze a program, unless you know assembly, IDA and other software inside-out, you will suffer quite a bit. With an AI trained on this sort of stuff, reverse-engineering of assembly-based stuff could become much easier. I'm thinking far ahead, but I do think this would be possible.

_supert_@reddit

I think this is super interesting. For example embodied ai.

Illustrious_Car344@reddit

It's an interesting novelty but I don't see this really going anywhere. I think it's better for both the model and the agent infrastructure (especially for security/debugging) if the model only carries a more high-level view of how to achieve a goal and calls into dedicated tools (like a script interpreter) to accomplish more specific tasks. I honestly can't really see any benefit to this approach, it's basically just making a model output to COBOL instead of Python, it's just extra code that has even less documentation on how to reliably use.