Replaced an LLM's text generation head with one that emits raw machine opcodes. Here are my findings

Posted by ilbert_luca@reddit | LocalLLaMA | View on Reddit | 7 comments

Follow-up to my previous post about why AI agents should not control machines through text.

The idea: every AI agent today generates human text, parses it, then executes it. That's like controlling a robot arm by dictating English. Tesla FSD replaced that pattern. Cameras go in, steering commands come out, no text in between. Can we do the same for software? Skip the text, emit machine instructions directly?

I took a frozen Qwen 1.5B, ripped off the decoder head, and replaced it with a small cross-attention head (38M params) that emits raw CHIP-8 opcodes. The instruction tokens are queries, the machine state (display, registers, previous opcode) is keys/values. The LLM never generates a token. It just encodes the instruction once, then the head reads the machine and emits opcodes directly.

Demo video attached. Some highlights:

Every opcode executes on a real CHIP-8 emulator. No text, no parsing, no translation layer. The model emits loops, conditionals, subroutines, timer waits, and arithmetic. 1-3ms per opcode.

The interesting failure: "two plus three" breaks. The model produces an arithmetic program but with wrong operands. Turns out the frozen LLM's hidden states for "two" and "2" are nearly orthogonal (cosine sim 0.09) in arithmetic context. The LLM could bridge that gap, but only through token generation, which I removed. Understanding lives in the hidden states. Computation lives in the decoding. Remove the decoder, lose the computation.

It's a CHIP-8 VM and a proof of concept. But the cross-attention pattern (instruction queries machine state) doesn't depend on CHIP-8 specifically.

Experiments repo: https://github.com/ilbertt/reflex