I can't stop thinking about this: why are we making AI control machines through human text?

Posted by ilbert_luca@reddit | LocalLLaMA | View on Reddit | 24 comments

Every "AI agent" that controls a computer today does this: generate text → parse it → execute → serialize result back to text → repeat. It's like controlling a robot arm by dictating English commands that get translated to motor signals.

Tesla FSD went through this exact evolution. Pre-v12 FSD had separate modules: detect objects → hand-coded rules → plan path → control. V12 replaced it all with one neural net: cameras → trajectory. No intermediate representations, no hand-coded rules (source). Testers widely agreed the driving felt noticeably smoother and more human-like.

The obvious reason AI agents still use text is that LLMs are trained on text, so they think in text, so they control machines through text. The tool fits the training data, not the problem. But there's no law that says machine control has to go through a human language.

A computer's state is just bytes (/proc, memory, sockets). Its controls are just bytes (syscalls, device writes). Why is there text in between?

I've been playing with a small transformer (2 layers, self-attention over byte embeddings) that reads 576 raw bytes from /proc: no parsing, just byte values normalized to [0,1]. It learns to manage process scheduling and to spot unauthorized access to /etc/passwd, both from the same raw bytes. It's never read a man page. It learned what the bytes mean the same way a vision model learns what pixels mean, by watching them change.

It's tiny and toy-scale. But the question feels real: should the interface between AI and machines be text or raw signals?

When you type, your brain doesn't dictate "press the T key" in English. It fires motor neurons directly. Current AI agents are a brain that dictates to a translator. There should be a more direct path.

Anyone else thinking about this?