I can't stop thinking about this: why are we making AI control machines through human text?

Posted by ilbert_luca@reddit | LocalLLaMA | View on Reddit | 24 comments

Every "AI agent" that controls a computer today does this: generate text → parse it → execute → serialize result back to text → repeat. It's like controlling a robot arm by dictating English commands that get translated to motor signals.

Tesla FSD went through this exact evolution. Pre-v12 FSD had separate modules: detect objects → hand-coded rules → plan path → control. V12 replaced it all with one neural net: cameras → trajectory. No intermediate representations, no hand-coded rules (source). Testers widely agreed the driving felt noticeably smoother and more human-like.

The obvious reason AI agents still use text is that LLMs are trained on text, so they think in text, so they control machines through text. The tool fits the training data, not the problem. But there's no law that says machine control has to go through a human language.

A computer's state is just bytes (/proc, memory, sockets). Its controls are just bytes (syscalls, device writes). Why is there text in between?

I've been playing with a small transformer (2 layers, self-attention over byte embeddings) that reads 576 raw bytes from /proc: no parsing, just byte values normalized to [0,1]. It learns to manage process scheduling and to spot unauthorized access to /etc/passwd, both from the same raw bytes. It's never read a man page. It learned what the bytes mean the same way a vision model learns what pixels mean, by watching them change.

It's tiny and toy-scale. But the question feels real: should the interface between AI and machines be text or raw signals?

When you type, your brain doesn't dictate "press the T key" in English. It fires motor neurons directly. Current AI agents are a brain that dictates to a translator. There should be a more direct path.

Anyone else thinking about this?

[-]

No-Refrigerator-1672@reddit

I forgot how many times I've pointed out that reasoning too must be done in latent space, not in output tokens. Quite similar to what you're talking about. The underlying reason why everything is done through text today is training: modern gen transformers require literal trillions of tokens to get good, and getting this much data in non-human-oriented flavour is extremely hard, so the research teams will squeeze everything they can out of current arrangement until literally no other option remains than to move to latent space outputs.

[-]

jacek2023@reddit

What's the reason you are not training models this way?

[-]

ilbert_luca@reddit (OP)

Do you have any suggestion on how I should train a model this way as a proof of concept?

[-]

jacek2023@reddit

I was training models before LLMs got popular and I experimented with VAE a lot, just with tabular data (Kaggle competition). Karpathy explained how to train GPT2, I dont understand why you can't try something if you think thats a good way. I am just triggered by "someone else should do this my way".

[-]

ilbert_luca@reddit (OP)

FYI I've posted a little demo here: https://www.reddit.com/r/LocalLLaMA/comments/1snbyh8/replaced_an_llms_text_generation_head_with_one/

[-]

ilbert_luca@reddit (OP)

Oh, that was your question! Yeah ofc I'm trying out building this using Claude Code. I have a repo that I'll soon share here if it makes sense.

I still haven't figured out what data to collect to train the model on. Syscalls seem interesting but it seems like I need a lot of data to even reach a decent training level to demo something

[-]

jacek2023@reddit

BTW the first step should be probably trying with diffusion models (which are already available)

[-]

ilbert_luca@reddit (OP)

Interesting. Any suggestion on the model and/or method to start from? I've been experimenting with replacing the LM decoder head with a small cross-attention + MLP that outputs raw opcodes directly. Instruction tokens are Q, machine state is K/V, so the text literally attends to the machine

[-]

jacek2023@reddit

You basically need AGI or AGI lite.

[-]

No-Refrigerator-1672@reddit

So what's the reason you don't read the comments before responding to them?

[-]

jacek2023@reddit

Even for a proof of concept?

[-]

No-Refrigerator-1672@reddit

Proof of concent still requirrs an order of trillion tokens to train a nrw arch from scratch, you know? I have neither the data (just like everyome else, whixh I talked about in orihimal commenr) not the compute.

[-]

ilbert_luca@reddit (OP)

The training data problem seems more solvable than it looks for the machine control side. Text training data took decades of internet to accumulate. But for raw machine state you can just run any Linux box and record /proc, syscalls, memory. Every server, every container, every VM is generating this data right now, it's just not being collected. No human labeling needed either, just (state, action, outcome) triples.

Am I getting this wrong?

[-]

No-Refrigerator-1672@reddit

But for raw machine state you can just run any Linux box and record /proc, syscalls, memory. Every server, every container, every VM is generating this data right now, it's just not being collected.

So what's the point in collecting this data? LLM must translate an intend, some kind of task, into control commands - this is absent completely from logs, somebody has to go in and manually label all of it.

[-]

ilbert_luca@reddit (OP)

You're right. I don't have a clear idea on this yet. I'm trying to build a little demo that shows how an LLM can control a CHIP-8 machine and that may give me some clarity over what data is needed for the training

[-]

ImportancePitiful795@reddit

LLM "thinks" in form of tokens/embeddings/tensors/attention weights etc. Not "texts".

"Text" (text/voice) is the form of communication Humans have between them and with the computers. It makes sense since we do not have chips with electrodes wired on all our neurons.

Now, AI agents connected talk to LLMs with text because that's what the LLMs understand. Some agents bypass this going down to Token/embedding level. But no deeper.

Why? Because going deeper, requires things like Tensor injection etc, and at this point need new type of model, not Large LANGUAGE Model.

Now in relation to your example with Tesla FSD etc. As someone who started working with robotics back in 2008 during MS RoboChamps and Microsoft Robotics Studio, Tesla FSD is no different at it's core, than the work we did back then for the Kia self driving car challenge. (or even the mars rover challenge).

The "service" collects asynchronously concurrently all the sensor data in real time, and makes decisions based on them. Slow down, speed up, turn, etc. Sure Tesla FSD is much more advance than using CCS and some C# but the basic principle is the same.

So if you can find a way to make a new type of model which agents talk to them down to the bottom level, (or at least the matrix) please release the paperwork publicly and do not keep it private to make few billions cash. 😁

[-]

ilbert_luca@reddit (OP)

Yeah, the basic principle is the same. What I don't understand is why we don't make the models output "low-level tokens" (syscalls? assembly instructions? I don't know) like Tesla's FSD model does instead of outputting JSON (or any other human-readable format) that then gets parsed and executed by a pre-written program.

I get the reason is that we have loads of text available to train on, but I'm just wondering if anyone is even thinking of building something in this direction

[-]

ImportancePitiful795@reddit

Models depend the usage. LLM has a certain way to do the job.

TimeMixer Models, which I am using extensively, have different "communication" method.

[-]

ilbert_luca@reddit (OP)

Can you elaborate more on this? I'm curious

[-]

ImportancePitiful795@reddit

My main job is writing software for a company that provides time series forecasting.

So using foundation models that are designed for this job.

[-]

KptEmreU@reddit

Yes you are right. This is happening because we are not using developers right now. There are tons of robots in manufacturing atm does 0 use of LLMs . and now our LLMs at the stage that they can communicate with us in English . But not really great at coding ( they are great but not that much as LLMs are huge computer programs that can evaluate stuff but an LLM doesn’t normally create 2 billions row of code for checking emails at your openclaw - 2 billion parameters is the most stupid LLMs) so they use brute force of their knowledge to push through tasks.

Checking an email and evaluating can be accomplished by a dedicated code of maybe million row (made up number)

And it will be x1000 times more effective: yet where is that special code. Why do we need that when we have the nuke with us :) (LLMs)

Probably when LLMs gather better they will write specialized codes for tasks we have given (we are at the start of this phase btw) then LLMs will rest while small efficient codes work for them.

[-]

ilbert_luca@reddit (OP)

LLMs can be overkill for most machine tasks, but I don't think the answer is LLMs writing specialized code. Think about how computers already work: you don't talk to the CPU in English. There's a kernel that speaks raw bytes to the hardware and gives you a clean interface. LLMs don't have that, they go straight from English to tool calls, like using a computer without an OS.

What if there was a "kernel" for LLMs? A small, fast model that sits between the LLM and the machine. The LLM says "check my email." The kernel handles the raw machine control — bytes in, bytes out — without generating a single token. The LLM is the brain, the kernel is the nervous system.

I'm trying to come up with an MVP that shows this concept, but I'm struggling with finding a simple task that works without having to do much training on this small model that sits in front of the LLM.

[-]

KptEmreU@reddit

Yeah, it will be step by step

now they use MCP and skills.
Soon their own script libraries' API calls,
Then directly to the machine in assembly. Then "All hail the Omnissiah." Technology will be magic for our puny brains :D

[-]

x11iyu@reddit

I think the vision is that it opens the possibility for llms to zero-shot generalize across a wider range of applications (even if the reality isn't reflecting that now)

is your model directly trained on process scheduling bytes more effective for your task? absolutely, 0 doubt. buuut can this model be directly applied to a slew of other OSes that might use bytes differently to communicate? no