Why does AI fail to generate simple ASCII images ?

Posted by ConcernedIndInvestor@reddit | LocalLLaMA | View on Reddit | 16 comments

I saw a post earlier about MineBench. I was impressed to see that the latest models can produce such realistic outputs. Their ability to understand the prompt and make spatial modifications were impressive.
But when I asked the models to generate simple ascii images, they failed spectacularly.

Prompt: Draw simple ascii image of a person touching his eyes.

gemma-4-31b-it

O /
  /|/
  / \

(looks like someone hung themselves to me)

grok-4.1-thinking

    (=⌵=)
   ( x x )
    ( ─ )
     ||||
     ||||
    /    \    (=⌵=)
   ( x x )
    ( ─ )
     ||||
     ||||
    /    \

deepseek-v3.2-exp-thinking

( ͡° ͜ʖ ͡°)( ͡° ͜ʖ ͡°)

I also tried Qwen 3.6 Plus gemini-3-flash-preview and free version of ChatGPT. All the models failed and produced absurd outputs. Do the latest local models produce any better results ? I don't understand how AI can solve advance math and fail at such a trivial task!

[-]

Queasy_Dentist3903@reddit

The models don't "see" text like you or me, so they can't really craft images when they don't know what images look like. They are really just pattern matching existing ascii art.

[-]

ConcernedIndInvestor@reddit (OP)

Thanks. But based on my naive understand, they are trained on the vast trove of internet data. So, there must be places where ascii art occurs. Human anatomy must also be part of their training data. So, it surprises me that can't pattern match and instead produce such absurd outputs. Perhaps they can't combine those two sources of knowledge ?
I don't have a Claude subscription, so can't test it. I would be very curious to see its output.
Anthropic talks a lot about alignment training. So, does their model something sane ? If not, does it deceive or hallucinates like the others.

[-]

TotallyToxicToast@reddit

They would likely not train it on raw internet data. There would be some sort of pre-processing. So my guess is that likely a lot of white spaces where removed from the training data (which would make sense for almost all data, ASCII art is an exception). So the models get the white spaces wrong.

For us ASCII art is Trivial, but for a model it is far removed from anything else it does. If ASCII art works at all also depends on the Font, but the model has no idea about the Font.

I am pretty sure you could fine tune a model quite quickly to be good at ascii art.

Another option would be to have an agentic model with vision capabilities and let it see its own Ascii art as an image iteratively until it is happy.

[-]

Queasy_Dentist3903@reddit

"can't pattern match and instead produce such absurd outputs. " there is likely very few of the same exact prompt, also it is harder to generlize images you can draw but can't see then to classify text. So for these models you can't really expect them to draw anything they haven't memorized. Combinations of knowledge don't tend to occur in LLMs in the way you described, it can understand human anatomy, but not how to generate tokens representing it in visual form.

[-]

ConcernedIndInvestor@reddit (OP)

Sorry, but I have to disagree. I don't think they understand human anatomy.
Reason: LLMs can combine two sources of knowledge.
For example: I asked ChatGPT to summarize Mein Kampf in Sanskrit and it produced coherent output. It's unlikely that Sanskrit summaries of Mein Kampf are a big portion of its training data (so, not pattern matching). Instead, I think in this case it was able to put together two different sources of knowledge.
That's why, I think it can't be true that it understands both the concept of ascii art and human faces and still fail to put those concepts together.

[-]

No-Consequence-1779@reddit

Yes, but it’s stupid. Maybe when they run out of data to train, they’ll crack open the ascii ‘art’.

LeRobber@reddit

You intentionally remove ascii art from training data it's confusing as fuck to models.

abitrolly@reddit

Models operate on 1D strings. They need to be trained to be aware of 2D text with rows and columns. Then they can be trained to generate stuff at these rows and columns almost the same way they do for x an y blocks in images.

ambient_temp_xeno@reddit

The older models used to be better. Airoboros 65b.

Sonnet 4.6 can make a castle.

Ok_Gold_9674@reddit

The core issue isn't model capability—it's tokenization.

ASCII art depends on precise spatial control of individual characters (spaces, slashes, pipes). But LLMs tokenize text using BPE or SentencePiece, where a single space might merge with the next character into one token, and newlines get compressed. The model isn't "seeing" the visual grid; it's predicting the next token in a space where whitespace has no geometric meaning.

MineBench works because it's semantic ("place a block here"), not geometric. The model understands the instruction, not the pixel layout.

Some workarounds that actually help:

Ask the model to generate the ASCII as Python code using nested arrays or print statements. Treating it as code gives the model structural guardrails.
Use a multimodal model with vision capabilities and show it a reference image alongside the prompt.
For serious ASCII generation, diffusion-based approaches (some 2024 papers explored this) outperform autoregressive LLMs because diffusion operates in a continuous space that can be discretized to characters.

So it's not that AI "can't" do ASCII—it's that text-only LLMs are the wrong architecture for spatial tasks.