We just Fine-Tuned a Japanese Manga OCR Model with PaddleOCR-VL!

Posted by erinr1122@reddit | LocalLLaMA | View on Reddit | 22 comments

Hi all! 👋
Hope you don’t mind a little self-promo, but we just finished fine-tuning PaddleOCR-VL to build a model specialized in Japanese manga text recognition — and it works surprisingly well! 🎉

Model: PaddleOCR-VL-For-Manga

Dataset: Manga109-s + 1.5 million synthetic samples

Accuracy: 70% full-sentence accuracy (vs. 27% from the original model)

It handles manga speech bubbles and stylized fonts really nicely. There are still challenges with full-width vs. half-width characters, but overall it’s a big step forward for domain-specific OCR.

How to use
You can use this model with Transformers, PaddleOCR, or any library that supports PaddleOCR-VL to recognize manga text.
For structured documents, try pairing it with PP-DocLayoutV2 for layout analysis — though manga layouts are a bit different.

We’d love to hear your thoughts or see your own fine-tuned versions!
Really excited to see how we can push OCR models even further. 🚀

[-]

Anu_Rag9704@reddit

op can you share the code?

erinr1122@reddit (OP)

Hi! You can find the code here: https://pfcc.blog/posts/paddleocr-vl-for-manga. We’d also love to hear more about your fine-tuning project to explore possible collaboration or support opportunities.

Thanks man

on the way

chokehazard24@reddit

Great work! I'm also thinking of fine tuning PaddleOCR-VL for Vietnamese Financial Reports, I would appreciate if you could share the source code for finetuning, thanks!

On the way!

IJOY94@reddit

Why not decompose/segment things first? E.g. background, foreground graphical text, text bubble, text overlay, THEN do the extraction.

msp26@reddit

bitter lesson

The end goal for this sort of system should be a model that can also take previous pages + a summary of the plot so far into context.

I don't think we're that far away from having local models capable of this.

Serprotease@reddit

I did try to do something like that.

My goal was to convert a manga in a novel format and the have it read with a tts model.

Tracking the plot/action is not to hard if you pass a general summary of all previous page + details of the page n-1.

The difficult part is character recognition. I tried with clip embedding but + faiss embedding (I’d a character + pass some simple information about him to the model).

It’s a bit slow but it works okay-ish after one good weekend of work.

Nice! How did you handle double pages btw?

Karyo_Ten@reddit

Because with enough data you can teach a model that so you don't need to explicitly program it.

AureliusPere@reddit

how does this compare to manga_ocr?

JawGBoi@reddit

Pretty cool that you were able to do this.

I am a bit disappointed though. Maybe my example is too low resolution, but even so, the missing text isn't hard to recognise.

erasels@reddit

Only the furigana is missing and that has no real impact on the text since that's just there to tell you how to pronounce a word iirc

Oh, silly me! For some reason my eyes skipped over the 約束したじゃない part, expecting it to be on another line like in the manga.

I fully take this back.

What would be cool to see is a model for visual novel text. Coming up with a pipeline to generate synthetic images for that would be very interesting. Generating a variety of realistic text, in a variety of fonts, inside of a variety of (usually transparent) dialogue boxes, in front of lots of different backgrounds.

Ah yeah, happens. On a different note, what GUI did you use to test this in your screenshot?

I used the huggingface demo.