We just Fine-Tuned a Japanese Manga OCR Model with PaddleOCR-VL!
Posted by erinr1122@reddit | LocalLLaMA | View on Reddit | 22 comments
Hi all! 👋
Hope you don’t mind a little self-promo, but we just finished fine-tuning PaddleOCR-VL to build a model specialized in Japanese manga text recognition — and it works surprisingly well! 🎉
Model: PaddleOCR-VL-For-Manga
Dataset: Manga109-s + 1.5 million synthetic samples
Accuracy: 70% full-sentence accuracy (vs. 27% from the original model)
It handles manga speech bubbles and stylized fonts really nicely. There are still challenges with full-width vs. half-width characters, but overall it’s a big step forward for domain-specific OCR.
How to use
You can use this model with Transformers, PaddleOCR, or any library that supports PaddleOCR-VL to recognize manga text.
For structured documents, try pairing it with PP-DocLayoutV2 for layout analysis — though manga layouts are a bit different.
We’d love to hear your thoughts or see your own fine-tuned versions!
Really excited to see how we can push OCR models even further. 🚀

Anu_Rag9704@reddit
op can you share the code?
erinr1122@reddit (OP)
Hi! You can find the code here: https://pfcc.blog/posts/paddleocr-vl-for-manga. We’d also love to hear more about your fine-tuning project to explore possible collaboration or support opportunities.
Anu_Rag9704@reddit
Thanks man
erinr1122@reddit (OP)
on the way
chokehazard24@reddit
Great work! I'm also thinking of fine tuning PaddleOCR-VL for Vietnamese Financial Reports, I would appreciate if you could share the source code for finetuning, thanks!
erinr1122@reddit (OP)
Hi! You can find the code here: https://pfcc.blog/posts/paddleocr-vl-for-manga. We’d also love to hear more about your fine-tuning project to explore possible collaboration or support opportunities.
erinr1122@reddit (OP)
On the way!
IJOY94@reddit
Why not decompose/segment things first? E.g. background, foreground graphical text, text bubble, text overlay, THEN do the extraction.
msp26@reddit
bitter lesson
The end goal for this sort of system should be a model that can also take previous pages + a summary of the plot so far into context.
I don't think we're that far away from having local models capable of this.
Serprotease@reddit
I did try to do something like that.
My goal was to convert a manga in a novel format and the have it read with a tts model.
Tracking the plot/action is not to hard if you pass a general summary of all previous page + details of the page n-1.
The difficult part is character recognition. I tried with clip embedding but + faiss embedding (I’d a character + pass some simple information about him to the model).
It’s a bit slow but it works okay-ish after one good weekend of work.
msp26@reddit
Nice! How did you handle double pages btw?
Karyo_Ten@reddit
Because with enough data you can teach a model that so you don't need to explicitly program it.
AureliusPere@reddit
how does this compare to manga_ocr?
JawGBoi@reddit
Pretty cool that you were able to do this.
I am a bit disappointed though. Maybe my example is too low resolution, but even so, the missing text isn't hard to recognise.
erasels@reddit
Only the furigana is missing and that has no real impact on the text since that's just there to tell you how to pronounce a word iirc
JawGBoi@reddit
Oh, silly me! For some reason my eyes skipped over the 約束したじゃない part, expecting it to be on another line like in the manga.
I fully take this back.
What would be cool to see is a model for visual novel text. Coming up with a pipeline to generate synthetic images for that would be very interesting. Generating a variety of realistic text, in a variety of fonts, inside of a variety of (usually transparent) dialogue boxes, in front of lots of different backgrounds.
erasels@reddit
Ah yeah, happens. On a different note, what GUI did you use to test this in your screenshot?
JawGBoi@reddit
I used the huggingface demo.
zxyzyxz@reddit
Now if you could just make a translation and type setting model too...
aichiusagi@reddit
It would be awesome if you could share the synthetic data that was used to train this as well. Since you’re building on open datasets and models, it’d be great to give back to the community in the same way!
jamaalwakamaal@reddit
amazing!
Barubiri@reddit
Thanks op, doing gods work there.