ERNIE-4.5-VL-28B-A3B is a hidden gem that can decently tackle challenging chinese/japanese OCR problems.

Posted by mixivivo@reddit | LocalLLaMA | View on Reddit | 33 comments

图中文本转录如下：

倭王武の上表文

倭・任那・加罗・秦韩・慕韩七国诸军事安东大将军罗・任那・加罗・秦韩・慕韩七国诸军事安东大将军倭国王と称す。顺帝の昇明二年①使遣して上表する。昔して曰く、封国②は偏遗して藩を外に作る。昔より祖祢③躬甲胄揔斡、山川を跋涉して寛处④に进めあず、西は衆夷⑥を服することに六十六国、渡って海北⑦を平くること九十五国。

(宋书倭国传原汉文)

①四七八年。②领城、自分の国のこと。③父祖という说とがある。④おちついての最もない。⑤蛭页のこととか。⑦朝鲜半岛のことか。

竖穴式石室の模式図

【日本書紀】【宋書】

倭の五王と天皇

「宋書」倭伝に读・珍(彌)・济・奥・武の五王の名が记されてる。济以下は记纪に伝える尤恭・安康・雄略の各天皇にあてられるが、读には忤神・仁德・履中天皇をあててる诸说がある。珍にも仁德・反正天皇あててる2说がある。

纪にかけてのことである。高句麗の好太王の碑文①には、倭が朝鲜半岛に进出し高句麗と交戦したことが记されている。これは、大和政権が朝鲜半岛の进んだ技术や鉄资源を获得するために加罗(任那)に进出し、そこを拠点として高句麗の势力と对抗したことを物语っている。

「宋书」などには、5世纪初めからほぼ1世纪の间、倭の五王が中国の南朝に朝贡し、高い称号をえようとしたことが记されている。これは中国の皇帝の権威を利用して、朝鲜诸国に対する政治的立场を有利にしようとしたものと考えられる。

朝鲜半岛・中国南朝との交渉をつづじて、大和政権は大陆の进んだ技术と文化をとりいれ、势いを强めた。4世纪末から5世纪にかけての中の古墳は急激に巨大化し、大和政権の最高の首长である大王②の権力が强大化したことを物语っている。

① 好太王(広开土王)一代の事业を记した石碑で、高句麗の都のあった中国吉林省集安県にある。当时の朝鲜半岛の情势を知るための贵重な史料で、そのなかに「百済(百济)」新罗は旧是属民り。由来朝贡す。而るに倭、辛卯の年(391年)よりこのかた、海渡って百済□□□罗を破り、以って臣民とあず、日本の朝鲜半岛への进出を伝えている。

② 熊本県玉名郡菊水町の江田船山古墳出土の大刀铭には「治天下猨□□□罗大王世……」とあり、埼玉県行田市の楢荷山古墳出土の铁劔铭(→p.26図版)にも「倭加多支文大王」ともなる。「大王」は、倭の五王の1人武、记纪（「古事记」「日本书纪」）にワカタケルの名で记録された雄略天皇をさすと考えられる。これらの大刀や铁劔をもつ古墳の被葬者は、大和政権と密接な関系にあったと推测される。

[-]

AfterAte@reddit

I'm waiting for the day I can ditch Google's lens (for its translate feature of Japanese text) with a self hosted VL model.

[-]

jupiterbjy@reddit

same here, reading untranslated visual novels will be much easier

[-]

KageYume@reddit

Are text hookers such as Luna Translator not able to hook text from your VNs? If a game is supported, text hooking is much better than OCR.

I tried the new sugoi models (they are 14b and 32b, qwen2-based). They are decent is worse at following instruction than Gemma 3 (which is important because you can use system prompt to add context to the translation).

[-]

jupiterbjy@reddit

Darn guess I shouldn't bought this, gotta get PC ver later hahaha

Thanks for tip, I do have other untranslated(yet bought regardless) VNs in steam so will try luna with those instead!

[-]

KageYume@reddit

Yeah, Luna is great. You will have to add it to antivirus's white list most of the time because of false detection (Luna is open source, you can even get the source, check it and compile it yourself).

Here is a little tool to help with creating the system prompt for VN translation. :'>

https://www.reddit.com/r/visualnovels/comments/1kwm5wn/visual_novel_character_name_extractor_extract/

[-]

jupiterbjy@reddit

looks solid, too bad my yet-to-be-played on steam are chinese ones but gotta give it a shot!

[-]

KingDutchIsBad455@reddit

Maybe try something like Sugoi if you are into VNs, they recently released their new llm translation models, and they are pretty good. With the SugoiToolkit you should be able to translate VNs.

[-]

jupiterbjy@reddit

Darn I didn't even know this existed, thanks!

[-]

Lyroxide@reddit

Except they mixed Chinese characters in the Japanese text lmao

[-]

Seijinter@reddit

Where? I only see either full Chinese, or Japanese in those images.

[-]

Mar2ck@reddit

Every bold character here is Chinese-only

[-]

Seijinter@reddit

Oh, Japanese don't have those as kanji?

[-]

TheRealMasonMac@reddit

I think some are but some aren't. It's weird, and apparently many modifications Japanese did to traditional Chinese characters were adopted by China into simplified Chinese.

[-]

mixivivo@reddit (OP)

Gemini 2.5 Pro does the same.

[-]

condition_oakland@reddit

Professional Japanese to English technical translator here. The Japanese OCR transcription quality you posted is really poor, totally unusable in fact. I use both Gemini 2.5 pro and flash for Japanese OCR transcription in my actual translation work flow. They are better than any traditional OCR software I have used over the years. Just posting this in case anyone comes across this post in the future.

[-]

mixivivo@reddit (OP)

Sure, I know. But the key difference is that Gemini 2.5 isn't open-source under an Apache 2.0 license. It's not a local model. You can't run it on your own hardware, and you definitely can't send sensitive or private data to it.

[-]

NandaVegg@reddit

Do you have the same OCR test with the largest Ernie model with VL (424BA47)? The A47 model's Japanese is slightly off (it subtly inserts translation-like sentences that does not totally make sense here and there) but mostly coherent and much better than average opensource model, on par with Qwen3 235BA22 from what I tested.

[-]

mixivivo@reddit (OP)

The ERNIE 4.5 VL 424BA47 model is way too big; using it just for OCR is overkill and a waste of resources. Honestly, it doesn't make much sense. I did try it on a few samples, and it didn't seem any better than the 28B version.

[-]

mixivivo@reddit (OP)

My bad, I just realized the model I was testing was actually ERNIE 4.5 Turbo, not the 424B-A47B version. It seems like there isn't a public platform offering inference for ERNIE-4.5-VL-424B-A47B yet.

[-]

NandaVegg@reddit

? No. Gemini 2.5 Pro with prompt "Transcribe this image with Japanese".

There is no Chinese mix-up, text dropouts. It even transcribed tiny ruby perfectly. Compared to that, ERNIE example in the OP has tons of unreadable dropouts/total gibberish like "珍にも仁德・反正天皇あててる2说がある。" or "おちついての最もない" or "旧是属民り". Basically looking unusable, but it might be a good starting point for mid-training if not finetuning.

6gxzFhMY

[-]

mixivivo@reddit (OP)

Totally agree with you on Gemini 2.5 Pro being a beast for OCR. But I get the feeling that it's not a single "pure" model. My theory it's a system that pipes the image through a dedicated OCR engine first (probably its own Document AI) and then feeds the text to the LLM.

My reasoning is that it makes the exact same mistakes as Document AI when handling Chinese OCR, especially with punctuation. If you don't specifically prompt it, it'll convert full-width punctuation into half-width (like ，→, or ；→;), which is a pretty serious bug for formal Chinese documents. Early versions of Claude had this same issue.

My main focus was testing Chinese OCR on the ERNIE model, and I just threw in some Japanese samples to see what would happen. To be fair, the test images were tough - blurry, complex layouts, etc. So while the final results weren't perfect (and yeah, I saw the errors you mentioned), it still performed impressively well for an open-source model.

On a side note, I've used Gemini 2.5 Pro to OCR a few books and noticed that when it's reading Traditional Chinese, it sometimes mixes in Simplified Chinese, Japanese kanji, and other character variants(异体字). It doesn't happen often, but the errors are there, so even Gemini isn't flawless.

This is where ERNIE-4.5-VL-28B-A3B shows its potential. Yes, It has its flaws, but because it's open-source, it's a solid starting point for the community to build on.

[-]

CommunityTough1@reddit

I think they meant even when giving English responses, occasionally Gemini will throw Chinese characters in for a word or short phrase. In fact, I've seen it happen with just about every model I've ever used.

[-]

NandaVegg@reddit

Hmm. The latest DeepSeek R1 is much better than previous iterations in terms of inter-language drift. I think they had a post-training RL pass which penaltizes that.

Gemini sometimes does English <> Chinese or Japanese <> Korean drift, but not very noticeable (there was some improvement in the latest version in June). GPT-4o and 4.1 rarely does that, though o3 is unstable, likely because of reasoning datasets monolinguistic. I believe they all have some RL pass to counteract inter-language drift.

[-]

Freonr2@reddit

Not that hidden, it's just new. Waiting for broader support.

[-]

Ylsid@reddit

I wouldn't call these challenging for OCR, if these are the examples. The font is clean and contains lots of context to help the LLM. Try with an old game, with that font so tiny the writers had to get creative about tricking you into reading it.

[-]

tempetemplar@reddit

Very interesting

[-]

koumoua01@reddit

I really like the ERNIE-4.5-21B-A3B-PT. It knows and can output my language very well, even better than r1 and the big qwen3. I'm planning to integrate this model into my department's system for generating reports from our data. I'm going to be busy the next 1–2 years.

[-]

KoreanPeninsula@reddit

This is a really interesting post! But I'm curious, how can I prompt it to get more consistent results? The output is quite variable, and while I sometimes see something close to what I want, it requires a lot of tries.

[-]

mixivivo@reddit (OP)

maybe you should try "transcribe"（转录） instead of “OCR”、“识别”.

[-]

KoreanPeninsula@reddit

Thanks for your response. Unfortunately, it’s not performing as well as I had hoped. Additionally, 424B provides quite a few unusual responses. For now, I’ll continue using Gemini Pro frequently. I eagerly anticipate the day when it can accurately recognize Chinese characters written in cursive script.

[-]

mixivivo@reddit (OP)

I've also noticed this model kinda sucks at following prompts.

But based on my limited testing, its Chinese OCR is the best among all open-source models I've tried. It's actually usable. Other models are basically useless to me because they choke on rare characters.

For real accuracy, I usually use a combo of Gemini 2.5 Pro, TextIn OCR, and Baidu High-Precision OCR. If you cross-reference the results from all three, you can get incredibly high accuracy.

Since ERNIE-4.5-VL-28B-A3B is open-source under the Apache 2.0 license, I'm really hoping the community can build on its strong foundation. With some fine-tuning, we could get a truly usable Chinese/Japanese OCR model out of it.

[-]

No_Conversation9561@reddit

Is llama.cpp done? Where’d did you try?

[-]

mixivivo@reddit (OP)

You can try it out here. https://aistudio.baidu.com/modelsdetail/30648?modelId=30648 A Baidu account is required.