Local manga translator with LLM build-in, written in Rust with llama.cpp integration | TheaterFire

Local manga translator with LLM build-in, written in Rust with llama.cpp integration

Posted by mayocream39@reddit | LocalLLaMA | View on Reddit | 87 comments

Hi LocalLLaMA,

I created a post a few weeks ago, but this time this project has become more reliable and easier to use.

This is a manga translator that can also be used to translate any image. It uses a combination of object detection, visual LLM-based OCR, layout analysis, and fine-tuned inpainting models. I believe it is the most performant and easy-to-use pipeline for manga translation.

For the LLM part, I have integrated llama.cpp into this application; it supports the Gemma 4 family and the Qwen3.5 family, and also includes uncensored and fine-tuned models. It also supports OpenAPI-compatible API, so you can use LM Studio or OpenRouter, etc.

I think the demo video explains the workflow a lot, basiclly you just click a button and it will run the pipeline for you. You can also proofread and edit the result, changing the font, size, color, etc. It's a mini Photoshop editor.

For who may have interest on this, it's fully open-source: https://github.com/mayocream/koharu

[-]

CheatCodesOfLife@reddit

Looks awesome!

I normally avoid rust projects ever since I tried "mistral.rs" a while back, but I'm going to try to get this one running so I can rm -rf my old python slop old dodgy gradio mess and all those dead ocr models.

I could never get the new text to be placed in the boxes correctly depending on the word length.

[-]

mayocream39@reddit (OP)

AMA!

[-]

KageYume@reddit

Sorry for the constant question but I've encountered a bug.

Bug description:

At the OCR stage, certain symbols (such as ●) can trigger the generation of repetitive character strings. It overwhelms the LLM (slow prompt processing) and if you are using online LLM, it would burn through credit/token.

Input: https://imgur.com/a/lY9TZro

Output:

Expected: Only a short string of ● is sent to the LLM
Actual: A very long string of ● is sent to the LLM

Below is the sample page that encounters this issue (I'm using the default setting for models aside from Translator).

[-]

mayocream39@reddit (OP)

I think it's related to the detector model. I encountered the same issue when I used pp-doclayoutv3, but it works fine with "comic text & bubble detector"; you can give it a try.

We have tuned repetition_penalty for paddleocr-vl, but it looks like we need to implement something to prevent it from repeating at the application level 🤔

[-]

IrisColt@reddit

S-source?

[-]

KageYume@reddit

Switching the detector model to "comic text & bubble detector" solves the issue for me. Thanks.

[-]

kaisurniwurer@reddit

Does it know which way to actually read the text?

When I manually tried, most of the models tried reading left to right and then just forcing it to make sense. Even when I provided just the speech bubble. And even after I gave it instructions as how to read it.

[-]

mayocream39@reddit (OP)

We have a hidden prompt that feeds the LLM with the text in the reading order; the order of the text matters.

[-]

KageYume@reddit

I have some questions.

How can I select to process just some pages in the folder?
When I try holding Shift to try to select multiple pages in Navigator, it doesn't let me and there is no button to process multiple image.

I think it would be nice if user can select pages as processing target:

Select a group of image from the navigator
Has a text box "From page" ... "To page" to enter the range to process

Then, user can do the following to start processing:

Click "Pocess selected images" from Process menu to process them.
The Navigator has context menu (right click) that has "Process selected images".
How can I continue the processing for unprocessed pages only? If the processing of a folder stops halfway, user might want to continue processing instead of re-processing already processed pages again.

[-]

mayocream39@reddit (OP)

Unfortunately, both senarioes not supported yet. We will improve this! They also have an issue to track it: https://github.com/mayocream/koharu/issues/515

For 1, I think we can easily support it; the backend already partially supports running the pipeline.

For 2, it should also be easy to implement; we only need to check if a page has met the requirements of every engine, then we can skip the step.

[-]

KageYume@reddit

Thanks for the answer.

Again, I'm looking forward to the next update. <3

https://i.redd.it/slzj6ezfgrwg1.gif

[-]

-p-e-w-@reddit

Does it use (multi-page) image understanding to guide the translation, or does it simply find the speech bubbles and swap out the text?

In many comics, the visual story provides essential information without which an accurate or idiomatic translation is impossible to do.

[-]

mayocream39@reddit (OP)

We only feed the LLM with all the text on one single page to translate. I know this could be improved, and we will figure this out. An issue exists for this: https://github.com/mayocream/koharu/issues/508

Currently, the per-page translation result is acceptable, with Gemma4 31b, the translation is pretty good enough.

[-]

Formal_Scarcity_7861@reddit

Your work is just... make the world become better

[-]

mayocream39@reddit (OP)

Love & peace!

[-]

Stepfunction@reddit

Wow, this is awesome!

[-]

mayocream39@reddit (OP)

Thank you!

[-]

kentaromiura@reddit

Thanks to your code I was able to make https://github.com/kentaromiura/yonde a few months back; It mostly to test open source ai available as such kind of readers already existed but it was a fun test nevertheless.

[-]

mayocream39@reddit (OP)

I love your UI! good job!

[-]

npquanh30402@reddit

Impressive.

[-]

More-Curious816@reddit

We used to pray for times like this
Now our prayers have been answered

[-]

trioh281jsnf@reddit

panel text in sign-heavy scenes is the real stress test here, not the obvious speech bubbles. if it can keep vertical text and tiny sound effects from turning into soup, thats the part i’d actually be impressed by

[-]

HuiMoin@reddit

Koharu was truly the best Blue Archive character to name this after.

[-]

mayocream39@reddit (OP)

Koharu the best girl!

[-]

havnar-@reddit

My vibecoded weekend project sensors betrayed me.

This is more than a gui wrapper around a prompt.

[-]

mayocream39@reddit (OP)

Yeah it’s not a weekend project, I have been working on it for one year. 🥺

[-]

ffgg333@reddit

Can it use just local models or can it be used with API models too?

[-]

ffgg333@reddit

Amazing!

[-]

kengenerals@reddit

Hi! New to all this but it looks really cool so I wanna give a shot in using it to translate and tweak a couple of old raws I have. I didn a runthrough once but it was pretty slow. Didnt see that I had to install AMD HIP SDK first.

After installing, do I have to do anything else in the koharu folder?

[-]

mayocream39@reddit (OP)

It’s very late in my timezone, if you still stuck with the ZLUDA, you can join the discord server and ask for help. 😉

[-]

mayocream39@reddit (OP)

So you are using AMD GPU, you need to install AMD HIP SDK, and you can run koharu exe with debug flag (—debug) to inspect the logs. The version of AMD HIP SDK matters, if you are using newer AMD GPU, use 7.x, otherwise use 6.x.

[-]

BlueCoatEngineer@reddit

Can I specify a model for it to use rather than downloading? I have a collection of models I already use with llama.cpp and don't want to have to re-fetch. Specifically, I want to tell it to use the Qwen3.6-35B-A3B (drop-down only lists Qwen3.5).

[-]

mayocream39@reddit (OP)

You can use OpenAPI-compatible API instead, open settings and put the API URL, then you can see your models in the LLM dropdown.

[-]

Tenerezza@reddit

Looking very nice gotta try it out as i haven't seen this one yet. thought with the demo video i've noticed that it's seems to lack text detection outside speech bubbles, is this something that's planed in the future or already support? (The main reason why i'm using https://github.com/meangrinch/MangaTranslator right now granted it's slow but the end result is nice )

Also on the github page i see that you currently support PaddleOCR, MangaOCR and Mit, does it also support vision enabled LLM's why i ask is that I'm having very good success with Gemma4 right now feeding the text directly to it as long as you feed one text bubble per image, the end result is many many times better then mangocr thats tbh quite terrible right now, have yet to try PaddleOCR so maybe not a issue thoght.

[-]

mayocream39@reddit (OP)

We use https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5 by default! It's a 0.9B VLLM for OCR! It's probably the best OCR model so far!

[-]

Tenerezza@reddit

Thanks gave it a shot and it's quite good only issue i seem to have is OCR repetition sometimes but i also noticed that your working on that

Just got one more question in regard of the API / MCP, is it currently work in progress as many of the end points the manual talk about just does not seem to exist right now nor did i get mcp to work either., was hoping to setup some batch work as i usually crunch some translations overnight but most of the api end points just give "not found"

[-]

mayocream39@reddit (OP)

MCP currently only supports a few operations. For the full API, you can inspect the OpenAPI spec here: https://github.com/mayocream/koharu/blob/main/ui/openapi.json, you can generate your API client using https://github.com/openapitools/openapi-generator

The docs for the API and MCP are kinda out of date, but the OpenAPI spec is always up to date since our frontend uses it as well.

[-]

IrisColt@reddit

I've got this janky manga translator I made for personal use that ran on Open WebUI and Qwen 3.5 27B, and despite being super crude and just spitting out text, I thought it was the shit. I just saw yours and I'm bowing down to you... my implementation was pure garbage. THANKS!

[-]

Mayion@reddit

Can't wait for it to have a browser extension to translate in real-time the "manga" I am reading

[-]

IrisColt@reddit

"reading"

[-]

mayocream39@reddit (OP)

We have a GitHub issue for this; we wanna integrate with the ComicReadScript project. We are currently waiting for the author's response to integrate. <3

[-]

Mayion@reddit

[-]

Alex_L1nk@reddit

[-]

Caffdy@reddit

This is just incredible! I know this is a translation app, but bould you consider to add a functionality/tool to upscale/substitute raw Japanese text inside the bubbles with higher quality one (not even upscaling, just OCR'ing the text and put it back)? common image upscalers work well with the art but they destroy Japanese dialogues into mush. Would love for that feature to be possible, there are many raw scan that honestly have been done very poorly (1200px or sometimes even less!) Text is so difficult to read on those scans, if not impossible

[-]

ProposalOrganic1043@reddit

Why does this need to be done locally?

[-]

Caffdy@reddit

why not? or is this another case of r/lostredditors

[-]

mayocream39@reddit (OP)

If you pay for Gemini Nano Banana, it probably able to translate some pages, but for NSFW, I don’t think any providers support it. Also, I don’t wanna build a paid service, so I build this fully local software.

[-]

ProposalOrganic1043@reddit

I meant translate once and save it somewhere for people to access

[-]

mayocream39@reddit (OP)

I think that's out of scope for this project; it's just a translation tool. You can export the results and share them anywhere.

[-]

Velocita84@reddit

Blue archive mentioned

Massive W for using paddleOCR VL 1.5, i've also observed it being a serious cut above the rest for manga/japanese text extraction while being extremely fast

[-]

Altruistic_Heat_9531@reddit

can i use my own llamacpp and venv ? i dont want to redownload whole libs again

[-]

mayocream39@reddit (OP)

You can use the OpenAI-compatible API!

[-]

Altruistic_Heat_9531@reddit

I undertand the llamacpp expose openai api, but i mean, i dont want to redownload llamacpp and whole cuda toolkit again since it is quite big.

[-]

mayocream39@reddit (OP)

It's about 1.47 GB total. You can't avoid redownloading them because CUDA and llama.cpp have different versions, and the version matters a lot. We have Rust bindings for them, and they're tied to a specific version. 😔

[-]

Altruistic_Heat_9531@reddit

i see thanks

[-]

turtleisinnocent@reddit

What an amazing project. Well done!

[-]

HugoCortell@reddit

This is one of the coolest tools I've seen.

[-]

AnonsAnonAnonagain@reddit

Holy smokes this looks amazing 🤩

[-]

KageYume@reddit

Wow, what great progress since the last time I saw your project, and thanks for implementing the feedback (OpenAI-compatible interface)

[-]

mayocream39@reddit (OP)

Thank you for the feedback! <3

[-]

KageYume@reddit

I have one suggestion:

It would be great if koharu supports multiple profiles for custom prompts (can save prompt as a preset, and can load it from a list later).

[-]

mayocream39@reddit (OP)

Good point, I've created an issue to track it! https://github.com/mayocream/koharu/issues/539

[-]

KageYume@reddit

Thanks! I'm looking forward to it! :'D

[-]

Certain-Cod-1404@reddit

Would this be a plug and play tool for manhwa and manhua as well?

[-]

mayocream39@reddit (OP)

We have some users who use Koharu for manhwa and manhua. I think it works fine! If not, we definitely need to fix it.

[-]

Certain-Cod-1404@reddit

Either way man this is genuinely such a cool project and an actual good use of AI, congrats !

[-]

mayocream39@reddit (OP)

Thank you! <3

[-]

Altruistic_Voice_661@reddit

Best manga translator I've used so far! How do I pay you bro?

[-]

qiuyeforlife@reddit

i remember your project! guess i'll try it out on the latest issue of xxxx(a naughty name),lol.

[-]

mayocream39@reddit (OP)

Big love! <3

[-]

gunkanreddit@reddit

Suggeeeeeeeeeeeeeeee! すげ〜

[-]

mayocream39@reddit (OP)

あざす❣

[-]

EncampedMars801@reddit

Have there been any updates since your last post? I recall trying it, but there was a real lack of manual options. One of the strengths of Ballons Translator is how much control you get; you can make textboxes whever you want, and then ocr their area, customize the font/size of the text in them, or resize/rotate them. Not to mention you can also inpaint wherever you want. It could theoretically be used just for quicker scanlation if you already knew the language (which I don't lol, I'm just making the point).

The problem with Ballons is that it's annoying to set up and the UI is janky as fuck. Last I tried Koharu, it certainly fixed that. Downloading a compiled appimage is much better than setting up a python environment, and the UI was reeaally clean. But it was almost too simple/manual without any of that manual control I describe above. Do you have any plans to implement that sort of thing?

Sorry for the long ass comment. I don't mean to disparage your project. It looks awesome, I just really want it to improve so I can use it instead of the really awful current options.

[-]

mayocream39@reddit (OP)

Thank you for actually using it! We are improving the editing experience. We have added more editing features, such as font size, color, text alignment, bold, and italic, and made the textbox move/resize more smoothly. We have also added undo/redo and multi-selection. I think the only missing feature is rotation, but our backend already supports it; we just need to implement it in the UI!

[-]

EncampedMars801@reddit

Well then I'll give it another shot some day!

[-]

MadPelmewka@reddit

Wow, holy shit. Honestly, I didn’t expect this post to be from the creator himself. I thought it was a project I hadn't heard of, but it turns out I’d already starred it on GitHub. The repo is missing a full video demo, plus the tool still looks pretty raw from what I can see. Good luck if you're going to keep working on it!

[-]

mayocream39@reddit (OP)

I will keep working on it! Thank you for the support! <3

[-]

MadPelmewka@reddit

If I were you, I’d check this project out: https://github.com/meangrinch/MangaTranslator

Anyway, I saw how your tool works in the demo. To be honest, I haven't run it myself yet because I didn't see any SFX (sound effects) translation on the GitHub page. To do that properly, you'd need to train a separate model on a massive dataset—something like RF-DETR or YOLO, specifically for segmentation. There's a dataset called manga109s that's perfect for this.

If you manage to train it, you should publish the model on Hugging Face for the community, or even submit the dataset and model to the deepghs team so they can host it. They already have a YOLO model for manga109s, but they didn't include SFX labels, and honestly, their model feels a bit outdated now. You could even implement SAHI (Slicing Aided Hyper Inference) for better results.

Moving on—this is less about efficiency and more about quality: personally, my first step would be running a local VLM to describe all the SFX, the text, and the actual action in the images. All that metadata would go into an LLM to analyze the scene. Then, in a new request, you’d feed the system prompt that analysis plus context from previous chapters (or even character bios scraped from the web). The end goal is to generate a coherent translation for the entire manga (or a large chunk of it) in one go.

As for the typesetting (text insertion), it's not that complicated anymore. You could even use Flux models for it, though it might be overkill/too resource-heavy, but the dev over at MangaTranslator has already implemented it.

Everything I’m suggesting is basically about improving translation context. It would significantly boost the quality, though it’ll obviously make the pipeline much slower (even without accounting for the SFX work).

[-]

mayocream39@reddit (OP)

There are so many features I didn't mention in this main thread. If you are interested, please take a look at the GitHub README. I spent almost one year polishing this project, and while there may be some bugs, overall, it is acceptable to me.

What is worth mentioning, Koharu has full platform and GPU support, including NVIDIA and AMD. It's very hard to support broad hardware, but we did it!

Also, we have more contributors since this year, and the community is very active!

[-]

stopbanni@reddit

So, it’s full inference engine? Can it just connect to OpenAI-compatible API?

[-]

mayocream39@reddit (OP)

Sure! The translation part is considered pluggable; you can use OpenAI-compatible API, and we also have traditional Google Translate and DeepL. You can also tweak the system prompt if you use an LLM.

[-]

stopbanni@reddit

What part of it is not pluggable? I guess text replacing?

[-]

mayocream39@reddit (OP)

This is the settings page for Koharu. For every engine, it supports at least two different models, but the font detector, bubble segmenter, and renderer can't be replaced yet.

[-]

iTzNowbie@reddit

that’s so cool!!! nice job, OP

[-]

saito_zt81@reddit

Really love your work <3

[-]

mayocream39@reddit (OP)

Thank you! <3

[-]

ReXommendation@reddit

I was waiting for someone to make something like this.