OCR: what is the best way to extract data in JSON format from this old French book?

Posted by Wise_Stick9613@reddit | LocalLLaMA | View on Reddit | 35 comments

OCR: what is the best way to extract data in JSON format from this old French book?

As some of you may have guessed, what we have here is an old Bible.

I would like to extract the following information from the page:

{
    verse: number,
    verse_content: string,
    comment: string[]
}

I've played around with PaddleOCR a bit; I could probably make it work with that too, but I'm looking for something even simpler and more straightforward.

I don't think GLM-OCR will work because it doesn't support Latin (I need Latin, French, Spanish, Italian, and English).

I have an RTX 4060 with 12 GB of VRAM and 16 GB of RAM.

Thanks in advance.