is it even possible!
Posted by Big-Manufacturer-808@reddit | Python | View on Reddit | 22 comments
Hi My company wants a written text images to be converted as invoices i have used google lens for that and then now they wanted to add multi language for that. it should also translate the product name if written in other languages to english. it should also find the short form of the product people may write in the local languages.
Python-ModTeam@reddit
Your post was removed for violating Rule #2. All posts must be directly related to the Python programming language. Posts pertaining to programming in general are not permitted. You may want to try posting in /r/programming instead.
fohrloop@reddit
Almost anything that humans can do is possible to be automated if you have the (1) time (2) the money. There are autonomous delivery robots (e.g. Starship Technologies), for example, which someone has had to program. But anything that is complicated and automated should also expect failures at some point, and have a mitigation strategy. Translating invoices and product names to different languages does not seem to be a really hard problem but on the other hand it will not be a short task if you need to start from scratch and cannot pay for ready building blocks :)
Big-Manufacturer-808@reddit (OP)
Thanks for the info
lostinfury@reddit
This is an algorithmic problem. It's not going to be easy, but it won't be hard either. We've had OCR technology long before the first LLM made waves, so let's not just jump to LLMs as the first (and definitely more expensive choice) before using tools specifically designed for this job.
You know the layout of the receipts, so start by developing something using opencv or tesseract that analyzes each section of the receipt and tries to extract the information found there. I'll suggest storing these in a database like sqlite so that the information can be used to train a tiny LLM or a subset of tesseract can be trained just for your receipts.
The next steps would be to make it automated (segmentation, recognition, etc) and develop an API around it so that other tools can benefit from it.
Big-Manufacturer-808@reddit (OP)
Oh thanks for the insights. How long do u think this would take.
kosz85@reddit
You could try with Gemini or other LLM, but it won't be very fast nor cheap. But it should support main languages, but it's another PoC, so first get all requirements and use cases.
lostinfury@reddit
Geez, try OCR first before pulling out the big guns and going full LLM.
kosz85@reddit
In our case it was not big gun, it was required to read documents to assume return values. Each document was formatted differently and used different words to describe the same values :) Vague description due to NDA.
I just suggested that this is also an option, especially for hand writing. But it's not cheap, so I would try ocr and then LLM as last resort.
kosz85@reddit
Ah by the way suggesting Gemini is not random, we parse unstructured data from pdf's, and it worked better for us, almost no hallucinations and strict results.
JamzTyson@reddit
Your company could quickly become in deep and unpleasant stuff if that is not 100% reliable. Even a tiny error such as misreading "$1000.00" as "$100000" might raise an unhappy response from a client.
Another red flag. It may be a good idea to start looking for another job.
akasi2@reddit
Maybe this github tesseract could help
violentlymickey@reddit
You could use opencv for that probably. For translation there are many libraries for that as well. Depends on how accurate you need it.
LargeSale8354@reddit
OCR and Apache Tikka.
Not perfect, also be careful of currency conversions and tax calculations
DeDenker020@reddit
I was searching the same.
Free and/or open source, no way.
You should find a different option then written text.
Perhaps tick boxes? So you can scan the boxes ticked.
Ted_desolation@reddit
Could aws textract be used?
Big-Manufacturer-808@reddit (OP)
No it should all be open source, can't pay for anything
KingsmanVince@reddit
Leave
KingsmanVince@reddit
r/AskProgramming
r/learnprogramming
mon_key_house@reddit
Everything is possible if you have the money. If not, well… Your manager should know this.
Big-Manufacturer-808@reddit (OP)
I know it's not even a question but just asking. Can I get anything similar in github like this.
uhynb@reddit
I'd say step 1, get the requirements clear and manage expectations. Going from a Google lens POC to a multi language everything handling machine will not be quick. Step 2 see if there is another way/method to get text without having to go through an image first. If all else fails then there are some cloud OCR solutions but they are expensive. There are local/open source alternatives as well but they will be less reliable, require more maintenance, and be slower. Bunch of tradeoffs to balance with the actual business case, which brings us back to 1: requirements.
Big-Manufacturer-808@reddit (OP)
The manager asked me to search on GitHub. I know we can find anything like this, even closer. Also I can't spend much time on this by building it myself. So is it possible there are any open source projects I can use.