I’m looking for advice from people who have handled very large Excel/CSV imports in production systems.
Posted by just_human_right@reddit | learnprogramming | View on Reddit | 13 comments
Current requirement from my client:
Upload 3 Excel sheets
One sheet contains 150k+ rows
Another contains 40k+ rows
Data needs to be inserted into multiple relational tables based on hierarchical categorization
Each text field may also need translation into multiple languages (up to 15 languages, though realistically 3–5 most of the time)
The biggest challenge is making the import:
Reliable
Scalable
Consistent
Fault tolerant
User friendly with proper progress/error handling
Current approach:
Process/import data in chunks instead of loading the full file into memory
Import only the source language first (English)
Save core data into DB
Push translation processing into background jobs/queues (npm BullMq Package)
This works better than translating during import, but once the dataset becomes huge, even the background translation jobs become extremely heavy and time consuming.
I’d really like to hear how experienced engineers handle cases like this in real-world production systems.
Tech stack is mainly Node.js, Next js and MySQL for databse
Celestial_aki@reddit
Few patterns that worked for me on prod CSV ingest:
csv-parseStream API orpapaparsestep callback. RAM stays flat regardless of file size.COPY FROM STDIN(or MySQLLOAD DATA LOCAL INFILE) is 20â50Ã faster than per-row INSERT. ORM only for reads.(file_hash, row_number)upsert key so retries don't double-write.mottyay@reddit
What is wrong with the way you're doing it?
just_human_right@reddit (OP)
Honestly, the import side itself is working reasonably well.
My main concern is the translation phase afterward, since the number of translation operations grows very quickly with multiple fields and multiple languages.
I was mainly curious whether there are better architectural approaches people use in real-world systems for handling this kind of large-scale async translation workload.
I’ve already received some really useful suggestions here, so at this point I’m mostly trying to validate and fine-tune my current approach with more experienced opinions.
mottyay@reddit
Not sure if it was already suggested, but depending on your data, you may be able to cache some of the results also. If the data repeats often or the options for each field are fixed then you can use some sort of lookup table instead of translating everything
kawaidesuwuu@reddit
is this a one time task?
just_human_right@reddit (OP)
Yes
kawaidesuwuu@reddit
Your current approach is fine if you're really worry about reliability just use a durable workflow like inngest or using effect-ts.
HashDefTrueFalse@reddit
Very large :D I was expecting a billion rows. 200k rows is quite literally nothing. The world is your oyster. Run the worst script you can get an LLM to shit out, if you like. I would just take the files from the user and return a response, then store the file(s) and offload it all to a fault-tolerant job queue. Each job processes a row or chunk of rows, either commits the db transaction(s) or rolls back on failure so that you always know what has/has not completed. I'd try to work with the translation APIs as far as possible, e.g. if there's some batch submission mechanism then I'd use that, etc. How often is this happening and how fresh do you need the data to be to users? Rather, what is the actual problem or constraint here?
Forsaken-File9993@reddit
chunking is good start but you might want to look at streaming the csv parsing instead of loading chunks into memory, and for translations maybe batch them by language rather than by row so you can hit translation apis more efficiently
just_human_right@reddit (OP)
Thanks for the suggestion, that’s actually a really interesting optimization approach and I’d definitely like to experiment with it.
szank@reddit
So, what's the problem? Is this a homework or something?
This is just way too vague.
150000 rows with let's say 2KB of data in each is less than 300MB in ram. Are you running it on raspbery pi zero or something ?
Your current approach is fine. How much time does loading the inputs take ? Just loading, no translations?
How much time each translation take? How are you doign the translation?
I understand this is r/llearnprogramming, but help us help you.
just_human_right@reddit (OP)
Fair point, I probably explained it too broadly.
So to answer your questions:
“How much time does loading the inputs take? Just loading, no translations?” => The import itself usually takes around 8–10 minutes while processing data in chunks.
“How much time each translation take? How are you doing the translation?” => Each translation request usually takes around 200ms–400ms.
Now when this gets multiplied across real production data, the number of translation API calls becomes extremely large very quickly, especially because:
One row can contain multiple translatable fields
Data may need translation into multiple languages
Currently, I’m using the bing-translate-api npm package (free approach for now).
I did research some bulk translation APIs/services, but most of them are paid, so currently I’m sticking with this setup
Eruner_SK@reddit
What about splitting data based on relational tables? E.g. load the most-parent object, and then continue with children.