I’m looking for advice from people who have handled very large Excel/CSV imports in production systems.

Posted by just_human_right@reddit | learnprogramming | View on Reddit | 13 comments

Current requirement from my client:

Upload 3 Excel sheets

One sheet contains 150k+ rows

Another contains 40k+ rows

Data needs to be inserted into multiple relational tables based on hierarchical categorization

Each text field may also need translation into multiple languages (up to 15 languages, though realistically 3–5 most of the time)

The biggest challenge is making the import:

Reliable

Scalable

Consistent

Fault tolerant

User friendly with proper progress/error handling

Current approach:

Process/import data in chunks instead of loading the full file into memory

Import only the source language first (English)

Save core data into DB

Push translation processing into background jobs/queues (npm BullMq Package)

This works better than translating during import, but once the dataset becomes huge, even the background translation jobs become extremely heavy and time consuming.

I’d really like to hear how experienced engineers handle cases like this in real-world production systems.

Tech stack is mainly Node.js, Next js and MySQL for databse

[-]

Celestial_aki@reddit

Few patterns that worked for me on prod CSV ingest:

Stream-parse, don't chunk-load â csv-parse Stream API or papaparse step callback. RAM stays flat regardless of file size.
Bypass the ORM for the write path. Postgres COPY FROM STDIN (or MySQL LOAD DATA LOCAL INFILE) is 20â50Ã faster than per-row INSERT. ORM only for reads.
Dedupe translation jobs by content hash before queueing â large imports have massive string repetition, easily 60â80% reduction in translation cost/time.
Make the pipeline idempotent: (file_hash, row_number) upsert key so retries don't double-write.
UI progress from a separate read-side counter, not from the worker â survives worker restarts and keeps WS payloads tiny.

[-]

mottyay@reddit

What is wrong with the way you're doing it?

[-]

just_human_right@reddit (OP)

Honestly, the import side itself is working reasonably well.

My main concern is the translation phase afterward, since the number of translation operations grows very quickly with multiple fields and multiple languages.

I was mainly curious whether there are better architectural approaches people use in real-world systems for handling this kind of large-scale async translation workload.

I’ve already received some really useful suggestions here, so at this point I’m mostly trying to validate and fine-tune my current approach with more experienced opinions.

[-]

Not sure if it was already suggested, but depending on your data, you may be able to cache some of the results also. If the data repeats often or the options for each field are fixed then you can use some sort of lookup table instead of translating everything

[-]

kawaidesuwuu@reddit

is this a one time task?

[-]

just_human_right@reddit (OP)

Yes

[-]

kawaidesuwuu@reddit

Your current approach is fine if you're really worry about reliability just use a durable workflow like inngest or using effect-ts.

[-]

HashDefTrueFalse@reddit

Very large :D I was expecting a billion rows. 200k rows is quite literally nothing. The world is your oyster. Run the worst script you can get an LLM to shit out, if you like. I would just take the files from the user and return a response, then store the file(s) and offload it all to a fault-tolerant job queue. Each job processes a row or chunk of rows, either commits the db transaction(s) or rolls back on failure so that you always know what has/has not completed. I'd try to work with the translation APIs as far as possible, e.g. if there's some batch submission mechanism then I'd use that, etc. How often is this happening and how fresh do you need the data to be to users? Rather, what is the actual problem or constraint here?

[-]

Forsaken-File9993@reddit

chunking is good start but you might want to look at streaming the csv parsing instead of loading chunks into memory, and for translations maybe batch them by language rather than by row so you can hit translation apis more efficiently

[-]

just_human_right@reddit (OP)

Thanks for the suggestion, that’s actually a really interesting optimization approach and I’d definitely like to experiment with it.

[-]

szank@reddit

So, what's the problem? Is this a homework or something?
This is just way too vague.
150000 rows with let's say 2KB of data in each is less than 300MB in ram. Are you running it on raspbery pi zero or something ?

Your current approach is fine. How much time does loading the inputs take ? Just loading, no translations?
How much time each translation take? How are you doign the translation?

I understand this is r/llearnprogramming, but help us help you.

[-]

just_human_right@reddit (OP)

Fair point, I probably explained it too broadly.

So to answer your questions:

“How much time does loading the inputs take? Just loading, no translations?” => The import itself usually takes around 8–10 minutes while processing data in chunks.

“How much time each translation take? How are you doing the translation?” => Each translation request usually takes around 200ms–400ms.

Now when this gets multiplied across real production data, the number of translation API calls becomes extremely large very quickly, especially because:

One row can contain multiple translatable fields

Data may need translation into multiple languages

Currently, I’m using the bing-translate-api npm package (free approach for now).

I did research some bulk translation APIs/services, but most of them are paid, so currently I’m sticking with this setup

[-]

Eruner_SK@reddit

What about splitting data based on relational tables? E.g. load the most-parent object, and then continue with children.