Migrating 2.2B rows of Tick Data to Parquet: My SSD finally stopped screaming. | TheaterFire

Migrating 2.2B rows of Tick Data to Parquet: My SSD finally stopped screaming.

Posted by Marchese_QuantLab@reddit | Python | View on Reddit | 42 comments

I’ve been stuck in "data engineering hell" for the last few weeks. I had about 10 years of ES Futures tick data (from 2016 to now) sitting in a mountain of messy CSVs. Total row count: \~2.2 billion.

If you’ve ever tried to run a vectorized backtest on CSVs of that size, you know the pain. My I/O was a disaster and I was basically spending more time waiting for files to load than actually doing research.

I finally moved everything over to Apache Parquet using Polars, and man, I should have done this sooner.

A few things I learned (the hard way):

Compression is insane: I went from a massive disk footprint to a 22x reduction.
Polars is a beast: I used lazy evaluation to handle the rollover logic across 40+ quarterly contracts. Doing this in Pandas would have probably melted my RAM.
The "Rollover" nightmare: The hardest part wasn't the storage, it was getting the front-month transitions right without price gaps. Ensuring the bid/ask volume stayed consistent across 10 years of contract switches was... let's just say, "fun."

Now I can query specific contract slices in seconds instead of minutes. It’s a game changer for my workflow.

Curious to hear from others working with high-frequency data: are you guys still using HDF5/SQL for this scale, or has everyone moved to the Parquet/DuckDB stack already?

[-]

Actual__Wizard@reddit

I got tired of everything either being junk, or it's absurdly expensive and built my own structured data tech. I legitimately designed it for datasets up to or over 1 quadrillion rows. Just to make sure I never hit some weird limitation ever again.

[-]

Marchese_QuantLab@reddit (OP)

This is exactly why I stopped thinking about it as “which DB is best?” and more like “how do I avoid touching data I don’t need?”

My case is smaller than what you’re describing, but the issue rhymes: tick data gets ugly fast when every test becomes a full-history scan or when generic operations don’t match the access pattern.

For me the biggest improvement came from structuring the data around how I actually research it: symbol/date/contract slices, rollover-aware views, and keeping the cleaning layer separate from the query layer.

I’m curious what your storage model looks like at that scale. Are you using custom chunking / mmap style access, or more of a columnar layout with your own execution layer?

[-]

No-Seesaw4444@reddit

Nice work getting that migration done - 2.2B rows is no joke. Polars is definitely the move here, especially with lazy evaluation handling the contract rollovers. One thing that might help: partition by date AND symbol, then use PyArrow's dataset API for reading. You'd be surprised how much faster queries get when you're only scanning relevant partitions instead of the whole file.

[-]

Marchese_QuantLab@reddit (OP)

Yep, this is exactly the kind of advice I was hoping for.

Partitioning by date/contract/symbol made a huge difference because I’m no longer dragging the whole history into every query. I still need to tune the layout better though, especially to avoid too many tiny files.

I’ll look more into PyArrow’s dataset API too. Thanks for the useful suggestion.

[-]

No-Seesaw4444@reddit

glad to help!

[-]

Marchese_QuantLab@reddit (OP)

Appreciate it. That’s exactly the kind of practical feedback I was hoping for.

The next thing I need to tune is the partition layout/file size balance. Fast slices are great, but too many tiny files become their own little issue

[-]

retornam@reddit

Can y’all no longer write without the assistance of LLMs?

The minute you learn all the tells it’s almost impossible to not easily see badly written LLM prose.

See this for more details https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing

[-]

Marchese_QuantLab@reddit (OP)

Fair hit 😂 The post definitely came out a bit “LinkedIn engineer discovers storage formats.”

But the migration pain was very real. I’m not trying to win a prose contest here, just wanted to hear what people are using for tick data at this scale.

[-]

marr75@reddit

You couldn't even put Claude down to write this reply???

[-]

juanfnavarror@reddit

He even mentions goblins further down

[-]

Marchese_QuantLab@reddit (OP)

The CSV goblin is real and he lives inside folders named final_final_v3.

But yes, point taken: next time fewer goblins, more benchmarks.

[-]

Marchese_QuantLab@reddit (OP)

Fair enough, the style deserved the roast 😂

The project itself is real though. I’m more interested in comparing how people here handle large tick datasets than defending the poetry of my Reddit post.

[-]

marco_quieti@reddit

Who cares if he used a translator or an LLM for the post? The technical challenge of cleaning 2.2 billion rows of tick data and managing 10 years of contract rollovers is the actual interesting part here. 2B rows is a massive engineering project regardless of how the Reddit post is written. I'd rather discuss the data pipeline than the prose.

[-]

rosecurry@reddit

Because it likely means the whole story is fake

[-]

Marchese_QuantLab@reddit (OP)

No, the project is real. The writing just came out more polished than Reddit likes, which is fair.

The actual boring work was cleaning years of futures tick data, handling rollovers, and converting it into something fast enough for research. I probably should have posted fewer dramatic lines and more before/after numbers.

[-]

Lifaux@reddit

The issue with the purple prose is that it's hard to understand the raw numbers.

22x reduction from what to what? Melted my RAM - Are we actually talking temperatures, or just you don't have enough? Seconds instead of minutes - can we please just have an average before and after? I/O was a disaster - HOW SO?

It's so difficult to charitably engage with the actual information here, and I'm really trying. We don't even have the code so we could make guesses. If the OP had written it themselves I'd hope they'd understand how much information the LLM has stripped away

[-]

Marchese_QuantLab@reddit (OP)

Fair criticism. The post was too much “dramatic engineer survives CSV war” and not enough raw numbers.

Without turning the whole thing into a pipeline dump, the rough before/after was:

- storage dropped by about 22x after moving the cleaned tick data into compressed Parquet
- repeated contract/date slice queries went from painful CSV scans to seconds
- the RAM issue was mostly Pandas/eager loading; Polars lazy scans helped a lot because I could filter/project before materializing
- the annoying part was not “save as Parquet,” it was making the rollover/front-month data clean enough that backtests were not seeing fake gaps or weird bid/ask/volume jumps

I’m keeping some of the cleaning details vague on purpose because that part took way too many evenings and too much espresso 😂

But the general lesson was: raw tick data is easy to store badly, hard to make trustworthy.

[-]

Marchese_QuantLab@reddit (OP)

Thank you. That was the actual point of the project.

The hard part was not “write file as Parquet.” It was cleaning noisy tick data, keeping timestamps sane, dealing with contract rollovers across years, avoiding duplicate/bad sessions, and arranging the data so queries don’t become full historical scans every time.

[-]

mr_jim_lahey@reddit

X wasn't Y, it was Z1, Z2, and Z3. All just so I could A the B without C — but it was worth the D for the epic payoff of E.

[-]

fiskfisk@reddit

The number of curious people has skyrocketed

[-]

Marchese_QuantLab@reddit (OP)

Haha yeah, apparently compression ratios are now controversial enough to attract detectives.

[-]

Marchese_QuantLab@reddit (OP)

Fair hit 😂 The post definitely came out a bit “LinkedIn engineer discovers storage formats.”

But the migration pain was very real. I’m not trying to win a prose contest here, just wanted to hear what people are using for tick data at this scale.

[-]

Mr_Again@reddit

Re: rollover, isn't this a case of a recursive cte to fill gaps in something like duckdb? Point duckdb at csvs, write query, boom

[-]

Marchese_QuantLab@reddit (OP)

Yeah, fair. DuckDB can absolutely handle a lot of this.

The reason I moved it into Parquet was mostly to stop re-scanning the CSV swamp every time I wanted a contract slice. Rollover wasn’t just “fill the gaps,” it was making the front-month transitions consistent enough that I’d actually trust the backtest.

DuckDB good. CSV goblin bad.

[-]

juanfnavarror@reddit

Lol he even mentioned goblins

[-]

i_walk_away@reddit

ai

[-]

Marchese_QuantLab@reddit (OP)

ssd screamed, parquet helped, reddit investigated the author. normal Tuesday.

[-]

mr_jim_lahey@reddit

I can't trust unless I hear your SSD was 😱 s c r e a m i n g 😱

[-]

just4nothing@reddit

On the other hand, I’ve seen people commenting in different languages on the same post, AI makes it easy to translate. But if you were to create a non-English post all hell breaks loose.

Note: even if English is your first language, writing longer paragraphs can be hell for some people. LLMs reduce that barrier - so give it a rest. If a project is AI slop, fair enough. If a post is improved for legibility with AI - why are you complaining?

[-]

Marchese_QuantLab@reddit (OP)

Yeah, that’s pretty much how I see it too.

AI helped me make the post easier to write/read. It didn’t do the actual migration for me.

I get the criticism that I should have posted clearer before/after numbers. That’s fair. But turning the whole thing into “was this written with AI?” feels like missing the more interesting part of the project.

[-]

coolcosmos@reddit

Parquet is life.

[-]

Fantastic_Fly_7548@reddit

man just reading “2.2 billion CSV rows” gave me stress lol. honestly the rollover logic sounds way more painful than the parquet migration itself, especially keeping volume and pricing sane across contract switches. ive been seeing more people move toward the Parquet + DuckDB setup lately because the speed difference is just ridiculous for analytics workloads. also Polars keeps coming up everywhere now, feels like its becoming the default answer anytime someone mentions pandas choking on large datasets

[-]

Marchese_QuantLab@reddit (OP)

Yeah, honestly the rollover part was way more annoying than the Parquet conversion itself.

Writing Parquet is the easy part. The painful bit was keeping the contracts clean, not mixing volume from the wrong contract, handling pricing around the switch, and making sure the old data didn’t turn into garbage just because the symbol rolled.

DuckDB + Parquet has been great so far for analytics. Polars also saved me from a lot of pandas pain. Pandas is fine until the dataset decides to become a small planet.

[-]

SandraGifford785@reddit

parquet is the right format for time-series at that scale, the column-store layout dominates row formats for analytic workloads. couple of things that helped me at similar scale: partition by date or by month for query pruning, use zstd compression rather than snappy if you're storage-constrained, and split by symbol if you query subsets often. the SSD will thank you for sequential reads even more

[-]

Marchese_QuantLab@reddit (OP)

Yep, this matches what I’m seeing too.

The biggest win so far is avoiding full scans. Date/month partitioning already helped a lot, and I’m testing symbol/contract splits too — just trying not to create a tiny-file mess.

I’m also leaning toward zstd over snappy since storage matters more for this dataset than maximum write speed. And yes, the SSD is much happier with sequential reads than with the old workflow.

[-]

No-Seesaw4444@reddit

Impressive work on migrating 2.2B rows to Parquet! For that scale, I'd recommend using PyArrow's dataset API instead of loading everything into Polars directly - it handles partition discovery and schema evolution much better. Also consider using ZSTD compression instead of SNAPPY if you need faster decompression speeds. For the contract rollover issue, you might want to look at managing a separate 'rollover_map' that tracks front-month transitions separately from your data files.

[-]

RepulsiveHum4n@reddit

Spark local mode also runs single node and you can easily migrate to a cluster later using Connect, etc. It's gotten a lot of flack in the past but by 4.3 it's actually going to be pretty competitive for these kind of workloads..

Polars is good for when you're running smaller projects but anything in production needs some sort of governance, better data source support, etc.. esp if you're planning to use iceberg or delta later on instead of straight parquet

See the recent SPIP/proposal on the project

[-]

Marchese_QuantLab@reddit (OP)

Good point. I probably framed it too much as “Polars/DuckDB vs Spark”, but they’re solving slightly different problems.

For this first pass I stayed local because the bottleneck was mostly getting the tick data cleaned, partitioned, and converted without blowing up memory. Polars/PyArrow/DuckDB were enough for that.

But I agree that if this turns into a proper production data lake, Spark starts making more sense — especially with catalog/governance, better source support, and maybe Iceberg/Delta instead of just raw Parquet files.

I’ll check out the SPIP/proposal you mentioned. Thanks for the useful pointer.

[-]

LXj@reddit

Ever heard of databases?

[-]

Marchese_QuantLab@reddit (OP)

Yep. That was actually part of the question.

For this specific use case, big sequential tick-data scans + contract slices felt better as Parquet files than trying to shove everything into a traditional DB. Not saying DBs are bad, just that my SSD stopped filing HR complaints after Parquet.

[-]

just4nothing@reddit

Meh, file based all the way (except for some small metadata). Easier to have O(PB) of data across files than in a DB. Of course, your mileage might differ depending on the type of data (my rows have zero correlation)

[-]

Marchese_QuantLab@reddit (OP)

That’s basically where I landed too. File-based with small metadata seems way cleaner for this kind of data.

My rows also don’t really need relational magic. Mostly time/contract slices, bid/ask/volume, and avoiding rollover weirdness without turning the whole thing into a DB archaeology project.