Migrating 2.2B rows of Tick Data to Parquet: My SSD finally stopped screaming.

Posted by Marchese_QuantLab@reddit | Python | View on Reddit | 42 comments

I’ve been stuck in "data engineering hell" for the last few weeks. I had about 10 years of ES Futures tick data (from 2016 to now) sitting in a mountain of messy CSVs. Total row count: \~2.2 billion.

If you’ve ever tried to run a vectorized backtest on CSVs of that size, you know the pain. My I/O was a disaster and I was basically spending more time waiting for files to load than actually doing research.

I finally moved everything over to Apache Parquet using Polars, and man, I should have done this sooner.

A few things I learned (the hard way):

Now I can query specific contract slices in seconds instead of minutes. It’s a game changer for my workflow.

Curious to hear from others working with high-frequency data: are you guys still using HDF5/SQL for this scale, or has everyone moved to the Parquet/DuckDB stack already?