From 5x slower to 6x faster: what I learned rewriting Python's csv module in Rust

Posted by Kitchen_Dig4979@reddit | Python | View on Reddit | 3 comments

I've been exploring Rust+PyO3 for Python libraries and decided to build a csv module replacement. Seemed straightforward — csv parsing is a solved problem in Rust (BurntSushi's csv crate is excellent), PyO3 makes bindings easy, should be 10-50x faster, right?

My first version was 5x SLOWER than stdlib.

The problem isn't parsing — it's Python object creation. stdlib csv is written in C, and both C and Rust hit the same wall: every row needs N calls to PyUnicode_New() to create Python strings. Doesn't matter how fast you parse if you're creating 5 million Python objects.

Here's what I tried and what actually worked:

❌ Naive PyO3 wrapper (0.2x) — per-row Python↔Rust boundary crossing killed everything

❌ Batch buffering (0.3x → 1.0x) — helped but still creating all Python strings eagerly

✅ pyo3_disable_reference_pool — free 10-15% by removing PyO3's global reference pool overhead

✅ intern!() for dict keys — DictReader headers are repeated 100K times, intern once

✅ Raw CPython FFI — bypass PyO3 safety wrappers, call PyUnicode_FromStringAndSize directly

✅ SharedBuffer architecture (→ 3.9x) — parse entire file into one Vec in Rust, Row objects hold just an Arc pointer + row index. Zero allocations per row.

✅ Cursor pattern (→ 4.6x) — instead of creating a new #[pyclass] Row per iteration (Py::new() costs \~0.9µs), reuse one object and increment an index. Check refcount — if someone saved a reference, clone on demand.

✅ Lazy field access — Row.__getitem__ creates a PyString only when you access that field. Read 2 columns out of 50? Create 2 strings, not 50.

Final results (100K rows):

- reader(): 4.6x faster than stdlib

- DictReader(): 6.0x faster

- writer(): 1.4x faster

The key insight: you can't beat CPython's object creation cost, but you can avoid it. The cursor pattern + lazy fields means most iterations create zero Python objects until you actually need the data.

Also added features no other Python csv library has:

- CSV injection protection (OWASP CWE-1236) — writer escapes =, +, -, @ prefixes

- RFC 4180 strict validation mode

- Delimiter and encoding auto-detection

PyPI: pip install zcsv

Would love feedback, especially from anyone who's hit similar PyO3 performance walls

[-]

AutoModerator@reddit

Your submission has been automatically queued for manual review by the moderation team because it has been reported too many times.

Please wait until the moderation team reviews your post.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

hikingsticks@reddit

Report the slop and move on

b1e@reddit

AI slop post complete with the emojis