Is mitigating FastAPI event loop I/O overhead via PyO3 worth the FFI complexity? (Benchmarks inside)

Posted by mordechaihadad@reddit | Python | View on Reddit | 10 comments

Usually when you run high-concurrency rate limiting inside FastAPI, you are usually forcing python's single threaded event loop to spend precious time on network driver I/O just to verify a token before the request even hits the application logic.

I wanted to see how cleanly I could isolate the Redis network layer outside of python, so I built rustgate using PyO3 and a multi-threaded tokio driver.

Disclaimer: This is basically a proof of concept. It's basically tied to another experimental crate I am working on (axum-rate-limiter), and so it's not super configurable or abstracted as of now. Could you use in production? Probably, but why?

That being said, the raw performance under a 100-concurrency flood on a heavy, dynamically rerouted endpoint turned out pretty efficient:

Pushed 1,128 req/sec without dropping a connection.

Fastest response hit 15.3 ms.

Fails closed instantly with immediate 429 rejections to protect downstream application logic.

The cool part: I benched a naked, no-op /health endpoint (literally just returning {"status": "ok"}) on the same machine, and it maxed out at 1,496 req/sec.

The fact that crossing FFI boundaries, handling memory pinning, and doing a multi-threaded Tokio to Redis round-trip only costs \~370 req/s, proves that the Rust integration added almost non existent overhead.

I’ve dropped the GitHub link and the core architectural layout in the comments section below to keep this thread focused on the performance discussion.