Why async-native matters in LLM frameworks and why most get it wrong (with benchmarks)

Posted by MammothChildhood9298@reddit | Python | View on Reddit | 8 comments

Been thinking about the async correctness problem in LLM frameworks after profiling several deployments. Wanted to share what I found because I don't see this discussed enough.

The hidden problem: fake async

Most popular frameworks started sync and bolted async on later. The result is run_in_executor hiding a blocking call under the hood. You think you're running async, you're actually dispatching to a thread pool.

This matters a lot at scale:

True async at 50 concurrent requests: ~96-97% theoretical throughput
Fake async (run_in_executor):         ~60-70% depending on I/O pattern

The cold start problem nobody talks about

In serverless LLM deployments, dependency count is a direct tax:

2  dependencies:  ~80ms cold start
43 dependencies:  ~1,100ms cold start
67 dependencies:  ~2,400ms cold start

Every scale-from-zero event pays this. For latency-sensitive apps this is the difference between responsive and broken.

The traceback problem

Deep abstraction layers feel clean until 3am in production. An 8-line traceback vs a 47-line one with RunnableSequence.__call__ chains is not a style preference —> it's mean time to recovery.

Curious how others here are handling this -> especially those running local models in serverless or edge environments. Are cold starts actually a pain point for your setups or do you mostly run persistent servers?

(For context, these numbers came out of building SynapseKit -> an open source framework tackling exactly this. Happy to share more if useful but mainly wanted to discuss the underlying problem.)