DNS Resolution Delay: The Silent Killer That Blocks Your Threads
Posted by Designer_Bug9592@reddit | programming | View on Reddit | 16 comments
The Blocking Problem Everyone Forgets
Here’s the thing about DNS lookups that catches people off guard. When your service needs to connect to another service, it has to resolve the hostname to an IP address. In most programming languages, this happens through a synchronous system call like getaddrinfo(). That means the thread making the request just sits there, doing nothing, waiting for the DNS response.
Normally this takes 2-5 milliseconds and nobody notices. You have a thread pool of 200 threads, each request takes maybe 50ms total, and you’re processing thousands of requests per second without breaking a sweat. The occasional DNS lookup is just noise in the overall request time.
But when DNS gets slow, everything changes. Imagine your DNS resolver is now taking 300ms to respond. Every thread that needs to establish a new connection is now blocked for 300ms just waiting for DNS. During that time, incoming requests pile up in the queue. More threads pick up queued requests, and they also need new connections, so they also get stuck on DNS. Before you know it, your entire thread pool is blocked waiting for DNS responses, and your service is effectively dead even though your CPU is at 15% and you have plenty of memory.
https://howtech.substack.com/p/dns-resolution-delay-the-silent-killer
mosaic_hops@reddit
Yes, this is why resolvers cache responses. You also shouldn’t tie up an entire thread blocking on an operation like this… that was the pre-C10K programming paradigm from the 1980’s. Use async io!
trailing_zero_count@reddit
Even some async IO frameworks don't offer async DNS resolution. Why, I don't know. But it's something to carefully check when you are vetting your dependencies.
ecthiender@reddit
Because people use getaddrinfo syscall, because it handles all the edge cases and quirks, but it's a blocking call. As a library author you'd have to handle spawning a separate thread (or separate async task) to make it async in your library. It's more work.
Ref: https://valentin.gosu.se/blog/2025/02/getaddrinfo-sucks-everything-else-is-much-worse
nvmnghia@reddit
this is weird and new to me. could u name a few?
ecthiender@reddit
Because people use getaddrinfo syscall, because it handles all the edge cases and quirks, but it's a blocking call. As a library author you'd have to handle spawning a separate thread (or separate async task) to make it async in your library. It's more work.
Ref: https://valentin.gosu.se/blog/2025/02/getaddrinfo-sucks-everything-else-is-much-worse
KainMassadin@reddit
async io is fun until some random thing performs a blocking call on the main thread and your whole app crawls to a halt :)
buttplugs4life4me@reddit
async io is fun until you need to do something that isn't implemented and suddenly your carefully crafted async workflow just spends its time on a sync method
ericonr@reddit
People use getaddrinfo because it implements all the desired system policy. resolv.conf, hosts file, and whatever other extensions libc implements.
Respecting choices made when configuring a system is relevant, and doing so in an asynchronous manner isn't straightforward, which is why a framework like Tokio sends DNS queries to its blocking IO thread pool, at least making the operation transparent. Using async isn't enough, you need proper integration.
schplat@reddit
No no no no and no. Only one thread every 5 minutes (assuming a 5 minute TTL) is blocked for 300ms waiting on DNS. Pretty much everything, and especially getaddrinfo(), uses DNS cache. A cache based on the TTL of the record.
getaddrinfo() will check if the record is cached, and if so, use that, and absolutely 0 network calls are done. Otherwise, it will queue up a single call and wait for response, and then use the cached record for anything queued up behind it.
Most modern resolvers can also be configured to serve a stale record on a failed cache hit where the resolution is taking too long, but this also assumes that you have cached said record at some point in the recent past.
acroback@reddit
This is the correct answer.
Why do people even write stuff without understanding how dns actually works.
Now if you are responding with a dns record with a ttl of 0 ala stupid consul style , then there is nothing much you can do.
non3type@reddit
To be frank a lot of this is bad advice. If lookups are taking 300ms there’s a reason, forcing TTLs low and constantly refreshing cache in the background is likely to make things worse in a degraded situation.
cmd_Mack@reddit
Thank you, reading OPs hot take woke me up faster than the coffee in my hand.
non3type@reddit
Yeah I’m a software engineer in our network services division. Beyond dev, we also manage a lot of the critical layer 7 services that support the network like DNS, NTP, radius/tacacs, and etc. That’s all to say, as someone who does both dev and DNS, this gave me heart palpitations lol.
KainMassadin@reddit
People are also surprised to find out dns have rate limits, which you can hit if you have high outbound traffic or improper dns cache configs
mycall@reddit
Then when you include a WAF, thing black magic rules start to destroy your determinism.
sir_bok@reddit
Clearly written by AI.