Debugging a Python IMAP IDLE watcher with 100% CPU — py-spy found the root cause
in 30 seconds afte
Posted by AlexHoHo@reddit | Python | View on Reddit | 5 comments
I built a self-hosted email filter running as a systemd service on a Proxmox LXC
(4 vCores). It uses IMAP IDLE to react instantly to new emails. Everything worked
correctly — except one vCore was permanently pinned at 100% CPU.
---
## The symptom
One vCore at 100%, constant. The filter sorted emails correctly. No errors in the log.
Four accounts, four watcher threads.
---
## What I tried first (and failed)
**Fix 1:** The reconnect loop after a clean EOF had no sleep → added `stop_event.wait(30)`
outside the try/except. CPU still high.
**Fix 2:** Server sends `+ idling\r\n` keepalive lines after IDLE start → `readline()`
returned immediately for each → tight loop → added `time.sleep(0.05)` for non-EXISTS lines.
CPU still high.
**Fix 3:** Server sends EXISTS notifications even when mail count stays the same (e.g.
after IDLE start) → was triggering unnecessary UNSEEN queries → tracked last_exists count,
only act on increases. CPU still high.
After three failed guesses I stopped guessing and ran a profiler.
---
## The actual diagnosis — py-spy
```bash
pip install py-spy
py-spy record --pid $(pgrep -f "python.*main") --duration 10 -o /tmp/profile.svg
Flamegraph showed immediately:
_is_timeout_exc — 178 samples in 10 seconds (~18 calls/second!)
_idle_loop:172 — 95 samples (the `continue` after the exception)
One thread (watcher-fb) was active+gil — running in Python userspace, not blocked
on I/O. It was catching OSError(timeout) from readline() at \~18 times per second
and immediately looping back.
The root cause
In my _idle_loop:
conn.sock.settimeout(60)
while not self._stop_event.is_set():
try:
line = conn.readline()
except (imaplib.IMAP4.abort, OSError) as e:
if self._is_timeout_exc(e):
if time.time() - start > IDLE_TIMEOUT:
# renew IDLE...
continue # ← NO SLEEP HERE
break
The socket was in a state where readline() returned OSError(timeout) immediately
instead of blocking for 60 seconds. Every continue went straight back to another
failing readline(). The loop ran \~18 times per second on the fb account thread, pegging
one core.
Why readline() returned timeout immediately instead of blocking: likely an interaction
between settimeout(60) on the SSL socket and the internal state of imaplib's file object
(makefile('rb')). I haven't dug deeper — the fix is more important.
The fix — one line
if self._is_timeout_exc(e):
if time.time() - start > IDLE_TIMEOUT:
# renew IDLE...
else:
self._stop_event.wait(1) # ← THIS. Limits to 1 retry/second.
continue
break
Using stop_event.wait(1) instead of time.sleep(1) so the thread remains immediately
stoppable on shutdown.
Result: CPU dropped from \~85% to 0.2%.
Lessons learned
- Don't guess. Profile first. Three fixes based on code-reading + guessing = still broken. One
py-spy record= root cause in 30 seconds. - Every
continuein an exception handler wrapping blocking I/O needs a sleep. If the I/O call fails immediately instead of blocking, you have a tight loop. This applies to anysocket.readline(),recv(), etc. settimeout(N)on an SSL socket is not a guarantee thatreadline()blocks for N seconds. In certain socket states it can return immediately. Always protect the retry path with a sleep.- py-spy is incredible. No code changes needed, works on running processes, flame graph in 10 seconds. Install it on every server running Python.
Hope this saves someone else three rounds of failed fixes.
hikingsticks@reddit
Your claw bot forgot to add how this helped you grow as a programmer, and if anyone else feels the same?
LoreBadTime@reddit
I grew as programmer, but I don't feel the same?
_matze@reddit
I grew the same, but don’t feel the programmer..
LoreBadTime@reddit
I programmer the same, but don't feel the grew
AutoModerator@reddit
Your submission has been automatically queued for manual review by the moderation team because it has been reported too many times.
Please wait until the moderation team reviews your post.
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.