Debugging a Python IMAP IDLE watcher with 100% CPU — py-spy found the root cause in 30 seconds afte

Posted by AlexHoHo@reddit | Python | View on Reddit | 5 comments

I built a self-hosted email filter running as a systemd service on a Proxmox LXC 
(4 vCores). It uses IMAP IDLE to react instantly to new emails. Everything worked 
correctly — except one vCore was permanently pinned at 100% CPU.

---

## The symptom

One vCore at 100%, constant. The filter sorted emails correctly. No errors in the log.
Four accounts, four watcher threads.

---

## What I tried first (and failed)

**Fix 1:** The reconnect loop after a clean EOF had no sleep → added `stop_event.wait(30)` 
outside the try/except. CPU still high.

**Fix 2:** Server sends `+ idling\r\n` keepalive lines after IDLE start → `readline()` 
returned immediately for each → tight loop → added `time.sleep(0.05)` for non-EXISTS lines. 
CPU still high.

**Fix 3:** Server sends EXISTS notifications even when mail count stays the same (e.g. 
after IDLE start) → was triggering unnecessary UNSEEN queries → tracked last_exists count, 
only act on increases. CPU still high.

After three failed guesses I stopped guessing and ran a profiler.

---

## The actual diagnosis — py-spy

```bash
pip install py-spy
py-spy record --pid $(pgrep -f "python.*main") --duration 10 -o /tmp/profile.svg

Flamegraph showed immediately:

_is_timeout_exc   — 178 samples in 10 seconds  (~18 calls/second!)
_idle_loop:172    —  95 samples                 (the `continue` after the exception)

One thread (watcher-fb) was active+gil — running in Python userspace, not blocked
on I/O. It was catching OSError(timeout) from readline() at \~18 times per second
and immediately looping back.

The root cause

In my _idle_loop:

conn.sock.settimeout(60)
while not self._stop_event.is_set():
    try:
        line = conn.readline()
    except (imaplib.IMAP4.abort, OSError) as e:
        if self._is_timeout_exc(e):
            if time.time() - start > IDLE_TIMEOUT:
                # renew IDLE...
            continue   # ← NO SLEEP HERE
        break

The socket was in a state where readline() returned OSError(timeout) immediately
instead of blocking for 60 seconds. Every continue went straight back to another
failing readline(). The loop ran \~18 times per second on the fb account thread, pegging
one core.

Why readline() returned timeout immediately instead of blocking: likely an interaction
between settimeout(60) on the SSL socket and the internal state of imaplib's file object
(makefile('rb')). I haven't dug deeper — the fix is more important.

The fix — one line

        if self._is_timeout_exc(e):
            if time.time() - start > IDLE_TIMEOUT:
                # renew IDLE...
            else:
                self._stop_event.wait(1)  # ← THIS. Limits to 1 retry/second.
            continue
        break

Using stop_event.wait(1) instead of time.sleep(1) so the thread remains immediately
stoppable on shutdown.

Result: CPU dropped from \~85% to 0.2%.

Lessons learned

  1. Don't guess. Profile first. Three fixes based on code-reading + guessing = still broken. One py-spy record = root cause in 30 seconds.
  2. Every continue in an exception handler wrapping blocking I/O needs a sleep. If the I/O call fails immediately instead of blocking, you have a tight loop. This applies to any socket.readline(), recv(), etc.
  3. settimeout(N) on an SSL socket is not a guarantee that readline() blocks for N seconds. In certain socket states it can return immediately. Always protect the retry path with a sleep.
  4. py-spy is incredible. No code changes needed, works on running processes, flame graph in 10 seconds. Install it on every server running Python.

Hope this saves someone else three rounds of failed fixes.