Benchmarking a hybrid threat detection system (backend + APIs)

Posted by Emergency-Rough-6372@reddit | Python | View on Reddit | 16 comments

I’ve been spending some time reading through discussions here and I genuinely like how people break things down and share practical perspectives, so I thought I’d put this out as more of a discussion than a direct “help” post.

Lately I’ve been working on a backend system focused on detecting potential threats in API flows and chatbot interactions. It’s not purely rule-based, it combines deterministic security checks (using established open-source libraries) with a probabilistic layer for risk scoring and decision-making.

Because of that mix, evaluation becomes a bit tricky. It’s not a clean input → output system, and correctness isn’t always binary.

What I’ve been thinking about is how people approach benchmarking in cases like this. When part of the system is deterministic and part is probabilistic, what does “good performance” actually look like?

Is it more about:

Another thing I’ve been running into is edge cases. With deterministic checks, it’s straightforward. But once you add a probabilistic layer, it feels more like you’re evaluating behavior over distributions rather than validating exact outputs.

Since I’m relying on well-established libraries for the core detection logic, the challenge isn’t verifying those individually ,it’s understanding how the overall system behaves around them and how to present results in a way that feels trustworthy.

Curious how others here think about this:

Not looking for a single answer,just interested in how people approach this in real systems.