15 code health biomarkers, benchmarked against 6 months of real bugs across 3 repos

Posted by Obvious_Gap_5768@reddit | ExperiencedDevs | View on Reddit | 21 comments

I'm building an open source codebase intelligence tool. One layer of it scores every file 1-10 using 15 deterministic biomarkers without LLM. Uses AST parsing via tree-sitter plus git history.

The biomarkers fall into five buckets:

Structural: brain_method, nested_complexity, bumpy_road, complex_method, large_method, complex_conditional, primitive_obsession

Duplication: dry_violation (Rabin-Karp rolling hash over tree-sitter tokens, survives variable renames)

Test coverage: untested_hotspot, coverage_gap

Organizational: developer_congestion, knowledge_loss, hidden_coupling, function_hotspot, code_age_volatility

I ran a time-travel experiment on FastAPI (104 files), Pydantic (216 files), and Django (542 files). Then score every file at time T, count bug-fix commits over the next 6 months, check correlation.

On Django: Spearman ρ = -0.34, p < 0.0001. Precision@20 = 70%, meaning 14 of the 20 worst-scoring files had real bugs in the following 6 months.

The two strongest single predictors were untested_hotspot (Cliff's delta +0.67) and developer_congestion (+0.78 in Django). Both are process signals. McCabe complexity and nesting depth ranked lower.

knowledge_loss went negative. Files where original authors left the project had fewer bugs.

My read is that stable legacy code that nobody touches doesn't break.

One thing I'm being upfront about is thatcontrolling for file size drops the correlation from \~0.3 to \~0.1. Bigger files carry more complexity and more bugs. CodeScene published a similar study claiming 15x more defects in unhealthy code but never reported this confound.

What would you add to this list? And has anyone else seen ownership metrics beat complexity in practice?