15 code health biomarkers, benchmarked against 6 months of real bugs across 3 repos

Posted by Obvious_Gap_5768@reddit | ExperiencedDevs | View on Reddit | 21 comments

I'm building an open source codebase intelligence tool. One layer of it scores every file 1-10 using 15 deterministic biomarkers without LLM. Uses AST parsing via tree-sitter plus git history.

The biomarkers fall into five buckets:

Structural: brain_method, nested_complexity, bumpy_road, complex_method, large_method, complex_conditional, primitive_obsession

Duplication: dry_violation (Rabin-Karp rolling hash over tree-sitter tokens, survives variable renames)

Test coverage: untested_hotspot, coverage_gap

Organizational: developer_congestion, knowledge_loss, hidden_coupling, function_hotspot, code_age_volatility

I ran a time-travel experiment on FastAPI (104 files), Pydantic (216 files), and Django (542 files). Then score every file at time T, count bug-fix commits over the next 6 months, check correlation.

On Django: Spearman ρ = -0.34, p < 0.0001. Precision@20 = 70%, meaning 14 of the 20 worst-scoring files had real bugs in the following 6 months.

The two strongest single predictors were untested_hotspot (Cliff's delta +0.67) and developer_congestion (+0.78 in Django). Both are process signals. McCabe complexity and nesting depth ranked lower.

knowledge_loss went negative. Files where original authors left the project had fewer bugs.

My read is that stable legacy code that nobody touches doesn't break.

One thing I'm being upfront about is thatcontrolling for file size drops the correlation from \~0.3 to \~0.1. Bigger files carry more complexity and more bugs. CodeScene published a similar study claiming 15x more defects in unhealthy code but never reported this confound.

What would you add to this list? And has anyone else seen ownership metrics beat complexity in practice?

[-]

kkingsbe@reddit

What makes this better than SonarJS lint rules?

[-]

monstereye@reddit

This is super interesting. Thanks for putting this all together and testing it against a few repos reporting the findings.

[-]

new2bay@reddit

Biomarkers?

[-]

Mountain-Dragonfly46@reddit

As I work in in-silico biologics / bioinformatics, I was a bit confused as well :)

[-]

Obvious_Gap_5768@reddit (OP)

Same idea as medical biomarkers. Each one measures a specific dimension of file health (complexity, nesting, test coverage, ownership patterns, etc.) and they combine into an overall 1-10 score. Felt more accurate than "code smells" since some of these are process signals, not structural issues

[-]

bxk21@reddit

Would KPI be the tech equivalent word we're trying to use?

[-]

Obvious_Gap_5768@reddit (OP)

KPI is more of a business/performance tracking thing. The standard term would be "code metrics" or "code health indicators." I went with biomarkers because the medical analogy fits better: each one is a diagnostic signal, and you look at the full panel together to assess health. Same way a doctor wouldn't diagnose you off cholesterol alone

[-]

DigmonsDrill@reddit

I count 4 buckets?

[-]

sparklikemind@reddit

SonarQube does this already for years though

[-]

Obvious_Gap_5768@reddit (OP)

SonarQube uses git blame for issue assignment, but its metrics are structural: complexity, duplications, coverage, technical debt. It doesn't compute organizational signals from git history like author count per file, knowledge loss, or co-change coupling. Those process metrics are the ones that ended up predicting bugs better than complexity in this benchmark

[-]

sparklikemind@reddit

Interesting

[-]

dacydergoth@reddit

Lovely to see some actual engineering going on.

[-]

Obvious_Gap_5768@reddit (OP)

Thanks, really appreciate that

[-]

5olArchitect@reddit

Seems cool but dry is overrated

[-]

jambalaya004@reddit

lol

[-]

steerpike_is_my_name@reddit

+1 for being able to run this on my own codebase.

[-]

Obvious_Gap_5768@reddit (OP)

This is an open source tool I have been working on Link: https://github.com/repowise-dev/repowise

The benchmark pipeline is in there too if you want to run the same time-travel experiment on your own codebase. PRs welcome if you end up adding something :)

[-]

1000Ditto@reddit

Wow, I'm interested in this. Where can I learn more, and are there documents (ie articles, papers) you would recommend?

[-]

Obvious_Gap_5768@reddit (OP)

Honestly still learning a lot of this myself as I build it. I started with CodeScene "Code Red" paper and then dug deeper from there.

The tool is open source if you want to poke around or contribute. Plus it has 4 more layers apart from code health. You can find the link in my profile.

[-]

Tahazarif90@reddit

Your data perfectly mirrors the Microsoft Research "Don't Touch My Code" study—organizational metrics like author count and ownership percentage consistently crush McCabe complexity for predicting defects. For biomarkers, you should add Churn-to-Complexity Ratio (high churn on complex files) and Blast Radius (how many files change when a specific file is modified based on git history). The size confounder is real; you must normalize your scores per 100 lines of code (LoC) or use partial correlation to isolate the true structural signal from sheer volume.

[-]

Obvious_Gap_5768@reddit (OP)

I haven't read the Microsoft study yet, will definitely look into it. Blast radius we already track in the git layer, but might be worth pronoting it as a Biomarker.

Churn-to-complexity ratio is interesting, function_hotspot does something close (flags functions that are both complex and frequently modified) but at the file level I need to test

The NLOC normalization is in there, partial Spearman controlling for file size is where the ~0.3 to ~0.1 drop comes from. I wanted to calculate completely honest scores