Benchmarked a local-first MCP code-intel server on gin / nestjs / react — full methodology + reproducer

Posted by Parking-Geologist586@reddit | LocalLLaMA | View on Reddit | 0 comments

I've been working on a local-first code intelligence MCP server and benchmarked it on three pinned public repos. All numbers are reproducible with one command (npm run bench clones the exact versions and runs the profiler).

Repo	Files	Cold index	Search p95	Impact p95	DB size
gin-gonic/gin v1.10.0	99	10s	12ms	0.75ms	4 MB
nestjs/nest v10.4.0	1,709	22s	14ms	0.88ms	11 MB
facebook/react v18.3.1	4,368	152s	26ms	1.18ms	67 MB

Measured on M-series Apple Silicon, no GPU, cold start includes the full index build.

Stack

Parser: regex-based across 18 languages (TS / JS / Python / Go / Rust / Java / C / C++ / Ruby / PHP + 8 more). Tree-sitter upgrade is on the roadmap but not blocking.
Embeddings: all-MiniLM-L6-v2, ONNX, int8-quantized, 384 dimensions, \~90MB. Runs locally via onnxruntime-node. No cloud calls, no API keys.
Search: hybrid — BM25 (via SQLite FTS5) + cosine similarity over the embeddings + PageRank over the dependency graph. Fused via Reciprocal Rank Fusion (k=60). PageRank is stored as a column on the files table, computed once per full index.
Storage: single SQLite file per project at ~/.sverklo/<project-hash>/index.db. Full on-disk format is documented at docs/index-format.md.
Symbol graph: parsed call-site references stored in symbol_refs, lazy resolution against chunks at query time. Impact analysis is an indexed SQL join — sub-millisecond because the work was done at index time.

Why not just a bigger embedding model?

Because the three signals handle different failure modes:

BM25 catches exact identifier and string-literal matches that embeddings miss or misrank ("find every call to parseFoo").
Vector catches intent-shaped queries where the user doesn't know the identifier ("find the retry logic in the HTTP client").
PageRank separates "which files match" from "which files matter." Critical when a query returns 50 hits in tests and 2 hits in production code.

Any one signal on its own has clear failure cases. Fusing them with RRF is scale-invariant and catches the complementary strengths.

Honest weaknesses

Exact string lookup: ripgrep beats it. I use ripgrep all the time; sverklo is complementary, not a replacement.
Small repos: under \~50 files the indexing overhead doesn't pay off. Just read everything.
Framework wiring questions: "how is this bean registered" shapes return poor results because the answer lives in an annotation or a build-generated class, not in code that names the concept. The tool detects this query shape and explicitly recommends grep for the annotation instead.
Unicode identifiers in Kotlin / Swift: the word-boundary matcher uses \w which is ASCII-only. Non-ASCII identifiers fall back to substring mode.

What I actually built it for

MCP servers for AI coding agents (Claude Code, Cursor, Windsurf, Google Antigravity) mostly either (a) upload your code to a cloud index or (b) hallucinate file paths because they don't have an actual graph. I wanted something that
gave Claude Code the same mental model of a repo that a senior engineer has — symbol reachability, blast radius, test coverage, structural importance — without anything leaving my laptop.

Technical deep-dive

BENCHMARKS.md — reproducer script, methodology, raw results
docs/index-format.md — on-disk layout, SQLite schema, RRF fusion details, PageRank computation
DOGFOOD.md — the three-session quality-gate protocol I ran before shipping v0.2.16, including the four bugs I found in my own tool and fixed

Install

npm install -g sverklo
cd your-project && sverklo init

sverklo init auto-detects your installed AI coding agent and writes the right MCP config. MIT licensed. Opt-in telemetry (off by default, full schema documented, mirrored to a local log before any network call).

Repo: github.com/sverklo/sverklo

If anyone wants to benchmark sverklo against another local-first tool on the same repos, I'll run whatever comparison you propose and post the numbers in a reply. Interested in what shape of query breaks it most.