Benchmarked a local-first MCP code-intel server on gin / nestjs / react — full methodology + reproducer
Posted by Parking-Geologist586@reddit | LocalLLaMA | View on Reddit | 0 comments
I've been working on a local-first code intelligence MCP server and benchmarked it on three pinned public repos. All numbers are reproducible with one command (npm run bench clones the exact versions and runs the profiler).
| Repo | Files | Cold index | Search p95 | Impact p95 | DB size |
|---|---|---|---|---|---|
| gin-gonic/gin v1.10.0 | 99 | 10s | 12ms | 0.75ms | 4 MB |
| nestjs/nest v10.4.0 | 1,709 | 22s | 14ms | 0.88ms | 11 MB |
| facebook/react v18.3.1 | 4,368 | 152s | 26ms | 1.18ms | 67 MB |
Measured on M-series Apple Silicon, no GPU, cold start includes the full index build.
Stack
- Parser: regex-based across 18 languages (TS / JS / Python / Go / Rust / Java / C / C++ / Ruby / PHP + 8 more). Tree-sitter upgrade is on the roadmap but not blocking.
- Embeddings: all-MiniLM-L6-v2, ONNX, int8-quantized, 384 dimensions, \~90MB. Runs locally via
onnxruntime-node. No cloud calls, no API keys. - Search: hybrid — BM25 (via SQLite FTS5) + cosine similarity over the embeddings + PageRank over the dependency graph. Fused via Reciprocal Rank Fusion (k=60). PageRank is stored as a column on the files table, computed once per full index.
- Storage: single SQLite file per project at
~/.sverklo/<project-hash>/index.db. Full on-disk format is documented at docs/index-format.md. - Symbol graph: parsed call-site references stored in
symbol_refs, lazy resolution againstchunksat query time. Impact analysis is an indexed SQL join — sub-millisecond because the work was done at index time.
Why not just a bigger embedding model?
Because the three signals handle different failure modes:
- BM25 catches exact identifier and string-literal matches that embeddings miss or misrank ("find every call to
parseFoo"). - Vector catches intent-shaped queries where the user doesn't know the identifier ("find the retry logic in the HTTP client").
- PageRank separates "which files match" from "which files matter." Critical when a query returns 50 hits in tests and 2 hits in production code.
Any one signal on its own has clear failure cases. Fusing them with RRF is scale-invariant and catches the complementary strengths.
Honest weaknesses
- Exact string lookup:
ripgrepbeats it. I use ripgrep all the time; sverklo is complementary, not a replacement. - Small repos: under \~50 files the indexing overhead doesn't pay off. Just read everything.
- Framework wiring questions: "how is this bean registered" shapes return poor results because the answer lives in an annotation or a build-generated class, not in code that names the concept. The tool detects this query shape and explicitly recommends grep for the annotation instead.
- Unicode identifiers in Kotlin / Swift: the word-boundary matcher uses
\wwhich is ASCII-only. Non-ASCII identifiers fall back to substring mode.
What I actually built it for
MCP servers for AI coding agents (Claude Code, Cursor, Windsurf, Google Antigravity) mostly either (a) upload your code to a cloud index or (b) hallucinate file paths because they don't have an actual graph. I wanted something that
gave Claude Code the same mental model of a repo that a senior engineer has — symbol reachability, blast radius, test coverage, structural importance — without anything leaving my laptop.
Technical deep-dive
- BENCHMARKS.md — reproducer script, methodology, raw results
- docs/index-format.md — on-disk layout, SQLite schema, RRF fusion details, PageRank computation
- DOGFOOD.md — the three-session quality-gate protocol I ran before shipping v0.2.16, including the four bugs I found in my own tool and fixed
Install
npm install -g sverklo
cd your-project && sverklo init
sverklo init auto-detects your installed AI coding agent and writes the right MCP config. MIT licensed. Opt-in telemetry (off by default, full schema documented, mirrored to a local log before any network call).
Repo: github.com/sverklo/sverklo
If anyone wants to benchmark sverklo against another local-first tool on the same repos, I'll run whatever comparison you propose and post the numbers in a reply. Interested in what shape of query breaks it most.