We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local

Posted by ComplexIt@reddit | LocalLLaMA | View on Reddit | 64 comments

LDR maintainer here. Thanks to the strong support of r/LocalLLaMA community LDR got very far. I haven't reported in a while because I thought I was not ready for another prominent post in one of the leading outlets of Local LLM research.

But I think the LDR community finally there again. I think it is finally time to report again.

Setup

RTX 3090, 24GB
Ollama backend (qwen3.6:27b)
LDR's langgraph_agent strategy — LangChain create_agent() with tool-calling, parallel subtopic decomposition, up to 50 iterations
LLM grader: qwen3.6:27b self-graded (I have used opus to review examples and it generally only underestimates accuracy)

Benchmarks (fully local LLM with web search)

Model	SimpleQA	xbench-DeepSearch
Qwen3.6-27B	95.7% (287/300)	77.0% (77/100)
Qwen3.5-9B	91.2% (182/200)	59.0% (59/100)
gpt-oss-20B	85.4% (295/346)	–

sample size is small, but the benchmarks were not rerun multiple times and you can see from the other rows that this is unlikely just chance. Full leaderboard: https://huggingface.co/datasets/local-deep-research/ldr-benchmarks

Important framing — these are agent + search scores, not closed-book

However, also note that these are similar benchmarks results to Perplexity Deep Research (93.9%), tavily (93.3%) etc. [Tavily forces the LLM to answer only from retrieved docs (pure retrieval test). Perplexity Deep Research is an end-to-end agent and discloses no grader or sample size. ]

Even if our results where only 90% it would already be a great success.

Also I can confirm from using it daily that these results feel consistent with my performance on random querries I do for daily questions.

Caveats:

SimpleQA contamination risk on newer base models is real
LLM-judge noise + Sampling error
bench-DeepSearch is in chinese so an advantage for the chinese qwen models
No BrowseComp / GAIA numbers yet - But I also dont believe we are good at this benchmark yet. I will have to run some benchmarks to verify the current state

The thing that surprised me:

Results seem to track tool-calling quality more than raw size for local deep research. The langgraph_agent strategy hammers the model with multi-iteration tool calls, parallel subagent decomposition, and structured output — exactly the axis where the newer Qwen generations have improved most. Hypothesis only; if anyone wants to design an ablation we'd love the data.

Some cool LDR features that I want to additionally highlight:

Journal Quality System (shipped v1.6.0) - academic source grading using OpenAlex, DOAJ. I haven't seen this anywhere else in the open-source deep-research space.
Per-user SQLCipher AES-256 DB (PBKDF2-HMAC-SHA512, 256k iterations) — admins can't read your data at rest. No password recovery; we don't hold the keys.
Zero telemetry. Grep the repo. The README states it explicitly: "no telemetry, no analytics, no tracking."
Cosign-signed Docker images with SLSA provenance + SBOMs.
MIT licensed. Audit anything.

Repo: https://github.com/LearningCircuit/local-deep-research

Happy to share strategy configs, help reproduce the Qwen runs

Thanks to all the academic and other open source foundational work that made this repo possible.

[-]

Porespellar@reddit

Sorry for the lukewarm reception from some of the people on this sub. Your Local Deep Research repo is frigging amazing. I’ve been using it since December and was so impressed with it that I forked SearXNG and created SearXNG LDR Academic. In my version, I removed all the NSFW search engines (Pirate Bay, torrents, etc) and added several academic research-focused ones, and tried to make it integrate with LDR as much as possible.

I don’t think people here realize how much development effort that has been put into LDR, by you and your team. Y’all even have your own private subreddit with well over 100 devs and researchers on it if I remember correctly.

The report generation system you guys built is amazing btw. Would love to see some Matplotlib and Seaborne integration in the future.

[-]

Full-Definition6215@reddit

95.7% SimpleQA on a single 3090 is a milestone. The gap between local and cloud is closing faster than most people expected.

I run Qwen models on an i9-9880H mini PC with 31GB RAM (CPU-only, no GPU) for production tasks — content moderation and article generation. Obviously nowhere near 3090 speeds, but for async batch processing where latency doesn't matter, the quality of Qwen3.6-27B is already good enough to replace API calls for most tasks.

Curious about the agentic search latency — how long does a full search-augmented answer take end-to-end on the 3090? And does it degrade gracefully when the search doesn't find relevant results?

[-]

oxygen_addiction@reddit

Lets do something new. What would you recommend for hemorrhoids?

[-]