We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local
Posted by ComplexIt@reddit | LocalLLaMA | View on Reddit | 64 comments
LDR maintainer here. Thanks to the strong support of r/LocalLLaMA community LDR got very far. I haven't reported in a while because I thought I was not ready for another prominent post in one of the leading outlets of Local LLM research.
But I think the LDR community finally there again. I think it is finally time to report again.
Setup
- RTX 3090, 24GB
- Ollama backend (qwen3.6:27b)
- LDR's
langgraph_agentstrategy — LangChaincreate_agent()with tool-calling, parallel subtopic decomposition, up to 50 iterations - LLM grader: qwen3.6:27b self-graded (I have used opus to review examples and it generally only underestimates accuracy)
Benchmarks (fully local LLM with web search)
| Model | SimpleQA | xbench-DeepSearch |
|---|---|---|
| Qwen3.6-27B | 95.7% (287/300) | 77.0% (77/100) |
| Qwen3.5-9B | 91.2% (182/200) | 59.0% (59/100) |
| gpt-oss-20B | 85.4% (295/346) | – |
sample size is small, but the benchmarks were not rerun multiple times and you can see from the other rows that this is unlikely just chance. Full leaderboard: https://huggingface.co/datasets/local-deep-research/ldr-benchmarks
Important framing — these are agent + search scores, not closed-book
However, also note that these are similar benchmarks results to Perplexity Deep Research (93.9%), tavily (93.3%) etc. [Tavily forces the LLM to answer only from retrieved docs (pure retrieval test). Perplexity Deep Research is an end-to-end agent and discloses no grader or sample size. ]
Even if our results where only 90% it would already be a great success.
Also I can confirm from using it daily that these results feel consistent with my performance on random querries I do for daily questions.
Caveats:
- SimpleQA contamination risk on newer base models is real
- LLM-judge noise + Sampling error
- bench-DeepSearch is in chinese so an advantage for the chinese qwen models
- No BrowseComp / GAIA numbers yet - But I also dont believe we are good at this benchmark yet. I will have to run some benchmarks to verify the current state
The thing that surprised me:
Results seem to track tool-calling quality more than raw size for local deep research. The langgraph_agent strategy hammers the model with multi-iteration tool calls, parallel subagent decomposition, and structured output — exactly the axis where the newer Qwen generations have improved most. Hypothesis only; if anyone wants to design an ablation we'd love the data.
Some cool LDR features that I want to additionally highlight:
- Journal Quality System (shipped v1.6.0) - academic source grading using OpenAlex, DOAJ. I haven't seen this anywhere else in the open-source deep-research space.
- Per-user SQLCipher AES-256 DB (PBKDF2-HMAC-SHA512, 256k iterations) — admins can't read your data at rest. No password recovery; we don't hold the keys.
- Zero telemetry. Grep the repo. The README states it explicitly: "no telemetry, no analytics, no tracking."
- Cosign-signed Docker images with SLSA provenance + SBOMs.
- MIT licensed. Audit anything.
Repo: https://github.com/LearningCircuit/local-deep-research
Happy to share strategy configs, help reproduce the Qwen runs
Thanks to all the academic and other open source foundational work that made this repo possible.
Porespellar@reddit
Sorry for the lukewarm reception from some of the people on this sub. Your Local Deep Research repo is frigging amazing. I’ve been using it since December and was so impressed with it that I forked SearXNG and created SearXNG LDR Academic. In my version, I removed all the NSFW search engines (Pirate Bay, torrents, etc) and added several academic research-focused ones, and tried to make it integrate with LDR as much as possible.
I don’t think people here realize how much development effort that has been put into LDR, by you and your team. Y’all even have your own private subreddit with well over 100 devs and researchers on it if I remember correctly.
The report generation system you guys built is amazing btw. Would love to see some Matplotlib and Seaborne integration in the future.
Full-Definition6215@reddit
95.7% SimpleQA on a single 3090 is a milestone. The gap between local and cloud is closing faster than most people expected.
I run Qwen models on an i9-9880H mini PC with 31GB RAM (CPU-only, no GPU) for production tasks — content moderation and article generation. Obviously nowhere near 3090 speeds, but for async batch processing where latency doesn't matter, the quality of Qwen3.6-27B is already good enough to replace API calls for most tasks.
Curious about the agentic search latency — how long does a full search-augmented answer take end-to-end on the 3090? And does it degrade gracefully when the search doesn't find relevant results?
oxygen_addiction@reddit
Lets do something new. What would you recommend for hemorrhoids?
ComplexIt@reddit (OP)
Timing is also in the benchmark results. It really takes a few minutes and also depends a lot on the question and how much the model wants to search on the question.
Qwen is a model that seems to go for more agent cycles which I believe also is partly why it is so good, but also a bit slower.
Full-Definition6215@reddit
Makes sense — more agent cycles trading speed for accuracy is a good tradeoff for a search-augmented system. Better to take an extra minute and get the right answer than respond fast with hallucinations.
Thanks for sharing the benchmark data. Will dig into the repo.
SKirby00@reddit
I've been using LDR for a few days now, and I've been VERY impressed. Absolute game-changer for the types of questions I can ask of my AI, and the quality of results I can expect.
I've asked if to produce comprehensive reports on some topics where I genuinely needed a high level of detail, and the reports came out at 80-100 pages (when exported as PDF) of dense and accurate content, with every notable claim sourced and cited, great adherence to the original query/prompt, and pretty much the exact level of depth that I was looking for.
My biggest complaint right now is that you can't pause/resume a query in progress, and if it hits a snag (like temporarily losing connection to the inference server), seems you basically just need to restart.
I highly recommend LDR.
ComplexIt@reddit (OP)
Thank you. I will come to the pause resume button eventually.
User_Deprecated@reddit
Tool-calling quality over size, that tracks. I've been running 27B locally for agentic stuff and the annoying failures aren't wrong answers. It's when the model just skips a search it should've done, or latches onto some wrong subtopic early. By iteration 30 you're already off track and nothing looks obviously broken. Does the parallel decomposition help with that at all?
ComplexIt@reddit (OP)
Thanks for your internest.
From time to time it helps to anchor the original task/question in the end of the prompt for the next iteration. It doesnt cost much context and could help with this.
No_Hunter_7786@reddit
95.7% on SimpleQA fully local is wild. The point about tool-calling quality mattering more than raw size is interesting, makes sense given how much the benchmark relies on multi-step retrieval. Nice work
ComplexIt@reddit (OP)
Thank you for your very friendly words
Pitpeaches@reddit
Why langchain over MCP or any other searching api?
ComplexIt@reddit (OP)
You still need a search API (we have multiple also academic - e.g. OpenAlex, but for SimpleQA you need a generic search engine).
Langgraph just worked surprisingly well for implementing the agent. Although I wouldn't say that it is the best or only way to implement it. It is just what worked after multiple iterations and tests.
fiery_prometheus@reddit
Did you try pydantic ai, or llama index?
ComplexIt@reddit (OP)
So far not. I think they are equally viable. But I did not try, compare or benchmark libraries in general.
fiery_prometheus@reddit
pydantic ai can also just help with structured parsing of results etc, could be helpful. What does agentic research strategies then entail for you? I think sometimes composition and aggregation and agents gets a bit muddled these days.
ComplexIt@reddit (OP)
Thank you, I might look into it but I have so many tasks in this repo that I probably will not find the time soon.
Concerning research strategies: Currently, I am looking into adding more tools without confusing the models.Although some other features in LDR have higher priority at the moment.
dead_dads@reddit
Yo! New to local LLMs/ai stuff in general. I have an old 3090 and 128gb of DDR4 RAM. Was going to sell my old machine for parts but occurred to me this week I could turn it into an ai machine to dip my toes into locally run stuff.
My interest rn is to work on some vibe coding projects. Would like to assess and test models that fit fully into the VRAM of the 3090 but also curious about utilizing my ram (DDR4) to see what larger models can bring into the equation.
What models would be worth by time for testing? I’ve been working with Claude to ID some stuff of interest but as this field moves so fast I thought asking people who are actively engaged in this stuff would be better.
ComplexIt@reddit (OP)
Nice setup and you can run larger models than me. Also, thanks for your interest.
You can contribute benchmark results here: https://github.com/LearningCircuit/ldr-benchmarks/
I can support you in setting up everything for benchmarking. You can write me a message here.
It might also help to discuss this in discord: https://www.reddit.com/r/LocalDeepResearch/
icedgz@reddit
Meanwhile I can’t even get my Ollama Qwen3.6-27b to do tool calls correctly… sigh (windows , opencode)
Important_Quote_1180@reddit
https://github.com/noonghunna/club-3090/tree/0df8f743192809dbdcda942887b625b0f48699f2
Hope this helps, it’s vLLM not Ollama but you are probably used to random internet strangers recommending anything besides Ollama
icedgz@reddit
lol. I finally got it working with buun llama fork.
Seems ok i have a 5070ti .
At 65k context it runs good at 50t/s or so. But 100k context doesn’t fit in the vram it seems and slows to a crawl
GovernmentTechnical@reddit
Its an issue with qwen3.6. I got it working with a fixed jinja template but haven't done any longer runs yet so can't vouch 100%
icedgz@reddit
Have a link?
GovernmentTechnical@reddit
The template is in froggeric/Qwen-Fixed-Chat-Templates on hf. I just overwrote the template in my model files with that one
trialbuterror@reddit
Can it run in 9060xt 16gb and ddr4 16gb
ComplexIt@reddit (OP)
Yes, thats possible. https://lmstudio.ai/download (for AMD on Windows). Use the qwen 3.5 9b model for the start.
LienniTa@reddit
t/s?quant?
ComplexIt@reddit (OP)
This question humbles me. I just use https://ollama.com/library/qwen3.6 "qwen3.6:27b" . I hope this is helpfull. Thank you for your interest.
Mountain_Patience231@reddit
whats the pp and ts
iwanttobeweathy@reddit
my old 3090 in a linux got ~1000-~2000 pp and around 80-100 t/s
andy2na@reddit
Do you have sustained benchmarks results? 80 to 100 is not possible for longer context
iwanttobeweathy@reddit
oh forgot to mention, I run using a fork of llama.cpp with turbo quant implementation for kv cache too.
moderately-extremist@reddit
It's qwen3.6-27b running on your hardware. If you are doing any local LLM, you should be able to check those numbers yourself, or if you are using a common gpu or Ryzen AI system or Mac system, you can find benchmark numbers from other people with a little searching. But it's going to depend on your hardware, OP's pp and ts numbers aren't going to mean anything.
rootbeer_racinette@reddit
Yes they will, I have a 3090 and I want to know before setting this all up.
moderately-extremist@reddit
Shouldn't be too hard to find qwen3.6-27b speeds on a 3090.
Dry_Cartographer3348@reddit
Hey sorry for a basic question but what is pp, ik ts but never heard of pp
MrPanache52@reddit
Pp is reading the prompt, ts is responding
Far-Low-4705@reddit
* uses ollama
* uses langchain
* uses LLM as judge
-_-
One-Pain6799@reddit
Thanks for sharing That's cool repo
ComplexIt@reddit (OP)
Thank you means a lot
buttplugs4life4me@reddit
Seems very vibe coded. Just as an example a single file has 5 different ways of getting a parameter...
thx1138inator@reddit
Should we care? As long as the proof is in the pudding, IMHO. Sounds a little gate-keepy.
NNN_Throwaway2@reddit
Oh no... are people saying that the bare minimum standards of code quality is "gate-keeping" now?
buttplugs4life4me@reddit
Well its all about quality and reliability. A vibe coded app that can't even stick to one convention of configuring it is inherently harder to maintain, even if that maintenance is likely only done by LLMs as well.
And benchmarks are easy to fake/benchmaxx against.
thx1138inator@reddit
Point taken. And someone else reminded me that vibe-coded projects are not allowed in the sub. It's just... Back when that rule was created, there was a lot of AI slop in both digital media and source code. It was annoying. But it seems like we should be open to the possibility of very high-quality generated code. Like, "vibe-coded" as a perjorative will need to be retired (and pretty soon, I would estimate!).
moderately-extremist@reddit
It's against the subreddit rules.
ID-10T_Error@reddit
This ... the end result is what matters not the path taken.
draconic_tongue@reddit
as long as it was vibecoded with the title model
Kong28@reddit
Super cool, definitely need to put my 3060 to use a bit.
ComplexIt@reddit (OP)
Thank you. Maybe try qwen 3.5 9b
bnolsen@reddit
I use a q4_k_m on my server that has a12gb 3060.
rawcode@reddit
Really cool stuff, had no clue about local deep research. Thanks!
ComplexIt@reddit (OP)
Thank you for postive feedback means a lot
rafio77@reddit
95.7% on SimpleQA with agentic search is basically a search-quality test more than a reasoning one. the bench is dominated by factoids that resolve in 1-2 retrievals if ur retriever doesnt suck. interesting question is the latency/$/query at that 95.7%, ie how many tool calls per query and is the 3090 actually running the model or just hosting the orchestrator while a beefier remote does inference. also worth seeing the breakdown vs non-agentic baseline same model, if the delta is only 8-10 points the SimpleQA ceiling was always retrieval-limited not model-limited.
ComplexIt@reddit (OP)
SimpleQA is difficult for a small model.
Small mistakes in your pipeline add up much worse than they would with larger or cloud models. Furthermore, you are context-restricted.
For achieving this performance you need a (1) very optimized pipeline (2) a very good base model which doesnt get confused easily and understands tool calls and search. I have done many strategy attempts in this repo and tested many models. You can check git history of the repository in the advanced search system subfolder.
Note that benchmark performance on another benchmark (xbench-DeepSearch) is also reported in the table above.
And yes only the 3090 is running the model and only the mentioned qwen model is used. Inference is fully local.
Dry_Cartographer3348@reddit
Very new to this space, can some one please explain what is ldr in this context and simpleQA
ComplexIt@reddit (OP)
LDR = local deep research is the name of the repository that I am maintaining. You can read more about it here I made this post recently: https://www.reddit.com/r/WebAfterAI/comments/1t18wr6/local_deep_research_opensource_ai_research/
SimpleQA is a benchmark by open ai.
https://openai.com/index/introducing-simpleqa/
https://llm-stats.com/benchmarks/simpleqa
AngeloKappos@reddit
95.7% on SimpleQA is a solid number, but self-graded with the same model doing inference is going to inflate that score.
Run it through a separate grader, even llama-3.1-8b-instruct, and watch it drop a few points.
ComplexIt@reddit (OP)
I let opus cross-check some of the results and it claimed that it under reports accuracy. Although I agree grading is a problem as I wrote. In general though, agreed.
The reason I use the same model for grading is, because it is loaded in VRAM and I dont have to switch models. The current benchmark implementation grades after each question. It is nice for the user to see the current performance in the live benchmark.
kzoltan@reddit
Nice!
It might worth comparing to these: https://huggingface.co/spaces/muset-ai/DeepResearch-Bench-Leaderboard
I have no idea about the benchmarks details, it might take some compute…
ComplexIt@reddit (OP)
We have the questions from this benchmark "xbench_deepsearch" as benchmark category. But we just get the questions and calculate accuracy. Our performance is decent according to results you can see in the dataset below. However, I think what this benchmark page you linked does is more sophisticated and not reproducable for me due to limited ressources. I can only calculate accuracy on their question answer pairs. But you can look into our UI and run the benchmark and questions and get accuracy. It is chinese though so you need to translate or trust your grader on output quality. https://huggingface.co/datasets/local-deep-research/ldr-benchmarks/viewer/xbench-deepsearch
PaceZealousideal6091@reddit
Thanks for sharing this! Is it backend flexible? How does qwen 3.6 35B fare?
ComplexIt@reddit (OP)
All qwen models I tried so far seem to be working very well with it so far.
Look at the benchmarks that I so far created: https://huggingface.co/datasets/local-deep-research/ldr-benchmarks