I no longer need a cloud LLM to do quick web research

Posted by BitPsychological2767@reddit | LocalLLaMA | View on Reddit | 61 comments

This might be super old news to some people, but I only just recently started using local models due to them only just now meeting my standards for quality. I just want to share the setup I have for web searching/scraping locally.

I use Qwen3.5:27B-Q3_K_M on an RTX 4090 with a context length of \~200,000. I get \~40 tk/s and use about 22gb VRAM.

I use it through the llama.cpp Web UI, with MCP tools enabled. Here are the tools I have provided it for web search/scrape:

"""
webmcp - MCP server for web scraping and content extraction
"""

import asyncio
import json
import logging
import os
import re
import time
from contextlib import contextmanager
from datetime import datetime, timezone
from pathlib import Path
from typing import Any

import httpx
from ddgs import DDGS
from markdownify import markdownify as md
from mcp.server.fastmcp import FastMCP
from mcp.server.transport_security import TransportSecuritySettings
from playwright.async_api import async_playwright
from readability import Document as ReadabilityDocument
from starlette.middleware.cors import CORSMiddleware

# ============================================================================
# Configuration
# ============================================================================

logger = logging.getLogger(__name__)

TOOL_CALL_LOG_PATH = os.path.join(
    os.path.dirname(os.path.abspath(__file__)),
    "tool_calls.log.json"
)

LLM_URL = os.environ.get("LLM_URL", "")
LLM_MODEL = os.environ.get("LLM_MODEL", "")

if not LLM_URL or not LLM_MODEL:
    raise ValueError("LLM_URL and LLM_MODEL environment variables are required")

# ============================================================================
# Content Processing
# ============================================================================


def _html_to_clean(html: str) -> str:
    """Convert HTML to clean markdown, collapsing excessive whitespace."""
    text = md(
        html,
        heading_style="ATX",
        strip=["img", "script", "style", "nav", "footer", "header"]
    )
    text = re.sub(r"\n{3,}", "\n\n", text)
    text = re.sub(r"[^\S\n]+", " ", text)
    return text.strip()


async def _fetch_one(browser: Any, url: str, timeout_ms: int = 0) -> tuple[str, str]:
    page = await browser.new_page()
    await page.set_extra_http_headers({
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    })

    try:
        await page.goto(url, wait_until="domcontentloaded", timeout=timeout_ms)
        await page.wait_for_timeout(2000)
        html = await page.content()
    finally:
        await page.close()

    doc = ReadabilityDocument(html)
    title = doc.title()
    clean_text = _html_to_clean(doc.summary())

    if len(clean_text) < 50:
        clean_text = _html_to_clean(html)

    return title, clean_text


async def _fetch_pages(urls: list[str]) -> list[tuple[str, str, str | None]]:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        try:
            async def _fetch_single(url: str) -> tuple[str, str, str | None]:
                try:
                    title, text = await _fetch_one(browser, url)
                    return title, text, None
                except Exception as e:
                    logger.error(f"Failed to fetch {url}: {e}")
                    return "", "", str(e)

            results = await asyncio.gather(*[_fetch_single(u) for u in urls])
        finally:
            await browser.close()

    return results


async def _fetch_page_light(url: str) -> tuple[str, str]:
    async with httpx.AsyncClient(
        timeout=30,
        follow_redirects=True,
        verify=False
    ) as client:
        resp = await client.get(
            url,
            headers={"User-Agent": "Mozilla/5.0"}
        )
        resp.raise_for_status()
        html = resp.text

    doc = ReadabilityDocument(html)
    title = doc.title()
    clean_text = _html_to_clean(doc.summary())

    if len(clean_text) < 50:
        clean_text = _html_to_clean(html)

    return title, clean_text


async def _llm_extract(content: str, prompt: str | None, schema: dict | None) -> str:
    system_msg = (
        "You are a data extraction assistant. "
        "Extract the requested information from the provided web page content. "
        "Be precise and only return the extracted data. Be as detailed as possible "
        "without including extra information. Do not skimp. "
        "NEVER return an empty result. If you cannot find the requested data, "
        "you MUST explain why."
    )

    if schema:
        system_msg += f"\n\nReturn the data as JSON matching this schema:\n{json.dumps(schema, indent=2)}"

    user_msg = content
    if prompt:
        user_msg += f"\n\n---\nExtraction request: {prompt}"

    async with httpx.AsyncClient(timeout=120) as client:
        resp = await client.post(
            f"{LLM_URL}/v1/chat/completions",
            json={
                "model": LLM_MODEL,
                "messages": [
                    {"role": "system", "content": system_msg},
                    {"role": "user", "content": user_msg},
                ],
                "temperature": 0.1,
                "chat_template_kwargs": {"enable_thinking": False},
            },
        )
        resp.raise_for_status()
        result = resp.json()
        return result["choices"][0]["message"]["content"]


async def _search_ddg(query: str, limit: int) -> list[dict]:
    results = DDGS().text(query, max_results=limit)
    return [
        {
            "title": r.get("title", ""),
            "url": r.get("href", ""),
            "description": r.get("body", ""),
        }
        for r in results
    ]

I used Opus 4.6 to code these tools based on firecrawl's tools. This search ends up being completely free. No external APIs are being hit at all, so I can do as much AI research as I want using this tool with the only limit being my electricity bill. I have my extract tool hitting a separate 9b variant of Qwen3.5 on another 1080ti rig I have, but you can obviously set that to use whatever.

These tools are good, but on their own they still resulted in mostly misinformation being reported back, with little effort put into verification or further research. I have always liked the way Claude searches the web, so I had Opus 4.6 write a system prompt based on it's own instructions and tendencies, and it immediately improved the quality and accuracy of the results enormously. Now, it's roughly on the same level as Opus 4.6 (in my experience), with the only caveat being that it sometimes leaves things out due to not doing enough research and therefore not covering enough ground. Here is the prompt I use:

You are a friendly assistant.

=== CRITICAL: DATE AWARENESS ===

Before your FIRST search in any conversation, call get_current_date. This is mandatory — do not skip it.

The date returned by get_current_date is the real, actual current date. You may encounter search results with dates that feel "in the future" relative to your training data. This is expected and normal. These results are real. Do not:

- Flag current-year dates as errors or typos
- Say "this date appears incorrect" or "this seems to be from the future"
- Assume articles dated after your training cutoff are fake or simulated
- "Correct" accurate dates to older ones

If a search result is dated 2026 and get_current_date confirms it is 2026, the result is current — trust it.

=== RESEARCH METHODOLOGY ===

Follow this workflow for every research query. Do not skip steps.

STEP 1: ESTABLISH DATE
- Call get_current_date if you haven't already this session.

STEP 2: SEARCH BROADLY FIRST
- Run your initial search.
- Read the results. Note what claims are being made and by whom.
- DO NOT form conclusions yet.

STEP 3: VERIFY AND FILL GAPS
- If the story involves someone making a statement or response, search specifically for that statement.
- If multiple people or entities are named, search for each one.
- If a quote is circulating, search for its original source.
- Extract full article content when headlines alone are ambiguous.

MINIMUM EXTRACTION RULE:
If you use the extract tool once, you must use it at least one more time on a different source.

STEP 4: SYNTHESIZE
- Only now form your answer.
- If sources conflict, say so.
- If you could not find evidence, say that explicitly.

=== TRUST HIERARCHY ===

TIER 1 — HIGH TRUST:
- Major outlets (AP, Reuters, NYT, BBC, etc.)
- Official statements
- Multiple independent confirmations

TIER 2 — MODERATE TRUST:
- Single-source reporting
- Social media posts
- Regional outlets

TIER 3 — LOW TRUST:
- Viral screenshots
- Parody accounts
- Unverified quotes
- Aggregators
- Forums

=== COMMON FAILURE MODES — AVOID THESE ===

1. CONFIDENT DENIAL WITHOUT EVIDENCE  
2. "CORRECTING" ACCURATE INFORMATION  
3. PREMATURE CONCLUSIONS  
4. DATE SKEPTICISM  
5. OVER-HEDGING  
6. TREATING VIRAL CONTENT AS CONFIRMED

=== GENERAL REASONING PRINCIPLES ===

- Think before pattern-matching  
- "I don't know" is valid  
- Distinguish source vs reasoning  
- Update when contradicted  
- Precision > fluency  
- Match confidence to evidence  
- Don’t over-structure answers  
- Separate facts from opinions  
- Names/numbers/dates must be correct  
- Answer the actual question

=== RESPONSE FORMAT ===

- Lead with strongest facts  
- Separate confirmed vs unverified  
- State disagreements clearly  
- Attribute sources  
- Note debunking when relevant  
- No "as an AI" disclaimers

=== SELF-CHECK BEFORE RESPONDING ===

- Did I call get_current_date?  
- Did I verify negative claims?  
- Am I contradicting multiple sources?  
- Did I validate dates?  
- Did I trace quotes?  
- Would this hold up if tested?

[-]

lucgagan@reddit

Share this to https://www.reddit.com/r/webmcp

[-]

Waarheid@reddit

Very interesting, and nice work. I love the idea of _llm_extract, that the chat context doesn't get exploded with all of the web page text, but only gets the necessary info. The prompt is pretty big but it probably helps with quality and it seems like your performance is good enough for it not to be an issue probably.

Right now, I have my local agent (based on pi.dev) search wikipedia with a simple cli: $ wikipedia search <query> returns a list of article titles, and $ wikipedia read <title> returns a markdown version of the page. It works great with Gemma 4 26B-A4B, and the tool description for it is just a few lines. Some wikipedia pages are quite long though so I think I will steal your _llm_extract and test it out. Thanks for sharing!

[-]

mcbarron@reddit

Please don't saturate wikipedia.org with nonsense requests. You can download the entire thing over torrent here: https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia

No need to tax the servers with your constant scraping.

[-]

Waarheid@reddit

Oh no, my ten requests over the past 3 days, the horror!

[-]

Imaginary-Unit-3267@reddit

It adds up when lots of people are doing it, though. Wikipedia explicitly gets mad at people for using scrapers too much and they're a free service so... probably better just to be respectful of their preferences?

[-]

Waarheid@reddit

Yes you are right, especially if they explicitly say. I'll look into it, it would be cool to self hostt it myself :-)

[-]

OkUnderstanding420@reddit

Could you share how do we add these tools to llama cpp? I'm running a llama cpp server as well but not very well familiar with setting up tool calling etc.

Thanks

[-]

Imaginary-Unit-3267@reddit

Follow the instructions on OP's github to start the mcp server and the main researcher model, also start in another server the model that will do the extraction summarizing (though I wonder if they could just be done by the same model - I haven't tried that yet), then open the web ui, add the MCP server at the port he says in the readme, make sure to point it at /mcp (full path probably is 0.0.0.0:8642/mcp), click on the "use llama proxy" slider button thingy, and it should be good to go.

[-]

External_Dentist1928@reddit

So this works without a search engine backend, like Brave Search API, or a self-hosted searxng instance?

[-]

BitPsychological2767@reddit (OP)

Yes, thanks to ddsg

[-]

Orolol@reddit

DDGS function Available backends : bing, brave, duckduckgo, google, grokipedia, mojeek, yandex, yahoo, wikipedia

Aouch, your results are contaminated

[-]

BitPsychological2767@reddit (OP)

Can you elaborate? I'm very inclined to agree with you, but I need more info before I can take action.

[-]

Orolol@reddit

https://en.wikipedia.org/wiki/Grokipedia#Factual_inaccuracies

Matteo Wong noted in The Atlantic that Grokipedia frames the white genocide conspiracy theory as an event that is currently occurring.[10] The Business Standard described Grokipedia pages as validating debunked conspiracy theories such as Pizzagate and the "Great Replacement".

[-]

StirlingG@reddit

Those things are literally all real, kudos to grok

[-]

BitPsychological2767@reddit (OP)

Ugh. God damn it. Leave it to Grok to put me in this position.

I'm not sure what to do about this because there is no great DDGS alternative that doesn't require an API key. The user can still use their own SearXNG instance but I still don't feel terribly comfortable leaving DDGS in as the main method of search knowing this. Maybe I will add "-site:grokipedia.com" to each query, but I'm not sure if that will work. I'll have to think for a little bit more about this, thank you for bringing it to my attention.

[-]

Orolol@reddit

Ddgs doesnt have any option to filter ?

[-]

BitPsychological2767@reddit (OP)

Adding "-site:grokipedia.com" will work. I've now added it to the repo for all requests that go through DDGS.

[-]

No_easy_money@reddit

I have no quibbles with the reference to the factual inaccuracies in Grokipedia.

That said, the prior posted failed to note the biases of other sources. For example: - Wikipedia is notoriously liberal in its data sources and presentation - Google deprioritizes conservative news sources - Even the Grokipedia quote comes from The Atlantic, a very liberal source. There are many quotes from conservative sources that would disqualify "acceptable" on the list.

The point here is not to debate sources, but to use Internet research to get to the truth. Echo chamber sources risk the introduction of biases. I would much rather have a diverse set of source material and let the best ideas win. YMMV.

[-]

External_Dentist1928@reddit

Is this like searxng hosted by someone else?

[-]

BitPsychological2767@reddit (OP)

I guess that's an accurate way to describe it. It's just a non-official python interface for the DuckDuckGo search engine. Searxng would be trivial to add to this, though.

[-]

External_Dentist1928@reddit

I think that would actually be preferable here then, especially with respect to keeping things as local as possible

[-]

BitPsychological2767@reddit (OP)

Done! Just added it and tested it out, seems to be working well. Thanks for the suggestion!

[-]

Ell2509@reddit

The difference between creators we come back to listen to, try from, and those we don't, is in feedback heard.

Good job, sir (or madam).

[-]

mxcw@reddit

My favorite type of comment here. Good job, will look into this over the weekend

[-]

Imaginary-Unit-3267@reddit

Hot damn you figured this out so I don't have to. Only difference is I have a different Qwen3.5 model. I will try this out tomorrow. Thank you kind sir!

[-]

FerLuisxd@reddit

I guess it does not handle captchas yet?

[-]

BitPsychological2767@reddit (OP)

No, but that(and generally cutting down on the reasons for page refusals which is currently kind of unnacceptably high) is my next priority.

[-]

nufeen@reddit

Using Camoufox instead of Playwright may help for having less refusals. It's less often being detected as robot browser

[-]

MoffKalast@reddit

Or well, Puppeteer.

[-]

_D4rk4_@reddit

Why not unsearch?

[-]

BitPsychological2767@reddit (OP)

Do you mean why not use unsearch in place of this, or do you mean why does this app have no unsearch support?

[-]

Overall-Somewhere760@reddit

Finally a post in which the user shows directly his setup, without fancy damn words. Kudos to you man.

[-]

ebolathrowawayy@reddit

if everyone says this then everyone will do this. keep it up! claude code destroyed the elitist barriers. now let's do something fucking interesting.

btw, i'm not a bot bc i can say fuck shit poop and buttsex. deal with it.

[-]

BackgroundBalance502@reddit

Nice setup. One thing worth knowing if you ever extend it to agents that need to interact with pages instead of just read them: Readability drops all the spatial information. Once text hits your agent, position is gone and there's no path back to coordinates without another screenshot pass.

Built something that runs as an MCP server and sits alongside what you're already doing. Instead of inferring coordinates from screenshots it computes them from CSS and font metrics directly. Same MCP config you're already using.

github.com/Tetrahedroned/spatial-tether

[-]

Taenk@reddit

So I guess this makes Firecrawl completely unnecessary? Will try it out over the weekend probably.

[-]

BitPsychological2767@reddit (OP)

Yes, this actually started out as a Firecrawl MCP, and then when I hit my free credit limit within 12 hours of registering I decided to make this. There are other Firecrawl endpoints that are not included, but these two were the main functions that I found most useful for answering specific queries.

[-]

Loose-Breakfast363@reddit

ngl i just started using a similar setup and it feels like cheating

[-]

AreaInner8702@reddit

thats a sick setup, honestly jealous of that 4090. your prompt engineering is next level

[-]

traveddit@reddit

Do you guys not run into captcha issues without running xvfb?

[-]

clintCamp@reddit

I built one yesterday that strips the html and provides just the page text so the llm gets just the important content. Then I have another agent that curates all search findings into a knowledge base for rag search later in a condensed format with just the more important details broken out to their own small files.

[-]

Dundell@reddit

Neat, I just run a modified SearXNG myself with some captcha interactions for Google, and fix it into a mcp for roo code and just call it deep research.

Tell my local guy "can you deep research this" and watch it try to figure out "Do i need 5 or 15 results with duckduckgo, Google, brave?". Funny for me but 15k~80k context until its pulled all info it thinks it needs.

[-]

Past-Reception-424@reddit

This is the kind of setup that makes me feel like we are actually getting somewhere with local models. once you cut the cloud dependency it just feels different. Starred the repo

[-]

Fair_Ad845@reddit

what is your setup for the web search part? the LLM itself is easy to run locally but getting fresh search results without hitting a cloud API is the tricky bit.

[-]

TheTerrasque@reddit

This is similar to my setup, but I'm using searxng and an mcp to connect it. I also have no special prompt, other than telling it to use the search if unsure, and today's date and my username in the system prompt. Using Open Webui as the UI and connecting things together.

I first used Qwen3.5-35b-a3b which was the first fast model that were actually useful, but testing out gemma4's moe now.

Oh, and I also have an mcp connecting it to my Outline instance, so it can both look up things there and add new things to it. Since both me and my wife is using it (via authentik openid login), the username in the prompt lets it know which one it's talking to, so it knows which stuff in Outline is relevant.

[-]

Billysm23@reddit

Are you the previous guy?

[-]

kaliku@reddit

Thank you for this, dear. I'll give it a try and if it works I'll have to think of another weekend project because exactly this was my plan for the coming two days 😂

[-]

Maleficent-Low-7485@reddit

local search plus a decent 70b replaced half my google usage, not even joking.

[-]

Maleficent-Low-7485@reddit

local search plus a decent 70b replaced half my google usage, not even joking.

[-]

david_0_0@reddit

interesting setup. 40 tok/s on qwen 27b vs cloud api latency - what does total time compare? because cloud apis are slow on first token but fast on full response. also curious whether you hit accuracy drops with q3_k_m vs higher quantization on factual web research tasks - that matters more than throughput for web content extraction.

[-]

BitPsychological2767@reddit (OP)

I'd be super interested in testing these things, but I have no idea what the best way to do so is and I'm kind of a perfectionist when it comes to this kind of procedure. I'll have to look into doing proper benchmarks for this now that I've gone public with it.

[-]

SkyFeistyLlama8@reddit

I'm more surprised at that wordy prompt from Opus being usable on a local model. I tend to strip local model prompts down to bare bones.

[-]

Honest-Debate-6863@reddit

What about support for smaller models

[-]

BitPsychological2767@reddit (OP)

It works with smaller models, provided they can adequately call tools and follow instructions.

[-]

Honest-Debate-6863@reddit

I’m curious if this works with E4B

And other small models based on

https://www.reddit.com/r/LocalLLaMA/s/B4fB20exGU

[-]

drallcom3@reddit

I achieved some pretty impressive research just with LM Studio and the usual MCP tools. I had it research something I already know. The info is not that easy to collect. In parallel I had Claude Sonnet, ChatGPT and some dedicated research platform do it. The local version delivered the best result by far (but it also took the longest by far). Claude doesn't talk very much, ChatGPT was the usual nonsense and the research platform's result looked like a computer generated website from 10 years ago.

[-]

Dazzling_Equipment_9@reddit

Your approach is great, lightweight and practical. You can put it in a GitHub repository so we can like and fork it :)

[-]

BitPsychological2767@reddit (OP)

Sure! Thanks for your interest!

https://github.com/AuthBits/webmcp

[-]