If anyone is running qwen 9b or 27b or 35b and getting wrong facts while web search, follow this.
Posted by 9r4n4y@reddit | LocalLLaMA | View on Reddit | 19 comments
-
Try to go with searXNG as you search results by multiple engines + its open-sourced.
-
Use firecrawl / jina / fetch for reading the source.
- use firecrawl for complex web pages.
- jina for day to day stuff (you can also just add https://r.jina.ai/ infront of any url and you will get in a readable format that will be easy to scrape for a llm )
- Even if you do this, ai may still give you wrong facts, i have seen small models have capability to web search niche stuff but they can't, so for that you need to use a web search agent instructions prompt. Below is the prompt just copy and paste it :) its basically tells model to avoid using internal knowledge and doing complex maths, instead find the answer on web as directly given. It also tell to site minimum 2 sources for each major facts to prove itself right.
Prompt
You are a factual research assistant. Work step by step.
1. Search the web now for the exact question. 2. Retrieve at least two independent sources published after 2024. 3. Base your answer only on those sources. Do not use internal knowledge. 4. For every numeric fact, quote the exact text, give URL, date, and specify the condition. 5. If sources conflict or the information is missing, say "conflict" or "cannot verify" and show both quotes. 6. Temperature 0.1. No guessing. 7. It is mandatory to also read web pages; only web search is not sufficient enough. 8. You must cite all of the sources used with exact quotes at the end, in this format: source 1 xyz.com --> "quote"... 9. Identify all major key facts needed for the question, then for each fact cite minimum two sources per rule 8. 10. Avoid maths whenever possible and avoid internal knowledge unless no source exists. Always try to find numbers online first. Only simple addition or subtraction is allowed; never do complex maths.
I have kept the prompt under 1000 characters so that people who dont have good hardware can just paste the prompt into the qwen app project instructions, which have 1000 characters limit.
Result: earlier i asked it
Ok so go do a research on deepseek v4 flash vs minimax m2.7 and find which is lighter and keep in mind that kv cache size for both of them is at max content length.
1. Find their max context length
2. Then find - max context length (seperately) takes how much vram only to store kv cache.
3. Compare model + cache size of both
But most of the time results were very inconsistent, but after adding the instructions prompt it give the right answer most of the time.
Heres the answer
### Research Results: DeepSeek V4 Flash vs. MiniMax M2.7
Based on the latest technical reports and deployment documentation released in April 2026, here is the step-by-step comparison regarding context length, KV cache overhead, and total VRAM requirements.
1. Maximum Context Length
* **DeepSeek V4 Flash:** Supports an official maximum input context window of **1,048,576 tokens (1 Million)** [[2]], [[6]]. * **MiniMax M2.7:** Supports a maximum context length ranging between **200K and 205K tokens**, depending on the specific service tier [[11]], [[71]].
2. VRAM Required for KV Cache at Max Context
* **DeepSeek V4 Flash:** Due to its specialized Hybrid Attention architecture (Compressed Sparse Attention + Heavily Compressed Attention), the KV cache overhead is drastically reduced. At a 1M-token context, the BF16 KV cache requires exactly **9.62 GiB** of memory per sequence [[107]], [[127]]. * **MiniMax M2.7:** Relies on standard KV cache scaling relative to its hidden dimensions. The official vLLM deployment guides state that the model requires approximately **240 GB of VRAM per 1 million context tokens** [[82]], [[90]]. Applying this ratio to its maximum context length (\~200K tokens), the KV cache alone consumes approximately **48 GB** of VRAM.
3. Comparison of Model + Cache Size (Total VRAM Footprint)
To determine which model is "lighter" in a total deployment scenario, we sum the storage requirements for the model weights (in mixed precision) and the KV cache at maximum context.
| Feature | **DeepSeek V4 Flash** | **MiniMax M2.7** |
|---|---|---|
| **Model Weights** | **\~158 GB – 160 GB** (Mixed FP4/FP8 precision [[42]], [[95]]) |
**220 GB** (Official deployment requirement [[82]], [[90]]) |
| **KV Cache (at max context)** | **\~9.6 GB** (Highly compressed 1M context) |
**\~48 GB** <br(\~200K context) |
| **Total VRAM Needed** | **\~168 GB** | **\~268 GB** |
**Conclusion:** Despite having a larger raw parameter count (284B vs 230B), **DeepSeek V4 Flash is significantly lighter** than MiniMax M2.7. Its innovative compressed attention mechanisms allow it to handle context windows more than 5 times larger than MiniMax while using roughly one-fifth of the VRAM just for the KV cache. In a total workload comparison, DeepSeek V4 Flash requires approximately **100 GB less total VRAM** than MiniMax M2.7 at their respective maximum context lengths.
***
Sources & Exact Quotes
**Source 1 (DeepSeek Context Length)** URL: https://openrouter.ai/deepseek/deepseek-v4-flash Date: April 2026 Quote: "DeepSeek V4 Flash is an efficiency-optimized Mixture-of-Experts model from DeepSeek with 284B total parameters and 13B activated parameters, and a 1M-token context window."
**Source 2 (DeepSeek Context Length)** URL: https://llm-stats.com/models/deepseek-v4-flash-max Date: April 2026 Quote: "DeepSeek-V4-Flash-Max has a context window of 1,048,576 tokens for input and can generate up to 393,216 tokens of output."
**Source 3 (MiniMax Context Length)** URL: https://aihub.caict.ac.cn/models/MiniMaxAI/MiniMax-M2.7 Date: 2026-04-16 Quote: "MiniMax-M2.7 是MiniMaxAI 于2026 年3 月推出的旗舰级自进化Agent 大语言模型...支持200K 超长上下文"
**Source 4 (MiniMax Context Length)** URL: https://cloudprice.net/models/minimax-m2-7-highspeed Date: 2026-04-19 Quote: "MiniMax M2.7 High Speed is MiniMax logo MiniMax's language model with a 205K context window"
**Source 5 (DeepSeek KV Cache Size)** URL: https://docs.bswen.com/blog/2026-04-24-deepseek-v4-1m-context/ Date: 2026-04-24 Quote: "With 9.62 GiB KV cache, you can actually run" (context refers to 1M-token context capability)
**Source 6 (DeepSeek KV Cache Size)** URL: https://dasroot.net/posts/2026/04/deepseek-v4-hybrid-attention-massive-contexts/ Date: 2026-04-24 Quote: "For example, at 1 million tokens, the KV cache size for DeepSeek V4 is estimated at 9.62 GiB with bf16 KV cache, which is 8.7x smaller than the"
**Source 7 (MiniMax KV Cache & Weight Size)** URL: https://github.com/MiniMax-AI/MiniMax-M2.7/blob/main/docs/vllm_deploy_guide.md Date: 2026-04-15 Quote: "Memory requirements: 220 GB for weights, 240 GB per 1M context tokens"
**Source 8 (MiniMax KV Cache & Weight Size)** URL: https://x.com/Web3Aible/status/2043213211944485042 Date: 2026-04-12 Quote: "Weights memory: 220 GB for weights, plus ∼240 GB per 1M context tokens for KV-cache"
**Source 9 (DeepSeek Weight Size)** URL: https://lushbinary.com/blog/deepseek-v4-self-hosting-guide-vllm-hardware-deployment/ Date: 2026-04-24 Quote: "V4-Flash at \~158GB in FP4+FP8 mixed precision fits on a single H200 node."
**Source 10 (DeepSeek Weight Size)** URL: https://deepinfra.com/deepseek-ai/DeepSeek-V4-Flash Date: April 2026 Quote: "Weight (HuggingFace): 160 GB"
amazedballer@reddit
I did this with Tavily, using a two-phase search process with Haystack.
https://github.com/wsargent/groundedllm
ego100trique@reddit
I just use private-search MCP server on docker and 9b and it works just fine in the use cases I gave it.
Most use cases are like the following "find a restaurant of the same popularity as X at X place that has places for X number of persons on X date".
Jeidoz@reddit
It that me, or markup of this post is a bit messed up?
9r4n4y@reddit (OP)
I am not very familiar with the structuring the post in reddit. So dont mind . i avoid using ai for this because people then start to think that i am a bot
_bones__@reddit
That's exactly what a bot would say. /s
Jeidoz@reddit
Reddit is has standart markdown syntax. IMO looks like somewhere you messed up with "quote" syntax (using
>in front of LINE OF TEXT.If it hard for you to deal with Markdown Editor, switch to Rich Text Editor and style by it.
9r4n4y@reddit (OP)
I will see into it :) thx
Medium_Chemist_4032@reddit
Excellent. Bolting a search engine to any model icreases their capabilities immensly. Jina is easily replaceable with vibecoded tools too, for the less dynamic pages
9r4n4y@reddit (OP)
Yeah, you can vibe code a app on the top of jina but i think jina is sufficient by itself. And if any websites block you just use headless browsers for scraping
Medium_Chemist_4032@reddit
I meant transforming html into markdown. There are quite a few github projects to do that, but they vary in polish: making links navigable, cutting long content, finding out the "main" html part reliably. You can vibecode a good fetcher quite quickly. I don't see real need for jina, for those particular usecases
justicecurcian@reddit
Do you know a polished one? There are far too many issues with websites to make a reliable universal website to markdown parser by hand
9r4n4y@reddit (OP)
Bro jina do exactly that ok here you go, see this example
This is the website
--> https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3.html
Now this is jina markdown website just by adding "https://r.jina.ai/"
--> https://r.jina.ai/docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3.html
Medium_Chemist_4032@reddit
Yes, exactly.
It does however send data to jina servers. This is a local AI hosting subreddits and people prefer retaining as much privacy as they can.
Even though you only request specific urls, you still leak your ip and urls you are interested in, so it's possible for a website to sell that data for profit, to infuse advertising profiles.
That's why I mentioned that option at all
9r4n4y@reddit (OP)
Yeah i said you can run jina or tool like jina locally. I used jina url just as a example to show u
DeepBlue96@reddit
....waste of time just copy this: scrap - Pastebin.com note it uses alot of context maybe use an agent to summarize the research then pass it to the original agent
9r4n4y@reddit (OP)
Its works thats why i gave it here ¯_(ツ)_/¯ mine is way easier and faster to setup and can work anywhere like open web ui, lm studio. But any way thanks for ur stuff also
DeepBlue96@reddit
ok, it's not free though... and to be fair i missed the r.jina that one does the exact samething beautifullsoup does but it's online so TY for that one who knows when i will need it :)
9r4n4y@reddit (OP)
See i use firecrawl hosted on my pc. And its good.
TripleSecretSquirrel@reddit
You should sell this to Anthropic. In the past several weeks, Caudel Opus is returning tons of hallucinated facts from web search, I gem after corrections.