Web-Search is coming to a screeching performance halt as Google shuts down their free search index, and traffic defenders like Cloudflare challenge AI at every gateway. What are our options?
Posted by NetTechMan@reddit | LocalLLaMA | View on Reddit | 171 comments
Google is closing its free tier to just 50 domains for site-specific search, and an inheritance date of January 1st, 2027, with no public pricing being listed for advanced searches. Cloudflare's new site-default is to challenge all AI bots attempting to scrape web-information for all their customers, including now with a recent partnership all domains hosted by Go-Daddy.
Some of you may have felt it over the last few months, web searches that used to be more effective are now closing with 400 errors from every site your harness attempts to reach. Local models may lose efficacy as their internet pulling capabilities are crushed.
Make no mistake, Google is reinforcing their mote by pulling up the drawbridge for aggressive pricing. This is a direct attempt to close in on the open-host sphere by crippling reliance infrastructure.
As a community, what options do we have at our disposal? Are there any open-projects currently attacking this status quo? Filling this gap will likely be the next big "open" project to hit the market, as solutions to this issue will likely become dependencies as we progress down harness improvement.
Sofakingwetoddead@reddit
I'm not sure how old you are, but after web 2.0 was implemented we lost maybe 50% of the searchable index. After web 3.0 implementation, we were down to maybe 5% or less. Most of the cards in the card catalog for the entire library of web-based information were removed. Google search, now, isn't useful. Advanced search features which allowed for us to pinpoint exact strings within hundreds of millions of hits haven't functioned in more than a decade. For people like me, who were OSINT Gatherers, noticed the change, first. The time to get upset was 2015, not today. Today, it's already over. Google is worthless and most other search engines use Google's web crawler and index for their results. Nothing to see here. We lost the internet a long time ago!
Imaginary-Unit-3267@reddit
Do you think this is reversible with peer-to-peer search engines?
Sofakingwetoddead@reddit
I keep hearing things about that. I dunno. I'm not sure what would be required to collect, store and index the massive amount of information that once existed on the internet. If I had to guess, it's totally out of reach. I try not to think about these things. People believe the LLM is so incredibly powerful, and it is, but the power we had when we could find the needles in stacks of needles(and there were a lot of stacks) was so much more. Our information agency was taken away and, honestly, I know of only a few people who even noticed it. That tells me the vast majority of people were too busy watching football games or Dancing with the Stars to even notice. It really hurt to have that taken away, but it told me a LOT about the world I live in...
8lbIceBag@reddit
I've tried complaining about it since 2015 but only in the past few years have people agreed. Complaining about Google 2022 & earlier I've learned not to do it on reddit unless I wanted some downvotes.
Mickenfox@reddit
Can I mention the Kagi search API is in a closed beta?
Sofakingwetoddead@reddit
It's kinda wild that it needed to get THIS bad before people noticed it. Not surprising that I run into someone like myself here in this particular subreddit. \o
Acrobatic_Stress1388@reddit
Right there with you brother.. I noticed.
tukatu0@reddit
There are dozens of us.
Youtube is the best it has been since a bit before covid for me. Realistically It's not going to get better unless the systems are replaced. When any user no matter how small can rise and become a bad actor; There will never be a good outcome. My condolences to first time users.
ResidentPositive4122@reddit
While some of it is true, wait till you actually daily-drive anything but google. They are worse. The golden ages of search are gone, google or no google...
FaceDeer@reddit
It's going to be an interesting tug of war as AI agents become more common. Sites that lock out agents will get more human visitors initially, of course, but at some point the agents become a meaningful part of your actual audience. If your shop doesn't cater to agents then you won't sell as much stuff. If your news site only caters to humans, but most people are getting the news delivered to them by their agents, you're turning yourself into the equivalent of printed newspapers - a relic only read by the old generation that refuses to adopt new technologies.
The model of paying for sites through advertising is already a bit shakey due to ad blockers, but currently most people don't actually use ad blockers as far as I'm aware. The agent is inherently an ad blocker, though.
I'm really not sure where the Internet is going to go from here. Interesting times.
__JockY__@reddit
They have probably seen a wild influx of bot searches and realized it’s an untapped revenue stream that will slowly bleed them if they don’t shut the door.
Anyone else coming into this space is going to face the same problem: how do you monetize searches when there’s no human eyes to land on advertising?
I think we’re headed to the big search paywall soon.
MoneyPowerNexis@reddit
I think the answer is you advertise product to the bots because in the long run thats going to be the filter for advertising online: When a user wants something they will ask their bot so you as a search based advertiser need to get real good at convincing clankers your client's product is what they should get when asked. Blocking clankers altogether seems like the dumbest move you can do because they will move off your platform and integrate into someone else's system.
Right now I have duck duck go search integrated in my harness because it was easy to do. so if I ever want to buy something and ask my bot where to get it it will be duck duck go that routs me to suppliers and they could cash in on that. Google being hostile to agents is just out of the loop.
The real problem I see is how particular free content sites themselves will fund themselves as more and more access goes through agents.
SkyFeistyLlama8@reddit
GEO, pretty much. Generative AI bullshit optimization, getting page contents be more attractive to LLMs so that Google's own LLM scrapers can use that content later on.
In all these cases, the page gets viewed only once by a robot and never again by human eyes because the search results themselves are synthesized from scraped text.
Mickenfox@reddit
Google already has been effectively ranking pages by how well they look to AI, given they basically invented language models to use in their search engine. It's why all the results have so much filler text and look basically the same.
kulchacop@reddit
Google already has been nudging websites towards this direction. Google in partnership with Microsoft introduced WebMCP. This aims to be a standard for exposing a dedicated MCP server for your website, through the browser. When a shopping website exposes their MCP server in their website for the agents owned by the buyers to interact easily, it helps Google index it easily too. When a news website exposes their MCP server, it helps Google too.
Here, Google is trying to capture both ends of the market, if that is going to exist at all.
https://patents.google.com/patent/US12536233B1/en
Google patented dynamic generation of pages for the website, to match with the search query. Google says that when there is no matching webpage in the website for the search query, they can generate one on-the-fly. This implies that they will make the website to register all their pages or data with Google, because, how else Google would know when to show or not show generated content?
SkyFeistyLlama8@reddit
Yeah, that's the kicker. It's RAG on a gigantic scale, creating embeddings of web pages for vector search and scraping page text to use as LLM context for Gemini. The user never needs to visit the actual page itself because Google is generating answers on-the-fly based on scraped text.
Google still gets ad revenue on the search results screen and inline with Gemini, if it ever chooses to do that. Website and page owners get zip, zilch, nada... the only things visiting their pages now are robots.
TerminalNoop@reddit
"I think we’re headed to the big search paywall soon."
Honestly I don't think that is going to work. Revenue for all websites would immediately drop, also URLs are just bunch of charachters, how many can you cram into a 10GB file?
Then all you need is a sync provider that keeps you update and so on.
FullstackSensei@reddit
URL's on their own are pretty useless without a full index of what terms each URL/page contains. That's what raises size exponentially.
A more pertinent question is: how many websites does one really need to index to get most (>98%) of useful content? For text indexes only, I don't think the number is astronomically high, probably a few millions. The issue will be information "hubs" like reddit, where you now need to have an agreement before you can index the content.
heliosythic@reddit
You also don't need to full text index, we could just store a few vectors for the page contents. Wouldn't be perfect but a start.
FullstackSensei@reddit
Doubt that would work well at such a scale. Normal inverted indexes are very space efficient and queries are basically intersection operations that reduce the search space exponentially.
An inverted index can sift through a billion a billion IRS per second on a single core of a 10 year old machine. That's why Google and the other searches are so fast.
TerminalNoop@reddit
Well i was more thinking of normal websites like news, the local government etc.
Sure to find threads from a forum it might get tougher, but I could see some people rise against this adverse environment and make (i)legal indexes.
Sudden_Vegetable6844@reddit
FWIW revenue drop for websistes already happened with AI summarization : any website that is not paywalled or based on a constant stream of user content is pretty much dead in the water
RealPjotr@reddit
And that's the whole issue with first Google search results stealing revenue from content providers and now AI built in dubious information gathering from content providers.
As long as the content providers don't get paid, there won't be content.
Foreign_Risk_2031@reddit
It’s going to be aggressive fingerprinting (google is already tracking you anyway)
krileon@reddit
I think we're gonna have to go the route of running headless chrome and processing through the DOM instead of going through an API. Doesn't sound ideal, but it'd at least be free and pretty close to impossible to block.
Orolol@reddit
It's not impossible to block, in fact there's already many company that sell bot blocker, and I myself developed one for my client.
Plus, running headless is OK whern you have to do few search, nut but for deep research which tend to aim for 100+ sources, it's just not possible to do it unless you want to spend multiple hours.
alberto_467@reddit
Not at all that impossible to block.
I've actually been noticing some weird false positives captchas lately saying that traffic from my network was suspicious upon doing google searches. I had not seen them in a while, so I have a feeling that they're tuning their detection stuff up to fight automated searches more effectively.
Even if you use a headless browser like camoufox which in my unrelated scraping experiments is harder to detect you still risk falling into google captchas that can really be a PITA (especially because google has many signals about your traffic) and you still have to figure out residential proxies with some rotation.
JamesEvoAI@reddit
Simple as not using Google. If you're technically competent enough to figure out running models locally, you can figure out setting up a local instance of SearXNG
admajic@reddit
And tavily 1000 free searches a day fallback as there is some booking going on for me on searing sometimes probably rate limits
No-Refrigerator-1672@reddit
Tavilly gives you 1000 free searches a minth, not a day. After that it's $0.008 per search - which is more or less reasonable.
McSendo@reddit
recycle tavily, exa, brave, and you have 3000 /mo
graypasser@reddit
but then, It still becomes more and more impossible over time, as models itself evolve.
bespoke_tech_partner@reddit
Headless gets blocked all the time, but headful doesn’t. Most of my workflows are headful. But this is speaking about hosted inference not local llama.
sweatierorc@reddit
That's a good thing right ? less ads
ares623@reddit
Hmm so let me get this straight:
That0neSummoner@reddit
I’m fine paying for quality searches. It’s a product I use. Stop selling my data.
a_beautiful_rhind@reddit
I have duckduckgo and searchx mcp. I don't even use google myself since it hates VPNs. I'm not doing captchas just to search and they know I'm not a bot, just hate privacy.
Elbobinas@reddit
Searxng?
re-ghost@reddit
Free quota
1.duckduckgo
2.Firecrawl
3.Composio
4.tivaly
Host local
5.Searxng
6.Camofox
PromptInjection_@reddit
Am using SearchXNG, works pretty decent.
Mickenfox@reddit
Mirror the internet. If you're going to use someone's content without compensating them in any way, you should probably just scrape it once, put it on a big .txt file, and distribute that over a P2P network. At least this way they don't have to pay for bandwith.
Or, someone finally figures out micropayments. And then you pay those sites.
Caffdy@reddit
the CommonCrawl generate between 300 to 400 TB from 2 billion websites each month. (Not even counting the PBs of data already crawled from years past). I think is very very hard to try to mirror the internet
CONSOLE_LOAD_LETTER@reddit
It wouldn't be that hard if there were an easy way for people to crowdsource their hardware resources. An easy to use client with a rock solid backend would attract a lot of people interested in keeping the internet open.
Projects like Folding@Home prove that distributed computing can have a profound effect and for example, when there are added incentives involved the Chia network had >30 EB raw netspace at its peak.
The hardware and the motivation is out there, it just needs someone to write kickass software and get enough people's attention to harness it.
kettal@reddit
use 7zip
valdev@reddit
Im going to need to buy that winrar license... arent I?
Momsbestboy@reddit
Too late, already bought one. Not because I am unable to use a cracked version, but to honor a company which handed me this fantastic tool for years, for free (cracked version,...). So: I bought it, and one year later I switched to Linux.
FaceDeer@reddit
I've got a large language model that can click on the nag popup for me now, so no, keep on using it unlicensed.
valdev@reddit
Okay good. Was worried for a moment.
AlexWIWA@reddit
I am surprised no one has figured out micro payments yet. I don't mind paying for things as long as I don't need to provide any info.
Mickenfox@reddit
Most people do. Some people will go to great lengths to avoid paying $5 for an app that works well and does what they need.
ParaboloidalCrest@reddit
100%. A browser extension that copies the pages you agree to share on IPFS or something of that sort. And the page remains alive as long as other IPFS nodes are interested in it.
rhythmdev@reddit
Now we scrape the shit outta them
nullc@reddit
Local models need local knoweldge, especially now that there are consumer gpu friendly models that are very good at "doing stuff" but inherently weak at "knowing stuff".
I've been tinkering with taking an offline copy of wikipedia (only a few tens of gb without images, or about 130gb with images) and running each article through an LLM with a prompt to extract a list of questions that the article answers or provides critical information for answering. Then I take these questions and encode them with a sentence embedding and store the results in a vector database mapping back to the article.
Then at runtime my agent can fork its state, construct some questions and tool call to a lookup tool that will find the most relevant articles for the questions, the agent can then choose and read the articles, find the answers, then rollback the state and suddenly 'know' the relevant material and article names (by concatenating the final output of the pre-rollback state; the article names are useful in case it has to go back to them).
Open issues I have is trying to get the questions the model poses to be most similar to the generated questions. I'm also wondering if it would be useful to optimize the embedding. I should be able to fine tune to favor errors that land on the right article(s) and penalize ending up not just on the wrong articles for the question but more based on the link-space distance between the correct and incorrect article.
In any case, the whole approach should work for any cache of knoweldge but Wikipedia is obviously quite useful and it's freely redistributable, self contained, so I think it's a good starting point. Wikipedia also has lots of external links for citations, and so to some extent it can act as a replacement for search in a first step of research-- at least for the kinds of materials Wikipedia covers.
The cool thing about this approach is that there is no particular need to have the knoweldge itself on especially fast storage and so anyone that can run a 27B sized dense model could probably accommodate some tens of TB of reference knoweldge. Even system ram for vector databases lookups is much less precious than GPU ram.
I'm a little surprised that this isn't already a thing that people are doing for this purpose, but I couldn't find any evidence of it while looking.
IronColumn@reddit
yeah i built a mcp for my nonfiction ebook library
OnkelBB@reddit
This is actually a great way to save resources. That will def help me with my idea for ttrpgs
Do you have a repo with scripts/prompts I can dig into?
Acrobatic_Stress1388@reddit
This is awesome. Way to get creative.
azukaar@reddit
local knowledge base only covers a tiny fraction of what online research is useful for thought
rj_rad@reddit
Self-host SearXNG
BringTea_666@reddit
I mean it is obvious that it can't go on forever. Site owners pay for bandwidth and want real users to click on ads from which they make money.
Bots don't do that. They take something for nothing.
Little bot usage is fine it cost nothing but if there are shitload of bot ussage ? Obviously site owners don't want that. Now imagine if everyone uses AI to do search for them. It means there is 0 reasons for owners make sites or allow bots.
Free shit has a cost.
Elrandar@reddit
https://platform.yep.com/
Unlikely_Rich1436@reddit
The move by Cloudflare and Google to restrict AI scrapers is going to fundamentally change how we build local agents. We're moving from an open-web era to a "permissioned-web" era, and that has massive implications for local RAG systems.
ttkciar@reddit
Maybe YaCY's time has finally come? https://yacy.net/ https://en.wikipedia.org/wiki/YaCy
It's a P2P open source decentralized web search system. It's been around for about twenty years now (!!!) but never really had its day in the sun.
hellomistershifty@reddit
Just what it needs, a massive influx of bot traffic
DMmeurHappiestMemory@reddit
Hell yeah we should push this hard.
ttkciar@reddit
Agreed. It would be nice if it had bindings for languages other than Java, but maybe something could be done about that.
Also, I was poking around, and saw they already have a llama.cpp integration project:
https://github.com/yacy/yacy_expert
s101c@reddit
"Llama.cpp please rewrite Yacy in Rust"
waywardspooky@reddit
wouldn't yacy need to effectively immitate a human browsing in order to not get flagged as a bot or crawler by cloudflare and similar bot detecting countermeasures?
ttkciar@reddit
They've been doing this for twenty years, which seems like plenty of time to figure out good solutions to that problem. Maybe you could ask people who actually work on the project? I'm just a distant fan.
waywardspooky@reddit
i mean you're the one who suggested it, i would have figured you've been using it recently or something. everything that i can tell from looking into it, it does not implement any sort of methods to get around countermeasures so it seems pretty useless at the moment if everything is shifting in the direction of bot detection and crawler blocking.
ttkciar@reddit
You "looked around" but didn't notice YaCYIndexerGreasemonkey?
That's odd, it's one of the repositories under their main Github account.
waywardspooky@reddit
An 11 year old repo that hasn't been updated since April of 2015.
What this does and doesn't do.
It only indexes pages you actually visit manually. You can't point it at a site and crawl it; you have to browse there yourself. Does this not fundamentally defeat the whole purpose of why YaCY was brought up in the first place? We're talking crawling at scale. Unless I'm missing something, this repo does nothing to solve the issue of crawling sites at scale. It would be awesome if it did.
ttkciar@reddit
Way to move the goalposts. It's a way to index content without running afoul of CDN countermeasures, which is what you claimed the project didn't have.
JamesEvoAI@reddit
I just ran a test search on their demo instance, it turned up completely meaningless results.
SearXNG works today and doesn't rely on peering
bwjxjelsbd@reddit
Project like this are so early internet vibe lmao
We went from free internet too all paywalls so hard in the last decade and it needs to be reverse.
How much resource I need to become YaCy node?
ghulamalchik@reddit
I love this, thanks for sharing. The internet needs to move on from this archaic system (relying on the big corporations such as Google, Facebook, Microsoft, Amazon) and adopt decentralized and community-run solutions that are much harder to block.
lautan@reddit
I’ve heard bad things. Like it’ll send unlimited requests and it’s poorly optimized.
ttkciar@reddit
By default it tries to index everything in the world, limited only by local storage, but you can configure it to not do that.
Not sure what you mean by "poorly optimized" though.
CallOfBurger@reddit
You can use something else than Google : Qwant, Ecosia, DuckduckGo. Soon enough, maybe even next week, an internet browser for agents and AI will appear
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
EuphoricPenguin22@reddit
SearXNG is an open-source metasearch server with tons of public instances to point at.
This MCP you can run headless with NPM can direct web searches at any of these public instances without any real rate limit. SearXNG instances search basically every conceivable search engine, and even if one gets blocked from an upstream source you need, you can move to a different instance. Since it's also searching from so many different places, it's super resilient to rate limiting and instances getting blocked from any particular upstream search provider.
TL;DR SearXNG is an open-source search engine for search engines that a bunch of people host for free. You can point a local MCP at any of them and essentially get unlimited searches with resilience to upstream blocks and rate limits.
iVXsz@reddit
Good.
Demodude123@reddit
I personally use ollama's web search api, haven't had any issues with it.
danigoncalves@reddit
Maybe people start using Yacy and increase the index along with the quality that serves who wants to be part of the network
Acrobatic_Stress1388@reddit
One of the providers you can configure searxng with is Yacy.
a__side_of_fries@reddit
A few options that already exist:
Search APIs that aren't Google: SearXNG is fully open source and self-hostable. It meta-searches across multiple engines (Bing, DuckDuckGo, Brave, etc.) so you're not dependent on any single provider. Brave Search also has an API with a reasonable free tier.
Common Crawl: Petabytes of web data already crawled and freely available. Not real-time, but for a lot of use cases you don't need today's data. Several projects index Common Crawl for local search.
Brave Search API: Independent index (not reskinned Google/Bing), has a free tier of 2000 queries/month and paid plans that are significantly cheaper than Google.
For the Cloudflare bot-blocking problem specifically: This affects scraping, not search. If your workflow is search (get URLs) then scrape (get content), the second step is what's breaking.
Cached/archived versions (Wayback Machine API, Google Cache while it lasts, archive.today) can sometimes bypass this. Reader APIs like Jina Reader (r.jina.ai) are also designed to handle this.
The honest take: free unlimited web search for AI agents was never going to last. The economics don't work. But we're not helpless. The stack is probably going to settle on SearXNG or Brave for
search + a mix of reader services and caching for content retrieval. Not as clean as "Google everything for free" but functional.
Acrobatic_Stress1388@reddit
That's great info, but what I really want to know is a good brownie recipe.
Kibbles4Everyone@reddit
“The honest take”
I know what you are
a__side_of_fries@reddit
meaning?
Unlucky-Habit-2299@reddit
Cloudflare's basically turned the whole web into a fortress and called it "security," meanwhile anyone actually trying to build something gets treated like a criminal.
SkyFeistyLlama8@reddit
Can't blame them. Compute and storage aren't free. Advertising once paid for all that but now, who knows?
tavirabon@reddit
Advertising revenue is at an all-time high of $300B. As with practically everything ever, the issue is consolidation of distribution and greed.
Fi3nd7@reddit
The true future is a decentralized P2P conglomerate index IMO. With a, the more you host and serve, the more you can query model.
IrisColt@reddit
I knew it!
Confident_Ideal_5385@reddit
Worst case, paid API providers like Kagi are gonna set a pretty reasonable price ceiling for this stuff, and you can doubtless still get pretty decent results with searxng or a headless browser querying DDG or etc.
AcaciaBlue@reddit
Reddit already ahead of the game, just the other day I was using Claude and it tried to web fetch a reddit page but failed.. If Claude is screwed over it's gonna be real tough for a regular guy.
OldEffective9726@reddit
Use brave api, mine never exceeded $2 a month, and they have free version that is enough for personal use 95% of the time.
Ylsid@reddit
Begun, the AI wars have.
bespoke_tech_partner@reddit
Playwright?
antunes145@reddit
They are obviously trying to curb Chinese open source models from easily scrapping the web. They are doing everything they can to stomp on open source models.
tukatu0@reddit
It's not just google. I have seen small sites go from 100 visitors to 20000. What's the point if no human ever sees? What is the point if no user benefits? Cloudflare is also taking on that load.
It is what it is icnreasing the cost for everybody
Delicious-Storm-5243@reddit
running into this on our side too. fallback we've been using: scrape via authenticated browser session (cookies + real user-agent), keep volume low to avoid cloudflare challenge. works but adds friction. the real long-term answer is probably structured-data APIs from sites that want AI traffic vs sites that don't
Leather_Flan5071@reddit
wait so visiting google.com and just searching will be limited?
clintCamp@reddit
I worked with claude and set up my own rag search tool mcp that has its own web search tools built in going through duck duck go and strips out the html garbage to just get the core site text which works for most sites. Not sure if that is going to hit any blocks but it gives it access to any non captchad sites that beautiful soup can parse to markdown format.
FullstackSensei@reddit
Just give it some time, and someone will figure how to scrape Google's regular search API (what runs when you hit the site via browser).
I've written quite a few website scrapers over the years. From, past experience, most of these protections rely on two things: user agent string and how many concurrent connections you make. copy-paste whatever your current browser's user agent string is, and make sure to rate limit your site scrapers.
With how good LLMs have become at these things, I think an LLM like Qwen 3.6 could very well build this on it's own with good enough prompt and access to a basic python interpreter.
Tokarak@reddit
I’ve actually been hitting google captchas under normal usage. Unless I’m unusually aggressive at searching (doubt it), or my browser of choice is bugged…
somersetyellow@reddit
I'm hitting captchas everywhere on residential Internet and very human browsing. I can get reddit to 429 me by just scrolling fast and middle clicking 5 links to queue up my reading list.
Power users are getting flagged as bots
tomz17@reddit
There are entire websites which no longer work properly under Linux+Firefox due to aggressive not detection. I have to pull out my macbook just to browse the home depot website.
FullstackSensei@reddit
I middle/ctrl-click all the time on reddit. Often sitting on my home office desk with reddit open on both the desktop and laptop, and never hit any such limits. But I don't speed scroll.
somersetyellow@reddit
It really depends. Sometimes it happens. Other times it's fine. It's pattern recognition end of the day and I guess I sometimes fit the pattern haha
FullstackSensei@reddit
I haven't hit google captchas in Years. How many searches are you conducting each day? That "normal" usage seems a bit heavy.
I still have the PSE API until the end of this year. I've yet to reach 10 searches/day using that.
fallingdowndizzyvr@reddit
I get captcha on Google more often than not. So I use DDG. But now DDG slow walks the results somehow for the first result. After that it's fine though. So I use Bing or Yahoo. Which for the first page of search results is fine. But if I click on the next page it captchas me.
So my solution right now is to use DDG. Yes, that first result takes forever. So I just don't close that tab.
DeltaSqueezer@reddit
Maybe I was unlucky, but I hit some TLS fingerprinters and had to spoof that.
mrjackspade@reddit
TLS fingerprinting is getting bigger too. Amazon uses TLS fingerprinting, Gelbooru actually recently added it as well. Not difficult to get around, as long as you know what it is. You can't just spoof a user agent on a lot of sites anymore though because they can just verify the TLS handshake for your user agent and see that it's spoofed.
sn2006gy@reddit
Yeah, there are some tools that will spoof the fingerprints now but the terrible part is they're ran by people who create agents to scalp stuff so they just create communities that make things more expensive... can't have nice things anymore
mr_tolkien@reddit
Searxng works pretty well to get around that
Important-Radish-722@reddit
Until there are no more search engines to feed it.
itssethc@reddit
Agents can browse now, if someone just links (if they haven’t already) a skill to just got to Google to search and send back URLs it’s moot. They can’t sustain paywalling their customer facing UI.
NetTechMan@reddit (OP)
Agents losing the ability to browse from leveraging web search is what this post is about.
itssethc@reddit
I’m talking about an actual browser on your machine/VM, not browsing skill. If done right there’s no way it’s traceable compared to a typical user sitting in front of the browser.
Savantskie1@reddit
Except the fact that humans don’t read as fast as an LLM, or model. And google can definitely tell if it’s a LLM, searching by its sheer speed and hammering searches and people generally reading and clicking. People don’t want an ai that replicates human speed of search. They want the information faster than they can search and read.
itssethc@reddit
An LLM is a model, and all Google knows is someone in the browser made a search. Have it in dev mode, all links can be copy/pasted and opened in separate tabs. It doesn’t need to trace how quickly you click on links.
Savantskie1@reddit
Try again. They are clearly watching the speed of searches. That’s how they’re detecting bots/Llm/ai. It’s not hard to understand.
itssethc@reddit
So you’re telling me, making a search and clicking enter is something an AI can do so fast it triggers a flag? Do you understand how metadata works? How headless browsing works?
This is why reddit sucks to post on lmao
tukatu0@reddit
You have never come across a keylogger? Do you think the one intergrated into chrome is for show?
Don't take pride in anti intellectualism.
itssethc@reddit
I don’t think you or the other guy understands at all what I’m saying lmao a keylogger will show you those keys. This site is so toxic.
Savantskie1@reddit
Actually yes it can if provided the tools to do so.
itssethc@reddit
Glad an expert came across my posts and educated me, thank you so much
Savantskie1@reddit
I know that’s a snide response but it’s still not proving I’m wrong. I’m not an expert per se but I’ve been working with computers since their inception. I know quite a bit how the internet under the hood works
StorageHungry8380@reddit
Right, you could have an agent running on the computer taking screenshots, a multi-modal model interpreting it and outputting instructions for where to click, then just move the cursor there to click or which keys to press (page down for example). You could include a small NN trained on your mouse movements to generate natural-looking cursor movement. Just one option, there are many others.
itssethc@reddit
Exactly. Sometimes there’s pain in freedom. Still better than nothing though.
kettal@reddit
i think it will hit captcha challenges after heavy usage.
sixx7@reddit
Scrolled too far to find this. Hermes and OpenClaw and probably other harnesses can use a headful browser to literally browse any site, including solving any captcha and bypassing any "verify human" checkbox. Come to think of it, so can Cowork and Codex desktop app, though the models themselves might refuse to do captcha for you haven't tried
daronjay@reddit
Nothing stopping OpenAI and Anthropic setting up their own indexes, make it part of the toolchain, and where will that leave Google and Cloudflare?
MaruluVR@reddit
Doesnt Bing and Yandex sell access to their search results to other search engines?
We could get together and buy the results and then make a small subscription where everyone chips in like 1 dollar per month for access.
kettal@reddit
bing stopped offering search api publicly 2 years ago.
RedditUsr2@reddit
Man I miss gigablast. Think of how much money they could have made in the AI era.
Innomen@reddit
I feel a domino effect under all this. They cut off bots for search, we route bots through us, (like copy pasting content when the robots file says no and the ai honors it) they try to stop that, we respond, etc. AI is speeding things up.
lakySK@reddit
I’ve had good experience with exa.ai so far!
looselyhuman@reddit
Google and Cloudflare are demonstrating that there is nothing public or open about the Internet. It's like CompuServe or AOL, before http. Everything goes through the gateway. Humans get access to the webpage section, for now, because it's profitable to advertise to us.
fmlitscometothis@reddit
Ironically companies that don't rely on advertising want to optimise for agent searches. I'm always surprised when Amazon comes up blocked on Claude. "Compare product X and Y and tell me which is better for my needs" - I'm in the market to buy, don't block me!
Terminator857@reddit
There was a time that search was king, now AI is king. The trend will continue. Anthropic and similar will hold the cards in the future.
yad_aj@reddit
honestly inevitable tbh. the entire “free infinite internet for AI agents” era was probably always temporary. once scraping stopped looking like search traffic and started looking like automated extraction at massive scale, platforms were gonna lock down.
i think the ecosystem splits into:
also wouldn’t be surprised if personal/local RAG becomes way more important than live web search for most workflows.
the ironic part is this might actually improve agents lol. current web-search loops spend half their time digging through SEO sludge and javascript nightmares anyway
Hydroskeletal@reddit
While I've gone down the route already of plowing into headless and even headed browser interfaces to get past this it's still kinda rough.
The real problem is that it is an arms race where every time we 'crack' the code Google has an army of competent, full time engineers that are highly motivated to defeat you.
Caffdy@reddit
what about the OpenCrawl, I was curious about how much information about the web that dataset already holds. Can it work as a local web-search engine? I understand that it's "outdated" (but still has like 25 years or so of internet information)
sonicnerd14@reddit
There are ways to get around it like third parties such as brave search, browser agents, or computer use. I'd assume there will be a cat and mouse game for awhile of companies trying to figure out how to capitalize on bot traffic, while people like us figure out how to break and circumvent it. Much like ad blockers, but this time they can't keep our bots blocked forever.
torrso@reddit
Just pay.
letsgoiowa@reddit
Brave search is great.
thatoneshadowclone@reddit
need i remind you to not support Brave?
https://slate.com/technology/2014/04/brendan-eich-why-mozilla-s-ceo-had-to-resign-over-gay-marriage-views.html
https://thelibre.news/no-really-dont-use-brave/
https://www.reddit.com/r/browsers/comments/1j1pq7b/list_of_brave_browser_controversies/
use DuckDuckGo.
letsgoiowa@reddit
Google supports, funds, and arms regimes that drone civilians. It seems really weird that people get upset about someone with mean words when the alternatives literally kill people.
thatoneshadowclone@reddit
they're a multi billion dollar megacorporation, that's kind of a given. that is fair, but i never said that i support google anyway, just that brave is also bad and undeserving of your money. not an api but i use duckduckgo.
MaruluVR@reddit
The only good browser nowadays is librewolf IMO
CalligrapherFar7833@reddit
Why should we not use brave ?
MaruluVR@reddit
Because librewolf gives you real privacy and doesnt have bloatware
thatoneshadowclone@reddit
read the links.
jwpbe@reddit
a lot of people in software / who use LLMs are reactionaries and won't care that the guy has been openly bigoted for 20 years, but he also skims money from crypto users, so it may balance out.
wiltors42@reddit
It's not free anymore unless you're grandfathered in.
julp@reddit
I've been using Tavily (first came across it via SuperClaude). Seems to work well and has a decent free tier.
JLeonsarmiento@reddit
search by yourself?
woadwarrior@reddit
I recently came across tinyfish. Although, I haven’t tried it yet.
jeffwadsworth@reddit
Yeah I noticed this coming to a head a while back and it sucks but I get it. No idea of a solution beyond mirroring the data. Pfft.
relaxusMaximus@reddit
Kagi’s API is working well for me. You have to request access since it’s in beta, and it does require payment, but … free stuff is never really free anyway.
Korphaus@reddit
Just put in searxng - I got a Gemini CLI session to get that and kiwix with all the text from Wikipedia downloaded and available in docker containers within like half an hour
I just asked it to go do it and get them plumbed into my agents - it's really not that hard (yes it would take longer if I had to implement myself, sue me)
Tuned3f@reddit
Searxng mcp
zakerytclarke@reddit
I'm mostly using Brave Search API.
I don't love the idea of relying on paid search apis, though and am hoping to see more community supported indexes.
I recently shared a LLM SearchIndex which is a local search index that compressed most of the fine web results from common crawl.
Curious to know if there are other community projects.
phein4242@reddit
I do hope you mean socialist, and not communist ;-)
Either way, checkout SearcXNG
Public_Umpire_1099@reddit
Yep, came here to say exactly this. Get on SearXNG, route all requests on your network away from Google and towards SearXNG. Use Vane instead of perplexity too.
Spend a day fine-tuning your setup and you won't miss Google. After I setup my index priorities and providers, I even restyled the CSS for it to make it look more modern.
This is the only long term play. Get your data out of Google's hands.
__JockY__@reddit
The thing I miss, funnily enough, is the AI summary at the top of search results that google et. al. provide. Do you know of anyone who’s already implemented that for Searxng?
Dany0@reddit
I'm scared of this happening but I've never faced this before. I think I use SearXNG and some other thing... cannot recall. Mostly it just works. I think it only struggles with fetching reddit and twitter links but if you're trying to visit those domains anyway... are you even doing anything useful let's be honest lmao
Are you sure y'all aren't triggering just regular pre-ai era bot protection?
Ha_Deal_5079@reddit
honestly brave search api and tavily seem like the play rn. searxng was cool for a while but way too fragile for prod use tbh
petburiraja@reddit
There are a lot of SERP API services out there
BrightRestaurant5401@reddit
I don't know? think longer then 30 seconds about the problem and device your own solution?