What the industry has been anticipating for at least a year has finally happened: there are now more machines on the internet than there are people. On June 3, 2026, Cloudflare published data from its Radar network, indicating that automated systems have, for the first time in history, generated the majority of all HTTP requests to web content — 57.5% compared to 42.5% from live users. NBC News, referencing the same report, provided a nearly identical ratio — 57.4% to 42.6%. This is not a statistical anomaly or a one-time spike, but a documented break in a long-standing trend.
What is most striking is how quickly this has occurred. Just three months before the publication, during a speech at the SXSW conference, Cloudflare CEO Matthew Prince assured that the crossover point would not happen before 2027. Commenting on the new figures, he admitted, "Well, this happened faster than I predicted." The milestone was reached more than a year earlier than the forecast made by the very person who made that prediction.
Who Turned the Web into a Bot Territory
The main culprit is not classic search spiders or spam bots, but agent AI: semi-autonomous programs that perform tasks for assistants like ChatGPT and Gemini. The logic is simple and ruthless for servers: where a human clicks a couple of times, one AI agent crawls thousands of pages to gather context and provide an answer. Each such "expedition" involves dozens and hundreds of requests, which statistically accumulate into an avalanche.
The scale of growth is evident from individual crawlers. According to Cloudflare's measurements, traffic from GPTBot by OpenAI grew by 305% over the year. Looking at the share within all AI traffic, the picture is the same: GPTBot rose from 4.7% (July 2024) to 11.7% (July 2025). In May 2026, specialized AI crawlers accounted for 20.3% of bot requests, with another 6.5% coming from AI search bots — totaling nearly 27% of all bot traffic that directly feeds language models. In terms of purpose, this traffic is distributed as follows: 51.8% for data collection for training, 35.7% for a mixed mode (training plus providing answers), and only about 9% for pure searching.
The strain on infrastructure has ceased to be an abstraction. The Wikimedia Foundation reported that since January 2024, bandwidth consumption for multimedia delivery has increased by 50%, with 65% of the most resource-intensive traffic generated by bots, despite them accounting for only 35% of page views. In other words, machines are consuming disproportionately large amounts of expensive traffic without returning anything to the website owner.
Why the Open Web is Closing Its Doors
The reaction from platforms has been predictable: if bots do not bring any advertising impressions or clicks, they begin to be blocked. By August 2025, more than 2.5 million websites had completely prohibited the use of their data for AI training. In the five months following July 2025, Cloudflare alone blocked around 416 billion requests from AI bots. GPTBot became the most "banned" crawler in robots.txt files — it appears in 5.52% of all DISALLOW rules.
The imbalance is clearly visible in the so-called crawl-to-referral ratio — how many pages a bot crawls for each referral click it sends back. For the benchmark Googlebot, this ratio is about 4.9:1. For GPTBot, it stands at 1276:1, while ClaudeBot reached nearly 24,000:1 before improving to around 11,000:1. For a website owner, this means simply: AI takes thousands and gives back units.
However, simply blocking means losing potential revenue, so Cloudflare proposed a third way. Its Pay-Per-Crawl system utilizes the long-forgotten HTTP status 402 "Payment Required": instead of completely shutting out the bot, the site can bill it for access. The company acts as an intermediary and processes the payments. The mechanics are three-tiered: Block (one click, by default for new domains), Charge (paid access at the owner's rate), and Allow (open access with detailed analytics). According to Cloudflare, clients are already issuing more than one billion 402 codes per day.
This trend extends beyond a single company. On April 7, 2026, GoDaddy — one of the largest hosting providers in the world — integrated Cloudflare's AI Crawl Control tool into its platform. Cloudflare's strategy director, Stephanie Cohen, articulated it this way: "By providing website owners with tools like AI Crawl Control and open standards, we are laying the foundation for a new business model of the internet." Considering that approximately 20% of all websites globally operate behind Cloudflare's reverse proxy, this represents a tectonic shift in the rules of the game.
The Mask War: Why Blocks Do Not Affect Everyone Equally
A key nuance often overlooked in sensational headlines: the new barriers are primarily aimed at bots that honestly identify themselves and come from data center IP ranges. A crawler with a clear User-Agent like "GPTBot" and an address from AWS cloud is an easy target for WAFs and traffic categorizers. It is precisely these that are hit by billion-dollar blocks.
The problem is that not everyone follows the rules. The AI Agent Index from MIT CSAIL for 2025 and Cloudflare's observations align: about half of AI traffic simply ignores robots.txt. The standard llms.txt, intended to serve as a "polite menu" for models, is not being read in production by any major AI company as of the first quarter of 2026. A notable incident from August 2025: Cloudflare publicly accused Perplexity of covert crawling — rotating User-Agents and masquerading as a regular browser to bypass restrictions in robots.txt. Perplexity denied the allegations, but the case clearly illustrated the direction the industry is heading.
The takeaway for those legally collecting public, unlogged data is paradoxical: the more aggressively platforms cut "noisy" data center crawlers, the higher the value of traffic that appears to be from a regular person. A request coming from a residential or mobile IP, with a normal browser fingerprint and human rhythm, is indistinguishable from a visitor for anti-bot systems — and passes where a cloud bot receives an instant ban.
What This Means for Web Scraping in Practice
If your business relies on data collection — price monitoring, SERP parsing, review aggregation, training models on open sources — the conclusions from Cloudflare's report should be taken as a call to action.
- Data center proxies without masking are a risk zone. If you are sending requests from obvious cloud ranges and not managing your fingerprint, you fall precisely into the category that is under heavy fire. For tasks that are not sensitive to reputation (internal APIs, friendly sources, simple public pages), data center proxies remain fast and cheap, but for protected platforms, their lifecycle is shortening.
- Residential IPs are the new baseline. For serious scraping of protected sites, residential proxies provide the "human" profile that anti-bot systems allow by default. This is no longer a premium option, but a hygiene minimum.
- Mobile proxies — for the toughest targets. Social networks and platforms with behavioral analysis are particularly strict about the source of the connection. Mobile proxies with real operator IPs and their rotation mechanics provide maximum "stealth" where even residential addresses are under suspicion.
- Prepare for paid access. Pay-Per-Crawl with code 402 is not a temporary experiment: a billion such responses a day indicates that the model has taken root. Some data will become available only for money or only to those who can appear as organic traffic in the coming years.
A separate scenario involves building your own infrastructure. For small volumes and private tasks, it makes sense to set up your own node: we have detailed how to build a home proxy server on a Raspberry Pi in an evening for a couple of thousand rubles. This will not replace a pool of millions of addresses, but it covers basic needs and helps understand the mechanics from the inside.
Conclusion
The figure of 57.5% is a symbolic milestone, but behind it lies a real shift in eras. The internet, which has been built for decades for the human reader, is increasingly being restructured for the data-consuming machine, and platforms are responding with barricades: blocks, paid gateways, and cryptographic authentication of bots. The open web is not disappearing — it is stratifying. Free access remains for those who play by the rules or can appear as ordinary users; everything else is moving behind paywalls or into bans. For the data collection industry, this means one thing: the quality and "humanity" of your traffic are becoming not a competitive advantage, but a condition for survival.