OpenAI and anthropomorphic AI bots wreak havoc and drive up website costs

Vaseline4 weeks ago

0 6 minutes read

OpenAI and anthropomorphic AI bots wreak havoc and drive up website costs

Edd Coates knew something was wrong. His online database was hacked.

Coates is a game designer and creator of the Game UI database. It’s a labor of love for which he spent five years cataloging over 56,000 screenshots of video game user interfaces. If you want to know what the health bar looks like in Fallout 3 and compare it to the inventory screen in Breath of the Wild, Coates is waiting for you.

A few weeks ago, he says, the site slowed to a crawl. Pages took 3x longer to load, users were getting 502 bad gateway errors, and the home page was reloading 200 times per second.

“I assumed it was some kind of small DDoS attack,” Coates told Business Insider.

But when he checked the system logs, he realized that the traffic stream was coming from a single IP address owned by OpenAI.

In the race to build the world’s most advanced artificial intelligence, tech companies have spread across the web, unleashing botnets like a plague of digital locusts to scour websites for anything they can use to power their models voracious.

They’re often after high-quality training data, but also other information that can help AI models understand the world. The race is on to collect as much information as possible before it runs out or the rules change about what is acceptable.

A study estimated that the world’s supply of usable AI training data could be exhausted by 2032. The entire online body of recorded human experience may soon be inadequate to keep ChatGPT up to date.

A resource like the Game UI Database, where a human has already done the painstaking work of cleaning and classifying the images, must have looked like an all-you-can-eat buffet.

Higher cloud bills

For small website owners with limited resources, the costs of playing host to a swarm of hungry bots can be a significant burden.

“In the space of 10 minutes we were transferring about 60 to 70 gigabytes of data,” said Jay Peet, a fellow game designer who manages the servers that host Coates’ database. “Based on Amazon’s on-demand pricing for bandwidth, that would cost $850 per day.”

Coates makes no money from the Game UI database and actually runs the site at a loss, but he worries that the giant AI companies’ actions could endanger independent creators who rely on their sites to- and earn a living.

“The fact that OpenAI’s behavior crippled my website to the point where it stopped working is just the icing on the cake,” he said.

An OpenAI spokesperson said the company’s bot was querying Coates’ site about twice a second. The representative also pointed out that OpenAI was crawling the site as part of an effort to understand the structure of the web. It wasn’t there to scrape data.

“It’s easy for web publishers to opt out of our ecosystem and express their preferences about how their sites and content work with our products,” the spokesperson added. “We also built systems to detect and moderate site load to be courteous and considerate web participants.”

Planetary problems

Joshua Gross, founder of digital product studio Planetary, told BI that he ran into a similar problem after redesigning a website for one of his clients. Shortly after launch, traffic increased and the customer saw their cloud computing costs double from previous months.

“An audit of the traffic logs revealed a significant amount of traffic from scraping bots,” Gross said. “The problem was primarily Anthropic driving an overwhelming amount of crap traffic,” he added, referring to the repeated requests, all of which resulted in 404 errors.

Anthropic spokeswoman Jennifer Martinez said the company strives to ensure its data collection efforts are transparent and not intrusive or disruptive.

Ultimately, Gross said, he was able to stem the flood of traffic by updating the site’s robots.txt code. Robots.txt is a protocol, used since the late 1990s, that allows bot crawlers to know where they can and cannot go. It is widely accepted as one of the unofficial rules of the web.

Blocking AI bots

Robots.txt restrictions targeting AI companies have skyrocketed. A study found that between April 2023 and April 2024, nearly 5% of all online data and about 25% of the highest quality data added robots.txt restrictions for AI botnets.

The same study found that 25.9% of such restrictions were for OpenAI, compared to 13.3% for Anthropic and 9.8% for Google. The authors also found that many data owners disallowed crawling in their Terms of Service but did not apply robots.txt restrictions. This left them vulnerable to unwanted crawling from robots that rely solely on robots.txt.

OpenAI and Anthropic said their bots respected robots.txt, but BI did reported cases in the recent past where both companies have circumvented the restrictions.

Polluted key values

David Senecal, a principal fraud and abuse product architect at networking giant Akamai, says his firm is tracking AI training botnets run by Google, Microsoft, OpenAI, Anthropic and others. He says bots are controversial among Akamai users.

“Website owners are generally okay with having their data indexed by web search engines like Googlebot or Bingbot,” Senecal said, “however, some don’t like the idea of their data being used to to train a model”.

He says some users complain about increased cloud costs or stability issues due to increased traffic. Others worry that botnets present intellectual property issues or will “pollute key metrics” such as conversion rates.

When an AI bot swarms your site over and over again, your traffic numbers will likely be out of line with reality. This causes problems for sites that advertise online and have to track how effective this marketing is.

Senecal says robots.txt is still the best way to handle unwanted crawling and scraping, though it’s an imperfect solution. It requires domain creators to know the specific names of each bot they want to block and requires bot operators to voluntarily comply. In addition, Senecal says Akamai is tracking various “copycat” bots masquerading as Anthropic or OpenAI web crawlers, which makes the task of analyzing them even more difficult.

In some cases, Senecal says, botnets will crawl an entire website every day just to see what has changed, a straightforward approach that results in massive amounts of duplicate data.

“This way of collecting data is very wasteful,” he said, “but until the mindset around data sharing changes and there is a more evolved and mature way of sharing data, scraping will remain the status quo “.

“We are not Google”

Roberto Di Cosmo is the director of Software Heritage, a non-profit database created to “collect, preserve and share all publicly available source code for the benefit of society”.

Di Cosmo says last summer he saw an unprecedented rise in AI botnets scraping the online database, causing the website to stop responding for some users. Its engineers spent hours identifying and blacklisting thousands of IP addresses that were driving traffic, diverting resources from other important tasks.

“We’re not Google, we have a limited amount of resources to run this operation,” Di Cosmo said.

He is an evangelist for open access and, in theory, is not opposed to AI companies using the database to train models. Software Heritage already has a partnership with Hugging Face, which used the database to help train its AI model StarCoder2.

“Developing machine learning models that embrace these digital commons can democratize software creation, allowing a wider audience to benefit from the digital revolution, a goal that aligns with our values,” said Di Cosmo, “but it must be done in -a responsible way. .”

Software Heritage has published a set of principles which governs how and when they agree to share their data. All models created using the database must be open-source and not “monopolized for private gain”. And the creators of the underlying code must be able to opt out if they want to.

“Sometimes these people get the data anyway,” Di Cosmo said, referring to botnets that scrape hundreds of billions of web pages one by one.

To be taken offline

“We’ve been stopped a few times by AI bots,” said Tania Cohen, chief executive of 360Giving, a nonprofit database of grants and charitable giving opportunities.

Cohen says that as a small charity with no in-house technical team, the spikes in traffic were extremely disruptive. What’s even more frustrating, she says, is that much of the information can be easily downloaded in other ways and doesn’t need to be crawled.

But hungry AI botnets scavenge first, ask questions later.

“Totally Sick”

Coates says the Game UI database is back up and running and continues to add to it. There are millions of people out there like Coates, obsessive about a tiny corner of the world, forced to sink thousands of hours into a pursuit that no one else on Earth could make sense of. It’s one of the reasons to love the internet.

And it’s yet another area of society shaken by the ripple effects of the AI revolution. The server costs of a small-fry database operator may not seem worth mentioning. But Coates’ story is emblematic of a larger question: When AI comes to change the world, who bears the cost?

Coates says he maintains the database as a source of reference material for fellow game designers. He worries that generative AI, which depends on the work of human creators, will inevitably replace those same creators.

“To find that my work is not only being stolen by a large organization, but being used to hurt the very people I’m trying to help, makes me feel completely sick,” Coates said.