close
close
migores1

The parent of TikTok has released a web scraper that gobbles up the world’s online data 25 times faster than OpenAI

ByteDance seems eager to make up for lost time when it comes to scraping the web for the data needed to train its generative AI models.

The China-based parent company of video app TikTok launched its own web crawler, or scraper, called Bytespider sometime in April, according to research from Kasada, a company that specializes in bot management for online data companies. The existence of the bot was also confirmed by Dark Visitors, which monitors scraper bots.

The ByteDance bot has quickly become one of the most, if not the most, aggressive scrapers on the Internet, research shows. Scraping data at a rate that is many multiples of other major companies such as (Google, Meta, Amazon, OpenAI and Anthropic, who use their own scraper bots to help create and improve their large language or multimodal models , known as LLMs or LMMs.

Sam Crowther, CEO of Kasada, said that since Bytespider came out, it has scraped data at about 25 times the rate of GPTbot, which scrapes data for OpenAI’s ChatGPT platform and underlying models, for example. Bytespider scraped 3,000 times more than ClaudeBot, from Anthropic, which operates the Claude platform.

As the months passed, Bytespider became even more aggressive, according to Kasada. The data shows huge increases in scraping activity from Bytespider in each of the past six weeks.

Representatives for TikTok and ByteDance did not respond to emails seeking comment.

ByteDance’s aggressive scrapping comes despite the possibility that TikTok will be banned in the US in the coming months. President Joe Biden has signed legislation requiring ByteDance to sell TikTok, on national security grounds, or shut it down.

The Bytespider bot, like those from OpenAI and Anthropic, does not respect robots.txt, research shows. Robots.txt is a line of code that publishers can insert into a website that, while not legally binding in any way, should signal to scraper robots that they can’t retrieve that site’s data.

Web scraping dates back decades, mainly by search engines to gather links to web pages. But the rise of generative AI tools has added a new dimension and made the practice a primary source of lawsuits and controversy. People and organizations whose work has been scraped claim that their copyrights are being infringed in the process. All the models behind generative AI tools have been trained on massive amounts of online data, effectively everything available on the web, especially written information. Tech companies use scraper bots to copy everything for free and put it in their datasets.

“It’s like it’s desperately trying to catch up,” Crowther said of the Bytespider’s aggressive scraping. Just last year, ByteDance was apparently so far behind in the generative AI race that it used OpenAI to help build ByteDance’s own LLM, which is against OpenAI’s terms and conditions. Earlier this year, ByteDance released a chat-based LLM called Duabo, but work on that model was reportedly completed before the accumulation of more recent training data scraped by Bytespider.

It is “clear” that ByteDance is working on a new LLM, according to a person familiar with the company. As for what ByteDance plans to do with a new LLM, a person familiar with the company’s ambitions said one goal has to do with the search function for TikTok.

Last week, TikTok released an update to its current keyword-focused search feature for ads, essentially allowing advertisers to search in real-time for words that are trending on TikTok. It allows marketers to create an ad with relevant keywords that would apparently help the ad appear on the screens of more users.

A new artificial intelligence model with data about the latest Internet trends and topics could further expand and improve TikTok’s search environment, according to the person familiar with the company’s ambitions.

“Given the audience and the amount of usage, TikTok with a search environment that’s a completely biddable space with keywords and topics, that would be very interesting for a lot of people who are spending a lot of money with Google right now,” he said the person. .

Are you a TikTok or ByteDance employee or someone with insight or advice to share? Contact Kali Hays securely via Signal at +1-949-280-0267 or at [email protected].

Recommended newsletter
Data sheet: Stay on top of the tech business with close analysis of the industry’s biggest names.
Register here.

Related Articles

Back to top button