Unlock Editor's Digest totally free
FT editor Roula Khalaf selects her favourite stories on this weekly newsletter.
The AI start-up Anthropic is accused of aggressively collecting data from web sites to coach its systems and, in line with those affected, can have violated the publishers' terms of use.
AI developers depend on ingesting massive amounts of knowledge from quite a lot of sources to construct large language models. This technology is behind chatbots like OpenAI's ChatGPT and its competitor Claude from Anthropic.
Anthropic was founded by a bunch of former OpenAI researchers with the promise of developing “responsible” AI systems.
However, Matt Barrie, CEO of Freelancer.com, accused the San Francisco-based company of being “by far essentially the most aggressive scraper” of his freelance portal, which receives hundreds of thousands of visits on daily basis.
Other web publishers share Barrie's concerns that Anthropic is flooding their sites and ignoring their instructions to stop collecting content to coach its models.
Freelancer.com received 3.5 million visits in 4 hours from an online “crawler” linked to Anthropic, in line with data obtained by the Financial Times. That gives Anthropic “probably about five times the amount of the second-largest” AI crawler, Barrie said.
Visits from his bot continued to extend even after Freelancer.com tried to disclaim its access requests through the use of standard web protocols to manage crawlers, he added. After that, Barrie decided to dam traffic from Anthropic's Internet addresses entirely.
“We needed to block them because they don't follow the principles of the web,” Barrie said. “This is egregious scraping that slows down the location for everybody who uses it and ultimately impacts our revenue.”
Anthropic said it was investigating the case and revered the publishers' wishes, saying it didn’t wish to be “intrusive or disruptive.”
Scraping publicly available data from across the net is mostly legal, however the practice is controversial, can violate web sites' terms of service, and will be costly for site hosts.
Kyle Wiens, CEO of iFixit.com, said his electronics repair website received one million hits from anthropic bots in 24 hours. “We have a variety of alerts (for prime traffic), individuals are waking up at 3 a.m. That set off all our alarms,” he said.
iFixit's terms of service prohibit the usage of its data for machine learning, Wiens said. “My first message to Anthropic is: If you employ this to coach your model, that's illegal. My second is: That's not polite behavior on the web. Crawling is a matter of etiquette.”
Websites use a protocol called robots.txt to maintain crawlers and other web robots away from parts of their web sites, but voluntary compliance with the protocol is required.
“We respect robots.txt and our crawler respected that signal when iFixit implemented it,” Anthropic said. The company also said its crawlers respected “anti-evasion technologies” resembling CAPTCHAs and that “our crawling shouldn’t be intrusive or disruptive. We strive for minimal disruption by fastidiously considering how quickly we crawl the identical domains.”
Data scraping will not be a brand new practice, however it has increased dramatically within the last two years because of the AI arms race, creating recent costs for web sites.
“AI crawlers have cost us a variety of money in bandwidth fees and a variety of time coping with abuse,” wrote Eric Holscher, co-founder of document hosting website Read the Docs in a blog entry on Thursday. “AI crawlers are usually not being respectful to the web sites they crawl, and that may result in a backlash against AI crawlers normally,” he added.
Anthropic has developed a few of the world's most advanced chatbots – rivaling OpenAI's ChatGPT – that may reply to a spread of natural language prompts, while positioning itself as a more ethical actor than some competitors. Anthropic's stated goal is “the responsible development and maintenance of advanced AI for the long-term advantage of humanity.”
As leading AI corporations compete to develop ever more powerful and adept models, they’re penetrating deeper into untapped corners of the web, partnering with publishers or creating synthetic training data.
OpenAI has signed plenty of deals with publishers and content providers in recent months, including Reddit, The Atlantic and The Financial Times. Anthropic has not publicly announced similar partnerships.
“Search engines have at all times done a variety of scraping,” Barrie said, “but with the training of generative AI, that has gone to an entire recent level.”
iFixit's mission “is to share information,” Wiens said, to encourage people to do their very own repairs. “We don't mind them using our content to coach models, we just wish to be a part of the conversation.”
He added: “I'm not an advocate on this issue, I'm just attempting to keep an internet site online.”