Perplexity AI has found itself at the middle of a firestorm over its data collection practices.
Perplexity essentially combines a search engine with generative AI and returns AI-generated content related to the user's search query.
The processes required to do that are prone to involve illegally scraping content from quite a few web sites, including people who explicitly prohibit it.
The scandal broke out on June 11, when Forbes reported that Perplexity had copied a complete article, including its own illustrations, from its site and repurposed it with only minimal attribution.
Not long after, WIRED conducted an investigation The study uncovered evidence that Perplexity reads content from web sites that prohibit automatic data collection.
An internet site can request that its content not be crawled by web crawlers through a file called robots.txt.
This exclusion protocol communicates with web crawlers and other automated bots. It is an easy text file placed on an internet site's server that specifies which pages or sections of the web site mustn’t be accessed or scraped.
The robots.txt file has been a widely accepted convention for the reason that early days of the Internet. It helps website owners control their content and forestall unauthorized data collection.
Although not legally binding, it has long been considered best practice for web crawlers to follow the instructions in an internet site's robots.txt file.
Jason Kint, CEO of Digital Content Nexta trade group representing online publishers, was blunt in its assessment of Perplexity's web scraping processes.
“AI corporations should default to the belief that they haven’t any right to take and reuse content from publishers without permission,” he said.
“If Perplexity circumvents the Terms of Service or robots.txt, alarm bells should ring that something fallacious is occurring.”
Amazon investigates
These revelations prompted Amazon Web Services (AWS), which hosts a server involved in Perplexity's alleged illicit scraping, to launch an investigation.
AWS strictly prohibits its customers from engaging in abusive or illegal activities that violate the Terms of Service.
Perplexity CEO Aravind Srinivas initially dismissed the concerns, claiming they reflected “a deep and fundamental misunderstanding” of the corporate’s business and the web on the whole.
But in a later Interview with Fast CompanyHe acknowledged that Perplexity relied on an unnamed third party to crawl and index the online and suggested that the third party was chargeable for any violations of the Robots.txt policy.
Srinivas declined to reveal the corporate's name, citing a confidentiality agreement.
For now, Perplexity seems determined to weather the storm. A spokesperson downplayed the AWS investigation as “standard procedure” and suggested the corporate has made no changes to its operations.
However, the startup's defiant stance may prove untenable as concerns about AI's data practices grow.