LAION, the German research organization that created the info used to coach Stable Diffusion and other generative AI models, has released a brand new dataset that is claimed to be “thoroughly cleansed of known links to suspected child sexual abuse material (CSAM).”
The latest dataset, Re-LAION-5B, is definitely a re-release of an old dataset, LAION-5B—but with “corrections” based on recommendations from the nonprofit Internet Watch Foundation, Human Rights Watch, the Canadian Center for Child Protection, and the now-defunct Stanford Internet Observatory. It is offered for download in two versions, Re-LAION-5B Research and Re-LAION-5B Research-Safe (which also removes additional NSFW content), each of which have been filtered for 1000’s of links to known—and “likely”—CSAM, LAION says.
“LAION has been committed from the outset to removing illegal content from its data sets and has taken appropriate measures to attain this from the outset,” LAION wrote in a Blog post“LAION strictly adheres to the principle of removing illegal content as quickly as possible after it becomes known.”
It is essential to notice that LAION's datasets don’t contain – and have never contained – images. Rather, they’re indexes with links to photographs and alt text of images curated by LAION, all of which come from a dataset – the Common Crawl – of scraped web sites and web pages.
The release of Re-LAION-5B got here after a December 2023 investigation by the Stanford Internet Observatory found that LAION-5B – more specifically, a subset called LAION-5B 400M – contained at the least 1,679 links to illegal images copied from social media posts and popular adult web sites. According to the report, 400M also contained links to “a wide selection of inappropriate content, including pornographic images, racial slurs, and harmful social stereotypes.”
While the report's Stanford co-authors noted that it could be difficult to remove the offending content and that the presence of CSAM doesn’t necessarily affect the outcomes of models trained on the dataset, LAION said it could temporarily take LAION-5B offline.
The Stanford report really useful that the models trained on LAION-5B ought to be “deprecated and their distribution stopped where possible.” Perhaps related to that is the AI startup Runway recently lost weight its Stable Diffusion 1.5 model from AI hosting platform Hugging Face; we've reached out to the corporate for more information. (Runway partnered with Stability AI, the corporate behind Stable Diffusion, in 2023 to assist train the unique Stable Diffusion model.)
About the brand new Re-LAION-5B dataset, which incorporates around 5.5 billion text-image pairs and has been released under an Apache 2.0 license, LAION says that the metadata will be utilized by third parties to wash up existing copies of LAION-5B by removing the corresponding illegal content.
LAION stresses that its datasets are for research purposes, not industrial ones. But if history is any indication, that won't deter some organizations. In addition to Stability AI, Google once used LAION datasets to coach its image-generating models.
“In total, 2,236 links (to suspected CSAM) were removed after being matched against the lists of link and image hashes provided by our partners,” LAION continued within the post. “These links also include 1,008 links present in the Stanford Internet Observatory report in December 2023… We urge all research labs and organizations still using old LAION-5B datasets to transition to re-LAION-5B datasets as soon as possible.”