This piece is published with support from The Capitol Forum.
The LAION-5B machine learning dataset used by Stable Diffusion and other major AI products has been removed by the organization that created it after a Stanford study found that it contained 3,226 suspected instances of child sexual abuse material, 1,008 of which were externally validated.
LAION told 404 Media on Tuesday that out of “an abundance of caution,” it was taking down its datasets, including LAION-5B and another called LAION-400M temporarily “to ensure they are safe before republishing them."
According to a new study by the Stanford Internet Observatory shared with 404 Media ahead of publication, the researchers found the suspected instances of CSAM through a combination of perceptual and cryptographic hash-based detection and analysis of the images themselves.
“We find that having possession of a LAION‐5B dataset populated even in late 2023 implies the possession of thousands of illegal images—not including all of the intimate imagery published and gathered non‐consensually, the legality of which is more variable by jurisdiction,” the paper says. “While the amount of CSAM present does not necessarily indicate that the presence of CSAM drastically influences the output of the model above and beyond the model’s ability to combine the concepts of sexual activity and children, it likely does still exert influence. The presence of repeated identical instances of CSAM is also problematic, particularly due to its reinforcement of images of specific victims.”
The finding highlights the danger of largely indiscriminate scraping of the internet for the purposes of generative artificial intelligence.
Large-scale Artificial Intelligence Open Network, or LAION, is a non-profit organization that creates open-source tools for machine learning. LAION-5B is one of its biggest and most popular products. It is made up of more than five billion links to images scraped from the open web, including user-generated social media platforms, and is used to train the most popular AI generation models currently on the market. Stable Diffusion, for example, uses LAION-5B, and Stability AI funded its development.
“If you have downloaded that full dataset for whatever purpose, for training a model for research purposes, then yes, you absolutely have CSAM, unless you took some extraordinary measures to stop it,” David Thiel, lead author of the study and Chief Technologist at the Stanford Internet Observatory told 404 Media.
Public chats from LAION leadership in the organization’s official Discord server show that they were aware of the possibility of CSAM being scraped into their datasets as far back as 2021.