Advertisement
News

AI Dataset for Detecting Nudity Contained Child Sexual Abuse Images

The Canadian Centre for Child Protection found more than 120 images of identified or known victims of CSAM in the dataset.
AI Dataset for Detecting Nudity Contained Child Sexual Abuse Images
Photo by Stefan Heinemann / Unsplash

A large image dataset used to develop AI tools for detecting nudity contains a number of images of child sexual abuse material (CSAM), according to the Canadian Centre for Child Protection (C3P). 

The NudeNet dataset, which contains more than 700,000 images scraped from the internet, was used to train an AI image classifier which could automatically detect nudity in an image. C3P found that more than 250 academic works either cited or used the NudeNet dataset since it was available download from Academic Torrents, a platform for sharing research data, in June 2019.

“A non-exhaustive review of 50 of these academic projects found 13 made use of the NudeNet data set, and 29 relied on the NudeNet classifier or model,” C3P said in its announcement

C3P found more than 120 images of identified or known victims of CSAM in the dataset, including nearly 70 images focused on the genital or anal area of children who are confirmed or appear to be pre-pubescent. “In some cases, images depicting sexual or abusive acts involving children and teenagers such as fellatio or penile-vaginal penetration,” C3P said. 

People and organizations that downloaded the dataset would have no way of knowing it contained CSAM unless they went looking for it, and most likely they did not, but having those images on their machines would be technically criminal. 

“CSAM is illegal and hosting and distributing creates huge liabilities for the creators and researchers. There is also a larger ethical issue here in that the victims in these images have almost certainly not consented to have these images distributed and used in training,” Hany Farid, a professor at UC Berkeley and one of the world’s leading experts on digitally manipulated images, told me in an email. Farid also developed PhotoDNA, a widely used image-identification and content filtering tool. “Even if the ends are noble, they don’t justify the means in this case.”

“Many of the AI models used to support features in applications and research initiatives have been trained on data that has been collected indiscriminately or in ethically questionable ways. This lack of due diligence has led to the appearance of known child sexual abuse and exploitation material in these types of datasets, something that is largely preventable,” Lloyd Richardson, C3P's director of technology, said.

Academic Torrents removed the dataset after C3P issued a removal notice to its administrators. 

"In operating Canada's national tipline for reporting the sexual exploitation of children we receive information or tips from members of the public on a daily basis," Richardson told me in an email. "In the case of the NudeNet image dataset, an individual flagged concerns about the possibility of the dataset containing CSAM, which prompted us to look into it more closely."

C3P’s findings are similar to 2023 research from Stanford University’s Cyber Policy Center, which found that LAION-5B, one of the largest datasets powering AI-generated images, also contained CSAM. The organization that manages LAION-5B removed it from the internet following that report and only shared it again once it had removed the offending images. 

"These image datasets, which have typically not been vetted, are promoted and distributed online for hundreds of researchers, companies, and hobbyists to use, sometimes for commercial pursuits," Richardson told me. "By this point, few are considering the possible harm or exploitation that may underpin their products. We also can’t forget that many of these images are themselves evidence of child sexual abuse crimes. In the rush for innovation, we’re seeing a great deal of collateral damage, but many are simply not acknowledging it — ultimately, I think we have an obligation to develop AI technology in responsible and ethical ways."

Update: This story has been updated with comment from Lloyd Richardson.

Advertisement