Google Books Indexes AI Trash

Google said it will continue to evaluate its approach “as the world of book publishing evolves.”
Google Books Indexes AI Trash
Photo by Kimberly Farmer / Unsplash

Google Books is indexing low quality, AI-generated books that will turn up in search results, and could possibly impact Google Ngram viewer, an important tool used by researchers to track language use throughout history. 

I was able to find the AI-generated books with the same method we’ve previously used to find AI-generated Amazon product reviews, papers published in academic journals, and online articles. Searching Google Books for the term “As of my last knowledge update,” which is associated with ChatGPT-generated answers, returns dozens of books that include that phrase. Some of the books are about ChatGPT, machine learning, AI, and other related subjects and include the phrase because they are discussing ChatGPT and its outputs. These books appear to be written by humans. However, most of the books in the first eight pages of results turned up by the search appear to be AI-generated and are not about AI.

For example, the 2024 book Bears, Bulls, and Wolves: Stock Trading for the Twenty-Year-Old by Tristin McIver, bills itself as “a transformative journey into the world of stock trading” and “a comprehensive guide designed for beginners eager to unlock the mysteries of financial markets.” In reality, it reads like ChatGPT-generated text with surface, Wikipedia-level analysis of complex financial events like Facebook’s initial public offering or the 2008 financial crisis summed up in a few short paragraphs.

"Despite the initial hiccups, Facebook’s stock eventually found its footing in the market. Over the years following the IPO, the company’s share price experienced fluctuations but also demonstrated resilience, reflecting the dynamic nature of the tech industry. As of my last knowledge update in January 2022, Facebook had evolved into Meta Platforms, Inc., reflecting its expansion beyond social media into virtual reality and the metaverse."

Other books appear to be outdated to the point of being useless at the time they are published because they are generated with a version of ChatGPT with an old “knowledge update.” For example, Maximize Your Twitter Presence: 101 Strategies for Marketing Success by Shu Chen Hou was published in March of 2024, according to a listing for the same book on Amazon. As is the case with many AI-generated books, the same author has published dozens and dozens of books, in this case mostly children’s books with AI-generated art. 

At the end of a multiple page section in Maximize Your Twitter Presence about how to become verified on Twitter (now X), the books says “As of my last update in September 2021, Twitter was in the process of evaluating and updating its verification criteria and process, so the steps and requirements may have changed since then.” Twitter, of course, was acquired by Elon Musk in 2022, and famously upended the verification process entirely, which can now simply be purchased. 

“I cannot believe that they [Google] don't know what they're putting into Google Books search,” Gary Price, a librarian, consultant, and editor of the Library Journal's infoDOCKET, told me in a call. “They're just ingesting all this material, however it gets to them, but I have to believe that they know that this stuff is AI generated. And they would be doing themselves and users a big favor by labeling it as such.”

These AI-generated books are very similar to the type of AI-generated books we’ve found on Amazon, and in fact, many of the books appear on both Amazon and Google Play Books. But one unintended outcome of Google Books indexing AI-generated text is its possible future inclusion in Google Ngram viewer. 

Google Ngram viewer is a search tool that charts the frequencies of words or phrases over the years in published books scanned by Google dating back to 1500 and up to 2019, the most recent update to the Google Books corpora. Google said that none of the AI-generated books I flagged are currently informing Ngram viewer results.

“Our automated systems aim to surface relevant, high quality books for a given search, and the books in question appeared for an uncommon, specific search query,” a Google spokesperson said. “None of the identified books have factored into Ngram viewer results.”

When I asked specifically if these books will be filtered out of Google Ngram viewer corpora in the future, the Google spokesperson said “We’re committed to ensuring that the Ngram viewer remains a high quality resource, and will continue to evaluate our approach as the world of book publishing evolves.”

Ngram viewer is far from perfect, but it’s often cited in academic papers because it allows researchers to track cultural change as it is reflected in books. If AI-generated books start informing Ngram viewer results in the future, the meaning of these results will change entirely. Either they will be unreliable for teaching us about human-made culture, or they say something perhaps more bleak: that human-made culture is being replaced by AI-generated content. In some ways, this is already happening. Whether you’re on Facebook, porn sites, or looking for books on Amazon, much of what we’re seeing online is AI-generated and not always disclosed as such. 

“This seems like there will be a type of runaway feedback loop, right?” Alex Hanna, director of research at the Distributed AI Research Institute (DAIR), told me in an email. “Ngram viewer is already a pretty noisy signal for things which computational social scientists and linguists may care about, but it'll probably be completely unusable in a few years.”

Google also didn’t say whether it has or is formulating a policy to filter out AI-generated books from Google Books, and it did not remove any of the AI-generated books I’ve flagged to the company. “We continually work to adapt our systems and policies to ensure users find helpful and relevant books within the Google Books corpus,” the Google spokesperson said. 

“It strikes me as another instance of AI-generated text becoming an ouroboros, where AI-generated content will be ingested into Google Books, then Google using the content to train new models,” Hanna said. “I'm sure they will say they have a ‘quality filter’ but I'm sure the details of such won't be described anywhere publicly.”