Advertisement
automattic

A WordPress ‘Firehose’ Allows AI Companies to Buy Access to a Million Posts a Day

There is a complex chain of companies buying access to WordPress and Tumblr posts through a company called SocialGist.
A WordPress ‘Firehose’ Allows AI Companies to Buy Access to a Million Posts a Day
🖥️
404 Media is a journalist-owned website. Sign up to support our work and for free access to this article. Learn why we require this here.

Update: After this article was published, Automattic told 404 Media that it is "deprecating" the Firehose: "SocialGist is rolling off as a firehose customer this month and the remaining customers are winding down in the coming months, both things that were already in motion for different reasons," an Automattic spokesperson said. "We’re in the process of updating our developer page to indicate that we have been deprecating the old firehose for several months." The spokesperson did not answer the original questions we posed to them about the data supply chain for the Firehose.

In September 2023, WordPress.com quietly changed the language of a developer page explaining how to access a “Firehose” of roughly a million daily WordPress posts to add that the feeds are “intended for partners like search engines, artificial intelligence (AI) products and market intelligence providers who would like to ingest a real-time stream of new content from a wide spectrum of publishers.” Before then, this page did not note the AI use case. 

This is notable because of the fervor and confusion that has arisen this week after we broke the news that Automattic, which owns WordPress.com and Tumblr, was preparing to send user data to OpenAI and Midjourney. Since then, there has been much discussion about which WordPress blogs would be included, which would not, whether data was already sent, and whether people who opt out would have their data redacted retroactively. 

We still do not know the answers to all of these questions, because Automattic has repeatedly ignored our detailed questions, will not get on the phone with us, and has instead chosen to frame a new opt-out feature as “protecting user choice.” We are at the point where individual Automattic employees are posting clarifications on their personal Mastodon accounts about what data is and is not included. 

The truth is that Automattic has been selling access to this “firehose” of posts for years, for a variety of purposes. This includes selling access to self-hosted blogs and websites that use a popular plugin called Jetpack; Automattic edited its original “protecting user choice” statement this week to say it will exclude Jetpack from its deals with “select AI companies.” These posts have been directly available via a data partner called SocialGist, which markets its services to “social listening” companies, marketing insights firms, and, increasingly, AI companies. Tumblr has its own Firehose, and Tumblr posts are available via SocialGist as well. 

Almost every platform has some sort of post “firehose,” API, or way of accessing huge amounts of user posts. Famously, Twitter and Reddit used to give these away for free. Now they do not, and charging access for these posts has become big business for those companies. This is just to say that the existence of Automattic’s firehose is not anomalous in an internet ecosystem that trades on data. But this firehose also means that the average user doesn’t and can’t know what companies are getting direct access to their posts, and what they’re being used for.

Screenshot of SocialGist's data sources page

Data sharing like this is made possible by clauses buried in terms of service agreements that people don’t read, and can sometimes be opted out of in settings pages that most people often don’t look at. These clauses then enable the sale of user data to business-to-business companies who specialize in hoovering up and analyzing data from every surface you could possibly think of. And as in the case of the Automattic news this week, it doesn't matter anymore if you do read the fine print of privacy policies; platforms can change those terms anytime, make you opt out rather than opt in, and call it “protecting user choice.”

While it’s not uncommon for internet platforms to sell user data like this,  Automattic’s new deals with OpenAI and Midjourney have rightfully struck a specific nerve. OpenAI and Midjourney will use this data to improve their generative AI tools, which are built off of the labor and art of humans and attempts to replicate the labor and art of humans. 

This firehose appears to be distinct from any direct data sharing deal with Midjourney and OpenAI, in part because the documentation makes clear that data being sold via this firehose is not limited only to posts on WordPress.com, but also can include posts on self-hosted WordPress.org websites that use Jetpack, a wildly popular plugin that millions of sites use and that users are encouraged to install when setting up a WordPress site. The documentation says that Jetpack sites are subject to a “separate” feed of posts, and SocialGist advertises posts not just from WordPress.com but also from “popular WP-powered sites across the web.” Automattic has stressed, meanwhile, that Jetpack sites are not a part of its deals with “select AI companies.”

Automattic and its partners’ documentation makes clear that WordPress blogs and Tumblr posts are valuable at scale, that they are being sold at scale, and that they are being analyzed at scale. This means that third-party companies who are willing to pay for huge numbers of posts do not need to “scrape” it from individual, public websites or Tumblr pages. In its post earlier this week, Automattic said "we currently block, by default, major AI platform crawlers—including ones from the biggest tech companies—and update our lists as new ones launch." But it is simultaneously advertising direct access to posts for sale via the firehose, including for AI companies. Companies who want this data, then, can buy access to those posts, and get them delivered in an easy-to-ingest way.

The publicly available documentation shows a data supply chain where huge numbers of posts are passed in near real time to SocialGist, which makes them available to its customers, who do things like threat intelligence, market research, large-scale analytics, product development, and market analysis. A sample of WordPress data SocialGist advertises on the Amazon Web Services store is a blog from a site called Fresh24news.com about LeBron James’s Kobe Bryant tattoo.

💡
Do you work at Automattic, SocialGist, or DataStreamer or know anything else about the WordPress Firehose? I would love to hear from you. Using a non-work device, you can message me securely on Signal at +1 202 505 1702. Otherwise, send me an email at jason@404media.co.

Last year, however, SocialGist began to advertise itself as creating “clean and compliant data” for AI and LLM training, and on its website says that “digital data is the treasure trove for training the next wave of AI and machine learning models” while linking to their own data sources, which include Tumblr, WordPress.com, and, notably, “popular WP-powered sites across the web.” This suggests that they are also selling access to posts from Jetpack sites. SocialGist further indicated its interest in providing data for AI purposes with a December partnership with a company called DataStreamer, which advertises companies doing AI training as one of its key client bases. 

DataStreamer does not explicitly say that it will make WordPress and Tumblr posts available, but it is making blog posts and news posts from SocialGist available (SocialGist collects data from other news and blog platforms, too). We do not know if WordPress and Tumblr posts are available through DataStreamer, because these companies will not answer simple questions from us, and the data itself is very expensive, so we cannot buy it to analyze it ourselves. We do not know if there are different levels of restrictions on Jetpack posts, because none of the companies want to talk about it. Neither SocialGist nor DataStreamer responded to a request for comment. 

SocialGist calls itself the “world’s largest index of human to human conversational content,” advertises access to WordPress and Tumblr including “posts, comments and likes from the world's most popular blogging platform, including Wordpress.com and popular WP-powered sites around the web,” and on a page outlining its data sources, explains that “Tumblr is home to over 500 million microblogs generating 1 million posts per day that can provide you with deep insights into the mind and market of Millennials & Gen Z in 2020 and beyond.”

WordPress and Tumblr users have no way of knowing who SocialGist is selling data to, and for what specific purposes. Automattic’s documentation has a list of “prohibited uses,” which includes using the data for crime or passing it to the government for surveillance or military purposes, but does not include “training AI.”

I asked Automattic the following questions. It has not answered them, nor has it responded at all:

“- What are the terms of use for data sold via SocialGist? According to the terms, can the data be used to train LLMs? Are there different terms for JetPack data?

- To what extent does Automattic control what happens when SocialGist bundles data and sends it further down the supply chain? Is Automattic across what happens to that data when it gets to DataStreamer, for example? Is there any enforcement mechanism if the data gets misused?

- The Firehose documentation page mentions there is a separate firehose for JetPack. Are there additional privacy features / terms on the use of that data?

- What was the reason for the change in your terms in September to add AI training on the firehose documentation page?”

The large-scale selling of user posts to third parties is not unique to Automattic, nor is it even really possible to track what they are being used for, but the practice reveals a complicated ecosystem of data sales to third parties for a variety of reasons that highlights the business models of so many large platforms. We know from covering data brokers in other industries for years that it is often difficult for the original company sharing the data to track specifically how it is used further down the supply chain. It can also be difficult to enforce against misuse; cutting off future data access, for example, doesn’t always mean that a bad actor is going to delete the data they’ve already gotten.

“There are so many links in the chain here for companies and individuals to launder their responsibility,” Jim Winstead, the founder of a blog-tracking site called blo.gs, which he sold to Yahoo, was acquired by Verizon, and is now owned by Automattic, told 404 Media. “The project from WordPress.org isn't selling data, it just encourages you to install the Jetpack plugin, which has a feature called ‘enhanced distribution’ (enabled by default), which feeds data into the WordPress.com firehose, which has a terms of use that you can’t use the data for ‘a biased, misleading, or dishonest manner, for example, to promote or publicize a biased political point of view.’ How is that being enforced?” 

The site he founded (which he left long ago) now states the following: “Want more data? Get access to a real-time, high-quality stream of WordPress site content with the WordPress.com Firehose.”

There is very little way for you—the WordPress user or Tumblr poster—to know if any one of your specific posts has been shared in this way, who it was shared to, how they used it, or what they used it for. Automattic itself could clear this up to some extent, but it also likely has little way of knowing what the data is ultimately used for, especially if it is mixed with other data sources down the line. 

Given this complicated and ever-changing supply chain, even if you read the terms of service agreements back when you started posting to Tumblr and Wordpress more than a decade ago, there was no way for you to know that that content would eventually be used by companies building AI tools that are actively working to replace the same type of human labor that created that content.

Advertisement