Over the US holiday period, several posts were shared about an alleged leak of Google ranking-related data. The initial posts about the leak focused on “confirming” Rand Fishkin's long-held beliefs, with little attention paid to the context of the information or what it actually means.
Context Matters: The Document AI Warehouse
The leaked documents relate to a public Google Cloud platform called Document AI Warehouse that is used to analyze, organize, search, and store data. This public document is titled An Overview of Document AI Warehouse. The Facebook post states that the “leaked” data is an “internal version” of publicly available Document AI Warehouse documents. This is the context of this data:
Screenshot: Document AI Warehouse
@DavidGQuaid tweeted:
“I think it's pretty clear, as the name suggests, that this is an external-facing API for building document warehouses.”
This appears to pour cold water on the idea that the “leaked” data represents inside information about Google Search.
From what we know so far, the “leaked data” is similar to that found on the public Document AI Warehouse page.
Internal search data leak?
SparkToro's original post does not state that the data was taken from a Google search, it says that the person who sent the data to Rand Fishkin is the one who made the claim.
One of the things I admire about Rand Fishkin is how precise he is in his writing, especially when it comes to caveats. Rand correctly points out that it is the people providing the data who claim that the data was taken from a Google search. There is no evidence, just allegations.
He writes:
“I received an email from someone claiming to have access to a large amount of leaked API documentation from Google's search division.”
Fishkin himself did not claim that the former Googlers confirmed that the data came from Google searches; he wrote that the person who sent the data in the email made that claim.
“The email further claimed that the leaked documents had been verified as authentic by former Google employees, who shared additional personal information about Google's search operations.”
Fishkin writes about a subsequent video conference call in which the leakers revealed that their contact with the former Googlers occurred in the context of meeting them at a search industry event. Again, we must take the leakers' word for it regarding the former Googlers; their statements came after careful review of the data, not as off-the-record comments.
Fishkin wrote that he contacted three former Googlers about the matter. Notably, the ex-Googlers did not explicitly acknowledge that the data was internal to Google Search. They only acknowledged that the data did not originate from Google Search, but resembled internal Google information.
Fishkin writes that he heard the following from a former Googler:
- “I didn't have access to this code when I worked there, but this certainly looks authentic.”
- “It has all the hallmarks of an internal Google API.”
- “It's a Java-based API. Someone spent a lot of time making sure it adhered to Google's own internal standards for documentation and naming.”
- “We need more time to know for sure, but this is consistent with the internal documents that I have.”
- “From what I saw in my brief review, there was nothing to indicate that this wasn't real.”
Saying something originated from a Google search is one thing, but saying it originated from Google is another thing entirely.
Open your heart
It's important to keep an open mind, as there's a lot that's unknown about the data — for example, we don't know if this is documentation from an internal search team — so it's probably not a good idea to derive actionable SEO advice from this data.
Additionally, we don't recommend analyzing data to concretely confirm long-held beliefs, as this leads to confirmation bias.
Confirmation bias definition:
“Confirmation bias is the tendency to search for, interpret, prefer, and recall information in a way that confirms or supports one's prior beliefs and values.”
Confirmation bias causes us to deny things that are empirically true. For example, the theory that Google automatically excludes new sites from ranking, known as the sandbox, has been around for decades. Every day, people report that their new site or new page ranks in the top 10 of Google searches almost instantly.
But if you are a fervent believer in the sandbox, then such actually observable experiences will be ignored, no matter how many people observe experiences to the contrary.
Brenda Malone ( LinkedIn profile ), a freelance Senior SEO Technical Strategist and Web Developer, messaged me about the claims made about Sandbox.
“I know from practical experience that the sandbox theory is wrong. I indexed a personal blog with two posts in just two days. According to the sandbox theory, a small site with two posts should never be indexed.”
The point here is that if a document turns out to have been generated from a Google search, searching for confirmation of a long-held belief is the wrong way to analyze the data.
What is the Google data breach?
There are five things to consider regarding leaked data:
- The context of the leaked information is unclear: is it related to Google searches? Or is it for some other purpose?
- Purpose of the data. Was the information used for actual search results or for internal data management and operations?
- The former Googler wouldn't confirm whether the data was specific to Google Search, only that it appeared to come from Google.
- Keep an open mind. If you look for justification for a long-held belief, what do you think will happen? You will find it everywhere. This is called confirmation bias.
- Evidence suggests the data relates to an external-facing API for building a document warehouse.
What others say about the 'leaked' documents
Ryan Jones, someone with not only extensive experience in SEO but also a deep understanding of computer science, shared some rational insights on the so-called data leak.
Ryan tweeted.
“I don't know if this is for production or testing. My guess is that it's mainly for testing potential changes.
I don't know what's used for the web or other areas, some may just be used for Google Home, news, etc.
I don't know what the inputs are to the ML algorithm and what is used to train it. I would guess that clicks are not direct inputs but are used to train a model that predicts click likelihood (other than trend boosting).
We also speculate that some of these fields only apply to the training data set and not to all sites.
Am I saying Google isn't lying? Absolutely not. But let's take an open-minded and critical look at this leak.”
@DavidGQuaid tweeted:
“I don't know if this is for Google search or Google cloud docs search.
The API seems selective. I don't think the algorithm is executed this way. What if an engineer wants to skip the quality checks altogether? This seems like what would happen if you wanted to build a content warehouse app for an enterprise knowledge base.”
Does the “leaked” data relate to Google searches?
At this point, there is no solid evidence that this “leaked” data actually comes from Google Search. There is a lot of ambiguity as to what the purpose of the data is. Notably, there are hints that the data is merely “an external-facing API for building a document warehouse, as the name suggests,” and has nothing to do with ranking websites in Google Search at all.
While it's not yet conclusive that this data isn't derived from Google searches, the winds of evidence seem to be pointing in that direction.
Featured image: Shutterstock/Jaaak