Rand Fishkin, along with Mike King, may have disclosed one of the largest data leaks outside of the Department of Justice regarding Google Search and its internal ranking features and signals. The document comes from an anonymous source (no longer anonymous, see below), but has been reviewed by Rand Fishkin, and contains a ton of details about how Google Search works.
More importantly, it appears to contradict several Google statements made by numerous Google search employees over the past two decades, as I've covered here before.
I haven't read it all yet, but I felt it was important for you all to read it for yourself. You can find more details in the headings below.
“Many of their claims directly contradict statements made publicly by Googlers over the years, including the company's repeated denials that it employs click-centric user signals, that subdomains are considered separately in rankings, that it has no sandbox for new websites, and that it collects or considers domain age,” Rand wrote.
Mike King writes: “I reviewed the API reference documentation and compared it to other past Google leaks and DOJ antitrust testimony. I combined it with extensive patent and whitepaper research I did for my upcoming book, The Science of SEO. The documentation I reviewed doesn't provide any details about Google's scoring capabilities, but it does provide a wealth of information about the data it stores about content, links, and user interactions, and there are varying degrees of explanation (ranging from disappointingly sparse to surprisingly detailed) about the features it operates and stores. It's tempting to loosely refer to these as 'ranking factors,' but that's not accurate.”
Aleyda Solis wrote a short article on X summarizing some of the leaked information.
- The document includes 14K ranking features and more.
- Google has a calculator called “siteAuthority.”
- Navboost has a specific module that is entirely focused on click signals that represent users as voters, and clicks are stored as votes.
- Google stores the most clicked result during the session
- Google has an attribute called hostAge that they use specifically to “sandbox new spam as it is delivered.”
- One of the modules related to Page Quality Score is site-level view measurement from Chrome.
I haven't had time to go through everything yet, but I'll be doing so over the next few days.
Also, I haven't seen any Googlers publicly comment on this yet, and I know this is new, and I don't know if any Googlers will comment on this.
This is a bit like the Yandex search rankings leak.
There are a few social media posts about this: Again, this has only been out for a few hours and no one other than Rand and Mike has had time to process this in detail.
Sincerely thank FollowA person I reached out to on Friday after seeing the leak and who has helped analyze and decipher many of the early findings: https://t.co/JGYdGydKlC
— Rand Fishkin (Follow @randderuiter on Threads) (@randfish) May 28, 2024
Let's get the party started!
A few weeks ago I said I was going to publish the most important thing I'd ever written. I was wrong.
A document about the Google search algorithm was leaked, so I spent the weekend dissecting it. https://t.co/v71B16Ggov
✌🏾
— Mick King (@iPullRank) May 28, 2024
🚨 Google Search internal engineering documents leaked and analyzed Follow 👀 Many of these were rejected by Google 👇
* The documentation includes 14K ranking features and more
*Google has a calculator called “siteAuthority”
* Navboost has… pic.twitter.com/dlpCIQdpDm— Aleyda Solis 🕊️ (@aleyda) May 28, 2024
Here's a direct link to the leaked Google Ranking API documentation, until it's (presumably) taken down by Google's lawyers.
“google_api_content_warehouse v0.4.0”
Save this page! https://t.co/8RgmoF69z9 pic.twitter.com/9dXobbr2U1
— Cyrus SEO (@CyrusShepard) May 28, 2024
Very interesting blog post Follow.
Another of the many things he wrote that we are saving is usefulness ⬇️ https://t.co/VZH8EARV1G— Gianluca Fiorelli (@gfiorelli1) May 28, 2024
Apparently, someone at Google Search “accidentally” leaked an engineering document that reveals many secrets about how the search engine works, including the existence of a “golden document” flag that gives a higher weight to documents that are “human-labeled.” This could have some implications… pic.twitter.com/zeG79f161B
— Joe Youngblood (@YoungbloodJoe) May 28, 2024
If you want to broach this topic with me, I'll be updating this Google Doc with anything interesting for the next 30 minutes or so before returning to normal life. https://t.co/1iQ40nknZ0
— Glenn Allsopp 👾 (@ViperChill) May 28, 2024
#Google search #leak Over 14,000 ranking factors revealed…including “demotion of baby panda”?!?
Panda demoted…but to baby panda? Google seems to be getting lenient with low-quality sites these days pic.twitter.com/Ob2bndHnzH
— Shay Harrell (@RangerShay) May 28, 2024
My personal experience over the years of watching Google's algorithms react in the exact opposite way to what commentators say, I don't think it's biased. They've been lying since day one, and anyone with even basic SEO experience can…
— Greg Boser (@GregBoser) May 28, 2024
The commit can be found here: https://t.co/4CqyJZXqZy
— Fili 🇪🇺🇳🇱 (@filiwiese) May 28, 2024
I'm looking forward to really digging into this.
Update: I read through these two stories briefly and looked a bit into the actual API documentation. Honestly, given everything I've followed about Google Search over the past 20+ years, they really do look like they're legitimate. Some of the details in these docs are things I've heard both on and off the record as actual ranking features, but some are no longer in use to my understanding, and some I'm not sure how they're used (i.e. are they used directly for ranking, or for post-mortem ranking validation). In my opinion, it's worth looking into these docs in more detail.
UPDATE 2: The leaker has spoken out – Erfan Azimi emailed me this video:
Forum discussion on X.