A large amount of documents that appear to explain how Google ranks search results have been published online, likely the result of accidental publication by an internal company bot.
The leaked documentation describes an older version of Google's Content Warehouse API and offers a glimpse into the inner workings of Google Search.
The material appears to have been accidentally committed to a public, Google-owned repository on GitHub by the web giant's automated tools around March 13. That automation added the Apache 2.0 open source license to the commit, which is standard for Google's public documentation. An additional commit on May 7 attempted to revert the leak.
Still, the content was discovered by Erfan Azimi, CEO of search engine optimization (SEO) company EA Digital Eagle, and was then made public on Sunday by fellow SEOs SparkToro CEO Rand Fishkin and iPullRank CEO Michael King.
The documents do not contain any code or anything, instead explaining how to use Google's Content Warehouse API, presumably for internal use only. The leaked documents contain numerous references to internal systems and projects. A similarly named Google Cloud API is already publicly available, but the GitHub publication appears to go much further than that.
These files are notable because they reveal what Google considers important when ranking the relevance of a webpage, which is of perpetual interest to anyone in the SEO business or who runs a website hoping Google can help them get traffic.
Over 2,500 pages of documentation compiled for easy browsing here provide details about over 14,000 attributes accessible through or associated with the API, but there is little information about whether all these signals are used or how important they are, making it difficult to determine how much weight Google gives to attributes in its search result ranking algorithm.
However, SEO consultants believe the document contains some noteworthy details as it differs from official statements made by Google representatives.
“lots of [Azimi’s] Claim [in an email describing the leak] “This directly contradicts statements made publicly by Googlers over the years, particularly the company's repeated denials that click-centric user signals are employed, denials that subdomains are considered separately in rankings, denials of a sandbox for new websites, and denials that domain age is collected or taken into account,” SparkToro's Fishkin explained in the report.
In his post about the documents, iPullRank's King pointed to a quote from Google search advocate John Mueller, who said in a video that “there is no such thing as a website authority score,” a measure of whether Google considers a site authoritative and worthy of being ranked highly in search results.
But King points out that the document makes it clear that Google can calculate a “siteAuthority” score as part of the compression quality signal it stores for documents.
The two posts also cite several other revelations.
First, clicks matter. The type of click (good clicks, bad clicks, long clicks, etc.) determines the ranking of a web page. Google United States vs. Google Antitrust lawsuits approved [PDF] This means taking click metrics into account as a ranking factor for web searches.
Secondly, Google uses websites viewed in Chrome as a quality signal, which appears in the API as a parameter ChromeInTotal. “One of the modules related to Page Quality Score has the ability to measure views from Chrome at a site level,” King said.
Additionally, the document indicates that Google also considers other factors such as the freshness of the content, the authorship, whether the page is relevant to the site's central focus, the placement of the page title and content, and the “average weighted font size of terms within the body of the document.”
Google did not respond to a request for comment.