How Google Works: A Google Ranking Engineer’s Story #SMX

Big Brands Talk Search: Disney, Vista... Advanced Audience Targeting for Remar...

How Google Works: A Google Ranking Engineer’s Story #SMX

How Google Works: A Google Ranking Engineer’s Story #SMX was originally published on BruceClay.com, home of expert search engine optimization tips.

Google Software Engineer Paul Haahr has been at Google for more than 14 years. For two of them, he shared an office with Matt Cutts. He’s taking the SMX West 2016 stage to share how Google works from a Google engineer’s perspective – or, at least, share as much as he can in 30 minutes. After, Webmaster Trends Analyst Gary Illyes will join him onstage and the two will field questions from the SMX audience with Search Engine Land Editor Danny Sullivan moderating (jump to the Q&A portion!).

From left: Google Webmaster Trends Analyst Gary Illyes, Google Software Engineer Paul Haahr and Search Engine Land Editor Danny Sullivan on the SMX West 2016 stage in San Jose.

How Google Works

Haahr opens by telling us what Google engineers do. Their job includes:

Writing code for searches
Optimizing metrics
Looking for new signals and combine old signals in new ways
Moving results with good ratings up
Moving results with bad ratings down
Fixing rating guidelines or developing new metrics when necessary

Two parts of a search engine:

Ahead of time (before the query)
Query processing

Before the Query

Crawl the web
Analyze the crawled pages

Extract links
Render contents
Annotate semantics

Build an index

The Index

Like the index of a book
For each word, a list of pages it appears on
Broken up into groups of millions of pages
Plus per-document metadata

Query Processing

Query understanding and expansion

Does the query name any known entities?
Retrieval and scoring

Send the query to all the shards

Each shard

Finds the matching pages
Computes a score for query+page
Sends back the top N page by score

Combine all the top pages
Sort by score

Post-retrieval adjustments

Host clustering
Is there duplication

Scoring Signals

A signal is:

A piece of information used in scoring
Query independent – feature of a page
Query dependent

Metrics

“If you cannot measure it, you cannot improve it” – Lord Kelvin

Relevance

Does a page usefully answer the user’s query
Ranking’s top-line metric

Quality

How good are the results we show

Time to result (faster is better)

Google measure itself with live experiments:

a/b experiments on real traffic
look for changes in click patterns
a lot of traffic is in one experiment or another

At one time, Google test 41 different blues to see what was best.

Google also does human rater experiments:

Show real people experimental search results
Ask how the results are
Ratings are aggregated across raters
Publish guidelines explaining criteria for raters
Tools support doing this in an automated way, similar to Mechanical Turk

Google judges pages on two main factors:

Needs Met (where mobile is front and center)
Page Quality

Needs Met Grades

Fully Meets
Very Highly Meets
Highly Meets
Moderately Meets
Slightly Meets
Fails to Meet

Page Quality Concepts:

Expertise
Authoritativeness
Trustworthiness

Google Engineer Development Process

Idea
Repeat until Ready

Write code
Generate data
Run experiments
Analyze

Launch report by quantitative analyst
Launch review
Launch

What goes wrong?

There are two kinds of problems:

Systematically bad ratings
Metrics don’t capture the things we care about

Here’s an example of a bad rating. Someone searches for [Texas farm fertilizer] and the search result provides a map to the manufacturer’s headquarters. It’s very unlikely that that’s what they want. Google determines this through live experiments. If the raters see the maps and rate it needs highly met, this is a rater failing.

Or, what if the metrics are missing? In 2009-2011, three were lots of complaints about low-quality content. But relevance metrics kept going up, due to content farms. Conclusion: Google wasn’t measuring the metrics they needed to be. Thus, the quality metric was developed apart from relevance.

Gary Illyes and Paul Haahr Answer Questions from the SMX Audience

SMX: How does RankBrain fit into all of this?

Haahr: RankBrain gets to see a subset of the signals. I can’t go into too much detail about how RankBrain works. We understand how it works but not as much what it’s doing. It uses a lot of the stuff that we’ve published about deep learning.

How would RankBrain know the authority of a page?

Haahr: It’s all a function of the training that it gets. It sees queries and other signals. I can’t say that much more that would be useful.

SMX: When you are logged into a Google app, do you differentiate by the information you gather? If you’re in Google Now vs. Chrome can that impact what you’re seeing?

Haahr: It’s really a question of if you’re logged in or not. We provide a consistent experience. Your browsing history follows you to either.

Does Google deliver different results for the same queries at different times in the day?

Illyes: I’m not sure. In maps, for example, if we display something maps related we will show the hours (but it doesn’t change what shows up, to Gary’s knowledge).

SMX: What’s going on with Panda and Penguin?

Illyes: I gave up on giving a date or timeline on Penguin. We are working on it, thinking about how to launch it, but I honestly don’t know a date and I don’t want to say a date because I was already wrong three or four times, and it’s bad for business.

SMX: Post-Google Authorship, how are you tracking author authority?

Haahr: There I’m not going to go into any detail. What I will say is the raters are expected to that manually for a page that they are seeing. What we measure is are we able to a good job of getting things that the raters think are good authorities.

SMX: Does that mean authority is used as a direct or indirect factor?

Haahr: I wouldn’t say yes or no. It’s much more complicated than that and I can’t give a direct answer.

SMX: When explicit authorship ended, Google did say to keep having bylines? Should you bother with rel=author at all?

Illyes: There is at least one team that is still looking into using the rel=author tag just for the sake of future developments, if I were an SEO I would still leave the tag. It doesn’t hurt to have it. On new pages, however, it’s probably not worth it to have. Though we might use it for something in the future.

SMX: What are you reading right now?

Haahr: I read a lot of journalism and very few books. However, I just finished “City on Fire” – it’s about New York in the ’70s. There are 900 pages and I was disappointed when it ended. I’ve just started “It Can’t Happen Here.”

View more on Bruce Clay's website »

Like • 0 comments • flag

Published on March 03, 2016 15:30

No comments have been added yet.

Bruce Clay's Blog

Bruce Clay's profile
30 followers