This page contains a Flash digital edition of a book.
PRESIDENT’S INVITED COLUMN
For this month’s column, I asked Daryl Pregibon, a statistician (or “engineer”), to describe his life at Google. In this
article, he shares the different kinds of problems statisticians help solve. Read on and you will see that data are the
language and currency of a Google statistician.
~ Sally C. Morton
Statistics @ Google
Daryl Pregibon
S
earch engines (SEs) are ubiquitous in today’s it doesn’t exist, “invent it.” We often find that sta-
society. We use them for entertainment, com- tistical techniques developed years ago in a totally
merce, news, education, and research. unrelated area provide the essentials of the solution
Statistics plays a fundamental role in the services we are looking for today. Generic methods such as
SEs deliver, from spelling suggestions to machine logistic regression and iterative proportional fitting
translation. This article highlights the “behind the are routinely used, though the size of the underly-
scenes” areas where statisticians have helped improve ing data sets are orders of magnitude larger than
Google’s basic services. most new graduates in statistics are accustomed. In
general, Google statisticians use a variety of meth-
The Environment ods, including time series analysis, hypothesis test-
At Google, most technical employees (generically
ing, survival analysis, decision theory, and others
called engineers) are computer scientists, many
from econometrics and epidemiology.
without any prior industry experience. Engineers
Before turning to experimentation, one last com-
are not organized by discipline, but by project.
ment on what distinguishes statistics at Google from
This means there is no ‘statistics department’ per se,
other industries. We are a dot-com–era company
but rather statisticians work on project teams with
with a shallow management hierarchy. This allows
other engineers. Generally, software engineers focus
most decisions to be made quickly. Statisticians at
on implementing some product or service, while
Google have to make data speak early and often to
statisticians focus on defining measurement and
be effective. This seldom means “quick ’n dirty,” as
analysis systems needed to make sure the product or
management relies on statisticians for careful and
service is functioning as intended. Communication
correct analysis of data. Speed is attained by design-
is important, and the language of choice between
ing an analysis methodology that is sound and easy
to automate. A good example of this is the maturity
Pregibon
statisticians and software engineers is data.
Nearly every question about products and ser-
of our approach to live experiments.
vices at Google is answered with data. For the most
part, experiments are at the core of Google deci-
Experiments: A Query Is too Important
sionmaking. But when experiments are not pos-
an Opportunity to Waste
sible, observational data is collected from internal When users type a query at an SE, they see “organic”
“logging” systems. For example, Google relies on search results and sponsored links. By organic, we
distributed file systems that use huge numbers of mean that such links provide no commercial value
inexpensive hardware components administered to the SE. The sponsored links, however, are poten-
in a ‘smart’ way. These systems coordinate massive tially revenue-generating—“potentially” as there is
read and write operations in a situation where it typically no fee to show an ad until it is clicked.
is virtually certain that some components will not From a user’s perspective, a query was submitted
be operating correctly. As a result, optimal strate- and results appear. From Google’s perspective, the user
gies for replication, data repair, and monitoring are has provided an opportunity to test something. What
essential to achieve good performance without wast- can we test? Well, there is so much to test that we have
ing resources. Statistical and combinatorial models an Experiment Council that vets experiment propos-
of the rate of failure at each layer help us determine als and quickly approves those that pass muster.
the optimal repair and replication strategies to use
to make these systems exceptionally reliable, with-
Simple A/B/C Experiments
out having to guarantee each hardware component Consider a ‘treatment’ that an engineer wants to
is exceptionally reliable. study. It could be a new component in the user
Our attitude toward statistical methodology is interface (UI), a new way to auto-generate the short
that if it already exists, “use it or modify it,” and if text snippet in organic search results, or a new way
MAY 2009 AMSTAT NEWS 3
Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  |  Page 8  |  Page 9  |  Page 10  |  Page 11  |  Page 12  |  Page 13  |  Page 14  |  Page 15  |  Page 16  |  Page 17  |  Page 18  |  Page 19  |  Page 20  |  Page 21  |  Page 22  |  Page 23  |  Page 24  |  Page 25  |  Page 26  |  Page 27  |  Page 28  |  Page 29  |  Page 30  |  Page 31  |  Page 32  |  Page 33  |  Page 34  |  Page 35  |  Page 36  |  Page 37  |  Page 38  |  Page 39  |  Page 40  |  Page 41  |  Page 42  |  Page 43  |  Page 44  |  Page 45  |  Page 46  |  Page 47  |  Page 48  |  Page 49  |  Page 50  |  Page 51  |  Page 52  |  Page 53  |  Page 54  |  Page 55  |  Page 56  |  Page 57  |  Page 58  |  Page 59  |  Page 60  |  Page 61  |  Page 62  |  Page 63  |  Page 64  |  Page 65  |  Page 66  |  Page 67  |  Page 68  |  Page 69  |  Page 70  |  Page 71  |  Page 72  |  Page 73  |  Page 74  |  Page 75  |  Page 76  |  Page 77  |  Page 78  |  Page 79  |  Page 80  |  Page 81  |  Page 82  |  Page 83  |  Page 84
Produced with Yudu - www.yudu.com