to decide which ads to place on top of the organic experimental units will be effectively assigned to their
search results. Depending on the experiment, we experiment and others in which it coexists. An anal-
can assign a fraction of incoming queries to the ysis platform contains a tool to help engineers size
treatment group and another fraction to a control their experiment (a requirement of the Experiment
group. Users issuing queries assigned to the former Council) based on the type of traffic/diversion they
will experience the new style of search results, while will see. In addition, we have automated the data
users issuing queries assigned to the latter will expe- collection and summary statistics so the metrics are
rience the standard results. Such experiments are consistently defined and calculated.
called “random query” experiments because queries By “calculation,” we do not mean the formula
are the experimental units assigned to treatment used, but rather that the raw data from all experi-
and control groups. ments are extracted consistently (e.g., all using the
For UI experiments, where it would be discon- same filter to detect and eliminate “bot” traffic).
certing for the same user to possibly see the page for- As a sign of the statistical maturity of the analysis,
mat changing with each query, we perform “random the underlying logging system also captures coun-
cookie” experiments. Here, we divert traffic based terfactual conditions (e.g., What ads would have
on a user’s cookie-id so a random subset of users shown in the absence of this treatment?) so we can
experiences the ‘treatment’ page format and another compare behavior on the subset of queries actually
subset experiences the usual page format (i.e., serve affected by the experiment. A browser-based dash-
as controls). By exposing queries or users to experi- board provides estimated effects and their standard
mental treatments and measuring responses such as errors (along with drill-down capabilities to slice and
clicks, revenue, etc., we can make informed deci- dice the data as needed). By automating those aspects
sions about whether the treatment will become part of live experiments that are amenable to automation,
of the standard service. each engineering team can spend more time inter-
preting the results and exploring unanticipated
Response Surface Experiments
interactions.
There are many knobs SEs use to control their
search and ads serving systems. These knobs can be
Experiments for Advertisers
varied and optimized one at a time in an A/B-style We evangelize experimentation to the extent
experiment, or they can be varied and optimized that we provide a mechanism for advertisers to
simultaneously. We use classical response surface run their own experiments. Our web site opti-
designs to perform the latter. mizer (www.google.com/websiteoptimizer) allows
an advertiser to run a (full) factorial experiment
Overlapping Experiments
on its web page. Advertisers can explore layout
As an SE prepares to launch a new feature under and content alternatives while Google randomly
experimental test, the fraction of traffic exposed directs queries to the resulting treatment combina-
to the new feature is increased to 100%. Does this tions. Simple analysis of click and conversion rates
mean no other experiment can run during this allows advertisers to explore a range of alternatives
ramp-up period? Put another way, what does one do and their effect on user awareness and interest.
if there are not enough queries to satisfy the dozens
of ongoing experiments each day? Clearly, queries
Quasi-Experiments
have to perform double/triple/… duty when they Experiments on the user community typically
participate in multiple A/B comparisons. involve subtle changes to the results page that are
We usually cannot do full factorial experiments transparent to users. This facilitates the assignment
because not all concurrent experiments start and stop of queries or users to treatment combinations (e.g.,
at the same time and certain treatment combinations we can randomize). Experiments on advertisers
might not make sense (e.g., an experiment varying are different; it is usually not possible to randomly
font color might not mesh with another experiment
assign advertisers to treatment groups due to con-
varying background color). Similarly, even though a
tractual obligations and/or their willingness to be
query is assigned to a random cookie comparison of
‘experimental units’ for a service for which they are
some treatments, we would still like to use it in some
paying. In such cases, we allow advertisers to opt
random query experiments. To manage all this, we
in to new features in our campaign management
built an experiment infrastructure to plan, record,
system and use statistical methods to try and tease
and execute all experiments.
out causal inferences. Propensity score matching,
inverse propensity weighting, and double robust
Analysis Platforms for Live Experiments
estimates are some of the methods established in
Engineers launch their experiments using the social and biological sciences currently in use at
experiment infrastructure. This guarantees that Google when randomization is not possible.
4 AMSTAT NEWS MAY 2009
Page 1 |
Page 2 |
Page 3 |
Page 4 |
Page 5 |
Page 6 |
Page 7 |
Page 8 |
Page 9 |
Page 10 |
Page 11 |
Page 12 |
Page 13 |
Page 14 |
Page 15 |
Page 16 |
Page 17 |
Page 18 |
Page 19 |
Page 20 |
Page 21 |
Page 22 |
Page 23 |
Page 24 |
Page 25 |
Page 26 |
Page 27 |
Page 28 |
Page 29 |
Page 30 |
Page 31 |
Page 32 |
Page 33 |
Page 34 |
Page 35 |
Page 36 |
Page 37 |
Page 38 |
Page 39 |
Page 40 |
Page 41 |
Page 42 |
Page 43 |
Page 44 |
Page 45 |
Page 46 |
Page 47 |
Page 48 |
Page 49 |
Page 50 |
Page 51 |
Page 52 |
Page 53 |
Page 54 |
Page 55 |
Page 56 |
Page 57 |
Page 58 |
Page 59 |
Page 60 |
Page 61 |
Page 62 |
Page 63 |
Page 64 |
Page 65 |
Page 66 |
Page 67 |
Page 68 |
Page 69 |
Page 70 |
Page 71 |
Page 72 |
Page 73 |
Page 74 |
Page 75 |
Page 76 |
Page 77 |
Page 78 |
Page 79 |
Page 80 |
Page 81 |
Page 82 |
Page 83 |
Page 84