This page contains a Flash digital edition of a book.
of
of
the
benefit
that
in
an
estimate
gi
v
e
us
by
+
epsilon1n
,
0
,
or

epsilon1n
.
Note
with
each
of
These
three
av
erages
of
the
t
policies
there
will
be
of
the
n
th
parameter
though
an
altering there
will
be
t/
3
settings,
to
construct
An
example
expectation,
parameter these
scores
possible
W
e
use
of
the
tw
o
parents.
7. the
three
v
ariation.
strings
Figure pi
1
is
gi
v
en
in
0.5 some
random
of
size
N
,
where
g+0
,n
>
Av
g+
epsilon1,n
and
0.704
v
ector
A
if
Av
on
the
parameter 11.2
7.7
0.035
pi2
adjustment
 0
Av
g+0
,n
>
Av
g−
epsilon1,n
,
the
crosso
v
er
process
4.893
−2.8
0.0
0.43 
5.6
0.679
otherwise
η
(we
4.9
10.84
2.8
7.483
4.2
0.049
5.285
−2.98
An
=

0.217
6.02 pi3
Av
g+
epsilon1,n

Av
g−
epsilon1,n
it
by
a
scalar
step-size
5.152 0.43
0.574 0.704
4.081
A
and
multiply
7.7
10.84
0.035
5.285
W
e
normalize
is
0.0
−2.98 genetic
6.02 algorithm
4.9
for
the
re-
2.8
4.2
er
operator
multi-point
used
a
v
alue
of
η
=
2
to
of
fset
the
relati
v
ely
small
v
alues
a
fix
ed
size
will
remain
adjustment
to
pi
,
and
be
gin
the
ne
xt
)
so
that
our
,
we
add
A
y
gradient
of
the
crosso
v
through of
each
epsilon1j
Finally
for
this
polic
xample
7
:
An
e 3
)
is
created
1
and
pi2
).
each
iteration.
Pseudocode
Figur
e polic
y
(pi
(pi
A
ne
w
iteration.
8.
algorithm. of
tw
o
parent
policies
polic
y
gradient
sho
wn
in
Figure
of
pi
combination Algorithm
N
-dimensional
described
algorithm
,
in
the
pi

I
nitial
P
ol
icy
perturbations
random
P
olicy
Gradient of
the
while
!done
do
t
}
=
t
the
hill
climbing
is
generated
search
Our
implementation
policies
the
b
uilds
on
of
t
of
focusing
e
{
R1
,R
2
,.
..
,R
{
R
..
.,
Rt
}
)
1
,R
2
,
n
algorithm
,
a
collection
.
Instead in
ev
aluate(
in
dimension
abo
Tv
he robodogs’ brainse.
Initially polic
y,
pi
of
an
initial , remo
v
abl
for
n
=
1
to
N
do
Av
g+
epsilon1,n

av
erage
score
for
all
Ri
that
ha
ve
perturbation
a
positi
ve
vicinity
of
the
objecti
v
e
function
which are tiny
the
gradient wards
a
local
optimum.
on
the
best
polic
y
at
each
step,
ho
we
v
er
,
the
polic
y
gradient
estimates
ex-
about
the
true
functional
Av
g+0
,n

av
erage
score
for
all
Ri
that
ha
ve
a
zero
n
in
dimension
perturbation
algorithm space
and
follo
ws
it
to
the
gradient
by
the
memor
parameter
y s
ticks
, run programs
kno
w
an
ything
calculate
the
gradient
Since
we
do
not
we
chine l
cannot
e
arning
Av
g−
epsilon1,n

av
erage
score
for
all
Ri
that
ha
ve
a
n
in
dimension
perturbation
ne
gati
ve
estimating
if
done
nai
v
ely
,
the
polic
y,
cost
empirically expensi
v
e
based on ma
form
of temporal
for
if
Av
g+0
,n
>
Av
g+
epsilon1,n
and
Av
g+0
,n
>
Av
g−
epsilon1,n
then
actly
.
Furthermore,
computationally
can
be
.
space
and
the
simulators
on
An

0
the
search
accurate
entirely else
al
sampling
gi
gorithms
lar
ge
size
of
v
en
the Gi
v
en
the
lack
of
the
learning
An

Av
g+
epsilon1,n

Av
g−
epsilon1,n
ev
aluation.
to
perform
prime
concern.
end
if
of
each
we
are
forced
es
ef
ficienc
y
a
elop
an
ef
ficient
can
end
for
the
Aibo,
which
mak us
to
de
v
This
method
A

A
|A
|

η
algo-
real
robots,
prompted
y
gradient. polic
y
gradi-
concerns pi

pi
+
A polic
y
gradient
These
the
polic
form
of
standard
et
al.
2000;
N
-dimensional
of
estimating
(Sutton
end
while
for
the
method
a
de
generate
techniques
polic
y
is
purely
input.
of
the
main
loop
t
are
sam-
policies
pi
,
then
pi
is
mo
ved
by
8
:
Pseudocode around
be
considered
learning sensory
programs based on machine learning
in
that
the
control
algorithms.
will
only
Figur
e
each
iteration
will help produce
During
the
gradient
direction.
ent
reinforcement
2001) of
the
robotʼ
s
rithm. such a vision.
&
Bartlett
function
our
approach
some
action- during
An algor
Baxter
pled
near
pi
to
estimate
ithm is a sequence of logictechniques, that helps the dogs take action “It’s unimaginable h
of
η
in
the
most
fa
vorable
In
contrast,
an
amount
ow we’ll get there
Results
, and there are huge challenges
open
loop
and
not
a
such
as
Q-learning
of
the
four
algorithms
based on se
Lik
nso
e
rythese data
more
general
the
ards
y acqau
local
optimum.
ire. The hu
algorithms,
man brain worpolicks m
y
uc
(W
atkins
h along the way,” he says. “But when we solve those challenges, wine w
time.
The
ill
learning
optimal for
Mark
v
the same wa
con
y, s
v
er
ge
to
w
9
sho
ws
the
progress
an
av
erage
o
v
er
2
runs,
and
rep-
instant
ays Stone. “Your brain is running an algorithm,” he have solved a
Figure
lot of othe
Each
r much
curv
e
is
more practical challenges.”
were
able
to
ac-
v
alue
reinforcementto
the
globally
explains. “It takes in you
con
v
er
ge which
is
designed
to
our
problem,
training.
polic
y
found
at
each
algorithm
pro
v
ably r sensesQ-learning,, what you see an
applicable
d hear, and figu
of
“state”.
notion
res He likens the journe
the
best
y from robodogs to robohumans playing soccer
Ho
we
v
er
,
and
no
v
ector
pi
=
resents
and
amoeba ve
on
the
initial
hand-
algorithms
algorithm
and
impro
y
gradient
were
out what actio
1989).
is
not
directly
to
ns to take.
processes,
One goal of AI is t
control
o figure out tparameterhe algorithm
proceeds
to the journey
genetic
from the Wsomeright b
learning
rothers’ first fl
and
polic
ight to a ofma
these
methods
n walking
decision
open-loop with
complish
The
hill
climbing
and
both
of the human brain.” on the moon. Unimaginable progress can happen over fifty to seventy
algorithm
which
features
starts
from
an
initial
12
in
our
case)
and
t
aluating
tuned
g
ait.
Our
approach
The algorithms spinning through the robodo
e
of
pi
ʼs
objecti
v
e
function
much
better
results, than
our
best
hand-tuned
performance
y
gradient
gs’ memory sticks give years, and the ro
yielded
the
polic
ad leading to robohumans will be dotted with innu-
g
aits
(in
{
θ
N
}
N
=
(where
1
,.
..

deri
v
ati
v
so
by
first
ev
the
partial .
W
e
do
able
to
of
fer
better
ev
aluations
ed
at
291
mm/s,
faster
than
all
cur
-
at
them the ability t
estimate
o learn through reparameterinforcement. {Th
R
e
1
y
,

R
pe
2
,.
..
,R
t
}
near
pi
,
such
rform an ac- merable advances in AI technologies.
to
each
policies
∆N
}
and
each
∆j
is
w
alk.
After
350
the
fastest
learned
a
g
ait
that
mo
v
tion—poorly at firs
respect
t—and imp
generated
rove the action based on positive feedback, In other words,
generated
g
aits
and
among
g
ait
has
been
reported
2003).
per
-
RoboCup competitions and soccer-playing robots
randomly
+
epsilon1j
,
0
,
or

epsilon1j
.
Each
epsilon1j
is
a
the
rent
hand-tuned
ork,
a
learned
&
Middleton,
algorithm
with
our
w
Chalup,
just like a real dog g
that
ettin
each
Ri
=
{
θ1
+
∆1
,.
..
,θN
+
g a treat every t
to
im
be
either
randomly
e sh
relati
e sits
v
o
e
n
to
θ
com
j
.
mand.
we
estimate
are a means to an e
parallel
nd. If teams of robots can play andsocc
the
amoeba
er, t
that
hen bythe
man
y
standards
y algo-
When training to
chosen
of
each
Ri
,
This
is
ac-
walk, for example, the robodogs time themselves may
sets
for
also be able to p
296
mm/s
by
Quinlan,
algorithm
the
fact
erform risky tasks like searching for bombs
sophisticated
and
fix
ed
v
alue
that
is
small
the
speed
dimensions.
v
aluating
Both
the
genetic
,
despite algorithms.
walking back-and-forth
After
e
on the soccer fi
in
each
of
the
N
fairly
poorly to
be
more
y
gradient
deri
v
ati
v
e
eld. They
each
impr
R
ov
i
into
one
of
three
e their walk rescu
epsilon1
i
n
ng people fromformed crumbling buil
considered
each
algo-
dings. And m
and
achi
polic
ne learning
al-
partial
by choosing gaits that increase the
by
grouping
ir speed. They share information algorithms developed
the
y
w
ould
be
hill
climbing
result,
we
analyzed
space
that
each
complished
n
:
of
Ri
is
θn
+
in Stonthane’s la
the
b could lead to autonomsearchous cars
of
Ri
is
θn
+
0
rithms
this
surprising
of
the
about what they are doeaching th
dimension
rough

th
parameter
wireless E
if
the
n

S+
epsilon1,n
thernet, le
parameter
arning how to that increase traveling eTffi
o
c
in
ie
v
n
estig
ate
cy and re
of
du
the
amount
ce automobile accidents; to
rithm
in
terms
walk more effectively as a team
R
. Initial
S
ly
+0
, th
,n
if
the
n
th
of
Ri
is
θn

epsilon1n
i


e dogs
the
are
n
cthlu
parameter
msy and slow,
+
se
epsilon1,n
lf-
,
h
Av
g+0
,n
,
ealing
v
c
ely
om. puters; and to AI agents that manage business activi-
but after three hours of training the dog
S
if
s

epsilon1,n respecti
can trot
an
arouavn
erage
score
Av
g
d the field with ties more effectively than humans (see example below).
the best of them. In fact, the r
W
obeo
then
compute
ts on the UT Austin Villa team at one Stone won’t be surprised to look out the window from the backseat
time had learned the faste
and
Av
g−epsilon1,n
for
S+
epsilon1,n
,
S+0
,n
,
and
S−
epsilon1,n
,
st walk of any Aibo dog on record, a skill of his autonomous car one day to see a world populated by robots
that comes in handy on the soccer field. (Other researchers have since working, playing and learning side-by-side with humans. He will have
improved on Stone and his students’ learning methods and achieved helped make it that way. ✥
slightly faster walks.)
The UT Austin Villa team competes regularly in tournaments across To learn more about robot soccer, visit: www.cs.utexas.edu/~AustinVilla
Peter Stone’s webpage: www.cs.utexas.edu/~pstone
the country and the world. The dogs placed third in the May 2005 U.S.
Open and made it to the quarterfinals in the “legged league” at the Another of Stone’s AI agents, TacTex-05, won the 2005 Trading Agent Competi-
tion in Edinburgh, Scotland. TacTex-05, a software agent, competed against
World RoboCup 2005 competition in Osaka, Japan.
other agents to manage a business supply chain for manufacturing PCs.
By 2050, Stone and other scientists involved in the RoboCup TacTex finished the game with the most money in the virtual bank, which
Stone credits to its ability to adapt its strategy from game to game using
robot soccer project want to see fully independent humanoid
machine learning. For more information, see
robots beating flesh-and-bone world champion soccer players. www.cs.utexas.edu/~TacTex/
It sounds like science fiction, but Stone is confident that his research
8 s p r i n g 2 0 0 6
Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  |  Page 8  |  Page 9  |  Page 10  |  Page 11  |  Page 12  |  Page 13  |  Page 14  |  Page 15  |  Page 16  |  Page 17  |  Page 18  |  Page 19  |  Page 20  |  Page 21  |  Page 22  |  Page 23  |  Page 24
Produced with Yudu - www.yudu.com