nonmyopic active learning of gaussian processes an exploration – exploitation approach andreas...
TRANSCRIPT
River monitoring
Want to monitor ecological condition of river Need to decide where to make observations!
Mixing zone of San Joaquin and Merced rivers
7.4
7.6
7.8
8
Position along transect (m)pH
valu
e
NIMS(UCLA)
Observation Selection for Spatial prediction
Gaussian processes Distribution over functions (e.g., how pH varies in space) Allows estimating uncertainty in prediction
Horizontal position
pH
valu
eobservations
Unobservedprocess
Prediction
Confidencebands
Mutual Information[Caselton Zidek 1984]
Finite set of possible locations V For any subset A µ V, can compute
Want: A* = argmax MI(A) subject to |A| ≤ k
Finding A* is NP hard optimization problem
M I (A) = H (V nA) ¡ H (V nA j A)
Entropy ofuninstrumented
locationsafter sensing
Entropy ofuninstrumented
locationsbefore sensing
Want to find: A* = argmax|A|=k MI(A) Greedy algorithm:
Start with A = ; For i = 1 to k
s* := argmaxs MI(A [ {s}) A := A [ {s*}
The greedy algorithm for finding optimal a priori sets
M I (Agreedy) ¸ (1 ¡ 1=e) maxA:jAj=k
M I (A) ¡ "M I
Theorem [ICML 2005, with Carlos Guestrin, Ajit Singh]
Optimalsolution
Result ofgreedy algorithm
Constant factor, ~63%
1
2
3
4
5
Sequential design
Observed variables depend on previous measurements and observation policy
MI() = expected MI score over outcome of observations
X5=?
X3 =? X2 =?
<20°C ¸ 20°C
X7 =?
>15°C
MI(X5=17, X3=16,
X7=19) = 3.4
X5=17X5=21
X3 =16
X7 =19 X12=? X23 =?
¸ 18°C<18°C
MI(…) = 2.1 MI(…) = 2.4
Observation
policy
MI() = 3.1
A priori vs. sequential Sets are very simple policies. Hence:
maxA MI(A) · max MI() subject to |A|=||=k
Key question addressed in this work:
How much better is sequential vs. a priori design?
Main motivation: Performance guarantees about sequential design? A priori design is logistically much simpler!
GPs slightly more formally Set of locations V Joint distribution P(XV) For any A µ V, P(XA) Gaussian GP defined by
Prior mean (s) [often constant, e.g., 0] Kernel K(s,t)
7.4
7.6
7.8
8
Position along transect (m)
pH
valu
e
V… …
XV
1: Variance (Amplitude)
2: Bandwidth
K(s;t) = µ1 expµ
¡ks ¡ tk2
2
µ22
¶
Example: Squaredexponential kernel
4 2 0 2 4 0
0.5
1
Distance
Cor
rela
tion
Known parametersKnown parameters
(bandwidth, variance, etc.)
No benefit in sequential design!
maxA MI(A) = max MI()
Mutual Information does not depend on observed values:
H(XB j XA = xA ) / logj§ (µ)BjA j
Mutual Information does depend on observed values!
P (xB j xA ) =X
µ
P (µ j xA )N (xB ;¹ (µ)BjA ;§ (µ)
BjA )
Unknown parametersUnknown (discretized)
parameters: Prior P( = )
Sequential design can be better!
maxA MI(A) · max MI()
depends on observations!
Theorem:
Key result: How big is the gap?
If = known: MI(A*) = MI(*) If “almost” known: MI(A*) ¼ MI(*)
MIMI(A*) MI(*)0
Gap depends on H()
M I (¼¤) · M I (A¤) + O(1)H (£ )
MI of best policyMI of best set Gap size
M I (¼¤) ·X
µ
P (µ) maxjA j=k
M I (A j µ) + H(£)
As H() ! 0:
MI of best policy MI of best param. spec. set
Near-optimal policy if parameter approximately known Use greedy algorithm to optimize
MI(Agreedy | ) = P() MI(Agreedy | )
Note: | MI(A | ) – MI(A) | · H() Can compute MI(A | ) analytically, but not MI(A)
M I (Agreedy) ¸ (1 ¡ 1=e)M I (¼¤)¡ "¡ O(1)H (£ )Corollary [using our result from ICML 05]
Optimalseq. plan
Result ofgreedy algorithm
~63% Gap ≈ 0(known par.)
Exploration—Exploitation for GPsReinforcementLearning
Active Learning in GPs
Parameters P(St+1|St, At), Rew(St) Kernel parameters
Known parameters:Exploitation
Find near-optimal policy by solving MDP!
Find near-optimal policy by finding best set
Unknown parameters:Exploration
Try to quickly learn parameters! Need to waste only polynomially many robots!
Try to quickly learn parameters. How many samples do we need?
Parameter info-gain exploration (IGE) Gap depends on H() Intuitive heuristic: greedily select
s* = argmaxs I(; Xs) = argmaxs H() – H( | Xs)
Does not directly try to improve spatial prediction No sample complexity bounds
Parameter entropybefore observing s
P.E. after observing s
Implicit exploration (IE) Intuition: Any observation will help us reduce H() Sequential greedy algorithm: Given previous
observations XA = xA, greedily select
s* = argmaxs MI ({Xs} | XA=xA, )
Contrary to a priori greedy, this algorithm takes observations into account (updates parameters)
Proposition: H( | X) · H() “Information never hurts” for policies
No samplecomplexity bounds
Can narrow down kernel bandwidth by sensing inside and outside bandwidth distance!
Learning the bandwidthKernel
Bandwidth
Sensors withinbandwidth are
correlated
Sensors outsidebandwidth are
≈ independent
AB C
-4 -2 0 2 40
0.5
1
Square exponential kernel:
Choose pairs of samples at distance to test correlation!
K(s;t) = expµ
¡ks ¡ tk2
2
µ2
¶
Hypothesis testing:Distinguishing two bandwidths
Correlationunder BW=1
Correlationunder BW=3
At this distance correlation gap largest
BW = 1
BW = 3
-2 0 2
-2
0
2
-2 0 2
-2
0
2
Hypothesis testing:Sample complexity
Theorem: To distinguish bandwidths with minimum gap in correlation and error < we need independent samples.
In GPs, samples are dependent, but “almost” independent samples suffice! (details in paper)
Other tests can be used for variance/noise etc. What if we want to distinguish more than two
bandwidths?
N̂ = O³
1½2 log2 1
"
´
1 2 3 4 50
0.2
0.4
0.6
P(
)
Find “most informative split” at posterior median
Hypothesis testing:Binary searching for bandwidth
Testing policy ITE needs only
logarithmically many tests!
ET [M I (¼I T E + Agreedy j £ )] ¸ (1¡ 1=e)M I (¼¤) ¡ k"M I ¡ O("T )
Theorem: If we have tests with error < T then
Exploration—Exploitation Algorithm Exploration phase
Sample according to exploration policy Compute bound on gap between best set and best policy If bound < specified threshold, go to exploitation phase,
otherwise continue exploring. Exploitation phase
Use a priori greedy algorithm select remaining samples
For hypothesis testing, guaranteed to proceed to exploitation after logarithmically many samples!
0 5 10 15 20
0.3
0.35
0.4
0.45
0.5
IE
ITE
IGE
0 5 10 15 20 25
0.5
1
1.5
2
IE
IGE
ITE
Results
None of the strategies dominates each other Usefulness depends on application
More
RM
S e
rror
More observations More observationsMore
para
m.
unce
rtain
ty
SERVER
LAB
KITCHEN
COPYELEC
PHONEQUIET
STORAGE
CONFERENCE
OFFICEOFFICE50
51
52 53
54
46
48
49
47
43
45
44
42 41
3739
38 36
33
3
6
10
11
12
13 14
1516
17
19
2021
22
242526283032
31
2729
23
18
9
5
8
7
4
34
1
2
3540
Temperature dataIGE: Parameter info-gain
ITE: Hypothesis testingIE: Implicit exploration
10 20 30 40 500
1
Coordinates (m)
Nonstationarity by spatial partitioning Isotropic GP for each
region, weighted by region membership
spatially varying linear combination
Stationary fit Nonstationary fit Problem: Parameter space grows exponentially in #regions! Solution: Variational approximation (BK-style) allows efficient
approximate inference (Details in paper)
0 10 20 30 400
0.05
0.1
0.15
0.2
IE,nonstationary
IE,isotropic
a priori,nonstationary
Results on river data
Nonstationary model + active learning lead to lower RMS error
More
RM
S e
rror
More observations Larger bars = later sample
10 20 30 40 500
1(14.54/0.04)
(13.10/0.03)(13.82/0.10)
(14.49/0.02)
Coordinates (m)
0 5 10 15 200.5
1
1.5
IE,isotropic
IGE,nonstationary
IE,nonstationary
Random,nonstationary
0 5 10 15 20 25 306.5
7
7.5
8
8.5
9
9.5
10
IEnonstationary
IGEnonstationary
Results on temperature data
IE reduces error most quickly IGE reduces parameter entropy most quickly
More
RM
S e
rror
More observations More
para
m.
un
cert
ain
tyMore observations
Conclusions Nonmyopic approach towards active learning in GPs If parameters known, greedy algorithm achieves
near-optimal exploitation If parameters unknown, perform exploration
Implicit exploration Explicit, using information gain Explicit, using hypothesis tests, with logarithmic sample
complexity bounds! Each exploration strategy has its own advantages
Can use bound to compute stopping criterion Presented extensive evaluation on real world data