machine learning summer school 2016

data science @ NYT

Chris Wiggins

Aug 8/9, 2016

Outline

1. overview of DS@NYT

Outline

1. overview of DS@NYT2. prediction + supervised learning

Outline

1. overview of DS@NYT2. prediction + supervised learning3. prescription, causality, and RL

Outline

1. overview of DS@NYT2. prediction + supervised learning3. prescription, causality, and RL4. description + inference

Outline

1. overview of DS@NYT2. prediction + supervised learning3. prescription, causality, and RL4. description + inference5. (if interest) designing data products

0. Thank the organizers!

Figure 1: prepping slides until last minute

Lecture 1: overview of ds@NYT

data science @ The New York Times

[email protected]

[email protected]

@chrishwiggins

references: http://bit.ly/stanf16

data science: searches

data science: mindset & toolset

drew conway, 2010

modern history:

2009

biology: 1892 vs. 1995


biology changed for good.


new toolset, new mindset

genetics: 1837 vs. 2012

ML toolset; data science mindset


ML toolset; data science mindset

arxiv.org/abs/1105.5821 ; github.com/rajanil/mkboost

data science: mindset & toolset

news: 20th century

church state

church

news: 20th century

church state

news: 21st century

church state

data

1851 1996

newspapering: 1851 vs. 1996

example:

millions of views per hour2015

"...social activities generate large quantities of potentially valuable data...The data were not generated for the

purpose of learning; however, the potential for learning is great’’

"...social activities generate large quantities of potentially valuable data...The data were not generated for the

purpose of learning; however, the potential for learning is great’’ - J Chambers, Bell Labs,1993, “GLS”

data science: the web


is your “online presence”


is a microscope


is an experimental tool


is an optimization tool

1851 1996

newspapering: 1851 vs. 1996 vs. 2008

2008

“a startup is a temporary organization in search of a

repeatable and scalable business model” —Steve Blank

every publisher is now a startup

news: 21st century

church state

data

learnings

learnings

- predictive modeling- descriptive modeling- prescriptive modeling

(actually ML, shhhh…)

- (supervised learning)- (unsupervised learning)- (reinforcement learning)


h/t michael littman


h/t michael littman

Supervised

Learning

Reinforcement

Learning

Unsupervised

Learning

(reports)

learnings

- predictive modeling- descriptive modeling- prescriptive modeling

cf. modelingsocialdata.org

stats.stackexchange.com

from “are you a bayesian or a frequentist”

—michael jordan

L =

NX

i=1

ϕ (yif(xi;β)) + λ||β||

predictive modeling, e.g.,


predictive modeling, e.g.,

“the funnel”


interpretable predictive modeling

super cool stuff


interpretable predictive modeling

super cool stuff


arxiv.org/abs/q-bio/0701021

optimization & learning, e.g.,

“How The New York Times Works “popular mechanics, 2015

optimization & prediction, e.g.,

(some models)

(some moneys)

“newsvendor problem,” literally (+prediction+experiment)

recommendation as inference

recommendation as inference

bit.ly/AlexCTM

descriptive modeling, e.g,

“segments”



“segments”

argmax_z p(z|x)=14



“segments”

“baby boomer”


- descriptive data product

cf. daeilkim.com


cf. daeilkim.com ; import bnpy

modeling your audience

bit.ly/Hughes-Kim-Sudderth-AISTATS15


(optimization, ultimately)

also allows insight+targeting as inference


prescriptive modeling

“off policy value estimation”

(cf. “causal effect estimation”)

cf. Langford `08-`16;

Horvitz & Thompson `52;

Holland `86

“off policy value estimation”

(cf. “causal effect estimation”)

Vapnik’s razor

“ When solving a (learning) problem of interest,

do not solve a more complex problem as an

intermediate step.”


aka “A/B testing”;

RCT


Reporting

Learning

Test

aka “A/B

testing”;

business as

usual

(esp.

predictive)

Some of the most recognizable personalization in our service is the collection of “genre” rows. …Members connect with these rows so

well that we measure an increase in member retention by placing the most tailored rows higher on the page instead of lower.


prescriptive modeling: from A/B to….

real-time A/B -> “bandits”

GOOG blog:


prescriptive modeling, e.g,

prescriptive modeling, e.g,

leverage methods which are predictive yet performant

NB: data-informed, not data-driven

predicting views/cascades: doable?

KDD 09: how many people are online?


WWW 14: FB shares

predicting views/cascades: features?


WWW 16: TWIT RT’s

Reporting

Learning

Test

Optimizing

Exploredescriptive:

predictive:

prescriptive:

things:

what does DS team deliver?

- build data product- build APIs- impact roadmaps


[email protected]

[email protected]

@chrishwiggins

references: http://bit.ly/stanf16

Lecture 2: predictive modeling @ NYT

desc/pred/pres

Figure 2: desc/pred/pres

caveat: difference between observation and experiment. why?

blossom example

Figure 3: Reminder: Blossom

blossom + boosting (‘exponential’)

Figure 4: Reminder: Surrogate Loss Functions

tangent: logistic function as surrogate loss function

define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R

https://www.wolframalpha.com/input/?i=plot(log2(1%2Bexp(-u)),exp(-u),Heaviside(-u),(-u%2B1)*Heaviside(1-u),u%3D-1..1?*)



p(y = 1|x) + p(y = −1|x) = 1→ p(y |x) = 1/(1 + exp(−yf ))




p(y = 1|x) + p(y = −1|x) = 1→ p(y |x) = 1/(1 + exp(−yf ))

− log2 p(yN1 ) =∑

i log2

(

1 + e−yi f (xi ))

≡∑

i ℓ(yi f (xi))




p(y = 1|x) + p(y = −1|x) = 1→ p(y |x) = 1/(1 + exp(−yf ))

− log2 p(yN1 ) =∑

i log2

(

1 + e−yi f (xi ))

≡∑

i ℓ(yi f (xi))

ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.




p(y = 1|x) + p(y = −1|x) = 1→ p(y |x) = 1/(1 + exp(−yf ))

− log2 p(yN1 ) =∑

i log2

(

1 + e−yi f (xi ))

≡∑

i ℓ(yi f (xi))

ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.

∴ maximizing log-likelihood is minimizing a surrogate convexloss function for classification (though not strongly convex,cf. Yoram’s talk)




p(y = 1|x) + p(y = −1|x) = 1→ p(y |x) = 1/(1 + exp(−yf ))

− log2 p(yN1 ) =∑

i log2

(

1 + e−yi f (xi ))

≡∑

i ℓ(yi f (xi))

ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.

∴ maximizing log-likelihood is minimizing a surrogate convexloss function for classification (though not strongly convex,cf. Yoram’s talk)

but∑

i log2

(

1 + e−yi wT h(xi )

)

not as easy as∑

i e−yi wT h(xi )


boosting 1

L exponential surrogate loss function, summed over examples:

L[F ] =∑

i exp (−yiF (xi))

boosting 1


L[F ] =∑

i exp (−yiF (xi)) =

∑

i exp(

−yi

∑tt′ wt′ht′(xi)

)

≡ Lt(wt)

boosting 1


L[F ] =∑


∑

i exp(

−yi


)

≡ Lt(wt) Draw ht ∈ H large space of rules s.t. h(x) ∈ −1, +1

boosting 1


L[F ] =∑


∑

i exp(

−yi


)

≡ Lt(wt) Draw ht ∈ H large space of rules s.t. h(x) ∈ −1, +1 label y ∈ −1, +1

boosting 1


Lt+1(wt ; w) ≡∑

i d ti exp (−yiwht+1(xi))

Punchlines: sparse, predictive, interpretable, fast (to execute), andeasy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞

1, . . .

boosting 1


Lt+1(wt ; w) ≡∑


=∑

y=h′ d ti e−w +

∑

y 6=h′ d ti e+w ≡ e−w D+ + e+w D−


1, . . .

boosting 1


Lt+1(wt ; w) ≡∑


=∑

y=h′ d ti e−w +

∑

y 6=h′ d ti e+w ≡ e−w D+ + e+w D−

∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D−


1, . . .

boosting 1


Lt+1(wt ; w) ≡∑


=∑

y=h′ d ti e−w +

∑

y 6=h′ d ti e+w ≡ e−w D+ + e+w D−


Lt+1(wt+1) = 2√

D+D− = 2√

ν+(1− ν+)/D, where0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1


1, . . .

boosting 1


Lt+1(wt ; w) ≡∑


=∑

y=h′ d ti e−w +

∑

y 6=h′ d ti e+w ≡ e−w D+ + e+w D−


Lt+1(wt+1) = 2√

D+D− = 2√

ν+(1− ν+)/D, where0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1

update example weights d t+1i = d t

i e∓w


1, . . .

1Duchi + Singer “Boosting with structural sparsity” ICML ’09

predicting people

“customer journey” prediction

predicting people


fun covariates

predicting people


fun covariates observational complication v structural models

predicting people (reminder)

Figure 5: both in science and in real world, feature analysis guides futureexperiments

single copy (reminder)

Figure 6: from Lecture 1

example in CAR (computer assisted reporting)

Figure 7: Tabuchi article


cf. Friedman’s “Statistical models and Shoe Leather”2

2Freedman, David A. “Statistical models and shoe leather.” Sociologicalmethodology 21.2 (1991): 291-313.



Takata airbag fatalities




Takata airbag fatalities 2219 labeled3 examples from 33,204 comments


3By Hiroko Tabuchi, a Pulitzer winner



Takata airbag fatalities 2219 labeled3 examples from 33,204 comments cf. Box’s “Science and Statistics”4


3By Hiroko Tabuchi, a Pulitzer winner4Science and Statistics, George E. P. Box Journal of the American Statistical

Association, Vol. 71, No. 356. (Dec., 1976), pp. 791-799.

computer assisted reporting

Impact

Figure 8: impact

Lecture 3: prescriptive modeling @ NYT

the natural abstraction

operators5 make decisions

5In the sense of business deciders; that said, doctors, including those whooperate, also have to make decisions, cf., personalized medicines


operators5 make decisions faster horses v. cars



operators5 make decisions faster horses v. cars general insights v. optimal policies


maximizing outcome

the problem: maximizing an outcome over policies. . .

maximizing outcome

the problem: maximizing an outcome over policies. . . . . . while inferring causality from observation

maximizing outcome

the problem: maximizing an outcome over policies. . . . . . while inferring causality from observation different from predicting outcome in absence of action/policy

examples

observation is not experiment

examples


e.g., (Med.) smoking hurts vs unhealthy people smoke

examples


e.g., (Med.) smoking hurts vs unhealthy people smoke e.g., (Med.) affluent get prescribed different meds/treatment

examples


e.g., (Med.) smoking hurts vs unhealthy people smoke e.g., (Med.) affluent get prescribed different meds/treatment e.g., (life) veterans earn less vs the rich serve less6

6Angrist, Joshua D. (1990). “Lifetime Earnings and the Vietnam DraftLottery: Evidence from Social Security Administrative Records”. AmericanEconomic Review 80 (3): 313–336.

examples


e.g., (Med.) smoking hurts vs unhealthy people smoke e.g., (Med.) affluent get prescribed different meds/treatment e.g., (life) veterans earn less vs the rich serve less6

e.g., (life) admitted to school vs learn at school?

6Angrist, Joshua D. (1990). “Lifetime Earnings and the Vietnam DraftLottery: Evidence from Social Security Administrative Records”. AmericanEconomic Review 80 (3): 313–336.

reinforcement/machine learning/graphical models

key idea: model joint p(y , a, x)


key idea: model joint p(y , a, x) explore/exploit: family of joints pα(y , a, x)


key idea: model joint p(y , a, x) explore/exploit: family of joints pα(y , a, x) “causality”: pα(y , a, x) = p(y |a, x)pα(a|x)p(x) “a causes y”


key idea: model joint p(y , a, x) explore/exploit: family of joints pα(y , a, x) “causality”: pα(y , a, x) = p(y |a, x)pα(a|x)p(x) “a causes y” nomenclature: ‘response’, ‘policy’/‘bias’, ‘prior’ above

in general

Figure 9: policy/bias, response, and prior define the distribution

also describes both the ‘exploration’ and ‘exploitation’ distributions

randomized controlled trial

Figure 10: RCT: ‘bias’ removed, random ‘policy’ (response and priorunaffected)

also Pearl’s ‘do’ distribution: a distribution with “no arrows”pointing to the action variable.

POISE: calculation, estimation, optimization

POISE: “policy optimization via importance sample estimation”


POISE: “policy optimization via importance sample estimation” Monte Carlo importance sampling estimation



aka “off policy estimation”



aka “off policy estimation” role of “IPW”




reduction




reduction normalization




reduction normalization hyper-parameter searching




reduction normalization hyper-parameter searching unexpected connection: personalized medicine

POISE setup and Goal

“a causes y” ⇐⇒ ∃ family pα(y , a, x) = p(y |a, x)pα(a|x)p(x)


“a causes y” ⇐⇒ ∃ family pα(y , a, x) = p(y |a, x)pα(a|x)p(x) define off-policy/exploration distribution

p−(y , a, x) = p(y |a, x)p−(a|x)p(x)



p−(y , a, x) = p(y |a, x)p−(a|x)p(x) define exploitation distribution

p+(y , a, x) = p(y |a, x)p+(a|x)p(x)




p+(y , a, x) = p(y |a, x)p+(a|x)p(x) Goal: Maximize E+(Y ) over p+(a|x) using data drawn from

p−(y , a, x).




p+(y , a, x) = p(y |a, x)p+(a|x)p(x) Goal: Maximize E+(Y ) over p+(a|x) using data drawn from

p−(y , a, x).

notation: x , a, y ∈ X , A, Y i.e., Eα(Y ) is not a function of y

POISE math: IS+Monte Carlo estimation=ISE

i.e, “importance sampling estimation”

E+(Y ) ≡∑

yax yp+(y , a, x)



E+(Y ) ≡∑

yax yp+(y , a, x) E+(Y ) =

∑

yax yp−(y , a, x)(p+(y , a, x)/p−(y , a, x))



E+(Y ) ≡∑

yax yp+(y , a, x) E+(Y ) =

∑

yax yp−(y , a, x)(p+(y , a, x)/p−(y , a, x)) E+(Y ) =

∑

yax yp−(y , a, x)(p+(a|x)/p−(a|x))



E+(Y ) ≡∑

yax yp+(y , a, x) E+(Y ) =

∑


∑

yax yp−(y , a, x)(p+(a|x)/p−(a|x)) E+(Y ) ≈ N−1 ∑

i yi(p+(ai |xi)/p−(ai |xi))



E+(Y ) ≡∑

yax yp+(y , a, x) E+(Y ) =

∑


∑

yax yp−(y , a, x)(p+(a|x)/p−(a|x)) E+(Y ) ≈ N−1 ∑

i yi(p+(ai |xi)/p−(ai |xi))

let’s spend some time getting to know this last equation, theimportance sampling estimate of outcome in a “causal model” (“acauses y”) among y , a, x

Observation (cf. Bottou7 )

factorizing P±(x): P+(x)P−(x) = Πfactors

P+but not−(x)P−but not+(x)




origin: importance sampling Eq(f ) = Ep(fq/p) (as invariational methods)





the “causal” model pα(y , a, x) = p(y |a, x)pα(a|x)p(x) helpshere






factors left over are numerator (p+(a|x), to optimize) anddenominator (p−(a|x), to infer if not a RCT)






factors left over are numerator (p+(a|x), to optimize) anddenominator (p−(a|x), to infer if not a RCT)

unobserved confounders will confound us (later)

7Counterfactual Reasoning and Learning Systems, arXiv:1209.2355

Reduction (cf. Langford8,9,10 (’05, ’08, ’09 ))

consider numerator for deterministic policy:p+(a|x) = 1[a = h(x)]

http://hunch.net/~reductions_tutorial/



E+(Y ) ∝∑

i(yi/p−(a|x))1[a = h(x)] ≡∑

i wi1[a = h(x)]




E+(Y ) ∝∑

i(yi/p−(a|x))1[a = h(x)] ≡∑

i wi1[a = h(x)] Note: 1[c = d ] = 1− 1[c 6= d ]




E+(Y ) ∝∑

i(yi/p−(a|x))1[a = h(x)] ≡∑

i wi1[a = h(x)] Note: 1[c = d ] = 1− 1[c 6= d ] ∴ E+(Y ) ∝ constant−

∑

i wi1[a 6= h(x)]




E+(Y ) ∝∑

i(yi/p−(a|x))1[a = h(x)] ≡∑

i wi1[a = h(x)] Note: 1[c = d ] = 1− 1[c 6= d ] ∴ E+(Y ) ∝ constant−

∑

i wi1[a 6= h(x)] ∴ reduces policy optimization to (weighted) classification

8Langford & Zadrozny “Relating Reinforcement Learning Performance toClassification Performance” ICML 2005

9Beygelzimer & Langford “The offset tree for learning with partial labels”(KDD 2009)

10Tutorial on “Reductions” (including at ICML 2009)


Reduction w/optimistic complication

Prescription ⇐⇒ classification L =∑

i wi1[ai 6= h(xi)]



i wi1[ai 6= h(xi)] weight wi = yi/p−(ai |xi), inferred or RCT



i wi1[ai 6= h(xi)] weight wi = yi/p−(ai |xi), inferred or RCT destroys measure by treating p−(a|x) differently than

1/p−(a|x)




1/p−(a|x)

normalize as L ≡

∑

iy1[ai 6=h(xi )]/p−(ai |xi )

∑

i1[ai 6=h(xi )]/p−(ai |xi )




1/p−(a|x)

normalize as L ≡

∑


∑


destroys lovely reduction




1/p−(a|x)

normalize as L ≡

∑


∑


destroys lovely reduction simply11 L(λ) =

∑

i(yi − λ)1[ai 6= h(xi)]/p−(ai |xi)

11Suggestion by Dan Hsu




1/p−(a|x)

normalize as L ≡

∑


∑


destroys lovely reduction simply11 L(λ) =

∑

i(yi − λ)1[ai 6= h(xi)]/p−(ai |xi) hidden here is a 2nd parameter, in classification, ∴ harder

search

11Suggestion by Dan Hsu

POISE punchlines

allows policy planning even with implicit logged explorationdata12

12Strehl, Alex, et al. “Learning from logged implicit exploration data.”Advances in Neural Information Processing Systems. 2010.

POISE punchlines


e.g., two hospital story


POISE punchlines


e.g., two hospital story “personalized medicine” is also a policy


POISE punchlines


e.g., two hospital story “personalized medicine” is also a policy abundant data available, under-explored IMHO


tangent: causality as told by an economist

different, related goal

they think in terms of ATE/ITE instead of policy




ATE




ATE

τ ≡ E0(Y |a = 1) − E0(Y |a = 0) ≡ Q(a = 1) − Q(a = 0)




ATE

τ ≡ E0(Y |a = 1) − E0(Y |a = 0) ≡ Q(a = 1) − Q(a = 0)

CATE aka Individualized Treatment Effect (ITE)




ATE

τ ≡ E0(Y |a = 1) − E0(Y |a = 0) ≡ Q(a = 1) − Q(a = 0)


τ(x) ≡ E0(Y |a = 1, x) − E0(Y |a = 0, x)




ATE

τ ≡ E0(Y |a = 1) − E0(Y |a = 0) ≡ Q(a = 1) − Q(a = 0)


τ(x) ≡ E0(Y |a = 1, x) − E0(Y |a = 0, x) ≡ Q(a = 1, x) − Q(a = 0, x)

Q-note: “generalizing” Monte Carlo w/kernels

MC: Ep(f ) =∑

x p(x)f (x) ≈ N−1 ∑

i∼p f (xi)


MC: Ep(f ) =∑

x p(x)f (x) ≈ N−1 ∑

i∼p f (xi) K: p ≈ N−1 ∑

i K (x |xi)


MC: Ep(f ) =∑

x p(x)f (x) ≈ N−1 ∑

i∼p f (xi) K: p ≈ N−1 ∑

i K (x |xi) ⇒

∑

x p(x)f (x) ≈ N−1 ∑

i

∑

x f (x)K (x |xi)


MC: Ep(f ) =∑

x p(x)f (x) ≈ N−1 ∑

i∼p f (xi) K: p ≈ N−1 ∑

i K (x |xi) ⇒

∑

x p(x)f (x) ≈ N−1 ∑

i

∑

x f (x)K (x |xi) K can be any normalized function, e.g., K (x |xi) = δx ,xi

, whichyields MC.

Q-note: application w/strata+matching, setup

Helps think about economists’ approach:

Q(a, x) ≡ E (Y |a, x) =∑

y yp(y |a, x) =∑

y yp−(y ,a,x)

p−(a|x)p(x)



Q(a, x) ≡ E (Y |a, x) =∑

y yp(y |a, x) =∑

y yp−(y ,a,x)

p−(a|x)p(x)

= 1p−(a|x)p(x)

∑

y yp−(y , a, x)



Q(a, x) ≡ E (Y |a, x) =∑

y yp(y |a, x) =∑

y yp−(y ,a,x)

p−(a|x)p(x)

= 1p−(a|x)p(x)

∑

y yp−(y , a, x)

stratify x using z(x) such that ∪z = X , and ∩z , z ′ = ø



Q(a, x) ≡ E (Y |a, x) =∑

y yp(y |a, x) =∑

y yp−(y ,a,x)

p−(a|x)p(x)

= 1p−(a|x)p(x)

∑

y yp−(y , a, x)

stratify x using z(x) such that ∪z = X , and ∩z , z ′ = ø n(x) =

∑

i 1[z(xi) = z(x)]=number of points in x ’s stratum



Q(a, x) ≡ E (Y |a, x) =∑

y yp(y |a, x) =∑

y yp−(y ,a,x)

p−(a|x)p(x)

= 1p−(a|x)p(x)

∑

y yp−(y , a, x)


∑

i 1[z(xi) = z(x)]=number of points in x ’s stratum Ω(x) =

∑

x ′ 1[z(x ′) = z(x)]=area of x ’s stratum

Q-note: application w/strata+matching, payoff

∑

y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑

ai =a,z(xi )=z(x) yi


∑

y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑


p(x) ≈ (n(x)/N)Ω(x)−1


∑

y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑


p(x) ≈ (n(x)/N)Ω(x)−1

∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ∑



∑

y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑


p(x) ≈ (n(x)/N)Ω(x)−1

∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ∑


“matching” means: choose each z to contain 1 positive example & 1negative example,

p−(a|x) ≈ 1/2, n(x) = 2


∑

y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑


p(x) ≈ (n(x)/N)Ω(x)−1

∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ∑



p−(a|x) ≈ 1/2, n(x) = 2 ∴ τ(a, x) = Q(a = 1, x)− Q(a = 0, x) = y1(x)− y0(x)


∑

y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑


p(x) ≈ (n(x)/N)Ω(x)−1

∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ∑



p−(a|x) ≈ 1/2, n(x) = 2 ∴ τ(a, x) = Q(a = 1, x)− Q(a = 0, x) = y1(x)− y0(x) z-generalizations: graphs, digraphs, k-NN, “matching”


∑

y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑


p(x) ≈ (n(x)/N)Ω(x)−1

∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ∑



p−(a|x) ≈ 1/2, n(x) = 2 ∴ τ(a, x) = Q(a = 1, x)− Q(a = 0, x) = y1(x)− y0(x) z-generalizations: graphs, digraphs, k-NN, “matching” K -generalizations: continuous a, any metric or similarity you

like,. . .


∑

y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑


p(x) ≈ (n(x)/N)Ω(x)−1

∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ∑



p−(a|x) ≈ 1/2, n(x) = 2 ∴ τ(a, x) = Q(a = 1, x)− Q(a = 0, x) = y1(x)− y0(x) z-generalizations: graphs, digraphs, k-NN, “matching” K -generalizations: continuous a, any metric or similarity you

like,. . .

IMHO underexplored

causality, as understood in marketing a/b testing and RCT

Figure 11: Blattberg, Robert C., Byung-Do Kim, and Scott A. Neslin.Database Marketing, Springer New York, 2008

https://en.wikipedia.org/wiki/Max_O._Lorenz

https://en.wikipedia.org/wiki/Lorenz_curve

causality, as understood in marketing a/b testing and RCT yield optimization




causality, as understood in marketing a/b testing and RCT yield optimization Lorenz curve (vs ROC plots)




unobserved confounders vs. “causality” modeling

truth: pα(y , a, x , u) = p(y |a, x , u)pα(a|x , u)p(x , u)


truth: pα(y , a, x , u) = p(y |a, x , u)pα(a|x , u)p(x , u) but: p+(y , a, x , u) = p(y |a, x , u)p−(a|x)p(x , u)

cautionary tale problem: Simpson’s paradox

a: admissions (a=1: admitted, a=0: declined)


a: admissions (a=1: admitted, a=0: declined) x : gender (x=1: female, x=0: male)


a: admissions (a=1: admitted, a=0: declined) x : gender (x=1: female, x=0: male) lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35


a: admissions (a=1: admitted, a=0: declined) x : gender (x=1: female, x=0: male) lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35 ‘resolved’ by Bickel (1975)13 (See also Pearl14 )

13P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975). “Sex Bias inGraduate Admissions: Data From Berkeley”. Science 187 (4175): 398–404

14Pearl, Judea (December 2013). “Understanding Simpson’s paradox”. UCLACognitive Systems Laboratory, Technical Report R-414.


a: admissions (a=1: admitted, a=0: declined) x : gender (x=1: female, x=0: male) lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35 ‘resolved’ by Bickel (1975)13 (See also Pearl14 ) u: unobserved department they applied to




a: admissions (a=1: admitted, a=0: declined) x : gender (x=1: female, x=0: male) lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35 ‘resolved’ by Bickel (1975)13 (See also Pearl14 ) u: unobserved department they applied to p(a|x) =

∑u=6u=1 p(a|x , u)p(u|x)




a: admissions (a=1: admitted, a=0: declined) x : gender (x=1: female, x=0: male) lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35 ‘resolved’ by Bickel (1975)13 (See also Pearl14 ) u: unobserved department they applied to p(a|x) =

∑u=6u=1 p(a|x , u)p(u|x)

e.g., gender-blind: p(a|1)− p(a|0) = p(a|u) · (p(u|1)− p(u|0))



confounded approach: quasi-experiments + instruments 17

Q: does engagement drive retention? (NYT, NFLX, . . . )



we don’t directly control engagement



we don’t directly control engagement nonetheless useful since many things can influence it




Q: does serving in Vietnam war decrease earnings15?

15Angrist, Joshua D. “Lifetime earnings and the Vietnam era draft lottery:evidence from social security administrative records.” The American EconomicReview (1990): 313-336.





US didn’t directly control serving in Vietnam, either16


16cf., George Bush, Donald Trump, Bill Clinton, Dick Cheney. . .





US didn’t directly control serving in Vietnam, either16

requires strong assumptions, including linear model


16cf., George Bush, Donald Trump, Bill Clinton, Dick Cheney. . .17I thank Sinan Aral, MIT Sloan, for bringing this to my attention

IV: graphical model assumption

Figure 12: independence assumption

IV: graphical model assumption (sideways)

Figure 13: independence assumption

IV: review s/OLS/MOM/ (E is empirical average)

a endogenous


a endogenous

e.g., ∃u s.t. p(y |a, x , u), p(a|x , u)


a endogenous

e.g., ∃u s.t. p(y |a, x , u), p(a|x , u)

linear ansatz: y = βT a + ǫ


a endogenous

e.g., ∃u s.t. p(y |a, x , u), p(a|x , u)

linear ansatz: y = βT a + ǫ if a exogenous (e.g., OLS), use E [YAj ] = E [βT AAj ] + E [ǫAj ]

(note that E [AjAk ] gives square matrix; invert for β)


a endogenous

e.g., ∃u s.t. p(y |a, x , u), p(a|x , u)


(note that E [AjAk ] gives square matrix; invert for β) add instrument x uncorrelated with ǫ


a endogenous

e.g., ∃u s.t. p(y |a, x , u), p(a|x , u)


(note that E [AjAk ] gives square matrix; invert for β) add instrument x uncorrelated with ǫ E [YXk ] = E [βT AXk ] + E [ǫ]E [Xk ]


a endogenous

e.g., ∃u s.t. p(y |a, x , u), p(a|x , u)


(note that E [AjAk ] gives square matrix; invert for β) add instrument x uncorrelated with ǫ E [YXk ] = E [βT AXk ] + E [ǫ]E [Xk ] E [Y ] = E [βT A] + E [ǫ] (from ansatz)


a endogenous

e.g., ∃u s.t. p(y |a, x , u), p(a|x , u)


(note that E [AjAk ] gives square matrix; invert for β) add instrument x uncorrelated with ǫ E [YXk ] = E [βT AXk ] + E [ǫ]E [Xk ] E [Y ] = E [βT A] + E [ǫ] (from ansatz) C(Y , Xk) = βT C(A, Xk), not an “inversion” problem, requires

“two stage regression”

IV: binary, binary case (aka “Wald estimator”)

y = βa + ǫ


y = βa + ǫ E (Y |x) = βE (A|x) + E (ǫ), evaluate at x = 0, 1


y = βa + ǫ E (Y |x) = βE (A|x) + E (ǫ), evaluate at x = 0, 1 β = (E (Y |x = 1)− E (Y |x = 0))/(E (A|x = 1)− E (A|x = 0)).

bandits: obligatory slide

Figure 14: almost all the talks I’ve gone to on bandits have this image

bandits

wide applicability: humane clinical trials, targeting, . . .

https://e76d6ebf22ef8d7e079810f3d1f82ba1e5f145d5.googledrive.com/host/0B2GQktu-wcTiWDB2R2t2a2tMUG8/


bandits

wide applicability: humane clinical trials, targeting, . . . replace meetings with code



bandits

wide applicability: humane clinical trials, targeting, . . . replace meetings with code requires software engineering to replace decisions with, e.g.,

Javascript



bandits


Javascript most useful if decisions or items get “stale” quickly



bandits


Javascript most useful if decisions or items get “stale” quickly less useful for one-off, major decisions to be “interpreted”



bandits



examples

ǫ-greedy (no context, aka ‘vanilla’, aka ‘context-free’)



bandits



examples

ǫ-greedy (no context, aka ‘vanilla’, aka ‘context-free’) UCB1 (2002) (no context) + LinUCB (with context)



bandits



examples

ǫ-greedy (no context, aka ‘vanilla’, aka ‘context-free’) UCB1 (2002) (no context) + LinUCB (with context) Thompson Sampling (1933)18,19,20 (general, with or without

context)

18Thompson, William R. “On the likelihood that one unknown probabilityexceeds another in view of the evidence of two samples”. Biometrika,25(3–4):285–294, 1933.

19AKA “probability matching”, “posterior sampling”20cf., “Bayesian Bandit Explorer” (link)



TS: connecting w/“generative causal modeling” 0

WAS p(y , x , a) = p(y |x , a)pα(a|x)p(x)


WAS p(y , x , a) = p(y |x , a)pα(a|x)p(x) These 3 terms were treated by



response p(y |a, x): avoid regression/inferring using importancesampling




policy pα(a|x): optimize ours, infer theirs




policy pα(a|x): optimize ours, infer theirs (NB: ours was deterministic: p(a|x) = 1[a = h(x)])




policy pα(a|x): optimize ours, infer theirs (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) prior p(x): either avoid by importance sampling or estimate via

kernel methods





kernel methods

In the economics approach we focus on





kernel methods

In the economics approach we focus on τ(. . .) ≡ Q(a = 1, . . .)− Q(a = 0, . . .) “treatment effect”





kernel methods

In the economics approach we focus on τ(. . .) ≡ Q(a = 1, . . .)− Q(a = 0, . . .) “treatment effect” where Q(a, . . .) =

∑

y yp(y | . . .)





kernel methods


∑

y yp(y | . . .)

In Thompson sampling we will generate 1 datum at a time, by

asserting a parameterized generative model for p(y |a, x , θ)





kernel methods


∑

y yp(y | . . .)

In Thompson sampling we will generate 1 datum at a time, by

asserting a parameterized generative model for p(y |a, x , θ) using a deterministic but averaged policy


model true world response function p(y |a, x) parametrically asp(y |a, x , θ∗)



(i.e., θ∗ is the true value of the parameter)21

21Note that θ is a vector, with components for each action.




if you knew θ:





if you knew θ:

could compute Q(a, x , θ) ≡∑

y yp(y |x , a, θ∗) directly





if you knew θ:


y yp(y |x , a, θ∗) directly then choose h(x ; θ) = argmaxaQ(a, x , θ)





if you knew θ:


y yp(y |x , a, θ∗) directly then choose h(x ; θ) = argmaxaQ(a, x , θ) inducing policy p(a|x , θ) = 1[a = h(x ; θ) = argmaxaQ(a, x , θ)]





if you knew θ:



idea: use prior data D = y , a, xt1 to define non-deterministic

policy:





if you knew θ:




policy:

p(a|x) =∫

dθp(a|x , θ)p(θ|D)





if you knew θ:




policy:

p(a|x) =∫


∫


hold up:





if you knew θ:




policy:

p(a|x) =∫


∫


hold up:

Q1: what’s p(θ|D)?





if you knew θ:




policy:

p(a|x) =∫


∫


hold up:

Q1: what’s p(θ|D)? Q2: how am I going to evaluate this integral?




Q2: how am I going to evaluate this integral?




A1: p(θ|D) definable by choosing prior p(θ|α) and likelihoodon y given by the (modeled, parameterized) responsep(y |a, x , θ).





(now you’re not only generative, you’re Bayesian.)





(now you’re not only generative, you’re Bayesian.) p(θ|D) = p(θ|yt

1, at1, xt

1, α)






1, at1, xt

1, α) ∝ p(yt

1|at1, xt

1, θ)p(θ|α)






1, at1, xt

1, α) ∝ p(yt

1|at1, xt

1, θ)p(θ|α) = p(θ|α)Πtp(yt |at , xt , θ)






1, at1, xt

1, α) ∝ p(yt

1|at1, xt


p(a|θ) or p(x |θ) here warning 2: don’t need historical record of θt .






1, at1, xt

1, α) ∝ p(yt

1|at1, xt


p(a|θ) or p(x |θ) here warning 2: don’t need historical record of θt . (we used Bayes rule, but only in θ and y .)






1, at1, xt

1, α) ∝ p(yt

1|at1, xt



A2: evaluate integral by N = 1 Monte Carlo22

22AFAIK it is open research area what happens when you replace N = 1 withN = N0t−ν .






1, at1, xt

1, α) ∝ p(yt

1|at1, xt




take 1 sample “θt” of θ from p(θ|D)







1, at1, xt

1, α) ∝ p(yt

1|at1, xt




take 1 sample “θt” of θ from p(θ|D) at = h(xt ; θt) = argmaxaQ(a, x , θt)


That sounds hard.

No, just general. Let’s do toy case:

y ∈ 0, 1,

That sounds hard.


y ∈ 0, 1, no context x ,

That sounds hard.


y ∈ 0, 1, no context x , Bernoulli (coin flipping), keep track of

That sounds hard.



Sa ≡ number of successes flipping coin a

That sounds hard.



Sa ≡ number of successes flipping coin a Fa ≡ number of failures flipping coin a

That sounds hard.




Then

p(θ|D) ∝ p(θ|α)Πtp(yt |at , θ)

That sounds hard.




Then


=(

Πaθα−1a (1− θa)β−1

)

(

Πt,at θytat (1− θat )

1−yt)

That sounds hard.




Then


=(


)

(


1−yt)

= Πaθα+Sa−1(1− θa)β+Fa−1

That sounds hard.




Then


=(


)

(


1−yt)

= Πaθα+Sa−1(1− θa)β+Fa−1

∴ θa ∼ Beta(α + Sa, β + Fa)

Thompson sampling: results (2011)

Figure 15: Chaleppe and Li 2011

TS: words

Figure 16: from Chaleppe and Li 2011

TS: p-code


TS: Bernoulli bandit p-code23


TS: Bernoulli bandit p-code (results)


UCB1 (2002), p-code

Figure 20: UCB1

from Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer.“Finite-time analysis of the multiarmed bandit problem.” Machinelearning 47.2-3 (2002): 235-256.

TS: with context


LinUCB: UCB with context

Figure 22: LinUCB

TS: with context (results)


Bandits: Regret via Lai and Robbins (1985)

Figure 24: Lai Robbins

Thompson sampling (1933) and optimality (2013)

Figure 25: TS result

from S. Agrawal, N. Goyal, ”Further optimal regret bounds forThompson Sampling”, AISTATS 2013.; see also Agrawal, Shipra,and Navin Goyal. “Analysis of Thompson Sampling for theMulti-armed Bandit Problem.” COLT. 2012 and Emilie Kaufmann,Nathaniel Korda, and R´emi Munos. Thompson sampling: Anasymptotically optimal finite-time analysis. In Algorithmic LearningTheory, pages 199–213. Springer, 2012.

other ‘Causalities’: structure learning

Figure 26: from heckerman 1995

D. Heckerman. A Tutorial on Learning with Bayesian Networks.Technical Report MSR-TR-95-06, Microsoft Research, March, 1995.

other ‘Causalities’: potential outcomes

model distribution of p(yi(1), yi(0), ai , xi)


model distribution of p(yi(1), yi(0), ai , xi) “action” replaced by “observed outcome”


model distribution of p(yi(1), yi(0), ai , xi) “action” replaced by “observed outcome” aka Neyman-Rubin causal model: Neyman (’23); Rubin (’74)


model distribution of p(yi(1), yi(0), ai , xi) “action” replaced by “observed outcome” aka Neyman-Rubin causal model: Neyman (’23); Rubin (’74) see Morgan + Winship24 for connections between frameworks

24Morgan, Stephen L., and Christopher Winship. Counterfactuals and causal

inference Cambridge University Press, 2014.

Lecture 4: descriptive modeling @ NYT

review: (latent) inference and clustering

what does kmeans mean?



given xi ∈ RD



given xi ∈ RD

given d : RD → R1



given xi ∈ RD

given d : RD → R1

assign zi



given xi ∈ RD

given d : RD → R1

assign zi

generative modeling gives meaning



given xi ∈ RD

given d : RD → R1

assign zi


given p(x |z , θ)



given xi ∈ RD

given d : RD → R1

assign zi


given p(x |z , θ) maximize p(x |θ)



given xi ∈ RD

given d : RD → R1

assign zi


given p(x |z , θ) maximize p(x |θ) output assignment p(z |x , θ)

actual math

define P ≡ p(x , z |θ)

actual math

define P ≡ p(x , z |θ) log-likelihood L ≡ log p(x |θ) = log

∑

z P = log EqP/q

(cf. importance sampling)

actual math


∑

z P = log EqP/q

(cf. importance sampling) Jensen’s:

L ≥ L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F

actual math


∑

z P = log EqP/q



analogy to free energy in physics

actual math


∑

z P = log EqP/q




alternate optimization on θ and on q

actual math


∑

z P = log EqP/q





NB: q step gives q(z) = p(z |x , θ)

actual math


∑

z P = log EqP/q





NB: q step gives q(z) = p(z |x , θ) NB: log P convenient for independent examples w/ exponential

families

actual math


∑

z P = log EqP/q






families e.g., GMMs: µk ← E [x |z] and σ2

k ← E [(x − µ)2|z] aresufficient statistics

actual math


∑

z P = log EqP/q






families e.g., GMMs: µk ← E [x |z] and σ2

k ← E [(x − µ)2|z] aresufficient statistics

e.g., LDAs: word counts are sufficient statistics

tangent: more math on GMMs, part 1

Energy U (to be minimized):

−U ≡ Eq log P =∑

z

∑

i qi(z) log P(xi , zi) ≡ Ux + Uz




z

∑


−Ux ≡∑

z

∑

i qi(z) log p(xi |zi)




z

∑


−Ux ≡∑

z

∑

i qi(z) log p(xi |zi) =

∑

i

∑

z qi(z)∑

k 1[zi = k] log p(xi |zi)




z

∑


−Ux ≡∑

z

∑


∑

i

∑

z qi(z)∑

k 1[zi = k] log p(xi |zi) define rik =

∑

z qi(z)1[zi = k]




z

∑


−Ux ≡∑

z

∑


∑

i

∑

z qi(z)∑


∑

z qi(z)1[zi = k] −Ux =

∑

i rik log p(xi |k).




z

∑


−Ux ≡∑

z

∑


∑

i

∑

z qi(z)∑


∑


∑

i rik log p(xi |k). Gaussian25

⇒ −Ux =∑

i rik

(

−12(xi − µk)2λk + 1

2 ln λk −12 ln 2π

)

25math is simpler if you work with λk ≡ σ−2




z

∑


−Ux ≡∑

z

∑


∑

i

∑

z qi(z)∑


∑


∑

i rik log p(xi |k). Gaussian25

⇒ −Ux =∑

i rik

(

−12(xi − µk)2λk + 1


)

simple to minimize for parameters ϑ = µk , λk

25math is simpler if you work with λk ≡ σ−2


-Ux =∑

i rik

(

−12(xi − µk)2λk + 1


)


-Ux =∑

i rik

(

−12(xi − µk)2λk + 1


)

µk ← E [x |k] solves∑

i rik =∑

i rikxi


-Ux =∑

i rik

(

−12(xi − µk)2λk + 1


)

µk ← E [x |k] solves∑

i rik =∑

i rikxi

λk ← E [(x − µ)2|k] solves∑

i rik12(xi − µk)2 = λ−1

k

∑

i rik

tangent: Gaussians ∈ exponential family27

as before, −U =∑

i rik log p(xi |k)



i rik log p(xi |k) define p(xi |k) = exp (η(θ) · T (x)− A(θ) + B(x))



i rik log p(xi |k) define p(xi |k) = exp (η(θ) · T (x)− A(θ) + B(x)) e.g., Gaussian case 26,

26Choosing η(θ) = η called ‘canonical form’




T1 = x ,





T1 = x , T2 = x2





T1 = x , T2 = x2

η1 = µ/σ2 = µλ





T1 = x , T2 = x2

η1 = µ/σ2 = µλ η2 = − 1

2 λ = −1/(2σ2)





T1 = x , T2 = x2

η1 = µ/σ2 = µλ η2 = − 1

2 λ = −1/(2σ2) A = λµ2/2− 1

2 ln λ





T1 = x , T2 = x2

η1 = µ/σ2 = µλ η2 = − 1

2 λ = −1/(2σ2) A = λµ2/2− 1

2 ln λ exp(B(x)) = (2π)−1/2





T1 = x , T2 = x2

η1 = µ/σ2 = µλ η2 = − 1

2 λ = −1/(2σ2) A = λµ2/2− 1

2 ln λ exp(B(x)) = (2π)−1/2

note that in a mixture model, there are separate η (and thusA(η)) for each value of z

26Choosing η(θ) = η called ‘canonical form’27NB: Gaussians ∈ exponential family, GMM /∈ exponential family! (Thanks

to Eszter Vértes for pointing out this error in earlier title.)

tangent: variational joy ∈ exponential family


i rik

(

ηTk T (xi)− A(ηk) + B(xi)

)



i rik

(


)

ηk,α solves∑

i rikTk,α(xi) = ∂A(ηk)∂ηk,α

∑

i rik (canonical)



i rik

(


)

ηk,α solves∑


∑

i rik (canonical)

∴ ∂ηk,αA(ηk)← E [Tk,α|k] (canonical)



i rik

(


)

ηk,α solves∑


∑

i rik (canonical)

∴ ∂ηk,αA(ηk)← E [Tk,α|k] (canonical)

nice connection w/physics, esp. mean field theory28

28read MacKay, David JC. Information theory, inference and learning

algorithms, Cambridge university press, 2003 to learn more. Actually you shouldread it regardless.

clustering and inference: GMM/k-means case study

generative model gives meaning and optimization


generative model gives meaning and optimization large freedom to choose different optimization approaches



e.g., hard clustering limit



e.g., hard clustering limit e.g., streaming solutions



e.g., hard clustering limit e.g., streaming solutions e.g., stochastic gradient methods

general framework: E+M/variational

e.g., GMM+hard clustering gives kmeans

http://arxiv.org/pdf/0709.3512v3.pdf

https://ebfret.github.io/

http://edhmm.github.io/


e.g., GMM+hard clustering gives kmeans e.g., some favorite applications:






hmm






hmm vbmod: arXiv:0709.3512






hmm vbmod: arXiv:0709.3512 ebfret: ebfret.github.io






hmm vbmod: arXiv:0709.3512 ebfret: ebfret.github.io EDHMM: edhmm.github.io




example application: LDA+topics

Figure 27: From Blei 2003

rec engine via CTM 29


29

http://bit.ly/AlexCTM

recall: recommendation via factoring


CTM: combined loss function


CTM: updates for factors


CTM: (via Jensen’s, again) bound on loss


Lecture 5 data product

data science and design thinking

knowing customer


knowing customer right tool for right job


knowing customer right tool for right job practical matters:



munging



munging data ops



munging data ops ML in prod

Thanks!

Thanks MLSS students for your great questions; please contact me@chrishwiggins or chris.wiggins@nytimes,gmail.com with anyquestions, comments, or suggestions!

machine learning summer school 2016

Engineering