networks and complex data: discussion · networks and complex data: discussion . ... gotas de...

39
Edoardo Airoldi Department of Statistics Harvard University Becker Friedman Institute, University of Chicago, September 24 th 2016 Networks and complex data: Discussion

Upload: doantuyen

Post on 23-Sep-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Edoardo Airoldi Department of Statistics

Harvard University

Becker Friedman Institute, University of Chicago, September 24th 2016

Networks and complex data: Discussion

Page 2: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Analysis of network data

Becker Friedman Institute, 2016 2

Page 3: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Remarks

•  What researchers might do in practice

•  Complications arising from a population network

•  What we know and what we do not

Becker Friedman Institute, 2016 3

Page 4: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

A field experiment in Honduras

•  Education on standards of care for newborns (176 villages; 30,000 people; 3,000 target households)

•  Design: 2 target nomination schemes and 8 levels of treatment

•  Short- and long-term causal effects, including on social interactions 2-year post intervention

(N Christakis, J Fowler, D Spielman, R Negon, et al.)

4

Page 5: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Agua Buena

Agua Buena Norte

Agua Caliente

Agua Caliente

Agua Sucia

Agua Zarca

Aldea Nueva

Aldea Nueva

Banaderos

Barbascos

Barbascos Copan

Boca del monte

Buena Vista

Buena Vista Buenos Aires

Campamento 1

Carrizalito 1

Carrizalito 2

Carrizalon

Ceiba Segunda

ChimichalChonco

Colon Jubuco

Corralito

Crucitas

Delicias 1

Delicias 2

Descombros

Dos Quebradas

El Barrial

El Bonete

El Caobal

El Carrizal

El Cedral

El Chilar

El Chilcal

El Chorreron

El Cisne

El Conal

El Cordoncillo

El Eden/La Moya

El Encino

El Florido

El Guayabo

El Jaragua

El Jaral

El Llanetillo

El Llano

El Malcote

El Mirador

El Pedernal

El Pinabeton

El Porvenir 1

El Porvenir 2

El Prado

El Quebracho El Raizal

El Rosario

El Rosario

El Salitron

El sompopero

El Tamarindo

El Templador

El Tesorito

El Tigre

El Transito

El Trapiche

El Triunfo

El Volcan

El Zapote

El Zapote

Gotas de Sangre

Hacienda Grande

Hacienda San Juan

Ingenios

La Arada

La Canteada

La Casita

La Castellona

La Cuchilla

La Cuchilla

La cumbre

La Esperanza

La Esperanza

La Estanzuela

La Hermosura

La Huertona

La Laguna

La Leonita

La Libertad

La Linea/Pinalito

La Pintada

La Ramada 1

La Ramada 2

La Reforma

La Union 2

La Union, OtutaLa Vegona

La Vegona

La Zona

Lancetillal

Las Brisas

Las Delicias

Las Flores

Las Juntas 1

Las Juntas 2

Las Medias

Las Mesas

Las Pavas

Llano la Puerta

Londres

Los Achiotes

Los Amates

Los Arcos

Los Ranchos

mariposal

Mecatal

Mirasol

Mirasolito/Rio Negro

Monte Cristo

Monte los NegrosMoscarronal

Motagua

Mun. San Jeronimo

Naranjales

Naranjito

Nueva Alianza

Nueva Armenia

Nueva Esperanza Sur

Nueva Suyapa

Nuevo San Isidro

Penas 1

Penas 2

Piedras Coloradas

Pinabetal

Pinalito

Pinalito

Plan de Limon

Plan del Danto

Plan El Perico

Plan Grande

Planes de la Brea

Planon de la Libertad

Pueblo viejo

Queseras

Rastrojitos

Rincon del Buey

Rio Amarillo

Rio Blanco

Rio Negro

San Antonio del Norte

San Antonio Miramar

San Antonio Tapesco

San Cristobal

San Francisco

San Isidro 1

San Jeronimo

San Jose Miramar

San Juan

San Manuel

San Miguel

San Rafael

Santa Cruz

Santa Cruz Virginia

Santa Elena

Santa Rosita

Santo DomingoSesesmil 1

Sesesmil 2

Tierra Blanca

Tierra Fria 1

Tierra Fria 2

Union Cedral

Valle Maria

Vara de Cohete

Virginia 1

14.7

14.8

14.9

15.0

15.1

−89.2 −89.1 −89.0 −88.9lon

lat

Dosage

0

0.05

0.1

0.2

0.3

0.5

0.75

1

Treatment

random

friend nomination

Aldeas − Copan, Honduras − Treatment Blocking

Becker Friedman Institute, 2016 5

Page 6: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Networks mapped pre-intervention

•  Trellis for mapping 20+ aspects of social relations (Arguably no measurement error, but missing data)

Becker Friedman Institute, 2016 6

Page 7: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Other illustrative applications

•  LinkedIn –  Labor mobility, employment histories, recruiting –  Network only provides limited / partial view of

professional interactions –  Connections may be interventions of interest

•  Google display ads –  Selective callouts problem (who to invite in auctions) –  Many networks available –  Causal mechanisms are not well developed

Becker Friedman Institute, 2016 7

Page 8: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Remarks

•  What researchers might do in practice

•  Complications arising from a population network

•  What we know and what we do not

Becker Friedman Institute, 2016 8

Page 9: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Potential outcomes / table of science

Table of science Y is N=4 x |Z|=24, with element Yi(Z)

Causal inferential targets are defined as a function of Y Typically we assume Yi(Zi) – Y becomes Nx2 9

Page 10: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Network interference and exposure

•  Given network G, if no interference is untenable, we must assume how ZNi affects Yi(Zi, g(ZNi), rest)

•  Example: “i is exposed if it has 1+ treated neighbors” leads to 4 distinct potential outcomes

10

0

0

00

0

0

1

0

00

0

0

0

1

10

1

0

1

1

10

1

0

Control

Treated

Exposed

Treated & Exposedi

i

i

i

Becker Friedman Institute, 2016

Page 11: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

What fails?

•  ATE and TTE are no longer equal •  Allocations Zj on G with the same (nt,nc) have diffe-

rent configurations of treatment, exposure and control

•  Need randomization schemes that leverage G •  Need to revisit causal interpretation of parameters

11

0

1

1

0

1

0

00

0

0

0

1

0 0

01

0

1

Becker Friedman Institute, 2016

Page 12: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Causal interpretation of parameters

Recall Yiobs = Zi Yi(1) + (1−Zi) Yi(0); assume SUTVA

and additivity of treatment effect Yi(1) = Yi(0) + β •  Slope coefficient in a regression model for Yi

obs

equals the ATE; thus it has a clear interpretation as causal effect defined on table of science Y

In the presence of network interference? The practice is •  State a model for Yi

obs and argue why its parameter(s) capture (aspects of) the causal effect(s) of interest

12 Becker Friedman Institute, 2016

Page 13: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Neighborhood interference assumption (__N__IA)

Convenient to define

And parameters

Additive model: Yi(Z) = αi + βiZi + Γi(ZNi) + ZiΔi(Zni)

Main assumption and model

13 Becker Friedman Institute, 2016

Page 14: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Structural assumptions

(_AN__IA)

(__N_SIA)

(S_N__IA)

(__NA_IA)

Page 15: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Relations among twelve unique models

15 Becker Friedman Institute, 2016

Page 16: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Unambiguous causal interpretation

•  Can now write causal effects of interest in terms of parameters of the 12 additive models for Yi

obs, e.g., ATE = 1/n Σi (Yi(ei) − Yi(0))

= 1/n Σi βi (under NIA)

TTE ≡ 1/n Σi (Yi(1) − Yi(0)) = 1/n Σi (βi + Γi(di) + Δi(di)) (under NIA) = 1/n Σi (βi + γ di) (under SANASIA)

•  We can now spell out assumptions that justify previously published estimators / estimates

16 Becker Friedman Institute, 2016

Page 17: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Remarks

•  What researchers might do in practice

•  Complications arising from a population network

•  What we know and what we do not

Becker Friedman Institute, 2016 17

Page 18: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Complex landscape and set-up

•  Inferential targets: ATE, TTE, AIE, …

•  Assumptions about the network: fixed vs. model, observed pre-intervention or outcome; fully vs. partially observed, with and without errors, formal notion of interference

•  Sampling mechanism; finite vs. infinite population inference; models for the outcomes; observational vs. experimental

•  Treatment allocation strategy; estimator; complications …

18

Page 19: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

What do we know? Not much …

•  Network observed pre-intervention without error –  Fisher tests for interference (beyond first order neighbors) –  Durbin-Wu-Hausman style test SUTVA violations –  Estimation theory (LUE / MIVE) for ATE and failures –  New randomization and rerandomization strategies –  Homophily vs peer influence in observational studies

•  Network observed with error –  Inference from non-ignorable network sampling designs –  Partially revealed interference (network as outcome) –  Condition on a model for the network

19

Page 20: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Analysis of text data

Becker Friedman Institute, 2016 20

Page 21: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Analysis of word counts

•  Statistics and machine learning++

•  Common elements Matrix of word counts W (n documents x v terms) Mixture models (k components, interpreted as ??) Parameters µ are rates of occurrence (v terms x k components)

•  Problem specific Document covariates L (n documents x …; e.g., author(s), publication year, topic annotations by professional editors)

Becker Friedman Institute, 2016 21

Page 22: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Remarks

•  Statistics and machine learning: Classic papers

•  Evaluation and interpretation issues

•  Bringing causality back

Becker Friedman Institute, 2016 22

Page 23: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

An authorship attribution problem

Becker Friedman Institute, 2016 23

Page 24: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Parameterization and priors

24

•  Counts for term v are Poisson with rates (µvH, µv

M) •  Re-parameterize with total and differential rates

σv = µvH + µv

M τv = µv

H / (µvH + µv

M)

•  Priors σv ∝ constant τv = symmetric beta (α1 + α2 σv)

Also see Negative-Binomial (MW) and COM Poisson distributions (Kadane, Shmueli, and co-authors)

Page 25: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Becker Friedman Institute, 2016 25

Page 26: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Remarks

•  Authorship vector L is largely observed •  Clear interpretation of the 2 mixture components •  Evaluation

–  In-sample using agreement between posterior odds of authorship for undisputed papers and Lobs

–  Out-of-sample predictions for Lmis

26

Page 27: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Characterizing topics

27

Page 28: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Basic model

Data: count matrix W, document lengths N Re-parameterize rate matrix β where βvk = µvk / Σv µvk

For document d •  θd ~ Dirichlet (α) where θd is a k x 1 vector •  zd ~ Multinomial (θd,Nd) where zd is a k x 1 vector

•  w'dk ~ Multinomial (β·k, zdk) •  Wdv = Σk w'dkv

•  Place symmetric Dirichlet prior on columns β·k 28

Page 29: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

(Source: Blei, 2012) 29

Page 30: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Remarks

•  Topic vector L is entirely unobserved •  Often unclear interpretation of many of the k mixture

components •  Evaluation

–  Lists of most frequent words –  Predictions for Lmis using cross-validation, held-out log-lik

30

Page 31: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Remarks

•  Statistics and machine learning: Classic papers

•  Evaluation and interpretation issues

•  Bringing causality back

Becker Friedman Institute, 2016 31

Page 32: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Issues with evaluation standards

32

•  Original papers Interpret most frequent words for most frequent topics Supplemental websites to explore the entire model output

•  Follow-up papers Almost exclusively focus on frequency Qualitative and anecdotal evaluations

•  We need exhaustive and quantitative evaluations How to quantify topic diversity and coherence? How to maximize interpretability of components/topics?

Page 33: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Hypotheses and MTurk experiments

1.  Topic summaries based on frequency and exclu-sivity are more interpretable than frequency alone

2.  Regularizing rates by word yields better estimates of FREX scores than regularizing rates by topic

•  However, interpretability is hard to quantify

•  We carry out two experiments on Amazon MTurk that enlist human evaluators to execute a comparative analysis of the interpretability of topic summaries

Becker Friedman Institute, 2016 33

Page 34: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Design of experiments

•  Three strategies to generate topic summaries PCM FREX: Poisson, regularize by word, max FREX

LDA FREQ: Binomial, regularize by topic, most frequent

LDA FREX: Binomial, regularize by topic, max FREX (exclusivity estimated by renormalizing rates post inference)

•  Two tasks: (i) word intrusion, (ii) topic coherence

•  Top-5 words from models with 10, 25, 50, 100 topics

•  400 turkers for each model size; 2 replicates 34

Page 35: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Becker Friedman Institute, 2016 35

Page 36: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

36

Page 37: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Remarks

•  Statistics and machine learning: Classic papers

•  Evaluation and interpretation issues

•  Bringing causality back

Becker Friedman Institute, 2016 37

Page 38: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

A simple idea

•  Make the topic proportions and the rates a function of document level covariates

•  JASA paper with Molly Roberts, Brandon Stewart

•  R package stm, by Molly, Brandon and Dustin Tingley

•  What causal questions can we answer leveraging text data? Text as outcome or covariates.

Becker Friedman Institute, 2016 38

Page 39: Networks and complex data: Discussion · Networks and complex data: Discussion . ... Gotas de Sangre ... La Canteada La Casita La Castellona La Cuchilla La Cuchilla La cumbre La Esperanza

Acknowledgements and pointers

Rubin, Basse, Karwa, Sussman, Toulis (Harvard), Aral (MIT), Imbens (Stanford), Christakis (Yale), Manski (Northwestern), Azari, Lambert (Google), Agarwal, Xu, Ghosh (LinkedIn).

Some fundamental ideas for causal inference on large networks. (Donald B Rubin) Soon on PNAS. Optimal design of experiments in the presence of network-correlated outcomes. (Guillaume Basse) arXiv paper no. 1507.00803 Optimal design of experiments in the presence of network interference. (with discussion) Soon on EJS. Papers on text are on the JASA’s “latest articles” page