networks and complex data: discussion · networks and complex data: discussion . ... gotas de...

Edoardo Airoldi Department of Statistics

Harvard University

Becker Friedman Institute, University of Chicago, September 24th 2016

Networks and complex data: Discussion

Analysis of network data

Becker Friedman Institute, 2016 2

Remarks

•  What researchers might do in practice

•  Complications arising from a population network

•  What we know and what we do not


A field experiment in Honduras

•  Education on standards of care for newborns (176 villages; 30,000 people; 3,000 target households)

•  Design: 2 target nomination schemes and 8 levels of treatment

•  Short- and long-term causal effects, including on social interactions 2-year post intervention

(N Christakis, J Fowler, D Spielman, R Negon, et al.)

4

Agua Buena

Agua Buena Norte

Agua Caliente

Agua Caliente

Agua Sucia

Agua Zarca

Aldea Nueva

Aldea Nueva

Banaderos

Barbascos

Barbascos Copan

Boca del monte

Buena Vista

Buena Vista Buenos Aires

Campamento 1

Carrizalito 1

Carrizalito 2

Carrizalon

Ceiba Segunda

ChimichalChonco

Colon Jubuco

Corralito

Crucitas

Delicias 1

Delicias 2

Descombros

Dos Quebradas

El Barrial

El Bonete

El Caobal

El Carrizal

El Cedral

El Chilar

El Chilcal

El Chorreron

El Cisne

El Conal

El Cordoncillo

El Eden/La Moya

El Encino

El Florido

El Guayabo

El Jaragua

El Jaral

El Llanetillo

El Llano

El Malcote

El Mirador

El Pedernal

El Pinabeton

El Porvenir 1

El Porvenir 2

El Prado

El Quebracho El Raizal

El Rosario

El Rosario

El Salitron

El sompopero

El Tamarindo

El Templador

El Tesorito

El Tigre

El Transito

El Trapiche

El Triunfo

El Volcan

El Zapote

El Zapote

Gotas de Sangre

Hacienda Grande

Hacienda San Juan

Ingenios

La Arada

La Canteada

La Casita

La Castellona

La Cuchilla

La Cuchilla

La cumbre

La Esperanza

La Esperanza

La Estanzuela

La Hermosura

La Huertona

La Laguna

La Leonita

La Libertad

La Linea/Pinalito

La Pintada

La Ramada 1

La Ramada 2

La Reforma

La Union 2

La Union, OtutaLa Vegona

La Vegona

La Zona

Lancetillal

Las Brisas

Las Delicias

Las Flores

Las Juntas 1

Las Juntas 2

Las Medias

Las Mesas

Las Pavas

Llano la Puerta

Londres

Los Achiotes

Los Amates

Los Arcos

Los Ranchos

mariposal

Mecatal

Mirasol

Mirasolito/Rio Negro

Monte Cristo

Monte los NegrosMoscarronal

Motagua

Mun. San Jeronimo

Naranjales

Naranjito

Nueva Alianza

Nueva Armenia

Nueva Esperanza Sur

Nueva Suyapa

Nuevo San Isidro

Penas 1

Penas 2

Piedras Coloradas

Pinabetal

Pinalito

Pinalito

Plan de Limon

Plan del Danto

Plan El Perico

Plan Grande

Planes de la Brea

Planon de la Libertad

Pueblo viejo

Queseras

Rastrojitos

Rincon del Buey

Rio Amarillo

Rio Blanco

Rio Negro

San Antonio del Norte

San Antonio Miramar

San Antonio Tapesco

San Cristobal

San Francisco

San Isidro 1

San Jeronimo

San Jose Miramar

San Juan

San Manuel

San Miguel

San Rafael

Santa Cruz

Santa Cruz Virginia

Santa Elena

Santa Rosita

Santo DomingoSesesmil 1

Sesesmil 2

Tierra Blanca

Tierra Fria 1

Tierra Fria 2

Union Cedral

Valle Maria

Vara de Cohete

Virginia 1

14.7

14.8

14.9

15.0

15.1

−89.2 −89.1 −89.0 −88.9lon

lat

Dosage

0

0.05

0.1

0.2

0.3

0.5

0.75

1

Treatment

random

friend nomination

Aldeas − Copan, Honduras − Treatment Blocking


Networks mapped pre-intervention

•  Trellis for mapping 20+ aspects of social relations (Arguably no measurement error, but missing data)


Other illustrative applications

•  LinkedIn –  Labor mobility, employment histories, recruiting –  Network only provides limited / partial view of

professional interactions –  Connections may be interventions of interest

•  Google display ads –  Selective callouts problem (who to invite in auctions) –  Many networks available –  Causal mechanisms are not well developed


Remarks





Potential outcomes / table of science

Table of science Y is N=4 x |Z|=24, with element Yi(Z)

Causal inferential targets are defined as a function of Y Typically we assume Yi(Zi) – Y becomes Nx2 9

Network interference and exposure

•  Given network G, if no interference is untenable, we must assume how ZNi affects Yi(Zi, g(ZNi), rest)

•  Example: “i is exposed if it has 1+ treated neighbors” leads to 4 distinct potential outcomes

10

0

0

00

0

0

1

0

00

0

0

0

1

10

1

0

1

1

10

1

0

Control

Treated

Exposed

Treated & Exposedi

i

i

i

Becker Friedman Institute, 2016

What fails?

•  ATE and TTE are no longer equal •  Allocations Zj on G with the same (nt,nc) have diffe-

rent configurations of treatment, exposure and control

•  Need randomization schemes that leverage G •  Need to revisit causal interpretation of parameters

11

0

1

1

0

1

0

00

0

0

0

1

0 0

01

0

1

Becker Friedman Institute, 2016

Causal interpretation of parameters

Recall Yiobs = Zi Yi(1) + (1−Zi) Yi(0); assume SUTVA

and additivity of treatment effect Yi(1) = Yi(0) + β •  Slope coefficient in a regression model for Yi

obs

equals the ATE; thus it has a clear interpretation as causal effect defined on table of science Y

In the presence of network interference? The practice is •  State a model for Yi

obs and argue why its parameter(s) capture (aspects of) the causal effect(s) of interest

12 Becker Friedman Institute, 2016

Neighborhood interference assumption (__N__IA)

Convenient to define

And parameters

Additive model: Yi(Z) = αi + βiZi + Γi(ZNi) + ZiΔi(Zni)

Main assumption and model


Structural assumptions

(_AN__IA)

(__N_SIA)

(S_N__IA)

(__NA_IA)

Relations among twelve unique models


Unambiguous causal interpretation

•  Can now write causal effects of interest in terms of parameters of the 12 additive models for Yi

obs, e.g., ATE = 1/n Σi (Yi(ei) − Yi(0))

= 1/n Σi βi (under NIA)

TTE ≡ 1/n Σi (Yi(1) − Yi(0)) = 1/n Σi (βi + Γi(di) + Δi(di)) (under NIA) = 1/n Σi (βi + γ di) (under SANASIA)

•  We can now spell out assumptions that justify previously published estimators / estimates


Remarks





Complex landscape and set-up

•  Inferential targets: ATE, TTE, AIE, …

•  Assumptions about the network: fixed vs. model, observed pre-intervention or outcome; fully vs. partially observed, with and without errors, formal notion of interference

•  Sampling mechanism; finite vs. infinite population inference; models for the outcomes; observational vs. experimental

•  Treatment allocation strategy; estimator; complications …

18

What do we know? Not much …

•  Network observed pre-intervention without error –  Fisher tests for interference (beyond first order neighbors) –  Durbin-Wu-Hausman style test SUTVA violations –  Estimation theory (LUE / MIVE) for ATE and failures –  New randomization and rerandomization strategies –  Homophily vs peer influence in observational studies

•  Network observed with error –  Inference from non-ignorable network sampling designs –  Partially revealed interference (network as outcome) –  Condition on a model for the network

19

Analysis of text data


Analysis of word counts

•  Statistics and machine learning++

•  Common elements Matrix of word counts W (n documents x v terms) Mixture models (k components, interpreted as ??) Parameters µ are rates of occurrence (v terms x k components)

•  Problem specific Document covariates L (n documents x …; e.g., author(s), publication year, topic annotations by professional editors)


Remarks

•  Statistics and machine learning: Classic papers

•  Evaluation and interpretation issues

•  Bringing causality back


An authorship attribution problem


Parameterization and priors

24

•  Counts for term v are Poisson with rates (µvH, µv

M) •  Re-parameterize with total and differential rates

σv = µvH + µv

M τv = µv

H / (µvH + µv

M)

•  Priors σv ∝ constant τv = symmetric beta (α1 + α2 σv)

Also see Negative-Binomial (MW) and COM Poisson distributions (Kadane, Shmueli, and co-authors)

Remarks

•  Authorship vector L is largely observed •  Clear interpretation of the 2 mixture components •  Evaluation

–  In-sample using agreement between posterior odds of authorship for undisputed papers and Lobs

–  Out-of-sample predictions for Lmis

26

Characterizing topics

27

Basic model

Data: count matrix W, document lengths N Re-parameterize rate matrix β where βvk = µvk / Σv µvk

For document d •  θd ~ Dirichlet (α) where θd is a k x 1 vector •  zd ~ Multinomial (θd,Nd) where zd is a k x 1 vector

•  w'dk ~ Multinomial (β·k, zdk) •  Wdv = Σk w'dkv

•  Place symmetric Dirichlet prior on columns β·k 28

(Source: Blei, 2012) 29

Remarks

•  Topic vector L is entirely unobserved •  Often unclear interpretation of many of the k mixture

components •  Evaluation

–  Lists of most frequent words –  Predictions for Lmis using cross-validation, held-out log-lik

30

Remarks





Issues with evaluation standards

32

•  Original papers Interpret most frequent words for most frequent topics Supplemental websites to explore the entire model output

•  Follow-up papers Almost exclusively focus on frequency Qualitative and anecdotal evaluations

•  We need exhaustive and quantitative evaluations How to quantify topic diversity and coherence? How to maximize interpretability of components/topics?

Hypotheses and MTurk experiments

1.  Topic summaries based on frequency and exclu-sivity are more interpretable than frequency alone

2.  Regularizing rates by word yields better estimates of FREX scores than regularizing rates by topic

•  However, interpretability is hard to quantify

•  We carry out two experiments on Amazon MTurk that enlist human evaluators to execute a comparative analysis of the interpretability of topic summaries


Design of experiments

•  Three strategies to generate topic summaries PCM FREX: Poisson, regularize by word, max FREX

LDA FREQ: Binomial, regularize by topic, most frequent

LDA FREX: Binomial, regularize by topic, max FREX (exclusivity estimated by renormalizing rates post inference)

•  Two tasks: (i) word intrusion, (ii) topic coherence

•  Top-5 words from models with 10, 25, 50, 100 topics

•  400 turkers for each model size; 2 replicates 34

Remarks





A simple idea

•  Make the topic proportions and the rates a function of document level covariates

•  JASA paper with Molly Roberts, Brandon Stewart

•  R package stm, by Molly, Brandon and Dustin Tingley

•  What causal questions can we answer leveraging text data? Text as outcome or covariates.


Acknowledgements and pointers

Rubin, Basse, Karwa, Sussman, Toulis (Harvard), Aral (MIT), Imbens (Stanford), Christakis (Yale), Manski (Northwestern), Azari, Lambert (Google), Agarwal, Xu, Ghosh (LinkedIn).

Some fundamental ideas for causal inference on large networks. (Donald B Rubin) Soon on PNAS. Optimal design of experiments in the presence of network-correlated outcomes. (Guillaume Basse) arXiv paper no. 1507.00803 Optimal design of experiments in the presence of network interference. (with discussion) Soon on EJS. Papers on text are on the JASA’s “latest articles” page

networks and complex data: discussion · networks and complex data: discussion . ... gotas de...

Documents