individualized fusion learning (ifusion) for ......personalized (precision) medicine/individualized...

Individualized Fusion Learning (iFusion) forIndividualized Inference

Min-ge XieDepartment of Statistics, Rutgers University

Joint work with Jieli Shen and Regina Liu

IMA Workshop on Precision MedicineMinneapolis/St. Paul, Minnesota, USA; September 14-16 2017

Research supported in part by grants from NSF and Dun&Bradstreet

Big data, heterogeneity & fusion learning

Today, the integration of computer technology into science and daily life hasenabled the collection of big data across all fields

Impose difficulties/challenges (and also lead to opportunities)

– Memory/storage issues: too large to fit into a single computer or asingle site.

– Computing issues: too expensive to perform any computationallyintensive analysis

– Statistical issues: heterogeneity (in designs, information, & more),sparsity, non-conventional (e.g.,image/voice/text/network) data,missing data, random/stochastic components, etc.

I My task today – Introduce an iFusion approach, which is an exampleon how to conduct fusion learning and process information in big datausing a new statistical inference tool, known as confidence distributions(CDs).

Illustration Example: iFusion with big data

Red = Interest; Blue = Clique (similar ones)

Personalized (precision) medicine/individualized inference

◦ Bias-variance tradeoff

iFusion: Summarize ind. info. in CDs; form ‘clique’ (ties/near ties);combine info in clique – theoretical sound; division of labor

◦ Computationally feasible for big data(vs. conv. hierarchical/mixture/non-parametric Bayesian methods)

Illustrative Example: iFusion with big data

Simulation study: iFusion of big data with no subgroup/clustering structuresSetting: Model Yik = αk + βk xik + N(0, 1), for i = 1, . . . , nk , k = 1, . . . ,K

θk = (αk , βk ) = (R cos([ k−1

5

] 2π1200 ) + U(−1,1)

nk,R sin(

[ k−15

] 2π1200 ) + U(−1,1)

nk)

Size/Target: K × nk = 6000× 40 ≈ .24 millions; Target: Individual-1500 (say)

Chain structure: No subgroups

Figure : Parameter values (αk , βk ) (left) and simulated samples (xik , yik ) (right),i = 1, . . . , 50, k = 1, . . . , 6000, with target individual in blue and its clique in yellow.

Question: Borrow information from individuals with θ the same/similar θ1500?

Fusion learning

Fusion learning refers to learning from different studies/sources (or

different parts of a single study) that leads to more effective inference

and prediction than any individual study/part alone.

– Such learning methodology is of vital importance, especially inlight of the trove of data collected routinely from various sources inalmost all aspects of the real world and at all time!

Fusion learning concerns with four V’s –

– In data science era, we have information explosion with big data ofthree Vs (Volume, Velocity, Variety!) from different databases,different data sources, different labs, ...

– What do we gain from combining inferences (fusion learning) –Validity (the 4th V) + Enhance/strengthen overall inference

◦ Meta-analysis is one type of fusion learning, albeit fusion learning ismuch broad.

I Confidence distribution is a useful tool for fusion learning/meta analysis.

Introduction to confidence distribution (CD)

Statistical inference (Parameter estimation):

Point estimate

Interval estimate

Distribution estimate (e.g., confidence distribution)

Example: X1, . . . ,Xn i.i.d. follows N(µ, 1)

Point estimate: x̄n = 1n

∑ni=1 xi

Interval estimate: (x̄n − 1.96/√

n, x̄n + 1.96/√

n)

Distribution estimate: N(x̄n,1n )

The idea of the CD approach is to use a sample-dependent distribution (or

density) function to estimate the parameter of interest.

(Xie and Singh 2013)

CD is very informative –Point estimators, confidence intervals, p-values & more

CD can provide meaningful answers for all questions in statistical inference –

b

(cf., Xie & Singh 2013; Singh et al. 2007)

Definition: Confidence Distribution

Definition:

A confidence distribution (CD) is a sample-dependent distribution

function on parameter space that can represent confidence intervals

(regions) of all levels for a parameter of interest.

– Cox (2013, Int. Stat. Rev. ): The CD approach is “to providesimple and interpretable summaries of what can reasonably belearned from data (and an assumed model).”

– Efron (2013, Int. Stat. Rev. ): The CD development is “agrounding process” to help solve “perhaps the most importantunresolved problem in statistical inference” on “the use of Bayestheorem in the absence of prior information.”

� Wide range of examples: bootstrap distribution, (normalized) likelihoodfunction, empirical likelihood, p-value functions, fiducial distributions,some informative priors and Bayesian posteriors, among others

More CD examples

Under regularity conditions, we can prove that a normalized likelihoodfunction (with respect to parameter θ)

L(θ|data)∫L(θ|data)dθ

is a confidence density function.

Example: X1, . . . ,Xn i.i.d. follows N(µ, 1)

Likelihood function

L(µ|data) =∏

f (xi |µ) = Ce− 12∑

(xi−µ)2= Ce− n

2 (x̄n−µ)2− 12∑

(xi−x̄n)2

Normalized with respect to µ

L(µ|data)∫L(µ|data)dµ

= ... =1√

2π/ne− n

2 (µ−x̄n)2

It is the density of N(x̄n,1n )!

More CD examples

Example: (Bivariate normal correlation) Let ρ denote the correlationcoefficient of a bivariate normal population; ρ̂ be the sample version.

Fisher’s zz =

12

log1 + ρ̂

1− ρ̂has the limiting distribution N

( 12 log 1+ρ

1−ρ ,1

n−3

)=⇒

Hn(ρ) = 1− Φ

(√n − 3

(12

log1 + ρ̂

1− ρ̂ −12

log1 + ρ

1− ρ

))is an asymptotic CD for ρ, when sample size n→∞.

– Hn(ρ) is a cumulative distribution function on Θ = (−1, 1), theparameter space of ρ

– The quantiles of Hn(ρ) can provide confidence intervals of alllevels for ρ.

Three forms of CD presentations

0.0 0.2 0.4 0.6

01

23

4

mu

CD

de

nsity

0.0 0.2 0.4 0.6

0.0

0.4

0.8

mu

CD

0.0 0.2 0.4 0.6

0.0

0.4

0.8

muCV

.

Confidence density: in the form of a densityfunction hn(θ)

e.g., N(x̄n,1n ) as hn(θ) = 1√

2π/ne− n

2 (θ−x̄n)2.

Confidence distribution in the form of acumulative distribution function Hn(θ)

e.g., N(x̄n,1n ) as Hn(θ) = Φ

(√n(θ − x̄n)

)Confidence curve:CVn(θ) = 2 min

{Hn(θ), 1− Hn(θ)

}e.g., N(x̄n,

1n ) as CVn(θ) =

2 min{

Φ(√

n(θ − x̄n)), 1− Φ

(√n(θ − x̄n)

)}

CD — a unifying concept for distributional inference

Our understanding/interpretation: Any approach, regardless of being

frequentist, fiducial or Bayesian, can potentially be unified under the

concept of confidence distributions, as long as it can be used to build

confidence intervals of all levels, exactly or asymptotically.

I May provide a union for Bayesian, frequentist & fiducial (BFF) inferences

I Supports new methodology developments — providing inference toolswhose solutions are previously unavailable or unknown

◦ From our Rutgers group, for instance -

– New prediction approaches

– New testing methods

– New simulation schemes (⇒Application to precision medicine??)∗

– Combining information from diverse sources through combiningCDs (fusion learning/meta analysis, split & conquer, etc.)

Fusion learning by CDs

Key idea (steps)

Summarize relevant data information using a CD in each study

Synthesize information from diverse sources/studies via combination ofthe CDs from these studies

General (& unifying) framework on combining CDs has been developed

o A simple illustrative example (Stouffer method):

H(c)(θ) = Φ

({Φ−1(H1(θ)

)+ . . .+ Φ−1(HK (θ)

)}/√K).

where Hi (θ) is the CD from the i th study/source

o For more approaches and indepth discussions – see Singh et al (2005),Xie et al (2011) and Schweder and Hjort (2016).

Fusion learning by CDs

Why combine CDs in fusion learning?

CD is informative (much more than a single point or an interval)

CD concept is broad (covering a broad range of example acrossBFF paradigms)

CD combination is supported by statistical theory (e.g., ensuringfrequentist coverage, etc.)

It’s computationally feasible for big data (inherently a "divide-and-conquer” approach)

. . . , flexible, effective, versatile, etc.

Individualized fusion learning (iFusion) with big andheterogeneous data (Motivation)

iFusion - Individualized Fusion learning (by individual-to-clique) tacklesa special type of problems encountered in complex/heterogeneous/big data:

e.g. FINANCIAL FORECASTING (CONSULTING FOR DUN&BRADSTREET)

– Provide credit scores for each of 100,000+ companies

Q: Data collected for a single company may be limited, can we usecompanies of similar types to help improve inference?

e.g. PERSONALIZED/PRECISION MEDICINE.

Provide healthcare/treatment tailored to individual patients.

Q: Information on individual patient is often limited, can patients withsimilar traits help achieve better inference?

Goal: Make inference for an individual subject by borrowing information from

similar subjects (‘clique’) to improve efficiency.

iFusion with big and heterogeneous data

An iFusion approach following key steps:

S1 Obtain inference for each individual subject based on individual data(subset of data)

S2 Form a clique (or group) around the individual by the similarity of theirshare parameters or models

S3 Finally obtain inference or model from the clique (by a weightedcombination)

(Drawing inference from the clique allows borrowing strength from individualswith the same/similar parameters; thus enhance inference efficiency.)

Bias-Variance tradeoff

– Inclusion: Similar individuals with small biases (“clique”); increasedsample sizes reduce variance (improve efficiency)

– Exclusion: Individuals with big biases (outside the “clique”); Reducedvariance by increasing combined sample size can not overcome the bigbiases.


Red = Interest; Blue = Clique (similar ones)

Personalized (precision) medicine/individualized inference

◦ Bias-variance tradeoff

iFusion: Summarize ind. info. in CDs; form ‘clique’ (ties/near ties);combine info in clique – theoretical sound; division of labor

◦ Computationally feasible for big data(vs. conv. hierarchical/mixture/non-parametric Bayesian methods)

Simulation study (proof of concept): iFusion with big data

Simulation study/Proof of concept: iFusion vs Individual vs Subgroups

Simulation settings:

yik ∼ N(θk , 0.12), i = 1, . . . , nk = 50, k = 1, . . . , 30.θk is a realization from a mixture distribution of six normals

.1N(0, .12)+.3N(.5, .52)+.2N(1, 1)+.1N(1.5, .22)+.15N(2, .42)+.15N(2.5, .12)

Three methods compared

1 iFusion (≡ fusion learning with individualized weights); e.g.,

H(c)(θ) = Φ

({wi1Φ−1(H1(θ)

)+. . .+wiK Φ−1(HK (θ)

)}/√√√√ K∑k=1

w2ik

).

for a set of screening weights wik (to be discussed further)2 Conventional method (only individual data)3 Supgroup method (i.e., use K -means method to get K subgroups;

then perform subgroup analysis)

Simulation study (proof of concept): iFusion with big data

The target is 9th individual with true θ9 = 0.578

Long-run performance with 500 repetitions

Ind. SG (K=4) SG (K =6) SG (K =9) iFusion95% Coverage 0.938 0.000 0.032 0.206 0.924Median Width 0.0551 0.0165 0.0187 0.0239 0.041

0.0 0.5 1.0 1.5 2.0 2.5

010

2030

40

Theta

CD De

nsity

| | | | | | |||| || | | | | || | | | || || | | | | |

| TruthIndivudal − TargetIndep − OtheriFusion − Target

Typical outcomes based on a simulated dataset.


Desirable features

Inference efficiency

- “Borrowing strength” from relevant studies;- Oracle performance under mild conditions (by proper weighting)

Parameter assumption free

- No assumption on underlying individual parameters {θ1, . . . ,θK};Any of the θi ’s may or may not equal to some others.

Computational feasibility

- “Divide, conquer, and combine”; scalable to big data.

Generality and flexibility

- Confidence distribution - a new, versatile, flexible and effectiveinference concept/tool.


– Suppose K studies (companies/patients) with true parameters θ1, . . . , θK

(may or may not be the same – unknown to us).– For simplicity, say, we are interested in θ1 and n1 = n2 = ... = nK ≡ n.

Clique (non-separable/“near-tie”) set:

C1 = {θk : ‖θk − θ1‖2 = o(n−1/2), k = 1, . . . ,K}

- Bias is dominated by standard deviation (=Op(n−1/2)) for anyStudy-k inside C1

Ideal screening weight:

Place big weights on studies inside C1 and exclude/downweigh studiesoutside C1

Oracle inference: (best target)

If we know C1 a priori, we would pool data from the studies inside C1

(and exclude data outside C1) to make inference on θ1


Theorem 1 (Main Theorem)

Suppose w (s)1,k satisfies

w (s)1k =

{c1,k + op(n−1/2) if θk ∈ C1;

op(n−1/2) otherwise.(1)

for some constant c1,k ∈ (0, 1]. If no θk is on the√

n-boundary of C1, then

(i) H(c)1 (θ) by iFusion is a valid CD for θ1.

(ii) If further c1,k = 1, then H(c)1 (θ) by iFusion provides oracle inference,

asymptotically (as n→∞).

Translation:

(i) Inference by iFusion is valid (guaranteed coverage, test size, etc.)

(ii) Inference by iFusion is asymptotically equivalent to the “best” (oracle),which is the most efficient (achieving smallest MSEs)


A simple proposal for screening weights is

w (s)1k = 1{‖θ̂1 − θ̂k‖2 ≤ b1}, for k = 1, . . . ,K

- θ̂k is a√

n-consistent estimator of θk ; e.g., θ̂k = H−1k (−1/2).

- bk = Op(τn) satisfies τn/d1 → 0, and τn√

n→∞, where

d1 = mink{‖θ1 − θk‖2 : θk /∈ C1} and d1

√n→∞

Lemma 1

Assume the above assumptions. The simple screening weights satisfy (1)with c1,k = 1 for all k = 1, . . . ,K .


Additional Theorems (technical details omitted) –

(Boundary case) Some studies on the boundary of the clique set ∂C1

(Theorems 2 & 3):

(i) iFusion still produces consistent point estimator(ii) MSE(iFusion) ≤ MSE(Ind.)(ii) The estimator by iFusion is asymptotically most efficient, with

some further conditions

(Extension to design heterogeneity) Extension to cases with

heterogeneity in study designs (varying model structures) (Theorems 4

& 5)

o Generalizations to Theorems 1-3.

Illustrative Example: iFusion with big data

Simulation study: iFusion of big data with no subgroup/clustering structuresYik = αk + βk xik + N(0, 1), for i = 1, . . . , nk , k = 1, . . . , 6000

θk = (αk , βk ) = (R cos([ k−1

5

] 2π1200 ) + U(−1,1)

nk,R sin(

[ k−15

] 2π1200 ) + U(−1,1)

nk)

nk ≡ 40 or 400; R = 500

Target: Individual-1500

Figure : Parameter values (αk , βk ) (left) and simulated samples (xik , yik ) (right),i = 1, . . . , 50, k = 1, . . . , 6000, with target individual in blue and its clique in yellow.

Simulation study: iFusion using big data with no subgroup structure

Table : Long-run performance/500 repetitions: MSE and 95% nominal CI.

n ParaMSE Empirical Coverage Median Width

Indiv iFusion Oracle Indiv iFusion Oracle Indiv iFusion Oracle

40 β1500,1 0.026 0.007 0.005 0.940 0.922 0.928 0.624 0.278 0.27140 β1500,2 0.014 0.003 0.002 0.944 0.928 0.930 0.468 0.180 0.175

400 β1500,1 0.002 0.0005 0.0005 0.948 0.948 0.948 0.196 0.088 0.088400 β1500,2 0.001 0.0002 0.0002 0.950 0.966 0.966 0.131 0.058 0.058


Compared to a nonparametric Bayesian approach (“global”)

Nonparametric Bayesian approach

– Linear mixed-effects model with Dirichlet process mixture prior,fiited using DPlmm function in DPpackage package

– For the target individual parameter, the last 2500 of the total10000 MCMC samples are used to compute posterior mean andcredit intervals

� Computational difficulty: Impossible to running on entire K = 6000individuals!

Limited to 30 neighboring points; Took >3min for nk = 40 for a singlerandom run!


Table : NPB approach - Long-run performance/500 repetitions: MSE and95% nominal CI.

b Para MSE Empirical Coverage Median Width

40 β1500,1 0.005 0.964 0.27740 β1500,2 0.002 0.972 0.180

iFusion with big and heterogeneous data – real data sets

Real data analysis – We have worked on three datasets

Dun & Bradstreet dataset of 100,000+ companies (cannot be published)

SHEP clinical trail data for personalized/precision medicine (stillon-going)

Daily returns of 49 individual portfolios from 2016/01/01-2016/12/31

Real data example: Fama-French factor model/portfolio returns

In asset pricing and portfolio management Fama-French 3-factormodel is widely used to describe portfolio returns

r ekt = αk + bk r e

Mt + sk SMBt + hk HMLt + εkt

reMt : Excessive return of the market portfolio over the risk

free return at time tSMBt : Return of a portfolio long small-capitalization stocksand short large-capitialization stocksHMLt : Return of a portfolio long high book-to-price stocksand short low book-to-price stocks

Data: Daily returns of 49 portfolios and Fama/French factors from2016.01.01-2016.12.31

Real data example: Fama-French factor model/portfolio returns

Expansion for prediction

r ek,t+h = αk + bk r e

Mt + sk SMBt + hk HMLt + εk,t+h

Method: 1, 2, 3 -step ahead rolling forecast with sliding window size20, 60; compare predictions of individual approach and iFusion.

Portfolio Index

Rel

ativ

e R

PM

SE

0.75

0.80

0.85

0.90

0.95

1.00

0 10 20 30 40 50

1-Step3 Months

2-Step3 Months

0 10 20 30 40 50

3-Step3 Months

1-Step1 Month

0 10 20 30 40 50

2-Step1 Month

0.75

0.80

0.85

0.90

0.95

1.00

3-Step1 Month

Figure : h-step-ahead ratio of predictive mean squared errors: RPMSE =MSE(iFusion)/PMSE(individual approach).

iFusion with big and heterogeneous data – Discussion

Comparison/connections to existing methods

Versus fixed/random effects (hierarchical) models, iFusion provides

- Asymptotically equivalent solutions if fixed/random effects modelis true;

- Superior solutions if the assumption of fixed/random effects modelis invalid.

Versus (finite) mixture models, iFusion

- Does not need to determine the number of mixtures (typicallyunknown) in advance, nor EM-type iterations; It is much moreflexible.

Versus NP Bayesian models, iFusion

- Does not involve any prior distribution;- Provides similar (numerical) conclusions;- Requires much less computating and is scalable to big data.

!"#$%&'()*!

individualized fusion learning (ifusion) for ......personalized (precision) medicine/individualized...

Documents