shibing deng pfizer, inc. efficient outlier identification in lung cancer study

Shibing DengPfizer, Inc.

Efficient Outlier Identification in Lung Cancer Study

OutlineBackground and motivationCOPA statistics

Existing methodsA new method

Comparison of COPA statisticsApplication to lung cancer data

What is COPA Statistics?COPA = Cancer Outlier Profile Analysis

Statistics designed to identify outliers in cancer gene expression profile

56

78

9

ACTL8 (FC=1.27 FDR=0.031)

Group

log

2(I

nte

nsi

ty)

Normal (n=37) Tumor (n=95)

oo

oooo

o

oooo

o oooooo

oo

oo

ooo ooooooo

o oo oooooooo

oooo oo

o

oooo

o

o

oo o

oo

o

ooo

ooo

o

ooo ooo oo

oooo

oo

ooooooo

o

oo ooo

oo

ooo

o

ooo

oooooo

o ooo

o

o

oo

o+ +

Outliers

Motivation

Differential gene expression(DGE) is widely used to

identify over/under-expressed cancer genes.

It assumes two distinguish populations: tumor and

normal

However, cancer is not a homogenous disease

Genetically diverse

Oncogene has hetergeneous activation pattern

DGE may happen only in a subset of samples.

COPA identifies DGE in a subset of cancer patients

Example of Cancer HeterogeneityMolecular Subsets of Lung Adenocarcinoma

Pao W, Hutchinson K. 2012 Mar 6;18(3):349-51

COPA Methods in Literature

Original COPA method Tomlins et al 2005

Outlier Sum (OS) Tibshirani and Hastie 2007

Outlier Robust T (ORT) Wu 2007

Likelihood Ratio Statistic (LRS)Hu 2008

Notation

n1 = # of normal samplesn2 = # of tumor samplesn = n1+ n2 is the total # of samplesXij is the expression value for sample i and gene j

x1 x2 x3 … xn1Xn1+1 Xn1+2 … Xi … Xn

For gene j (for simplicity index j is not shown below) :

Normal samples (n1) Tumor samples (n2)

The Original COPA MethodTomlins et al (2005) proposed the original COPA

method.Standardize each gene based on median and MADDefine COPA stats as the rth (r = 75, 90, 95)

percentile of tumor samples

Limitations:

1) Fixed r r= 90th percentile, can only detect outliers with

expression levels greater than those of 90% of the tumor samples

Not efficient in differentiating the number of outliers

2) MAD is calculated over all samples Outliers can affect estimate of MAD

Outlier Sum (OS)

Standardize each geneMedian centeringScale on MAD based on normal samples

Define OS statistic as sum of standardized data from outliers which is defined as data above Q3+IQR

1j

jijij MAD

medianxx

)]()([ 751

jjijni

ijj xIQRxqxIxOS

56

78

9

sum

Improvement over COPA:

1.Outliers are defined based on data distribution (not fixed)

2.Take account of the number of outliers

3.Better scaling factor – MAD1

Outlier Robust T (ORT)

Similar to OSDifferent centering (normal group median) and

scaling factors (pooled MAD)

Define ORT as

Outlier threshold is based on normal group data only

)||,|(| 1211

1

nijijnijij

jijij medianxmedianxmedian

medianxx

)]1:()1:([ 751

nkxIQRnkxqxIxORT kjkjijni

ijj

Likelihood Ratio Statistic (LRS)Outlier => a change-point problemGroups normal and tumor samples separately,

and sort them within each group in ascending order

Separate all the samples into two groups at k-th tumor sample, k= n1+1,n1+2,…,n-1, and form a two-sample t statistic

Define

knkkk

kikik ss

s

xxt

11ˆ with

s

x(1) x(2) x(3) … x(n1)X(n1+1) X(n1+2) … X(i) … X(n)

Normal sample (n1) Tumor sample (n2)

where is sample standard deviation

)(max1

knkntLRS

Comments on LRSA maximum t statisticDoes not provide an explicit definition of

outliersEvery gene provides a max(t)Need a significance measure (p value) to

define outliers

A New Method – Maximum Square Difference (MSD)

Similar to LRS, instead of using a t statistic, we can use a squared difference

Define

More sensitive when the number of outliers is small.

samples allfor SE and ˆwith

)(

11

22

sss

ss

xxd

knkk

k

kikik

)(max 2

1k

nkndMSD

Comparison of the Methods - ROCComparisons of the methods were evaluated based on

simulation using ROC curves.

When n1=n2=20, we simulate 8000 null genes from standard normal. We also simulate 2000 up-regulated genes with the number of up-regulated samples (out of 20) k = 2,5,10 and 15 from N(2,1)

Based on the percentiles of copa statistic from the null genes, we define the detection threshold for false positive rate (FPR). The true positive rate (TPR) is

TPR = Prob(copa>=threshold | up-regulated genes)

ROC (n1=n2=20, k=2 and 5)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

k = 2

FPR

TP

R

COPAORTOSLRSTMSD

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

k = 5

FPR

TP

R

COPAORTOSLRSTMSD

ROC (n1=n2=50,k=5,10)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

k = 5

FPR

TP

R

COPAORTOSLRSTMSD

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

k = 10

FPR

TP

R

COPAORTOSLRSTMSD

Comparison of the Methods - FDRComparison of methods can also be evaluated

based on false discovery rate (FDR).Simulate n1=n2= 20, 50 samples with 10000

genes, among which 2000 are up-regulated in k tumor samples.

For each detection threshold of copa statistic, FDR is the proportion of false positives among all positives.

FDR = # of False Positives / All claimed positives =sum(copa >= c | null genes)/sum(copa>=c | all

genes)

A plot of FDR vs positive rate is created

FDR : n1=n2=20, k=2, 5

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

k = 2

Positive

FD

R

COPAORTOSLRSTMSD

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

k = 5

Positive

FD

R

COPAORTOSLRSTMSD

Fraction of genes declared positive Fraction of genes declared positive

FDR : n1=n2=50, k=5, 10

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

k = 5

Positive

FD

R

COPAORTOSLRSTMSD

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

k = 10

Positive

FD

R

COPAORTOSLRSTMSD

Fraction of genes declared positive Fraction of genes declared positive

Comparison of the Methods - SummaryOur new MSD method performs the best when

there is small percent (≤ 20%) of tumor samples differentially expressed (DE) - outliers.

For moderate number of DE samples (20-50%), LRS performs better in ROC.

For large number of DE samples (>50% tumors), t stats becomes more efficient.

When relatively large number (>30%) of DE samples exist, MSD,LRS, ORT and T have comparable FDR.

Assess SignificanceThe distributions of all COPA statistics are

not knownAnalytic solution was not easily availablePermutation test does not generate the

correct null distribution.Simulation:

Simulate COPA statistics under the null and derive the null distribution based on relatively large number of simulations, say, n=10000.

Distribution of COPA statistics

Simulated null for n1=n2=20.

0 1 2 3 4

0.0

0.4

0.8

Original COPA

COPA Statistic

Den

sity

0 5 10 15 20

0.0

0.1

0.2

0.3

0.4

OS

OS Statistic (non-zero)

Den

sity

0 10 20 30 40

0.00

0.10

0.20

ORT

ORT Statistic (non-zero)

Den

sity

1 2 3 4 5

0.0

0.2

0.4

0.6

LRS

LRS Statistic

Den

sity

MSD Distribution

Simulated under the null, 10000 genes, n1=n2=20, data from N(0,1)

The figures display the pdf of both MSD and y=sqrt(MSD).

Fitted dash line is a non-central Chi-square density function for MSD and a normal distribution for y.

2

2212

),(~

MSD

0 50 100 150

0.0

00

0.0

15

MSD Density

MSD Statistic

De

nsi

ty

2 4 6 8 10 12

0.0

00

.15

0.3

0

Square Root of MSD

Sqrt(MSD)

De

nsi

ty

),(~ 2Ny

n1n2

n1n2

MSD Distribution – ParametersBoth and are functions of n1 and n2, as well

as underlying gene expression distribution. If assume gene expression follows a N(0,1) distribution, then MSD parameter will be (n1,n2), 2(n1,n2).

Plots show is driven by n2, and is driven by n2/n1 ratio.

Outlier Identification

COPA, OS and ORT define outlier samples in their methods.

MSD and LRS do not provide an explicit definition of outliers

The following procedure can be used for MSD (or LRS) outlier identificationCalculate MSD for all genesEstimate p value of MSD based on simulated nullCalculate FDR based on Benjamini-Hochberg methodDefine outliers as the samples above the max(MSD)

sample index and with FDR<0.05

Application – Lung Cancer DataOne of the drivers in NSCLC is EML4-ALK

fusion (Soda et al 2007).ALK fusion was associated with high ALK

gene expression (Zhang et al 2010)The prevalence of ALK fusion in NSCLC is

about 5%.Xalkori® is a highly effective ALK inhibitor

in treating NSCLC patients with ALK fusion.

NSCLC Expression DataThe Cancer Genome Atlas (TCGA) has

expression data generated from 57 normal lung samples and 355 lung adenocarcinoma samples.

Expression data were obtained using RNAseq.

ALK Gene ExpressionNo significant difference using t-test

-10

12

3

Group

Exp

ress

ion

leve

ls [l

og

2(I

nte

nsi

ty)]

1 (n = 57) 2 (n = 353)

ALK Gene Expression in Normal and Tumor NSCLC Patients

ALK Outlier AnalysisLRS method failed to find any outliers, MSD

identified 16 outliers (4.5%)Waterfall plots of tumor vs normal expression levels

ALK

Exp

ress

ion

leve

ls (

me

dia

n c

en

tere

d a

nd

sca

led

)

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0 Normal

TumorOutlier

ALK Gene FusionALK gene has 29 exonsThe break point of fusion is between E19

and E20.

Normal ALK transcript

EML4-ALK fusion 23222120EML4 or other partner

2322212016 17 18 19

downstream of ALK upstream of ALK Junction

RNAseq ALK Exon Expression RNAseq provide ways to measure exon level expression. Exon 20-29 showed high expression, Exon 1-19 had very

low expression, an indication of fusion event.

Fusion SamplesAmong the 16 outliers samples, 7 samples showed fusion characteristics in exon expression.

Fusion Samples vs. Outlier Samples Of all 355 tumor samples, 8 showed fusion

characteristics from exon expression (marked by “+”), they are in the top 20 samples in ALK mRNA expression.

SummaryWe proposed a new cancer outlier analysis

method MSD and compared it to existing methods.

MSD was shown to be more sensitive in detecting outliers when the prevalence of outliers was small (<20%).

References Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW,

Varambally S, Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM., (2005), Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science. 2005 Oct 28;310(5748):644-8.

Tibshirani R and Hastie, T, 2006, Outlier sums for differential gene expression analysis, Biostatistics 2007;8:2-8.

Wu B. (2007), Cancer outlier differential gene expression detection. Biostatistics 2007;8:566-75.

Hu, J, 2008, Cancer outlier detection based on likelihood ratio test, Bioinformatics (2008) 24(19): 2193-2199

Soda M, Choi YL, Enomoto M, Takada S, Yamashita Y, Ishikawa S, Fujiwara S, Watanabe H, Kurashina K, Hatanaka H, et al.: Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer. Nature 2007, 448:561-566.

Zhang X, Zhang S, Yang X, Yang J, Zhou Q, et al. (2010) Fusion of EML4 and ALK is associated with development of lung adenocarcinomas lacking EGFR and KRAS mutations and is correlated with ALK expression. Mol Cancer 9: 188

Acknowledgements

Fred ImmermannPfizer Oncology Research Unit at La Jolla,

CAComputational BiologyAsia Omics Project Team

shibing deng pfizer, inc. efficient outlier identification in lung cancer study

Documents