shibing deng pfizer, inc. efficient outlier identification in lung cancer study
TRANSCRIPT
Shibing DengPfizer, Inc.
Efficient Outlier Identification in Lung Cancer Study
OutlineBackground and motivationCOPA statistics
Existing methodsA new method
Comparison of COPA statisticsApplication to lung cancer data
What is COPA Statistics?COPA = Cancer Outlier Profile Analysis
Statistics designed to identify outliers in cancer gene expression profile
56
78
9
ACTL8 (FC=1.27 FDR=0.031)
Group
log
2(I
nte
nsi
ty)
Normal (n=37) Tumor (n=95)
oo
oooo
o
oooo
o oooooo
oo
oo
ooo ooooooo
o oo oooooooo
oooo oo
o
oooo
o
o
oo o
oo
o
ooo
ooo
o
ooo ooo oo
oooo
oo
ooooooo
o
oo ooo
oo
ooo
o
ooo
oooooo
o ooo
o
o
oo
o+ +
Outliers
Motivation
Differential gene expression(DGE) is widely used to
identify over/under-expressed cancer genes.
It assumes two distinguish populations: tumor and
normal
However, cancer is not a homogenous disease
Genetically diverse
Oncogene has hetergeneous activation pattern
DGE may happen only in a subset of samples.
COPA identifies DGE in a subset of cancer patients
Example of Cancer HeterogeneityMolecular Subsets of Lung Adenocarcinoma
Pao W, Hutchinson K. 2012 Mar 6;18(3):349-51
COPA Methods in Literature
Original COPA method Tomlins et al 2005
Outlier Sum (OS) Tibshirani and Hastie 2007
Outlier Robust T (ORT) Wu 2007
Likelihood Ratio Statistic (LRS)Hu 2008
Notation
n1 = # of normal samplesn2 = # of tumor samplesn = n1+ n2 is the total # of samplesXij is the expression value for sample i and gene j
x1 x2 x3 … xn1Xn1+1 Xn1+2 … Xi … Xn
For gene j (for simplicity index j is not shown below) :
Normal samples (n1) Tumor samples (n2)
The Original COPA MethodTomlins et al (2005) proposed the original COPA
method.Standardize each gene based on median and MADDefine COPA stats as the rth (r = 75, 90, 95)
percentile of tumor samples
Limitations:
1) Fixed r r= 90th percentile, can only detect outliers with
expression levels greater than those of 90% of the tumor samples
Not efficient in differentiating the number of outliers
2) MAD is calculated over all samples Outliers can affect estimate of MAD
Outlier Sum (OS)
Standardize each geneMedian centeringScale on MAD based on normal samples
Define OS statistic as sum of standardized data from outliers which is defined as data above Q3+IQR
1j
jijij MAD
medianxx
)]()([ 751
jjijni
ijj xIQRxqxIxOS
56
78
9
sum
Improvement over COPA:
1.Outliers are defined based on data distribution (not fixed)
2.Take account of the number of outliers
3.Better scaling factor – MAD1
Outlier Robust T (ORT)
Similar to OSDifferent centering (normal group median) and
scaling factors (pooled MAD)
Define ORT as
Outlier threshold is based on normal group data only
)||,|(| 1211
1
nijijnijij
jijij medianxmedianxmedian
medianxx
)]1:()1:([ 751
nkxIQRnkxqxIxORT kjkjijni
ijj
Likelihood Ratio Statistic (LRS)Outlier => a change-point problemGroups normal and tumor samples separately,
and sort them within each group in ascending order
Separate all the samples into two groups at k-th tumor sample, k= n1+1,n1+2,…,n-1, and form a two-sample t statistic
Define
knkkk
kikik ss
s
xxt
11ˆ with
s
x(1) x(2) x(3) … x(n1)X(n1+1) X(n1+2) … X(i) … X(n)
Normal sample (n1) Tumor sample (n2)
where is sample standard deviation
)(max1
knkntLRS
Comments on LRSA maximum t statisticDoes not provide an explicit definition of
outliersEvery gene provides a max(t)Need a significance measure (p value) to
define outliers
A New Method – Maximum Square Difference (MSD)
Similar to LRS, instead of using a t statistic, we can use a squared difference
Define
More sensitive when the number of outliers is small.
samples allfor SE and ˆwith
)(
11
22
sss
ss
xxd
knkk
k
kikik
)(max 2
1k
nkndMSD
Comparison of the Methods - ROCComparisons of the methods were evaluated based on
simulation using ROC curves.
When n1=n2=20, we simulate 8000 null genes from standard normal. We also simulate 2000 up-regulated genes with the number of up-regulated samples (out of 20) k = 2,5,10 and 15 from N(2,1)
Based on the percentiles of copa statistic from the null genes, we define the detection threshold for false positive rate (FPR). The true positive rate (TPR) is
TPR = Prob(copa>=threshold | up-regulated genes)
ROC (n1=n2=20, k=2 and 5)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
k = 2
FPR
TP
R
COPAORTOSLRSTMSD
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
k = 5
FPR
TP
R
COPAORTOSLRSTMSD
ROC (n1=n2=50,k=5,10)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
k = 5
FPR
TP
R
COPAORTOSLRSTMSD
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
k = 10
FPR
TP
R
COPAORTOSLRSTMSD
Comparison of the Methods - FDRComparison of methods can also be evaluated
based on false discovery rate (FDR).Simulate n1=n2= 20, 50 samples with 10000
genes, among which 2000 are up-regulated in k tumor samples.
For each detection threshold of copa statistic, FDR is the proportion of false positives among all positives.
FDR = # of False Positives / All claimed positives =sum(copa >= c | null genes)/sum(copa>=c | all
genes)
A plot of FDR vs positive rate is created
FDR : n1=n2=20, k=2, 5
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
k = 2
Positive
FD
R
COPAORTOSLRSTMSD
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
k = 5
Positive
FD
R
COPAORTOSLRSTMSD
Fraction of genes declared positive Fraction of genes declared positive
FDR : n1=n2=50, k=5, 10
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
k = 5
Positive
FD
R
COPAORTOSLRSTMSD
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
k = 10
Positive
FD
R
COPAORTOSLRSTMSD
Fraction of genes declared positive Fraction of genes declared positive
Comparison of the Methods - SummaryOur new MSD method performs the best when
there is small percent (≤ 20%) of tumor samples differentially expressed (DE) - outliers.
For moderate number of DE samples (20-50%), LRS performs better in ROC.
For large number of DE samples (>50% tumors), t stats becomes more efficient.
When relatively large number (>30%) of DE samples exist, MSD,LRS, ORT and T have comparable FDR.
Assess SignificanceThe distributions of all COPA statistics are
not knownAnalytic solution was not easily availablePermutation test does not generate the
correct null distribution.Simulation:
Simulate COPA statistics under the null and derive the null distribution based on relatively large number of simulations, say, n=10000.
Distribution of COPA statistics
Simulated null for n1=n2=20.
0 1 2 3 4
0.0
0.4
0.8
Original COPA
COPA Statistic
Den
sity
0 5 10 15 20
0.0
0.1
0.2
0.3
0.4
OS
OS Statistic (non-zero)
Den
sity
0 10 20 30 40
0.00
0.10
0.20
ORT
ORT Statistic (non-zero)
Den
sity
1 2 3 4 5
0.0
0.2
0.4
0.6
LRS
LRS Statistic
Den
sity
MSD Distribution
Simulated under the null, 10000 genes, n1=n2=20, data from N(0,1)
The figures display the pdf of both MSD and y=sqrt(MSD).
Fitted dash line is a non-central Chi-square density function for MSD and a normal distribution for y.
2
2212
),(~
MSD
0 50 100 150
0.0
00
0.0
15
MSD Density
MSD Statistic
De
nsi
ty
2 4 6 8 10 12
0.0
00
.15
0.3
0
Square Root of MSD
Sqrt(MSD)
De
nsi
ty
),(~ 2Ny
n1n2
n1n2
MSD Distribution – ParametersBoth and are functions of n1 and n2, as well
as underlying gene expression distribution. If assume gene expression follows a N(0,1) distribution, then MSD parameter will be (n1,n2), 2(n1,n2).
Plots show is driven by n2, and is driven by n2/n1 ratio.
Outlier Identification
COPA, OS and ORT define outlier samples in their methods.
MSD and LRS do not provide an explicit definition of outliers
The following procedure can be used for MSD (or LRS) outlier identificationCalculate MSD for all genesEstimate p value of MSD based on simulated nullCalculate FDR based on Benjamini-Hochberg methodDefine outliers as the samples above the max(MSD)
sample index and with FDR<0.05
Application – Lung Cancer DataOne of the drivers in NSCLC is EML4-ALK
fusion (Soda et al 2007).ALK fusion was associated with high ALK
gene expression (Zhang et al 2010)The prevalence of ALK fusion in NSCLC is
about 5%.Xalkori® is a highly effective ALK inhibitor
in treating NSCLC patients with ALK fusion.
NSCLC Expression DataThe Cancer Genome Atlas (TCGA) has
expression data generated from 57 normal lung samples and 355 lung adenocarcinoma samples.
Expression data were obtained using RNAseq.
ALK Gene ExpressionNo significant difference using t-test
-10
12
3
Group
Exp
ress
ion
leve
ls [l
og
2(I
nte
nsi
ty)]
1 (n = 57) 2 (n = 353)
ALK Gene Expression in Normal and Tumor NSCLC Patients
ALK Outlier AnalysisLRS method failed to find any outliers, MSD
identified 16 outliers (4.5%)Waterfall plots of tumor vs normal expression levels
ALK
Exp
ress
ion
leve
ls (
me
dia
n c
en
tere
d a
nd
sca
led
)
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0 Normal
TumorOutlier
ALK Gene FusionALK gene has 29 exonsThe break point of fusion is between E19
and E20.
Normal ALK transcript
EML4-ALK fusion 23222120EML4 or other partner
2322212016 17 18 19
downstream of ALK upstream of ALK Junction
RNAseq ALK Exon Expression RNAseq provide ways to measure exon level expression. Exon 20-29 showed high expression, Exon 1-19 had very
low expression, an indication of fusion event.
Fusion SamplesAmong the 16 outliers samples, 7 samples showed fusion characteristics in exon expression.
Fusion Samples vs. Outlier Samples Of all 355 tumor samples, 8 showed fusion
characteristics from exon expression (marked by “+”), they are in the top 20 samples in ALK mRNA expression.
SummaryWe proposed a new cancer outlier analysis
method MSD and compared it to existing methods.
MSD was shown to be more sensitive in detecting outliers when the prevalence of outliers was small (<20%).
References Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW,
Varambally S, Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM., (2005), Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science. 2005 Oct 28;310(5748):644-8.
Tibshirani R and Hastie, T, 2006, Outlier sums for differential gene expression analysis, Biostatistics 2007;8:2-8.
Wu B. (2007), Cancer outlier differential gene expression detection. Biostatistics 2007;8:566-75.
Hu, J, 2008, Cancer outlier detection based on likelihood ratio test, Bioinformatics (2008) 24(19): 2193-2199
Soda M, Choi YL, Enomoto M, Takada S, Yamashita Y, Ishikawa S, Fujiwara S, Watanabe H, Kurashina K, Hatanaka H, et al.: Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer. Nature 2007, 448:561-566.
Zhang X, Zhang S, Yang X, Yang J, Zhou Q, et al. (2010) Fusion of EML4 and ALK is associated with development of lung adenocarcinomas lacking EGFR and KRAS mutations and is correlated with ALK expression. Mol Cancer 9: 188
Acknowledgements
Fred ImmermannPfizer Oncology Research Unit at La Jolla,
CAComputational BiologyAsia Omics Project Team