gene set enrichment analysis microarray classification stat115 jun s. liu and xiole shirley liu
TRANSCRIPT
Gene Set Enrichment AnalysisMicroarray Classification
STAT115
Jun S. Liu and Xiole Shirley Liu
Outline• Gene ontology
– Check differential expression and clustering results
– Gene set enrichment analysis
• Unsupervised learning for classification– Clustering and KNN– PCA (dimension reduction)
• Supervised learning for classification– CART, SVM
• Expression and genome resources
GO
• Relationships:– Subclass: Is_a
– Membership: Part_of
– Topological: adjacent_to; Derivation: derives_from– E.g. 5_prime_UTR is part_of a transcript, and mRNA
is_a kind of transcript
• Same term could be annotated at multiple branches
• Directed acyclic graph
Evaluate Differentially Expressed Genes
• NetAffx mapped GO terms for all probesets
Whole genome Up genes
GO term X 100 80
Total 20K 200• Statistical significance?• Binomial proportional test
– p = 100 / 20 K = 0.005
– Check z table
2.79995.0005.0200
005.020080
)1(
pnp
npxz
Evaluate Differentially Expressed Genes
Whole genome Up genesGO term X 100 80Total 20K 200• Chi sq test:
Up !Up TotalGO: 80 (1) 20 (99) 100!GO: 120 (199) 20K-120 (19701) 20K-
100Total: 200 20K-200 20K
– Check Chi-sq table
i i
ii
E
EO 22 )(
GO Tools for Microarray Analysis
• 40 tools
GO on Clustering
• Evaluate and refine clustering– Check GO term for members in the cluster– Are GO term significantly enriched?– Can we summarize what this cluster of these
genes do?– Are there conflicting members in the cluster?
• Annotate unknown genes– After clustering, check GO term– Can we infer an unknown gene’s function
based on the GO terms of cluster members?
Gene Set Enrichment Analysis
• In some microarray experiments comparing two conditions, there might be no single gene significantly diff expressed, but a group of genes slightly diff expressed
• Check a set of genes with similar annotation (e.g. GO) and see their expression values– Kolmogorov-Smirnov test
– One sample z-test
• GSEA at Broad Institute
Gene Set Enrichment Analysis
• Kolmogorov-Smirnov test– Determine if two datasets differ significantly– Cumulative fraction function
• What fraction of genes are below this fold change?
Gene Set Enrichment Analysis
• Set of genes with specific annotation involved in coordinated down-regulation
• Need to define the set before looking at the data• Can only see the significance by looking at the
whole set
Gene Set Enrichment Analysis
• Alternative to KS: one sample z-test– Population with all the genes follow normal ~
N(,2) – Avg of the genes (X) with a specific
annotation:
||
)(
X
Xz
Dimension Reduction
• High dimensional data points are difficult to visualize
• Always good to plot data in 2D– Easier to detect or confirm the relationship among data
points
– Catch stupid mistakes (e.g. in clustering)
• Two ways to reduce:– By genes: some experiments are similar or have little
information
– By experiments: some genes are similar or have little information
Principal Component Analysis
• Optimal linear transformation that chooses a new coordinate system for the data set that maximizes the variance by projecting the data on to new axes in order of the principal components
• Components are orthogonal (mutually uncorrelated)
• Few PCs may capture most
variation in original data• E.g. reduce 2D into 1D data
Principal Component Analysis
• Achieved by singular value decomposition (SVD): X = UDVT
• X is the original N p data– E.g. N genes, p experiments
• V is p p project directions
– Orthogonal matrix: UTU = Ip
– v1 is direction of the first projection
– Linear combination (relative importance) of each experiment or (gene if PCA on samples)
PCA
• U is N p, relative projection of points• D is p p scaling factor
– Diagonal matrix, d1 d2 d3 … dp 0
• ui1d1 is distance along v1 from origin (first principal components)– Expression value projected on v1 – v2 is 2nd projection direction, ui2d2 is 2nd
principal component, so on
• Captured variances by the first m principal components
p
jj
m
ii dd
11
5.0000
0200
0030
0005
PCA
N
P
× P
P
= N
P
× P
P
Original data Projection dir Projected value scale
X11V11 + X12V21 + X13V31 + …= X11’ = U11 D11
X21V11 + X22V21 + X23V31 + …= X21’ = U21 D11
X11V12 + X12V22 + X13V32 + …= X12’ = U12 D22
X21V12 + X22V22 + X23V32 + …= X22’ = U22 D22
1st Principal Component
2nd PrincipalComponent
PCA
v1v2
v1v2
v1
v2
PCA on Genes Example• Cell cycle genes, 13 time points, reduced to 2D• Genes: 1: G1; 4: S; 2: G2; 3: M
PCA ExampleVariance in data explained by the first n principle
components
PCA Example
• The weights of the first 8 principle directions
• This is an example of PCA to reduce samples• Can do PCA to reduce the genes as well
– Use first 2-3 PC to plot samples, give more weight to the more differentially expressed genes, can often see sample classification
v1 v2 v3 v4
Microarray Classificationprobe set Normal m412aNormal m414aNormal m416aNormal m426aNormal m430aMM m282 MM m331aMM m332aMM m333aMM m334aMM m353aMM m408aMM m423aMM m424a39089_at 89.31 143.37 111.61 134.78 121.57 104.02 101.11 105.16 121.21 176.72 117.16 137.19 109.5 109.0635862_at 95.05 107.04 71.06 100.63 117.58 103.96 95.2 114.35 95.03 90.32 93.13 88.61 90.87 112.9541777_at 22.76 20.05 21.37 25.55 30.8 20.75 21.95 28.82 30.85 28.81 22.65 18.91 22.58 21.6538250_at 53.55 62.89 29.36 62.74 36.14 60.07 37.46 42.85 27.86 41.48 116.4 46.39 38.9 29.11656_at 177.69 177.65 167.15 166.04 155.07 180.4 136.47 200.4 201.8 138.38 165.92 176.25 162.85 156.17332_at 128.5 98.29 130.58 111.49 103.56 115.47 121.01 134.5 118.85 88.71 105.08 93.28 113.18 140.1339185_at 107.86 114.02 104.08 108.89 112.75 113.61 120.9 120.1 113.82 102.72 109.81 104.86 104.4 95.53514_at 69.21 51.43 92.43 69.21 55.46 58.43 73.9 74.58 88.07 57.01 79.11 53.63 53.43 69.6235010_at 65.34 42 48.14 52.85 59.07 49.62 62.59 68.39 55.57 47.92 46.97 49.73 44.7 55.7334793_s_at 9.95 9.12 10.45 14.65 21.91 13.2 14.02 17.15 9.05 10.66 8.24 13.43 17.17 15.9733277_at 153.21 120.52 136.7 113.79 110.23 140.96 153.44 149.59 119.14 98.57 156.85 101.86 117.28 104.7234788_at 167.66 172.86 142.6 199.39 195.34 156.66 173.96 159.16 207.34 154.18 158.59 151.91 171.65 246.112053_at 91.76 111.82 99.57 95.58 87.17 123.15 82.24 93.92 97.76 114.66 80.33 107.65 89.78 85.4133465_at 63.37 45.24 54.72 56.74 58.16 59.55 63.43 71.55 55.76 46.63 49.78 40.49 44.5 69.3341097_at 145.34 148.08 171.78 151.96 128.26 138.98 148.45 160.25 169.47 133.5 166.24 135.37 159.2 129.9632394_s_at 449.9 1190.09 429.93 1034.13 196.52 214.51 220.81 331.66 652.66 488.37 699.41 1903.88 843.79 575.161969_s_at 30.03 34.58 59.76 32.84 46.98 51.34 40.4 41.75 31.8 36.74 62.42 40.4 36.37 26.0639225_at 43.19 82.15 97.56 78.3 57.23 65.29 75.14 54.5 58.35 62.47 124.64 56.42 90.55 57.2836919_r_at 36.45 26.84 37.94 35.79 38.86 33.99 28.94 32.57 39.61 32.08 31.37 36.58 44.33 36.9933574_at 16.14 12.58 10.93 14.65 29.64 19.38 14.65 15.29 16.14 19.72 11.23 12.6 18.2 24.0436271_at 41.71 25.8 39.79 49.71 52.64 33.5 48.33 41.15 48.74 45.12 36.5 38.58 55.99 29.73490_g_at 83.48 103.93 121.57 80.05 73.81 115.47 106.57 96.19 101.49 78.5 86.13 71.87 83.73 93.641654_at 78.63 82.7 93.15 73.96 73.82 104.4 100.39 91.78 82.26 63.21 76.23 56.97 76.2 73.0441207_at 100.27 80.62 84.98 75.44 74.26 95.56 96.83 100.36 85.12 71.34 81.04 75.81 70.77 70.8140080_at 172.83 106.63 122.03 118.12 131.15 153.53 150.19 161.04 123 101.64 142.03 110.02 113.58 117.1838699_at 69.1 67.16 62.73 67.46 74.03 61.16 75.27 75.7 63.2 68.12 57.25 65.42 70.71 75.81698_f_at 21.36 43.88 30.5 65.43 35.73 44.05 32.34 35.17 33.89 62.61 34.72 42.49 32.13 37.5136036_at 105.59 71.45 88.72 79.84 75.78 95.13 115.07 100.81 84.13 69.87 76.51 71.58 72.16 73.8540720_at 104.84 175.9 186.87 65.58 64 204.55 89.48 110.87 99 59.84 138.3 59.43 197.43 118.3232194_at 34.01 165.32 153.91 59.4 43.4 98.5 59.53 43.28 47.98 63.09 217.29 127.38 79.38 82.0431499_s_at 42.66 36.26 47.61 43.35 48.55 40.87 52.57 53.86 41.41 40.08 44.22 35.6 43.32 41.4841685_at 25.07 14.68 22.41 22.98 19.79 22.21 21.85 25.12 20.27 18.44 20.37 12.85 22.02 25.9131788_at 115.87 151.38 103.33 144.45 138.01 125.9 132.74 121.06 113.56 114.21 149.88 199.76 121.17 96.031719_at 15.65 18.26 16.74 21.49 15.16 11.49 17.52 21.35 19.36 20.6 15.13 14.3 18.77 18.49973_at 169.15 142.44 164.57 129 151.38 189.15 171.12 169.57 139.02 140.37 145.62 145.17 130.23 132.35
?
Classification• Equivalent to machine learning methods• Task: assign object to class based on
measurements on object– E.g. is sample normal or cancer based on expression
profile?• Unsupervised learning
– Ignore known class labels, e.g. cluster analysis or KNN– Sometimes can’t separate even the known classes
• Supervised learning:– Extract useful features based on known class labels to
best separate classes– Can over fit the data, so need to separate training and
test set (e.g. cross-validation)
Clustering Classification• Which known samples does the unknown sample
cluster with?• No guarantee that the known sample will cluster• Try different clustering methods (semi-
supervised)– E.g. change linkage, use subset of genes
K Nearest Neighbor
• Used in missing value estimation
• For observation X with unknown label, find the K observations in the training data closest (e.g. correlation) to X
• Predict the label of X based on majority vote by KNN
• K can be determined by predictability of known samples, semi-supervised again!
• Offer little insights into mechanism
STAT115 03/18/2008
25
Supervised Learning Performance Assessment
• If error rate is estimated from whole learning data set, it will be over-optimistic (do well now, but poorly in future observations)
• Divide observations into L1 and L2– Build classifier using L1
– Compute classifier error rate using L2
– Requirement: L1 and L2 are iid (independent & identically-distributed)
• N-fold cross validation– Divide data into N subsets (equal size), build classifier
on (N-1) subsets, compute error rate on left out subset
Classification And Regression Tree
• Split data using set of binary (or multiple value) decisions
• Root node (all data) has certain impurities, need to split the data to reduce impurities
CART
• Measure of impurities– Entropy
– Gini index impurity
• Example with Gini: multiply impurity by number of samples in the node– Root node
(e.g. 8 normal & 14 cancer)
– Try split by gene xi (xi 0, 13 cancer; xi < 0, 1 cancer & 8 normal):
– Split at gene with the biggest reduction in impurities
class
classPclassP ))((log)( 2
class
classP 2))((1
18.1022
14
22
8122
22
78.19
1
9
819
13
13113
222
CART
• Assume independence of partitions, same level may split on different gene
• Stop splitting– When impurity is small enough– When number of node is small
• Pruning to reduce over fit– Training set to split, test set for pruning– Split has cost, compared to gain at each split
Support Vector Machine
• SVM– Which hyperplane is the best?
Support Vector Machine
• SVM finds the hyperplane that maximizes the margin
• Margin determined by support vectors (samples lie on the class
edge), others irrelevant
Support Vector Machine
• SVM finds the hyperplane that maximizes the margin
• Margin determined by support vectors others irrelevant
• Extensions: – Soft edge, support vectors diff
weight– Non separable: slack var > 0
Max (margin – # bad)
Nonlinear SVM• Project the data through higher dimensional space
with kernel function, so classes can be separated by hyperplane
• A few implemented kernel functions available in Matlab & BioConductor, the choice is usually trial and error and personal experience
K(x,y) = (xy)2
Most Widely Used Sequence IDs
• GenBank: all submitted sequences • EST: Expressed Sequence Tags (mRNA), some
redundancy, might have contaminations• UniGene: computationally derived gene-based
transcribed sequence clusters • Entrez Gene: comprehensive catalog of genes and
associated information, ~ traditional concept of “gene”
• RefSeq: reference sequences mRNAs and proteins, individual transcript (splice variant)
UCSC Genome Browser
• Can display custom tracks
Entrez: Main NCBI Search Engine
Public Microarray Databases
• SMD: Stanford Microarray Database, most Stanford and collaborators’ cDNA arrays
• GEO: Gene Expression Omnibus, a NCBI repository for gene expression and hybridization data, growing quickly.
• Oncomine: Cancer Microarray Database– Published cancer related microarrays– Raw data all processed, nice interface
Outline• Gene ontology
– Check diff expr and clustering, GSEA
• Microarray clustering:– Unsupervised
• Clustering, KNN, PCA
– Supervised learning for classification• CART, SVM
• Expression and genome resources
Acknowledgment• Kevin Coombes & Keith Baggerly• Darlene Goldstein• Mark Craven• George Gerber• Gabriel Eichler• Ying Xie• Terry Speed & Group• Larry Hunter• Wing Wong & Cheng Li• Ping Ma, Xin Lu, Pengyu Hong• Mark Reimers• Marco Ramoni• Jenia Semyonov