statistics tools in genespring the center for bioinformatics unc at chapel hill jianping jin ph.d....
Post on 19-Dec-2015
214 views
TRANSCRIPT
Statistics Tools in GeneSpringThe Center for Bioinformatics
UNC at Chapel Hill
Jianping Jin Ph.D.
Bioinformatics Scientist
Phone: (919)843-6015
E-mail: [email protected]
Fax: (919)966-6821
What GeneSpring Can do?• Works with both Affymetrix and two-color
data.• Views data graphically (classification,
graph, tree, scatter plot, Vann Diagram …)• Performs statistical analyses.• Annotates genes (updating from GenBank,
LocusLink, Unigene; biochemical pathways).
• ……
• Clustering:
• k-means (non-hierarchical) • Self-organizing map• Gene trees (hierarchical dendrograms).
• principal component analysis • T-Test analyses ( p-values)• Like a known gene or average of genes• Like a pattern drawn with the mouse • Genes with high confidence • Genes with relative expression in certain ranges • Pathway analysis finding genes that fit in a certain place in a pathway. • Sequence analysis to automatically find regulatory sequences. • Automatic functional annotation of sub-trees in dendrograms.• …
What statistical analyses does GS do?
Tree Clustering1. Standard correlation
2. Smooth correlation
3. Change correlation
4. Upregulated correlation
5. Pearson correlation
6. Spearman correlation
7. Spearman confidence
8. Two-sided Spearman confidence
9. Distance
Notations to the Formulas
Result: the result of the calculation for genes A and B.
n: the numbers of samples being correlated over.
a: the vector (a 1 , a 2 , a 3 ... a n) of expression values for gene A.
b: the vector (b 1 , b 2 , b 3 ... b n) of expression values for gene B.
a.b = a 1 b 1 +a 2 b 2 +...+a n b n. |a|=square root(a.a)
Standard Correlation
• Equation: a.b/(|a||b|)
• also called “Pearson correlation around zero”.
• Measure the angular separation of expression vectors for genes A & B.
• Answer the question “do the peaks match up?”
Pearson Correlation
• Equation: A.B / ( | A || B | ) • Very similar to the Std correlation, except it
measures the angle of expression vector for genes A & B around the mean of the expression vectors.
• A = the mean of all element in vector a - the value from each element in a.
• Do the same for b to make a vector B
Spearman Confidence
• r = the value of the Spearman correlation, SC = 1-(probability you would get a value
of r or higher by chance) • A measure of similarity, not a correlation• High SC value if a high Spearman corr, & a
low p-value.• Takes account of the number of sub-
experiment in your experiment set.
Two-sided Spearman Confidence• A measure of similarity, very similar to the
Spearman conf. • Two-sided test of whether the Spearman corr. is
either significantly gt/lt zero. • “what genes behave similarly/opposite to a
specific gene?”• Probably not good for k-means/tree clustering.• 1-(probability you would get a Spearman
correlation of |r| or higher, or -|r| or lower, by chance).
Distance
• A measurement of dissimilarity, not a correlation at all.
• Euclidian dist. b/w expression Profiles ( values for each point in N-dimensional space) of genes A & B.
• Distance = |a-b|/square root of N (expt. points)
Special Case Correlations
• Smooth correlation, Change correlation and Upregulated correlation.
• All three modified version of the Std. correlation.
• Only make sense when data in a sequence, such as “before”/”after”, a time series, or a drug series.
Smooth Correlation
• Make a new vector A from a by interpolating the avg. of each consecutive pair of elements of a.
• Insert this new value b/w the old values
• Do this for each pair of elements that would connected by a line in the graph screen
• Do the same to make a vector B from b.
Change Correlation• The opposite of what the Smooth corr.
looks for. Only the chg. in expression level of adjacent points.
• Similar to the Std corr., but use an arc tangent transformation of ratio b/w adjacent pairs of points to create the expr. vector. Less sensitive to outliers than using the ratio directly.
• The value created b/w two values a i and a
i+1 is atan(a i+1 /a i )- /4
Upregulated Correlation
• Very similar to the Chg. Corr., but it only considers positive changes. All negative values for the arc tangent are set to zero.
• Make a new vector A from a by looking at the change b/w each pair of elements of a.
• The value created b/w two values a i and a
i+1 is max(atan(a i+1 /a i )- /4.0).
Algorithm to Build Gene Tree• Determine if there is only one gene or subtree
left. If yes, go to step five.• Find the two closest genes/subtrees.• Merge these two into one subtree.• Return to step one.• Merge together branches where the distance
between sub-branches is less than the separation ratio, subject to considering genes with less than the minimum distance apart.
Algorithm to Build Tree
• The minimum distance: how far down the tree discrete branches are depicted. Higher number, more genes in a group, less specific.
• The separate ratio: the correlation diff. b/w groups of clustered genes. B/w 0 and 1. Increasing separation increases the branchiness of the tree.
Principal Components Analysis
• Not a clustering method.
• PCA, the most abundant building blocks, a set of expression patterns.
• 1st PC is obtained by finding the linear combination of expr. Patterns for the most of variability in the data. And so on.
k-Means Clustering• Divides genes into a user-defined # (k) of
equal-sized groups, based on their expression patterns.
• Creates centroids at the avg. location of each group of genes
• With each iteration, genes are reassigned to the group with closest centroid
• After all of the genes have been reassigned, the location of the centroids is recalculated.
Self-Organizing Maps
• Similar to k-means clustering.
• Relationship b/w groups in a 2-D map.
• Best represents the variability of the data, while still maintaining similarity b/w adjacent nodes, e.g. point 1,2 is one unit away from 1,3.
What does t-test mean in GS
• Replicates: one-sample Student’s t-test• Comparisons for 2 groups: Student’s two-sample
t-test.• Comparisons for multiple groups: one-way
analysis of variance (ANOVA).• Filtering genes: based on a one-sample t-test of the
mean expression level across replicates vs. a reference value (Expression Percentage Restriction)
Filter Genes Analysis Tools• Global Error Model: filters out genes with
large std deviations or error values.• Raw data filtering: gets rid of genes too
close to the background.• Sample to sample comparison: fold cmp.
Among different samples.• Statistical Group cmp.: filters out genes not
vary significantly across different groups.• Data File Restriction: based on other field
( P/S call, +/- pairs).
Statistical Group Comparison
• Genes statistically significant difference in the mean expression levels across all group.
• For two groups: Students’s two-sample t-test.• For multiple groups: ANOVA• Non-parametric cmp.: for each gene, the rank
order is used for analysis. Wilcoxon two-sample test (Mann-Whitney U test), the Kruskal-Wallis test for multiple groups.
Data Normalization
• In two-color experiments, normalizing vs. the control channel (green) for each gene.
• Normalize each sample to itself or to a positive control. Make diff. samples comparable to one another.
• Normalizing each gene to itself: remove the differing intensity scales from multiple expt readings (highly recommended if not using a two-color experiment.