statistics tools in genespring the center for bioinformatics unc at chapel hill jianping jin ph.d....

Statistics Tools in GeneSpringThe Center for Bioinformatics

UNC at Chapel Hill

Jianping Jin Ph.D.

Bioinformatics Scientist

Phone: (919)843-6015

E-mail: [email protected]

Fax: (919)966-6821

mailto:[email protected]

What GeneSpring Can do?• Works with both Affymetrix and two-color

data.• Views data graphically (classification,

graph, tree, scatter plot, Vann Diagram …)• Performs statistical analyses.• Annotates genes (updating from GenBank,

LocusLink, Unigene; biochemical pathways).

• ……

• Clustering:

• k-means (non-hierarchical) • Self-organizing map• Gene trees (hierarchical dendrograms).

• principal component analysis • T-Test analyses ( p-values)• Like a known gene or average of genes• Like a pattern drawn with the mouse • Genes with high confidence • Genes with relative expression in certain ranges • Pathway analysis finding genes that fit in a certain place in a pathway. • Sequence analysis to automatically find regulatory sequences. • Automatic functional annotation of sub-trees in dendrograms.• …

What statistical analyses does GS do?

Tree Clustering1. Standard correlation

2. Smooth correlation

3. Change correlation

4. Upregulated correlation

5. Pearson correlation

6. Spearman correlation

7. Spearman confidence

8. Two-sided Spearman confidence

9. Distance

Notations to the Formulas

Result: the result of the calculation for genes A and B.

n: the numbers of samples being correlated over.

a: the vector (a 1 , a 2 , a 3 ... a n) of expression values for gene A.

b: the vector (b 1 , b 2 , b 3 ... b n) of expression values for gene B.

a.b = a 1 b 1 +a 2 b 2 +...+a n b n. |a|=square root(a.a)

Standard Correlation

• Equation: a.b/(|a||b|)

• also called “Pearson correlation around zero”.

• Measure the angular separation of expression vectors for genes A & B.

• Answer the question “do the peaks match up?”

Pearson Correlation

• Equation: A.B / ( | A || B | ) • Very similar to the Std correlation, except it

measures the angle of expression vector for genes A & B around the mean of the expression vectors.

• A = the mean of all element in vector a - the value from each element in a.

• Do the same for b to make a vector B

Spearman Confidence

• r = the value of the Spearman correlation, SC = 1-(probability you would get a value

of r or higher by chance) • A measure of similarity, not a correlation• High SC value if a high Spearman corr, & a

low p-value.• Takes account of the number of sub-

experiment in your experiment set.

Two-sided Spearman Confidence• A measure of similarity, very similar to the

Spearman conf. • Two-sided test of whether the Spearman corr. is

either significantly gt/lt zero. • “what genes behave similarly/opposite to a

specific gene?”• Probably not good for k-means/tree clustering.• 1-(probability you would get a Spearman

correlation of |r| or higher, or -|r| or lower, by chance).

Distance

• A measurement of dissimilarity, not a correlation at all.

• Euclidian dist. b/w expression Profiles ( values for each point in N-dimensional space) of genes A & B.

• Distance = |a-b|/square root of N (expt. points)

Special Case Correlations

• Smooth correlation, Change correlation and Upregulated correlation.

• All three modified version of the Std. correlation.

• Only make sense when data in a sequence, such as “before”/”after”, a time series, or a drug series.

Smooth Correlation

• Make a new vector A from a by interpolating the avg. of each consecutive pair of elements of a.

• Insert this new value b/w the old values

• Do this for each pair of elements that would connected by a line in the graph screen

• Do the same to make a vector B from b.

Change Correlation• The opposite of what the Smooth corr.

looks for. Only the chg. in expression level of adjacent points.

• Similar to the Std corr., but use an arc tangent transformation of ratio b/w adjacent pairs of points to create the expr. vector. Less sensitive to outliers than using the ratio directly.

• The value created b/w two values a i and a

i+1 is atan(a i+1 /a i )- /4

Upregulated Correlation

• Very similar to the Chg. Corr., but it only considers positive changes. All negative values for the arc tangent are set to zero.

• Make a new vector A from a by looking at the change b/w each pair of elements of a.

• The value created b/w two values a i and a

i+1 is max(atan(a i+1 /a i )- /4.0).

Algorithm to Build Gene Tree• Determine if there is only one gene or subtree

left. If yes, go to step five.• Find the two closest genes/subtrees.• Merge these two into one subtree.• Return to step one.• Merge together branches where the distance

between sub-branches is less than the separation ratio, subject to considering genes with less than the minimum distance apart.

Algorithm to Build Tree

• The minimum distance: how far down the tree discrete branches are depicted. Higher number, more genes in a group, less specific.

• The separate ratio: the correlation diff. b/w groups of clustered genes. B/w 0 and 1. Increasing separation increases the branchiness of the tree.

Principal Components Analysis

• Not a clustering method.

• PCA, the most abundant building blocks, a set of expression patterns.

• 1st PC is obtained by finding the linear combination of expr. Patterns for the most of variability in the data. And so on.

k-Means Clustering• Divides genes into a user-defined # (k) of

equal-sized groups, based on their expression patterns.

• Creates centroids at the avg. location of each group of genes

• With each iteration, genes are reassigned to the group with closest centroid

• After all of the genes have been reassigned, the location of the centroids is recalculated.

Self-Organizing Maps

• Similar to k-means clustering.

• Relationship b/w groups in a 2-D map.

• Best represents the variability of the data, while still maintaining similarity b/w adjacent nodes, e.g. point 1,2 is one unit away from 1,3.

What does t-test mean in GS

• Replicates: one-sample Student’s t-test• Comparisons for 2 groups: Student’s two-sample

t-test.• Comparisons for multiple groups: one-way

analysis of variance (ANOVA).• Filtering genes: based on a one-sample t-test of the

mean expression level across replicates vs. a reference value (Expression Percentage Restriction)

Filter Genes Analysis Tools• Global Error Model: filters out genes with

large std deviations or error values.• Raw data filtering: gets rid of genes too

close to the background.• Sample to sample comparison: fold cmp.

Among different samples.• Statistical Group cmp.: filters out genes not

vary significantly across different groups.• Data File Restriction: based on other field

( P/S call, +/- pairs).

Statistical Group Comparison

• Genes statistically significant difference in the mean expression levels across all group.

• For two groups: Students’s two-sample t-test.• For multiple groups: ANOVA• Non-parametric cmp.: for each gene, the rank

order is used for analysis. Wilcoxon two-sample test (Mann-Whitney U test), the Kruskal-Wallis test for multiple groups.

Data Normalization

• In two-color experiments, normalizing vs. the control channel (green) for each gene.

• Normalize each sample to itself or to a positive control. Make diff. samples comparable to one another.

• Normalizing each gene to itself: remove the differing intensity scales from multiple expt readings (highly recommended if not using a two-color experiment.

NCI-60 cell lines

DrugActivity_AT

statistics tools in genespring the center for bioinformatics unc at chapel hill jianping jin ph.d....

Documents