class #2: statistics, testing multiple hypotheses and bioinformatics · 2015-02-24 · class #2:...

83
Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics Hypotheses and bioinformatics ML4Bio 2012 January 20 th 2012 January 20 th , 2012 Quaid Morris 1

Upload: others

Post on 01-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics Hypotheses and bioinformatics

ML4Bio 2012January 20th 2012January 20th, 2012

Quaid Morris

1

Page 2: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Scooter MorrisPiet Molenaar

2Module #: Title of Module

Gary BaderTero AittokallioBoris Steipe

Page 3: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Overview

• Bioinformatics: gene lists, annotation, etc.• Hypothesis testing:yp g

– T-test, Sign-rank, Ranksum, Hypergeometric test (or Fisher’s Exact Test)

• Multiple test corrections:– Bonferroni, False Discovery Rate (Benjamini-

Hochberg)Hochberg)

3

Page 4: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Bioinformatics:Bioinformatics:gene annotations,ID a ing etcID mapping, etc

4

Page 5: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Interpreting Gene Listsp g

• My cool new screen worked and produced 1000 hits!  …Now what?

• Genome Scale Analysis (Omics)• Genome‐Scale Analysis (Omics)– Genomics, Proteomics

• Tell me what’s interesting about these genes

?Ranking orclustering

?GenMAPP.org

Page 6: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Interpreting Gene Listsp g• My cool new screen worked and produced 1000 hits!  …Now what?

• Genome‐Scale Analysis (Omics)y ( )– Genomics, Proteomics

• Tell me what’s interesting about these genesA th i h d i k th l f ti

Ranking or Analysis

– Are they enriched in known pathways, complexes, functions

Ranking orclustering

ytools

Eureka! Newheart disease

!Prior knowledge about gene!gcellular processes

Page 7: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Where Do Gene Lists Come From?

• Molecular profiling e.g. mRNA, protein– Identification  Gene list– Quantification  Gene list + valuesRanking Cl stering (biostatistics)– Ranking, Clustering (biostatistics)

• Interactions: Protein interactions, microRNA targets, transcription factor binding sites (ChIP)transcription factor binding sites (ChIP)

• Genetic screen e.g. of knock out library• Association studies (Genome‐wide)( )

– Single nucleotide polymorphisms (SNPs)– Copy number variants (CNVs)

Page 8: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

What Do Gene Lists Mean?

• Biological system: complex pathway physical• Biological system: complex, pathway, physical interactors

• Similar gene function e.g. protein kinaseSimilar gene function e.g. protein kinase

• Similar cell or tissue location

• Chromosomal location (linkage CNVs)• Chromosomal location (linkage, CNVs)

D tData

Page 9: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Gene AttributesAvailable in databases:• Function annotation

– Biological process, molecular function, cell location

• Chromosome position• Chromosome position• Disease association• DNA properties

bi di i (i / )– TF binding sites, gene structure (intron/exon), SNPs

• Transcript properties– Splicing, 3’ UTR, microRNA binding sites

• Protein properties– Domains, secondary and tertiary structure, PTM sites

• Interactions with other genes

Page 10: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

What is the Gene Ontology (GO)?gy ( )www.geneontology.org

• Set of biological phrases (terms) which are applied to genes:– protein kinase

– apoptosis

b– membrane

• Dictionary: term definitions

l f l f d b• Ontology: A formal system for describing knowledge

Jane Lomax @ EBI

Page 11: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

GO Structure• Terms are related within a hierarchy– is‐a

– part‐of

• Describes multiple l l f d il flevels of detail of gene function

T h• Terms can have more than one parent or childchild

Page 12: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

What GO Covers?• GO terms divided into three aspects:

– cellular component

– molecular function

biological process (important path a so rce)– biological process (important pathway source)

l 6 h h tglucose-6-phosphate isomerase activity

Cell divisionCell division

Page 13: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Terms• Where do GO terms come from?

– GO terms are added by editors at EBI and gene– GO terms are added by editors at EBI and gene annotation database groups

– Terms added by request

– Experts help with major development

– 32029 terms, >99% with definitions.• 19639 biological_process

• 2859 cellular_component

• 9531 molecular_function

• As of July 15, 2010

Page 14: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Annotations

• Genes are linked, or associated, with GO terms by trained curators at genome databases– Known as ‘gene associations’ or GO annotations

– Multiple annotations per gene 

• Some GO annotations created automatically (without human review)

Page 15: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Annotation Sources

• Manual annotation– Curated by scientistsCurated by scientists

• High quality

• Small number (time‐consuming to create)

d l l– Reviewed computational analysis

• Electronic annotationA i d i d i h h lid i– Annotation derived without human validation• Computational predictions (accuracy varies)

• Lower ‘quality’ than manual codes

• Key point: be aware of annotation origin 

Page 16: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Evidence Types

For your information

Evidence Types• Experimental Evidence Codes

• EXP: Inferred from Experiment• IDA: Inferred from Direct Assay

• Author Statement Evidence Codes

• TAS: Traceable Author y• IPI: Inferred from Physical Interaction• IMP: Inferred from Mutant Phenotype• IGI: Inferred from Genetic Interaction• IEP: Inferred from Expression Pattern

Statement• NAS: Non-traceable

Author Statement• Curator Statement Evidence

CodesIC: Inferred by • IC: Inferred by Curator

• ND: No biological Data available

• Computational Analysis Evidence Codes• ISS: Inferred from Sequence or Structural

Similarity• ISO: Inferred from Sequence Orthology

ISA: Inferred from Sequence Alignment• ISA: Inferred from Sequence Alignment• ISM: Inferred from Sequence Model• IGC: Inferred from Genomic Context• RCA: inferred from Reviewed Computational

Analysis

• IEA: Inferred from electronic annotation

See http://www.geneontology.org

Page 17: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Wide & Variable Species Coverage

Lomax J. Get ready to GO! A biologist's guide to the Gene Ontology. Brief Bioinform. 2005 Sep;6(3):298-304.

Page 18: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Accessing GO: QuickGO

http://www.ebi.ac.uk/ego/See also AmiGO: http://amigo.geneontology.org/cgi-bin/amigo/go.cgi

Page 19: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Gene Attributes

• Function annotation– Biological process, molecular function, cell location

• Chromosome positionChromosome position• Disease association• DNA properties

TF binding sites gene structure (intron/exon) SNPs– TF binding sites, gene structure (intron/exon), SNPs

• Transcript properties– Splicing, 3’ UTR, microRNA binding sites

P t i ti• Protein properties– Domains, secondary and tertiary structure, PTM sites

• Interactions with other genes

Page 20: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Sources of Gene Attributes

• Ensembl BioMart (eukaryotes)– http://www.ensembl.org

• Entrez Gene (general)– http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene

• Model organism databases– E.g. SGD: http://www.yeastgenome.org/

• Also available through R

Page 21: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Biomart 0.7

This oneThis one

Page 22: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Ensembl BioMart• Convenient access to gene list annotation

Select genome

Select filters

Select attributesto download

Page 23: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

What Have We Learned?

• Many gene attributes in databasesMany gene attributes in databases– Gene Ontology (GO) provides gene function annotation

• GO is a classification system and dictionary for biological concepts

• Annotations are contributed by many groups

• More than one annotation term allowed per gene

• Some genomes are annotated more than othersSome genomes are annotated more than others

• Annotation comes from manual and electronic sources

• GO can be simplified for certain uses (GO Slim)

• Many gene attributes available from Ensembl and EntrezGene

Page 24: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Gene Lists Overview

• Interpreting gene lists

• Gene function attributes– Gene Ontology

• Ontology Structure

A i• Annotation

– BioMart + other sources

• Gene identifiers and mapping• Gene identifiers and mapping

Page 25: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Gene and Protein Identifiers• Identifiers (IDs) are ideally unique, stable names or numbers that help track database records– E.g. Social Insurance Number, Entrez Gene ID 41232

• Gene and protein information stored in many databases• Gene and protein information stored in many databases– Genes have many IDs

• Records for: Gene, DNA, RNA, Protein– Important to recognize the correct record type– E.g. Entrez Gene records don’t store sequence. They link to DNA regions RNA transcripts and proteins e g into DNA regions, RNA transcripts and proteins e.g. in RefSeq, which stores sequence.

Page 26: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Common IdentifiersFor your information

Species-specificHUGO HGNC BRCA2MGI MGI:109337RGD 2219

GeneEnsembl ENSG00000139618Entrez Gene 675U i H 34012 RGD 2219

ZFIN ZDB-GENE-060510-3 FlyBase CG9097 WormBase WBGene00002299 or ZK1067.1SG S

Unigene Hs.34012

RNA transcriptGenBank BC026160 1 SGD S000002187 or YDL029W

AnnotationsInterPro IPR015252OMIM 600185

GenBank BC026160.1RefSeq NM_000059Ensembl ENST00000380152

OMIM 600185Pfam PF09104Gene Ontology GO:0000724SNPs rs28897757E i t l Pl tf

ProteinEnsembl ENSP00000369497RefSeq NP_000050.2U iP t BRCA2 HUMAN Experimental Platform

Affymetrix 208368_3p_s_atAgilent A_23_P99452CodeLink GE60169

UniProt BRCA2_HUMAN or A1YBP1_HUMANIPI IPI00412408.1EMBL AF309413

Red = Illumina GI_4502450-S

EMBL AF309413 PDB 1MIU Recommended

Page 27: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Identifier Mapping

• So many IDs!– Mapping (conversion) is a headache

• Four main uses– Searching for a favorite gene name

– Link to related resources

– Identifier translationE G t t i E t G t Aff• E.g. Genes to proteins, Entrez Gene to Affy

– Unification during dataset merging• Equivalent records

Page 28: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

ID Mapping Services

• Synergizer• Synergizer– http://llama.med.harvard.edu/syner

gizer/translate/

• Ensembl BioMart• Ensembl BioMart– http://www.ensembl.org

• PICR (proteins only)– http://www.ebi.ac.uk/Tools/picr/

• R language annotationannotation databases– http://www.bioconductor.org

Page 29: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

ID Mapping ChallengesID Mapping Challenges• Avoid errors: map IDs correctly

• Gene name ambiguity – not a good ID• Gene name ambiguity  not a good ID– e.g. FLJ92943, LFS1, TRP53, p53

– Better to use the standard gene symbol: TP53g y

• Excel error‐introduction– OCT4 is changed to October‐4g

• Problems reaching 100% coverage– E.g. due to version issues

– Use multiple sources to increase coverageZeeberg BR et al. Mistaken identifiers: gene name errors can be introduced inadvertently ywhen using Excel in bioinformatics BMC Bioinformatics. 2004 Jun 23;5:80

Page 30: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Recommendations

• For proteins and genes– (doesn’t consider splice forms)

• Map everything to Entrez Gene IDs using a spreadsheet

• If 100% coverage desired manually curate• If 100% coverage desired, manually curate missing mappings

• Be careful of Excel auto conversions – especially e careful of xcel auto conversions especiallywhen pasting large gene lists!– Format cells as ‘text’

Page 31: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

What Have We Learned?

• Genes and their products and attributes have many identifiers (IDs)

• Genomics requirement to convert or map IDs from one type to another

• ID mapping services are available

• Use standard, commonly used IDs to reduce ID mapping challenges

Page 32: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Hypothesis testing: Hypothesis testing: P-values, enrichment analysis,

T tests etcT-tests, etc

32

Page 33: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

What is a P-value?

• A) Probability of an incorrect rejection of the null hypothesis

• B) Probability that a sample of the test statistic under the null distribution is as or

h i d lmore extreme than its measured value• C) False rejection probability• D) Some subset of the above

33

Page 34: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

What is a P-value?

• A) Probability of an incorrect rejection of the null hypothesis

• B) Probability that a sample of the test statistic under the null distribution is as or

h i d lmore extreme than its measured value• C) False rejection probability• D) Some subset of the above

34

Page 35: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Hypothesis testingyp g• Random variables:

– H: H0 (null hypothesis) or H1 (alternative 0 ( yp ) 1 (hypothesis)

– Data: X1, X2, … XN (independent and identically distributed IID)distributed – IID)

– T: sample from null distribution of the test statistic– t: observed value of test statistict: observed value of test statistic

• Parameters:– α: significance levelα: significance level

• Goal:– Set P = Pr(T is “more extreme” than t | H = H0)( | 0)– Reject H0 if P < α

35

Page 36: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

P-value facts

• Note that: Pr[P < p | H0 is true] = p

• So under the null distribution, P is a random variable that is uniformly distributed between 0 and 1.

36

Page 37: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

P-value versus false rejections

• P-value is:– Pr[ T is “as or more extreme” than t | H0 is true ]

• False rejection probability:– Pr[ H0 is true | H0 is rejected ]– aka “False discovery rate”

• How do we go from one to another?

37

Page 38: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

(Gene Set) Enrichment analysis

• Given:1. Gene list: e.g. RRP6, MRD1, RRP7, RRP43, RRP42

(yeast)(yeast)2. Gene annotations: e.g. Gene ontology, transcription factor

binding sites in promoter

Q ti A f th t ti • Question: Are any of the gene annotations surprisingly enriched in the gene list?

• Details:Details:– How to assess “surprisingly” (statistics)– How to correct for repeating the tests

38

Page 39: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

The hypergeometric testGene list

RRP6MRD1RRP7

H0: List is a random sample from populationH : More black genes than expectedRRP7

RRP43RRP42

H1: More black genes than expected

Background population:g p p500 black genes, 4500 red genes

39

Page 40: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

The hypergeometric function

Probability a random sample of k genes contains q black genes when the background population contains m black genes out of n total genes:genes out of n total genes:

m⎛ ⎞ n −m⎛ ⎞ # ways to # ways to choose m

q

⎝ ⎜

⎠ ⎟ n m

q − k

⎝ ⎜

⎠ ⎟

n⎛ ⎞

choose q out of m genes

q-k out of n-mgenes

=n

k

⎝ ⎜

⎠ ⎟ # ways to

choose k out of n genes

40n

k

⎝ ⎜

⎠ ⎟ =

n!(n − k)!k!

is called “n choose k” for details seehttp://www.khanacademy.org/video/combinations

Page 41: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

The hypergeometric testGene list Null distribution

RRP6MRD1RRP7

500

4

⎝ ⎜

⎠ ⎟ 4500

1

⎝ ⎜

⎠ ⎟

5000⎛ ⎜

⎞ ⎟

500

5

⎝ ⎜

⎠ ⎟ 4500

0

⎝ ⎜

⎠ ⎟

5000⎛ ⎜

⎞ ⎟

+ = 4.6 x 10-4

P-value

RRP7RRP43RRP42

5⎝ ⎜

⎠ ⎟ 5⎝

⎜ ⎠ ⎟

Background population:g p p500 black genes, 4500 red genes

41

Page 42: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Important details

• One way to test for under-enrichment of “black”, test for over-enrichment of “red”S ti ll d “Fi h ’ E t T t”• Sometimes called “Fisher’s Exact Test”

• Need to choose “background population” appropriately, e.g., if only portion of the total gene appropriately, e.g., if only portion of the total gene complement is queried (or available for annotation), only use that population as background.T f i h f h i d d • To test for enrichment of more than one independent types of annotation (red vs black and circle vssquare), apply the hypergeometric test separately for

42

q ), pp y yp g p yeach type. ***More on this later***

Page 43: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Summary

• The P-value given by the hypergeometric test (Fisher’s Exact Test):

– is “the probability that a random draw of the same size as the gene list from the background population would produce the observed number of annotations in the gene list or gmore.”,

– depends on size of both gene list and background population as well and # of black genes in gene list and background.

43

Page 44: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Review: One sample T-test

• Does drug XyZiZy reduce tumour size?• X1, X2, …, XN: change in tumour size after 1 2 N g

taking the drug.

• Calculate t = (E[X] – μ) / (s / √N) [here μ=0]• Evaluate t under a Student t-distribution with

N-1 degrees of freedom.

44

Page 45: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Notes about the T-test

• Assumptions:– 1) Xi is normally distributed– 2) Variance of Xi is unknown but has a chi-square

distributionPlus some other minor technical ones– Plus some other minor technical ones

• What should we do if these aren’t true or we • What should we do if these aren’t true, or we don’t know whether they are?

45

Page 46: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Wilcoxon sign-rank test

• H0: median of distribution of Xi is zero• H1: median of distribution of Xi is not1 i

• Rank X1, X2, …, XN by absolute value• Test statistics: W+ is the sum of the ranks of

the positive values of Xi, W- is the rank sum of the negative values – S = W+ - W-.

• For large N (>20 or so), null distribution of S is well approximated by S ~ Normal(0, σ2)

h 2 N(N )( N )/where σ2 = N(N+1)(2N+1)/646

Page 47: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Sign rank test notes• Assumptions:Assumptions:

– 1) Data is symmetrically distributed around the median

– Plus some other minor technical ones

• Tied ranks need to be “corrected” • Some versions use min(W+, W-) as the test

statistic.

47

Page 48: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

(Wilcoxon) Sign Test

• Used when you can’t even assume that the distribution of Xi is symmetric.

• Test statistic is the # of positive values of Xi.• Under the null hypothesis, the samples are

IID and the probability that any Xi is positive is 0.5, so the null distribution of the test statistic is Bino ial(0 5 N) here N is the # of is Binomial(0.5, N) where N is the # of samples.

48

Page 49: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

SummaryAppropriate

T-Test

distributions for Xi

Sign-Rank testFewer

Assumptions More Power

Sign testg

49

Page 50: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

SummaryAppropriate Test statistic

T-Test

distributions for Xi

T-statistic

Sign-Rank test More PowerDifference of ranksums

Sign test# of positives g

50

Page 51: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Doing paired test

• The same set of tests usually applies for determining whether paired data are drawn from distributions with the same or different mean (medians).If (Y Z ) i h i d d i f Y d Z• If (Yi, Zi) is the paired data point for Yi and Zifrom the two datasets, then simply set:

X Y Z– Xi = Yi – Zi.

• Note: paired Wilcoxon sign-rank is almost exactly paired T-test applied to joint ranksexactly paired T-test applied to joint ranks.

51

Page 52: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Gene lists and gene scores

Clustering Thresholding a gene “score”Clustering Thresholding a gene score

esGene list

Gene list

es

Gen

e

Gen

e

Time

Examples of gene scores

52

Source Eisen et al. (1998) PNAS 95 Source: Gerber et al. (2006) PNAS103

Time

Page 53: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Enrichment analysis using gene scoresEnrichment analysis using gene scores

Gene scores

667

7

5 Gene score distributions

01

21 1

10

0 01 2 10

Question: How likely are the differences between the two distributions due to chance?

53

distributions due to chance?

Page 54: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Enrichment analysis with non-i d t l T t tpaired, two sample T-test

Answer: Two tailed T test Gene score distributionsAnswer: Two-tailed T-test

Black: N1=500Mean: m1 = 1.1

Red: N2=5000

Mean: m1 1.1 Std: s1 = 0.9

Mean: m = 4 9

T statistic =

Mean: m1 = 4.9 Std: s1 = 1.0

21 mm −

H0: Black and red scores are drawn from a distribution with the same meanH Th t t l

T-statistic =

2

22

1

21

Ns

Ns

+

= -88 5

54

H1: The two means are not equal 88.5

Page 55: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Enrichment analysis with non-i d t l T t t

P-value = shaded area * 2

paired, two sample T-test

Gene score distributionsT-distribution

dens

ity

P value shaded area 2

Pro

babi

lity

d

0-88.5

T statistic =21 mm −

T-statistic

088.5

T-statistic =

2

22

1

21

Ns

Ns

+

= -88 5

H0: Black and red scores are drawn from a distribution with the same meanH Th t t l

55

88.5 H1: The two means are not equal

Page 56: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

T-test caveats (also see next slide)

1. Assumes black and red gene score distributions are both approximately Gaussian (i.e. normal)

Score distribution assumption is often true for:– Score distribution assumption is often true for:• Log ratios from microarrays

– Score distribution assumption is rarely true for:SAGE N G• Peptide counts, sequence tags (SAGE or NextGen

sequencing), transcription factor binding sites hits

2. Tests for significance of difference in means of two distribution but does not test for other differences between distributions.

56

Page 57: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Examples of inappropriate score Examples of inappropriate score distributions for T-tests

Gene scores are positive and have increasing density near zero, e.g. sequence counts

Distributions with gene score outliers, or “heavy-tailed” distributions

Bimodal “two-bumped” distributions.

ility

den

sity

bilit

y de

nsity

lity

dens

ity

Pro

bab

Gene score 0

Pro

bab

Gene score

Pro

babi

l

Gene score

Solutions:1) Robust test for difference of medians (WMW)2) Di f diff f di ib i (K S)

57

2) Direct test of difference of distributions (K-S)

Page 58: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Enrichment analysis with two-sample, not paired Wilcoxon Rank Sumpaired Wilcoxon Rank Sumaka Mann-Whitney U test or simply “WMW”

1) Rank gene scores, calculate RB,1) Rank gene scores, calculate RB, sum of ranks of black gene scores

6.52.15 6

ranks1 de

nsity

5.64.53 2

5.6-1.1-2.50

234

RB = 21

Pro

babi

lity

3.22.11.70 1

3.21.7

-0.5N2 red gene

scores

4567

P

Gene score

H : Probability that a random sample from0.1-1.1-2.5

6.54.50.1

789

H0: Probability that a random sample from distribution of red score is > than one from black is 0.5H1: Otherwise

58

-0.5N1 black genescores

10Z

H1: Otherwise

Page 59: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Wilcoxon Mann Whitney (WMW) testWilcoxon-Mann-Whitney (WMW) testaka Mann-Whitney U-test, Wilcoxon rank-sum test

2) Calculate Z-score:

RB = 21

dens

ity

BNNNR2

1211

++−

mean rank

Pro

babi

lity

U

2=

3) Calculate P value:

= -1.4

P

Gene score Normal distributiony

P-value = shaded area * 23) Calculate P-value:

H : Probability that a random sample from

obab

ility

dens

ity H0: Probability that a random sample from distribution of red score is > than one from black is 0.5H1: Otherwise

59Z

Z

Pro

0-1.4

H1: Otherwise

Page 60: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

WMW test details

• Described method is only applicable for large N1 and N2 and when there are no tied scores

• WMW test is robust to (a few) outliers

12/)1( NNNN 12/)1( 2121 ++= NNNNuσ

60

Page 61: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Kolmogorov-Smirnov (K-S) test for Kolmogorov Smirnov (K S) test for difference of distributions

Empirical (cumulative)

dens

ity

prob

abili

ty

1.0

Empirical (cumulative) distribution

roba

bilit

y d

mul

ativ

e p

0.5

Pr

Gene score 0C

um

Gene score 0

1) Calculate cumulative distributions of red and black

61

Page 62: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Kolmogorov-Smirnov (K-S) testEmpirical (cumulative)

dens

ity

prob

abili

ty

1.0

Empirical (cumulative) distribution

roba

bilit

y d

mul

ativ

e p

0.5

Pr

Gene score 0C

um

Gene score 0

1) Calculate cumulative distributions of red and black

62

Page 63: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Kolmogorov-Smirnov (K-S) testEmpirical (cumulative)

dens

ity

prob

abili

ty

1.0

Empirical (cumulative) distribution

roba

bilit

y d

mul

ativ

e p

0.5

Pr

Gene score 0C

um

Gene score 0

1) Calculate cumulative distributions of red and black

63

Page 64: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Kolmogorov-Smirnov (K-S) testEmpirical (cumulative)

dens

ity

prob

abili

ty

1.0

Empirical (cumulative) distribution

roba

bilit

y d

mul

ativ

e p

0.5Length = 0.4

Pr

Gene score 0

Test statistic: Maximum vertical

Cum

Gene score 0

Test statistic: Maximum vertical difference between the two cumulative distributions

64

Page 65: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

WMW and K-S test caveats

• Neither tests is as sensitive as the T-test, ie they require more data points to detect the same amount of difference so use the T test whenever it is validof difference, so use the T-test whenever it is valid.

• K-S test and WMW can give you different answers: K-S detects difference of distributions, WMW detects whether samples from one tend to be higher than those from the other (or vice versa)

• Rare problem: Tied scores and/or small # of • Rare problem: Tied scores and/or small # of observations can be a problem for some implementations of the WMW test

65

Page 66: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Proper tests for different distributions

Gene scores are positive and have increasing density near zero, e.g. sequence counts

Distributions with gene score outliers, or “heavy-tailed” distributions

Bimodal “two-bumped” distributions.

ility

den

sity

bilit

y de

nsity

lity

dens

ity

Pro

bab

Gene score 0

Pro

bab

Gene score

Pro

babi

l

Gene score

WMW or K-S K-S only WMW or K-SRecommended test:

66

Page 67: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

What have we learned?

• T-test is not valid when one or both of the score distributions is not normal,

• If need a “robust” test, or to test for difference of medians use WMW test,

• To test for overall difference between two distributions, use K-S test.

67

Page 68: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Central Limit Theorem (CLT)• We’ve seen a few examples of test statistic We ve seen a few examples of test statistic

distributions which, for large N, are well approximated by a normal distribution.

• This is often due to the CLT:– If X1, X2, … XN are IID with a PDF with a finite

mean and variance then as N increases, the distribution of mean of these variables approaches a Gaussian.a Gauss a

• Also holds in most cases for independent, non-identically distributed random variables yhave different distributions.

68

Page 69: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Other common tests and distributions

• Chi squared (contingency table) test• Chi-squared (contingency table) test– Useful if there are >2 values of annotation (e.g. red genes,

black genes, and blue genes)Used as an approximation to Fisher’s Exact Test but is – Used as an approximation to Fisher s Exact Test but is inaccurate for small gene lists

– Also used for goodness-of-fit (in general)

• Binomial test• Binomial test– Tests if gene scores for red and black either come from

either N flips of the same coin or different coins.E g black genes are “expressed” in on average 5 out of 12 – E.g. black genes are expressed in, on average, 5 out of 12 conditions and red genes are expressed in, on average, 2 out of 12 conditions, is the probability of being expressed significantly different for the black and red genes?

69

g y g

Page 70: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Permutation tests

• Often the null distribution of the test statistic • Often, the null distribution of the test statistic is unclear or not analytical.

• In these cases you can generate an In these cases, you can generate an empirical distribution by sampling from the null distribution and then evaluating your test statistic against this distribution.

• In many genomic applications it is often possible to get a sample from the null distribution by randomizing (i.e. permuting) the association between genes and the association between genes and corresponding data. 70

Page 71: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Multiple test correction: Bonferroni and False Discovery Ratey

71

Page 72: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

How to win the P-value lottery, part 1Random draws

… 7,834 draws later …Expect a random draw with observed

i h t,

enrichment once every 1 / P-value draws

Background population:g p p500 black genes, 5000 red genes

72

Page 73: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

How to win the P value lottery part 2How to win the P-value lottery, part 2Keep the gene list the same, evaluate different annotations

Observed draw Different annotationsObserved drawRRP6MRD1RRP7

Different annotationsRRP6MRD1RRP7RRP7

RRP43RRP42

RRP7RRP43RRP42

73

Page 74: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

ORA tests need correction

From the Gene Ontology website:

Current ontology statistics: 25206 terms• 14825 biological process• 2101 cellular component• 8280 molecular function

74

Page 75: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Simple P-value correction: Bonferroni

If M = # of annotations tested:

Corrected P-value = M x original P-value

Corrected P-value is greater than or equal to the probability thatg q p yone or more of the observed enrichments could be due to

random draws. The jargon for this correction is “controlling for the Family-Wise Error Rate (FWER)”

Page 76: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Bonferroni correction caveats

• Bonferroni correction is very stringent and can “wash away” real enrichments.

• Often users are willing to accept a less stringent condition, the “false discovery rate” (FDR) hi h l d l i (FDR), which leads to a gentler correction when there are real enrichments.

76

Page 77: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

False discovery rate (FDR)

DR• FDR is the expected proportion of the observed enrichments due to random chancechance.

• Compare to Bonferroni correction which is a bound on the probability that any one of the observed p y yenrichments could be due to random chance.

• Typically FDR corrections are calculated using the Benjamini Hochberg procedureBenjamini-Hochberg procedure.

• FDR threshold is often called the “q-value”

Page 78: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Controlling FDR using the Controlling FDR using the Benjamini-Hochberg procedure I

• Say you want to bound the FDR at α, you need to calculate the corresponding P-value threshold t

• First, calculate the P-values for all the tests, d h h h i h ll and then sort them so that p1 is the smallest

(i.e. most significant) P-value, and pm is the leastleast.

78Benjamini, Y. & Hochberg, Y. (1995) J. R. Stat. Soc. B 85, 289–300

Page 79: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Controlling FDR using the Controlling FDR using the Benjamini-Hochberg procedure II

• t = pr where r is the max value for which:

FDR threshold

pr ≤ rα / m

FDR threshold

pr ≤ rα / m

rank # of tests

Cavaet: Assumes independent or positively Cavaet: Assumes independent or positively correlated tests.

79

Page 80: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Reducing multiple test correction Reducing multiple test correction stringency

• Can control the stringency by reducing the number of tests: e.g. use GO slim or restrict testing to the appropriate GO annotations.

80

Page 81: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Reducing multiple test correction istringency

• The correction to the P-value threshold ⟨depends on the # of tests that you do, so, no matter what, the more tests you do, the more sensitive the test needs to beC l h i b d i h • Can control the stringency by reducing the number of tests: e.g. use GO slim; restrict testing to the appropriate GO annotations; or testing to the appropriate GO annotations; or select only larger GO categories.

Page 82: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

Meta-analysis for multiple tests: Fisher’s Meta analysis for multiple tests: Fisher s method for combining P-values

• Given different tests for the same hypothesis H, with P-values p1, p2, …, pN you can use Fisher’s method to combine them into a single P-value.Th i i X2 2 Σ l [ ] h ll • The test statistic X2 = -2 Σi ln[ pi ] has a null distribution as a chi-square with 2N degrees of freedomof freedom.

82

Page 83: Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics · 2015-02-24 · Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics ML4Bio 2012 January 20th,

SummarySummary

• Multiple test correction– Bonferroni: stringent, controls probability of at least

one false positive*– FDR: more forgiving controls expected proportion of – FDR: more forgiving, controls expected proportion of

false positives* -- typically uses Benjamini-Hochberg

• Fisher’s Method to combine P-values– If have multiple, independent tests of same

hypothesis, can combine P-values into a single P-alue value.