genes, microarrays and motifs lecture 8 csc 2417/bcb 410 michael brudno many slides from various...
TRANSCRIPT
Genes, Genes, MicroarraysMicroarrays and Motifsand Motifs
Lecture 8Lecture 8
CSC 2417/BCB 410 CSC 2417/BCB 410
Michael BrudnoMichael BrudnoMany slides from various Many slides from various sources, including T. Hughes sources, including T. Hughes (U. of T.), S. Batzolgou (U. of T.), S. Batzolgou (Stanford), Sanja Rogic (Stanford), Sanja Rogic (UBC), Manolis Kellis (MIT)(UBC), Manolis Kellis (MIT)
OutlineOutline
Intro to genes and motifsIntro to genes and motifs Identifying Gene StructuresIdentifying Gene Structures MicroarraysMicroarrays Identifying Regulatory ElementsIdentifying Regulatory Elements
Cells respond to environmentCells respond to environmentCell responds toenvironment—various external messages
Genome is fixed – Cells are Genome is fixed – Cells are dynamicdynamic
A genome is staticA genome is static
Every cell in our body has a copy of same genomeEvery cell in our body has a copy of same genome
A cell is dynamicA cell is dynamic
Responds to external conditionsResponds to external conditions Most cells follow a Most cells follow a cell cyclecell cycle of division of division
Cells differentiate during developmentCells differentiate during development
Gene expression varies according to:Gene expression varies according to:
Cell typeCell type Cell cycleCell cycle External conditionsExternal conditions LocationLocation
slide credits: M. Kellis
Where gene regulation takes Where gene regulation takes placeplace
Opening of chromatinOpening of chromatin
TranscriptionTranscription
TranslationTranslation
Protein stabilityProtein stability
Protein modificationsProtein modifications
Transcriptional RegulationTranscriptional Regulation
EfficientEfficient place to regulate: place to regulate:
No energy wasted making intermediate productsNo energy wasted making intermediate products
However, However, slowestslowest response time response time
After a receptor notices a change:After a receptor notices a change:
1.1. Cascade message to nucleusCascade message to nucleus
2.2. Open chromatin & bind transcription factorsOpen chromatin & bind transcription factors
3.3. Recruit RNA polymerase and transcribeRecruit RNA polymerase and transcribe
4.4. Splice mRNA and send to cytoplasmSplice mRNA and send to cytoplasm
5.5. Translate into proteinTranslate into protein
Transcription Factors Binding Transcription Factors Binding to DNAto DNA
Transcription regulation:Transcription regulation:
Transcription factors bind DNATranscription factors bind DNA
Binding recognizes DNA Binding recognizes DNA substrings:substrings:
Regulatory motifsRegulatory motifs
Promoter and EnhancersPromoter and Enhancers
PromoterPromoter necessary to start transcription necessary to start transcription
EnhancersEnhancers can affect transcription from afar can affect transcription from afar
Transcription Factor(Protein)
DNA
Gene Regulation with TFsGene Regulation with TFs
Regulatory Element Gene
RNA polymerase
Gene
RNA polymerase
Transcription Factor(Protein)
Regulatory Element
DNA
Gene Regulation with TFsGene Regulation with TFs
DNA
New protein
Gene Regulation with TFsGene Regulation with TFs
Transcription Factor(Protein)
Regulatory Element Gene
RNA polymerase
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT
Promoter motifs
3’ UTR motifs
Exons
Introns
Example: A Human heat Example: A Human heat shock proteinshock protein
TATA box: TATA box: positioning transcription startpositioning transcription start
TATA, CCAAT: TATA, CCAAT: constitutive transcriptionconstitutive transcription
GRE: GRE: glucocorticoid responseglucocorticoid response
MRE:MRE: metal responsemetal response
HSE:HSE: heat shock elementheat shock element
TATASP1CCAAT AP2HSEAP2CCAATSP1
promoter of heat shock hsp70
0--158
GENE
SplicingSplicing
frgjjfrgjjthissentencethissentencehjfmkhjfmkcontainsjunkcontainsjunkelmelm
thissentencecontainsjunkthissentencecontainsjunk
Gene structureGene structure
exon1 exon2 exon3intron1 intron2
transcription
translation
splicing
exon = protein-codingintron = non-coding
Codon:A triplet of nucleotides that is converted to one amino acid
Alternative splicingAlternative splicing
Isoform 1
Isoform 2
Isoform 3
exon 1 exon 2 exon 3 exon 4 exon 5
Gene Finding: Different Gene Finding: Different ApproachesApproaches
Similarity-based methods (extrinsic)Similarity-based methods (extrinsic) - use similarity to - use similarity to
annotated sequencesannotated sequences::
proteinsproteins cDNAscDNAs ESTsESTs
Comparative genomicsComparative genomics - Aligning genomic sequences from - Aligning genomic sequences from different speciesdifferent species
Ab initioAb initio gene-finding (intrinsic) gene-finding (intrinsic)
Integrated approachesIntegrated approaches
Similarity-based methodsSimilarity-based methods Based on sequence conservation due to functional Based on sequence conservation due to functional
constraintsconstraints
Use local alignment tools (Smith-Waterman algo, Use local alignment tools (Smith-Waterman algo, BLAST, FASTA) to search protein, cDNA, and EST BLAST, FASTA) to search protein, cDNA, and EST databasesdatabases
Will not identify genes that code for proteins not Will not identify genes that code for proteins not already in databasesalready in databases
Limits of the regions of similarity not well definedLimits of the regions of similarity not well defined
Comparative GenomicsComparative Genomics
Based on the assumption that coding Based on the assumption that coding sequences are more conserved than non-sequences are more conserved than non-codingcoding
Two approaches:Two approaches: intra-genomic (gene families)intra-genomic (gene families) inter-genomic (cross-species)inter-genomic (cross-species)
Alignment of homologous regionsAlignment of homologous regions
Difficult to define limits of higher similarityDifficult to define limits of higher similarity
Difficult to find optimal evolutionary distanceDifficult to find optimal evolutionary distance
Using Comparative Using Comparative Information Information
Hox cluster is an example where everything is conservedHox cluster is an example where everything is conserved
Patterns of ConservationPatterns of Conservation
30% 1.3%
0.14%
58%14%
10.2%
Genes Intergenic
Mutations Gaps Frameshifts
Separation
2-fold10-fold75-fold
Summary for Extrinsic Summary for Extrinsic ApproachesApproaches
Strengths:Strengths:
Rely on accumulated pre-existing Rely on accumulated pre-existing biological data, thus should produce biological data, thus should produce biologically relevant predictionsbiologically relevant predictions
Weaknesses:Weaknesses:
Limited to pre-existing biological dataLimited to pre-existing biological data Errors in databasesErrors in databases Difficult to find limits of similarityDifficult to find limits of similarity
Ab initioAb initio Gene Finding, Gene Finding, Part 1Part 1
Input:Input: A DNA string over the alphabet A DNA string over the alphabet {A,C,G,T}{A,C,G,T}
Output:Output: An annotation of the string An annotation of the string showing for every nucleotide showing for every nucleotide whether it is coding or notwhether it is coding or not
AAAGCATGCATTTAACGAGTGCATCAGGACTCCATACGTAATGAAAGCATGCATTTAACGAGTGCATCAGGACTCCATACGTAATGCCGCCG
AAAGC ATG CAT TTA ACG A GT GCATC AG GA CTC CAT ACG TAA TGCCG
Gene finder
Ab initioAb initio Gene Finding, Gene Finding, Part 2Part 2
Using only sequence informationUsing only sequence information
Identifying only coding exons of Identifying only coding exons of protein-coding genes (transcription protein-coding genes (transcription start site, 5start site, 5’’ and 3 and 3’’ UTRs are UTRs are ignored)ignored)
Integrates coding statistics with Integrates coding statistics with signal detectionsignal detection
Coding Statistics, Part 1Coding Statistics, Part 1
Unequal usage of codons in the coding Unequal usage of codons in the coding regions is a universal feature of the genomesregions is a universal feature of the genomes uneven usage of amino acids in existing proteinsuneven usage of amino acids in existing proteins uneven usage of synonymous codonsuneven usage of synonymous codons
We can use this feature to differentiate We can use this feature to differentiate between coding and non-coding regions of the between coding and non-coding regions of the genomegenome
Coding statistics - a function that for a given Coding statistics - a function that for a given DNA sequence computes a likelihood that the DNA sequence computes a likelihood that the sequence is coding for a proteinsequence is coding for a protein
Coding Statistics, Part 2Coding Statistics, Part 2
Many different onesMany different ones
codon usagecodon usage hexamer usagehexamer usage GC contentGC content compositional bias between codon compositional bias between codon
positionspositions nucleotide periodicitynucleotide periodicity ……
Signal Sensors, Part 1Signal Sensors, Part 1 Signal – a string of DNA recognized by the cellular Signal – a string of DNA recognized by the cellular
machinerymachinery
Signal Sensors, Part 2Signal Sensors, Part 2
Various pattern recognition method are Various pattern recognition method are used for identification of these signals:used for identification of these signals:
consensus sequencesconsensus sequences weight matricesweight matrices weight arraysweight arrays decision treesdecision trees Hidden Markov Models (HMMs)Hidden Markov Models (HMMs) neural networksneural networks ……
A T G
T G A T A A T A G
G T A G
(start codons)(start codons) (start codons)(start codons)
(donor splice sites)(donor splice sites)(donor splice sites)(donor splice sites) (acceptor splice sites)(acceptor splice sites)(acceptor splice sites)(acceptor splice sites)
(stop codons)(stop codons) (stop codons)(stop codons)
Stochastic Nature of Signal Motifs
Stochastic Nature of Signal Motifs
Gene Prediction Gene Prediction SummarySummary
Expressed Sequence (cDNA) or protein Expressed Sequence (cDNA) or protein sequence available?sequence available? Yes Yes Spliced alignment Spliced alignment
BLAT, Exonerate, est_genome, spidey, GMAP, BLAT, Exonerate, est_genome, spidey, GMAP, GenewiseGenewise
No No Integrated gene prediction Integrated gene prediction Informant genome(s) available?Informant genome(s) available?
Yes Yes Dual or n-genome Dual or n-genome de novode novo predictors: predictors: SGP2, Twinscan, NSCAN, SGP2, Twinscan, NSCAN, (Genomescan – same or cross genome protein (Genomescan – same or cross genome protein
blastx)blastx) No No ab initioab initio predictors predictors
geneid, genscan, augustus, fgenesh, geneid, genscan, augustus, fgenesh, genemark, etc.genemark, etc.
Many newer gene predictors can run in Many newer gene predictors can run in multiple modes depending on the multiple modes depending on the evidence available.evidence available.
MicroarrayMicroarray
Measure the level of mRNA messages Measure the level of mRNA messages in a cellin a cell
DN
A 1
DN
A 3
DN
A 5
DN
A 6
DN
A 4
DN
A 2
cDNA 4
cDNA 6
Hybridize Gen
e 1
Gen
e 3
Gen
e 5
Gen
e 6
Gen
e 4
Gen
e 2
MeasureRNA 4
RNA 6
RT
slide credits: M. Kellis
controltreatment
(drug, mutation)
updownunchangednot present
x y z
xx
x
xx
yy
yy
zz z
cDNA pools
Typical use of cDNA Microarrays:“Internal” normalization using two colors
Alternative Splicing Alternative Splicing MicroarrayMicroarray
Measure the Measure the expression of the expression of the various probesvarious probes
Infer the Infer the expression of the expression of the different splice different splice forms from the forms from the ratio of the ratio of the inclusion and inclusion and exclusion isoformexclusion isoform
Picture taken from:
J. Calarco et al. Genes and Dev. 21, 2963-2975
“cDNA microarrays” are essentially dot-blots on glass slides
http://arrayit.com/Products/Printing/Stealth/stealth.html
• This slide was made with 16 pins• 4.5 mm pin spacing matches 384-well plates (16 x 24)• Done with robotics• Slides usually coated with poly-lysine• Spots are usually 100-150 microns• Spot spacing is usually 200-300 microns.• Slides are 25 x 75 mm• Easy to deposit 20K spots/slide
0.45 mm
Microarray expression profiling by 2-color assay (“cDNA arrays”)
Array: PCR products6250 yeast ORFs
hybridized cDNAs:green = controlred = experiment
*Schena et al., 1995
Looking at data from a single experiment
3-AT vs.No drug
wild-type vs.wild-type
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Log10
(Intensity)
Log 1
0(Exp
ress
ion
Rat
io)
Slides: 11120c01 -11121c01
P-value < 0.01
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
P-value < 0.01
Log10
(Intensity)
Log 1
0(Exp
ress
ion
Rat
io)
Slides: 11857c01 -11858c01
log10(average intensity)
-2 -1 0 1 2
log 1
0(r
atio
)lo
g 10(r
atio
)
2
1
0
-1
-2
-2 -1 0 1 2
2
1
0
-1
-2
Clustering AlgorithmsClustering Algorithms
b
ed
f
a
c
h
ga b d e f g hc
• K-meansb
ed
f
a
c
h
gc1
c2
c3a b g hcd e f
• Hierarchical
slide credits: M. Kellis
Hierarchical clusteringHierarchical clustering
Bottom-up algorithm:Bottom-up algorithm: Initialization: each point in a Initialization: each point in a
separate clusterseparate cluster At each step:At each step:
Choose the pair of Choose the pair of closest closest clustersclusters
MergeMerge The exact behavior of the The exact behavior of the
algorithm depends on how we algorithm depends on how we define the define the distance CD(X,Y)distance CD(X,Y) between clusters X and Ybetween clusters X and Y
Avoids the problem of Avoids the problem of specifying the number of specifying the number of clustersclusters
b
ed
f
a
c
h
g
slide credits: M. Kellis
Distance between Distance between clustersclusters
CD(X,Y)=minCD(X,Y)=minx x X, y X, y Y Y D(x,y)D(x,y)
Single-link methodSingle-link method
CD(X,Y)=maxCD(X,Y)=maxx x X, y X, y Y Y D(x,y)D(x,y)
Complete-link methodComplete-link method CD(X,Y)=avgCD(X,Y)=avgx x X, y X, y Y Y D(x,y)D(x,y)
Average-link methodAverage-link method CD(X,Y)=D( avg(X) , CD(X,Y)=D( avg(X) ,
avg(Y) )avg(Y) )
Centroid methodCentroid method
ed
f
h
g
ed
f
h
g
ed
f
h
g
ed
f
h
g
slide credits: M. Kellis
-10 -5 -2 1 2 5 10
fold repression fold induction
transcript response index
exp
erim
ent
ind
ex
RHO O/XPKC O/X
ste mutants
treatment withalpha-factor
Data from Roberts et al., Science (2000)
Hierarchical Clustering Hierarchical Clustering ResultResult
K-Means Clustering K-Means Clustering AlgorithmAlgorithm
Each cluster Each cluster XXii has a center has a center ccii
Define the clustering cost criterion Define the clustering cost criterion COST(XCOST(X11,…X,…Xkk) = ∑) = ∑XiXi ∑ ∑x x XiXi |x – c |x – cii||22
Algorithm tries to find clusters XAlgorithm tries to find clusters X11…X…Xkk and centers cand centers c11…c…ckk that minimize COST that minimize COST
K-means algorithm:K-means algorithm: Initialize centers Initialize centers Repeat:Repeat:
Compute best clusters for given centersCompute best clusters for given centers → → Attach each point to the closest Attach each point to the closest
centercenter Compute best centers for given clustersCompute best centers for given clusters → → Choose the centroid of points in Choose the centroid of points in
clustercluster Until the changes in COST are Until the changes in COST are ““smallsmall””
b
ed
f
a
c
h
g
c1
c2
c3
slide credits: M. Kellis
K-Means AlgorithmK-Means Algorithm
Assign data Assign data points to points to nearest nearest clustersclusters
K-Means AlgorithmK-Means Algorithm
Repeat … Repeat … until until convergenceconvergence
Time: O(KNM) per iteration
N: #genesM: #conditions
K = 10 #1 #2 #3
K-Means Result One example: K-means (must choose K)
See: Sherlock G. Analysis of large-scale gene expression data.Curr Opin Immunol. 2000 Apr;12(2):201-5.
GO-Biological Process GO-Biological Process categoriescategories
Broad
Mid-level
Narrow eye pigment metabolism
eye morphogenesis
pigment metabolism
striated muscle contraction
ATP biosynthesis
vision
CNS development
insulin secretion
Very Broadmetabolism
163
137
21
36
25
33
34
1548
# annotated genes(mouse)
development 2341
GO-Biological Process GO-Biological Process hierarchyhierarchy
eye pigment metabolism
eye morphogenesis
pigment metabolism
CNS development
metabolism
development
Other types of categorical Other types of categorical annotations:annotations:
KEGG, EC numbers (describe biochemical “pathways”)
MIPS, YPD (yeast databases – older than GO)
Results of individual studies (localization, 2-hybrid screens, protein complexes, etc.
Sequence motifs, structural domains (pfam, SMART)
Evaluating clusters – Hypergeometric Evaluating clusters – Hypergeometric DistributionDistribution
• N genes, p labeled ++, (N-p) ––• Cluster: k genes, m labeled ++• P-value of single cluster
containing k genes of which at least r are ++
Prob a random set of k genes
has m ++ and k-m –– genes
P-value that at least r
genes are ++ in the cluster
slide credits: M. Kellis
Analyzing clusters:
amino acid biosynthesis (p<10-
14)**amino acid metabolism (p<10-
14)**
methionine metabolism (p=1.07×10-7)
**When testing clusters against many different types of categorical annotations, should consider correcting for multiple-testing, and also consider
that categories are often not independent
Cluster labelamino acid metabolismarginine biosynthesisarginine catabolismaromatic AA metabolismasparagine biosynthesisbranched chain AA synthlysine biosynthesismethionine biosynthesissulfur AA tnsprt, metabadenine biosynthesisaldehyde metabolismbiotin biosynthesiscitrate metabolismergosterol biosynthesisfatty acid biosynthesisgluconeogenesisNAD biosynthesisone-carbon metabolismpyridoxine metabolismthiamin biosynthesis 1thiamin biosynthesis 2hexose transportsodium ion transportpolyamine transportnucleocytoplasmic transportribosome/RNA biogenesisribosomal proteinstranslational elongationprotein foldingsecretionprotein glycosylationvesicle-mediated transportproteasomevacuole fusionmitoribosome/respirationMitochond. electron trans.iron transport/TCA cycleChromatin/transcriptionhistonesMCM2/3/6/CDC47DNA replicationmitotic cell cycleCLB1/CLB6/BBP1cytokinesisdevelopmentpheromone responseconjugationsporulation/meiosisresponse to oxidative stressstress/heat shock
Sample genesTRP4, HIS3ARG1, ARG3CAR1, CAR2ARO9, ARO10ASN1, ASN2ILV1,2,3,6LYS2, LYS9MET3,16,28MUP1, MHT1ADE1,4,8AAD4,14,16BIO3,4CIT1,2ERG1,5,11FAS1,FAS2PGK1, TDH1,2,3BNA4,6GCV1,2,3SNO1, SNZ1THI5,12THI2,20HXT4,GSY1ENA1,2,5TPO2,3KAP123,NUP100MAK16,CBF5RPS1A,RPL28TEF1,2SSA1,HSP60VTH1,KRE11ALG6,CAX4VPS5,IMH1RPN6,RPT5VTC1,3,4,PHO84MRPL1,MRPS5ATP1,COX4FRE1,FET3SNF2,CHD1,DOT6HTA1,HHF1MCM2,3,6RFA1,POL12SPC110,CIN8CLB1,6CTS1,EGT2PAM1,GIC2FUS3,FAR1CIK1,KAR3SPO11,SPO19GDH3,HYR1 HSP104,SSA4
Candidate regulatorGCN4ARG80/81ARG80/81/UME6/RPD3ARO80GCN4/HAP1/HAP2LEU3, GCN4LYS14CBF1, MET28, MET32MET31,MET32BAS1, BAS2, GCN4
RTG3ECM22/UPC2INO4GCR1
THI2/THI3THI2/THI3GCR1NRG1,MIG1HAA1RRPE-binding factorPAC/RRPE-binding factors
HAC1,ROX1RLM1XBP1
RPN4PHO4
HAP2/3/4/5MAC1/RCS1/AFT1/PDR1/3
HIR1,HIR2ECBMCBHCM1FKH1ACE2,SWI4
MATALPHA2,STE12KAR4NDT80ROX1,MSN2,MSN4MSN2,MSN4
249
gen
es1,
226
gen
esNon-overlapping yeast gene expression
clusters424 experiments
Chua et al., 2004
Microarray expression data
Co-regulated groups of genes
Functional categories
Predict functions of new genes
cis, trans regulators
Identifying MotifsIdentifying Motifs
Genes are turned on or off by regulatory Genes are turned on or off by regulatory proteinsproteins
These proteins bind to upstream regulatory These proteins bind to upstream regulatory regions of genes to either attract or block an regions of genes to either attract or block an RNA polymeraseRNA polymerase
Regulatory protein (TF) binds to a short DNA Regulatory protein (TF) binds to a short DNA sequence called a motif (TFBS)sequence called a motif (TFBS)
So finding the same motif in multiple genesSo finding the same motif in multiple genes’’ regulatory regions suggests a regulatory regulatory regions suggests a regulatory relationship amongst those genesrelationship amongst those genes
Identifying Motifs: Identifying Motifs: ComplicationsComplications
We do not know the motif sequenceWe do not know the motif sequence
We do not know where it is located relative We do not know where it is located relative to the genes start to the genes start
Motifs can differ slightly from one gene to Motifs can differ slightly from one gene to the nextthe next
How to discern it from How to discern it from ““randomrandom”” motifs? motifs?
Regulatory Motif Regulatory Motif DiscoveryDiscovery
DNA
Group of co-regulated genesCommon subsequence
Find motifs within groups of corregulated genes
slide credits: M. Kellis
Random SampleRandom Sample
atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgatgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtacaacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca
tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagagctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatagtcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag
gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaagtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcatcggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtaaacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaagttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagcactggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca
Implanting Motif Implanting Motif AAAAAAAGGGGGGGAAAAAAAGGGGGGG
atgaccgggatactgatatgaccgggatactgatAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGaa
tgagtatccctgggatgactttgagtatccctgggatgacttAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatggctgagaattggatgAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaattcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGcttatagcttatag
gtcaatcatgttcttgtgaatggatttgtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtcggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcatcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaacttgagttAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtactggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGaccgaaagggaagaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGaa
Where is the Implanted Where is the Implanted Motif? Motif?
atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgatgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaagggggggaacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga
tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagagctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatagtcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag
gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaagtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcatcggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtaaacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaagttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaagggggggactggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga
Implanting Motif AAAAAAGGGGGGG Implanting Motif AAAAAAGGGGGGG
with Four Mutationswith Four Mutations
atgaccgggatactgatatgaccgggatactgatAAggAAAAggAAAGGAAAGGttttGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataccAAAAttAAAAAAAAccGGGGccGGGGGGaa
tgagtatccctgggatgactttgagtatccctgggatgacttAAAAAAAAttAAAAttGGGGaaGGttGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatggctgagaattggatgccAAAAAAAGGGAAAAAAAGGGattattGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaattcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAttAAAAttAAAGGAAAGGaaaaGGGGGGcttatagcttatag
gtcaatcatgttcttgtgaatggatttgtcaatcatgttcttgtgaatggatttAAAAccAAAAttAAGGGAAGGGctctGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtcggttttggcccttgttagaggcccccgtAAttAAAAAAccAAGGAAGGaaGGGGGGcccaattatgagagagctaatctatcgcgtgcgtgttcatcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaacttgagttAAAAAAAAAAAAttAGGGAGGGaaGGccccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtactggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAActctAAAAAGGAAAAAGGaaGGccGGGGaccgaaagggaagaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAActctAAAAAGGAAAAAGGaaGGccGGGGaa
Where is the Motif??? Where is the Motif???
atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgatgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcgggaacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga
tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagagctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatagtcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag
gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaagtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcatcggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtaaacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaagttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcggactggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga
Why Finding (15,4) Motif is Why Finding (15,4) Motif is Difficult?Difficult?
atgaccgggatactgatatgaccgggatactgatAAggAAAAggAAAGGAAAGGttttGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataccAAAAttAAAAAAAAccGGGGccGGGGGGaa
tgagtatccctgggatgactttgagtatccctgggatgacttAAAAAAAAttAAAAttGGGGaaGGttGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatggctgagaattggatgccAAAAAAAGGGAAAAAAAGGGattattGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaattcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAttAAAAttAAAGGAAAGGaaaaGGGGGGcttatagcttatag
gtcaatcatgttcttgtgaatggatttgtcaatcatgttcttgtgaatggatttAAAAccAAAAttAAGGGAAGGGctctGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtcggttttggcccttgttagaggcccccgtAAttAAAAAAccAAGGAAGGaaGGGGGGcccaattatgagagagctaatctatcgcgtgcgtgttcatcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaacttgagttAAAAAAAAAAAAttAGGGAGGGaaGGccccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtactggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAActctAAAAAGGAAAAAGGaaGGccGGGGaccgaaagggaagaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAActctAAAAAGGAAAAAGGaaGGccGGGGaa
AgAAgAAAGGttGGG
cAAtAAAAcGGcGGG
..|..|||.|..|||
Challenge ProblemChallenge Problem
Find a motif in a sample of Find a motif in a sample of
- 20 - 20 ““randomrandom”” sequences (e.g. 600 nt long) sequences (e.g. 600 nt long)
- each sequence containing an implanted - each sequence containing an implanted
pattern of length 15, pattern of length 15,
- each pattern appearing with 4 mismatches - each pattern appearing with 4 mismatches
as (15,4)-motif.as (15,4)-motif.
Pevzner, et al
Discrete FormulationsDiscrete Formulations
Given sequences S = {xGiven sequences S = {x11, …, x, …, xnn}}
A motif W is a consensus string wA motif W is a consensus string w11…w…wKK
FindFind motif W motif W** with with ““bestbest”” match to x match to x11, …, x, …, xnn
Definition of Definition of ““bestbest””::
d(W, xd(W, xii) = min hamming dist. between W and ) = min hamming dist. between W and any word in xany word in xii
d(W, S) = d(W, S) = ii d(W, x d(W, xii))
Exhaustive SearchesExhaustive Searches
1. Pattern-driven algorithm1. Pattern-driven algorithm::
For W = AA…A to TT…T For W = AA…A to TT…T (4 (4KK possibilities) possibilities)
Find d( W, S )Find d( W, S )
Report W* = argmin( d(W, S) )Report W* = argmin( d(W, S) )
Running time: O( K N 4Running time: O( K N 4KK ) )
(where N = (where N = ii |x |xii|)|)
Advantage:Advantage: Finds provably Finds provably ““bestbest”” motif W motif W
Disadvantage:Disadvantage: TimeTime
Exhaustive SearchesExhaustive Searches2. Sample-driven algorithm2. Sample-driven algorithm::
For W = any K-long word occurring in some xFor W = any K-long word occurring in some x ii
Find d( W, S )Find d( W, S )
ReportReport W* = argmin( d( W, S ) ) W* = argmin( d( W, S ) )or, or, ReportReport a local improvement of W a local improvement of W**
Running time: O( K NRunning time: O( K N22 ) )
Advantage:Advantage: TimeTime
Disadvantage:Disadvantage: If the true motif is weak and does not occur in If the true motif is weak and does not occur in datadata
then a random motif may score better than any then a random motif may score better than any instance of true motif instance of true motif
Example of Consensus Example of Consensus SequenceSequence
obtained by choosing the most frequent base at each position of obtained by choosing the most frequent base at each position of the multiple alignment of subsequences of interestthe multiple alignment of subsequences of interest
TACGATTACGATTATAATTATAATTATAATTATAATGATACTGATACTTATGATTATGATTATGTTTATGTT
consensus sequenceconsensus sequence
consensus (IUPAC)consensus (IUPAC)
Leads to loss of information and can produce Leads to loss of information and can produce many false positive or false negative predictionsmany false positive or false negative predictions
TATAAT
TATRNT
MELONMANGOHONEYSWEETCOOKY
MONEY
Sequence LogosSequence Logos
TGGGGGATGGGGGA
TGAGAGATGAGAGA
TGGGGGATGGGGGA
TGAGAGATGAGAGA
TGAGGGATGAGGGA
Characteristics of Characteristics of Regulatory MotifsRegulatory Motifs
TinyTiny
Highly VariableHighly Variable
~Constant Size~Constant Size Because a constant-size Because a constant-size
transcription factor bindstranscription factor binds
Often repeatedOften repeated
Low-complexity-ishLow-complexity-ish
Weight Matrices & Sequence Logos
1 2 3 4 5 6 7 8 9 10 11 12 13 141 G A C C A A A T A A G G C A2 G A C C A A A T A A G G C A3 T G A C T A T A A A A G G A4 T G A C T A T A A A A G G A5 T G C C A A A A G T G G T C6 C A A C T A T C T T G G G C7 C A A C T A T C T T G G G C8 C T C C T T A C A T G G G C
Set of signal sequences:
A 0 4 4 0 3 7 4 3 5 4 2 0 0 4C 3 0 4 8 0 0 0 3 0 0 0 0 0 4G 2 3 0 0 0 0 0 0 1 0 6 8 5 0T 3 1 0 0 5 1 4 2 2 4 0 0 1 0
Position Frequency Matrix - PFM
A -1.93 .79 .79 -1.93 .45 1.50 .79 .45 1.07 .79 .0 -1.93 -1.93 .79C .45 -1.93 .79 1.68 -1.93 -1.93 -1.93 .45 -1.93 -1.93 -1.93 -1.93 .0 .79G .0 .45 -1.93 -1.93 -1.93 -1.93 -1.93 -1.93 .66 -1.93 1.3 1.68 1.07 -1.93T .15 .66 -1.93 -1.93 1.07 .66 .79 .0 .79 -1.93 -1.93 -1.93 .66 -1.93 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Position Weight Matrix - PWM
T T G C A T A A G T A G T C.45 -.66 .79 1.66 .45 -.66 .79 .45 -.66 .79 .0 1.68 -.66 .79
Score for New Sequence
Sequence Logo & Information content
Motifs in Biological Sequences
(R,l)
1
K
A=(a1,..,aK) – positions of the windows
Priors A has uniform prior
j has Dirichlet(N0) prior – base frequency in genome. N0 is pseudocounts
0.0 1.0
=(1,A,…,w,T) probability of different bases in the window
0=(A,..,T) – background frequencies of nucleotides.
Natural Extensions to Basic ModelCorrelated in Nucleotide Occurrence in Motif: Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics, 6, 909-916.
Regulatory Modules:De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Nat’l Acad Sci USA, 102, 7079-84
M1
M2
M3
Stop
Start
Gene AGene B
Insertion-Deletion
BALSA: Bayesian algorithm for local sequence alignment Nucl. Acids Res., 30 1268-77.
1
K
w1
w2
w3
w4