![Page 1: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/1.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
![Page 2: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/2.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
March 28, 2012
Daniel Fernandez
Alejandro Quiroz
![Page 3: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/3.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
1st ACTInformation theory correction
Motif Finding
The Genome Browser
Homework help Q1, Q2
INTERLUDEElectronic music with DJ Cistrome (10 min)
2nd ACTDah Cistrome
MA2C
Homework help Q3
![Page 4: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/4.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Information Theory
![Page 5: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/5.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Information TheoryThe amount of information transmitted through the channel is the same as the entropy (or uncertainty) associated with the source.
I.e., it is maximized when the source can produce n possible outcomes, all with equal probability (1/n). Then, the entropy is log2(n).
Thus, biologists took this concept and used it to characterize the amount of uncertainty associated with a motif, represented as a PWM. But, your TF got confused… see why!
![Page 6: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/6.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Information Theory
INFORMATIONENTROPY
Source channel destination
ATCG
1 1 1 1 1 1 1 1 1
![Page 7: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/7.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Information TheoryBut what happens when we want to compare the uncertainty between two sources?Or the comparison between two probability distributions, i.e, the background sequence PWM and the motif PWM?
RELATIVE ENTROPY, or, KULLBACK-LEIBLER DIVERGENCE, or
INFORMATION CONTENT
![Page 8: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/8.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Motif Example IProkaryotic Co-expression
Objective. Find the binding sites that control the gene regulation of co-expressed genes in Mycobacterium Tuberculosis.
File. mt.fasta
Note. We assume that genes are co-expressed because they are under the control of the same transcription factor(s), and we use Gibbs sampling to try to identify the putative binding motif for this factor(s).
![Page 9: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/9.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Motif Example IProkaryotic Co-expression
Motif parameters are designed to capture the features of binding sites for a classic bacterial helix-turn-helix (HTH) type transcription factor.
HTH-type TFs are typically symmetric homodimers, thus they bind to symmetric (palindromic) DNA binding sites.
Furthermore, the two HTH regions of the dimeric TF typically contact bases in two adjacent major grooves of the DNA, and thus the two halves of the palindromic binding site span well over 10 bases (the approximate number of bases per helical turn of B-form DNA).
The bases contacted by a TF are not necessarily contiguous, thus we use fragmentation to allow the Gibbs sampler to ignore positions which do not participate in the protein-DNA interaction, and are therefore not conserved as part of the binding site.
To understand what I am saying: http://melolab.org/pdidb/web/content/home search 1lmb
![Page 10: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/10.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Motif Example IProkaryotic Co-expression
http://ai.stanford.edu/~xsliu/BioProspector/
http://weblogo.berkeley.edu/logo.cgi
![Page 11: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/11.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
DNA as Herederitary Material
![Page 12: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/12.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Central Dogma of Molecular Biology
Gene Expression
Splicing
![Page 13: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/13.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
![Page 14: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/14.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
The Human Genome Project• The goal is to understand the human
genome and its role in health and disease.– “The true payoff from the HGP will be the
ability to better diagnose, treat and prevent disease”
• Francis Collins. Director of the HGP and NHGRI
![Page 15: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/15.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Sequencing
• Thousands of researchers from 20 centers worked on the HGP
Assembly• The sequence existed as millions of clones of small
fragments• Finding overlaps and putting together “contigs” was a
huge challenge
Annotation• What does it all mean?• Where are the genes?• What do they do?
![Page 16: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/16.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
UCSC Genome browser
• http://genome.ucsc.edu/
![Page 17: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/17.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Basic Features
• Species, assemblies
• Genome browser
• Gene sorter
• Sequence search (BLAT)
Advanced Features• Coordinate conversion
• Custom tracks
• Table Browser
![Page 18: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/18.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
UCSC Genome Browser• Consists of a suite of tools for the viewing
and mining of genomic data.
![Page 19: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/19.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Organization of Genomic Data
![Page 20: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/20.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Genome Gatewaystart page, basic search
![Page 21: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/21.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Overview of the browser
![Page 22: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/22.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
The browser
![Page 23: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/23.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
The browser
![Page 24: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/24.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
The browser
![Page 25: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/25.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
The browser
![Page 26: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/26.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Genome Gatewaystart page, basic search
Genome version Chromosome/regionGeneCytogenetic coordinatesPhenotype of interestKey words: Zinc fingers, kinase
Try the following example: AutismHow many UCSC genes are located on chromosome X?How many RefSeq are associated with Autism?
Pick the gene: AUTS2 (uc011keg.1) at chr7:70231248-70257884
![Page 27: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/27.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
base positionbase position
Gene annotationGene annotation
Tracks!Where we obtain information
Tracks!Where we obtain information
![Page 28: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/28.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
![Page 29: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/29.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
![Page 30: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/30.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
![Page 31: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/31.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
UCSC Table Browser• Retrieve the data associated with a track
in text format– To calculate intersections between tracks– To retrieve DNA sequence covered by a track.
![Page 32: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/32.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
![Page 33: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/33.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Hhelp Q2
• How many RefSeq genes have more than 15 exons in human chromosome 1?
• How many genes on chromosome 22, on the positive strand, are associated with a disease on the OMIM db?
![Page 34: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/34.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
The CistromeUnderstanding Genetic Regulation
• CisTrOme, stands for Cis-acting regulatory elements searched across, Trans, the whole genOme. – Visit and register at http://cistrome.org/
• The objective is to map/identify the binding regions of a transcription factor across (trans) the genome in order to understand the regulatory mechanisms of gene expression in the chromosome where the gene is located (cis).
![Page 35: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/35.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Types of Data and Peak –Calling Methods
• Chip-Chip data (Chip on Chip)
– Affymetrix one color arrays
– Nimble two color arrays
• Chip-Seq data (Chip and NGS)
– Sequencing data
(Illumina, Roche, 454)
MACSModel based
Analysis for Chip-Seq
MA2CModel based
Analysis for 2-Color arrays
MATModel based
Analysis for Tiling arrays
![Page 36: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/36.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
MA2C – Hhelp Q3 Model based Analysis for 2-Color arrays
• http://liulab.dfci.harvard.edu/MA2C/MA2C.htm
• Installation. You need Java Runtime Environment (JRE) 5.0 or higher. You can download it from http://java.sun.com
• Download the MA2C.zip and uncompress it.– Windows: open MA2C\dist\
MA2C.bat– Go to the terminal and then
MA2C/dist/ and execute the command java –Xmx600m –jar MA2C.jar (or just double click on MA2C.jar)
![Page 37: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/37.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
MA2CData Normalization
• Download the data from the homework – SDC3 zip file
• Uncompress it and open MA2C
• Upload the SampleKeyIVtoX.txt to the sample key
• Select your control group (IP channel)
• Go to normalization tab and normalize your data – default parameters are ok.
![Page 38: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology](https://reader036.vdocuments.site/reader036/viewer/2022062409/56649ead5503460f94bb455b/html5/thumbnails/38.jpg)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
MA2CPeak Finding
• Go to the peak-detection tab.• Change the parameters accordingly• Select find peaks• Voila! the results have been ouputed to the MA2C_output
folder!