cse182-l16 - university of california, san diego · cse182-l16 non-coding rna. biol. data analysis:...
TRANSCRIPT
CSE182-L16
Non-coding RNA
Biol. Data analysis: Review
Protein SequenceAnalysis
Sequence Analysis
Gene Finding
Assembly
Much other analysis is possible
Protein SequenceAnalysis
Sequence Analysis
Gene Finding
Assembly
ncRNA
GenomicAnalysis/Pop. Genetics
A Static picture of the cell is insufficient
• Each Cell is continuouslyactive,– Genes are being
transcribed into RNA– RNA is translated into
proteins– Proteins are PT
modified andtransported
– Proteins performvarious cellularfunctions
• Can we probe the Celldynamically
GeneRegulation Proteomic
profilingTranscript profiling
ncRNA gene finding
• Gene is transcribed but not translated.• What are the clues to non-coding genes?
– Look for signals selecting start of transcription andtranslation. Non coding genes are transcribed by Pol III
– Non-coding genes have structure. Look for genomicsequences that fold into an RNA structure
• Structure: Given a sequence, what is the structureinto which it can fold with minimum energy?
tRNA structure
RNA structure: Basics
• Key: RNA is single-stranded. Think of a string over 4letters, AC,G, and U.• The complementary bases form pairs.• Base-pairing defines a secondary structure. The base-pairing is usually non-crossing.
RNA structure: pseudoknots
Sometimes, unpaired bases in loops form‘crossing pairs’. These are pseudoknots
RNA structure prediction
• Any set of non-crossing base-pairsdefines a secondary structure.
• Abstract Question:– Given an RNA string find a structure that
maximizes the number of non-crossing base-pairs
– Incorporate the true energetics of folding– Incorporate Pseudo-knots
A combinatorial problem
• Input:• A string over A,C,G,U• A pairs with U, C pairs with G
• Output:• A subset of possible base-pairs of maximum
size such that• No two base-pairs intersect
• How can we compute this set efficiently?
RNA structure
1. Nussinov’s algorithm1. Score B for every base-pair. No penalty for loops. No pesudo-knots.2. Let W(i,j) be the score of the best structure of the subsequence
from i to j.
for i = n down to 1 {for j = i+1 to n {
}}
†
W (i, j) = max
B(ri,rj ) + W (i +1, j -1),W (i, j -1),W(i +1, j)
W (i,k) + W (k +1, j) i £ k < j
Ï
Ì
Ô Ô
Ó
Ô Ô
Obtaining RNA structure
for i = n downto 1 {for j = i+1 to n {
}}
†
W (i, j) = max
B(ri,rj ) + W (i +1, j -1),W (i, j -1),W(i +1, j)
W(i,k) +W(k +1, j)
(1)(2)(3)(4)
Ï
Ì Ô Ô
Ó Ô Ô
†
if (1) { S(i, j) = / else if (2) S(i, j) = | else if(3) S(i, j) = - else S(i, j) = k }
Obtaining RNA StructureProcedure print_RNA(i,j) {
if S(i,j) = / { print “(i,j)”;
print_RNA(i+1,j-1); else if (S(i,j) = -) {
print_RNA(i+1,j);} else if (S(i,j) = |) {
print_RNA(i,j-1);} else { k=S(i,j) print_RNA(i,k); print_RNA(k+1,j);}
}
RNA structure: example
011231122
01111
0
i 1 2 3 4 5 6j
3
4
5
6
A C G A U UA C G A U U 1 2 3 4 5 6 1 2 3 4 5 6
2
RNA Structure: Details
Base-pairing & Loops• Base-pairs arise from complementary nucleotides• Single-stranded• Stack is when 2 base-pairs are contiguous• Loops arise when there are unpaired bases.• They are characterized by the number of base-pairs that close it.
• Hairpin: closed by 1 base-pair• Bulge/Interior Loops (2 base-pairs)• Multiple Internal loops (k base-pairs)
Scoring Loops, multi-loops
• Zuker-Turner Energy Rules• http://www.bioinfo.rpi.edu/~zukerm/rna/energy/node2.html
• Stacking Energies• Energy for Bulges and Interior Loops• Energy for Multi-loops
Other tricks for obtaining structure
• Alignment and Covariance
RNA: unsolved problems
• The structure problem is still unsolved.– De novo prediction does not work as well.– Co-variance models require prior alignment.
• Many undiscovered non-coding genes– miRNA, and others have only just been discovered.– Very hard to detect signal for these genes– Random sequence folds into low energy structures.
Other ncRNA: miRNA
• ncRNA ~22 nt in length• Pairs to sites within the 3’ UTR,
specifying translational repression.• Similar to siRNA (involved in RNAi)• Unlike siRNA, miRNA do not need
perfect base complementarity• Until recently, no computational
techniques to predict miRNA• Most predictions based on cloning
small RNAs from size fractionatedsamples
Gene Regulation
Gene expression
• The expression oftranscripts and protein inthe cell is not static. Itchanges in response tosignals.
• The expression can bemeasured using micro-arrays.
• What causes the changein expression?
Transcriptional machinery
• DNA polymerase (II) scans the genome, initiatingtranscription, and terminating it.
• The same machinery is used for every gene, so while Pol IIis required, it is not sufficient to confer specificity
TF binding
• Other transcriptionfactors interact with thecore machinery andupstream DNA toprovide specificity.
• TFs bind to TF bindingsites which are clusteredin upstream enhancer andpromoter elements.
• The enhancer elementsmay be located many kbupstream of the core-promoter
Upstream elements
Transcription factors
TF binding sites• TF binding sites are
weak signal (about 10bp with 5bpconserved)
• If two genes are co-regulated, they arelikely to share bindingsites
• Discovery of bindingsite motifs is animportant researchproblem.
TGAGGAGTCAGGAG
TCAGGTGTGAGGTGTCAGGTG
g1
g2
g3
g4
g5
http://www.gene-regulation.com/pub/databases.html#transfac
Discovering TF binding sites
• Identification of these TF bindingsites/switches is critical.
• Requires identification of co-regulatedgenes (genes containing the same set ofswitches).
• How do we find co-regulated genes?
Idea1: Use orthologous genes from differentspecies
ACGGCAGCTCGCCGCCGCGC||||| || ||||||| ||ACGGC-GGGCGCCGCCCCGC
ACGGCAGCTCGCCGCCGC-C| || | ||||||| | AGTGC-GGGCGCCGCCTCAT
ACGGC-GC-TCGCCGCCGCGC| | | || | | AT-ACGAAGTAGCGG-ATGGT
1. The species are too close (EX:humans and chimps). Binding& non-binding sites are bothconserved.
2. The species are distant. Bindingsites are conserved but notother sequence.
3. The species are very distant.Even binding sites are notconerved. The genes havealternative regulators.
Idea2: Measure expression of genes
• Northern Blot:– Quantitative
expression of afew genes
Microarray
• Expression level of all genes
Protein Expression using MS
Pathways
• Proteins interact totransduce signal,catalyze reactions, etc.
• The interactions can becaptured in a database.
• Queries on thisdatabase are aboutlooking for interestingsub-graphs in a largegraph.
Biological databases in NAR
• http://www3.oup.co.uk/nar/database/c• 548 databases in various categories
RfamGenbank
SwissProt
Stanford microarray db
PDBKegg
dbSNP/OMIM/seattleSNPs
SWISS 2D-page
Summary
• Biological databases cannot beunderstood without understandingthe data, and the tools forquerying and accessing these data.
• While database technology (XML,Relational OO databases, textformats) is used to store this data,its use is (often) transparent forBioinformatics people.
• In this course, we looked at variousdata-streams, and pointed todatabases that store these data-streams
• Nucleic Acids Research brings outa database issue every January
2004: 548 databases