bioinformatic approaches to functionally characterise rnas
TRANSCRIPT
Research program
Bioinformatic approaches to functionallycharacterise RNAs
Paul Gardner
June 2, 2011
Paul Gardner Bioinformatic approaches to functionally characterise RNAs
Research program
The data deluge and bioinformatics
I There is a deluge of data being generated by new techniquesin sequencing and structure determination
I Bioinformatic analysis is the only way to analyse and annotatethis level of data, now driving a lot of biological discoveries
0
50
100
150
200
250
300
The growth of Genbank
Year
Num
ber
of n
ucle
otid
es (
billi
ons)
1985 1995 2005
0
2
4
6
8
10
12
The growth of UniProt
Year
Num
ber
of p
rote
ins
(mill
ions
)
1998 2002 2006 2010
0
10
20
30
40
50
60
The growth of PDB
Year
Num
ber
of s
truc
ture
s (t
hous
ands
)
1980 1990 2000 2010
Paul Gardner Bioinformatic approaches to functionally characterise RNAs
Research program
Bioinformatics is important
Table 1: The most cited articles for each OECD country.Citations Country Reference
32399 de Thompson et al. (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment... NAR30616 us Altschul et al. (1997) Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. NAR20099 ch Towbin et al. (1979) Electrophoretic transfer of proteins from polyacrylamide gels to nitrocellulose sheets... PNAS18479 fr Thompson et al. (1997) The CLUSTAL X windows interface... NAR17011 uk Bland & Altman (1986) Statistical methods for assessing agreement between two methods of clinical measurement. Lancet16451 jp Iijima (1991) Helical microtubules of graphitic carbon. Nature9773 at Kresse & Furthmuller (1996) Efficient iterative schemes for ab initio total-energy calculations... Physical Review B8593 ie, il Lander et al. (2001) Initial sequencing and analysis of the human genome. Nature8113 no Pedersen (1994) Randomised trial of cholesterol lowering in 4444 patients with coronary heart disease... Lancet7976 pt Perdew et al. (1992) Atoms, molecules, solids, and surfaces: Applications of the generalized... Physical Review B7202 it Berendsen et al. (1984) Molecular dynamics with coupling to an external bath. The Journal of Chemical Physics6972 be Murshudov et al. (1997) Refinement of macromolecular structures by the maximum-likelihood method. Acta Crystallogr D.6902 se Huelsenbeck & Ronquist (2001) MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics5850 pl Cornell et al. (1995) A second generation force field for the simulation of proteins, nucleic acids... J Am Chem Soc4681 fi Simons & Ikonen (1997) Functional rafts in cell membranes. Nature4588 au, es Perlmutter et al. (1999) Measurements of Ω and Λ from 42 high-redshift Supernovae. Astrophysical Journal3950 dk Nielsen et al. (1997) Identification of prokaryotic and eukaryotic signal peptides and prediction of... Protein Eng.3805 nz Ihaka & Gentleman (1996) R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics3351 kr, mx Eidelman et al. (2004) Review of particle physics Physics Letters...3244 si Wilk et al. (2001) High-κ gate dielectrics: Current status and materials properties considerations. Journal of Applied Physics2953 gr Polymeropoulos et al. (1997) Mutation in the α-synuclein gene identified in families with Parkinson’s disease. Science2881 sk Miertus et al. (1981) Electrostatic interaction of a solute with a continuum... Chemical Physics2393 hu Morice et al. (2002) A randomized comparison of a sirolimus-eluting stent... New England Journal of Medicine1977 is Sever et al. (2003) Prevention of coronary and stroke events with atorvastatin... Lancet1790 ca Sedlak & Lindsay (1968) Estimation of total, protein-bound, and nonprotein sulfhydryl groups... Analytical Biochemistry1366 tr Ozgur et al. (2005) A comprehensive review of ZnO materials and devices. Journal of Applied Physics291 lu James et al. (2006) Reconstructing the early evolution of Fungi using a six-gene phylogeny. Nature
1
*This data was collected from Scopus in April 2011. NB. The US data is the union of multiple searches.
Paul Gardner Bioinformatic approaches to functionally characterise RNAs
Research program
“All science is either physics or stamp collecting” – ErnestRutherford
Paul Gardner Bioinformatic approaches to functionally characterise RNAs
Research program
Stamp collectors
Paul Gardner Bioinformatic approaches to functionally characterise RNAs
Research program
Rfam
I My aim is to build a periodic table for classifying RNA,enabling researchers to predict function.
0 1 5 10 50100
500
10
100
1000
10000
1e+05
1e+06
Cis-reg.
Gene
snRNA
snoRNA
Intron
Types
tRNA
splicing
thermore ulator
e el ad r
HACA-box snoRNA
scaRNA
Intron
IRES
frameshift element
sRNA
riboswitch
eantis ns
rRNA
miRNA
CRISPR
ribozyme
CD-box snoRNA
5’
3’
GSSVVYRU
RGYYY
ARYu
GG
U u AR M R C
RYYDSVY
UB H H
AM
BCHRDWRRu
YR Y R G G
U UCR
AWUCCYDY
YNBBNSYR
5’ 3’
AAU
UC
CA
G
C
G
A
G
A
GGCAGAGGGAGCGAGCGGGCG
GCCGGCUAGGGUGGA
AGAGC
CGGGC
GAGCA
GA G C UG
CGCUGCGGGCG
UCCUGGG
A AG G G A G A
U C C G GA G C GAAU
A GGGG
GC
UUCGC
C U CU
GG
CC
C
A
G C C CUC C CGC
UGAU C C C C C A G C C A
G C GG U C C G C A A
C C C U U G C CGCAUCCACGAAACUUUGCCCAUAGCAGCGGGCGGGCACUUUGCACUG
GAAC
UUACAACACCCGAGCAAG
GACG
CGAC
UCUCCCGACG
CGGGGAGGCUAU
UCU
GCCC A
UUUG
GGG
ACA
CU
UC
CC
CG
CCGC
5’
3’
UGAUG
YC
CC
UCW
CC
CA
CYCY
UGAA
G A U CCCA
GG
UGGGC
GAGGG
R A Y R GYCAG
MG
GGAUC
5’3’A
AYAAAAUAAUUUACAUUCCA AG
GACCGGUAU
UAUUGU A
GGGGAU
UUGU
GACU
UY C
AA
GGCA
AY
GUCCUCU
CUA
CAA
CCGAGUUC R A
GA
AUAARY
AC
MA
AYGGCUC U
UUUU
GUU
AUU
CGAAAG C
UUA
CAAGDU
VYR
GYRUMUU
CURUAURCU
CWCYUca
MUY
A CUUUC
MAGUACU
UCAC
AC G
GGCCWRACAKMU
5’ 3’
UVDWHAUGAUG
AG
YU
CMACUUCWUuGG
UC
CG U G U U U C U G A g a R MC
YM
RUGAUMUBWRU
Ga
SA
Aa
GUUCUGAYUHM
e
g
5’ 3’
AUAC
UUACCU
GG
MM
GRGRDSWWWSSRYG
AUCA
MG
A A GG Y B S W U B B C C Y
AR
GGYKR
GKS
HY
MKC
CAUUG
C A C UYC
GG a
VKGKG Y Y
GAM
CC
YWGMGRU Y
WM C
CC
A A AU
GYGGUK
RA
ACYC
GA
SH
KY A U A A U U
UKUGGYA
GU
GG
GG
R aMCUG
CG U
UCGC
GC
KKY
CC
CY
WS
5’ 3’
UGG
CC
SAU
UUUGGCACUAGCACAU
UU
UUGCUU
SU
GU
CU
C UCC
GCU
CUG
AGCAA
UC
AUGUGYAGUGCCAAU
AUR
GG
AM
A
5’
3’GUYYG
MGW
GS M R B
AU
CC
AY
U A M AAC
AAGG
AU
UGAA
AC
5’
3’
NY
BKKM
SW
GGUUC
SWR
R M C YUCC
CW
SK H W A A A A
AACUA
RGGRRDD
5’
3’CGCUAUCAUCAUUA
AC
UU
YAUUUAUUACC
GUCA
UU
MA
SYg a W
SW
GAAU G U C U G Y W U A
CC
CCUAUUUC
R A C C G R M U G C UUC
GCRKYCGGUUU
UU
UWW
5’
3’YCA
UCA
Y CAYCAUCAYCCUGA
CUAG
UC
U U U C R G GM g G A U GU
ac g
cR U R
CY
GG
RA
GRY
RDK
H A aRA
YCUYCCRGGggu
aa W G R Y R
MRWRA
AMRHAWUA DWR A R C C C Y C GG R A G A B
CaAW
CUYYCGRGGGYUUUUUUDU
5’
3’UGAAAGACGCGCAUUURUUAUCAUC
AU
CCC
UG
WW W
WCAG
AGAUGWWAW
UUUG G C C A C A S
HG
WBaGUGGCC
UUUUUC
5’
3’
UGUAAAA
AA
CAUY
AY U U A
GCGUSA
YU
UUCUWUCA
ACA G C U A A C
AAUUGUU
RUUAC
U G CCU
AR
YScaaU
YWU W A G G R U
AaUUUU
WM
AAAARGG
CKAU
AAAA
AAC
GA
UU
G GG
GGAUGAc
RA
MAUGRAC
GCU C
AAGCA
5’ 3’
SAWVAGU
CUGKGCU
Wg A
G C M C ACUGAYGAG Y C BY
U G ARA
URMGRCG
AAA
CUYWUS
5’
3’
BYYKRYGRY
CAUA SCR
NNDKGRWHRCAC
CBGWUCCCRU
Y C CG AWCW
C V G M AG UU A A
RC N Y B Y YW
G S G C CD R D
KU aGUA C U
DB R R U
G GG U
KACC
VYVUG
GGARuA
SYAGGWC
RYYGYMRDBY
Sequences
Families
Paul Gardner Bioinformatic approaches to functionally characterise RNAs
Research program
Rfam: families of ncRNAs
SEED alignments
1,446 families
FULL alignments
DESC
RFAMSEQ 10.0169,604,735,232 nucs
55,655,739 seqs
WU-BLAST
cmsearch
cmalign
Infe
rnal
Curated
27,045 regions
3,192,596 regions
Genome annotation
-DAS, GFF-ENSEMBL-UCSC-ncRNA.org
Benchmarksand training
RNA Biology
40%
88% 5%
-1%
178%
http://rfam.sanger.ac.ukhttp://rfam.janelia.org
Paul Gardner Bioinformatic approaches to functionally characterise RNAs
Research program
Why RNA?
I One of the major scientific realisations of this century is theimportance of RNAs in genetic regulation. Eg.
I Nobel prize for discovering RNAi to Andrew Fire and CraigMello
I RNA is also involved in chromosome deactivation, initiation ofDNA replication, transposon supression, environmentalsensors, ...
xx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xx
xx
xx
xx
xx
xx
x
xx
1950
1960
1970
1980
1990
2000
2010
0
20
40
60
80
100
120
Non−protein−coding RNA related publications
Year
Num
ber
of p
ublic
atio
ns (
'000
s)
Paul Gardner Bioinformatic approaches to functionally characterise RNAs
Research program
RNA and disease: progressive hearing loss
5’ 3’
0Sequence conservation
1
UGGC
CSAU
UUUGGCACUAGCACAU
UUU
UGCUU
SU
GU
CU
C UCC
GCU
CUG
AGCAA
UC
AUGUGYAGUGCCAAU
AUR
GG A
MA
Menca et al. (2009) Mutations in the seed region of human miR-96 are responsible for nonsyndromic progressive
hearing loss. Nat. Genet.
Lewis et al. (2009) An ENU-induced mutation of miR-96 associated with progressive hearing loss in mice. Nat.
Genet.
Paul Gardner Bioinformatic approaches to functionally characterise RNAs
Research program
RNA and disease: Prader-Willi syndrome
5’ 3’
0
Sequence conservation
1
GGAUCGAU
GA
UG
AC
UYC
CWYA
HA
WR
CA
UU
CC
UU
GGA
AAa
G C UGAA
CAAA
AU
GAGUG A R A A C U C Y
MU
AC
CGUCDYYCU
CR
UC
GA
ACUGAG
GUCC
Cavaill et al. (2000) Identification of brain-specific and imprinted small nucleolar RNA genes exhibiting an unusual
genomic organization. PNAS
Skryabin et al. (2007) Deletion of the MBII-85 snoRNA gene cluster in mice results in postnatal growth
retardation. PLoS Genet.
Ding et al. (2008) SnoRNA Snord116 (Pwcr1/MBII-85) Deletion Causes Growth Deficiency and Hyperphagia in
Mice. PLoS ONE
Sahoo et al. (2008) Prader-Willi phenotype caused by paternal deficiency for the HBII-85 C/D box small nucleolar
RNA cluster. Nat Genet
de Smith et al. (2009) A Deletion of the HBII-85 Class of Small Nucleolar RNAs (snoRNAs) is Associated with
Hyperphagia, Obesity and Hypogonadism. Hum. Mol. Genet.
Paul Gardner Bioinformatic approaches to functionally characterise RNAs
Research program
RNA important in agriculture
I The Texel sheep, myostatin and miR-1
5’ 3’
0Sequence conservation
1
BCB
YRR
G SBMCAURCUUCYUUAYRU
SCCCAUAB
KRAC
H U V VRMW
SCU
AUGGA
AUGUAARGAAGURUGKRK Y
YYH
GGB
Clop et al. (2006) A mutation creating a potential illegitimate microRNA target site in the myostatin gene affects
muscularity in sheep. Nat. Genet.
Paul Gardner Bioinformatic approaches to functionally characterise RNAs
Research program
ncRNAs and human health
I Genetic diseasesI RNase MRP variation and
cartilage-hair hypoplasia
I Mitochondrial tRNA variation: Leigh
syndrome, MELAS syndrome, MERRF
syndrome, cardiomyopathy,
ophthalmoplegia, ...
I CancerI Y RNA
I Telomerase RNA
I microRNAs
I Alzheimer’s diseaseI BACE1-AS, 38A
I Viral infectionI Human miRNAs required for infection
(eg. miR-122 and HCV)
I Viral miRNAs required for infection
(eg. HIV TAR miRNA)
I Many structured regulatory elements
(eg. IRE, IRES)
Paul Gardner Bioinformatic approaches to functionally characterise RNAs
Research program
RMfam: a motif alignment library
I Tetra loops
I T-loop
I K-turns
I Intrinsic terminators
I Group II intron domainV/U6 stem-loop
I Shine-Dalgarno
I Sm binding site
I · · ·
GR
A
5´
CU
U CG
G
R
5´
UR
5´
R
GAGY
RR
RC R
RGA
R
5´
GCCGAAG
G
R
Y
GAGGY
5´RAAAARCY
Y R
RGYUUUUU
U U5´RRR U
UU
U U U5´
GRRR
R
Y R
Y
YYY
Y U U U A5´
R
AA
YA
R
5´ A R R
R
Y
Y Y Y Y Y Y Y U U Y5´
A
R
R
AY
Y
YYY
U Y Y U U U U U5´R C G Y U G A U A C G C R
GYUGU
R
Y
U
U
G
YR
YCGCRC
C G
GYGCGR
YR
C
R
R
A
Y
ACARY
5´
ARR
U Y
YYU
5´ R A A
R Y
Y5´
Paul Gardner Bioinformatic approaches to functionally characterise RNAs
Research program
Bacterial intrinsic terminators
RRR U
UU
U U U5´
Native genomes
% R
IT g
enes
0
10
20
30
40
B.s
ubtil
is
B.th
etai
tmcr
n
E.c
oli
D.r
adio
dura
ns
S.e
nter
ica
C.d
iffic
ile
U.p
arvu
m
F.nu
clea
tum
C.p
neum
onia
e
S.g
riseu
s
L.in
terr
ogan
s
F.no
dosu
m
P.m
arin
us
T.ye
llow
ston
i
M.in
fern
orum
H.p
ylor
i
M.tu
berc
ulos
s
RNIE (genome) RNIE (gene) TransTermHP
Permuted genomes
% R
IT g
enes
B.s
ubtil
is
B.th
etai
otao
mic
ron
E.c
oli
D.r
adio
dura
ns
S.e
nter
ica
C.d
iffic
ile
U.p
arvu
m
F.nu
clea
tum
C.p
neum
onia
e
S.g
riseu
s
L.in
terr
ogan
s
F.no
dosu
m
P.m
arin
us
T.ye
llow
ston
ii
M.in
fern
orum
H.p
ylor
i
M.tu
berc
ulos
is
0
2
4
Gardner et al. (2011) RNIE: genome-wide prediction of bacterial intrinsic terminators. NAR.
Paul Gardner Bioinformatic approaches to functionally characterise RNAs
Research program
TRIT: A Mycobacterium specific terminator motif
−20 −10 0 10 20 30
0
5
10
15
20
25
TRIT proximity to genic features
Distance to nearest gene terminus (nucs)
Fre
quen
cy
G C G A G C A G A C G C A R A A C R C C C R R
Y
R
R
Y G G G G U U Y U G C G U C U G C U C G C
5'
WebLogo 3.0
0.0
1.0
2.0
bits
C
UCGGCA
CG
5
U
CC
A
GAAGAUC10
AGGUACAG15U
CU
GAUGAC
AGA20
GUU
G
A
CU
ACGCAC
25A
UCU
C
AGU
G
CAG
C
UAC
A
U
30
G
C
U
A
U
GCU
A
CGU
A
CG
WebLogo 3.0
0.0
1.0
2.0
bits
U
A
G
C
35U
A
GCU
A
CGU
G
AU
A
C
G
AU
40C
AGUA
G
UCAGUGG
45
UGCC
A
U
GCACUG
U50
ACUA
C
UA
GUCG55
C
A
UCC
UU
A
GU
C60
UU
G
CAGU
GCGC
65AGCG
1e−05
5e−05
1e−04
5e−04
0.00
1
0.00
5
0.01
Distribution of P−values
Fre
quen
cy
0
10
20
30
40
50
60
70
P-value (MFE.native vs MFE.shuffled)
A B C
D E
Gardner et al. (2011) RNIE: genome-wide prediction of bacterial intrinsic terminators. NAR.
Paul Gardner Bioinformatic approaches to functionally characterise RNAs
Research program
RNA derived pseudogenes
I There are millions of RNA-derived pseudogenes in the humangenome, it is very difficult to discriminate these fromfunctional copies
I We have derived a metric that discriminates between RNApseudogenes and functional RNAs
−80
−70
−60
−50
−40
−30
−20
−10 0 10 20 30 40 50 60 70 80 90 10
011
012
0
SRP
CM−HMM (bits)
Fre
q.
0
20
40
60
80
100
120
140 trueCmemittrueEmblDepseudoEmblDepseudoHmmemitpseudoRepeatMasker
Paul Gardner Bioinformatic approaches to functionally characterise RNAs
Research program
New sequencing data: RNA-seq, transposon libraries,comparative genomics, ...
05101520253035
RNAseq
reads
020406080100
TraDIS
reads
994000 994100 994200 994300 994400
genome coordinate
dsrA
putative transcript yodD
Perkins et al. (2009) A strand-specific RNA-Seq analysis of the transcriptome of the typhoid bacillus Salmonellatyphi. PLoS Genet.Langridge et al. (2009) Simultaneous assay of every Salmonella Typhi gene using one million transposon mutants.Genome Res.
Paul Gardner Bioinformatic approaches to functionally characterise RNAs
Research program
Thanks!
PPG is supported by a Rutherford Discovery Fellowship from Government funding, administered by the RoyalSociety of New Zealand.
Paul Gardner Bioinformatic approaches to functionally characterise RNAs