g. paolella napoli, 21/2/ 2008 1 progetto s.co.p.e. – wp4 bioinformatica nel progetto scope g....
TRANSCRIPT
G. Paolella Napoli, 21/2/ 2008 1
Progetto S.Co.P.E. – WP4
Bioinformatica nel progetto SCOPE
G. Paolella, M. Petrillo, L.Cozzuto, A. Boccia, C. Cantarella, L.Sepe
G. Paolella Napoli, 21/2/ 2008 2
Our role within SCOPE
Nodes Nodes Nodes Ns Nodes
GRID software
High level middleware
SCOPE web siteAstronomy Chemistry
Physics Bioinformatics
Hardware
Middleware
Application
G. Paolella Napoli, 21/2/ 2008 3
Tasks
• Provide a large number of users with general purpose bioinformatic service, which take advantage of high performance hardware, allowing:– Web access for quick operations, performed by the vast majority
of users– Unix level access in the form of an integrated problem solving
environment
• Set up an automatic annotation system to be used in specific computational or experimental projects, based on the available services two specific applications:– CST analysis by comparative genomics– Mining for regulatory RNA within completely sequenced
genomes
G. Paolella Napoli, 21/2/ 2008 4
Bioinfo portal
G. Paolella Napoli, 21/2/ 2008 5
Available services
G. Paolella Napoli, 21/2/ 2008 6
Programs
G. Paolella Napoli, 21/2/ 2008 7
Graphic interface to programs
G. Paolella Napoli, 21/2/ 2008 8
Various operations in a row:Complement ->Translation -> Isoelectric point of the resulting protein.
DNA
Complement
Translation
Isoelectric point
CAPRI workflow
G. Paolella Napoli, 21/2/ 2008 9
SRS: the database tool
G. Paolella Napoli, 21/2/ 2008 10
SRS
G. Paolella Napoli, 21/2/ 2008 11
WEB SERVER
CAPRI SRSPISE
Other Emboss Fasta Blast
UserData DB
Primary remotedatabases
ENSEMBL
Services organization
G. Paolella Napoli, 21/2/ 2008 12
Sito periferico medicina
HD attached to the system:• 112 processor cluster• Two 8-processor servers, several 2-processor servers• Storage center (SCOPE)• Campus GRID and beyond (SCOPE)
G. Paolella Napoli, 21/2/ 2008 13
Broker
virtualnode
virtualnode
DB
DB
Grid
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
Low latency scheduler
High level scheduler
500 tasks/sec
20-50 ms delay
G. Paolella Napoli, 21/2/ 2008 14
Joining the GRID
HD attached to the system:• 1 Cluster Element (CE)• 5 Worker nodes (WN) biproc (expandable up to 40)• 1 Storage Element (SE) with 50 Gb• 1 User Interface (UI)
G. Paolella Napoli, 21/2/ 2008 15
Available at:lfn:/grid/scope/bioinfo/
programs/ (executables)dbs/ (datasets)
Currently installed tools
• Blast• Randfold• Infernal package
Support databases
• RFAM• Blast (human, rat, dog, chicken and macaca genomes)
GRID bioinformatic tools
G. Paolella Napoli, 21/2/ 2008 16
• Blastz• Clustalw• Dialignt• Emboss package• FASTA package• Genscan• Hmmer package• MCL package• Pcma• Primer3• RNAz• Vienna package• Multiz-tba
Ready to be installed tools
G. Paolella Napoli, 21/2/ 2008 17
Tasks
• Provide a large number of users with general purpose bioinformatic service, which take advantage of high performance hardware, allowing:– Web access for quick operations, performed by the vast majority
of users– Unix level access in the form of an integrated problem solving
environment
• Set up an automatic annotation system to be used in specific computational or experimental projects, based on the available services two specific applications:– CST analysis by comparative genomics– Mining for regulatory RNA within completely sequenced
genomes
G. Paolella Napoli, 21/2/ 2008 18
Due esempi
Due esempi di sistemi di annotazione automatica, utilizzati per la identificazione e caratterizzazione di sequenze di DNA con possibile ruolo funzionale:
– sequenze di piccole dimensioni, conservate tra uomo ed altre specie CST;
– sequenze in grado di codificare per RNA strutturati.
G. Paolella Napoli, 21/2/ 2008 19
• Obiettivo: Sistema di annotazione automatica di sequenze
• Motivazioni: Analisi computazionale di sequenze non codificanti permette l’identificazione di nuovi elementi funzionali
• Descrizione del problema e sua risoluzione. Diversi tipi di test predittivi applicati su larga scala ad un gran numero di dati sperimentali, estratti da banche dati pubblicamente disponibili o provenienti da dati sperimentali.
• Esigenza per l’uso dell’HPC: dato l’elevato numero di test, in genere si utilizzano cluster multiprocessore. L’uso di GRID permette di estendere l’analisi a set di dati di dimensioni ancora maggiori.
• Descrizione della soluzione del problema nell’ambiente HPC
Obiettivi e modalita’
G. Paolella Napoli, 21/2/ 2008 20
Identificazione e caratterizzazione di sequenze nucleotidiche conservate tra uomo e topo (CSTs) in altre specie.
H. Sapiens
M. Musculus
CSTs
CST identificate in geni associati a malattie: 64.495.Analisi da effettuare mediante BLAST contro altri genomi (ratto, cane, scimmia, pollo, etc).
Identificazione di CST
G. Paolella Napoli, 21/2/ 2008 21
Annotation is carried out through a pipeline which goes through the various phases wit hout requiring human assistance. Tasks requiring intensive CPU usage, such as BLAST homology search, are spread on several collaborating servers using a system specifically developed for load distribution and monitoring.
CST ANNOTATIONCSTs- chromosome position- type (i.e. intergenic, intronic, exonic, etc.)- coding %- closest gene and relative distances- .......
ENSEMBL gene and gene structure data- Max L-Score- Avg L-Score- .......
UCSC Log Score dataMatches with:- EST- Other genomes- Proteins (BlastX)
BLAST- repeats type- repeats %Repeat MaskerCoding Potential ScoreCPS - Redundancy- Overlapping- ........
PHP ScriptsDBRemote Servers Remote Servers
CST annotation
G. Paolella Napoli, 21/2/ 2008 22
DG-CST
1022 genes related to
genetically transmitted
disease
G. Paolella Napoli, 21/2/ 2008 23
KinWeb
500 genes coding for
human protein
kinases
G. Paolella Napoli, 21/2/ 2008 24
(a)
(b)
(c)
(d)
(e)
KinWeb DB
G. Paolella Napoli, 21/2/ 2008 25
BLAST
• Eseguibile submitted da un repository locale di programmi • Librerie di dati genomici conservate su SE locale e registrate sull'SE
centrale scopelfc01.dsf.unina.it:/grid/scope/bioinfo
• Esempio Blast delle 65597 CST contro genomi di cane, gallo, scimmia e ratto.
• Numero jobs sottomessi 67• Gruppo di sequenze di input: 1000 sequenze• Tempo totale di esecuzione dei 67 jobs: 4 ore• Tempo medio per job: 18 minuti (2 spesi per scaricare il dataset).
• Tempo CPU• Ricerca di 1 sequenza nel genoma di topo => 5 sec. • 64.495 sequenze => 3,75 giorni• 10 genomi => 37,5 giorni• MPIBLAST (soltanto installato)
G. Paolella Napoli, 21/2/ 2008 26
Bacterial SLSs
Pae-1 (Pseudomonas aeuruginosa)Eric (Escherichia coli)
G. Paolella Napoli, 21/2/ 2008 27
Identificazione e caratterizzazione in genomi batterici di famiglie di sequenze ripetute che condividono una struttura secondaria conservata.
Analisi da effettuare mediante INFERNAL su oltre 300 genomi batterici
EsempioRicerca di una famiglia in un genoma =====> 6 ore.Ricerca di 50 famiglie in un genoma =====> 12,5 giorniRicerca di 50 famiglie in 300 genomi =====> 10 anni
Ricerca Strutture secondarie
G. Paolella Napoli, 21/2/ 2008 28
DNA
Aim: find potential regulatory sequences acting as structured RNAs.
Pilot project: Analyses carried on chromosome 21.
Protein
Structured RNA
mRNA
Folding of the human genome
G. Paolella Napoli, 21/2/ 2008 29
Chromosome length 46,944,323 bp
Transcriptome length 14,609,025 bp
Sequences potentially transcribed has been split in overlapping fragments of 150 bp length.
Fragments 290,904 sequences
Total length 43,726,912 bp
Genome plan
G. Paolella Napoli, 21/2/ 2008 30
Length 46,944,323 bps
Total genes 392
> miRNA Genes 10
> rRNA Genes 3
> snRNA Genes 7
> snoRNA Genes 8
> miscRNA 8
Found known RNAs 9
Transcriptome length 14,609,025
Sequences potentially transcribed has been split in overlapping fragments of 150 bp length.
290,904 sequences
Results
G. Paolella Napoli, 21/2/ 2008 31
Valutazione dei risultati ottenuti
• RANDFOLD• Programma randfold• Eseguibile submitted da un repository locale di programmi di
bioinformatica• Gruppo di sequenze di input: 2500 sequenze di regioni trascritte del chr
21• Numero jobs sottomessi 117
• Tempo CPU richiesto• Sequenze derivate dai geni del cromosoma 21: 291.589• Predizione su 1 sequenza => 45 sec.• 291.589 sequenze => 152 giorni.
G. Paolella Napoli, 21/2/ 2008 32
Node number n_sequences seconds Day(s)
1 1 45 0
1 291,589 13,121,505 152
117 2,500 112,500 1,3
About 3 days
How long ?
G. Paolella Napoli, 21/2/ 2008 33
0
10000
20000
30000
40000
50000
60000
0 200 400 600 800 1000 1200
proc numbers
time (sec)
single nodegrid
Performance
G. Paolella Napoli, 21/2/ 2008 34
Some extra applications
G. Paolella Napoli, 21/2/ 2008 35
Assemble
…
Contigs Scaffolds
…
geneA tRNA prom oprA oprB
geneCluster A
Annotation
High throughput sequencing
G. Paolella Napoli, 21/2/ 2008 36
• Identification of genes and other genetic elements.• Protein functional annotation.• Cellular process annotation.
• Identification of ORFs, tRNAs, rRNAs• Scanning for signals, such as promoters and microRNAs• Identification of operons and gene clusters• Comparison with known genomes/proteins• Identification of orthologs and paralogs • Characterization of protein domains• Reconstruction of complete metabolic pathways• …• …
Annotation Steps
G. Paolella Napoli, 21/2/ 2008 37
Annotation
G. Paolella Napoli, 21/2/ 2008 38
IPROC
IPROC
IPROC
The image processing system: IPROC
G. Paolella Napoli, 21/2/ 2008 39
image in
iProcStep
iProcStepImageMagick
iProcStepPHP
iProcStepPerl
commandLine program
Image MagickPackage
PHPPackage
PERLPackage
Command LinePackages
adapter adapter
image out
adapter
Image processing modules
G. Paolella Napoli, 21/2/ 2008 40
HPCon
ClusternodesG
ateway
iPage
image
area
data + images
page
iPaneiPaneiPane
proc-steps
IPROC architecture
G. Paolella Napoli, 21/2/ 2008 41
Cluster Nodes
AccessServer
AccessServer
AccessServer
CLUSTER
IPROC
Parallel processing
G. Paolella Napoli, 21/2/ 2008 42
The group
Angelo BocciaGianluca BusielloMauro PetrilloConcita Cantarella*Luca CozzutoLeandra Sepe*
Vittorio LucignanoMarisa Passaro
G. Paolella Napoli, 21/2/ 2008 43
G. Paolella Napoli, 21/2/ 2008 44
SPEED (μ /40 )m min
( )ANGLE degree R
FRONT MIDDLE FAR FRONT MIDDLE FAR FRONT MIDDLE FAR
3 3NIH T 7,27 4,92 5,25 194,95 181,82 212,620.85
( =0.49)coeff0.47
( =0.56)coeff0.09
( =0.30)coeff
NIHRas 11,57 6,88 7,57 160,74 188,16 87,60.83
( =0.51)coeff0.59
( =0.54)coeff0.34
( =0.30)coeff
NIHSrc 6,05 5.1 3,71 181,08 168,29 156,950.89
( =0.48)coeff0.74
( =0.49)coeff0.56
( =0.35)coeff
SPEED μm/40min R
average angle(degree)
NIH3t3 8.43 0.02 226.73
NIHRas 11.73 0.37 203.79
NIHSrc 4.87 0.24 251.07
middle
front
far
NIHRas, NIH3T3, NIHSrc wound NIHRas, NIH3T3, NIHSrc wound healinghealing
Three cell subpopulations: Three cell subpopulations:
front, middle, and far from front, middle, and far from
the woundthe wound
G. Paolella Napoli, 21/2/ 2008 45
Version number 1 features tab-delimited
Name filename
Depth size 16bit
wdim size 4 where files
cdim size 3 where files
pdim size n where files
tdim size n unit min scale 10 where files
ldim size n unit µm scale 0.4 where layers
Time 1 Time 2 Time n
well1 well2
well3 well4
Channel1Channel 2
Channel 3
Position 1
Position n
l1
ln
File format
Data input: text description
G. Paolella Napoli, 21/2/ 2008 46
Acquisition parameters Buttons to slide
the acquisition
Image processing menus
Info panel for each frame
hide/show control command
IPROC
Image processing
G. Paolella Napoli, 21/2/ 2008 47
Broker
virtualnode
virtualnode
DB
DB
Grid
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
Hierarchical node organization