g. paolella napoli, 21/2/ 2008 1 progetto s.co.p.e. – wp4 bioinformatica nel progetto scope g....

G. Paolella Napoli, 21/2/ 2008 1

Progetto S.Co.P.E. – WP4

Bioinformatica nel progetto SCOPE

G. Paolella, M. Petrillo, L.Cozzuto, A. Boccia, C. Cantarella, L.Sepe


Our role within SCOPE

Nodes Nodes Nodes Ns Nodes

GRID software

High level middleware

SCOPE web siteAstronomy Chemistry

Physics Bioinformatics

Hardware

Middleware

Application


Tasks

• Provide a large number of users with general purpose bioinformatic service, which take advantage of high performance hardware, allowing:– Web access for quick operations, performed by the vast majority

of users– Unix level access in the form of an integrated problem solving

environment

• Set up an automatic annotation system to be used in specific computational or experimental projects, based on the available services two specific applications:– CST analysis by comparative genomics– Mining for regulatory RNA within completely sequenced

genomes


Bioinfo portal


Available services


Programs


Graphic interface to programs


Various operations in a row:Complement ->Translation -> Isoelectric point of the resulting protein.

DNA

Complement

Translation

Isoelectric point

CAPRI workflow


SRS: the database tool


SRS


WEB SERVER

CAPRI SRSPISE

Other Emboss Fasta Blast

UserData DB

Primary remotedatabases

ENSEMBL

Services organization


Sito periferico medicina

HD attached to the system:• 112 processor cluster• Two 8-processor servers, several 2-processor servers• Storage center (SCOPE)• Campus GRID and beyond (SCOPE)


Broker

virtualnode

virtualnode

DB

DB

Grid

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

Low latency scheduler

High level scheduler

500 tasks/sec

20-50 ms delay


Joining the GRID

HD attached to the system:• 1 Cluster Element (CE)• 5 Worker nodes (WN) biproc (expandable up to 40)• 1 Storage Element (SE) with 50 Gb• 1 User Interface (UI)


Available at:lfn:/grid/scope/bioinfo/

programs/ (executables)dbs/ (datasets)

Currently installed tools

• Blast• Randfold• Infernal package

Support databases

• RFAM• Blast (human, rat, dog, chicken and macaca genomes)

GRID bioinformatic tools


• Blastz• Clustalw• Dialignt• Emboss package• FASTA package• Genscan• Hmmer package• MCL package• Pcma• Primer3• RNAz• Vienna package• Multiz-tba

Ready to be installed tools


Tasks

• Provide a large number of users with general purpose bioinformatic service, which take advantage of high performance hardware, allowing:– Web access for quick operations, performed by the vast majority

of users– Unix level access in the form of an integrated problem solving

environment

• Set up an automatic annotation system to be used in specific computational or experimental projects, based on the available services two specific applications:– CST analysis by comparative genomics– Mining for regulatory RNA within completely sequenced

genomes


Due esempi

Due esempi di sistemi di annotazione automatica, utilizzati per la identificazione e caratterizzazione di sequenze di DNA con possibile ruolo funzionale:

– sequenze di piccole dimensioni, conservate tra uomo ed altre specie CST;

– sequenze in grado di codificare per RNA strutturati.


• Obiettivo: Sistema di annotazione automatica di sequenze

• Motivazioni: Analisi computazionale di sequenze non codificanti permette l’identificazione di nuovi elementi funzionali

• Descrizione del problema e sua risoluzione. Diversi tipi di test predittivi applicati su larga scala ad un gran numero di dati sperimentali, estratti da banche dati pubblicamente disponibili o provenienti da dati sperimentali.

• Esigenza per l’uso dell’HPC: dato l’elevato numero di test, in genere si utilizzano cluster multiprocessore. L’uso di GRID permette di estendere l’analisi a set di dati di dimensioni ancora maggiori.

• Descrizione della soluzione del problema nell’ambiente HPC

Obiettivi e modalita’


Identificazione e caratterizzazione di sequenze nucleotidiche conservate tra uomo e topo (CSTs) in altre specie.

H. Sapiens

M. Musculus

CSTs

CST identificate in geni associati a malattie: 64.495.Analisi da effettuare mediante BLAST contro altri genomi (ratto, cane, scimmia, pollo, etc).

Identificazione di CST


Annotation is carried out through a pipeline which goes through the various phases wit hout requiring human assistance. Tasks requiring intensive CPU usage, such as BLAST homology search, are spread on several collaborating servers using a system specifically developed for load distribution and monitoring.

CST ANNOTATIONCSTs- chromosome position- type (i.e. intergenic, intronic, exonic, etc.)- coding %- closest gene and relative distances- .......

ENSEMBL gene and gene structure data- Max L-Score- Avg L-Score- .......

UCSC Log Score dataMatches with:- EST- Other genomes- Proteins (BlastX)

BLAST- repeats type- repeats %Repeat MaskerCoding Potential ScoreCPS - Redundancy- Overlapping- ........

PHP ScriptsDBRemote Servers Remote Servers

CST annotation


DG-CST

1022 genes related to

genetically transmitted

disease


KinWeb

500 genes coding for

human protein

kinases


(a)

(b)

(c)

(d)

(e)

KinWeb DB


BLAST

• Eseguibile submitted da un repository locale di programmi • Librerie di dati genomici conservate su SE locale e registrate sull'SE

centrale scopelfc01.dsf.unina.it:/grid/scope/bioinfo

• Esempio Blast delle 65597 CST contro genomi di cane, gallo, scimmia e ratto.

• Numero jobs sottomessi 67• Gruppo di sequenze di input: 1000 sequenze• Tempo totale di esecuzione dei 67 jobs: 4 ore• Tempo medio per job: 18 minuti (2 spesi per scaricare il dataset).

• Tempo CPU• Ricerca di 1 sequenza nel genoma di topo => 5 sec. • 64.495 sequenze => 3,75 giorni• 10 genomi => 37,5 giorni• MPIBLAST (soltanto installato)


Bacterial SLSs

Pae-1 (Pseudomonas aeuruginosa)Eric (Escherichia coli)


Identificazione e caratterizzazione in genomi batterici di famiglie di sequenze ripetute che condividono una struttura secondaria conservata.

Analisi da effettuare mediante INFERNAL su oltre 300 genomi batterici

EsempioRicerca di una famiglia in un genoma =====> 6 ore.Ricerca di 50 famiglie in un genoma =====> 12,5 giorniRicerca di 50 famiglie in 300 genomi =====> 10 anni

Ricerca Strutture secondarie


DNA

Aim: find potential regulatory sequences acting as structured RNAs.

Pilot project: Analyses carried on chromosome 21.

Protein

Structured RNA

mRNA

Folding of the human genome


Chromosome length 46,944,323 bp

Transcriptome length 14,609,025 bp

Sequences potentially transcribed has been split in overlapping fragments of 150 bp length.

Fragments 290,904 sequences

Total length 43,726,912 bp

Genome plan


Length 46,944,323 bps

Total genes 392

> miRNA Genes 10

> rRNA Genes 3

> snRNA Genes 7

> snoRNA Genes 8

> miscRNA 8

Found known RNAs 9

Transcriptome length 14,609,025

Sequences potentially transcribed has been split in overlapping fragments of 150 bp length.

290,904 sequences

Results


Valutazione dei risultati ottenuti

• RANDFOLD• Programma randfold• Eseguibile submitted da un repository locale di programmi di

bioinformatica• Gruppo di sequenze di input: 2500 sequenze di regioni trascritte del chr

21• Numero jobs sottomessi 117

• Tempo CPU richiesto• Sequenze derivate dai geni del cromosoma 21: 291.589• Predizione su 1 sequenza => 45 sec.• 291.589 sequenze => 152 giorni.


Node number n_sequences seconds Day(s)

1 1 45 0

1 291,589 13,121,505 152

117 2,500 112,500 1,3

About 3 days

How long ?


0

10000

20000

30000

40000

50000

60000

0 200 400 600 800 1000 1200

proc numbers

time (sec)

single nodegrid

Performance


Some extra applications


Assemble

…

Contigs Scaffolds

…

geneA tRNA prom oprA oprB

geneCluster A

Annotation

High throughput sequencing


• Identification of genes and other genetic elements.• Protein functional annotation.• Cellular process annotation.

• Identification of ORFs, tRNAs, rRNAs• Scanning for signals, such as promoters and microRNAs• Identification of operons and gene clusters• Comparison with known genomes/proteins• Identification of orthologs and paralogs • Characterization of protein domains• Reconstruction of complete metabolic pathways• …• …

Annotation Steps


Annotation


IPROC

IPROC

IPROC

The image processing system: IPROC


image in

iProcStep

iProcStepImageMagick

iProcStepPHP

iProcStepPerl

commandLine program

Image MagickPackage

PHPPackage

PERLPackage

Command LinePackages

adapter adapter

image out

adapter

Image processing modules


HPCon

ClusternodesG

ateway

iPage

image

area

data + images

page

iPaneiPaneiPane

proc-steps

IPROC architecture


Cluster Nodes

AccessServer

AccessServer

AccessServer

CLUSTER

IPROC

Parallel processing


The group

Angelo BocciaGianluca BusielloMauro PetrilloConcita Cantarella*Luca CozzutoLeandra Sepe*

Vittorio LucignanoMarisa Passaro


SPEED (μ /40 )m min

( )ANGLE degree R

FRONT MIDDLE FAR FRONT MIDDLE FAR FRONT MIDDLE FAR

3 3NIH T 7,27 4,92 5,25 194,95 181,82 212,620.85

( =0.49)coeff0.47

( =0.56)coeff0.09

( =0.30)coeff

NIHRas 11,57 6,88 7,57 160,74 188,16 87,60.83

( =0.51)coeff0.59

( =0.54)coeff0.34

( =0.30)coeff

NIHSrc 6,05 5.1 3,71 181,08 168,29 156,950.89

( =0.48)coeff0.74

( =0.49)coeff0.56

( =0.35)coeff

SPEED μm/40min R

average angle(degree)

NIH3t3 8.43 0.02 226.73

NIHRas 11.73 0.37 203.79

NIHSrc 4.87 0.24 251.07

middle

front

far

NIHRas, NIH3T3, NIHSrc wound NIHRas, NIH3T3, NIHSrc wound healinghealing

Three cell subpopulations: Three cell subpopulations:

front, middle, and far from front, middle, and far from

the woundthe wound


Version number 1 features tab-delimited

Name filename

Depth size 16bit

wdim size 4 where files

cdim size 3 where files

pdim size n where files

tdim size n unit min scale 10 where files

ldim size n unit µm scale 0.4 where layers

Time 1 Time 2 Time n

well1 well2

well3 well4

Channel1Channel 2

Channel 3

Position 1

Position n

l1

ln

File format

Data input: text description


Acquisition parameters Buttons to slide

the acquisition

Image processing menus

Info panel for each frame

hide/show control command

IPROC

Image processing


Broker

virtualnode

virtualnode

DB

DB

Grid

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

Hierarchical node organization

g. paolella napoli, 21/2/ 2008 1 progetto s.co.p.e. – wp4 bioinformatica nel progetto scope g....

Documents

paolella napoli

cst slide

scope slide

srs slide

sepe slide

cst annotation slide

ms delay slide

transmitted disease