bioinformatics at usda-ars livestock issues research unit scot e. dowd, joaquin zaragoza mel oliver...

20
Bioinformatics at USDA- ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton

Post on 18-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton

Bioinformatics at USDA-ARS Livestock Issues Research Unit

Scot E. Dowd, Joaquin ZaragozaMel Oliver and Paxton Payton

Page 2: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton

Projects

• Future: Interactive neural network based models to describe and predict gene expression in Livestock and Pathogens

• Present: Various Projects Various States Leading to the Future– Molecular Modeling– Gene Finding – Distributed BLAST– Whole Genome Comparison– Functional Genomics and pathways– Pathway or system targeted Microarray design

Page 3: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton

Functional Genomics

• Functional Genomics/Gene Ontology- controlled vocabulary

• Define, annotate, categorize, and describe large genetic datasets (e.g. est, mRNA)

• We have developed a custom curated database for functional domain BLAST (regular blast and rps-BLAST using kog, cog, pfam, hmmr, smart domains)

• Ultimately will become a comprehensive .NET suite of analyses for microarray design from new sequence all the way to result visualization.

Page 4: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton

Ontology

• Annotation – propogation of error in definitions

• Ca

Page 5: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton

BLAST: need for speed (II)

• We are working with roughly 5000-100,000 queries against 1GB databases

• 1 query takes a fairly fast PC 3 minute to complete– dual 3.2 GHZ XEON– 6 GB RAM – RAID0 SCSI-320 HD

• Other methods MPI-BLAST, WU-BLAST, THREADED BLAST, SGE-BLAST, commercial TURBO BLAST, DNAstar etc.

Page 6: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton

BLAST ALGORITHM

Cgtcgctcgctgtaagtac– query e.g.1000 letter word

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990) A basic local alignment search tool. Journal of Molecular Biology 215, 403-410.

• What database sequence is most similar to my query.• Databases one of ours is 60GB worth of letters• BLAST generates statistics based upon similarity and substitution

probabilities In simplest form purine to purine better than purine to pyrimidine

• Slide along 4 GB database find word match and try to extend

Page 7: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton

• BLASTX as example-Translation into 6 reading frames, search database with these 6 sequences with word size of 3.

• Time to BLAST – Up to a point decreased time correlated with

number of slaves available – Average test machines (2.4 ghz/1gb

RAM/SATA150)– (e.g. 90 seq/13 CPU/3 min) vs

(90seq/1CPU/38.5 min) 350MB db GB-LAN

Page 8: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton
Page 9: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton

.NET Distributed BLAST

• Take advantage of unused laboratory compute resources

• Provide easy, powerful tool for Distributing BLAST

• Target Atmosphere– Windows LAN

• Current Open Source Distributed BLAST Applications– Require server class master or version of UNIX– Difficult to set up, configure databases, compile and

submit jobs.– No large job fault tolerance

Page 10: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton

W.ND BLAST : A Bioinformatician promoting windows?

• .NET C# • First tests Condor, MPI, a ported remote shell• Contractor• Project Manager• Database formatter• Worker machines• Job leasing• Output processing HT backend apps

Page 11: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton

Gotta GUI

Page 12: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton

Database formatter

Page 13: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton
Page 14: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton
Page 15: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton
Page 16: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton

Functionality

• Network bandwidth would eventually be limited

• Fault tolerant to worker failure

• Resume upon reboot if Contractor fails

• No statistical problems with search results

• Complete BLAST database on each worker node if resources allow

• Easy to install a breeze to use

Page 17: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton

.NET Distributed BLAST

• Queue at each node– Contractor only allows maximum of two query sequences in

each node’s queue– Ensures application wait a minimal amount of time between

completion and next job• Thread per node

– Makes use of .NET Asynchronous Delegate / AD – scalability ???

– Thread Invokes BLAST on remote node– Upon completion, remote node sends “finished” message to the

Contractor– The contractor collects results and performs validity check – Once results are verified, remote worker BLAST starts on queue

sequence and Contractor prepares allocates future job

Page 18: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton

.NET Distributed BLAST

• Fault Tolerance-revisited– Task migration handled through application-level checkpointing– Worker encounters fault or crashes, – Contractor redirects failed nodes sequence on another worker

node.– Minimal loss of time

• Integrating QOS functionality- current in works– decrease priority when workstation is in use –based upon

system remote call checking CPU%, memory etc– GUI allows increasing or decreasing priority – rev gauges and

throttles– Storage requirement limitations - redirect query to other

database source (working with 10 connection limitation in XP pro)

Page 19: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton

Future Directions

• Quality of Service– Allow Contractor to set priority for application

• Contractor Fault Tolerance • Large Network Optimization

– Sub Contractors

• Asynch Del. Thread limit- ewww kewl WEB SERVICE!

• Shadow (Sub) Contractors- network load balance

Page 20: Bioinformatics at USDA-ARS Livestock Issues Research Unit Scot E. Dowd, Joaquin Zaragoza Mel Oliver and Paxton Payton

• The End!

• Questions?

• Suggestions?

• Advice?

• Even Criticism?