high-performance integrated virtual environment (hive): a ... · – a powerful de novo utility...

1
dna-hexagon – massively parallel, efficient algorithm designed to work with NGS alignments to reference genomes. It is comparable to well-recognized algorithms in its quantitative and qualitative outcomes with significant improvements in speed and sensitivity. Base-calling and SNP-profiler – utility that allows base-calling and SNP- calling and reports statistical significance, quality profile, sequencing noise profile. Post-alignment Quality Control Procedures – set of tools developed based on positional base-frequency entropic information content paradigm from information theory methodologies to validate the results of alignment algorithms and to distinguish artifacts from real biological variability. Meta-genomic Recombination Analysis Tool – utility to study microbial populations of environmental samples, to discover genetic recombination events and resolve subspecies. Clonal Population Discovery Tool – a powerful de novo utility capable of tracking bifurcation diagrams of read mappings along the reference-assisted profile and extracting tens or hundreds of reference genomes from a soup of meta- genomic material where multiple co-related species exist. Comparative SNP Phylogenetic Analysis Tool algorithm designed to show hierarchical information or graphs/networks of interconnected data on the screen. The visuals produced by this tool are capable of showing interactive controls for user to explore the data. XL- table data analysis tool Excel like backend tool capable of executing information retrieval inquiries from multi-million row tables: sorting, filtering, categorization, non-redundification and more… Support, adoption and adaptation of existing industry standard tools and utilities for next gen analysis. Computational Pipeline The submission process, usually a web-page CGI process or a client program, submits the initial information from HTML form into the HIVE cloud server queue and retrieves a unique request identification number. Using this value, the request can be tracked in the system and associated information can be retrieved or updated. COMPUTATIONS SECURITY The High-performance Integrated Virtual Environment (HIVE) is a distributed storage and computational environment designed to securely handle next-generation sequencing (NGS) data. HIVE was developed in collaboration with researchers and FDA regulatory scientists to provide a web-based portal to store, retrieve, analyze, and review NGS data. HIVE infrastructure allows reviewers and scientists to perform NGS analysis in a manner that is both efficient and secure. Data Deposition Pathway Data deposition into HIVE is accomplished by two different tools: the downloader retrieves the file from user specified location; the archiver identifies file type and executes the parser to identify and analyze the content of each file and extract the information to be stored in the system. Each field is assigned a unique ID that is used internally to recognize and access the path to a specific file. Web Portal FTP drop-box Metadata Experiment Sample Run Information Engineering data Sequence Data >sequence1 atgcgagagcttgtgtaca ggtgagccgagagctt >sequence2 gtacaggtgagccgagag cttgtgtacaggtgag Parsing Processing Pipeline Web Browser Instrument Metadata Database Storage Data Flow An elegant solution to the data transfer bottlenecking issue has been found by integrating the distributed storage system, the database of metadata, and the compute nodes into the same network. The distributed storage layer of software is the key component for file and archive management and the backbone for the deposition pipeline. The data deposition backend adds the capability to automatically download and update external data sets to HIVE data repositories. Additionally, data parser backend procedures enable the users to retrieve specific information from external data sources including NCBI, EBI, DDBJ, UniProt, PIR and others. Example: SNP Profile Mapping The SNP profile calculates the frequency of individual bases plotted against either the number of bases in the reference genome (the index) or the number of bases in the consensus sequence. Collaborative Environment All available resources allow for the implementation of a collaborative environment. This creates an optional environment for users who want to enjoy free exchange of ideas and projects using the powerful HIVE engine. Visual interfaces are implemented in a Data Driven Document model using JavaScript and HTML-5 which are natively implemented in all modern browsers Access Control Rights User/Group level access rights and encryption parameters Example: Genomic-Proteomic Mapping Once the SNP positions in a genomic sequence are discovered, a follow up question which may be asked is how the mutation affects the protein sequence. As a result, HIVE provides the graphic representation that shows the mapping between the genomic and proteomic space. ACKNOWLEDGEMENTS Mark Walderhaug, Carolyn Wilson, HIVE team: Grace Alterovitz, Amanda Bell, Prince Birring, Ting-Chia Chang, Stefan Dabic, Hiral Desai, Hayley Dingerdissen, Lama Elzohary, Marianna Faradzheva, Brian Fochtman, Tigran Ghazanchyan, Anton Golikov, Naila Gulzar, Robel Kahsay, Konstantinos Karagiannis, Charles Hadley King, Ilya Mazo, Reza Mousavi, Rahi Navelkar, Ekaterina Osipova, Aleksey Pshenichnov, Yao Ren, Alexandre Rostovtsev, Xutian Ruan, Luis Santana-Quintero, Michelle Shen, Amanraj Singh, Krista Smith, Valery Tkachenko, Phuc VinhNguyen Lam, Jeet Kiran Vora, Alin Voskanian Web Portal Parameter selection Choose algorithms Distributed Storage Cloud M etadata Database High throughput data exchange highway Distributed Computational Cloud Nodes Input selection Cloud Control Server web page form crontab or user shell scheduled updates submission CGI Task DB/Server updates access to working dumps cloud production pipeline Hierarchical Security Model Users, groups, files, processes, metadata, visual elements and algorithms are all treated as security-enabled objects. A hierarchical security model allows granting of permissions down the hierarchy, up the hierarchy or to specific subjects with a minimal number of access control rules. High-performance Integrated Virtual Environment (HIVE): A Robust Infrastructure for Next-generation Sequence Data Analysis Elaine E. Thompson 1 , Vahan Simonyan 1 , Raja Mazumder 2 and the HIVE Team 1,2 1 Food and Drug Administration, Center for Biologics Evaluation and Research, Silver Spring, MD, 20993 2 Department of Biochemistry and Molecular Biology, The George Washington University, Washington, DC, 20037

Upload: others

Post on 04-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High-performance Integrated Virtual Environment (HIVE): A ... · – a powerful de novo utility capable of tracking bifurcation diagrams of read mappings ... in a Data Driven Document

dna-hexagon – massively parallel, efficient algorithm designed to work with NGS alignments to reference genomes. It is comparable to well-recognized algorithmsin its quantitative and qualitative outcomes with significant improvements in speed and sensitivity.

Base-calling and SNP-profiler – utility that allows base-calling and SNP-calling and reports statistical significance, quality profile, sequencing noiseprofile.

Post-alignment Quality Control Procedures – set of tools developed based on positional base-frequency entropic information content paradigm from information theory methodologies to validate the results of alignment algorithms and to distinguish artifacts from real biological variability.

Meta-genomic Recombination Analysis Tool – utility to study microbial populations of environmental samples, to discover genetic recombination events and resolve subspecies.

Clonal Population Discovery Tool – a powerful de novo utility capable of tracking bifurcation diagrams of read mappings along the reference-assisted profile and extracting tens or hundreds of reference genomes from a soup of meta-genomic material where multiple co-related species exist.

Comparative SNP Phylogenetic Analysis Tool – algorithm designed to show hierarchical information or graphs/networks of interconnected data on the screen. The visuals produced by this tool are capable of showing interactive controls for user to explore the data.

XL- table data analysis tool – Excel like backend tool capable of executing information retrieval inquiries from multi-million row tables: sorting, filtering, categorization, non-redundification and more…

Support, adoption and adaptation of existing industry standard tools and utilities for next gen analysis.

Computational PipelineThe submission process, usually a web-page CGI process or a client program,submits the initial information from HTML form into the HIVE cloud serverqueue and retrieves a unique request identification number. Using this value,the request can be tracked in the system and associated information can beretrieved or updated.

CO

MPU

TATIO

NS

SECURITY

The High-performance Integrated Virtual Environment (HIVE) is a distributed storage and computational environment designed

to securely handle next-generation sequencing (NGS) data. HIVE was

developed in collaboration with researchers and FDA regulatory scientists to provide a

web-based portal to store, retrieve, analyze, and review NGS data. HIVE infrastructure allows reviewers and scientists to perform

NGS analysis in a manner that is both efficient and secure.

Data Deposition PathwayData deposition into HIVE is accomplished by two different tools: the

downloader retrieves the file from user specified location; the archiveridentifies file type and executes the parser to identify and analyze the

content of each file and extract the information to be stored in thesystem. Each field is assigned a unique ID that is used internally to

recognize and access the path to a specific file.

Web PortalFTP drop-box

MetadataExperimentSampleRun Information Engineering data

Sequence Data>sequence1 atgcgagagcttgtgtacaggtgagccgagagctt>sequence2gtacaggtgagccgagagcttgtgtacaggtgag

ParsingProcessing PipelineWeb Browser

Instrument

Metadata DatabaseStorage Data Flow

An elegant solution to the data transfer bottlenecking issue has been found by integrating the distributed storage system, the database of metadata, and the compute nodes into the same network. The distributed storage layer of software is the key component for file and archive management and the backbonefor the deposition pipeline. The data deposition backend adds the capability to automatically download and update external data sets to HIVE data repositories. Additionally, data parser backend procedures enable the users to retrieve specific information from external data sources including NCBI, EBI, DDBJ, UniProt, PIR and others.

Example: SNP Profile MappingThe SNP profile calculates the frequency of individual bases plotted against either the number of bases in the reference genome (the index) or the number of bases in the consensus sequence.

Collaborative EnvironmentAll available resources allow for the implementation of a collaborative environment. This creates an optional environment for users who want to enjoy free exchange of ideas and projects using the powerful HIVE engine.

Visual interfaces are implemented in a Data Driven Document model using

JavaScript and HTML-5 which are natively implemented in all modern browsers

Access Control RightsUser/Group level access rights and encryption parameters

Example: Genomic-Proteomic MappingOnce the SNP positions in a genomic sequence are discovered, a follow up question which may be asked is how the mutation affects the protein sequence. As a result, HIVE provides the graphic representation that shows the mapping between the genomic and proteomic space.

ACKNOWLEDGEMENTSMark Walderhaug, Carolyn Wilson, HIVE team: Grace Alterovitz, Amanda Bell, Prince Birring, Ting-Chia Chang, Stefan Dabic, Hiral Desai, Hayley Dingerdissen, Lama Elzohary, Marianna Faradzheva, Brian Fochtman, Tigran Ghazanchyan, Anton Golikov, Naila Gulzar, Robel Kahsay, Konstantinos Karagiannis, Charles Hadley King, Ilya Mazo, Reza Mousavi, Rahi Navelkar, Ekaterina Osipova, Aleksey Pshenichnov, Yao Ren, Alexandre Rostovtsev, Xutian Ruan, Luis Santana-Quintero, Michelle Shen, Amanraj Singh, Krista Smith, Valery Tkachenko, Phuc VinhNguyen Lam, Jeet Kiran Vora, Alin Voskanian

Web Portal

Parameter selection

Choose algorithms

Distributed Storage Cloud

Met

adat

a D

atab

ase

High throughput data exchange highway

Distributed Computational Cloud Nodes

Input selection

Cloud Control Server

web page form crontab or user shell scheduled updates

submission CGI

Task DB/Server

updates

access to working dumps

cloud

production pipeline

Hierarchical Security ModelUsers, groups, files, processes, metadata, visual elements andalgorithms are all treated as security-enabled objects. Ahierarchical security model allows granting of permissions downthe hierarchy, up the hierarchy or to specific subjects with aminimal number of access control rules.

High-performance Integrated Virtual Environment (HIVE): A Robust Infrastructure for Next-generation Sequence Data Analysis

Elaine E. Thompson1, Vahan Simonyan1, Raja Mazumder2 and the HIVE Team1,2

1Food and Drug Administration, Center for Biologics Evaluation and Research, Silver Spring, MD, 209932Department of Biochemistry and Molecular Biology, The George Washington University, Washington, DC, 20037