big data exploration in genome-based data analysis

40
Big Data Exploration in Genome-based Data Analysis Dr. Jittisak Senachak Systems Biology and Bioinformatics (SBI) research group King Mongkut’s University of Technology Thonburi (KMUTT) [Affiliation: National Center for Genetic Engineering and Biotechnology (BIOTEC)] Trilateral Scientific Meeting Indonesia-Thailand-France: Climate change, Big data management and Health” IICC-Bogor, Indonesia, October 29, 2015

Upload: adb-health-sector-group

Post on 18-Feb-2016

15 views

Category:

Documents


0 download

DESCRIPTION

Presented by Dr. Jittisak Senachak, Systems Biology and Bioinformatics (SBI) research group, King Mongkut’s University of Technology Thonburi (KMUTT) last 28 October 2015 in Bogor, Indonesia

TRANSCRIPT

Page 1: Big Data Exploration in Genome-based Data Analysis

Big Data Exploration in Genome-based Data Analysis

Dr. Jittisak Senachak

Systems Biology and Bioinformatics (SBI) research group

King Mongkut’s University of Technology Thonburi (KMUTT)

[Affiliation: National Center for Genetic Engineering and Biotechnology (BIOTEC)]

Trilateral Scientific Meeting Indonesia-Thailand-France: “Climate change, Big data management and Health”

IICC-Bogor, Indonesia, October 29, 2015

Page 2: Big Data Exploration in Genome-based Data Analysis

@Bangkok, THAILAND BM: BangMod – Main campus BKT: Bangkhunthien campus - R&D cluster + Industrial Park - Pilot-plant 1,2,3 - Nation Biophamaceutical Facility

KX: Knowledge Exchange Center - Big Data Exchange Center

@Ratchaburi, THAILAND RAT: Ratchaburi Campus - Residential campus - Bee-Park

King Mongkut’s University of

Technology Thonburi

http://global.kmutt.ac.th/

Page 3: Big Data Exploration in Genome-based Data Analysis

BangKhunThian Campus, BKK

Main Campus @BangMod, BKK Incity Innovation Center

Ratchaburi Campus

Page 4: Big Data Exploration in Genome-based Data Analysis

<X : Knowledge eXchange for Innovation Center

• KMUTT Learning Square

• Working + Learning + Sharing • change

• perience

• pert

• tension

• plore

Page 5: Big Data Exploration in Genome-based Data Analysis
Page 6: Big Data Exploration in Genome-based Data Analysis

BX: Data scientists meet with enterprise to solve business problems

• The big data ecosystem in Thailand • e plain: Education & training

• e plore: Share idea & best practices

• e change: Case studies, prototypes & surveys among academia & IT providers

Mobilizing talent Leveraging education Big data Trends

Analysis Methods Big Data

New

In

sigh

t D

ata

Page 7: Big Data Exploration in Genome-based Data Analysis

Agenda

• Intro … KMUTT’s facility for Big Data trend

• Big Data & Characteristics Genome-base data

• Exploration of biological entities • Genome Browsers

• Application: Comparative genomics

• Exploration of relationships among the entities • Integrative tool with ~omics data

• Example: Applied Big Data technology to Genome-based data analysis

• Our current activities: conference & workshop

Page 8: Big Data Exploration in Genome-based Data Analysis

What is Big Data?

Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.

-- Gartner, 2015

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making. And better decisions can mean greater operational efficiency, cost reduction and reduced risk.

-- Wikipedia, 2015

Page 9: Big Data Exploration in Genome-based Data Analysis

What is Big Data?

Characteristics (5V)

• Volume: Data at scale

• Variety: Data in various forms

• Velocity: Data flow

• Veracity: Data uncertainty

• Value

Page 10: Big Data Exploration in Genome-based Data Analysis

Genome-based data as Big data

Page 11: Big Data Exploration in Genome-based Data Analysis

How big genome-based data is?

Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical? PLoS Biol 13(7): e1002195. doi: 10.1371/journal.pbio.1002195

Astronomy Twitter YouTube Genomics Current rate 7.5TB/sec 5M tweets/day 300 hours/min Raw 3.6PB + 35Pbp/year

Growth in Y2025 750TB/sec 1,200M tweets/day 1700 hours/min ??

Acquisition-2025(unit per

year)

25 ZB (~5 trillion DVD)

0.5-15M tweets 500-900M hours ??

Storage-2025 (byte/year)

1 EB (~212 million DVDs)

1-17 PB (˜3.62 million DVDs)

1-2 EB (~425 million DVDs)

??

Analysis-2025 In situ data reduction Real-time processing Massive volumes

Topic & sentiment mining Metadata analysis

Limited requirements Heterogeneous data/analysis Variant calling All-pairs genome alignments

Distribution-2025

Dedicated lines from antennae to server (600 TB/s)

Small units of distribution

Major component of modern user’s bandwidth (10MB/s)

Many small (10 MB/s) and fewer massive data movement (10 TB/s)

Zetta: 270 (~1021); Exa: 260 (~1018); Penta: 250 (~1015); Tera: 240 (~1012); Giga: 230 (~109); Single-sided DVD ~4.5GB

Page 12: Big Data Exploration in Genome-based Data Analysis

• In human genome, approx. 25,000 proteins (over 3,000Mbp), variant 0.1% of WGS data

• Assembled ~ 700 MB • Raw(30x) ~ 200,000MB • Variants ~ 125MB

Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical? PLoS Biol 13(7): e1002195. doi: 10.1371/journal.pbio.1002195

Too much Data: Cautionary Tale of Sequencing Data

First Gen

Next-Gen Sequencer

Next-next Gen Sequencer

Moore’s law

Ilumina estimated

Historical growth

#(h

um

an g

eno

mes

)

x106

x103

Page 13: Big Data Exploration in Genome-based Data Analysis

Not only human genomes to being sequenced

Currently, Sequence Read Archive (SRA) @NIH/NCBI contains more than 3.6Pbp ~ 32,000 microbes ~ 5,000 plants & animals ~250,000 human genomes Massive Sequencing projects are on-going 3k rice genomes [Public data on AWS] 1,000k plants & animals 100k (UK) + 100k (SA) + 320k (Iceland) + 1,000k (US) + 1,000k (CN) Y2025, around 25% population (developing) + 50% population (developed)

Page 14: Big Data Exploration in Genome-based Data Analysis

How big genome-based data is?

Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical? PLoS Biol 13(7): e1002195. doi: 10.1371/journal.pbio.1002195

Astronomy Twitter YouTube Genomics Current rate 7.5TB/sec 5M tweets/day 300 hours/min Raw 3.6PB + 35Pbp/year

Growth in Y2025 750TB/sec 1,200M tweets/day 1700 hours/min --estimated--

Acquisition-2025(unit per

year)

25 ZB (~5 trillion DVD)

0.5-15M tweets 500-900M hours 1ZB (~2125 billion DVDs)

Storage-2025 (byte/year)

1 EB (~212 million DVDs)

1-17 PB (˜3.62 million DVDs)

1-2 EB (~425 million DVDs)

2-40 EB (~8,500 million DVDs)

Analysis-2025

In situ data reduction Real-time processing Massive volumes

Topic & sentiment mining Metadata analysis

Limited requirements Heterogeneous data/analysis Variant calling (~2 trillion CPU-hours) All-pairs genome alignments (~10,000 trillion CPU-hours)

Distribution-2025

Dedicated lines from antennae to server (600 TB/s)

Small units of distribution Major component of modern user’s bandwidth (10MB/s)

Many small (10 MB/s) and fewer massive data movement (10 TB/s)

Zetta: 270 (~1021); Exa: 260 (~1018); Penta: 250 (~1015); Tera: 240 (~1012); Giga: 230 (~109); Single-sided DVD ~4.5GB

Page 15: Big Data Exploration in Genome-based Data Analysis

Exploration of Biological Entities

Page 16: Big Data Exploration in Genome-based Data Analysis

Biological Entities: Genome (Genes in total)

~1500 base pair in this slide

Need 2,000,000 slides for

human genome

Page 17: Big Data Exploration in Genome-based Data Analysis

Genome Browser Chromosomes to genes to nucleotides

Page 18: Big Data Exploration in Genome-based Data Analysis

Genome Browser Chromosomes to genes to nucleotides

Page 19: Big Data Exploration in Genome-based Data Analysis

Comparative Genomics

- Gene Presence/Absence

- Gene gains/losses - Evolutionary

Study - Strain screening Kit set for QA Alt options

Page 20: Big Data Exploration in Genome-based Data Analysis

Biological entities: varieties

Page 21: Big Data Exploration in Genome-based Data Analysis
Page 22: Big Data Exploration in Genome-based Data Analysis

Exploration of Relationships among entities

Page 23: Big Data Exploration in Genome-based Data Analysis

Relationships: Associations, Interactions, …. • Associations

• Correlation

• Interactions • DNA-Protein • TF-Target protein • Physical Protein-protein

• Metabolic Reactions • Substrates, Products • Catalysts/Enzymes • Metabolites

• Pathways • Metabolic pathways

Page 24: Big Data Exploration in Genome-based Data Analysis

Systems biology (integrative biology): velocity

How the cell regulates itself? Signal-Response study (condition + time)

Page 25: Big Data Exploration in Genome-based Data Analysis

Inte

grat

ive

anal

ysis

of

~om

ic d

ata

…or googling us: “SpirPro SBI”

J. Senachak et al. SpirPro: A Spirulina proteome database and web-based tools for the analysis of protein-protein interactions at the metabolic level in Spirulina (Arthrospira) platensis C1. BMC Bioinformatics 2015, 16:233 doi:10.1186/s12859-015-0676-z.

Page 26: Big Data Exploration in Genome-based Data Analysis

SpirPro: proteome-effect analysis

Proteome Data Temporal Stress-Response

Snapshot Interactions PPIs surround an expressed protein

Page 27: Big Data Exploration in Genome-based Data Analysis

SpirPro: proteome-effect analysis

Effect on Metabolisms (Proteome over KEGG pathways)

dashed line for expressed enzymes

Inter-pathways (Expressed protein effecting to other pathways via PPI)

Figure shows only a pathway of left-hand-side protein, and all possible PPIs to other pathways

Protein Interaction

Page 28: Big Data Exploration in Genome-based Data Analysis

SpirPro: proteome-effect analysis

• Web-based platform as browsing interactive • Comparative study of 52 cyanobacterial genomes

• Ortholog analysis • Ortholog classified by OrthoMCL algorithm

• Protein domain analysis • Pfam scan V.14 with in-house script for visualization

• Protein-protein interaction • Inferred from Yeast Two-hybrid screening in Synechocystis sp. PCC6803

…or googling us: “SpirPro SBI”

J. Senachak et al. SpirPro: A Spirulina proteome database and web-based tools for the analysis of protein-protein interactions at the metabolic level in Spirulina (Arthrospira) platensis C1. BMC Bioinformatics 2015, 16:233 doi:10.1186/s12859-015-0676-z.

Page 29: Big Data Exploration in Genome-based Data Analysis

…or googling us: “CyanoCOG”

Page 30: Big Data Exploration in Genome-based Data Analysis

[with-PPI]: Snapshot Interactions

<msa>: Multi-Sequence Alignment

[click-on-image]: Gene Location on Genome

Page 31: Big Data Exploration in Genome-based Data Analysis

Example: Speed up the analysis pipeline by applied Big Data technology

Page 32: Big Data Exploration in Genome-based Data Analysis

Chromosome VCFtools Impala SQL

All chromosomes 22x60 hr 1.6 min

Per chromosome 16.5 – 110 min 2 -7 sec

Speed up! ~1000X

Page 33: Big Data Exploration in Genome-based Data Analysis

1.5TB (uncompressed)

The analysis pipeline for Next-gen sequence data

Page 34: Big Data Exploration in Genome-based Data Analysis

Hadoop file system speeds up variant calling

Chromosome VCFtools Impala SQL

All chromosomes 22x60 hr 1.6 min

Per chromosome 16.5 – 110 min 2 -7 sec

Speed up! ~1000X

Page 35: Big Data Exploration in Genome-based Data Analysis

Use BigData technology

Master

Node Node Node Node Node

12.. 12.. 12.. 12.. 12..

10Gbps 10Gbps

10Gbps

Page 36: Big Data Exploration in Genome-based Data Analysis

Conclusions • Genome-based data as big data

• Data visualization is important for genome-based researches • Genome Browsers Multi-level data

• Multi-level –omics data integration Regulatory Network

• Biological networks and interactive tool SpirPro

• Example: Big data technology applied for genome-based analysis

bigdataexperience http://www.sbi.kmutt.ac.th/

http://www.bioinformatics.kmutt.ac.th/

Page 37: Big Data Exploration in Genome-based Data Analysis

Acknowledgements

Proteome (Algal Biotech)

Bioinformatics & Systems Biology

(BIF) program

Systems Biology & Bioinformatics (SBI)

Medical research (collab w/ hospitals)

Cell & Physiology (Algal Biotech)

Comparative Genomics Analysis

& Visualization

Page 38: Big Data Exploration in Genome-based Data Analysis

• Genome bioinformatics’s current success, challenges, and opportunities in the era of low-cost sequencing

• Big data analysis in Metagenomics

• Third Generation Sequencing for Rapid Surveillance http://www.csbio.org/2015/

Page 39: Big Data Exploration in Genome-based Data Analysis

http://academy.sbi.kmutt.ac.th/cmg2015/

Page 40: Big Data Exploration in Genome-based Data Analysis