hardware accelerated big data analytics for genomics analysis · gpu accelerated compute hadoop and...

17
1 Delft Data Science Zaid Al-Ars ce.ewi.tudelft.nl/zaid Computer Engineering (CE) Lab Delft Data Science (DDS) Delft University of Technology (TUDelft) Hardware Accelerated Big Data Analytics for Genomics Analysis

Upload: others

Post on 25-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hardware Accelerated Big Data Analytics for Genomics Analysis · GPU ACCELERATED COMPUTE Hadoop and MapReduce enables distributed storage and processing across multiple machines

1  

Del

ft D

ata

Scie

nce

 

Zaid Al-Ars ce.ewi.tudelft.nl/zaid Computer Engineering (CE) Lab Delft Data Science (DDS) Delft University of Technology (TUDelft)

Hardware Accelerated Big Data Analytics for Genomics Analysis

Page 2: Hardware Accelerated Big Data Analytics for Genomics Analysis · GPU ACCELERATED COMPUTE Hadoop and MapReduce enables distributed storage and processing across multiple machines

2  

Del

ft D

ata

Scie

nce

 

•  TUDelft initiative for research, education and training in data science and technology

What is Delft Data Science (DDS)

Page 3: Hardware Accelerated Big Data Analytics for Genomics Analysis · GPU ACCELERATED COMPUTE Hadoop and MapReduce enables distributed storage and processing across multiple machines

3  

Del

ft D

ata

Scie

nce

 DDS Research Disciplines

Page 4: Hardware Accelerated Big Data Analytics for Genomics Analysis · GPU ACCELERATED COMPUTE Hadoop and MapReduce enables distributed storage and processing across multiple machines

4  

Del

ft D

ata

Scie

nce

 

•  Facebook has 300 PB of data + 500 TB/day •  eBay has 90 PB of data + 100 PB/day •  Google has 15 EB and processes 100 PB a day •  Sequencing DNA of world population requires 18 EB

Scale of big data

640K ought to be enough for anybody.

[“Follow  the  Data”  h2ps://followthedata.wordpress.com/2014/06/24/data-­‐size-­‐esBmates/]  

Page 5: Hardware Accelerated Big Data Analytics for Genomics Analysis · GPU ACCELERATED COMPUTE Hadoop and MapReduce enables distributed storage and processing across multiple machines

5  

Del

ft D

ata

Scie

nce

 

•  >50% of world data is created in last 2 years!! •  Every day, we create ~2 exa bytes of data •  Data sources everywhere:

-  Sensors -  Posts to social media sites -  Digital pictures and videos -  Transaction records -  Cell phone signals

Scale of big data

Page 6: Hardware Accelerated Big Data Analytics for Genomics Analysis · GPU ACCELERATED COMPUTE Hadoop and MapReduce enables distributed storage and processing across multiple machines

6  

Del

ft D

ata

Scie

nce

 

•  Until 2000: plant breeding “trial & error” •  Insight into core genomes (tomato, rice, etc)

may reverse traditional breeding workflows

•  First in silico mining for relevant genes -> data-driven crossing

•  Opportunity to develop commercially interesting varieties faster

Big data examples: genetics

Page 7: Hardware Accelerated Big Data Analytics for Genomics Analysis · GPU ACCELERATED COMPUTE Hadoop and MapReduce enables distributed storage and processing across multiple machines

7  

Del

ft D

ata

Scie

nce

 

•  Intensive care units (ICUs) collect data with high frequency •  Only 3 piece of info are stored every hour •  19% of babies die of infections •  Dr. Carolyn McgRegor (U. of Ontario) & IBM develop big data

system can monitor vital signs continuously

•  System is able to predict infections 24 hours in advance

Big data examples: preterm births

Page 8: Hardware Accelerated Big Data Analytics for Genomics Analysis · GPU ACCELERATED COMPUTE Hadoop and MapReduce enables distributed storage and processing across multiple machines

8  

Del

ft D

ata

Scie

nce

  •  Until 2005: high performance machines, but at high cost •  Until 2010: Hadoop enables distributed storage and processing •  Until 2015: Faster in-memory processing Spark, HBase, Hana

=> At scale processing becomes the bottleneck

Today: Hardware accelerated computing GPUs and FPGAs process tasks heavily in parallel: far more efficient for many data-intensive tasks than CPUs

Evolution of big data systems Evolution of Data Processing

2

DATA WAREHOUSE

RDBMS & Data Warehouse technologies enable organizations to store and analyze growing volumes of data on high performance machines, but at high cost.

DISTRIBUTED STORAGE AFFORDABLE IN-MEMORY

GPU ACCELERATED COMPUTE

Hadoop and MapReduce enables distributed storage and processing across multiple machines.

Storing massive volumes of data becomes more affordable, but performance is slow

Affordable memory allows for faster data read and write. HBase, Hana, MemSQL provide faster analytics.

At scale processing now becomes the bottleneck

GPU cores bulk process tasks in parallel - far more efficient for many data-intensive tasks than CPUs which process those tasks linearly.Infinite compute on Power hardware usher in a new generation of possibilities….

1990 - 2000’s 2005… 2010… 2016…

2

Page 9: Hardware Accelerated Big Data Analytics for Genomics Analysis · GPU ACCELERATED COMPUTE Hadoop and MapReduce enables distributed storage and processing across multiple machines

9  

Del

ft D

ata

Scie

nce

 

•  Hadoop runs in hard disk to ensure reliability •  Spark uses resilient distributed datasets (RDDs)

•  Data objects spread across a cluster •  Automatically rebuilt on failure and can be stored in RAM

•  Spark is faster than Hadoop MapReduce for •  Iterative algorithms •  Streaming algorithms •  Graph-based algorithms

Systems for big data: in-memory Spark

Page 10: Hardware Accelerated Big Data Analytics for Genomics Analysis · GPU ACCELERATED COMPUTE Hadoop and MapReduce enables distributed storage and processing across multiple machines

10  

Del

ft D

ata

Scie

nce

 

•  GPUs have 1000s of small, efficient cores

•  Suited for compute intensive tasks for repeated similar instructions

•  This makes them well-suited to the compute-intensive workloads required of large data sets

•  Examples are:

•  Image processing •  Visualoization

Systems for big data: GPUs

Page 11: Hardware Accelerated Big Data Analytics for Genomics Analysis · GPU ACCELERATED COMPUTE Hadoop and MapReduce enables distributed storage and processing across multiple machines

11  

Del

ft D

ata

Scie

nce

 

•  Trends to custom-made accelerators for big data

•  Used for compute intensive parts of algorithms •  Matrix computation •  Data flow analysis •  Vector processing

•  Examples are: •  Microsoft Bing search engine •  IBM Netezza database acceleration

Systems for big data: FPGAs

Page 12: Hardware Accelerated Big Data Analytics for Genomics Analysis · GPU ACCELERATED COMPUTE Hadoop and MapReduce enables distributed storage and processing across multiple machines

12  

Del

ft D

ata

Scie

nce

 

•  Spark runs on Java •  Accelerators run natively

•  SparkJNI developed by the TUDelft: connects Java into native runtime processors

•  Allows transparent native programming from Java

•  Incurs minimal overhead on performance

SparkJNI: accelerators into Spark

Page 13: Hardware Accelerated Big Data Analytics for Genomics Analysis · GPU ACCELERATED COMPUTE Hadoop and MapReduce enables distributed storage and processing across multiple machines

13  

Del

ft D

ata

Scie

nce

 

•  Exponentially growing data volumes •  Increasing complexity of analysis •  Both computational and data challenges

Use case: DNA diagnostics

Page 14: Hardware Accelerated Big Data Analytics for Genomics Analysis · GPU ACCELERATED COMPUTE Hadoop and MapReduce enables distributed storage and processing across multiple machines

14  

Del

ft D

ata

Scie

nce

 

•  Urgent clinical diagnostics, for example •  Targeted cancer & neo-natal diagnostics è We provide techniques to reduce compute time

•  Cost prohibitive for society •  More patients & diseases to be treated è We provide techniques to reduce cost

Societal challenge

COMPUTE  COST                                  COMPUTE  TIME  

Page 15: Hardware Accelerated Big Data Analytics for Genomics Analysis · GPU ACCELERATED COMPUTE Hadoop and MapReduce enables distributed storage and processing across multiple machines

15  

Del

ft D

ata

Scie

nce

 Delft Data Science research agenda •  CE Lab provides a holistic approach to optimize big data

infrastructure 1.  Addressing big data storage limitations

•  Effective compression techniques 2.  Addressing big data computational time

•  Acceleration of big data algorithms 3.  Addressing big data system cost

•  Effective utilization of system resources

Storage  limitaBons   ComputaBonal  bo2lenecks  

Infrastructure  cost  opBmizaBons  

Page 16: Hardware Accelerated Big Data Analytics for Genomics Analysis · GPU ACCELERATED COMPUTE Hadoop and MapReduce enables distributed storage and processing across multiple machines

16  

Del

ft D

ata

Scie

nce

 Collaboration opportunities

•  Collaborations on big data infrastructure •  Work together on industrially relevant challenges •  Transfer of expert knowledge to organizations

•  CE Lab is leading research in •  Pipeline-wide performance optimization •  Integrated system cost optimization

•  Large network of leading technology providers •  IBM, Intel, Xilinx, NVidia, etc.

Page 17: Hardware Accelerated Big Data Analytics for Genomics Analysis · GPU ACCELERATED COMPUTE Hadoop and MapReduce enables distributed storage and processing across multiple machines

17  

Del

ft D

ata

Scie

nce

 

Questions?

•  Zaid Al-Ars CE Lab / TUDelft Mekelweg 4, 2628 CD Delft

•  Email: [email protected] •  Web: ce.ewi.tudelft.nl/zaid •  Tel: 015 27 89097