hardware accelerated big data analytics for genomics analysis · gpu accelerated compute hadoop and...

1

Del

ft D

ata

Scie

nce

Zaid Al-Ars ce.ewi.tudelft.nl/zaid Computer Engineering (CE) Lab Delft Data Science (DDS) Delft University of Technology (TUDelft)

Hardware Accelerated Big Data Analytics for Genomics Analysis

2

Del

ft D

ata

Scie

nce

•  TUDelft initiative for research, education and training in data science and technology

What is Delft Data Science (DDS)

3

Del

ft D

ata

Scie

nce

DDS Research Disciplines

4

Del

ft D

ata

Scie

nce

•  Facebook has 300 PB of data + 500 TB/day •  eBay has 90 PB of data + 100 PB/day •  Google has 15 EB and processes 100 PB a day •  Sequencing DNA of world population requires 18 EB

Scale of big data

640K ought to be enough for anybody.

[“Follow the Data” h2ps://followthedata.wordpress.com/2014/06/24/data-‐size-‐esBmates/]

5

Del

ft D

ata

Scie

nce

•  >50% of world data is created in last 2 years!! •  Every day, we create ~2 exa bytes of data •  Data sources everywhere:

-  Sensors -  Posts to social media sites -  Digital pictures and videos -  Transaction records -  Cell phone signals

Scale of big data

6

Del

ft D

ata

Scie

nce

•  Until 2000: plant breeding “trial & error” •  Insight into core genomes (tomato, rice, etc)

may reverse traditional breeding workflows

•  First in silico mining for relevant genes -> data-driven crossing

•  Opportunity to develop commercially interesting varieties faster

Big data examples: genetics

7

Del

ft D

ata

Scie

nce

•  Intensive care units (ICUs) collect data with high frequency •  Only 3 piece of info are stored every hour •  19% of babies die of infections •  Dr. Carolyn McgRegor (U. of Ontario) & IBM develop big data

system can monitor vital signs continuously

•  System is able to predict infections 24 hours in advance

Big data examples: preterm births

8

Del

ft D

ata

Scie

nce

•  Until 2005: high performance machines, but at high cost •  Until 2010: Hadoop enables distributed storage and processing •  Until 2015: Faster in-memory processing Spark, HBase, Hana

=> At scale processing becomes the bottleneck

Today: Hardware accelerated computing GPUs and FPGAs process tasks heavily in parallel: far more efficient for many data-intensive tasks than CPUs

Evolution of big data systems Evolution of Data Processing

2

DATA WAREHOUSE

RDBMS & Data Warehouse technologies enable organizations to store and analyze growing volumes of data on high performance machines, but at high cost.

DISTRIBUTED STORAGE AFFORDABLE IN-MEMORY

GPU ACCELERATED COMPUTE

Hadoop and MapReduce enables distributed storage and processing across multiple machines.

Storing massive volumes of data becomes more affordable, but performance is slow

Affordable memory allows for faster data read and write. HBase, Hana, MemSQL provide faster analytics.

At scale processing now becomes the bottleneck

GPU cores bulk process tasks in parallel - far more efficient for many data-intensive tasks than CPUs which process those tasks linearly.Infinite compute on Power hardware usher in a new generation of possibilities….

1990 - 2000’s 2005… 2010… 2016…

2

9

Del

ft D

ata

Scie

nce

•  Hadoop runs in hard disk to ensure reliability •  Spark uses resilient distributed datasets (RDDs)

•  Data objects spread across a cluster •  Automatically rebuilt on failure and can be stored in RAM

•  Spark is faster than Hadoop MapReduce for •  Iterative algorithms •  Streaming algorithms •  Graph-based algorithms

Systems for big data: in-memory Spark

10

Del

ft D

ata

Scie

nce

•  GPUs have 1000s of small, efficient cores

•  Suited for compute intensive tasks for repeated similar instructions

•  This makes them well-suited to the compute-intensive workloads required of large data sets

•  Examples are:

•  Image processing •  Visualoization

Systems for big data: GPUs

11

Del

ft D

ata

Scie

nce

•  Trends to custom-made accelerators for big data

•  Used for compute intensive parts of algorithms •  Matrix computation •  Data flow analysis •  Vector processing

•  Examples are: •  Microsoft Bing search engine •  IBM Netezza database acceleration

Systems for big data: FPGAs

12

Del

ft D

ata

Scie

nce

•  Spark runs on Java •  Accelerators run natively

•  SparkJNI developed by the TUDelft: connects Java into native runtime processors

•  Allows transparent native programming from Java

•  Incurs minimal overhead on performance

SparkJNI: accelerators into Spark

13

Del

ft D

ata

Scie

nce

•  Exponentially growing data volumes •  Increasing complexity of analysis •  Both computational and data challenges

Use case: DNA diagnostics

14

Del

ft D

ata

Scie

nce

•  Urgent clinical diagnostics, for example •  Targeted cancer & neo-natal diagnostics è We provide techniques to reduce compute time

•  Cost prohibitive for society •  More patients & diseases to be treated è We provide techniques to reduce cost

Societal challenge

COMPUTE COST COMPUTE TIME

15

Del

ft D

ata

Scie

nce

Delft Data Science research agenda •  CE Lab provides a holistic approach to optimize big data

infrastructure 1.  Addressing big data storage limitations

•  Effective compression techniques 2.  Addressing big data computational time

•  Acceleration of big data algorithms 3.  Addressing big data system cost

•  Effective utilization of system resources

Storage limitaBons ComputaBonal bo2lenecks

Infrastructure cost opBmizaBons

16

Del

ft D

ata

Scie

nce

Collaboration opportunities

•  Collaborations on big data infrastructure •  Work together on industrially relevant challenges •  Transfer of expert knowledge to organizations

•  CE Lab is leading research in •  Pipeline-wide performance optimization •  Integrated system cost optimization

•  Large network of leading technology providers •  IBM, Intel, Xilinx, NVidia, etc.

17

Del

ft D

ata

Scie

nce

Questions?

•  Zaid Al-Ars CE Lab / TUDelft Mekelweg 4, 2628 CD Delft

•  Email: [email protected] •  Web: ce.ewi.tudelft.nl/zaid •  Tel: 015 27 89097

hardware accelerated big data analytics for genomics analysis · gpu accelerated compute hadoop and...

Documents