hardware accelerated big data analytics for genomics analysis · gpu accelerated compute hadoop and...
TRANSCRIPT
1
Del
ft D
ata
Scie
nce
Zaid Al-Ars ce.ewi.tudelft.nl/zaid Computer Engineering (CE) Lab Delft Data Science (DDS) Delft University of Technology (TUDelft)
Hardware Accelerated Big Data Analytics for Genomics Analysis
2
Del
ft D
ata
Scie
nce
• TUDelft initiative for research, education and training in data science and technology
What is Delft Data Science (DDS)
3
Del
ft D
ata
Scie
nce
DDS Research Disciplines
4
Del
ft D
ata
Scie
nce
• Facebook has 300 PB of data + 500 TB/day • eBay has 90 PB of data + 100 PB/day • Google has 15 EB and processes 100 PB a day • Sequencing DNA of world population requires 18 EB
Scale of big data
640K ought to be enough for anybody.
[“Follow the Data” h2ps://followthedata.wordpress.com/2014/06/24/data-‐size-‐esBmates/]
5
Del
ft D
ata
Scie
nce
• >50% of world data is created in last 2 years!! • Every day, we create ~2 exa bytes of data • Data sources everywhere:
- Sensors - Posts to social media sites - Digital pictures and videos - Transaction records - Cell phone signals
Scale of big data
6
Del
ft D
ata
Scie
nce
• Until 2000: plant breeding “trial & error” • Insight into core genomes (tomato, rice, etc)
may reverse traditional breeding workflows
• First in silico mining for relevant genes -> data-driven crossing
• Opportunity to develop commercially interesting varieties faster
Big data examples: genetics
7
Del
ft D
ata
Scie
nce
• Intensive care units (ICUs) collect data with high frequency • Only 3 piece of info are stored every hour • 19% of babies die of infections • Dr. Carolyn McgRegor (U. of Ontario) & IBM develop big data
system can monitor vital signs continuously
• System is able to predict infections 24 hours in advance
Big data examples: preterm births
8
Del
ft D
ata
Scie
nce
• Until 2005: high performance machines, but at high cost • Until 2010: Hadoop enables distributed storage and processing • Until 2015: Faster in-memory processing Spark, HBase, Hana
=> At scale processing becomes the bottleneck
Today: Hardware accelerated computing GPUs and FPGAs process tasks heavily in parallel: far more efficient for many data-intensive tasks than CPUs
Evolution of big data systems Evolution of Data Processing
2
DATA WAREHOUSE
RDBMS & Data Warehouse technologies enable organizations to store and analyze growing volumes of data on high performance machines, but at high cost.
DISTRIBUTED STORAGE AFFORDABLE IN-MEMORY
GPU ACCELERATED COMPUTE
Hadoop and MapReduce enables distributed storage and processing across multiple machines.
Storing massive volumes of data becomes more affordable, but performance is slow
Affordable memory allows for faster data read and write. HBase, Hana, MemSQL provide faster analytics.
At scale processing now becomes the bottleneck
GPU cores bulk process tasks in parallel - far more efficient for many data-intensive tasks than CPUs which process those tasks linearly.Infinite compute on Power hardware usher in a new generation of possibilities….
1990 - 2000’s 2005… 2010… 2016…
2
9
Del
ft D
ata
Scie
nce
• Hadoop runs in hard disk to ensure reliability • Spark uses resilient distributed datasets (RDDs)
• Data objects spread across a cluster • Automatically rebuilt on failure and can be stored in RAM
• Spark is faster than Hadoop MapReduce for • Iterative algorithms • Streaming algorithms • Graph-based algorithms
Systems for big data: in-memory Spark
10
Del
ft D
ata
Scie
nce
• GPUs have 1000s of small, efficient cores
• Suited for compute intensive tasks for repeated similar instructions
• This makes them well-suited to the compute-intensive workloads required of large data sets
• Examples are:
• Image processing • Visualoization
Systems for big data: GPUs
11
Del
ft D
ata
Scie
nce
• Trends to custom-made accelerators for big data
• Used for compute intensive parts of algorithms • Matrix computation • Data flow analysis • Vector processing
• Examples are: • Microsoft Bing search engine • IBM Netezza database acceleration
Systems for big data: FPGAs
12
Del
ft D
ata
Scie
nce
• Spark runs on Java • Accelerators run natively
• SparkJNI developed by the TUDelft: connects Java into native runtime processors
• Allows transparent native programming from Java
• Incurs minimal overhead on performance
SparkJNI: accelerators into Spark
13
Del
ft D
ata
Scie
nce
• Exponentially growing data volumes • Increasing complexity of analysis • Both computational and data challenges
Use case: DNA diagnostics
14
Del
ft D
ata
Scie
nce
• Urgent clinical diagnostics, for example • Targeted cancer & neo-natal diagnostics è We provide techniques to reduce compute time
• Cost prohibitive for society • More patients & diseases to be treated è We provide techniques to reduce cost
Societal challenge
COMPUTE COST COMPUTE TIME
15
Del
ft D
ata
Scie
nce
Delft Data Science research agenda • CE Lab provides a holistic approach to optimize big data
infrastructure 1. Addressing big data storage limitations
• Effective compression techniques 2. Addressing big data computational time
• Acceleration of big data algorithms 3. Addressing big data system cost
• Effective utilization of system resources
Storage limitaBons ComputaBonal bo2lenecks
Infrastructure cost opBmizaBons
16
Del
ft D
ata
Scie
nce
Collaboration opportunities
• Collaborations on big data infrastructure • Work together on industrially relevant challenges • Transfer of expert knowledge to organizations
• CE Lab is leading research in • Pipeline-wide performance optimization • Integrated system cost optimization
• Large network of leading technology providers • IBM, Intel, Xilinx, NVidia, etc.
17
Del
ft D
ata
Scie
nce
Questions?
• Zaid Al-Ars CE Lab / TUDelft Mekelweg 4, 2628 CD Delft
• Email: [email protected] • Web: ce.ewi.tudelft.nl/zaid • Tel: 015 27 89097