large scale analytical data management

12
Peter Boncz Large-Scale Analytical Data Management

Upload: data-science-research-center

Post on 14-Jun-2015

619 views

Category:

Technology


6 download

DESCRIPTION

Peter Bonz (http://homepages.cwi.nl/~boncz/) describes the challenge that data makes on data management systems. He describes his links to other computer science disciplines within the DSRC and importantly outlines the need to train data scientists.

TRANSCRIPT

Page 1: Large Scale Analytical Data Management

Peter Boncz

Large-Scale Analytical Data Management

Page 2: Large Scale Analytical Data Management

Database Research Data Mgmt Systems Research• SIGMOD, TODS, PVLDB, ICDE, VLDBJ

– major industry connections (billion$/y)

Expanding Topic set & Societal Impact– Data Stream Processing– Data Mining – Information Extraction, Text Retrieval– RDF and Graph data management– MapReduce + Cloud– Data Privacy

Page 3: Large Scale Analytical Data Management

DB Research Highlights (1/4)

Data Storage and Query – efficiency/scalability• Computer architecture vs DBMS architecture

http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster

Page 4: Large Scale Analytical Data Management

DB Research Highlights (1/4)

Data Storage and Query – efficiency/scalability• Computer architecture vs DBMS architecture

– Columnar storage

– Fast Compression Methods– Differential Storage Techniques (Positional Delta

Trees)– Vectorized Execution

• http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster

– Robust Query Execution (“micro adaptivity”)– Just-In-Time (JIT) Compilation– Cooperative Scans – sharing scarce I/O bandwidth

http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster

Page 5: Large Scale Analytical Data Management

DB Research Highlights (2/4)

Commodity Cluster Computing - Cloud• Various MonetDB Cluster Projects

– Shared-nothing data storage, query optimization• Hadoop VectorWise (VU MSc projects)

– cluster scalability &failover– Tightly integrated Hadoop/YARN/HDFS

• CWI scilens cluster– Amdahl number >1 large I/O resources– Other uses:webcraw analysis, 500 billion triple BI BSBM

benchmark

Page 6: Large Scale Analytical Data Management

DB Research Highlights (3/4)

Adaptive Indexing• DBA expertise extremely scarce• Science workloads hard to predict & variableDatabase Cracking:“every query is an advise how to store the

data”continuous self-steering data

reorganization

+ Approximate Query Execution on Samples+ Recycling – exploit overlap in workloads+ Fingerprint Indexing – exploit local

correlations

Page 7: Large Scale Analytical Data Management

DB Research Highlights (4/4)

Support for non-tabular data• Text (retrieval)• Scientific

– Data vaults: directly query FITS, GeoTIFF,BEM,MSEED,..

– SciQL: Arrays as 1st class database objects– MonetDB.R: using columns as arrays (and vice

versa)• Semantic Data – RDF

– “automatically discovering schemas in LOD data”• Bridge gap between RDF and relational

• Graph Data Management– Benchmark development

Page 8: Large Scale Analytical Data Management

Application Areas

– Business Intelligence• Marketing/Sales, Fraud Detection, Churn (spin-offs)• Social network analysis (LDBC)

– Security• Digital Forensics (NFI - XIRAF)• ...

– Science• Astronomy (LOFAR transient search) • Meterology (Earthquake Analysis - KNMI)

– Linked Data• Open government (LOD2)

Page 9: Large Scale Analytical Data Management

Areas of Activity

Data

Understand and decide

Analyze and model

Store and process

Reasoning

Knowledge representati

on

MultimediaRetrieval

Modeling and

simulation

Machine Learning

Information Retrieval

Decision Theory

BusinessAnalytics

VisualAnalytics

DistributedProcessing

Large Scale Databases

SoftwareEng.

System / Network

Eng.

Page 10: Large Scale Analytical Data Management

Data Science Education

enormous demand for (“big”) data scientists• Possibilities/limitations of wide array of techniques

– Information extraction, cleaning– Ranking, retrieval– Data Mining, and its applications– DB principles (Q-opt, query processing algorithms, storage techniques)

• Understand key performance factors– Latency vs bandwidth– Networks, computer architecture– algorithm optimization techniques

• Practical skills– Modern Software engineering methods– Rapid prototyping languages– Solving problems usin Hadoop clusters

proposal: “Extreme Data Management” MSc course

Page 11: Large Scale Analytical Data Management

Opportunities: CWI

• Database Architecture Group– research, application, data science experience– MonetDB, Vectorwise technologies– Scilens: data-intensive large compute cluster

• CWI motivators– Dual Appointments– Data Science MSc education

• Attracting top students into MSc projects / PhD– DSRC co-positioning in future research funding

Page 12: Large Scale Analytical Data Management

Conclusion

• Database research present in Amsterdam– research, application, valorisation

• Data Science Education!– Proposal: Extreme data Management course

• ..DSRC and the CWI..