big data analytics and data warehousing with data cubes carlos ordonez university of houston att...

38
Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Upload: austen-parks

Post on 30-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Big Data Analytics and Data Warehousingwith Data Cubes

Carlos Ordonez University of Houston ATT Research Labs NY

Page 2: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Goals of talk

1. Big Data

2. Cubes

3. Highlight some of my “cubic” research

2/79

Page 3: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Big Data: what is different from large databases? VVV+V

• Variety: – Loosely specified or no schema– Storage: Record -> files; data types -> any digital

content• Volume:

– Higher volume, including streaming– Multiple levels of granularity

• Velocity: speed of arrival/processing• Veracity: Internet, multiple versions of data

3/79

Page 4: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Big Data: specific technical details

• Integration needed!

• Finer granularity than transactions; give up ACID?

• cannot be directly analyzed: pre-processing

• Diverse data sources, beyond relational/alphanumeric

• Volume requires parallelism

• Skip ETL: load files directly

• Web logs, user interactions, social nets, streams

• Still only HDD provides capacity and good $; SSD $$; future non-volatile RAM?

4/79

Page 5: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Current technologies for Big Data

• DBMS– Row– Column– Other: array, XML, Datalog is back

• Hadoop stack– Apache: + important than GNU in corp. world– MapReduce: forgotten?; HDFS: dominant– Hive; SPARQL; Cassandra, Impala, Cask– Many more: open-source

5/79

Page 6: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Data Warehouse versus Data Lake(Data Swamp?)

6/79

Feature Data Warehouse

Data Lake

Database Model ER model None

ETL Involved, data transformation

Copy file

Querying SQL Sparql, Java program, SQL?

Grow # nodes Difficult, $$$ Easy, $$

Page 7: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Big Data Analytics Processing

Data ProfilingData Profiling• Data Exploration; univariate stats• Data Preparation

• Multivariate Statistics• Machine Learning Algorithms

Analytic ModelingAnalytic Modeling

• Scoring• Lifecycle Maintenance

Model DeploymentModel Deployment

Highly Iterative Process

Page 8: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Processing: DBMS versus Hadoop

Task SQL Hadoop/noSQL

Available Sequential open-source y y

Parallel open source n y

Fault tolerant on long jobs n Y

Libraries limited Many

Arrays and matrices limited good

Massive parallelism (# servers, 1000s of CPUs) n y

8/60

Page 9: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Some cons about big data not using DBMS technology

• No SQL• No model, no DDL• no consistency, although transaction too stringent• Web-scale data tough, but not universal• Database integration and cleaning much harder• Parallel processing with too much hardware• Fact: SQL remains main query mechanism

9/79

Page 10: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Why analysis inside a DBMS?llll llll

Teradata

Your PC with Warehouse Miner

ODBC

• Huge data volumes: potentially better results with larger amounts of data; less process. time

• Minimizes data redundancy; Eliminate proprietary data structures; simplifies data management; security

• Caveats: SQL, limited statistical functionality, complex DBMS architecture

Page 11: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

DBMS Sequential vs Parallel Physical Operators

• Serial DBMS (one CPU, RAID):– table Scan– join: hash join, sort merge join, nested loop– external merge sort

• Parallel DBMS (shared-nothing):– even row distribution, hashing– parallel table scan– parallel joins: large/large (sort-merge, hash);

large/short (replicate short)– distributed sort

11/60

Page 12: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Big Data AnalyticsOverview

• Simple:– Ad-hoc Queries– Cubes: OLAP, MOLAP, includes descriptive

statistics: histograms, means, plots, statistical tests

• Complex:– Statistical and Machine Learning Models– Patterns: Graphs subsuming other problems

12/60

Page 13: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Cube Processing Input

• Data set F : n records, d dimensions , e measures• Dimensions: discrete, measures: numeric• Focus of the talk, d dimensions• I/O bottleneck: • Cube: lattice of d dimensions• High d harder than n

13/60

Page 14: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Cube computations

• Explore lattice of dimensions• Large n: F cannot fit in RAM, minimize I/O• Multidimensional

– d: tens, maybe hundreds of dimensions• Internally computed with data structures

14/60

Page 15: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Cube algorithms: elevator story

• Behavior with respect to data set X:– Level-wise: k passes

• Time complexity bottleneck d: O(n2d) • Cubes research today:

– Parallel processing– Data structures incompatible with relational DB– different time complexity in

SQL/Hadoop/MapReduce– Incremental and online

15/60

Page 16: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Cubes inside DBMS: more involved• Assumption:

– data records are in the DBMS; exporting slow

– row-based or column-based storage

• Programming alternatives:– SQL and UDFs: SQL code generation (JDBC), precompiled

UDFs. Extra: SP, embedded SQL, cursors

– Internal C Code (direct access to file system and mem)

• DBMS advantages:– Columns 10X faster: compression + efficient projection

– mportant: storage, queries, security

– maybe: recovery, concurrency control, integrity, transactions (i.e. some ACID ok)

16/60

Page 17: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Cubes outside DBMS: alternatives

• Hadoop: dump data to Data Lake; SQL-like later• MOLAP tools:

– Push hard aggregations with SQL– Memory-based lattice traversal– Interaction with spreadsheets

• Imperative programming languages instead of SQL: C++, Java– Arrays, functions, modules, classes – flexibility of control statements

17/60

Page 18: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Cube Processing OptimizationsAlgorithmic & Systems

• Algorithmic (90% research, but not in a DBMS)– accelerate/reduce cube computations– database systems focus: reduce I/O passes– approximate solutions: good for count(*), sum()

looked at with suspicion– parallel

• Systems (SQL, Hadoop, MapReduce, Libraries)– Platform: parallel DBMS server vs cluster of

computers vs multicore CPUs– Programming: SQL/C++ versus Java

18/60

Page 19: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Research Highlights research with my students

• Comprehensive– Modeling– Query processing– Visualization

• Biased– Motivated by DOLAP!– Influenced by Stonebraker– Mostly with my students– Hadoop ignored

19/79

Page 20: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

A glimpse

• Preparing and cleaning data takes time: ETL• Lots of SQL and scripts written to prepare data

sets for statistical analysis• Data quality was hot; worth revisiting w/big data• Graph analytics• Cube computation is the most researched topic;

cube result analysis/interpretation 2nd priority• Is “Big data” different?

20/79

Page 21: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

SQL and ER: can they get closer?

• Goal: creating a data set X with d dimensions D(K,A), K commonly a single id

• Lots of SQL queries, many temporary tables• Users do not like to look at someone else’s SQL

code• Decoupled from ER model, not reused• Many transformations: cubes, variable creation,

even math transformation for statistical analysis

21/79

Page 22: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Representing Data Transformations done with SQL queries

22/79

Page 23: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

SQL transformations in ER

23/79

Page 24: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Extended ER zoom in

24/79

Page 25: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Referential Integrity QMs

25/79

Page 26: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

SQL Optimizations: Queries vs UDFs

• SQL query optimization– mathematical equations as queries– Turing-complete: SQL code generation and

programming language• UDFs as optimization

– substitute difficult/slow math computations– push processing into RAM memory

26/60

Page 27: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

SQL Query Processing

• Columns will take over rows [Stonebraker]– Vertica and MonetDB “pure” column– Hybrid: Oracle Exadata, Teradata, SQL Server

indexes• But a lot of work to do

– OLTP: rows, not columns (slow conversion)– still a lot of data warehouses working in row form– Many external tools store by row

27/79

Page 28: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Horizontal aggregations

• Create cross-tabular tables from cube• PIVOT requires knowing values• Aggregations in horizontal layout

28/79

Page 29: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Prepare Data Set Horizontal aggregations

Page 30: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Horizontal Meta-optimizer

30/79

Page 31: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Graph Analytics

• Recursive queries in SQL• Patterns: paths, cycles, cliques• Examples:

– Twitter: who follows who?, how many #?– Facebook: family, close friends, social circles,

friends in common– Airline: list all flights from A to B; balance

cost/distance• Surprisingly: SQL is good!, but with a column DBMS

31/79

Page 32: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

A Benchmark to compute # of paths in a graph of length k

32/79

Page 33: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Cube computation with UDF (table function)

• Data structure in RAM; maybe one pass• It requires maximal cuboid or choosing k

dimensions

33/79

Page 34: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Cube in UDFLattice manipulated with hash table

34/79

Page 35: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Cube visualization: harder than 2D or 3D data!

• Lattice exploration• Projection into 2D• Comparing cuboids

35/79

Page 36: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Cube interpretation & visualizationstatistical tests on cubes

Page 37: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Can we do “search engines”?Keyword search, ranking

Page 38: Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Acknowledgments

• Il-Yeol Song, since we met in 2010, but I started sending papers to DOLAP in 2003

• Mike Stonebraker: one size does not fit all• My students

38/79