analytics: sql or nosql? richard taylor chair business intelligence sig

21
Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Upload: marlene-cook

Post on 20-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Analytics: SQL or NoSQL?

Richard TaylorChair Business Intelligence SIG

Page 2: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

The NoSQL Movement

Meetup June 11 2009 in San FranciscoNoSQL name proposed by Eric Evans

2004 BigTable (Google)

2007 Dynamo (Amazon)

2008 Cassandra (Facebook)

Hadoop/HBase (Yahoo)

Project Voldemort (LinkedIn)

NoSQL Conferences

Page 3: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Relational Database/SQL

Page 4: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

1980

1981 Bernstein and GoodmanMulti-version ConcurrencyControl

Database Timeline

19701970 1990 2000 2010

1969 CODASYL- Network database- Schema- DDL/DML

1970 CoddRelational Model

1980 GrayTransaction

1995 Bernstein et alCritique of ANSI SQLIsolation Levels

1989 SQL-89

1992 SQL-92

1999 SQL:1999Object Relational

2003 SQL:2003Analytics extensions

1979 Oracle

1974 SEQUEL

Page 5: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

RowColumn

Relational Model

Normalized data “Atomic” Multi-column Key

Operations on tables: select, project, join

Relationship on key Primary Key Foreign Key

Table – n-tuple

Key

Page 6: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

SQL Designed for Transaction Processing Good

Easily handles simple cases Everyone has a Query Language

Bad Data access language (not Turing complete) Declarative Language (4GL)

Impedance mismatch with procedural languages Complicated cases get repetitive

Page 7: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Normalization

Refine design of structured data “Atomic” No repeating groups Data item depends on key (and nothing else)

Avoid modification anomalies Ensure every data item is stored only once

Avoid bias to any particular pattern of querying Allow data to be accessed from every angle

Denormalization

Page 8: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Star Schema Example

FactTable

Product

Store

Promotion

Date

Date_key

Store_key

Promotion_key

Product_key

Receipt_number

Quantity

Revenue

Unit_price

Date_key

Day_in_week

Day_in_month

Day_in_year

Day_name

Week_in_month

Week_in_year

Month_nbr

Month_name

Quarter

Year

Holiday

Holiday_desc

Page 9: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Database Summary• Costs

– Fixed schema– Normalization– Transform data on load– Cost of scaling– Problems with large objects– Complicated software

• Benefits– Mature technology– Precise querying– Star Schema – historic data

Page 10: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Tuple Store/NoSQL

Page 11: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Tuple Storage Systems

• Google Database System– Chubby – Lock/metadata manager– Google File System – Distributed file system– Bigtable – Tuple storage on GFS– Map Reduce – Data processing on tuples

• Other tuple stores– Voldemort – Amazon Dynamo– Cassandra– HBase– Hypertable

Page 12: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Tuple Store Model

One Table Operate on Map

Set of (Key, Value) Structured Key Unstructured Value Operations:

select, project Map Reduce

Tuple Store

Key Value

Key Column Timestamp

Page 13: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Map Reduce

• Define two functions– Map

• Input: tuple

• Output: list of tuples

– Reduce• Input: key, list of values

• Output: list or tuple

• Specify a cluster• Specify input and output tuple stores• Framework does the rest

{ Map(k1, v1) } -> { list(k2, v2) }

{ list(k2, v2) } -> { (k2, list(v2)) }

{ Reduce(k2, list(v2)) } -> { list(v3) } -> { (k2, v3) }

Page 14: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Map Reduce Example

For each web page count the number of pages that reference that page

Input tuple store is WWW

Map Function:for each anchor on web page, emit (anchorURL, 1)

Reduce Function:emit (anchorURL, sum(list))

{ Map(k1, v1) } -> { list(k2, v2) }

{ list(k2, v2) } -> { (k2, list(v2)) }

{ Reduce(k2, list(v2)) } -> { (k2, v3) }

URL Web PageURL Web PageURL Web PageURL Web Page

Output tuple store is{ (URL, count) }

Page 15: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Example in SQL

CREATE TABLE links ( URL page NOT NULL,

URL ref_page NOT NULL,PRIMARY KEY page, ref_page

)

SELECT ref_page, count(DISTINCT page)FROM linksGROUP BY ref_page

For each web page count the number of pages that reference that page

Page 16: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Tuple Store Summary

• Semi-structured data– No need to normalize data

• Simple implementations– Cheap, fast, scalable

• Map Reduce Processing– Simple programming (for geeks)

• Issues– No guidance from schema– No model for historic data

Hadoop winsSort Benchmark

Page 17: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Synthesis

Page 18: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Summary

• SQL– Structured data

– Precise

– Historic data

– Needs transformation

– Scalability issues

• NoSQL– Cheap

– Scalable

– Handles large data

Page 19: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Enterprise Model

Money Content Analytics

?NoSQLRelational

DB

Metadata?

Issues:- Data volume- Query requirements

Page 20: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Analytics Architecture

Map ReduceProcessing TB+/day

RDBData Warehouse

GB

++/day

ReportsTupleStore

CubesReports

etc.

Page 21: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Summary

It is all about structured dataHow much do we want?

How much can we afford?