analytics: sql or nosql? richard taylor chair business intelligence sig

Analytics: SQL or NoSQL?

Richard TaylorChair Business Intelligence SIG

The NoSQL Movement

Meetup June 11 2009 in San FranciscoNoSQL name proposed by Eric Evans

2004 BigTable (Google)

2007 Dynamo (Amazon)

2008 Cassandra (Facebook)

Hadoop/HBase (Yahoo)

Project Voldemort (LinkedIn)

NoSQL Conferences

Relational Database/SQL

1980

1981 Bernstein and GoodmanMulti-version ConcurrencyControl

Database Timeline

19701970 1990 2000 2010

1969 CODASYL- Network database- Schema- DDL/DML

1970 CoddRelational Model

1980 GrayTransaction

1995 Bernstein et alCritique of ANSI SQLIsolation Levels

1989 SQL-89

1992 SQL-92

1999 SQL:1999Object Relational

2003 SQL:2003Analytics extensions

1979 Oracle

1974 SEQUEL

RowColumn

Relational Model

Normalized data “Atomic” Multi-column Key

Operations on tables: select, project, join

Relationship on key Primary Key Foreign Key

Table – n-tuple

Key

SQL Designed for Transaction Processing Good

Easily handles simple cases Everyone has a Query Language

Bad Data access language (not Turing complete) Declarative Language (4GL)

Impedance mismatch with procedural languages Complicated cases get repetitive

Normalization

Refine design of structured data “Atomic” No repeating groups Data item depends on key (and nothing else)

Avoid modification anomalies Ensure every data item is stored only once

Avoid bias to any particular pattern of querying Allow data to be accessed from every angle

Denormalization

Star Schema Example

FactTable

Product

Store

Promotion

Date

Date_key

Store_key

Promotion_key

Product_key

Receipt_number

Quantity

Revenue

Unit_price

Date_key

Day_in_week

Day_in_month

Day_in_year

Day_name

Week_in_month

Week_in_year

Month_nbr

Month_name

Quarter

Year

Holiday

Holiday_desc

…

Database Summary• Costs

– Fixed schema– Normalization– Transform data on load– Cost of scaling– Problems with large objects– Complicated software

• Benefits– Mature technology– Precise querying– Star Schema – historic data

Tuple Store/NoSQL

Tuple Storage Systems

• Google Database System– Chubby – Lock/metadata manager– Google File System – Distributed file system– Bigtable – Tuple storage on GFS– Map Reduce – Data processing on tuples

• Other tuple stores– Voldemort – Amazon Dynamo– Cassandra– HBase– Hypertable

Tuple Store Model

One Table Operate on Map

Set of (Key, Value) Structured Key Unstructured Value Operations:

select, project Map Reduce

Tuple Store

Key Value

Key Column Timestamp

Map Reduce

• Define two functions– Map

• Input: tuple

• Output: list of tuples

– Reduce• Input: key, list of values

• Output: list or tuple

• Specify a cluster• Specify input and output tuple stores• Framework does the rest

{ Map(k1, v1) } -> { list(k2, v2) }

{ list(k2, v2) } -> { (k2, list(v2)) }

{ Reduce(k2, list(v2)) } -> { list(v3) } -> { (k2, v3) }

Map Reduce Example

For each web page count the number of pages that reference that page

Input tuple store is WWW

Map Function:for each anchor on web page, emit (anchorURL, 1)

Reduce Function:emit (anchorURL, sum(list))

{ Map(k1, v1) } -> { list(k2, v2) }

{ list(k2, v2) } -> { (k2, list(v2)) }

{ Reduce(k2, list(v2)) } -> { (k2, v3) }

URL Web PageURL Web PageURL Web PageURL Web Page

…

Output tuple store is{ (URL, count) }

Example in SQL

CREATE TABLE links ( URL page NOT NULL,

URL ref_page NOT NULL,PRIMARY KEY page, ref_page

)

SELECT ref_page, count(DISTINCT page)FROM linksGROUP BY ref_page

For each web page count the number of pages that reference that page

Tuple Store Summary

• Semi-structured data– No need to normalize data

• Simple implementations– Cheap, fast, scalable

• Map Reduce Processing– Simple programming (for geeks)

• Issues– No guidance from schema– No model for historic data

Hadoop winsSort Benchmark

Synthesis

Summary

• SQL– Structured data

– Precise

– Historic data

– Needs transformation

– Scalability issues

• NoSQL– Cheap

– Scalable

– Handles large data

Enterprise Model

Money Content Analytics

?NoSQLRelational

DB

Metadata?

Issues:- Data volume- Query requirements

Analytics Architecture

Map ReduceProcessing TB+/day

RDBData Warehouse

GB

++/day

ReportsTupleStore

CubesReports

etc.

Summary

It is all about structured dataHow much do we want?

How much can we afford?

analytics: sql or nosql? richard taylor chair business intelligence sig

Documents

web page

page input tuple store

primary key page

url ref

output tuple storesframework

sqlcreate table links

pageselect ref

data processing