analytics: sql or nosql? richard taylor chair business intelligence sig
TRANSCRIPT
Analytics: SQL or NoSQL?
Richard TaylorChair Business Intelligence SIG
The NoSQL Movement
Meetup June 11 2009 in San FranciscoNoSQL name proposed by Eric Evans
2004 BigTable (Google)
2007 Dynamo (Amazon)
2008 Cassandra (Facebook)
Hadoop/HBase (Yahoo)
Project Voldemort (LinkedIn)
NoSQL Conferences
Relational Database/SQL
1980
1981 Bernstein and GoodmanMulti-version ConcurrencyControl
Database Timeline
19701970 1990 2000 2010
1969 CODASYL- Network database- Schema- DDL/DML
1970 CoddRelational Model
1980 GrayTransaction
1995 Bernstein et alCritique of ANSI SQLIsolation Levels
1989 SQL-89
1992 SQL-92
1999 SQL:1999Object Relational
2003 SQL:2003Analytics extensions
1979 Oracle
1974 SEQUEL
RowColumn
Relational Model
Normalized data “Atomic” Multi-column Key
Operations on tables: select, project, join
Relationship on key Primary Key Foreign Key
Table – n-tuple
Key
SQL Designed for Transaction Processing Good
Easily handles simple cases Everyone has a Query Language
Bad Data access language (not Turing complete) Declarative Language (4GL)
Impedance mismatch with procedural languages Complicated cases get repetitive
Normalization
Refine design of structured data “Atomic” No repeating groups Data item depends on key (and nothing else)
Avoid modification anomalies Ensure every data item is stored only once
Avoid bias to any particular pattern of querying Allow data to be accessed from every angle
Denormalization
Star Schema Example
FactTable
Product
Store
Promotion
Date
Date_key
Store_key
Promotion_key
Product_key
Receipt_number
Quantity
Revenue
Unit_price
Date_key
Day_in_week
Day_in_month
Day_in_year
Day_name
Week_in_month
Week_in_year
Month_nbr
Month_name
Quarter
Year
Holiday
Holiday_desc
…
Database Summary• Costs
– Fixed schema– Normalization– Transform data on load– Cost of scaling– Problems with large objects– Complicated software
• Benefits– Mature technology– Precise querying– Star Schema – historic data
Tuple Store/NoSQL
Tuple Storage Systems
• Google Database System– Chubby – Lock/metadata manager– Google File System – Distributed file system– Bigtable – Tuple storage on GFS– Map Reduce – Data processing on tuples
• Other tuple stores– Voldemort – Amazon Dynamo– Cassandra– HBase– Hypertable
Tuple Store Model
One Table Operate on Map
Set of (Key, Value) Structured Key Unstructured Value Operations:
select, project Map Reduce
Tuple Store
Key Value
Key Column Timestamp
Map Reduce
• Define two functions– Map
• Input: tuple
• Output: list of tuples
– Reduce• Input: key, list of values
• Output: list or tuple
• Specify a cluster• Specify input and output tuple stores• Framework does the rest
{ Map(k1, v1) } -> { list(k2, v2) }
{ list(k2, v2) } -> { (k2, list(v2)) }
{ Reduce(k2, list(v2)) } -> { list(v3) } -> { (k2, v3) }
Map Reduce Example
For each web page count the number of pages that reference that page
Input tuple store is WWW
Map Function:for each anchor on web page, emit (anchorURL, 1)
Reduce Function:emit (anchorURL, sum(list))
{ Map(k1, v1) } -> { list(k2, v2) }
{ list(k2, v2) } -> { (k2, list(v2)) }
{ Reduce(k2, list(v2)) } -> { (k2, v3) }
URL Web PageURL Web PageURL Web PageURL Web Page
…
Output tuple store is{ (URL, count) }
Example in SQL
CREATE TABLE links ( URL page NOT NULL,
URL ref_page NOT NULL,PRIMARY KEY page, ref_page
)
SELECT ref_page, count(DISTINCT page)FROM linksGROUP BY ref_page
For each web page count the number of pages that reference that page
Tuple Store Summary
• Semi-structured data– No need to normalize data
• Simple implementations– Cheap, fast, scalable
• Map Reduce Processing– Simple programming (for geeks)
• Issues– No guidance from schema– No model for historic data
Hadoop winsSort Benchmark
Synthesis
Summary
• SQL– Structured data
– Precise
– Historic data
– Needs transformation
– Scalability issues
• NoSQL– Cheap
– Scalable
– Handles large data
Enterprise Model
Money Content Analytics
?NoSQLRelational
DB
Metadata?
Issues:- Data volume- Query requirements
Analytics Architecture
Map ReduceProcessing TB+/day
RDBData Warehouse
GB
++/day
ReportsTupleStore
CubesReports
etc.
Summary
It is all about structured dataHow much do we want?
How much can we afford?