web-scale data processing: practical approaches for low-latency and batch
TRANSCRIPT
$>whoami
Edward CaprioloDeveloper @ dstillery (the company formally known as m6d aka media6degrees)
Hive: Project Management Committee
Hadoop'in it since 0.17.2
Cassandra-'in it since 0.6.X
Hive'in it 0.3.X
Incredibly skilled with power point
Agenda for this talk
Batch processing via Hadoop
Stream processing
Relational Databases and NoSQL
Life lessons, quips, and other prospective
Before we talk tech...
Lets talk math!
Yay! math fun! (as people start leaving room)
Don't worry. It is only a couple slides.
Wanted to talk about relational algebra since it is the foundation of relation databases
Even in the NoSQL age, relational algebra is alive and well
Relational algebra...
A big slide with many words
Relational algebra received little attention outside of pure mathematics until the publication of E.F. Codd's relational model of data in 1970. Codd proposed such an algebra as a basis for database query languages.
In computer science, relational algebra is an offshoot of first-order logic and of algebra of sets concerned with operations over finitary relations, usually made more convenient to work with by identifying the components of a tuple by a name (called attribute) rather than by a numeric column index, which is called a relation in database terminology.
http://en.wikipedia.org/wiki/Relational_algebra
Operators of Relational algebra:
Projection
SELECT Age, Weight ...Extended projections
SELECT Age+Weight as X ...
SELECT ROUND(Weight),Age+1 as X ...
Selection
SELECT * FROM Person
SELECT * FROM Person WHERE Age >=34
SELECT * FROM Person WHERE Age = Weight
Joins
SELECT * FROM Car JOIN Boat on (CarPrice >= BoatPrice)
SELECT * FROM Car JOIN Boat on (CarPrice = BoatPrice)
Aggregate
SELECT sum(C) FROM r
SELECT A, sum(C) FROM r GROUP BY A
http://www.cbcb.umd.edu/confcour/CMSC424/Relational_algebra.pdf
Other Operators
Set operationsUnion
Intersection
Cartesian Product
Outer joinsLEFT
RIGHT,
FULL
Semi Join / Exists
Batch Processing and Big Data
When hadoop game on the scene it was a game changer because:Viable implementation of Google's map reduce white paper
Worked with commodity hardware
Had no exuberant software fees
Scaled processing and storage with growing companies without typically needed processes to be redesigned
Archetype Hadoop deployment
(circa facebook 2009)
Web Servers
Scribe Writers
RealtimeHadoop Cluster
Hadoop Hive Warehouse
Oracle RAC
MySQL
Scribe MidTier
http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html
The Hadoop archetype
Component generating events (web servers)
Component collecting logs into hadoop (scribe)
Translation of raw data using hadoop and hive
Output of rollups to oracle and other data systems feedback loops (mysql hive)
Use case: Book store
Our book store will be named (say it with me!):Web scale,
Big Data,
No SQL,
Real Time Analytics,
Books!
One more time!Web scale, Big Data, No SQL, Real Time Analytics, Books(A buzzword bingo company)
Domain model
{ "id":"00001",
"refer":"http://affiliate1.superbooks.com",
"ip":"209.191.139.200",
"status":"ACCEPTED",
"eventTimeInMillis":1383011801439,
"credit_hash":"ab45de21",
"email":"[email protected]",
"purchases":[ {
"name":"Programming Hive",
"cost":30.0 }, {
"name":"frAgile Software Development",
"cost":0.2 } ]}
Complex serialized payloads
process web logs in facebook's case were NOT always tab delimited text files
In many cases scribe was logging complex structures in thrift format
Hadoop (and hive) can work with complex records not typical in RDBMS
Log collection/ingestion
http://flume.apache.org/FlumeUserGuide.html
Several ingestion approaches
Scribe never took off
Choctaw (hangs around not sexy)
Log servers log direct with HDFS API
Duck taped up set of shell scripts
Flume seems to be the most widely used, feature rich, and supported system
Left up to the user...
What format do you want the raw data in
How should the data be staged in HDFShourly directories
by host
How to monitorSemantics of what the pipeline should do if files stop appearing?
Application specific sanity checks
Unleash the hounds!
SELECT refer, sum(purchase.cost)
FROM store_transactionLATERAL VIEW explode (purchase) plist as
purchaseGROUP BY refer
WHERE refer = 'y'
Hive and relational algebra