introduction to real-time analytics with cassandra and hadoop

68
#strataconf + #hw2013 Real-Time Analytics with Cassandra and Hadoop Patricia Gorla Download code: bit.ly/1aB8Jy8 (12KB)

Upload: patricia-gorla

Post on 10-May-2015

6.533 views

Category:

Technology


1 download

DESCRIPTION

This presentation examines the benefits of using Cassandra to store data, and how the Hadoop ecosystem can fit in to add aggregation functionality to your cluster. Accompanying code can be found online at bit.ly/1aB8Jy8. Talk delivered at StrataConf + Hadoop World 2013.

TRANSCRIPT

Page 1: Introduction to Real-Time Analytics with Cassandra and Hadoop

#strataconf + #hw2013

Real-Time Analytics with Cassandra and Hadoop

Patricia Gorla

Download code: bit.ly/1aB8Jy8 (12KB)

Page 2: Introduction to Real-Time Analytics with Cassandra and Hadoop

About Me• Solr• Cassandra• Datastax MVP

Download code: bit.ly/1aB8Jy8 (12KB)

Page 3: Introduction to Real-Time Analytics with Cassandra and Hadoop

• Introduction to Cassandra + 2 labs 15m Break ~ 14:30

• Analytics + 1 labs 15m Break ~ 16:30

• Extra Credit

Outline

Download code: bit.ly/1aB8Jy8 (12KB)

Page 4: Introduction to Real-Time Analytics with Cassandra and Hadoop

Introduction

Download code: bit.ly/1aB8Jy8 (12KB)

Page 5: Introduction to Real-Time Analytics with Cassandra and Hadoop

Getting Started

ArchitectureData Modeling

Download code: bit.ly/1aB8Jy8 (12KB)

Page 6: Introduction to Real-Time Analytics with Cassandra and Hadoop

History• Powered inbox search at Facebook• Open-sourced in 2008

Page 7: Introduction to Real-Time Analytics with Cassandra and Hadoop

Why Cassandra?• Linear scalability• Availability• Set it and forget it

Page 8: Introduction to Real-Time Analytics with Cassandra and Hadoop

Many companies use Cassandra.

...

Page 9: Introduction to Real-Time Analytics with Cassandra and Hadoop

What is Cassandra?• Dynamo distributed cluster (no vector

clocks)• Bigtable data model• No SPOF• Tuneably consistent

Page 10: Introduction to Real-Time Analytics with Cassandra and Hadoop

Cluster

Keyspace

Architecture

Page 11: Introduction to Real-Time Analytics with Cassandra and Hadoop

Column Family 1

Keyspace

Column Family 2

Page 12: Introduction to Real-Time Analytics with Cassandra and Hadoop

Column Family 1

Keyspace

Column Family 2

row1: {col1:val1,time,TTL; … }

Page 13: Introduction to Real-Time Analytics with Cassandra and Hadoop

Labintroduction/1-getting-started.md

Download code: bit.ly/1aB8Jy8 (12KB)

Page 14: Introduction to Real-Time Analytics with Cassandra and Hadoop

Getting StartedArchitecture

Data Modeling

Page 15: Introduction to Real-Time Analytics with Cassandra and Hadoop

WritesCommit Log -> Memtable -> SSTables

Source: datastax.com

Page 16: Introduction to Real-Time Analytics with Cassandra and Hadoop

Incoming write to cluster.

Page 17: Introduction to Real-Time Analytics with Cassandra and Hadoop
Page 18: Introduction to Real-Time Analytics with Cassandra and Hadoop
Page 19: Introduction to Real-Time Analytics with Cassandra and Hadoop

Data replicated to replicants.

Page 20: Introduction to Real-Time Analytics with Cassandra and Hadoop

Data partitioning by token ranges.

Page 21: Introduction to Real-Time Analytics with Cassandra and Hadoop

Data partitioning by virtual nodes.

Page 22: Introduction to Real-Time Analytics with Cassandra and Hadoop

Reads

Page 23: Introduction to Real-Time Analytics with Cassandra and Hadoop

Source: fusionio.com

High-level overview of reads.

Page 24: Introduction to Real-Time Analytics with Cassandra and Hadoop

Source: datastax.com

Page 25: Introduction to Real-Time Analytics with Cassandra and Hadoop

?

Reading from cluster.

Page 26: Introduction to Real-Time Analytics with Cassandra and Hadoop

Reading from cluster.

?

?

?

Page 27: Introduction to Real-Time Analytics with Cassandra and Hadoop

Reading from cluster.

Page 28: Introduction to Real-Time Analytics with Cassandra and Hadoop

Reading from cluster.

Page 29: Introduction to Real-Time Analytics with Cassandra and Hadoop

Fault tolerance

Page 30: Introduction to Real-Time Analytics with Cassandra and Hadoop

?

Reading from cluster.

Page 31: Introduction to Real-Time Analytics with Cassandra and Hadoop

Reading from cluster.

?

?

?

Page 32: Introduction to Real-Time Analytics with Cassandra and Hadoop

Reading from cluster.

Page 33: Introduction to Real-Time Analytics with Cassandra and Hadoop

Reading from cluster.

Page 34: Introduction to Real-Time Analytics with Cassandra and Hadoop

Deletes• Distributed deletes are tricky• Tombstones may not be propagated• Don’t rely on a delete-heavy system

Page 35: Introduction to Real-Time Analytics with Cassandra and Hadoop

Getting StartedArchitectureData Modeling

Page 36: Introduction to Real-Time Analytics with Cassandra and Hadoop

ProtocolsThrift

• Thrift, CQL• Synchronous

Binary• CQL• Asynchronous

Page 37: Introduction to Real-Time Analytics with Cassandra and Hadoop

• Familiar syntax• Flexible data model over Cassandra

Why CQL?

Page 38: Introduction to Real-Time Analytics with Cassandra and Hadoop

CQL: Creating a Keyspace

create KEYSPACE “Patisserie” with replication = {‘class’: ‘SimpleStrategy’, ‘replication_factor’: 1 } ;

use “Patisserie”;

Page 39: Introduction to Real-Time Analytics with Cassandra and Hadoop

CQL: Creating a Column Family

create TABLE “customers” (customer text, age int, PRIMARY KEY (customer) ) ;

customer age

Yves Laurent 77

Coco Chanel 130

Pierre Cardin

CQL Schema

Page 40: Introduction to Real-Time Analytics with Cassandra and Hadoop

CQL: Creating a Column Family

create TABLE “customers” (customer text, age int, PRIMARY KEY (customer) ) ;

”Yves Laurent”: {“age”:77}

“Coco Chanel”: {“age”:130}

“Pierre Cardin”: {}

Physical Representation

Page 41: Introduction to Real-Time Analytics with Cassandra and Hadoop

CQL: Composite Columnscreate TABLE “customer_purchases” (customer text,

day text,

item text,

PRIMARY KEY (customer,day) ) ;

customer day item

ylaurent M rivoli

ylaurent T mille feuille

cchanel M pain au chocolat

pcardin W mille feuille

pcardin F croissant

Page 42: Introduction to Real-Time Analytics with Cassandra and Hadoop

CQL: Composite Columnscreate TABLE “customer_purchases” (customer text,

day text,

item text,

PRIMARY KEY (customer,day) ) ;

”ylaurent”: { “M:item”: “rivoli”, “T:item”: “mille feuille” }

“cchanel”: { “M:item”: “pain au chocolat” }

“pcardin”: { “W:item”: “mille feuille”, “F:item”: croissant }

Page 43: Introduction to Real-Time Analytics with Cassandra and Hadoop

CQL: Composite Primary Keys

create TABLE “daily_sales_by_item” (day text, customer text, hour timestamp, item text, PRIMARY KEY ((day,customer), hour) ) ;

day customer hour item

M cchanel 13 rivoli

M cchanel 15 mille feuille

M ylaurent 4 rivoli

T cchanel 17 mille feuille

W pcardin 20 croissant

Page 44: Introduction to Real-Time Analytics with Cassandra and Hadoop

CQL: Composite Primary Keys

create TABLE “daily_sales_by_item” (day text, customer text, hour timestamp, item text, PRIMARY KEY ((day,customer), hour) ) ;

”M:cchanel”: { “13:item”: “rivoli”, “15:item”: “mille feuille” }

“M:ylaurent”: { “4:item”: “rivoli” }

“T:cchanel”: { “17:item”: “mille feuille" }

“W:pcardin”: { “20”item”: “croissant” }

Page 45: Introduction to Real-Time Analytics with Cassandra and Hadoop

CQL: Collectionscreate TABLE “customer_purchases” (customer text,

day text,

item list<text>,

PRIMARY KEY (customer,day) ) ;

customer day item

ylaurent M [‘rivoli’, ‘rivoli’, ‘javanais’]

cchanel M [‘pain au chocolat’]

pcardin W [‘mille feuille’, ‘croissant’]

pcardin F [‘croissant’]

Page 46: Introduction to Real-Time Analytics with Cassandra and Hadoop

Data Modeling Labintroduction/2-data-modeling.md

Page 47: Introduction to Real-Time Analytics with Cassandra and Hadoop

Analytics

Page 48: Introduction to Real-Time Analytics with Cassandra and Hadoop

Cassandra and Analytics

Adapting the Data ModelMapReduce Paradigms

Page 49: Introduction to Real-Time Analytics with Cassandra and Hadoop

An Unlikely Union

• Batch processing analytics and real-time data store

• MapReduce, Hive, Pig, Sqoop, Mahout

Page 50: Introduction to Real-Time Analytics with Cassandra and Hadoop

Why Cassandra and Hadoop?

• Unified workload• Availability• Simpler deployment

Page 51: Introduction to Real-Time Analytics with Cassandra and Hadoop

Datastax Enterprise

Data Locality

Data Locality

Data Locality

Page 52: Introduction to Real-Time Analytics with Cassandra and Hadoop

Datastax Enterprise

Task Trackers

Job Tracker

Page 53: Introduction to Real-Time Analytics with Cassandra and Hadoop

CFS

MapReduce

Writing in / out is passed

through the CassandraFS

layer

Page 54: Introduction to Real-Time Analytics with Cassandra and Hadoop

Starting Analytics Node

$ bin/dse cassandra -t -j

# Starts task tracker and job tracker on# node

Page 55: Introduction to Real-Time Analytics with Cassandra and Hadoop

Hello, Wordcount

$ bin/dse hadoop fs -put wikipedia /

$ bin/dse hadoop jar wordcount.jar /wikipedia wc-output

Page 56: Introduction to Real-Time Analytics with Cassandra and Hadoop

Cassandra and HadoopAdapting the Data Model

MapReduce Paradigms

Page 57: Introduction to Real-Time Analytics with Cassandra and Hadoop

Hive

• SQL-like MapReduce abstraction• Data types• Efficient JOINs, GROUP BY

Page 58: Introduction to Real-Time Analytics with Cassandra and Hadoop

Cassandra and Hive

• Hive still has to have separate tables.• DSE stores them in a separate keyspace.• 1:1 mapping to Cassandra CFs• Schemas must match or columns will be

inaccessible.

Page 59: Introduction to Real-Time Analytics with Cassandra and Hadoop

CFS

MapReduce

Hive Metastore is persisted in

Cassandra layer

Hive

Page 60: Introduction to Real-Time Analytics with Cassandra and Hadoop

Hive: Creating a DB

hive> CREATE EXTERNAL TABLE customers ( id string, name string, age int)STORED BY ‘o.a.h.h.cassandra.CassandraStorageHandler’TBLPROPERTIES ( “cassandra.ks.name” = “Oberweis”, “cassandra.ks.repfactor” = “2”, “cassandra.ks.strategy” = “o.a.c.l.SimpleStrategy”);

Page 61: Introduction to Real-Time Analytics with Cassandra and Hadoop

Hive: Multiple Data Centers

hive> CREATE EXTERNAL TABLE customers ( id string, name string, age int)STORED BY ‘o.a.h.h.cassandra.CassandraStorageHandler’TBLPROPERTIES ( “cassandra.ks.name” = “Oberweis”, “cassandra.ks.stratOptions” = “DC1:3, DC2:1”, “cassandra.ks.strategy” = “o.a.c.l.NTStrategy”);

Page 62: Introduction to Real-Time Analytics with Cassandra and Hadoop

• What about composite columns?

• Must be retrieved as binary data, and then use UDF to deserialize it.

Hive

Page 63: Introduction to Real-Time Analytics with Cassandra and Hadoop

• For each person, calculate how many pastries (and of what kind) they purchased.

Hive: Lab

Page 64: Introduction to Real-Time Analytics with Cassandra and Hadoop

Hive: Multiple Data Centers

hive> SELECT b.name, a.item, sum(a.amount)FROM Oberweis.daily_purchases aJOIN Oberweis.person b ON (a.person = b.id)GROUP BY b.name, a.item;

Page 65: Introduction to Real-Time Analytics with Cassandra and Hadoop

Extra Credit

Page 66: Introduction to Real-Time Analytics with Cassandra and Hadoop

• What about real time?

• Neither Hadoop nor Hive are built for real-time

• Cassandra provides you with data locality

Real Time Considerations

Page 67: Introduction to Real-Time Analytics with Cassandra and Hadoop

Cassandra 2.0• Transactions• Triggers• Prepared Statements

Page 68: Introduction to Real-Time Analytics with Cassandra and Hadoop

#strataconf + #hw2013

Q&A

@[email protected] on IRC (#cassandra, #python)