design patterns for building 360-degree views with hbase and kiji

Post on 10-May-2015

723 Views

Category:

Software

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Speaker: Jonathan Natkins (WibiData) Many companies aspire to have 360-degree views of their data. Whether they're concerned about customers, users, accounts, or more abstract things like sensors, organizations are focused on developing capabilities for analyzing all the data they have about these entities. This talk will introduce the concept of entity-centric storage, discuss what it means, what it enables for businesses, and how to develop an entity-centric system using the open-source Kiji framework and HBase. It will also compare and contrast traditional methods of building a 360-degree view on a relational database versus building against a distributed key-value store, and why HBase is a good choice for implementing an entity-centric system.

TRANSCRIPT

Design Patterns for 360º Views using HBase and Kiji

Jonathan Natkins

Who am I?

Jon “Natty” NatkinsField Engineer at WibiDataFormerly at Cloudera/Vertica

What is a 360º View?

What is a 360º View For?Past

What interactions has a customer had in the past?

PresentWhat is the customer doing right now?

FutureWhat is the customer likely do to next?

Past and present inform the future

What If I Don’t Care About Customers?

Generalizing the 360º View:Entity-Centric Systems

Goal of an Entity-Centric System

“Show me everything I know about Natty”

What Data Do I Need to Store?

Static data

Event-oriented data

Derived data

Building Entity-Centric Systems

Often, this is an EDW with a star schema

Fact

Dim

Dim

Dim

Dim

Challenges With Star Schemas

How do we answer the original question?

Full table scan + joinsOLTP systems will likely fall over from the volumeOLAP systems are usually not optimized for single-row lookups

Need Something Else…

Why

HBase rows can store both static and event-oriented data

Cell versions are key

Single-row lookups are extremely fast

is for Building Entity-Centric Systems

Often used for:Building recommendation systemsPersonalized searchReal-time HBase applications

Underlying technologies:

Designing an Entity-Centric Datastore

Ask yourself this: what is the entity?

Determine your entity by determining how you want to analyze the data

It’s ok to have data organized in multiple ways

Schema Management with Kiji

Sometimes you actually want a schema layerDefining a schema allows for data discoverability

Column Families in KijiKiji has two types of column familiesGroup families are similar to relational tables

Predefined set of columnsEach column has its own data type

Map families specify columns at runtime

Every column has the same data type

sessions:2345

sessions:2345

sessions:2345

sessions:1234

sessions:1234

info:purchases

Knowing When To Use Different Family Types

Do you know all of your columns up front?

Then use a group family

Map families are for when you don’t know your columns ahead of time

info:name info:emailsessions:1

234sessions:2

345info:purchas

esinfo:purchas

es

Choosing a Row KeyRow keys in Kiji are componentized

[ ‘component1’, ‘component2’, 1234 ]

More efficient than byte arraysConsider ‘1234567890’ versus [ 1234567890 ]

Good for scanning areas of the keyspace

A Common Use for Components

Known users IDs versus unknown IDsOn a website, how do you differentiate between a logged-in or cookie’d user versus a brand new visitor[ ‘K’, ‘user1234’ ] or [ ‘U’, ‘unknown2345’ ]

Physically and logically separate rowsRun jobs over all known or unknown users

Identifying Known UsersProblem: Users have many cookies over time.

Challenge: Ideally, we would have a single row for each user. How do we ensure that new data goes to the right row?

Finding Known Users WithLookup Tables

HBase get operations are fastIt’s easy enough to create a table that contains a mapping of cookies to known user IDsWhen data is loaded, check the lookup table to determine if you should write data to an existing row or a new one

Avoiding Hotspots

Unhashed Row KeysNode 1 Node 2 Node 3

RegionA-B

RegionB-C

RegionD-E

RegionF-G

RegionH-I

RegionJ-K

Hash-Prefixed Row KeysNode 1 Node 2 Node 3

Region00A-0fK

Region10A-1fK

Region20A-2fK

Region30A-3fK

Region40A-4fK

Region50A-5fK

Storing Event Series360º views need easy access to all the transactions and events for a userHBase cells may contain more than one versionKiji leverages this to store event series data like clicks or purchases

sessions:2345

sessions:2345

sessions:2345

sessions:1234

sessions:1234

info:purchasesinfo:name info:email

sessions:1234

sessions:2345

info:purchases

info:purchases

How Many Events is Too Many?

The HBase book warns that too many versions of a cell can cause StoreFile bloat

HBase will never split a row

Common tactic is to add a timestamp range to the row key

Kiji makes this easy with componentized row keys

Beware of Timestamp Misuse

A major reason the HBase book warns against mucking with timestamps is that they can be dangerous

What happens if you use a sequence number as a timestamp? Think about TTLs

Iterate and Evolve

Why is Evolution Necessary?No entity-centric system will be the end-all, be-all the first time aroundData sources in large enterprises are usually heavily silo’dStart smallIncorporate new data sources over time

Putting it TogetherKiji includes a shell to use DDL to create tablesMany of the features that have been discussed are declarative via the DDL

Users TableCREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))PROPERTIES (NUMREGIONS = 32)WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'),LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ));

Users TableCREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'

ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))PROPERTIES (NUMREGIONS = 32)WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'),LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ));

Users TableCREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))PROPERTIES (NUMREGIONS = 32)

WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'),LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ));

Users TableCREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))PROPERTIES (NUMREGIONS = 32)WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'),LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true,

FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.’));

In Summary…Designing applications in an entity-centric fashion can make them easier to build and more efficientKiji can speed up the development process of 360º views

Questions?Contact me

natty@wibidata.com@nattyice

The Kiji Project: kiji.org

top related