hbasecon 2012 | storing and manipulating graphs in hbase

71
Storing and Manipulating Graphs in HBase Dan Lynn [email protected] @danklynn

Upload: cloudera-inc

Post on 27-May-2015

4.264 views

Category:

Technology


3 download

DESCRIPTION

Google’s original use case for BigTable was the storage and processing of web graph information, represented as sparse matrices. However, many organizations tend to treat HBase as merely a “web scale” RDBMS. This session will cover several use cases for storing graph data in HBase, including social networks and web link graphs, MapReduce processes like cached traversal, PageRank, and clustering and lastly will look at some lower-level modeling details like row key and column qualifier design, using FullContact’s graph processing systems as a real-world use case.

TRANSCRIPT

Page 1: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Storing and Manipulating Graphs in HBase

Dan [email protected]

@danklynn

Page 2: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Keeps Contact Information Current and Complete

Based in Denver, Colorado

CTO & Co-Founder

Page 3: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Turn Partial Contacts Into Full Contacts

Page 4: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Refresher: Graph Theory

Page 5: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Refresher: Graph Theory

Page 6: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Refresher: Graph Theory

Vertex

Page 7: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Refresher: Graph Theory

Edge

Page 8: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Social Networks

Page 9: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Tweets

@danklynn

@xorlev

“#HBase rocks”

author

follows

retweeted

Page 10: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Web Links

http://fullcontact.com/blog/

http://techstars.com/

<a href=”...”>TechStars</a>

Page 11: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Why should you care?

Vertex Influence- PageRank

- Social Influence

- Network bottlenecks

Identifying Communities

Page 12: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Storage Options

Page 13: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

neo4j

Page 14: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Very expressive querying(e.g. Gremlin)

neo4j

Page 15: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Transactional

neo4j

Page 16: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Data must fit on a single machine

neo4j

:-(

Page 17: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

FlockDB

Page 18: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Scales horizontally

FlockDB

Page 19: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Very fast

FlockDB

Page 20: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

No multi-hop query support

:-(

FlockDB

Page 21: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

RDBMS(e.g. MySQL, Postgres, et al.)

Page 22: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Transactional

RDBMS

Page 23: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Huge amounts of JOINing

RDBMS

:-(

Page 24: HBaseCon 2012 | Storing and Manipulating Graphs in HBase
Page 25: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Massively scalable

HBase

Page 26: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Data model well-suited

HBase

Page 27: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Multi-hop querying?

HBase

Page 28: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Modeling Techniques

Page 29: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

1

2

3

Adjacency Matrix

Page 30: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Adjacency Matrix

0 1 1

1 0 1

1 1 0

1 2 3

1

2

3

Page 31: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Adjacency Matrix

Can use vectorized libraries

Page 32: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Adjacency Matrix

Requires O(n2) memory n = number of vertices

Page 33: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Adjacency Matrix

Hard(er) to distribute

Page 34: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

1

2

3

Adjacency List

Page 35: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Adjacency List

1 2,3

2 1,3

3 1,2

Page 36: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Adjacency List Design in HBase

e:[email protected]

t:danklynn

p:+13039316251

Page 37: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Adjacency List Design in HBase

e:[email protected] p:+13039316251= ...

t:danklynn= ...

p:+13039316251

t:danklynn= ...

e:[email protected]= ...

row key “edges” column family

t:danklynn e:[email protected]= ...

p:+13039316251= ...

Page 39: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Custom Writables

package org.apache.hadoop.io;

public interface Writable { void write(java.io.DataOutput dataOutput); void readFields(java.io.DataInput dataInput);}

java

Page 40: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Custom Writables

class EdgeValueWritable implements Writable { EdgeValue edgeValue

void write(DataOutput dataOutput) { dataOutput.writeDouble edgeValue.weight }

void readFields(DataInput dataInput) { Double weight = dataInput.readDouble() edgeValue = new EdgeValue(weight) }

// ...}

groovy

Page 41: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Don’t get fancy with byte[]

class EdgeValueWritable implements Writable { EdgeValue edgeValue

byte[] toBytes() { // use strings if you can help it}

static EdgeValueWritable fromBytes(byte[] bytes) { // use strings if you can help it}

}groovy

Page 42: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Querying by vertex

def get = new Get(vertexKeyBytes)get.addFamily(edgesFamilyBytes)

Result result = table.get(get);result.noVersionMap.each {family, data ->

// construct edge objects as needed// data is a Map<byte[],byte[]>

}

Page 43: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Adding edges to a vertex

def put = new Put(vertexKeyBytes)

put.add( edgesFamilyBytes, destinationVertexBytes, edgeValue.toBytes() // your own implementation here)

// if writing directlytable.put(put)

// if using TableReducercontext.write(NullWritable.get(), put)

Page 44: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

e:[email protected]

t:danklynn

p:+13039316251

Page 45: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

e:[email protected]

t:danklynn

p:+13039316251

Page 46: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

e:[email protected]

t:danklynn

p:+13039316251

Pivot vertex

Page 47: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

e:[email protected]

t:danklynn

p:+13039316251

MapReduce over outbound edges

Page 48: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

e:[email protected]

t:danklynn

p:+13039316251

Emit vertexes and edge data grouped by the pivot

Page 49: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

e:[email protected]

t:danklynn

p:+13039316251Reduce key

“Out” vertex

“In” vertex

Page 50: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

e:[email protected] t:danklynn

Reducer emits higher-order edge

Page 51: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

Iteration 0

Page 52: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

Iteration 1

Page 53: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

Iteration 2

Page 54: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

Iteration 2

Reuse edges created during previous iterations

Page 55: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

Iteration 3

Page 56: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

Iteration 3

Reuse edges created during previous iterations

Page 57: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Distributed Traversal / Indexing

hops requires only

iterations

Page 58: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Tips / Gotchas

Page 59: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Do implement your own comparator

java

public static class Comparator extends WritableComparator {

public int compare( byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { // ..... }

}

Page 60: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Do implement your own comparator

java

static { WritableComparator.define(VertexKeyWritable, new VertexKeyWritable.Comparator())}

Page 61: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

MultiScanTableInputFormat

MultiScanTableInputFormat.setTable(conf,"graph");

MultiScanTableInputFormat.addScan(conf, new Scan());

job.setInputFormatClass(MultiScanTableInputFormat.class);

java

Page 62: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

TableMapReduceUtil

TableMapReduceUtil.initTableReducerJob("graph", MyReducer.class, job);

java

Page 63: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Elastic MapReduce

Page 64: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Elastic MapReduce

HFiles

Page 65: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Elastic MapReduce

HFiles

SequenceFiles

Copy to S3

Page 66: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Elastic MapReduce

HFiles

SequenceFiles SequenceFiles

Copy to S3 Elastic MapReduce

Page 67: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Elastic MapReduce

HFiles

SequenceFiles SequenceFiles

Copy to S3 Elastic MapReduce

Page 68: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Elastic MapReduce

HFiles

SequenceFiles SequenceFiles

HFiles

Copy to S3 Elastic MapReduce

HFileOutputFormat.configureIncrementalLoad(job, outputTable)

Page 69: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Elastic MapReduce

HFiles

SequenceFiles SequenceFiles

HFiles HBase

Copy to S3 Elastic MapReduce

HFileOutputFormat.configureIncrementalLoad(job, outputTable)

$ hadoop jar hbase-VERSION.jar completebulkload

Page 70: HBaseCon 2012 | Storing and Manipulating Graphs in HBase

Additional Resources

Google Pregel: BSP-based graph processing system

Apache Giraph: Implementation of Pregel for Hadoop

MultiScanTableInputFormat: (code to appear on GitHub)

Apache Mahout - Distributed machine learning on Hadoop