graph store

49
Linked data analytics in an ad- system (Slide outlines) Inder Singh, Srikanth Sundarrajan @inmobi

Upload: inder-singh

Post on 20-Jun-2015

449 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Graph store

Linked data analytics in an ad-system (Slide outlines)

Inder Singh, Srikanth Sundarrajan@inmobi

Page 2: Graph store

User store

• Why a store for user data– Just like advertiser or publisher in the network,

consumer/user is a very important entity• It is an entity that has associated activity

• Not an attribute on a network entity

• Value for network, advertiser & publisher by showing ads that are very relevant to users– We need to understand users better

– And leverage information about them better

Page 3: Graph store

What do we need to store?

• Activities that a user is involved in

• Profile of the user (ex. Demographics)

• Besides• Location

• Device

• Apps etc

Page 4: Graph store

User Data Model

User: (Identifier, Age, Gender,

Interest, Preference, …)

Page 5: Graph store

User Data Model

User: (Identifier, Age, Gender,

Interest, Preference, …)

Site: Platform, Category …

Visits

Page 6: Graph store

User Data Model

User: (Identifier, Age, Gender,

Interest, Preference, …)

Site: Platform, Category …

Visits

Visit: Time of day, Requests, Impressions,

Clicks, Downloads, Burn, Engagement Time…

Page 7: Graph store

User Data Model

User: (Identifier, Age, Gender,

Interest, Preference, …)

Site: Platform, Category …

Visits

Served: Impressions, Clicks, Downloads, Burn,

Engagement Time…

AdGrp: Category, Objective …

Served

Page 8: Graph store

User Data Model

Owns: Request, Impressions, Clicks, Downloads, Burn.…

User: (Identifier, Age, Gender,

Interest, Preference, …)

Site: Platform, Category …

Visits

AdGrp: Category, Objective …

Served

Geo:

loca

ted

Device: manufacturer, OS

type, …

owns

Page 9: Graph store

How does the data look at scale ?

U1

S1 S2

U2

U3

U4

U5

S3

Ad1

Ad2

C1

C2

D1

D2

Page 10: Graph store

What can we do with this data?

• Examples– Get an user’s detail and user’s network activity to

infer something about the user

– Target segment of users based on user’s attributes or aggregate activity

– Understand reach of a targeting criteria

• Further– Function of how efficiently can we store this data

and how quickly can we retrieve information

Page 11: Graph store

About User Data

• Too sparse & High Cardinality (> billion)

• Random id (Quality of data)

• Some popular network entities are associated with large number of users (> 100 million)

• Lot of attributes about Users are inferred and as we have stronger signals, these need to be mutated

Page 12: Graph store

How ?

• Store for Analytics / Insights– Is all about organizing data aligning with retrieval

use cases

– Retrieval time = Organization / Data Size

– What is Organization • Simply put: Trade space for time

Page 13: Graph store

Popular storage structures

• Rows & Columns – Relational Table– Indexing ?

• Each column ?

• Group of columns ?

• Ingestion cost

• Cardinality

– Ideal if the queries/data extraction is reasonably well defined and conforms to set patterns

– Appends is the most efficient way of ingestion

Page 14: Graph store

Popular storage structures

• Columnar storage (big table / db)– Key based lookup / scan

– Optimized for use cases where not all data stored for the key needs to be retrieved

– Mutations / Append patterns are both scalable patterns for ingestion

Page 15: Graph store

Heavily Indexed Relation

Page 16: Graph store

An optimized representation

Page 17: Graph store

Dimension User Bitmap

Site1 + Adgroup1 1000011001……

Site1 + SFO 0001111000….

Site2 + nexus 1000011110…

….. …..

….. ….

…. ….

….. ….

Starting point….

• Can do all kinds of set operations to get you user reach

• Found a user for Site1+Adgroup1 combination. Find other apps/devices this user came from during a time interval. – Walking the graph if I had a link from userid to Dimension?

Page 18: Graph store

Dimension User Bitmap

Site1 + Adgroup1 1000011001……

Site1 + SFO 0001111000….

Site2 + nexus 1000011110…

….. …..

….. ….

…. ….

….. ….

Starting point extended (Logical Diagram) ….

UserID

u1

u2

u3

u4

• Now we can walk from a userID we found in bitmap back to where all other places it occurs….life is cool ☺

Page 19: Graph store

Engineers, we want neat abstractions so life is uncool again..

Page 20: Graph store

Directed multi property graphs

Page 21: Graph store

Buzzwords in Graph world

• Neo4j

• Titan

• Dex

• Pregel

• Giraph

• Gremlin

• Tinker Pop blueprint

Page 22: Graph store

What to do?

• Evaluate leading graph db’s neo4j, titan, orientDB, flockdb(twitter), Facebook TAO.

• Titan & neo4j – Challenges we faced

• SuperNodes• Queries like give me Sites, Devices with exclusive users

or in general class of queries requiring lots of edge traversals over lots of super nodes never returned for hours and the DB server goes in GC.

• Talk to experts from neo4j, titan. Still not there for huge scale and expensive queries.

Page 23: Graph store

• Research paper of DEX

• Formalizes and provides abstractions over our thinking of using bitmaps

• Compares class of queries it supports against leading graph DB’s and results are very promising.

Page 24: Graph store

Graph: Formal definition

• V = {v1,...,vn} : Finite Set of vertices

• E = {e1, ...,em} : Finite Set of Edges

• Relation Sets– T = {(e1,t1),...,(em,tm)} is the set of tail pairs (ei , ti ),

which indicates that the tail of ei is the vertex ti ∈ V , – H = {(e1,h1),...,(em,hm)} is the set of head pairs (ei , hi

), which indicates that the head of ei is the vertex hi ∈ V .

Page 25: Graph store

Formal definition contd..• Given an object o, which is either an edge or

vertex (o ∈ {V ∪ E}), we map a single label to each object L = {(o,l) | o ∈ (V ∪ E),l ∈ string}

• Attributes : Ai = {(o1,c1),...,(or,cr)}, which assign an attribute value ci ∈ D (where D are the valid data types such as int, boolean, timestamp, etc.)

• G = (V,E,L,T,H,A1,...,Ap)

Page 26: Graph store

Some things we want

• Store large #objects sets and access efficiently

• Given a key, find matching objects (Vertex/Edges)

• Given and object retrieve set of values associated with this object.

Page 27: Graph store

Assumptions

• Vertex/Edges in the entire graph have a unique ID called oid (object identifier).

Page 28: Graph store

So what’s the real deal

• Value Sets : Group all objects matching a value together. Similar to inverted index.

• Two set of maps : – VALUE_OID : maps a value to a bitmap

– OID_VALUE : maps an oid to a value

Page 29: Graph store

LABELS

• Two set of maps

• Value_oid : alike inverted index

• OID_VALUE_MAP : allows to walkLinks in graph

Page 30: Graph store

TAILS

• Value_oid : invertex index of all edges outgoing from this vertex

• Oid_value : given an edgeID what’s the outgoing vertex

Page 31: Graph store

HEADS

• Value_oid : inverted indexOf all incoming edges on a vertex

• Oid_value : go from oid to it’s value

Page 32: Graph store

Attributes

• Table name = Attr_KEY_NAME

• Value_oid : inverted index of all edge/Vertex for this value

• Oid_value : find value for this attributeKey given oid

Page 33: Graph store

Efficient bitmaps

• Parition 64 bitmap space into significantly big ranges for each entitytype like device, site, etc.

• Results in less sparse bitmaps and better compression.

Page 34: Graph store

Primitive apis : objects

• Bitmap Objects(NameSpace n, Value v) : works on VALUE_OID map

– Example • Objects(Age, 28) : Table = “Age”, Value = “28” return

bitmap of Vertex(User) with age = “28”

VALUE BITMAP(OIDS)

28 100111000….

17 011000………

OID

456

789

645

Objects()

Page 35: Graph store

Primitive apis : lookup

• V lookup(Namespace n, Long oid) : works on oid_value table.

• Example – Lookup(Age, 456) returns value = 17

VALUE BITMAP(OIDS)

28 100111000….

17 011000………

OID

456

789

645

Page 36: Graph store

Primitive apis : domain

• Iterable<Key, Value> Domain(Namespace n)

• Example – Domains(Age) returns an iterator over the value_oid table keys

VALUE BITMAP(OIDS)

28 100111000….

17 011000………

OID

456

789

645

Page 37: Graph store

Primitive apis : insert, remove

• Insert(Namespace n, Long oid, Value v)

• remove(Namespace n, Long oid, Value v)

Page 38: Graph store

Tinker Pop apis

• Let’s look at the code of AbstractGraph, ShadowfaxEdge, ShadowFaxvertex to understand how all of this works together

Page 39: Graph store

Walking the graph

• Find all users who own iphone and have 100 clicks in system. Find common sites among these users.

• Let’s break it down– Users who own iphone = objects(“label”,

“iphone”) – returns a bitmap with one vertex set of iphone.

Page 40: Graph store

Walking the graph

• From iphone vertex let’s go to all OWNS edges– Bitmap1 Objects(HEADS, “iphone-vid”) : bitmap

for all edges incident into iphone vertex

– Bitmap2 Objects(LABELS, “OWNS”) all owns edges

– all OWNS edges incident into iphone-vid = Bitmap3 = (Bitmap1 AND bitmap2)

Page 41: Graph store

Walking the graph contd..

• Find all edges which have “clicks = 100” – Bitmap3 = objects(click, “100”)

• Bitmap4 = (Bitmap3 AND bitmap4) All edges incident into iphone vertex and have 100 clicks.

Page 42: Graph store

Walking the graph contd..

• For all OWNS edges incident into iphone vertex let’s walk to users.– For (Long oid : Bitmap.vector()) { //find vertex from edge

Long userVertexOID = lookup(TAILS, oid); //find Sites visited by this user i.e. walk from user to site : getEdges out of this vertex which have LABEL “visits” Lookup(HEADS, (objects(TAILS, userVertexOID) AND objects(LABELS, “VISITS”) ));

}

Page 43: Graph store

Is that the way to write code?

• No use tinker pop api’s we give, life is easy ☺• How our implementation gives more power –

Expensive queries optimized through native apis.

Page 44: Graph store

Ingestion at Scale

Volatile Graph

Volatile graph

Volatile Graph

Volatile Graph

In memory Graphs

Local Graph

Local Graph

Local Graph

Local Graph

Persistent Graphs

Merge MergeMerge Merge

Global Graph

Merge local persistent graphs

Page 45: Graph store
Page 46: Graph store

Shard1

u1 s1

s2

d1

u10

Shard2

u11 s1

s2

d1

u20

Parition on UserID and replicate metadata to all Shards

Page 47: Graph store

Build up

• KV store : high throughput, low latency for bigger sized values.– Evaluated

• LevelDB : Chosen at the moment

• LigthingDB

• Others

Page 48: Graph store

Build up contd..

• Bitmap Indexing Library– Evaluated

• Fastbit : Chosen

• Javaewah

Page 49: Graph store

Examples of costly queries

• getEntitiesWithMaxUserCount(startDate, endDate, entityType) – Ex : work on a batch of sites in parallel. For a site for multiple dates work in parallel.

• getEntitiesWithExclusiveUsers(startDate, endDate, entityType)

• getRepeatingUserDistribution(startDate, endDate, entityType)