avatara: olap for web-scale analytics products

104
Recruiting Solutions 1 Avatara OLAP for Web-scale Analytics Products Lili Wu, Roshan Sumbaly, Chris Riccomini, Gordon Koo, Hyung Jin Kim, Jay Kreps, Sam Shah http://www.linkedin.com/in/liliwu [email protected]

Upload: lili-wu

Post on 11-Jul-2015

1.073 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Avatara: OLAP for Web-scale Analytics Products

Recruiting Solutions Recruiting Solutions Recruiting Solutions 1

Avatara

OLAP for Web-scale Analytics Products

Lili Wu, Roshan Sumbaly, Chris Riccomini, Gordon Koo, Hyung Jin Kim, Jay Kreps, Sam Shah

http://www.linkedin.com/in/liliwu [email protected]

Page 2: Avatara: OLAP for Web-scale Analytics Products

2

  World’s largest professional social network

  175+ million member in August 2012

  2 new member per second

  26th most visited web site in June 2012*

* Based on comScore, Q2 2012

About LinkedIn

Page 3: Avatara: OLAP for Web-scale Analytics Products

3

Structured

User Data   Industry

  Country

Page 4: Avatara: OLAP for Web-scale Analytics Products

4

Activity

Data   View profile

  Apply Job

Structured

User Data   Industry

  Country

Page 5: Avatara: OLAP for Web-scale Analytics Products

5

= Analytical Insights

Activity

Data

Structured

User Data +

Page 6: Avatara: OLAP for Web-scale Analytics Products

6

Who Viewed

My Profile

(WVMP)

Page 7: Avatara: OLAP for Web-scale Analytics Products

7

Page 8: Avatara: OLAP for Web-scale Analytics Products

8

Page 9: Avatara: OLAP for Web-scale Analytics Products

9

Page 10: Avatara: OLAP for Web-scale Analytics Products

10

Page 11: Avatara: OLAP for Web-scale Analytics Products

11

Page 12: Avatara: OLAP for Web-scale Analytics Products

12

If only have Member and Industry attributes

Computer Software

Recruiting & Staffing

Mobile Internet

Alice 260 152 293

Bob 233 186 121

Industry

Member

Page 13: Avatara: OLAP for Web-scale Analytics Products

13

If include country attribute

Member

Page 14: Avatara: OLAP for Web-scale Analytics Products

14

If include country attribute

Member If add viewing time…

Page 15: Avatara: OLAP for Web-scale Analytics Products

15

OLAP

Page 16: Avatara: OLAP for Web-scale Analytics Products

16

OLAP Online Analytical Processing

is an approach to quickly answer multi-dimensional analytical queries.

Page 17: Avatara: OLAP for Web-scale Analytics Products

17

OLAP Cube

17

Store data in Multi-dimensional form

Page 18: Avatara: OLAP for Web-scale Analytics Products

18

OLAP Cube

18

Member

Dimensions

Page 19: Avatara: OLAP for Web-scale Analytics Products

19

OLAP Cube

19

Member

Dimensions

Measure 5

Page 20: Avatara: OLAP for Web-scale Analytics Products

20

Our challenge: Web-scale OLAP

Page 21: Avatara: OLAP for Web-scale Analytics Products

21

•  Horizontally scalable   175+ million members

  Adding 2 new members / second

Our challenge: Web-scale OLAP

Page 22: Avatara: OLAP for Web-scale Analytics Products

22

•  Horizontally scalable

•  Low latency query   In request/response loop

  Tens of milliseconds

Our challenge: Web-scale OLAP

Page 23: Avatara: OLAP for Web-scale Analytics Products

23

•  Horizontally scalable

•  Low latency query

•  Highly available   26th most visited web site

Our challenge: Web-scale OLAP

Page 24: Avatara: OLAP for Web-scale Analytics Products

24

•  Horizontally scalable

•  Low latency query

•  Highly available

•  High read & write throughput   Billions of monthly page views

Our challenge: Web-scale OLAP

Page 25: Avatara: OLAP for Web-scale Analytics Products

25

•  Traditional OLAP

•  Distributed OLAP

•  Materialize Cubes

Our Options …

Page 26: Avatara: OLAP for Web-scale Analytics Products

26

•  SAP

•  Oracle Hyperion

•  MicroStrategy

•  …

Traditional OLAP For Business Intelligence

offline analysis

Page 27: Avatara: OLAP for Web-scale Analytics Products

27

•  Few concurrent users

•  High latency for web traffic

For Business Intelligence

Traditional OLAP

Page 28: Avatara: OLAP for Web-scale Analytics Products

28

•  Few concurrent users

•  High latency for web traffic

For Business Intelligence

Not well-suited for web-scale online traffic

Traditional OLAP

Page 29: Avatara: OLAP for Web-scale Analytics Products

29

Distributed OLAP Query Result

Query Distribution and Processing Layer

Page 30: Avatara: OLAP for Web-scale Analytics Products

30

Distributed OLAP

Query Distribution and Processing Layer

Query Result

Page 31: Avatara: OLAP for Web-scale Analytics Products

31

Distributed OLAP

Query Distribution and Processing Layer

Query Result

Page 32: Avatara: OLAP for Web-scale Analytics Products

32

Distributed OLAP

Query Distribution and Processing Layer

Query Result

Page 33: Avatara: OLAP for Web-scale Analytics Products

33

Distributed OLAP

Query Distribution and Processing Layer

Query Result

Page 34: Avatara: OLAP for Web-scale Analytics Products

34

Materialize: Pre-compute all combinations

Materialize Cubes

Combination Count {Alice} 55 {Alice, Internet} 21 {Alice, Recruiting} 22 {Alice, Internet, U.S.} 10 {Alice, Internet, Canada} 11 … {Bob} 60 {Bob, Internet} 34 …

{ Member }

{ Member, Industry }

{ Member, Industry,

Country }

Page 35: Avatara: OLAP for Web-scale Analytics Products

35

•  Materialize : requires more space & time 175 million members, average 10 industry, average 5 countries, 90 days

175 million + 175 million x 10 industry + 175 million x 5 countries + 175 million x 90 days + 175 million x 10 industry x 5 countries + 175 million x 10 industry x 90 days + 175 million x 5 countries x 90 days + 175 million x 10 industry x 5 country x 90 days + … ~ 1 trillion keys

Materialize Cubes

Page 36: Avatara: OLAP for Web-scale Analytics Products

36

•  Materialize : requires more space & time 175 million members, average 10 industry, average 5 countries, 90 days

175 million + 175 million x 10 industry + 175 million x 5 countries + 175 million x 90 days + 175 million x 10 industry x 5 countries + 175 million x 10 industry x 90 days + 175 million x 5 countries x 90 days + 175 million x 10 industry x 5 country x 90 days + … ~ 1 trillion keys

Each profile view is turned into 8 writes. Billion page views … load is too high.

Materialize Cubes

Page 37: Avatara: OLAP for Web-scale Analytics Products

37

Traditional OLAP

Distributed OLAP Materialize Cubes

Page 38: Avatara: OLAP for Web-scale Analytics Products

38

Traditional OLAP

Distributed OLAP Materialize Cubes

Page 39: Avatara: OLAP for Web-scale Analytics Products

39

Our use cases…

Page 40: Avatara: OLAP for Web-scale Analytics Products

40

 Data can be sharded

Our use cases…

Page 41: Avatara: OLAP for Web-scale Analytics Products

41

 Data can be sharded   By member id for “Who Viewed My Profile”

Our use cases…

Page 42: Avatara: OLAP for Web-scale Analytics Products

42

 Data can be sharded   By member id for “Who Viewed My Profile”   Data size per shard is small ( < 2MB )

Our use cases…

Page 43: Avatara: OLAP for Web-scale Analytics Products

43

 Data can be sharded   By member id for “Who Viewed My Profile”   Data size per shard is small ( < 2MB )   Goal: single disk I/O

Our use cases…

Page 44: Avatara: OLAP for Web-scale Analytics Products

44

 Data can be sharded   By member id for “Who Viewed My Profile”   Data size per shard is small ( < 2MB )   Goal: single disk I/O Many Small Cubes

Our use cases…

Page 45: Avatara: OLAP for Web-scale Analytics Products

45

 Data can be sharded   By member id for “Who Viewed My Profile”   Data size per shard is small ( < 2MB )   Goal: single disk I/O Many Small Cubes

 Can tolerate some data staleness   Within a few hours

Our use cases…

Page 46: Avatara: OLAP for Web-scale Analytics Products

46

Our challenges Our use cases

Page 47: Avatara: OLAP for Web-scale Analytics Products

47

Our challenges Our use cases +

Avatara

Page 48: Avatara: OLAP for Web-scale Analytics Products

48

Avatara

OLAP for Web-scale Analytical Products

Page 49: Avatara: OLAP for Web-scale Analytics Products

49

 Used in production for 2+ years  Powers several analytical products

Avatara

OLAP for Web-scale Analytical Products

Page 50: Avatara: OLAP for Web-scale Analytics Products

50

Agenda  Architecture  Related work  Conclusion

Page 51: Avatara: OLAP for Web-scale Analytics Products

51

An OLAP system: •  Compute cubes •  Serve queries

Avatara Architecture

Page 52: Avatara: OLAP for Web-scale Analytics Products

52

An OLAP system: •  Compute cubes •  Serve queries

Avatara Architecture

Together

Page 53: Avatara: OLAP for Web-scale Analytics Products

53

•  Offline: Compute cubes Offline Offline

•  Online: Serve queries

Avatara Architecture

Page 54: Avatara: OLAP for Web-scale Analytics Products

54

•  Offline: Compute cubes   Goal: high throughput   Batch processing (Hadoop)

•  Online: Serve queries

Avatara Architecture

Page 55: Avatara: OLAP for Web-scale Analytics Products

55

•  Offline: Compute cubes   Goal: high throughput   Batch processing (Hadoop)

•  Online: Serve queries   Goal: low latency, high availability   Key-value store (Voldemort)

Avatara Architecture

Page 56: Avatara: OLAP for Web-scale Analytics Products

56 56 56

Activity data

Offline Batch Engine

Preprocessing

Online Query Engine

Site

Key-value storage

Storage

Projection +

Join Cubification

Avatara Architecture

Page 57: Avatara: OLAP for Web-scale Analytics Products

57 57 57

Activity data

Offline Batch Engine

Preprocessing

Online Query Engine

Site

Key-value storage

Storage

Projection +

Join Cubification

Avatara Architecture

Page 58: Avatara: OLAP for Web-scale Analytics Products

58 58 58

Activity data

Offline Batch Engine

Preprocessing

Online Query Engine

Site

Key-value storage

Storage

Projection +

Join Cubification

Avatara Overview

Page 59: Avatara: OLAP for Web-scale Analytics Products

59 59 59

Activity data

Offline Batch Engine

Preprocessing

Online Query Engine

Site

Key-value storage

Storage

Projection +

Join Cubification

Avatara Overview

Page 60: Avatara: OLAP for Web-scale Analytics Products

60 60 60

Activity data

Offline Batch Engine

Preprocessing

Online Query Engine

Site

Storage

Projection +

Join Cubification Key-value

storage

Avatara Overview

Page 61: Avatara: OLAP for Web-scale Analytics Products

61 61 61

Activity data

Offline Batch Engine

Preprocessing

Online Query Engine

Site

Key-value storage

Storage

Projection +

Join Cubification

Avatara Overview

Page 62: Avatara: OLAP for Web-scale Analytics Products

62 62 62

Activity data

Offline Batch Engine

Preprocessing

Site

Key-value storage

Storage

Projection +

Join Cubification

Offline Batch Engine

Online Query Engine

Page 63: Avatara: OLAP for Web-scale Analytics Products

63 63 63

Activity data

Offline Batch Engine

Preprocessing

Online Query Engine

Site

Key-value storage

Storage

Projection +

Join Cubification

Offline Batch Engine

Controlled by a configuration file

Page 64: Avatara: OLAP for Web-scale Analytics Products

64

Phase 1 : Preprocessing

input.profile_views = /profile_views !input.member_info = /member_info !

Preprocessing Projection

+ Join

Cubification

Page 65: Avatara: OLAP for Web-scale Analytics Products

65

Phase 2 : Projection + Join

Preprocessing Projection

+ Join

Cubification

dimensions = !member_info.member_id, !member_info.industry, !member_info.country !

facts = !"profile_views.viewee_id, !"profile_views.viewer_id, !"profile_views.time !

measure = profile_views.visit !

join = !"profile_views.viewer_id, !"member_info.member_id !

Page 66: Avatara: OLAP for Web-scale Analytics Products

66

dimensions = !member_info.member_id, !member_info.industry, !member_info.country !

facts = !"profile_views.viewee_id, !"profile_views.viewer_id, !"profile_views.time !

measure = profile_views.visit !

join = !"profile_views.viewer_id, !"member_info.member_id !

Preprocessing Projection

+ Join

Cubification

Phase 2 : Projection + Join

Page 67: Avatara: OLAP for Web-scale Analytics Products

67

dimensions = !member_info.member_id, !member_info.industry, !member_info.country !

facts = !"profile_views.viewee_id, !"profile_views.viewer_id, !"profile_views.time !

measure = profile_views.visit !

join = !"profile_views.viewer_id, !"member_info.member_id !

Preprocessing Projection

+ Join

Cubification

Phase 2 : Projection + Join

Page 68: Avatara: OLAP for Web-scale Analytics Products

68

dimensions = !member_info.member_id, !member_info.industry, !member_info.country !

facts = !"profile_views.viewee_id, !"profile_views.viewer_id, !"profile_views.time !

measure = profile_views.visit !

join = !"profile_views.viewer_id, !"member_info.member_id !

Preprocessing Projection

+ Join

Cubification

Phase 2 : Projection + Join

Page 69: Avatara: OLAP for Web-scale Analytics Products

69

dimensions = !member_info.member_id, !member_info.industry, !member_info.country !

facts = !"profile_views.viewee_id, !"profile_views.viewer_id, !"profile_views.time !

measure = profile_views.visit !

join = !"profile_views.viewer_id, !"member_info.member_id !

Preprocessing Projection

+ Join

Cubification

Phase 2 : Projection + Join

Page 70: Avatara: OLAP for Web-scale Analytics Products

70

Phase 3 : Cubification

cube.name = wvmp-cube-profile-views !cube.shard_key = profile_views.viewee_id !

Preprocessing Projection

+ Join

Cubification

Page 71: Avatara: OLAP for Web-scale Analytics Products

71

Configuration File input.profile_views = /profile_views !input.member_info = /member_info !

dimensions = !member_info.member_id, !member_info.industry, !member_info.country !

facts = !"profile_views.viewee_id, !"profile_views.viewer_id, !"profile_views.time !

measure = profile_views.visit !

join = !"profile_views.viewer_id, !"member_info.member_id !

cube.name = wvmp-cube-profile-views !cube.shard_key = profile_views.viewee_id !

Page 72: Avatara: OLAP for Web-scale Analytics Products

72

Blob Format

Key Value

Alice

...

Page 73: Avatara: OLAP for Web-scale Analytics Products

73 73 73

Offline Batch Engine Recap

Activity data

Offline Batch Engine

Preprocessing

Site

Key-value storage

Storage

Projection +

Join Cubification

Online Query Engine

Controlled by a configuration file

Page 74: Avatara: OLAP for Web-scale Analytics Products

Activity data

Offline Batch Engine

Preprocessing

Online Query Engine

Site

Storage

Projection +

Join Cubification

74

Key-value storage

Online Query Engine

Page 75: Avatara: OLAP for Web-scale Analytics Products

75

Alice

Bob

Key Value

...

Key-value Storage

Bulk Load*

* R. Sumbaly, J. Kreps, L. Gao, A. Feinberg, C. Soman, and S. Shah. Serving Large-scale Batch Computed Data with Project Voldemort. In FAST, pages 223–235, 2012.

Page 76: Avatara: OLAP for Web-scale Analytics Products

Online Query Engine

Activity data

Offline Batch Engine

Preprocessing

Online Query Engine

Site

Storage

Projection +

Join Cubification Key-value

Storage

76

Page 77: Avatara: OLAP for Web-scale Analytics Products

77

Alice

...

Key-value Storage

Alice

Online Query Engine

Bob

Single disk I/O

Page 78: Avatara: OLAP for Web-scale Analytics Products

Online Query Engine

  Select  Where  Group-by   Having  Order   Limit   Count / Percent / Sum / Average

78

Page 79: Avatara: OLAP for Web-scale Analytics Products

79

Online Query Engine

Page 80: Avatara: OLAP for Web-scale Analytics Products

Online Query Engine

AvataraQuery query = new AvataraSqlishBuilder() .setCube("wvmp-cube-profile-views")

80

Page 81: Avatara: OLAP for Web-scale Analytics Products

Online Query Engine

AvataraQuery query = new AvataraSqlishBuilder() .setCube("wvmp-cube-profile-views") .setShardKey($member_id)

81

Page 82: Avatara: OLAP for Web-scale Analytics Products

Online Query Engine

AvataraQuery query = new AvataraSqlishBuilder() .setCube("wvmp-cube-profile-views") .setShardKey($member_id) .select("visit")

82

Page 83: Avatara: OLAP for Web-scale Analytics Products

Online Query Engine

AvataraQuery query = new AvataraSqlishBuilder() .setCube("wvmp-cube-profile-views") .setShardKey($member_id) .select("visit") .select("member_info.industry")

83

Page 84: Avatara: OLAP for Web-scale Analytics Products

Online Query Engine

AvataraQuery query = new AvataraSqlishBuilder() .setCube("wvmp-cube-profile-views") .setShardKey($member_id) .select("visit") .select("member_info.industry") .group("member_info.industry")

84

Page 85: Avatara: OLAP for Web-scale Analytics Products

Online Query Engine

AvataraQuery query = new AvataraSqlishBuilder() .setCube("wvmp-cube-profile-views") .setShardKey($member_id) .select("visit") .select("member_info.industry") .group("member_info.industry") .sum("visit")

85

Page 86: Avatara: OLAP for Web-scale Analytics Products

Online Query Engine

AvataraQuery query = new AvataraSqlishBuilder() .setCube("wvmp-cube-profile-views") .setShardKey($member_id) .select("visit") .select("member_info.industry") .group("member_info.industry") .sum("visit") .order("visit", "desc")

86

Page 87: Avatara: OLAP for Web-scale Analytics Products

Online Query Engine

AvataraQuery query = new AvataraSqlishBuilder() .setCube("wvmp-cube-profile-views") .setShardKey($member_id) .select("visit") .select("member_info.industry") .group("member_info.industry") .sum("visit") .order("visit", "desc") .limit(10)

87

Page 88: Avatara: OLAP for Web-scale Analytics Products

Online Query Engine

AvataraQuery query = new AvataraSqlishBuilder() .setCube("wvmp-cube-profile-views") .setShardKey($member_id) .select("visit") .select("member_info.industry") .group("member_info.industry") .sum("visit") .order("visit", "desc") .limit(10) .build();

AvataraResult result = queryEngine.getCube(query);

88

Page 89: Avatara: OLAP for Web-scale Analytics Products

89

Online Query Engine

Page 90: Avatara: OLAP for Web-scale Analytics Products

Online Query Engine

90

Page 91: Avatara: OLAP for Web-scale Analytics Products

Cube Thinning

91

Heavy Hitters   Roll up data to coarse granularity   Drop data from a dimension

Page 92: Avatara: OLAP for Web-scale Analytics Products

Predicate Push-Down

92

 Key-value storage nodes: I/O-bound  Our data: one blob  Computation done on storage nodes  Decrease data transfer

Page 93: Avatara: OLAP for Web-scale Analytics Products

Avatara Architecture Recap Avatara Architecture

Activity data

Offline Batch Engine

Preprocessing

Online Query Engine

Site

Key-value storage

Storage

Projection +

Join Cubification

93

Page 94: Avatara: OLAP for Web-scale Analytics Products

94

Agenda  Architecture  Related Work  Conclusion

Page 95: Avatara: OLAP for Web-scale Analytics Products

95

  Distributed OLAP  Scatter-gather

Related Work

Page 96: Avatara: OLAP for Web-scale Analytics Products

96

  Distributed OLAP   MR Cube [Nandi11]

  Materialize cubes for holistic measures (median, distinct)

  Utilizes MapReduce   No query engine

Related Work

Page 97: Avatara: OLAP for Web-scale Analytics Products

97

  Distributed OLAP   MR Cube   Key-value store

  Amazon Dynamo [DeCandia07]   Yahoo PNUTS [Cooper08]

Related Work

Page 98: Avatara: OLAP for Web-scale Analytics Products

98

Agenda  Architecture  Related Work  Conclusion

Page 99: Avatara: OLAP for Web-scale Analytics Products

99

  In production for 2+ years   Powers several analytical products   Hadoop + Voldemort   Cost-effective: commodity hardware   Horizontally scalable

Experiences

Page 100: Avatara: OLAP for Web-scale Analytics Products

100

  Near real-time cubing   Scaling read while high write throughput   Streaming joins   Multitenant issues   Dimension and schema changes

Future work

Page 101: Avatara: OLAP for Web-scale Analytics Products

101

  Problem Web scale OLAP: high throughput, low

latency, high availability

  Insight Many small cubes: sharded by key

  Solution Mix of batch computation and online

serving

Conclusion

Page 102: Avatara: OLAP for Web-scale Analytics Products

102

B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. PNUTS: Yahoo!’s Hosted Data Serving Platform. PVLDB, 1(2):1277–1288, 2008.

G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon’s Highly Available Key-Value Store. SIGOPS Operating Systems Review, 41(6):205–220, 2007.

A.  Nandi, C. Yu, P. Bohannon, and R. Ramakrishnan. Distributed Cube Materialization Holistic Measures. In ICDE, pages 183–194, 2011.

R. Sumbaly, J. Kreps, L. Gao, A. Feinberg, C. Soman, S. Shah. Serving Large-scale Batch Computed Data with Project Voldemort. In FAST, pages 223–235, 2012.

Selected Bibliography

Page 103: Avatara: OLAP for Web-scale Analytics Products

103

  Questions?

Thank you !

Page 104: Avatara: OLAP for Web-scale Analytics Products

104

  Problem Web scale OLAP: high throughput, low

latency, high availability

  Insight Many small cubes: sharded by key

  Solution Mix of batch computation and online

serving

Thank You !