hbasecon 2012 | low latency olap with hbase - cosmin lehene, adobe

35
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Cosmin Lehene | Adobe Low Latency “OLAP” with HBase

Upload: cloudera-inc

Post on 10-May-2015

5.534 views

Category:

Technology


3 download

DESCRIPTION

Adobe Systems uses “SaasBase Analytics” to incrementally process large heterogeneous data sets into pre-aggregated, indexed views, stored in HBase to be queried in real- time. Our goal was to process new data in real- time (currently minutes) and have it ready for a large number of concurrent queries that execute in milliseconds. This set our problem apart from what is traditionally solved with Hive or PIG. In this talk I’ll describe the design and the strategies (and hacks) we used to achieve low latency and scalability, from theoretical model to the entire process of ETL to warehousing and queries.

TRANSCRIPT

Page 1: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Cosmin Lehene | AdobeLow Latency “OLAP” with HBase

Page 2: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

What we needed … and built

OLAP Semantics Low Latency Ingestion High Throughput Real-time Query API

Not hardcoded to web analytics or x-, y-, z- analytics, but extensible

2

Page 3: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Building Blocks

Dimensions, Metrics Aggregations Roll-up, drill-down, slicing and dicing, sorting

3

Page 4: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

OLAP 101 – Queries example

4

Date Country

City OS Browser Sale

2012-05-21

USA NY Windows FF 0.0

2012-05-21

USA NY Windows FF 10.0

2012-05-22

USA SF OSX Chrome 25.0

2012-05-22

Canada Ontario Linux Chrome 0.0

2012-05-23

USA Chicago OSX Safari 15.0

5 visits,3 days

2 countriesUSA: 4Canada: 1

4 cities:NY: 2SF: 1

3 OS-esWin: 2OSX: 2

3 browsersFF: 2Chrome:2

50.03 sales

Page 5: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

OLAP 101 – Queries example

Rolling up to country level:

SELECT COUNT(visits), SUM(sales)

GROUP BY country

“Slicing” by browser

SELECT COUNT(visits), SUM(sales)

GROUP BY country

HAVING browser = “FF”

Top browsers by sales

SELECT SUM(sales), COUNT(visits)

GROUP BY browser

ORDER BY sales5

Country visits

sales

USA 4 $50

Canada 1 0

Country visits

sales

USA 2 $10

Canada 0 0

Browser sales visits

Chrome $25 2

Safari $15 1

FF $10 2

Page 6: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Aggregate at runtime Most flexible

Fast – scatter gather

Space efficient

But I/O, CPU intensive

slow for larger data

low throughput

Pre-aggregate Fast

Efficient – O(1)

High throughput

But More effort to process

(latency)

Combinatorial explosion (space)

No flexibility

OLAP – Runtime Aggregation vs. Pre-aggregation

6

Page 7: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Pre-aggregation

Data needs to be summarized

Can’t visualize 1B data points (no, not even with Retina display)

Difficult to comprehend correlations among more than 3 dimensions

Not all dimension groups are relevant

Index on a needed basis (view selection problem)

Runtime aggregation == TeraSort for every query?

Pre-aggregate to reduce cardinality

7

Page 8: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase

We tune both

pre-aggregation level vs. runtime post-aggregation

(ingestion speed + space ) vs. (query speed)

Think materialized views from RDBMS

8

Page 9: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase Domain Model Mapping

9

Page 10: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase - Domain Model Mapping

10

Page 11: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase - Ingestion, Processing, Indexing, Querying

11

Page 12: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase - Ingestion, Processing, Indexing, Querying

12

Page 13: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Ingestion

13

Page 14: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Ingestion throughput vs. latency

Historical data (large batches) Optimize for throughput

Increments (latest data, smaller) Optimize for latency

14

Page 15: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Large, granular input strategies

Slow listing in HDFS Archive processed files

Filtering input FileDateFilter (log name patterns: log-YYYY-MM-dd-HH.log)

TableInputFormat start/stop row

File Index in HBase (track processed/new files)

Map tasks overhead - stitching input splits 400K files => 400K map tasks => overhead, slow reduce copy

CombineFileInputFormat – 2GB-splits => 500 splits for 1TB

FixedMappersTableInputFormat (e.g. 5-region splits)15

Page 16: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Ingestion – Bulk Import

HFileOutputFormat (HFOF)

100s X faster than HBase API

No need to recover from failed jobs

No unnecessary load on machines

* No shuffle - global reduce order required!

e.g. first reduce key needs to be in the first region, last one in the last region

Watch for uneven partitions

16

Page 17: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

1 partition(reduce) / day for initial import

Uneven reduce (partitions) due to data growth over time Reduce k: 2010-12-04 = 500MB

Reduce n: 2012-05-22 = 5GB => slow and will result in a 5GB region

Balance reduce buckets based on input file sizes and the reduce key

Generate sub-partitions based on predefined size (e.g. 1GB)

HFOF – FileSizeDatePartitioner

17

Page 18: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Processing

18

Page 19: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Processing

Processing involves reading the Input (files, tables, events), pre-aggregating it (reducing cardinality) and generating tables that can be queried in real-time

1 year: 1B events => 100B data points indexed

Query => scan 365 data points (e.g. daily page views)

Processing could be either MR or real-time (e.g. Storm)

19

Page 20: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Processing for OLAP semantics

GROUP BY (process, query)

COUNT, SUM, AVG, etc. (process, query)

SORT (process, query)

HAVING (mostly query, can define pre-process constraints)

20

Page 21: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase vs. SQL Views Comparison

21

Page 22: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

reports.json entities definition

22

Page 23: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Processing Performance

read, map, partition, combine, copy, sort, reduce, write

Read:

Scan.setCaching() (I/O ~ buffer)

Scan.setBatching() (avoid timeouts for abnormal input, e.g. 1M hits/visit)

Even region distribution across cluster (distributes CPU, I/O)

Map:

No unnecessary transformations: Bytes.toString(bytes) + Bytes.toBytes(string) (CPU)

Avoid GC : new X() (CPU, Memory)

Avoid system calls (context switching)

Stripping unnecessary data (I/O)

23

Page 24: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Processing Performance

Hot (in memory) vs. Cold (on disk, on network) data

Minimize I/O from disk/network

Single shot MR job: SuperProcessor

Emit all groups from one map() call

Incremental processing

Data format YYYY-MM-DD prefixed rowkey (HH:mm for more granularity)

24

Page 25: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 25

Indexing

Page 26: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

HBase natural order: hierarchical representation

26

Page 27: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Indexing - Why

Example: top 10 cities ~50K [country, city] combinations per day

Top 10 cities for 1 year =>

365 (days) X 50K ~=15M data points scanned

If you add gender => 30M

If you add Device, OS, Browser …

Might compress well, but think about the environment

How much energy would you spend for just top 10 cities?

* Image from: http://my.neutralexistence.com/images/Green-Earth.jpg

27

Page 28: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Indexing with HBase “10” < “2”

GROUP BY year, month, country, city ORDER BY visits DESC LIMIT 10

Lexicographic sorting

2012/05/USA/0000000000/

2012/05/USA/4294961296/San Francisco = 1000 visits*

2012/05/USA/4294961396/New York = 900 visits*

. . .

2012/05/USA/9999999999/

scan “t” startrow => “2012/05/USA/”, limit => 10

* Padding numbers for lexicographic sorting:

1000 -> Long.MAX_VALUE – 1000 = 4294961296

28

Page 29: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Query Engine

Always reads indexed, compact data

Query parsing

Scan strategy

Single vs. multiple scans

Start/stop rows (prefixes, index positions, etc.)

Index selection (volatile indexes with incremental processing)

Deserialization

Post-aggregation, sorting, fuzzy-sorting etc.

Paging

Custom dimension/metric class loading

29

Page 30: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Conclusions

OLAP semantics on a simple data model

Data as first class citizen

Domain Specific “Language” for Dimensions, Metrics, Aggregations

Tunable performance, resource allocation

Framework for vertical analytics systems

30

Page 31: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Thank you!Cosmin Lehene @clehene

http://hstack.orgCredits:

Andrei Dragomir

Adrian Muraru

Andrei Dulvac

Raluca Podiuc

Tudor Scurtu

Bogdan Dragu

Bogdan Drutu

31

Page 32: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Page 33: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

OLAP 101 - Rollup

Rollup: SELECT COUNT(visits), SUM(sales) GROUP BY country

33

Country

Visits Sale

USA 4 $50

Canada 1 $0

Page 34: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

OLAP 101 - Slicing

Filter or Segment or Slice (WHERE or HAVING)

34

Date Country

City OS Browser Sale

2012-03-02

USA NY Windows FF 0.0

2012-03-02

USA NY Windows FF 10.0

2012-03-03

USA S OSX Chrome 25.0

2012-03-03

Canada Ontario Linux Chrome 0.0

2012-03-04

USA Chicago OSX Safari 15.0

5 visits,3 days

2 countriesUSA: 4Canada: 1

4 cities:NY: 2SF: 1

3 OS-esWin: 2OSX: 2

3 browsersFF: 2Chrome:2

50.03 sales

Page 35: HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

OLAP 101 – Sorting, TOP n

SELECT SUM(sales) as total GROUP BY browser ORDER BY total

35

Date Country

City OS Browser Sale

Chrome $25

Safari $15

Firefox $10