qcon sf-2015 stream processing in uber

103
STREAM PROCESSING @ UBER DANNY YUAN @ UBER

Upload: danny-yuan

Post on 06-Apr-2017

133 views

Category:

Internet


1 download

TRANSCRIPT

Page 1: QCon SF-2015 Stream Processing in uber

STREAM PROCESSING @ UBERDANNY YUAN @ UBER

Page 2: QCon SF-2015 Stream Processing in uber

What is Uber

Page 3: QCon SF-2015 Stream Processing in uber

Transportation at your fingertips

Page 4: QCon SF-2015 Stream Processing in uber
Page 5: QCon SF-2015 Stream Processing in uber

Stream Data Allows Us To Feel The Pulse Of Cities

Page 6: QCon SF-2015 Stream Processing in uber

Marketplace Health

Page 7: QCon SF-2015 Stream Processing in uber

What’s Going on Now

Page 8: QCon SF-2015 Stream Processing in uber

What’s Happened?

Page 9: QCon SF-2015 Stream Processing in uber

Status Tracking

Page 10: QCon SF-2015 Stream Processing in uber
Page 11: QCon SF-2015 Stream Processing in uber
Page 12: QCon SF-2015 Stream Processing in uber
Page 13: QCon SF-2015 Stream Processing in uber

A Little Background

Page 14: QCon SF-2015 Stream Processing in uber

Uber’s Platform Is a Distributed State Machine

Rider States

Page 15: QCon SF-2015 Stream Processing in uber

Uber’s Platform Is a Distributed State Machine

Rider States Driver States

Page 16: QCon SF-2015 Stream Processing in uber

Applications can’t do everything

Page 17: QCon SF-2015 Stream Processing in uber

Instead, Applications Emit Events

Page 18: QCon SF-2015 Stream Processing in uber

Events Should Be Available In Seconds

Page 19: QCon SF-2015 Stream Processing in uber

Events Should Rarely Get Lost

Page 20: QCon SF-2015 Stream Processing in uber

Events Should Be Cheap And Scalable

Page 21: QCon SF-2015 Stream Processing in uber
Page 22: QCon SF-2015 Stream Processing in uber

Where are the challenges?

Page 23: QCon SF-2015 Stream Processing in uber

Many Dimensions

Dozens of fields per event

Page 24: QCon SF-2015 Stream Processing in uber

Granular Data

Page 25: QCon SF-2015 Stream Processing in uber

Granular Data

Page 26: QCon SF-2015 Stream Processing in uber

Granular Data

Over 10,000 hexagons in the city

Page 27: QCon SF-2015 Stream Processing in uber

Granular Data

7 vehicle types

Page 28: QCon SF-2015 Stream Processing in uber

Granular Data

1440 minutes in a day

Page 29: QCon SF-2015 Stream Processing in uber

Granular Data

13 driver states

Page 30: QCon SF-2015 Stream Processing in uber

Granular Data

300 cities

Page 31: QCon SF-2015 Stream Processing in uber

Granular Data

1 day of data: 300 x 10,000 x 7 x 1440 x 13 = 393 billion possible combinations

Page 32: QCon SF-2015 Stream Processing in uber

Unknown Query Patterns

Any combination of dimensions

Page 33: QCon SF-2015 Stream Processing in uber

Variety of Aggregations - Heatmap

- Top N

- Histogram

- count(), avg(), sum(), percent(), geo

Page 34: QCon SF-2015 Stream Processing in uber

Different Geo Aggregation

Page 35: QCon SF-2015 Stream Processing in uber

Large Data Volume

• Hundreds of thousands of events per second, or billions of events per day

• At least dozens of fields in each event

Page 36: QCon SF-2015 Stream Processing in uber

Tight Schedule

Page 37: QCon SF-2015 Stream Processing in uber

Key: Generalization

Page 38: QCon SF-2015 Stream Processing in uber

Data Type• Dimensional Temporal Spatial Data

Dimension Value

state driver_arrived

vehicle type uber X

timestamp 13244323342

lattitude 12.23

longitude 30.00

Page 39: QCon SF-2015 Stream Processing in uber

Data Query• OLAP on single-table temporal-spatial data

SELECT  <agg  functions>,  <dimensions>  FROM  <data_source>WHERE  <boolean  filter>GROUP  BY  <dimensions>HAVING  <boolean  filter>ORDER  BY  <sorting  criterial>LIMIT  <n>DO  <post  aggregation>

Page 40: QCon SF-2015 Stream Processing in uber

Finding the Right Storage System

Page 41: QCon SF-2015 Stream Processing in uber

Minimum Requirements• OLAP with geospatial and time series support

• Support large amount of data

• Sub-second response time

• Query of raw data

Page 42: QCon SF-2015 Stream Processing in uber

It can’t be a KV store

Page 43: QCon SF-2015 Stream Processing in uber

Challenges to KV Store

Pre-computing all keys is O(2n)  for both space and time

Page 44: QCon SF-2015 Stream Processing in uber

It can’t be a relational database

Page 45: QCon SF-2015 Stream Processing in uber

Challenges to Relational DB

• Managing multiple indices is painful

• Scanning is not fast enough

Page 46: QCon SF-2015 Stream Processing in uber

A System That Supports• Fast scan

• Arbitrary boolean queries

• Raw data

• Wide range of aggregations

Page 47: QCon SF-2015 Stream Processing in uber

Elasticsearch

Page 48: QCon SF-2015 Stream Processing in uber

Highly Efficient Inverted-Index For Boolean Query

Page 49: QCon SF-2015 Stream Processing in uber

Built-in Distributed Query

Page 50: QCon SF-2015 Stream Processing in uber

Fast Scan with Flexible Aggregations

Page 51: QCon SF-2015 Stream Processing in uber

Storage

Page 52: QCon SF-2015 Stream Processing in uber

Are We Done?

Page 53: QCon SF-2015 Stream Processing in uber

Transformation e.g. (Lat, Long) -> (zipcode, hexagon)

Page 54: QCon SF-2015 Stream Processing in uber

Dynamic Pricing

Page 55: QCon SF-2015 Stream Processing in uber

Trend Prediction

Page 56: QCon SF-2015 Stream Processing in uber

Supply and Demand Distribution

Page 57: QCon SF-2015 Stream Processing in uber

Technically Speaking: Clustering & Pr(D, S, E)

Page 58: QCon SF-2015 Stream Processing in uber

New Use Cases —> New Requirements

Page 59: QCon SF-2015 Stream Processing in uber

Pre-aggregation

Page 60: QCon SF-2015 Stream Processing in uber

Joining Multiple Streams

Page 61: QCon SF-2015 Stream Processing in uber

Sessionization

Page 62: QCon SF-2015 Stream Processing in uber

Multi-Staged Processing

Page 63: QCon SF-2015 Stream Processing in uber

State Management

Page 64: QCon SF-2015 Stream Processing in uber

Apache Samza

Page 65: QCon SF-2015 Stream Processing in uber

Why Apache Samza?

Page 66: QCon SF-2015 Stream Processing in uber

DAG on Kafka

Page 67: QCon SF-2015 Stream Processing in uber

Excellent Integration with Kafka

Page 68: QCon SF-2015 Stream Processing in uber

Excellent Integration with Kafka

Page 69: QCon SF-2015 Stream Processing in uber

Built-in Checkpointing

Page 70: QCon SF-2015 Stream Processing in uber

Built-in State Management

Page 71: QCon SF-2015 Stream Processing in uber

Processing Storage

Page 72: QCon SF-2015 Stream Processing in uber

What If Storage Is Down?

Page 73: QCon SF-2015 Stream Processing in uber

What If Processing Takes Long?

Page 74: QCon SF-2015 Stream Processing in uber

Processing Storage

Page 75: QCon SF-2015 Stream Processing in uber

Are We Done?

Page 76: QCon SF-2015 Stream Processing in uber
Page 77: QCon SF-2015 Stream Processing in uber
Page 78: QCon SF-2015 Stream Processing in uber

Post Processing

Page 79: QCon SF-2015 Stream Processing in uber

Results Transformation and Smoothing

Page 80: QCon SF-2015 Stream Processing in uber

Scale of Post Processing

10,000 hexagons in a city

Page 81: QCon SF-2015 Stream Processing in uber

Scale of Post Processing

331 neighboring hexagons to look at

Page 82: QCon SF-2015 Stream Processing in uber

Scale of Post Processing

331 x 10,000 = 3.1 Million Hexagons to Process for a Single Query

Page 83: QCon SF-2015 Stream Processing in uber

Scale of Post Processing

99%-ile Processing Time: 70ms

Page 84: QCon SF-2015 Stream Processing in uber

Post Processing• Each processor is a pure function

• Processors can be composed by combinators

Page 85: QCon SF-2015 Stream Processing in uber

Post Processing

• Highly parallelized execution

• Pipelining

Page 86: QCon SF-2015 Stream Processing in uber

Post Processing• Each processor is a pure function

• Processors can be composed by combinators

• Highly parallelized execution

Page 87: QCon SF-2015 Stream Processing in uber

Practical Considerations

Page 88: QCon SF-2015 Stream Processing in uber

Data Discovery

Page 89: QCon SF-2015 Stream Processing in uber

Elasticsearch Query Can Be Complex

Page 90: QCon SF-2015 Stream Processing in uber

/driverAcceptanceRate?  geo_dist(10,  [37,  22])&  time_range(2015-­‐02-­‐04,2015-­‐03-­‐06)&  aggregate(timeseries(7d))&  eq(msg.driverId,1)  

Page 91: QCon SF-2015 Stream Processing in uber

Elasticsearch Query Can Be Optimized

• Pipelining

• Validation

• Throttling

Page 92: QCon SF-2015 Stream Processing in uber

Tim

e in seconds

Page 93: QCon SF-2015 Stream Processing in uber

Elasticsearch Can Be Replaced

Page 94: QCon SF-2015 Stream Processing in uber

Storage QueryProcessing

Page 95: QCon SF-2015 Stream Processing in uber

There’s one more thing

Page 96: QCon SF-2015 Stream Processing in uber

There are always patterns in streams

Page 97: QCon SF-2015 Stream Processing in uber

There is always need for quick exploration

Page 98: QCon SF-2015 Stream Processing in uber

How many drivers cancel a request 10 times in a row within a 5-minute window?

Page 99: QCon SF-2015 Stream Processing in uber

Which riders request a pickup from 100 miles apart within a half hour window?

Page 100: QCon SF-2015 Stream Processing in uber
Page 101: QCon SF-2015 Stream Processing in uber

Complex Event Processing

FROM  driver_canceled#window.time(10  min)    SELECT  clientUUID,  count(clientUUID)  as  cancelCount  GROUP  BY  clientUUID  HAVING  cancelCount  >  10    INSERT  INTO  hipchat(room);

Page 102: QCon SF-2015 Stream Processing in uber

Implementation Becomes Easy

Page 103: QCon SF-2015 Stream Processing in uber

Thank You!