shortening the feedback loop

70
Shortening the Feedback Loop How Spotify’s Big Data Ecosystem Has Evolved to Leverage Actionable Insights Josh Baer ([email protected]) Note: opinions expressed in these slides are the authors and not necessarily those of Spotify

Upload: josh-baer

Post on 08-Jan-2017

5.425 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Shortening the feedback loop

Shortening the Feedback Loop How Spotify’s Big Data Ecosystem Has Evolved to Leverage Actionable Insights

Josh Baer ([email protected])Note: opinions expressed in these slides are the authors and not necessarily those of Spotify

Page 2: Shortening the feedback loop

Who am I?

• Technical Product Owner at Spotify

• Working with fast processing infrastructure

• Previously, building out Spotify’s 2500 node Hadoop cluster

@l_phant

Page 3: Shortening the feedback loop

• Spotify Launches

• Access to a gigantic catalog of music

• Click to play instantaneous!

In 2008

Page 4: Shortening the feedback loop

Behind the Scenes: Days to Insights

Page 5: Shortening the feedback loop

Behind the Scenes

Page 6: Shortening the feedback loop

Behind the Scenes

Minutes to transfer

Hours to Clean and Bucket

Hours to Run Jobs or Ad Hoc

QueriesDAYS TO INSIGHTS

Page 7: Shortening the feedback loop

“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009

Page 8: Shortening the feedback loop

Real-time

ProcessingBatch Processing

(Hadoop, Hive, BigQuery)

“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009

Page 9: Shortening the feedback loop

To leverage actionable insights, we need a

faster feedback loop!

Page 10: Shortening the feedback loop

• Music Streaming Service

• Launched in 2008

• Premium and Free Tiers

• Available in 60 Countries

What is Spotify?

Page 11: Shortening the feedback loop

Over 100 Million Active Users

Page 12: Shortening the feedback loop

Over 30 Million Songs

Page 13: Shortening the feedback loop

Over 1 Billion Plays Per Day

Page 14: Shortening the feedback loop

And we have Data

Page 15: Shortening the feedback loop

Hadoop at Spotify

• ~2,500 Nodes

• >100 PB Capacity

• >100 TB Memory accessible by jobs

• 20K Jobs/Day

Page 16: Shortening the feedback loop

Apache Kafka at Spotify

• 500 Kafka-related machines

• 40 TB/day from logs

Page 17: Shortening the feedback loop

Real-Time at Spotify

• Storm Topologies fed via Kafka

• Mostly used for hack ideas or proof of concepts

Page 18: Shortening the feedback loop

Migrating to the Cloud

Page 19: Shortening the feedback loop

In the Beginning…

• Spotify was almost completely on-premise/bare metal

• Grew to 2,500 node Hadoop cluster and over 10K total machines in production at four globally distributed data centers

• “Flirted” with cloud providers at various times

Page 20: Shortening the feedback loop

In 2014

• Maybe we should try this cloud thing for real

Page 21: Shortening the feedback loop

Why Move to the Cloud?

• Cloud Providers have matured, decreasing in costs and increasing in reliability and variety of service offered

• Owning and operating physical machines is not a competitive advantage for Spotify

Page 22: Shortening the feedback loop

Why Google’s Cloud?

• We believe Google’s industry leading background in Big Data technologies will give us a data processing advantage

Page 23: Shortening the feedback loop

Google Cloud Data Building Blocks

Page 24: Shortening the feedback loop

BigQuery

• Ad-hoc and interactive querying service for massive datasets

• Like Hive, but without needing to manage Hadoop and servers

• Leverages Google’s internal tech

• Dremel (query execution engine)

• Colossus (distributed storage)

• Borg (distributed compute)

• Jupiter (network)

Source: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood

Page 25: Shortening the feedback loop

BigQuery vs. Hive

• Example Queries:

• What are the top 10 songs by popularity in Spain during October 2016?

• How many hours did users in Spain spend listening to Spotify during October?

Page 26: Shortening the feedback loop

BigQuery vs. Hive

• What are the top 10 songs by popularity in Spain during October 2016?

• Hive

• 2647s (44min, 7sec)

• 15.5 TB processed

• BigQuery

• 108s (1min, 48sec)

• 1.50 TB processed

Note: Hive performance unoptimized. Version used (0.14), input format (Avro), run on a ~2500 node Yarn cluster. This is not considered to be a thorough benchmark

Page 27: Shortening the feedback loop

Top 10 Tracks in Spain during October 2016

Rank Artist(s) Track Name1 JBalvin Safari

2 DJSnake LetMeLoveYou

3 RickyMar8n VentePa'Ca

4 Sebas8anYatra Traicionera

5 Zion&Lennox(feat.JBalvin) OtraVez

6 CarlosVives,Shakira LaBicicleta

7 TheChainsmokers Closer

8 MajorLazer(feat.Jus8nBieber&MØ) ColdWater

9 Sia TheGreatest

10 IAmChino(feat.Pitbull,Yandel&Chacal) AyMIDios

Page 28: Shortening the feedback loop

BigQuery vs. Hive

• How much time did users in Spain spend listening to Spotify during October?

• Hive

• 969s (16min, 9 sec)

• 15.5 TB processed

• BigQuery

• 33s

• 780 GB processed

Note: Hive performance unoptimized. Version used (0.14), input format (Avro), run on a ~2500 node Yarn cluster. This is not considered to be a thorough benchmark

Page 29: Shortening the feedback loop

Nearly 10,000 Years!

Page 30: Shortening the feedback loop

BigQuery at Spotify

• Interactive and ad-hoc querying immediately started to transfer to BQ once the data was available on the cloud

• Pace of learning increases as friction to question decreases

Page 31: Shortening the feedback loop

Cloud Pub/Sub

• At least once globally distributed message queue

• For high volume, low topic (<10,000) publish subscribe behavior

• Like Kafka, but without needing to operate servers and supporting services (zookeeper)

Page 32: Shortening the feedback loop

Cloud Pub/Sub at Spotify

• 800K events/second? No problem

• P99 Latency of ingestions into ES: 500ms

• Ingestion from globally distributed non-GCP datacenters is painless

Page 33: Shortening the feedback loop

• Managed Service for running batch and streaming jobs

• Unified API for batch and streaming mode

• Inspired by internal Google tools like FlumeJava and Millwheel

• Programming model open-sourced as Apache Beam (currently incubating)

Cloud Dataflow

Page 34: Shortening the feedback loop

• Usually run via Scio: https://github.com/spotify/scio

• Scio provides a scala API for running Dataflow jobs and provides easy integrations with BigQuery

• New batch processing jobs at Spotify are being written in Scio/Dataflow

Cloud Dataflow (Batch) at Spotify

Page 35: Shortening the feedback loop

• Exactly-once stream processing framework

• A replacement for Spark/Flink streaming and Storm workloads at Spotify

• Optimizes for consistency which can complicate real-time workloads

Cloud Dataflow (Streaming) at Spotify

Page 36: Shortening the feedback loop
Page 37: Shortening the feedback loop

Spotify + Google Cloud Timeline

2015 2016

Beginning of Google Cloud evaluation

BigQuery begins to replace Hive

Cloud Pub/Sub begins to replace Kafka

Dataflow (streaming) begins to replace Storm

Dataflow (batch) replacing Map/Reduce

Note: Dates are approximations

Page 38: Shortening the feedback loop

Putting It All Together

Page 39: Shortening the feedback loop

The Problem

• We want to detect within minutes if we’ve introduced a bug in a client release that affects important event logging behavior

Page 40: Shortening the feedback loop

Before…

Minutes to transfer

Hours to Clean and Bucket

Hours to Run Jobs or Ad Hoc

QueriesDAYS TO INSIGHTS

Page 41: Shortening the feedback loop

Getting Data from Clients to Pub/Sub

• Built Pulsar, a simple service aggregating data from Access Points and feeding it into Cloud Pub/Sub

• Replaces the Kafka real-time event feed

Page 42: Shortening the feedback loop

Pulsar

Page 43: Shortening the feedback loop

Dataflow

• Subscribes to important event Pub/Sub topics

• Aggregate events into minute windows

• Always running, no need to schedule or wait for results

Page 44: Shortening the feedback loop

BigQuery

• Receives aggregates from Dataflow

• Allows for ad-hoc inspection or slicing on different dimensions

Page 45: Shortening the feedback loop

Tableau

• Data Visualization Tool that integrates nicely with BigQuery

• Pulls data from BigQuery periodically and caches for quick inspection

Page 46: Shortening the feedback loop
Page 47: Shortening the feedback loop

Milliseconds to transfer

Milliseconds to process

Seconds to Query

SECONDS TO INSIGHTS

Page 48: Shortening the feedback loop
Page 49: Shortening the feedback loop

Faster Insights to Client Behavior

Page 50: Shortening the feedback loop

Problem

As a developer, I want to be able to instantly explore data being logged by the clients.

Page 51: Shortening the feedback loop

Solution

• Produce a topic for all employee client events

• Store in Elasticsearch

• Visualize in Kibana

Page 52: Shortening the feedback loop
Page 53: Shortening the feedback loop
Page 54: Shortening the feedback loop

Benefits

• Able to understand what’s being sent by the client as it happens

• Exploring events, visualizing distribution (i.e. does this field actually get populated)

• Prototyping analysis based on a sample

• Dashboards for Employee Releases

Page 55: Shortening the feedback loop

Other Uses

Page 56: Shortening the feedback loop

Ad Targeting

• Real-time genre targeting

• Session insights — explicit filter

Page 57: Shortening the feedback loop

Real-time Recommendations

Page 58: Shortening the feedback loop

Live Results for X-Factor

• X-Factor: music competition

• Songs available on Spotify immediately after show airs

• Listener behavior determines the order of contestants on the playlist

Page 59: Shortening the feedback loop

Review

Page 60: Shortening the feedback loop

Real-time

ProcessingBatch Processing

(Hadoop, Hive, BigQuery)

“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009

Page 61: Shortening the feedback loop

Behind the Scenes

Minutes to transfer

Hours to Clean and Bucket

Hours to Run Jobs or Ad Hoc

QueriesDAYS TO INSIGHTS

Page 62: Shortening the feedback loop

To leverage actionable insights, we need a

faster feedback loop!

Page 63: Shortening the feedback loop

Putting it all togetherMilliseconds

to transfer

Milliseconds to process

Seconds to Query

SECONDS TO INSIGHTS

Page 64: Shortening the feedback loop

The Value of a Fast Feedback Loop

• Detecting problems early in data avoids long backfills or long term data loss

• Instant insights on newly developed features allows teams to iterate quicker and take risks

• Providing a quicker ad-hoc querying engine allows teams to ask more questions and learn faster

Page 65: Shortening the feedback loop

Use Anything and Everything

• Spotify has leveraged Google Cloud tools, such as Pub/Sub, Dataflow and BigQuery

• Opensource and other cloud providers offer many alternatives to this stack

• Opensource tools (Elasticsearch/Kibana) and proprietary solutions (Tableau) have also been useful additions

Page 66: Shortening the feedback loop

Where Are We Going?

• The real-time mission is in the early stages at Spotify

Page 67: Shortening the feedback loop

Stream Processing First

• The sun never sets on Spotify, why impose boundaries on our datasets?

• What’s the shortest distance between two points? Zero!

• Can we reduce the feedback cycle to zero?

Page 68: Shortening the feedback loop

We’re Hiring!Engineers, Managers, Product Owners needed in NYC and Stockholm

https://www.spotify.com/jobs

Page 69: Shortening the feedback loop

Thanks! BigDataSpain!

Page 70: Shortening the feedback loop