shortening the feedback loop

Shortening the Feedback Loop How Spotify’s Big Data Ecosystem Has Evolved to Leverage Actionable Insights

Josh Baer ([email protected])Note: opinions expressed in these slides are the authors and not necessarily those of Spotify

mailto:[email protected]

Who am I?

• Technical Product Owner at Spotify

• Working with fast processing infrastructure

• Previously, building out Spotify’s 2500 node Hadoop cluster

@l_phant

• Spotify Launches

• Access to a gigantic catalog of music

• Click to play instantaneous!

In 2008

Behind the Scenes: Days to Insights

Behind the Scenes

Behind the Scenes

Minutes to transfer

Hours to Clean and Bucket

Hours to Run Jobs or Ad Hoc

QueriesDAYS TO INSIGHTS

“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009

Real-time

ProcessingBatch Processing

(Hadoop, Hive, BigQuery)


To leverage actionable insights, we need a

faster feedback loop!

• Music Streaming Service

• Launched in 2008

• Premium and Free Tiers

• Available in 60 Countries

What is Spotify?

Over 100 Million Active Users

Over 30 Million Songs

Over 1 Billion Plays Per Day

And we have Data

Hadoop at Spotify

• ~2,500 Nodes

• >100 PB Capacity

• >100 TB Memory accessible by jobs

• 20K Jobs/Day

Apache Kafka at Spotify

• 500 Kafka-related machines

• 40 TB/day from logs

Real-Time at Spotify

• Storm Topologies fed via Kafka

• Mostly used for hack ideas or proof of concepts

Migrating to the Cloud

In the Beginning…

• Spotify was almost completely on-premise/bare metal

• Grew to 2,500 node Hadoop cluster and over 10K total machines in production at four globally distributed data centers

• “Flirted” with cloud providers at various times

In 2014

• Maybe we should try this cloud thing for real

Why Move to the Cloud?

• Cloud Providers have matured, decreasing in costs and increasing in reliability and variety of service offered

• Owning and operating physical machines is not a competitive advantage for Spotify

Why Google’s Cloud?

• We believe Google’s industry leading background in Big Data technologies will give us a data processing advantage

Google Cloud Data Building Blocks

BigQuery

• Ad-hoc and interactive querying service for massive datasets

• Like Hive, but without needing to manage Hadoop and servers

• Leverages Google’s internal tech

• Dremel (query execution engine)

• Colossus (distributed storage)

• Borg (distributed compute)

• Jupiter (network)

Source: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood

http://research.google.com/pubs/pub36632.html

http://static.googleusercontent.com/media/research.google.com/en//university/relations/facultysummit2010/storage_architecture_and_challenges.pdf


http://googlecloudplatform.blogspot.com/2015/06/A-Look-Inside-Googles-Data-Center-Networks.html

https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood

BigQuery vs. Hive

• Example Queries:

• What are the top 10 songs by popularity in Spain during October 2016?

• How many hours did users in Spain spend listening to Spotify during October?

BigQuery vs. Hive

• What are the top 10 songs by popularity in Spain during October 2016?

• Hive

• 2647s (44min, 7sec)

• 15.5 TB processed

• BigQuery

• 108s (1min, 48sec)


Note: Hive performance unoptimized. Version used (0.14), input format (Avro), run on a ~2500 node Yarn cluster. This is not considered to be a thorough benchmark

Top 10 Tracks in Spain during October 2016

Rank Artist(s) Track Name1 JBalvin Safari

2 DJSnake LetMeLoveYou

3 RickyMar8n VentePa'Ca

4 Sebas8anYatra Traicionera

5 Zion&Lennox(feat.JBalvin) OtraVez

6 CarlosVives,Shakira LaBicicleta

7 TheChainsmokers Closer

8 MajorLazer(feat.Jus8nBieber&MØ) ColdWater

9 Sia TheGreatest

10 IAmChino(feat.Pitbull,Yandel&Chacal) AyMIDios

BigQuery vs. Hive

• How much time did users in Spain spend listening to Spotify during October?

• Hive

• 969s (16min, 9 sec)


• BigQuery

• 33s

• 780 GB processed

Note: Hive performance unoptimized. Version used (0.14), input format (Avro), run on a ~2500 node Yarn cluster. This is not considered to be a thorough benchmark

Nearly 10,000 Years!

BigQuery at Spotify

• Interactive and ad-hoc querying immediately started to transfer to BQ once the data was available on the cloud

• Pace of learning increases as friction to question decreases

Cloud Pub/Sub

• At least once globally distributed message queue

• For high volume, low topic (<10,000) publish subscribe behavior

• Like Kafka, but without needing to operate servers and supporting services (zookeeper)

Cloud Pub/Sub at Spotify

• 800K events/second? No problem

• P99 Latency of ingestions into ES: 500ms

• Ingestion from globally distributed non-GCP datacenters is painless

• Managed Service for running batch and streaming jobs

• Unified API for batch and streaming mode

• Inspired by internal Google tools like FlumeJava and Millwheel

• Programming model open-sourced as Apache Beam (currently incubating)

Cloud Dataflow



• Usually run via Scio: https://github.com/spotify/scio

• Scio provides a scala API for running Dataflow jobs and provides easy integrations with BigQuery

• New batch processing jobs at Spotify are being written in Scio/Dataflow

Cloud Dataflow (Batch) at Spotify

https://github.com/spotify/scio

• Exactly-once stream processing framework

• A replacement for Spark/Flink streaming and Storm workloads at Spotify

• Optimizes for consistency which can complicate real-time workloads

Cloud Dataflow (Streaming) at Spotify

Spotify + Google Cloud Timeline

2015 2016

Beginning of Google Cloud evaluation

BigQuery begins to replace Hive

Cloud Pub/Sub begins to replace Kafka

Dataflow (streaming) begins to replace Storm

Dataflow (batch) replacing Map/Reduce

Note: Dates are approximations

Putting It All Together

The Problem

• We want to detect within minutes if we’ve introduced a bug in a client release that affects important event logging behavior

Before…

Minutes to transfer




Getting Data from Clients to Pub/Sub

• Built Pulsar, a simple service aggregating data from Access Points and feeding it into Cloud Pub/Sub

• Replaces the Kafka real-time event feed

Pulsar

Dataflow

• Subscribes to important event Pub/Sub topics

• Aggregate events into minute windows

• Always running, no need to schedule or wait for results

BigQuery

• Receives aggregates from Dataflow

• Allows for ad-hoc inspection or slicing on different dimensions

Tableau

• Data Visualization Tool that integrates nicely with BigQuery

• Pulls data from BigQuery periodically and caches for quick inspection

Milliseconds to transfer

Milliseconds to process

Seconds to Query

SECONDS TO INSIGHTS

Faster Insights to Client Behavior

Problem

As a developer, I want to be able to instantly explore data being logged by the clients.

Solution

• Produce a topic for all employee client events

• Store in Elasticsearch

• Visualize in Kibana

Benefits

• Able to understand what’s being sent by the client as it happens

• Exploring events, visualizing distribution (i.e. does this field actually get populated)

• Prototyping analysis based on a sample

• Dashboards for Employee Releases

Other Uses

Ad Targeting

• Real-time genre targeting

• Session insights — explicit filter

Real-time Recommendations

Live Results for X-Factor

• X-Factor: music competition

• Songs available on Spotify immediately after show airs

• Listener behavior determines the order of contestants on the playlist

Review

Real-time

ProcessingBatch Processing

(Hadoop, Hive, BigQuery)


Behind the Scenes

Minutes to transfer




To leverage actionable insights, we need a

faster feedback loop!

Putting it all togetherMilliseconds

to transfer

Milliseconds to process

Seconds to Query

SECONDS TO INSIGHTS

The Value of a Fast Feedback Loop

• Detecting problems early in data avoids long backfills or long term data loss

• Instant insights on newly developed features allows teams to iterate quicker and take risks

• Providing a quicker ad-hoc querying engine allows teams to ask more questions and learn faster

Use Anything and Everything

• Spotify has leveraged Google Cloud tools, such as Pub/Sub, Dataflow and BigQuery

• Opensource and other cloud providers offer many alternatives to this stack

• Opensource tools (Elasticsearch/Kibana) and proprietary solutions (Tableau) have also been useful additions

Where Are We Going?

• The real-time mission is in the early stages at Spotify

Stream Processing First

• The sun never sets on Spotify, why impose boundaries on our datasets?

• What’s the shortest distance between two points? Zero!

• Can we reduce the feedback cycle to zero?

We’re Hiring!Engineers, Managers, Product Owners needed in NYC and Stockholm

https://www.spotify.com/jobs

https://www.spotify.com/jobs

Thanks! BigDataSpain!

shortening the feedback loop

Technology