big data at spotify

75
Adam Kawa Data Engineer @ Spotify (Big) Data At Spotify

Upload: adam-kawa

Post on 15-Jan-2015

3.042 views

Category:

Technology


5 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Big Data At Spotify

Adam KawaData Engineer @ Spotify

(Big) Data At Spotify

Page 2: Big Data At Spotify
Page 3: Big Data At Spotify

At Spotify,

important questions are being

asked all the time!

Page 4: Big Data At Spotify

Some of these questions

are ”relatively easy”

to answer…

Page 5: Big Data At Spotify

1. How many times has Coldplay been streamed this month?2. Who was the most popular artist in NYC last week?3. How many times was “Get Lucky” streamed during first 24h?

Labels, Licensor, Partners, Advertisers

Page 6: Big Data At Spotify

■■ Very granular reports are required-- Divided Divided by gender, age, location and more

■■ We have been delivering various reportsWe have been delivering various reports from day 1-- Too much data for traditional solutions Too much data for traditional solutions

Reporting

Page 7: Big Data At Spotify

QUIZ!

Page 8: Big Data At Spotify

QuestionWho was the most frequently streamed female artist in 2013?

Answer?A) Katy PerryB) Lady GagaC) MadonnaD) Rihanna

Popular Artists

Page 9: Big Data At Spotify

QuestionWho was the most frequently streamed female artist in 2013?

Popular Artists

Page 10: Big Data At Spotify

■■ The Most Popular Male Artist - Macklemore■■ The Most Popular Band - Imagine Dragons■■ The Most Popular Track - “Can't Hold Us”

Popular Artists In 2013

Page 11: Big Data At Spotify

■■ UsersUsers love local artists!-- Berlin - Sido-- London - Coldplay-- Singapore – Vanessa-Mae-- Stockholm - Avicii

Popular Artists In 2013

Page 12: Big Data At Spotify

■■ UsersUsers love local artists! love local artists!-- NYC listens to Jay-Z 88% more than rest of the world-- Stockholm listens to ABBA 110% more than the rest of the world

Popular Artists In 2013

Page 13: Big Data At Spotify

QuestionWhat was the most “viral” track in 2013?

Popular Tracks

Page 14: Big Data At Spotify

QuestionWhat was the most “viral” track in 2013?

Answer“Get Lucky” by Daft Punk feat. Pharrell Williams

Popular Tracks

Page 15: Big Data At Spotify

Artist Analytics – Daft Punk

“Get Lucky” was released on April, 19th

2013.

Page 16: Big Data At Spotify

Artist Analytics – Daft Punk

Around 5x more streams comparing a day “before” and “after” “Get Lucky”

Page 17: Big Data At Spotify

Artist Analytics – Daft Punk

What happened that day?

Page 18: Big Data At Spotify

Artist Analytics – Daft Punk

“Random Access Memories” was released on May, 17th 2013.

Page 19: Big Data At Spotify

■■ 09.08.63 – 11.02.2012

Artist Analytics – Whitney Houston

Page 20: Big Data At Spotify

■■ One of the most popular Polish rock bands ever

Artist Analytics – Budka Suflera

What happened?

Page 21: Big Data At Spotify

■■ One of the most popular Polish rock bands ever

Artist Analytics – Budka Suflera

Information about the retirement was

announced...

Page 22: Big Data At Spotify

1. What was the number of daily active users (DAU) yesterday?2. How many users have signed up this week?3. Which country to launch Spotify next?

Management And Investors

Page 23: Big Data At Spotify

■■ Analyzing Analyzing growthgrowth-- Number of Number of aactive usersctive users, , streamed songsstreamed songs, sign-ups and more-- Where to launch Spotify next Where to launch Spotify next

■■ Company KPIs

Business Analytics

Page 24: Big Data At Spotify

However,

some of the questions are

really tricky to answer!

Page 25: Big Data At Spotify

1. What song to stream to Jay-Z when he wakes up?2. Is Adam Kawa bored with Timbuktu today?3. How to encourage Jeff to go for the Premium Account?

Data Scientists, Researchers

Page 26: Big Data At Spotify

■■ Recommendations-- Powering features like Powering features like Discover, Radio-- “ “Perfect music for every moment ♪♫ ♬ ♯Perfect music for every moment ♪♫ ♬ ♯””

■■ Classification of songs and playlists by genre or mood

■■ Top listsTop lists per country

Product Features

Page 27: Big Data At Spotify

■■ Overall, in 2013Overall, in 2013-- Best Hangover Cure - “The Lazy Song”-- Best Song To Get Over An Ex - “Someone like you”-- Best Party Starter - “Levels”-- Best Driving Song – “Bohemian Rhapsody”-- Best Work Out Song - “Eye of the Tiger”

Perfect Music For Every Moment

Page 28: Big Data At Spotify

1. Is this button nicer that the previous one?2. How to personalize the messages displayed to users?3. How should the results of search be displayed?

Designers, Feature's Owners

Page 29: Big Data At Spotify

■■ A/B Test-- Come with promising “look-and-feels” and do A/B tests Come with promising “look-and-feels” and do A/B tests

■■ Explicit Explicit ffeedback from users-- But But users usually do not like to rateusers usually do not like to rate things things-- But But users usually do not like to customizeusers usually do not like to customize things things

Designers, Feature's Owners

Page 30: Big Data At Spotify

■■ Sign-up Button On FacebookSign-up Button On Facebook

A/B Test Use CaseSign-up button on the

landing page

Page 31: Big Data At Spotify

Sign-up Button On FacebookSign-up Button On FacebookLayouts of sign-up button

B – Test Group (50%)

A – Control Group (50%)

Page 32: Big Data At Spotify

Sign-up Button On FacebookSign-up Button On Facebook

Which one performed better?

B – Test Group (50%)

A – Control Group (50%)

Layouts of sign-up button

Page 33: Big Data At Spotify

Sign-up Button On FacebookSign-up Button On FacebookLayouts of

sign-up button

Much more sign-ups!

A – Control Group (50%)

B – Test Group (50%)

Page 34: Big Data At Spotify

■■ “Only 10% are likely to cause a true uplif” - Google after 12K tests-- Be able to i Be able to iterate fast!

■■ “80% of the times, we are wrong about what consumers want”-- The truth is in data!The truth is in data!

A/B Tests

Page 35: Big Data At Spotify

In the past,

we guesstimated a bit

(common sense, intuition,

gut feeling, observations,

inspirations)

Page 36: Big Data At Spotify

Isn't it inspired by the Window's

Menu Start button? ;)

Isn't it inspired by the Window's

Menu Start button? ;)

“KöP!” means “BUY!”“KöP!” means “BUY!”

Page 37: Big Data At Spotify

Today,

we make data-driven decisions

Page 38: Big Data At Spotify
Page 39: Big Data At Spotify

To make data-driven decision

data and data-infrastructure

are required (among the others)

Page 40: Big Data At Spotify

■ ■ Over Over 6 million of paying subscribers6 million of paying subscribers ■ ■ Over Over 24 million of MAU24 million of MAU (monthly active users) (monthly active users) ■ ■ 1.5 billion playlists1.5 billion playlists created so far created so far ■ ■ Available in Available in 55 countries55 countries ■ ■ Over Over 20 million of songs20 million of songs ■ ■ 4,5 billion hours streamed4,5 billion hours streamed in 2013 in 2013

Users At Spotify

Page 41: Big Data At Spotify

■ ■ Data generated Data generated by usersby users and and for usersfor users!!-- 1.5 1.5 TB of compressed data from users per day TB of compressed data from users per day-- 64 TB of data generated in Hadoop each day (triplicated)64 TB of data generated in Hadoop each day (triplicated)

(Big) Data At Spotify

Page 42: Big Data At Spotify

■■ Apache Apache Hadoop YARNHadoop YARN ■■ Many other systems including:Many other systems including:-- KafkaKafka, , LuigiLuigi, , Cassandra, Cassandra, PostgreSQLPostgreSQL in production in production

-- Giraph, Tez, Spark in the evaluation modeGiraph, Tez, Spark in the evaluation mode

Data Infrastructure At Spotify

Page 43: Big Data At Spotify

■■ Probably Probably the largest commercial Hadoop cluster in Europe!the largest commercial Hadoop cluster in Europe!-- 694 heterogeneous nodes-- 12.63 PB of data used12.63 PB of data used-- ~7.000 job each day~7.000 job each day

Apache Hadoop

Page 44: Big Data At Spotify

■■ Used for Used for “off-line” processing“off-line” processing-- When Hadoop is down, Spotify still plays music!When Hadoop is down, Spotify still plays music!-- When Hadoop is down, Data Analysts play FIFA, table tennisWhen Hadoop is down, Data Analysts play FIFA, table tennis

or … run queries locallyor … run queries locally■■ We We mostly analyze logsmostly analyze logs from users' activity from users' activity

Apache Hadoop

Page 45: Big Data At Spotify

■■ Get insights to Get insights to offer a better productoffer a better product-- “More data usually beats better algorithms”“More data usually beats better algorithms”

■■ Get insights to Get insights to make better decisionsmake better decisions-- Avoid “guesstimates”Avoid “guesstimates”

■■ Take a competitive advantageTake a competitive advantage-- More companies have started offering music streamingMore companies have started offering music streaming

What Does Hadoop Allow Us To Do?

Page 46: Big Data At Spotify

■■ We We use multiple tools and languagesuse multiple tools and languages-- HiveHive is very popular among our data analysts is very popular among our data analysts-- CrunchCrunch for core pipeline jobs for core pipeline jobs-- Some Some legacy code in Hadoop Streaminglegacy code in Hadoop Streaming with Python with Python-- A number of A number of PigPig, , Java MapReduceJava MapReduce jobs jobs-- AvroAvro as storage format (but we start considering columnar as storage format (but we start considering columnar

formats)formats)

How Do We Use Hadoop?

Page 47: Big Data At Spotify

■■ Primarily Primarily uused to transport logs-- from multiple servers-- to a central location for storage and analysis

■■ A better fit for us than FlumeA better fit for us than Flume-- We got higher throughput with Kafka We got higher throughput with Kafka

■■ We added more features to KafkaWe added more features to Kafka-- E End-to-end deliverynd-to-end delivery-- Encryption Encryption

Apache Kafka

Page 48: Big Data At Spotify

■■ A scalable and distributed key-value store■■ Provides fast read-write access for many Provides fast read-write access for many small pieces of datasmall pieces of data-- We use it for playlists, user profiles, We use it for playlists, user profiles,

popularity countpopularity count

■■ Was a better fit for us than HBaseWas a better fit for us than HBase-- The NN was the SPOF at that time The NN was the SPOF at that time

Apache Cassandra

Page 49: Big Data At Spotify

■■ Allows us to build complex pipelines of batch jobs■■ HHandles dependency resolution, workflow management, visualization and more

■■ Our alternative to Oozie and AzkabanOur alternative to Oozie and Azkaban-- Spotify, Spotify, Foursquare, Bitly and more contributeFoursquare, Bitly and more contribute

Luigi

Page 50: Big Data At Spotify

We still use them!■■ Powering features that require Powering features that require transactions support, integrity transactions support, integrity constraintsconstraints-- e.g. e.g. ordering Spotify gift-cardsordering Spotify gift-cards

■■ Semi-aggregated data for Semi-aggregated data for dashboardsdashboards■■ Semi-aggregated data for Semi-aggregated data for quick analysisquick analysis

RDBMS

Page 51: Big Data At Spotify

March 2013

Tricky questions were asked!

Page 52: Big Data At Spotify

1. How many servers do you need to buy to survive one year?2. If we agree, what will you do to use them efficiently?3. If we agree, do not come back to us this year, OK?

Finance Department

Page 53: Big Data At Spotify

■ Partially responsible for answering these questions!■ One of Data Engineers who - takes care of 694-node Hadoop-YARN cluster - implements and troubleshoots users' jobs- works in a team with Josh, Marcin, Rafal, Fabian and Wouter

■ Hadoop instructor for almost 2 years■ Co-organizer of Warsaw and Stockholm HUGs■ Blogger at HakunaMapData.com

Adam Kawa

Page 54: Big Data At Spotify

■■ Latency analysis- msec to wait for music after pressing the “Play” button

■■ CCapacity planning- servers, bandwidth, data-center space and more

Operational Metrics

Page 55: Big Data At Spotify

■■ Hadoop provides tons of metrics, logs and files■■ They can beThey can be analyzed by … Hadoop

Operational Metrics For Hadoop

Page 56: Big Data At Spotify

■ This knowledge can be useful to learn how to- measure how fast our HDFS is growing- calculate the empirical retention policy for datasets- optimize the scheduler- benchmark the cluster- and more

What Hadoop Can Tell About Itself

Page 57: Big Data At Spotify

Let's see

a couple of examples

Page 58: Big Data At Spotify

5.000 TB of data created before October 1, 2013

Page 59: Big Data At Spotify
Page 60: Big Data At Spotify

Could we Archive data accessed

before this day?

Page 61: Big Data At Spotify

■ You can analyze FsImage file to learn how fast you grow■ You can even correlate this data with - number of DAU - total size of logs generated by users - activity of users e.g. hours streamed - number of queries / day run by analysts

Advanced HDFS Capacity Planning

Page 62: Big Data At Spotify

■ You can also use ''trend feature'' in Ganglia

Simplified HDFS Capacity Planning

If we do NOTHING, we will fill the cluster in September...

Page 63: Big Data At Spotify

What will we do

to surviver longer than September?

Page 64: Big Data At Spotify

■ We introduced an automatic retention policy- An owner of the dataset specifies a retention period- If needed, a retention period can be calculated empirically

Page 65: Big Data At Spotify

We continuously improve

our MapReduce jobs

Page 66: Big Data At Spotify

■ We schedule some jobs each hour, day or week e.g.:- Top lists for each country- Reports for the labels, partners, advertisers

Idea■ Use job statistics from the previous executions of a job- to optimize the current execution of this job- to learn about the history of performance of a given job

Recurring MapReduce Jobs

Even perfect manual settingmay become eventually outdatedwhen an input dataset grows!

Page 67: Big Data At Spotify

■ A tiny PoC ;)■ The average task time set to 10 minutes (inspired by LinkedIn)

■ It should help in extreme cases: very short and long living tasks

type # map # reduce avg map time avg reduce time job execution time

old_1 4826 25 46sec 1hrs, 52mins, 14sec 2hrs, 52mins, 16sec

new_1 391 294 4mins, 46sec 8mins, 24sec 23mins, 12sec

type # map # reduce avg map time avg reduce time job execution time

old_2 4936 800 7mins, 30sec 22mins, 18sec 5hrs, 20mins, 1sec

new_2 4936 1893 8mins, 52sec 7mins, 35sec 1hrs, 18mins, 29sec

MapReduce Jobs Autotuning

Page 68: Big Data At Spotify

■ We make data-driven decisions to improve our product■ Scalable and open-source projects allows us to do that■ Hadoop, Cassandra, Kafka need love and care- And passionate people who give it to them

■ Hadoop is like a salutary virus- It quickly spreads across people and projects

Summary

Page 69: Big Data At Spotify

Questions?

Page 70: Big Data At Spotify

BONUS!

Page 71: Big Data At Spotify

One Question:One Question: What could happen after some time of simultaneousWhat could happen after some time of simultaneous development of MapReduce jobs,development of MapReduce jobs, maintenance of a large cluster,maintenance of a large cluster, and listening to perfect music for every moment?and listening to perfect music for every moment?

Page 72: Big Data At Spotify

A Possible Answer:A Possible Answer: You may discover Hadoop in the lyrics of many popular songs!You may discover Hadoop in the lyrics of many popular songs!

Page 73: Big Data At Spotify
Page 74: Big Data At Spotify

Check out spotify.com/jobs or @Spotifyjobs for more information

[email protected] out my blog: HakunaMapData.com

Want to join the band?

Page 75: Big Data At Spotify

Thank you!