delivering personalized music discovery

38
Delivering Music Recommendations to Millions Sriram Malladi Rohan Singh (@rohansingh)

Upload: rohan-singh

Post on 10-May-2015

2.492 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Delivering Personalized Music Discovery

Delivering Music Recommendations to Millions!

!

Sriram Malladi Rohan Singh (@rohansingh)

Page 2: Delivering Personalized Music Discovery

A little bit about us

Over 24 million active users !

55 countries, 4 data centers !

20+ million tracks

Page 3: Delivering Personalized Music Discovery

Discovering music — then

3

Page 4: Delivering Personalized Music Discovery

4

Page 5: Delivering Personalized Music Discovery

5

Page 6: Delivering Personalized Music Discovery

What do you want to listen to?

6

Page 7: Delivering Personalized Music Discovery

The right music for every moment

7

Page 8: Delivering Personalized Music Discovery

The Plan

8

Page 9: Delivering Personalized Music Discovery

1. Collect lots of data

2. Generate personalized recommendations

3. Serve those to millions of listeners each day

9

Page 10: Delivering Personalized Music Discovery

Data, data, data

Track plays !

Radio feedback !

Playlists !

Follows

10

Page 11: Delivering Personalized Music Discovery

Collecting it

All data into and out of Spotify goes through access points !

Access points and services log data !

Logs are aggregated and shipped to Hadoop

11

Page 12: Delivering Personalized Music Discovery

12

access points

HadoopCassandraDiscover backend

Page 13: Delivering Personalized Music Discovery

Hadoop

Framework to store and process big, distributed data !

Hadoop Distributed File System (HDFS) !

Hadoop MapReduce !

hadoop.apache.org

13

Page 14: Delivering Personalized Music Discovery

14

Page 15: Delivering Personalized Music Discovery

15

Page 16: Delivering Personalized Music Discovery

Crunch the data

16

Page 17: Delivering Personalized Music Discovery

What you listen to !

Artists you follow !

What similar users listen to

17

Page 18: Delivering Personalized Music Discovery

Your buddy listened to Daft Punk all day !

An artist you follow released a new album !

A friend just created a new playlist

18

Page 19: Delivering Personalized Music Discovery

Region !

Age group !

Gender

19

Page 20: Delivering Personalized Music Discovery

What do you want to listen to?

20

Page 21: Delivering Personalized Music Discovery

21

Page 22: Delivering Personalized Music Discovery

Roll it out

22

Page 23: Delivering Personalized Music Discovery

Storing recommendations

About 80 kilobytes of data per user !

Regenerated daily for all active users !

> 1 TB of recommendations overall

23

Page 24: Delivering Personalized Music Discovery

Storing recommendations

Terabyte a day is a fair amount of data to write and index (+ replicas)

!

Tradeoff between storing data or looking things up on the fly

24

Page 25: Delivering Personalized Music Discovery

Apache Cassandra

Highly available, distributed, scalable !

Fast writes, decent reads !

We have one cluster per data center

25

Page 26: Delivering Personalized Music Discovery

Transferring worldwide

26

Page 27: Delivering Personalized Music Discovery

Recommendations all generated in Hadoop, in London !

Need to be shipped to our data centers worldwide !

So much Internet weather

27

Page 28: Delivering Personalized Music Discovery

hdfs2cass

Internal tool to copy data from Hadoop to Cassandra !

Creates table from data in HDFS !

Loads tables into Cassandra using a bulk loader !

github.com/spotify/hdfs2cass

28

Page 29: Delivering Personalized Music Discovery

Serve it up

29

Page 30: Delivering Personalized Music Discovery

1. Aggregate data from all our data sources

2. Decorate it with metadata

3. Shuffle!

30

Page 31: Delivering Personalized Music Discovery

Serving at scale…

Discover is Spotify’s home page 500+ requests per second

!

Thousands of requests to other services !

Database calls for each unique user

31

Page 32: Delivering Personalized Music Discovery

…or not?

Naive, first-stage prototype: !

10 to 20 seconds per request !

(doesn’t really scale)

32

Page 33: Delivering Personalized Music Discovery

Fail a lot

33

Page 34: Delivering Personalized Music Discovery

Make it webscale!

!

Find & rewrite slow sections !

Switch to C++ from Python for critical code

34

Page 35: Delivering Personalized Music Discovery

More improvements

!

Pregenerate more data !

Cache aggressively

35

Page 36: Delivering Personalized Music Discovery

Throw hardware at it

!

Switch to SSD’s !

Scale horizontally

36

Page 37: Delivering Personalized Music Discovery

Takeaways

You will need hardware — big data can still be expensive

!

Less data can be better !

Prototype, iterate, optimize— fail early but improve

37

Page 38: Delivering Personalized Music Discovery

Questions?

38