hadoop at spotify - meetupfiles.meetup.com/3171672/hadoop at spotify - wouter de... ·...

31
June 5, 2013 Wouter de Bie Team Lead Data Infrastructure Hadoop at Spotify Wednesday, June 5, 13

Upload: others

Post on 24-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

June 5, 2013

Wouter de BieTeam Lead Data Infrastructure

Hadoop at Spotify

Wednesday, June 5, 13

Page 2: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

June 5, 2013

Witaj Polsko, Spotify już gra!

Wednesday, June 5, 13

Page 3: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

Agenda• Why Data?• Why Hadoop?• Use Cases• Infrastructure overview• Map/Reduce in different languages• Scheduling• Scaling Hadoop• Lessons learned

3

Wednesday, June 5, 13

Page 4: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

4

Spotify? Spotify!

Wednesday, June 5, 13

Page 5: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

Some context • Spotify started in 2006• Now 850+ employees, 250+ engineers• 26 million monthly active users• 20+ million tracks available• 12 data engineers building a platform for

easy access to data

5

Wednesday, June 5, 13

Page 6: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

Why data?We play music, right?

6

Wednesday, June 5, 13

Page 7: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

7

Listening behavior in Sweden

0.00%$

0.20%$

0.40%$

0.60%$

0.80%$

1.00%$

1.20%$

1.40%$

1.60%$

Mon

$-$00$

Mon

$-$02$

Mon

$-$04$

Mon

$-$06$

Mon

$-$08$

Mon

$-$10$

Mon

$-$12$

Mon

$-$14$

Mon

$-$16$

Mon

$-$18$

Mon

$-$20$

Mon

$-$22$

Tue$-$0

0$Tue$-$0

2$Tue$-$0

4$Tue$-$0

6$Tue$-$0

8$Tue$-$1

0$Tue$-$1

2$Tue$-$1

4$Tue$-$1

6$Tue$-$1

8$Tue$-$2

0$Tue$-$2

2$Wed

$-$00$

Wed

$-$02$

Wed

$-$04$

Wed

$-$06$

Wed

$-$08$

Wed

$-$10$

Wed

$-$12$

Wed

$-$14$

Wed

$-$16$

Wed

$-$18$

Wed

$-$20$

Wed

$-$22$

Thu$-$0

0$Thu$-$0

2$Thu$-$0

4$Thu$-$0

6$Thu$-$0

8$Thu$-$1

0$Thu$-$1

2$Thu$-$1

4$Thu$-$1

6$Thu$-$1

8$Thu$-$2

0$Thu$-$2

2$Fri$-$00$

Fri$-$02$

Fri$-$04$

Fri$-$06$

Fri$-$08$

Fri$-$10$

Fri$-$12$

Fri$-$14$

Fri$-$16$

Fri$-$18$

Fri$-$20$

Fri$-$22$

Sat$-$00$

Sat$-$02$

Sat$-$04$

Sat$-$06$

Sat$-$08$

Sat$-$10$

Sat$-$12$

Sat$-$14$

Sat$-$16$

Sat$-$18$

Sat$-$20$

Sat$-$22$

Sun$-$0

0$Sun$-$0

2$Sun$-$0

4$Sun$-$0

6$Sun$-$0

8$Sun$-$1

0$Sun$-$1

2$Sun$-$1

4$Sun$-$1

6$Sun$-$1

8$Sun$-$2

0$Sun$-$2

2$

SE$23-27$

SE$45-59$

SE$$0-17$

SE$35-44$

SE$60-150$

SE$28-34$

SE$18-22$

Wednesday, June 5, 13

Page 8: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

8

Listening behavior in Spain

0.00%$

0.20%$

0.40%$

0.60%$

0.80%$

1.00%$

1.20%$

1.40%$

Mon

$-$00$

Mon

$-$02$

Mon

$-$04$

Mon

$-$06$

Mon

$-$08$

Mon

$-$10$

Mon

$-$12$

Mon

$-$14$

Mon

$-$16$

Mon

$-$18$

Mon

$-$20$

Mon

$-$22$

Tue$-$0

0$Tue$-$0

2$Tue$-$0

4$Tue$-$0

6$Tue$-$0

8$Tue$-$1

0$Tue$-$1

2$Tue$-$1

4$Tue$-$1

6$Tue$-$1

8$Tue$-$2

0$Tue$-$2

2$Wed

$-$00$

Wed

$-$02$

Wed

$-$04$

Wed

$-$06$

Wed

$-$08$

Wed

$-$10$

Wed

$-$12$

Wed

$-$14$

Wed

$-$16$

Wed

$-$18$

Wed

$-$20$

Wed

$-$22$

Thu$-$0

0$Thu$-$0

2$Thu$-$0

4$Thu$-$0

6$Thu$-$0

8$Thu$-$1

0$Thu$-$1

2$Thu$-$1

4$Thu$-$1

6$Thu$-$1

8$Thu$-$2

0$Thu$-$2

2$Fri$-$00$

Fri$-$02$

Fri$-$04$

Fri$-$06$

Fri$-$08$

Fri$-$10$

Fri$-$12$

Fri$-$14$

Fri$-$16$

Fri$-$18$

Fri$-$20$

Fri$-$22$

Sat$-$00$

Sat$-$02$

Sat$-$04$

Sat$-$06$

Sat$-$08$

Sat$-$10$

Sat$-$12$

Sat$-$14$

Sat$-$16$

Sat$-$18$

Sat$-$20$

Sat$-$22$

Sun$-$0

0$Sun$-$0

2$Sun$-$0

4$Sun$-$0

6$Sun$-$0

8$Sun$-$1

0$Sun$-$1

2$Sun$-$1

4$Sun$-$1

6$Sun$-$1

8$Sun$-$2

0$Sun$-$2

2$

ES$$0-17$

ES$35-44$

ES$23-27$

ES$45-59$

ES$60-150$

ES$28-34$

ES$18-22$

Wednesday, June 5, 13

Page 9: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

9

Impact of hurricane Sandy29 October 2012

Wednesday, June 5, 13

Page 10: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

10

Impact of hurricane Sandy30 October 2012

Wednesday, June 5, 13

Page 11: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

Why data?To get more insights. The age group example could be used for ad targeting.

11

Wednesday, June 5, 13

Page 12: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

12

Why Hadoop?

Millions of daily active users == LOTS OF DATA

• Too much data to fit into a RDBMS (or too costly)• Hadoop provides good value for money• Runs on “commodity” hardware• Scales pretty well

Wednesday, June 5, 13

Page 13: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

13

ReportingBusiness AnalyticsOperational AnalyticsIn-client features

Use CasesWe’re a data-driven company, so data is used almost everywhere

Wednesday, June 5, 13

Page 14: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

Reporting

• Reporting to labels, licensors, partners and advertisers from day 1• In 2008 a RDBMS would have worked for a while, although

labels wanted very granular data, so we went with Hadoop instead

Wednesday, June 5, 13

Page 15: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

Business Analytics

• Analyzing growth, user behavior, sign-up funnels, etc

• Company KPIs• A/B testing• NPS analysis• Segmentation analysis

Wednesday, June 5, 13

Page 16: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

Operational metrics

•Root cause analysis• Latency analysis• Better capacity planning (servers, people, bandwidth)

Wednesday, June 5, 13

Page 17: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

Product features

•Radio• Top lists•Recommendations (better then external parties,

because of the amount of data)

Wednesday, June 5, 13

Page 18: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

18

600 GB of compressed data from users per day150 GB of data from services per day4 TB of data generated in Hadoop each day190 node Hadoop cluster (12 core CPUs, 32 Gb RAM, 24 TB disk space)Soon 690 nodes (12 core CPUs, 64 Gb Ram, 48 TB disk space)4 PB of storage capacity (soon 28 PB)

Some geeky numbers

Wednesday, June 5, 13

Page 19: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

19

Our data infrastructure

Wednesday, June 5, 13

Page 20: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

20

Spotify’s data infrastructure

Backend services

HDFS

Map/Reduce

LuigiScheduler

OperationalDatabases

Reporting

Analytical databases

Productfeatures

Dashboards

Map/ReduceJobs

Hive

Wednesday, June 5, 13

Page 21: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

Map/Reduce language

Python with Hadoop Streaming• Pros: fast development, many Spotify libraries available• Cons: slower then Java, no access to Hadoop API

Java• Pros: fast, access to Hadoop API• Cons: verbose language, not many Spotify libraries available

PIG• Pros: very small scripts, faster then streaming• Cons: yet another language to learn, not many Spotify libs available

Hive• Pros: SQL like syntax (easy for non-programmers) and relational data model• Cons: more moving parts (not well suited for a whole pipe line)

21

Wednesday, June 5, 13

Page 22: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

Some technical details

Technology we use in combination with Hadoop

• Java, PIG, Hive and Python for writing map reduce 1.0• Apache Kafka for log collection (in the process of replacing a batched transfer system)• Apache Cassandra (no HBase ;))• PostgreSQL• Sqoop for database import/export• Avro as storage format

22

Wednesday, June 5, 13

Page 23: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

Scheduling

We wrote and open-sourced our own scheduler: Luigi

• Nothing suitable out there..• https://github.com/spotify/luigi•Written in Python• Generic scheduler and dependency system that supports Python M/R, Pig and Hive

23

Wednesday, June 5, 13

Page 24: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

Scaling Hadoop at SpotifyOur Journey

• Started with a small (scrap metal) cluster of 37 servers

•Moved to Amazon Elastic Map/Reduce (EMR) and S3 to quickly scale

• Built an in-house cluster of 60 nodes because of EMR costs

• Capacity planning every 6 months, grown to 190 nodes today

• Just ordered 500 more nodes• Put in place data-retention policy and data

archive

24

Wednesday, June 5, 13

Page 25: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

Lessons learned

4+ years of Hadoop taught us

• Hadoop has brought us very far. We would never be able to handle the current volume with a “cheap” RDBMS

• “Commodity hardware” doesn’t mean cheap hardware• Hadoop isn’t a silver bullet• Hadoop is a complex system that needs love and care• You will have to extend Hadoop (and eco-system components) to tailor it to your needs• Hadoop is fun!

25

Wednesday, June 5, 13

Page 26: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

Why? How?

26

Questions?

What?

Wednesday, June 5, 13

Page 27: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

One more thing..

What’s your biggest annoyance when running the following?

hadoop fs -ls /

Wednesday, June 5, 13

Page 28: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

28

Why?• JVM startup time• Loading a lot of JAR files (Hadoop, logging, other stuff you don’t need)• Problematic for us, since we call Hadoop from Python

It’s slow..

Wednesday, June 5, 13

Page 29: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

29

Welcome Snakebite!A pure python HDFS client

• Both a client library and a command line tool• Uses Protocol Buffers to communicate to the NameNode•Much faster then command line hadoop• Tab completion included!• Available at http://github.com/spotify/snakebite

Wednesday, June 5, 13

Page 30: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

30

How much faster?wouter@foo:~$ time for i in {1..10}; do hadoop fs -ls /; done

real 0m14.464suser 0m21.761ssys 0m1.148s

wouter@foo:~$ time for i in {1..10}; do snakebite ls /; done

real 0m1.639suser 0m1.072ssys 0m0.160s

Wednesday, June 5, 13

Page 31: Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

June 5, 2013

Check out http://www.spotify.com/jobs or @Spotifyjobs for more information.

Come talk to Adam or me after the meetup!

Or mail: [email protected] twitter: @xinit

Want to join the band?

Wednesday, June 5, 13