hadoop at spotify - meetupfiles.meetup.com/5139282/shug 1 - hadoop at spotify.pdf · spotify is a...

22
Januari 2012 Hadoop at Spotify Friday, January 25, 13

Upload: others

Post on 12-Jan-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

Januari 2012

Hadoop at Spotify

Friday, January 25, 13

Page 2: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

IntroductionUse casesSome statsInfrastructure overviewMap/Reduce in different languagesSchedulingScaling HadoopLessons learned

Agenda

Friday, January 25, 13

Page 3: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

3+ years experience with HadoopWorking with a team of 12 engineers to keep our data flowingResponsible for the data architecture

Don’t hesitate to ask questions during this talk!

Wouter de Bie - Team lead for Analytics & Data Infrastructure

Who am I and what am I doing here?!

Introduction

Friday, January 25, 13

Page 4: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

Too much data to fit into a RDBMS (or too costly)Hadoop provides good value for moneyRuns on “commodity” hardwareScales pretty well

Millions of daily active users == LOTS OF DATA

Why do we use Hadoop?

Introduction

Friday, January 25, 13

Page 5: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

ReportingBusiness analysisOperational analysisIn-client features

Spotify is a Data-Driven company, so data is used practically everywhere..

Where do we use Hadoop?

Use cases

Friday, January 25, 13

Page 6: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

Reporting to labels, licensors, partners and advertisers from day 1In 2008 a RDBMS would have worked for a while, although labels wanted very granular data, so we went with Hadoop instead

This is where it all started..

Reporting

Friday, January 25, 13

Page 7: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

Analyzing growth, user behavior, sign-up funnels, etcCompany KPIsA/B testingNPS analysisSegmentation analysis

We have so much data, so let’s us it!

Business Analytics

Friday, January 25, 13

Page 8: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

An analysis I can share

Friday, January 25, 13

Page 9: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

Listening behavior in Sweden

0.00%$

0.20%$

0.40%$

0.60%$

0.80%$

1.00%$

1.20%$

1.40%$

1.60%$

Mon

$-$00$

Mon

$-$02$

Mon

$-$04$

Mon

$-$06$

Mon

$-$08$

Mon

$-$10$

Mon

$-$12$

Mon

$-$14$

Mon

$-$16$

Mon

$-$18$

Mon

$-$20$

Mon

$-$22$

Tue$-$0

0$Tue$-$0

2$Tue$-$0

4$Tue$-$0

6$Tue$-$0

8$Tue$-$1

0$Tue$-$1

2$Tue$-$1

4$Tue$-$1

6$Tue$-$1

8$Tue$-$2

0$Tue$-$2

2$Wed

$-$00$

Wed

$-$02$

Wed

$-$04$

Wed

$-$06$

Wed

$-$08$

Wed

$-$10$

Wed

$-$12$

Wed

$-$14$

Wed

$-$16$

Wed

$-$18$

Wed

$-$20$

Wed

$-$22$

Thu$-$0

0$Thu$-$0

2$Thu$-$0

4$Thu$-$0

6$Thu$-$0

8$Thu$-$1

0$Thu$-$1

2$Thu$-$1

4$Thu$-$1

6$Thu$-$1

8$Thu$-$2

0$Thu$-$2

2$Fri$-$00$

Fri$-$02$

Fri$-$04$

Fri$-$06$

Fri$-$08$

Fri$-$10$

Fri$-$12$

Fri$-$14$

Fri$-$16$

Fri$-$18$

Fri$-$20$

Fri$-$22$

Sat$-$00$

Sat$-$02$

Sat$-$04$

Sat$-$06$

Sat$-$08$

Sat$-$10$

Sat$-$12$

Sat$-$14$

Sat$-$16$

Sat$-$18$

Sat$-$20$

Sat$-$22$

Sun$-$0

0$Sun$-$0

2$Sun$-$0

4$Sun$-$0

6$Sun$-$0

8$Sun$-$1

0$Sun$-$1

2$Sun$-$1

4$Sun$-$1

6$Sun$-$1

8$Sun$-$2

0$Sun$-$2

2$

SE$23-27$

SE$45-59$

SE$$0-17$

SE$35-44$

SE$60-150$

SE$28-34$

SE$18-22$

Friday, January 25, 13

Page 10: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

Listening behavior in Spain

0.00%$

0.20%$

0.40%$

0.60%$

0.80%$

1.00%$

1.20%$

1.40%$

Mon

$-$00$

Mon

$-$02$

Mon

$-$04$

Mon

$-$06$

Mon

$-$08$

Mon

$-$10$

Mon

$-$12$

Mon

$-$14$

Mon

$-$16$

Mon

$-$18$

Mon

$-$20$

Mon

$-$22$

Tue$-$0

0$Tue$-$0

2$Tue$-$0

4$Tue$-$0

6$Tue$-$0

8$Tue$-$1

0$Tue$-$1

2$Tue$-$1

4$Tue$-$1

6$Tue$-$1

8$Tue$-$2

0$Tue$-$2

2$Wed

$-$00$

Wed

$-$02$

Wed

$-$04$

Wed

$-$06$

Wed

$-$08$

Wed

$-$10$

Wed

$-$12$

Wed

$-$14$

Wed

$-$16$

Wed

$-$18$

Wed

$-$20$

Wed

$-$22$

Thu$-$0

0$Thu$-$0

2$Thu$-$0

4$Thu$-$0

6$Thu$-$0

8$Thu$-$1

0$Thu$-$1

2$Thu$-$1

4$Thu$-$1

6$Thu$-$1

8$Thu$-$2

0$Thu$-$2

2$Fri$-$00$

Fri$-$02$

Fri$-$04$

Fri$-$06$

Fri$-$08$

Fri$-$10$

Fri$-$12$

Fri$-$14$

Fri$-$16$

Fri$-$18$

Fri$-$20$

Fri$-$22$

Sat$-$00$

Sat$-$02$

Sat$-$04$

Sat$-$06$

Sat$-$08$

Sat$-$10$

Sat$-$12$

Sat$-$14$

Sat$-$16$

Sat$-$18$

Sat$-$20$

Sat$-$22$

Sun$-$0

0$Sun$-$0

2$Sun$-$0

4$Sun$-$0

6$Sun$-$0

8$Sun$-$1

0$Sun$-$1

2$Sun$-$1

4$Sun$-$1

6$Sun$-$1

8$Sun$-$2

0$Sun$-$2

2$

ES$$0-17$

ES$35-44$

ES$23-27$

ES$45-59$

ES$60-150$

ES$28-34$

ES$18-22$

Friday, January 25, 13

Page 11: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

Root cause analysisLatency analysisBetter capacity planning (servers, people, bandwidth)

Things get interesting when you combine business metrics with operational data

Operational Metrics

Friday, January 25, 13

Page 12: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

Recommendations (better then external parties, because of the amount of data)RadioTop lists

Leverage data to create better product features

Product features

Friday, January 25, 13

Page 13: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

200 GB of compressed data from users per day100 GB of data from services per day60+ GB of data generated in Hadoop each day190 node Hadoop cluster (24 core CPUs, 32 Gb RAM, 20 TB disk space)4 PB of storage capacity

When I mean “A LOT”, I mean:

Some geeky numbers

Friday, January 25, 13

Page 14: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

Our datainfrastructure

Friday, January 25, 13

Page 15: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

Spotify’s data infrastructureBackend services

HDFS

Map/Reduce

LuigiScheduler

OperationalDatabases

Reporting

Analytical databases

Productfeatures

Dashboards

Map/ReduceJobs

Hive

Friday, January 25, 13

Page 16: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

Python with Hadoop Streaming Pros: fast development, many Spotify libraries available Cons: slower then Java, no access to Hadoop API

Java Pros: fast, access to Hadoop API Cons: verbose language, not many Spotify libs available

PIG Pros: very small scripts, faster then streaming Cons: yet another language to learn, not many Spotify libs available

Hive Pros: SQL like syntax (easy for non-programmers) and relational data model Cons: more moving parts (not well suited for a whole pipe line)

We use multiple languages to run Map/Reduce jobs

Map/Reduce languages

Friday, January 25, 13

Page 17: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

Java, PIG, Hive and Python for writing map reduce 1.0Apache Kafka for log collection (in the process of replacing a batched transfer system)Apache Cassandra (no HBase ;))PostgreSQLSqoop for database import/exportAvro as storage format

Technology we use in combination with Hadoop:

Hadoop feels lonely without friends..

Some technical details

Friday, January 25, 13

Page 18: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

Nothing suitable out there..https://github.com/spotify/luigiWritten in pythonGeneric scheduler and dependency system that supports Python M/R, Pig and Hive

We wrote and open-sourced our own scheduler: Luigi

Map/Reduce is simple, but a single map/reduce job is limited. You need to chain jobs.

Scheduling

Friday, January 25, 13

Page 19: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

Scaling Hadoop

Friday, January 25, 13

Page 20: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

Started with a small (scrap metal) cluster of 37 serversMoved to Amazon Elastic Map/Reduce (EMR) and S3 to quickly scaleBuilt an in-house cluster of 60 nodes because of EMR costsCapacity planning every 6 months, grown to 190 nodes todayPut in place data-retention policy and data archive

Our journey:

Scaling Hadoop

Friday, January 25, 13

Page 21: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

Hadoop has brought us very far. We would never be able to handle the current volume with a “cheap” RDBMS“Commodity hardware” doesn’t mean cheap hardwareHadoop isn’t a silver bulletHadoop is a complex system that needs love and careYou will have to extend Hadoop (and eco-system components) to tailor it to your needsHadoop is fun!

4+ years of Hadoop at Spotify taught us:

Lessons learned

Friday, January 25, 13

Page 22: Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

January 2013

Any [email protected] / @xinit

Shameless plug: Spotify is hiring! Talk to me or someone with a Spotify badge if you’re interested!

Thank you!

Friday, January 25, 13