data democratization at soundcloud - bruno sá (soundcloud)

33
DATA DEMOCRATIZATION @ SOUNDCLOUD October, 29th 2013

Upload: jaxlondonconference

Post on 12-May-2015

1.527 views

Category:

Technology


1 download

DESCRIPTION

SoundCloud is the world’s leading social sound platform where anyone can create sounds and share them everywhere. 200 Million people every month listen to sounds on SoundCloud. That is eight percent of the Internet. 12 hours are uploaded on SoundCloud every minute. This means that SoundCloud not only deals with a lot of data (3-digit terabytes approximately) but embraces the concept of “data democratization,” which means that all data must be available for anyone in the company that needs to access and work with it.

TRANSCRIPT

Page 1: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

DATA DEMOCRATIZATION @ SOUNDCLOUDOctober, 29th 2013

Page 2: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

HI, I’M BRUNO

Page 3: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

SOUNDCLOUD IS THE WORLD’S ˝LEADING AUDIO PLATFORM

Page 4: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

Every minute, creators upload

12hrs of audio

Page 5: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

reaching over

200m

people every month

Page 6: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

!

8% of the internet

Page 7: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)
Page 8: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

FOO FIGHTERS SNOOP LION MADONNA MACKLEMOREPRESIDENT OBAMA JOHN OLIVER˝(DAILY SHOW/BUGLE)

SKRILLEX

Page 9: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)
Page 10: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

what gets listened to where?

how many new usersdid we get from that campaign?

how much revenue dowe make in Brasil?

how do users use our iOS andAndroid apps?

what makes asound successfull?

did the product change affect feature x?

do comments ontracks correlatewith listens?

what makes anartist successfull?

Page 11: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

• Avoid Silos

• Remove unnecessary restrictions

• Provide simple tools

• Teach People how to use data

DATA DEMOCRATIZATION

Page 12: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

In one sentence:

DATA DEMOCRATIZATION

Deliver the right information to the right person at the right time.

Page 13: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

DATA ANALYSIS AND REPORTING

PRODUCTION DB

ANALYTICS DB

2010-2012

Page 14: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

DATA ANALYSIS AND REPORTING

ListensSoundsUsersCommentsFavoritesSharesReposts

ImpressionsClicksConversionsSuggestionsDownloadsTaggingsUploads

Page 15: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

DATA ANALYSIS AND REPORTING

Listens

timestampdurationsound ownerlistenerAPI-key(location) country

Page 16: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

DATA ANALYSIS AND REPORTING

additional metadata:• location within sound• context (location on site)• segmentation

Listening creates >6000 events/s

BIG DATA

Page 17: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

HADOOP TO THE RESCUE

2 Datacenter in AMS 200+ Nodes

Page 18: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

HADOOP TO THE RESCUE

listen data listen metadata search data recommender data product testing data backend production databackend logs

Page 19: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

HADOOP AND DATA DEMOCRATIZATION

Data is siloed on hadoopData governance not existing Technical hurdles for accessNot realtimeSlow access

Page 20: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

AMAZON REDSHIFT

Fast fully managed DW service

Optimized for petabyte or more datasets

Fast query and I/O performance

Columnar storage technology

Page 21: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

Staging Area

Pig/Ruby Scripts

Amazon EMR

COPY

Pig/Ruby Scripts

Job execution powered by:

2013

BI INFRASTRUCTURE

Data Exploration

Source Systems

Hadoop

MySql

External Systems

...

(production db)MySql

DataWarehouse

ETL Scripts ETL Scripts

Page 22: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

First: Gather data from the several source systems into S3

Hadoop

MySql

External Systems

(production db)MySql

Full/Daily Imports

MapReduce for: - Listens - Plays - Impressions - Affiliations - ...

IMPORT DATA FROM SOURCE SYSTEMS

Page 23: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

Second: Rebuild staging area tables for full imports

IMPORT DATA FROM SOURCE SYSTEMS

Staging Area

tracks users client applications

...

Based on configuration files!Create statements generated!Re-create DISTKEYS and SORTKEYS Full control in changes in the data model!

yaml config files

Page 24: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

Third: Import the data from S3 to RedShift

Staging Area

tracks users client applications

...

Full import: TRUNCATE & COPY Daily import: COPY

IMPORT DATA FROM SOURCE SYSTEMS

Page 25: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

!ETL scripts divided into layers:

!- Layer 1: Staging -> DW (dimensions)

- Layer 2: Staging -> DW (fact tables - raw data)

- Layer 3: DW -> DW (aggregated fact tables)

- Layer 4: DW -> Reporting Data Cubes (reporting data)

!

ETL AND DW DATAMODEL

Page 26: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

DataWarehouse

ETL AND DW DATAMODEL

Staging Area

Data Cleaning Data Transformation !Ruby/SQL Scripts

ETL Layer 1 & 2

Data Aggregation !Ruby/SQL Scripts

ETL Layer 3

Data Exploration

ETL Layer 4

Data Presentation !SQL

Page 27: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

JOB SCHEDULE AND EXECUTION

Job-scheduling tool developed internally

Set dependencies between jobs

Execution in multiple machines

Supports all the ETL layers

Page 28: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

DATA EXPLORATION

Simple and fast access to data

More time for “deep dives” into data

Individualized Reporting

Allows interactivity between users

Integrated with RedShift

Page 29: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

TIMELINEWeek 2 Week 4 Week 8 Week 10 Week 12 Week 14 Week 16

• Gap Analysis˝• Business Exploration

(requirements interviews)

• Information Mapping Design˝

• Solution Design (Draft)

Requirement Analysis˝

Analysis Stage˝

End of Analysis Stage˝

Milestones˝ Design & Build˝

• Define Infrastructure˝• Design Data Model

Week 6

Infrastructure Ready!˝

• Build ETL ˝• Build Data Cubes˝• Design Reports/Dashboards (Presentation

Layer)BI 1.0 is built!˝

• System/Integration Tests ˝

• User Acceptance

BI 1.0 is tested!˝

• User Workshops˝• BI 1.0 Evaluation

BI 1.0 is ready to use!˝

Test & Deploy

Page 30: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

• Reports designed by end users

• Central repository for data analysis

• User interaction

• Data from one source only

• Scalable solution

• Data to the people!

DATA DEMOCRATIZATION

Page 31: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

what gets listened to where?

how many new usersdid we get from that campaign?

what makes asound successfull?

did the product change affect feature x?

how much revenue dowe make in Brasil?

how do users use our iOS andAndroid apps?

do comments ontracks correlatewith listens?

what makes anartist successfull?

Page 32: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

!

QUESTIONS?

Page 33: Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

THANK YOU!

P.S. WE’RE HIRING.SOUNDCLOUD.COM/JOBS