big data platform at pinterest

27
Confident ial Mao Ye Big Data Platform at interest 1

Upload: qubole

Post on 15-Apr-2017

16.031 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Big Data Platform at Pinterest

Confidential

Mao Ye

Big Data Platform at interest

1

Page 2: Big Data Platform at Pinterest

Data Architecture

Design Choices for Hadoop Platform

Pinball for Workflow Management

Page 3: Big Data Platform at Pinterest

Data Architecture

Page 4: Big Data Platform at Pinterest

Data at Pinterest• 60 Billion Pins• 1 Billion boards• 100M MAU• 60 PB of data on S3• 3 PB processed every day• 2000 node Hadoop cluster• 250 engineers

Page 5: Big Data Platform at Pinterest

Pinterest Data ArchitectureApp

Page 6: Big Data Platform at Pinterest

Pinterest Data ArchitectureApp

events

Kafka

Secor

Singer

Page 7: Big Data Platform at Pinterest

Pinterest Data ArchitectureApp

events

Kafka

Secor

Singer

Page 8: Big Data Platform at Pinterest

Pinterest Data ArchitectureApp

events

Kafka

SecorSkyline

Pinball

Redshift

Pinalytics

Features

Qubole (Hadoop)

Singer

Page 9: Big Data Platform at Pinterest

Design Choices for Hadoop Platform

Page 10: Big Data Platform at Pinterest

•Ephemeral clusters

•Access control layer

•Shared data store

•Easy deployment

Hadoop Platform Requirements

•Isolated multi-tenancy

•Elasticity

•Support multiple clusters

Page 11: Big Data Platform at Pinterest

Decoupling compute & storageHadoop Cluster 1

Transient HDFS

Hadoop Cluster 2

Transient HDFS

S3 Persistent Store

Page 12: Big Data Platform at Pinterest

Centralized Hive Metastore

Hive Metastore

Pig

Cascading

Hive

HDFS/S3

DataMetadata

Page 13: Big Data Platform at Pinterest

Multi-layered PackagingMapreduce JobsHadoop Jars/Libs

Job/User level Configs

Software Packages/LibsConfigs (OS/Hadoop)

Misc Sys Admin

OSBootstrap Script

Core SW

Runtime Staging(on S3)

Automated Configuration

(Masterless Puppet)

Baked AMI

Page 14: Big Data Platform at Pinterest

Executor Abstraction Layer

Hive Metastore

HDFS/S3

Qubole

Managed Hadoop

EMR

Executor

Pinball

Dev Server

Page 15: Big Data Platform at Pinterest

•API for simplified executor abstraction

•Advanced support for spot instances

•Baked AMI customization

Why Qubole?•Hadoop & Spark as managed services

•Tight integration with Hive

•Graceful cluster scaling

Page 16: Big Data Platform at Pinterest

Confidential

Pinball for Workflow Management

Page 17: Big Data Platform at Pinterest

Confidential

● Scale:o 60 Billion Pinso Hundreds of workflowso Thousands of jobso 500+ jobs in a workflowo 3 petabytes processed daily

● Support:o Hadoop, Cascading, Hive, Spark …

Scale of Processing

job

workflow

Page 18: Big Data Platform at Pinterest

Confidential

Why Pinball?● Requirements

o Simple abstractionso Extensible in futureo Reliable stateless computingo Easy to debugo Scales horizontallyo Can be upgraded w/o aborting workflowso Rich features like auto-retries, per-job emails, overrun

policies… ● Options

o Apache Oozie, Azkaban, Luigi

Page 19: Big Data Platform at Pinterest

Confidential

Pinball Design

Master

Worker

Scheduler

Command Line Clients

UI

Page 20: Big Data Platform at Pinterest

Confidential

● Workflow o A directed graph

of nodes called jobs

● Edgeo Run after

dependence● Node

o Job is a node

Workflow Model

Page 21: Big Data Platform at Pinterest

Confidential

Job State● Job state is captured in a token● Tokens are named hierarchically

Master

Job Token

version: 123name: /workflow/w1/jobowner: worker_0expiration: 1234567data: JobTemplate(....)

Page 22: Big Data Platform at Pinterest

Confidential

Job State Machine

RUNNABLE

RUNNINGWAITING

Page 23: Big Data Platform at Pinterest

Confidential

● Master keeps the state● Workers claim and execute tasks● Horizontally scalable

Master Worker Interaction

Worker Master Persistent Store

1: request 2: update

3: ack

Page 24: Big Data Platform at Pinterest

Confidential

Master

● Entire state is kept in memory● Each state update is synchronously

persisted before master replies to client● Master runs on a single thread – no

concurrency issues

Page 25: Big Data Platform at Pinterest

Confidential

Worker

Page 26: Big Data Platform at Pinterest

Confidential

Open SourceGit repo: https://github.com/pinterest/pinball

Mailing list:https://groups.google.com/forum/#!forum/pinball-users

Page 27: Big Data Platform at Pinterest

Confidential

Thank You