big data platform at pinterest

Post on 15-Apr-2017

16.031 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Confidential

Mao Ye

Big Data Platform at interest

1

Data Architecture

Design Choices for Hadoop Platform

Pinball for Workflow Management

Data Architecture

Data at Pinterest• 60 Billion Pins• 1 Billion boards• 100M MAU• 60 PB of data on S3• 3 PB processed every day• 2000 node Hadoop cluster• 250 engineers

Pinterest Data ArchitectureApp

Pinterest Data ArchitectureApp

events

Kafka

Secor

Singer

Pinterest Data ArchitectureApp

events

Kafka

Secor

Singer

Pinterest Data ArchitectureApp

events

Kafka

SecorSkyline

Pinball

Redshift

Pinalytics

Features

Qubole (Hadoop)

Singer

Design Choices for Hadoop Platform

•Ephemeral clusters

•Access control layer

•Shared data store

•Easy deployment

Hadoop Platform Requirements

•Isolated multi-tenancy

•Elasticity

•Support multiple clusters

Decoupling compute & storageHadoop Cluster 1

Transient HDFS

Hadoop Cluster 2

Transient HDFS

S3 Persistent Store

Centralized Hive Metastore

Hive Metastore

Pig

Cascading

Hive

HDFS/S3

DataMetadata

Multi-layered PackagingMapreduce JobsHadoop Jars/Libs

Job/User level Configs

Software Packages/LibsConfigs (OS/Hadoop)

Misc Sys Admin

OSBootstrap Script

Core SW

Runtime Staging(on S3)

Automated Configuration

(Masterless Puppet)

Baked AMI

Executor Abstraction Layer

Hive Metastore

HDFS/S3

Qubole

Managed Hadoop

EMR

Executor

Pinball

Dev Server

•API for simplified executor abstraction

•Advanced support for spot instances

•Baked AMI customization

Why Qubole?•Hadoop & Spark as managed services

•Tight integration with Hive

•Graceful cluster scaling

Confidential

Pinball for Workflow Management

Confidential

● Scale:o 60 Billion Pinso Hundreds of workflowso Thousands of jobso 500+ jobs in a workflowo 3 petabytes processed daily

● Support:o Hadoop, Cascading, Hive, Spark …

Scale of Processing

job

workflow

Confidential

Why Pinball?● Requirements

o Simple abstractionso Extensible in futureo Reliable stateless computingo Easy to debugo Scales horizontallyo Can be upgraded w/o aborting workflowso Rich features like auto-retries, per-job emails, overrun

policies… ● Options

o Apache Oozie, Azkaban, Luigi

Confidential

Pinball Design

Master

Worker

Scheduler

Command Line Clients

UI

Confidential

● Workflow o A directed graph

of nodes called jobs

● Edgeo Run after

dependence● Node

o Job is a node

Workflow Model

Confidential

Job State● Job state is captured in a token● Tokens are named hierarchically

Master

Job Token

version: 123name: /workflow/w1/jobowner: worker_0expiration: 1234567data: JobTemplate(....)

Confidential

Job State Machine

RUNNABLE

RUNNINGWAITING

Confidential

● Master keeps the state● Workers claim and execute tasks● Horizontally scalable

Master Worker Interaction

Worker Master Persistent Store

1: request 2: update

3: ack

Confidential

Master

● Entire state is kept in memory● Each state update is synchronously

persisted before master replies to client● Master runs on a single thread – no

concurrency issues

Confidential

Worker

Confidential

Open SourceGit repo: https://github.com/pinterest/pinball

Mailing list:https://groups.google.com/forum/#!forum/pinball-users

Confidential

Thank You

top related