big data platform at pinterest
Post on 21-Apr-2017
16.118 Views
Preview:
TRANSCRIPT
Confidential
Mao Ye
Big Data Platform at interest
1
Data Architecture
Design Choices for Hadoop Platform
Pinball for Workflow Management
Data Architecture
Data at Pinterest• 60 Billion Pins• 1 Billion boards• 100M MAU• 60 PB of data on S3• 3 PB processed every day• 2000 node Hadoop cluster• 250 engineers
Pinterest Data ArchitectureApp
Pinterest Data ArchitectureApp
events
Kafka
Secor
Singer
Pinterest Data ArchitectureApp
events
Kafka
Secor
Singer
Pinterest Data ArchitectureApp
events
Kafka
SecorSkyline
Pinball
Redshift
Pinalytics
Features
Qubole (Hadoop)
Singer
Design Choices for Hadoop Platform
•Ephemeral clusters
•Access control layer
•Shared data store
•Easy deployment
Hadoop Platform Requirements
•Isolated multi-tenancy
•Elasticity
•Support multiple clusters
Decoupling compute & storageHadoop Cluster 1
Transient HDFS
Hadoop Cluster 2
Transient HDFS
S3 Persistent Store
Centralized Hive Metastore
Hive Metastore
Pig
Cascading
Hive
HDFS/S3
DataMetadata
Multi-layered PackagingMapreduce JobsHadoop Jars/Libs
Job/User level Configs
Software Packages/LibsConfigs (OS/Hadoop)
Misc Sys Admin
OSBootstrap Script
Core SW
Runtime Staging(on S3)
Automated Configuration
(Masterless Puppet)
Baked AMI
Executor Abstraction Layer
Hive Metastore
HDFS/S3
Qubole
Managed Hadoop
EMR
Executor
Pinball
Dev Server
•API for simplified executor abstraction
•Advanced support for spot instances
•Baked AMI customization
Why Qubole?•Hadoop & Spark as managed services
•Tight integration with Hive
•Graceful cluster scaling
Confidential
Pinball for Workflow Management
Confidential
● Scale:o 60 Billion Pinso Hundreds of workflowso Thousands of jobso 500+ jobs in a workflowo 3 petabytes processed daily
● Support:o Hadoop, Cascading, Hive, Spark …
Scale of Processing
job
workflow
Confidential
Why Pinball?● Requirements
o Simple abstractionso Extensible in futureo Reliable stateless computingo Easy to debugo Scales horizontallyo Can be upgraded w/o aborting workflowso Rich features like auto-retries, per-job emails, overrun
policies… ● Options
o Apache Oozie, Azkaban, Luigi
Confidential
Pinball Design
Master
Worker
Scheduler
Command Line Clients
UI
Confidential
● Workflow o A directed graph
of nodes called jobs
● Edgeo Run after
dependence● Node
o Job is a node
Workflow Model
Confidential
Job State● Job state is captured in a token● Tokens are named hierarchically
Master
Job Token
version: 123name: /workflow/w1/jobowner: worker_0expiration: 1234567data: JobTemplate(....)
Confidential
Job State Machine
RUNNABLE
RUNNINGWAITING
Confidential
● Master keeps the state● Workers claim and execute tasks● Horizontally scalable
Master Worker Interaction
Worker Master Persistent Store
1: request 2: update
3: ack
Confidential
Master
● Entire state is kept in memory● Each state update is synchronously
persisted before master replies to client● Master runs on a single thread – no
concurrency issues
Confidential
Worker
Confidential
Open SourceGit repo: https://github.com/pinterest/pinball
Mailing list:https://groups.google.com/forum/#!forum/pinball-users
Confidential
Thank You
top related