improving efficiency of twitter infrastructure using chargeback

Post on 23-Jan-2018

219 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Improving efficiency of Twitter Infrastructureusing Chargeback

@vinucharanya @micheal

• Brief History • Problem • Chargeback

• Engineering Challenges • The product • Impact

• Future

AGENDA

© Getty Images from http://www.fifa.com/worldcup/news/y=2010/m=7/news=pride-for-africa-spain-strike-gold-2247372.html

2010

© Getty Images from http://www.fifa.com/worldcup/news/y=2010/m=7/news=pride-for-africa-spain-strike-gold-2247372.html

© Getty Images from http://www.fifa.com/worldcup/news/y=2010/m=7/news=pride-for-africa-spain-strike-gold-2247372.html

3283 Tweets Per Sec (TPS)

© Getty Images from http://www.fifa.com/worldcup/news/y=2010/m=7/news=pride-for-africa-spain-strike-gold-2247372.html

5X increaseon avg. TPS

3283 Tweets Per Sec (TPS)

©The Simpsons

MONOLITH SERVICES

FENCING & OWNERSHIP

Clear isolation of services & its ownership.

RELIABILITY Failure isolation and graceful degradation

SCALABILITY & EFFICIENCY

Scale independently ensuring efficient use of infrastructure

DEVELOPER PRODUCTIVITY

Make it simple for engineers to build and launch services quickly and easily

(Micro) Services Oriented Model

2013

August 2 at 7:21:50 PDT

August 2 at 7:21:50 PDT

143,199 Tweets Per Sec (TPS)

August 2 at 7:21:50 PDT

28X increaseon avg. TPS

143,199 Tweets Per Sec (TPS)

Hundreds and thousands of #events at any given instant

Most Retweeted Tweet in History

RELIABILITY DEVELOPER AGILITY SCALABILITY EFFICIENCY

“Do More with Less”

Fast forward to 2016

INFRASTRUCTURE AND DATACENTER MANAGEMENT

CORE APPLICATION SERVICES

TWEETS

USERS

SOCIAL GRAPH

PLATFORM SERVICES

SEARCH

MESSAGING & QUEUES

CACHE

MONITORING AND ALERTING

REVERSE PROXY

FRAMEWORK/

LIBRARIES

FINAGLE (RPC)

SCALDING (Map Reduce in Scala)

HERON (Streaming Compute)

JVM

MANAGEMENT

TOOLS

SELF SERVE

SERVICE DIRECTORY

CHARGEBACK

CONFIG

DATA & ANALYTICSPLATFORM

INTERACTIVE QUERY

DATA DISCOVERY

WORKFLOW MANAGEMENT

INFRASTRUCTURESERVICES

MANHATTAN(Key-Val Store)

HDFS (File System)

BLOBSTORE

GRAPH STORE

STORAGE

AURORA (Scheduler)

HADOOP (Map-Reduce)

MESOS (Cluster Manager)

COMPUTE

DEPLOY(Workflows)

INFRASTRUCTURE AND DATACENTER MANAGEMENT

CORE APPLICATION SERVICES

TWEETS

USERS

SOCIAL GRAPH

PLATFORM SERVICES

SEARCH

MESSAGING & QUEUES

CACHE

MONITORING AND ALERTING

REVERSE PROXY

FRAMEWORK/

LIBRARIES

FINAGLE (RPC)

SCALDING (Map Reduce in Scala)

HERON (Streaming Compute)

JVM

MANAGEMENT

TOOLS

SELF SERVE

SERVICE DIRECTORY

CHARGEBACK

CONFIG

DATA & ANALYTICSPLATFORM

INTERACTIVE QUERY

DATA DISCOVERY

WORKFLOW MANAGEMENT

INFRASTRUCTURESERVICES

MANHATTAN(Key-Val Store)

HDFS (File System)

BLOBSTORE

GRAPH STORE

STORAGE

AURORA (Scheduler)

HADOOP (Map-Reduce)

MESOS (Cluster Manager)

COMPUTE

DEPLOY(Workflows)

INFRASTRUCTURE AND DATACENTER MANAGEMENT

CORE APPLICATION SERVICES

TWEETS

USERS

SOCIAL GRAPH

PLATFORM SERVICES

SEARCH

MESSAGING & QUEUES

CACHE

MONITORING AND ALERTING

REVERSE PROXY

FRAMEWORK/

LIBRARIES

FINAGLE (RPC)

SCALDING (Map Reduce in Scala)

HERON (Streaming Compute)

JVM

MANAGEMENT

TOOLS

SELF SERVE

SERVICE DIRECTORY

CHARGEBACK

CONFIG

DATA & ANALYTICSPLATFORM

INTERACTIVE QUERY

DATA DISCOVERY

WORKFLOW MANAGEMENT

INFRASTRUCTURESERVICES

MANHATTAN(Key-Val Store)

HDFS (File System)

BLOBSTORE

GRAPH STORE

STORAGE

AURORA (Scheduler)

HADOOP (Map-Reduce)

MESOS (Cluster Manager)

COMPUTE

DEPLOY(Workflows)

INFRASTRUCTURE AND DATACENTER MANAGEMENT

CORE APPLICATION SERVICES

TWEETS

USERS

SOCIAL GRAPH

PLATFORM SERVICES

SEARCH

MESSAGING & QUEUES

CACHE

MONITORING AND ALERTING

REVERSE PROXY

FRAMEWORK/

LIBRARIES

FINAGLE (RPC)

SCALDING (Map Reduce in Scala)

HERON (Streaming Compute)

JVM

MANAGEMENT

TOOLS

SELF SERVE

SERVICE DIRECTORY

CHARGEBACK

DEPLOY(Workflows)

CONFIG

DATA & ANALYTICSPLATFORM

INTERACTIVE QUERY

DATA DISCOVERY

WORKFLOW MANAGEMENT

INFRASTRUCTURESERVICES

MANHATTAN(Key-Val Store)

HDFS (File System)

BLOBSTORE

GRAPH STORE

STORAGE

AURORA (Scheduler)

HADOOP (Map-Reduce)

MESOS (Cluster Manager)

COMPUTE

INFRASTRUCTURE AND DATACENTER MANAGEMENT

CORE APPLICATION SERVICES

TWEETS

USERS

SOCIAL GRAPH

PLATFORM SERVICES

SEARCH

MESSAGING & QUEUES

CACHE

MONITORING AND ALERTING

REVERSE PROXY

FRAMEWORK/

LIBRARIES

FINAGLE (RPC)

SCALDING (Map Reduce in Scala)

HERON (Streaming Compute)

JVM

MANAGEMENT

TOOLS

SELF SERVE

SERVICE DIRECTORY

CHARGEBACK

CONFIG

DATA & ANALYTICSPLATFORM

INTERACTIVE QUERY

DATA DISCOVERY

WORKFLOW MANAGEMENT

INFRASTRUCTURESERVICES

MANHATTAN(Key-Val Store)

HDFS (File System)

BLOBSTORE

GRAPH STORE

STORAGE

AURORA (Scheduler)

HADOOP (Map-Reduce)

MESOS (Cluster Manager)

COMPUTE

DEPLOY(Workflows)

THOUSANDS OF SERVICES

HUNDREDS OF TEAMS

What is the overall use of infrastructure & platform resources across Twitter’s services?

What is the overall use of infrastructure & platform resources across Twitter’s services?

How to attribute resource consumption to teams/organization?

What is the overall use of infrastructure & platform resources across Twitter’s services?

How to attribute resource consumption to teams/organization?

How do you incentivize the right behavior to improve efficiency of resource usage?

Ability to meter allocation and utilization of resources per service, per engineering team and charge them accordingly

CHARGEBACK

COMPUTE STORAGE

PLATFORM AND OTHER SERVICES

SERVICE Tweet Service

SERVICE Ads Shard

SERVICE Who To Follow

RESOURCEunit of abstraction

MULTI-TENANCYtenant management using canonical identifiers

SERVICEIDENTITY

RESOURCECATALOG

COMPUTE STORAGE

PLATFORM AND OTHER SERVICES

SERVICE Tweet Service

SERVICE Ads Shard

SERVICE Who To Follow

RESOURCEunit of abstraction

MULTI-TENANCYtenant management using canonical identifiers

SERVICEIDENTITY

RESOURCECATALOG

COMPUTE STORAGE

PLATFORM AND OTHER SERVICES

SERVICE Tweet Service

SERVICE Ads Shard

SERVICE Who To Follow

RESOURCEunit of abstraction

MULTI-TENANCYtenant management using canonical identifiers

METERING ANDCHARGEBACK

SERVICEIDENTITY

RESOURCECATALOG

METERING ANDCHARGEBACK

COMPUTE STORAGE

SERVICEMETADATA

PLATFORM AND OTHER SERVICES

SERVICE Tweet Service

SERVICE Ads Shard

SERVICE Who To Follow

RESOURCEunit of abstraction

MULTI-TENANCYtenant management using canonical identifiers

UNIFIED CLOUD PLATFORM

SERVICEIDENTITY

RESOURCECATALOG

METERING ANDCHARGEBACK

COMPUTE STORAGE

SERVICEMETADATA

PLATFORM AND OTHER SERVICES

SERVICE Tweet Service

SERVICE Ads Shard

SERVICE Who To Follow

RESOURCEunit of abstraction

MULTI-TENANCYtenant management using canonical identifiers

SERVICE IDENTITY

A canonical way of identifying a service that consumesresources on various platform infrastructure.

• Disparate identifiers across infrastructure and platform services

• Multiple provisioning workflows (Self-Serve, Tickets)

• Disparate Ownership trackers (Email, LDAP)

• Lack of support for public cloud Identity and Access Management systems (IAM)

role: cim-servicejob_name: ui; env: prodid: <role>.<env>.<job_name>

app_id: cost_reportingid: <app_id>

Project: chargebackTeam: Cloud Infra MgmtSource code: /cim

COMPUTE

STORAGE

PROBLEM

BATCHCOMPUTE

role: cim-servicepool: etl_pipe_prodjob_name: compute_costid: <role>.<pool>.<job_name>

DASHBOARD

IDENTITY MANAGER

PROVISION

CONSUMPTION

• Designed an Entity Model that • Define canonical identifier scheme

across infrastructure and platform services• Define ownership structure with org

• Single pane of glass for every developer to manage their project IDs (including abstracting out public cloud IAM systems)

• Provider APIs for infrastructure services to provision and manage identityINFRASTRUCTURE

SERVICEINFRASTRUCTURE

SERVICEINFRASTRUCTURE

SERVICEINFRASTRUCTURE

SERVICEINFRASTRUCTURE

SERVICE

OUR APPROACH

API

Source of truth for identifier to org structure mapping improving Service ownership within the Org

Enables service to service authentication/authorization

IMPACT

BUSINESS OWNER

TEAM

PROJECT

SERVICE/SYSTEM ACCOUNT

<INFRA, CLIENTID>

1:N

1:N

1:N

1:N

ENTITY MODEL FOR SERVICE IDENTITY

Model that provides canonical identifier across infrastructure and platform service and ties it to an org structure

BUSINESS OWNER

TEAM

PROJECT

SERVICE/SYSTEM ACCOUNT

<INFRA, CLIENTID>

1:N

1:N

1:N

1:N

REVENUE

ADS SERVING

adshard

adshard

<Aurora, adshard.prod.adshard>

EXAMPLE of services running (on Aurora/Mesos)

ADS PREDICTION

prediction

ads-prediction

<Aurora, ads-prediction.prod.campaign-x>

ENTITY MODEL: EXAMPLE

RESOURCE CATALOG

Consistent way of identifying and inventorying ofresources of various platform infrastructure.

• Lack of clarity on what is available & how many resources are consumed

• Need to capture resource fluidity across infrastructure and platform services

• Better support to model abstract resources (ex, QPS, Tweets per Second)

• Need to define TCO (Total Cost of Ownership) of a resource per unit time

PROBLEM

CPUMEMORYDISK

STORAGE IN GBWPSRPS

COMPUTE

STORAGE

BATCHCOMPUTE

CPUFILES ACCESSEDSTORAGE IN GB

CORES MEMORY DISK

application = Task( name = 'application', resources = Resources(cpu = 1.0, ram = 512 * MB, disk = 1024 * MB), processes = [stage_application, run_application], constraints = order(stage_application, run_application))

CORES MEMORY DISK

application = Task( name = 'application', resources = Resources(cpu = 1.0, ram = 512 * MB, disk = 1024 * MB), processes = [stage_application, run_application], constraints = order(stage_application, run_application))

GPU NETWORK Need for Fluidity!

• Defining unit price for a resource • Framework to price resources. • Ensure Total Cost of Ownership. Eg. License cost, chargeback cost from other

services, human cost etc. • Support for Time Granularity. Eg. Machines/VMs used per day, Cores used per day

Used Cores

Operational Overhead

Headroom

Underutilized Quota AllocationTotal Cost of Ownership

Twitter Compute Platform

$X core-dayContainer Size Buffer (Underutilized Reservation)

Exce

ss Q

uota

and

Res

erva

tion

Non-Prod Used Cores

Disaster Recovery & Event Spikes

PROVIDER

INFRASTRUCTURE SERVICE

OFFERINGS

OFFER MEASURES

OFFER MEASURE COST

1:N

1:N

1:N

1:N

ENTITY MODEL FOR RESOURCE CATALOG

Model that supports Resource Fluidity and captures and manages unit price of a resource over time.

TWITTER DC/PUBLIC CLOUD

AURORA

COMPUTE

CORE-DAYS

$X

PROVIDER

INFRASTRUCTURE SERVICE

OFFERINGS

OFFER MEASURES

OFFER MEASURE COST

1:N

1:N

1:N

1:N

EXAMPLE of Resource Catalog

TWITTER DC

HADOOP

STORAGE

GB- RAM

ENTITY MODEL: EXAMPLE

PROCESSING CLUSTER

FILE ACCESSES

…GB- RAM

FILE ACCESSES… …

$X $Y …$M $N… …

METERING PIPELINE

HIGH LEVEL ARCHITECTURE

The Product

TEAM/ORG BILL

INFRASTRUCTURE PNL

ORG/TEAM BUDGET

CUSTOM REPORTS

• Infrastructure & Platform Owners • Overall Cluster Growth • Allocation v/s Utilization of resources by Customer Team

• Service Owners • Allocation v/s Utilization of resources across each Infrastructure & Platform

• Finance • Budget Management (Budget v/s Spend)

• Execs • Efficiency • Trends

What has been the Impact?

Jun 1, 2015 Sept 1, 2015

Twitter Compute Platform (Aurora/Mesos)

3 months (Jun - Sep, 2015)

Allocated Quota

Utilized Cores

Sept 1, 2015 Jan 1, 2015

Twitter Compute Platform (Aurora/Mesos)

4 months (Sep, 2015 - Jan, 2016)

Allocated Quota

Utilized Cores

More core usage against reservationcompared to May 2015

33%

• Ensures true to the cost unit price computation

• Input for capacity planning and budgeting

• Visibility into the organizational spend and enables accountability

• Improved utilization of infrastructure service resources • Enables comparison with Public Cloud Offerings

• Improved Service Ownership

IMPACT

Kite - Unified Cloud Platform A cloud agnostic service lifecycle manager

SERVICE IDENTITYMANAGER

RESOURCEPROVISIONING MANAGER

DASHBOARD(SINGLE PANE OF GLASS)

REPORTING

INFRASTRUCTURE SERVICEINFRASTRUCTURE SERVICEINFRASTRUCTURE SERVICEINFRASTRUCTURE & PLATFORM SERVICE

SERVICE LIFECYCLE WORKFLOWS

METADATA RESOURCE QUOTA MANAGEMENT DEPLOY METERING &

CHARGEBACKIDENTITY

PROVIDER APIS & ADAPTERS

@vinucharanya

@dpkagrawal

@pragashjj@fvrojas

@micheal

@igb

@imjessicayuen

@_jordanly

@xcv58

top related