1 an overview of cloud computing raghu ramakrishnan chief scientist, audience and cloud computing...

65
1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions with: Eric Baldeschwieler, Jay Kistler, Chuck Neerdaels, Shelton Shugar, and Raymie Stata and joint work with the Sherpa team, in particular: Brian Cooper, Utkarsh Srivastava, Adam Silberstein and Nick Puz in Y! Research Chuck Neerdaels, P.P. Suryanarayanan and many others in CCDI

Upload: micaela-gaskell

Post on 28-Mar-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

1

An Overview of Cloud Computing

Raghu RamakrishnanChief Scientist, Audience and Cloud Computing

Research Fellow, Yahoo! Research

Reflects many discussions with: Eric Baldeschwieler, Jay Kistler, Chuck Neerdaels, Shelton Shugar, and Raymie Stata

and joint work with the Sherpa team, in particular:Brian Cooper, Utkarsh Srivastava, Adam Silberstein and Nick Puz in Y! ResearchChuck Neerdaels, P.P. Suryanarayanan and many others in CCDI

Page 2: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

2

CCDI—Research Collaboration

Yahoo! Research

• Raghu Ramakrishnan • Brian Cooper• Utkarsh Srivastava• Adam Silberstein• Nick Puz• Rodrigo Fonseca

CCDI

• Chuck Neerdaels • P.P.S. Narayan • Kevin Athey• Toby Negrin• Plus Dev/QA teams

Page 3: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

3

SCENARIOSPie-in-the-sky

Page 4: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

4

Living in the Clouds

• We want to start a new website, FredsList.com• Our site will provide listings of items for sale, jobs,

etc.• As time goes on, we’ll add more features

– And illustrate how more cloud capabilities (and corresponding infrastructure components) are used as needed

• List of capabilities/components is illustrative, not exhaustive

• Our cloud provides a “dataset” abstraction– FredsList doesn’t worry about the underlying components

Page 5: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

5

Step 1: Listings

Simple Web Service API’s Simple Web Service API’s

Database

Sherpa

FredsList.com application FredsList.com application

1234323, transportation, For sale: one bicycle, barely used

FredsList wants to store listings as (key, category, description)

5523442, childcare, Nanny available in San Jose

215534, wanted, Looking for issue 1 of Superman comic book

DECLARE DATASET Listings AS( ID String PRIMARY KEY,Category String,Description Text )

DECLARE DATASET Listings AS( ID String PRIMARY KEY,Category String,Description Text )

Page 6: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

6

Step 2: Search

Simple Web Service API’s Simple Web Service API’s

Database

Sherpa

“bicycle”

FredsList’s customers quickly ask for keyword search

Search

Vespa

“dvd’s” “nanny”

MessagingYMB

FredsList.com application FredsList.com application

ALTER ListingsSET Description SEARCHABLE

ALTER ListingsSET Description SEARCHABLE

Page 7: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

7

Step 3: Photos

Simple Web Service API’s Simple Web Service API’s

Database

Sherpa

FredsList decides to add photos to listings

Search

Vespa

MessagingYMB

Storage

MObStorForeign key

photo → listing

FredsList.com application FredsList.com application

ALTER ListingsADD Photo BLOB

ALTER ListingsADD Photo BLOB

Page 8: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

8

Step 4: Data Analysis

Simple Web Service API’s Simple Web Service API’s

Database

Sherpa

FredsList wants to analyze its listings to get statistics about category, do geocoding, etc.

Search

Vespa

MessagingYMB

Storage

MObStorForeign key

photo → listing

FredsList.com application FredsList.com application

ALTER ListingsMAKE ANALYZABLE

ALTER ListingsMAKE ANALYZABLE

Compute

Grid

Batch export

Pig query to analyze categories

Hadoop program to geocode data

Hadoop program to generate fancy pages for listings

Page 9: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

9

Step 5: Performance

Simple Web Service API’s Simple Web Service API’s

Database

Sherpa

FredsList wants to reduce its data access latency

Search

Vespa

MessagingYMB

Storage

MObStorForeign key

photo → listing

FredsList.com application FredsList.com application

ALTER ListingsMAKE CACHEABLE

ALTER ListingsMAKE CACHEABLE

Compute

Grid

Batch export

Caching

memcached

Page 10: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

10

EYES TO THE SKIESMotherhood-and-Apple-Pie

Page 11: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

11

Why Clouds?

• On-demand infrastructure to create a fundamental shift in the OE curve. Let’s us:– Do things we can’t do– Reduce time to market– Build more robustly, more

efficiently, more globally, more completely, for a given budget

• Cloud services should do heavy lifting of heavy-lifting of scaling & high-availability– Today, this is done at the

app-level, which is not productive

Page 12: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

12

Requirements for Cloud Services

• Multitenant. A cloud service must support multiple, organizationally distant customers.

• Elasticity. Tenants should be able to negotiate and receive resources/QoS on-demand.

• Resource Sharing. Ideally, spare cloud resources should be transparently applied when a tenant’s negotiated QoS is insufficient, e.g., due to spikes.

• Horizontal scaling. It should be possible to add cloud capacity in small increments; this should be transparent to the tenants of the service.

• Metering. A cloud service must support accounting that reasonably ascribes operational and capital expenditures to each of the tenants of the service.

• Security. A cloud service should be secure in that tenants are not made vulnerable because of loopholes in the cloud.

• Availability. A cloud service should be highly available.• Operability. A cloud service should be easy to operate, with few

operators. Operating costs should scale linearly or better with the capacity of the service.

Page 13: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

13

Types of Cloud Services

• Two kinds of cloud services:– Horizontal Cloud Services

• Functionality enabling tenants to build applications or new services on top of the cloud

– Functional Cloud Services • Functionality that is useful in and of itself to tenants. E.g., various

SaaS instances, such as Saleforce.com; Google Analytics and Yahoo!’s IndexTools; Yahoo! properties aimed at end-users and small businesses, e.g., flickr, Groups, Mail, News, Shopping

• Could be build on top of horizontal cloud services or from scratch• Yahoo! has been offering these for a long while (e.g., Mail for

SMB, Groups, Flickr, BOSS, Ad exchanges)

Page 14: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

14

Horizontal Cloud Services

• Horizontal cloud services are foundations on which tenants build applications or new services. They should be:– Semantics-free. Must be "generic infrastructure,” and not tied to

specific app-logic. • May provide the ability to inject application logic through well-defined

APIs

– Broadly applicable. Must be broadly applicable (i.e., it can't be intended for just one or two properties).

– Fault-tolerant over commodity hardware. Must be built using inexpensive commodity hardware, and should mask component failures.

• While each cloud service provides value, the power of the cloud paradigm will depend on a collection of well-chosen, loosely coupled services that collectively make it easy to quickly develop and operate innovative web applications.

Page 15: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

15

What’s in the Horizontal Cloud?

Common Approaches to QA, Production Engineering,Performance Engineering, Datacenter Management, and Optimization

ID & Account Management

Monitoring & QoS

Shared Infrastructure

Metering, Billing, Accounting

Horizontal Cloud Services

Edge Content Servicese.g., YCS,

YCPI

Provisioning & Virtualization

e.g., EC2

Batch Storage & Processing

e.g., Hadoop

& Pig

Operational Storagee.g., S3,

MObStor,Sherpa

Other Services

Messaging, Workflow,

virtual DBs & Webserving

Security

Simple Web Service API’s

Page 16: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

16

Yahoo! CCDI Thrust Areas

• Fast Provisioning and Machine Virtualization: On demand, deliver a set of hosts imaged with desired software and configured against standard services– Multiple hosts may be multiplexed onto the same physical

machine.

• Batch Storage and Processing: Scalable data storage optimized for batch processing, together with computational capabilities

• Operational Storage: Persistent storage that supports low-latency updates and flexible retrieval

• Edge Content Services: Support for dealing with network topology, communication protocols, caching, and BCP

Rest of today’s talk

Page 17: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

17

[Workflow][Workflow]

Hadoop: Batch Storage/Analysis

Why is batch processing important?

• Whether it’s – response-prediction for advertising– machine-learned relevance for Search, or– content optimization for audience, – data-intensive computing is increasingly

central to everything Yahoo! does– Hadoop is central to addressing this need

• Hadoop is a case-study in our cloud vision– Processes enormous amounts of data– Provides horizontal scaling and fault-

tolerance for our users– Allows those users to focus on their app

logic

HDFSHDFS

Map-ReduceMap-Reduce

High-level query layer (Pig)

High-level query layer (Pig)

Page 18: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

18

SHERPA

To Help You Scale Your Mountains of Data

Page 19: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

19

The Yahoo! Storage Problem

– Small records – 100KB or less

– Structured records - tens, hundreds or thousands of fields

– Extreme data scale - Tens of TB

– Extreme request scale - Tens of thousands of requests/sec

– Low latency globally - 20+ datacenters worldwide

– High Availability - outages cost $millions

– Variable usage patterns - as applications and users change

19

Page 20: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

20

The Sherpa Solution

The next generation global-scale record store

– Record-orientation: Routing, data storage optimized for low-latency record access

– Scale out: Add machines to scale throughput (while keeping latency low)

– Asynchrony: Pub-sub replication to far-flung datacenters to mask propagation delay

– Consistency model: Reduce complexity of asynchrony for the application programmer

– Cloud deployment model: Hosted, managed service to reduce app time-to-market and enable on demand scale and elasticity

20

Page 21: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

21

E 75656 C

A 42342 EB 42521 W

C 66354 W

D 12352 E

F 15677 E

What is Sherpa?

E 75656 C

A 42342 EB 42521 W

C 66354 W

D 12352 E

F 15677 E

CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…

)

CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…

)

Parallel databaseParallel database Geographic replicationGeographic replication

Structured, flexible schemaStructured, flexible schema

Hosted, managed infrastructureHosted, managed infrastructure

A 42342 E

B 42521 W

C 66354 W

D 12352 E

E 75656 C

F 15677 E

21

Page 22: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

22

What Will Sherpa Become?

E 75656 C

A 42342 EB 42521 W

C 66354 W

D 12352 E

F 15677 E

E 75656 C

A 42342 EB 42521 W

C 66354 W

D 12352 E

F 15677 E

E 75656 C

A 42342 EB 42521 W

C 66354 W

D 12352 E

F 15677 E

CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…

)

CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…

)

Parallel databaseParallel database Geographic replicationGeographic replication

Indexes and viewsIndexes and views

Structured, flexible schemaStructured, flexible schema

Hosted, managed infrastructureHosted, managed infrastructure

Page 23: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

23

Scalability• Thousands of machines• Easy to add capacity• Restrict query language to avoid costly queries

Geographic replication• Asynchronous replication around the globe• Low-latency local access

High availability and fault tolerance• Automatically recover from failures• Serve reads and writes despite failures

Sherpa Design Goals

23

Consistency• Per-record guarantees• Timeline model • Option to relax if needed

Multiple access paths• Hash table, ordered table• Primary, secondary access

Hosted service• Applications plug and play• Share operational cost

Page 24: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

24

Technology Elements

PNUTS • Query planning and execution• Index maintenance

Distributed infrastructure for tabular data • Data partitioning • Update consistency• Replication

YDOT FS • Ordered tables

Applications

YMB• Pub/sub messaging

YDHT FS • Hash tables

Zookeeper• Consistency service

YC

A:

Aut

hori

zati

on

PNUTS API Tabular API

24

Page 25: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

25

Data Manipulation

• Per-record operations– Get– Set– Delete

• Multi-record operations– Multiget– Scan– Getrange

• Web service (RESTful) API

25

Page 26: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

26

Tablets—Hash Table

Apple

Lemon

Grape

Orange

Lime

Strawberry

Kiwi

Avocado

Tomato

Banana

Grapes are good to eat

Limes are green

Apple is wisdom

Strawberry shortcake

Arrgh! Don’t get scurvy!

But at what price?

How much did you pay for this lemon?

Is this a vegetable?

New Zealand

The perfect fruit

Name Description Price

$12

$9

$1

$900

$2

$3

$1

$14

$2

$8

0x0000

0xFFFF

0x911F

0x2AF3

26

Page 27: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

27

Tablets—Ordered Table

27

Apple

Banana

Grape

Orange

Lime

Strawberry

Kiwi

Avocado

Tomato

Lemon

Grapes are good to eat

Limes are green

Apple is wisdom

Strawberry shortcake

Arrgh! Don’t get scurvy!

But at what price?

The perfect fruit

Is this a vegetable?

How much did you pay for this lemon?

New Zealand

$1

$3

$2

$12

$8

$1

$9

$2

$900

$14

Name Description PriceA

Z

Q

H

Page 28: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

28

Flexible Schema

Posted date Listing id Item Price

6/1/07 424252 Couch $570

6/1/07 763245 Bike $86

6/3/07 211242 Car $1123

6/5/07 421133 Lamp $15

Color

Red

Condition

Good

Fair

Page 29: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

29

Storageunits

Routers

Tablet controller

REST API

Clients

Local regionRemote regions

YMB

Detailed Architecture

29

Page 30: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

30

Tablet Splitting and Balancing

30

Each storage unit has many tablets (horizontal partitions of the table)Each storage unit has many tablets (horizontal partitions of the table)

Tablets may grow over timeTablets may grow over timeOverfull tablets splitOverfull tablets split

Storage unit may become a hotspotStorage unit may become a hotspot

Shed load by moving tablets to other serversShed load by moving tablets to other servers

Storage unitTablet

Page 31: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

31

QUERY PROCESSING

31

Page 32: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

32

Accessing Data

32

SUSU SU

1

Get key k

2Get key k3 Record for key k

4 Record for key k

Page 33: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

33

Bulk Read

33

SUScatter/gather server

SU SU

1

{k1, k2, … kn}

2Get k1

Get k2Get k3

Page 34: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

34

Storage unit 1 Storage unit 2 Storage unit 3

Range Queries in YDOT

• Clustered, ordered retrieval of records

Storage unit 1Canteloupe

Storage unit 3Lime

Storage unit 2Strawberry

Storage unit 1

Router

AppleAvocadoBananaBlueberry

CanteloupeGrapeKiwiLemon

LimeMangoOrange

StrawberryTomatoWatermelon

AppleAvocadoBananaBlueberry

CanteloupeGrapeKiwiLemon

LimeMangoOrange

StrawberryTomatoWatermelon

Grapefruit…Pear?Grapefruit…Lime?

Lime…Pear?

Storage unit 1Canteloupe

Storage unit 3Lime

Storage unit 2Strawberry

Storage unit 1

Page 35: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

35

Updates

1

Write key k

2Write key k7 Sequence # for key k

8 Sequence # for key k

SU SU SU

3Write key k

4

5SUCCESS

6Write key k

RoutersMessage brokers

35

Page 36: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

36

ASYNCHRONOUS REPLICATION AND

CONSISTENCY

36

Page 37: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

37

Asynchronous Replication

37

Page 38: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

38

• Goal: make it easier for applications to reason about updates and cope with asynchrony

• What happens to a record with primary key “Brian”?

Consistency Model

38

Time

Record inserted

Update Update Update UpdateUpdate Delete

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Update Update

Page 39: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

39

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Current version

Stale versionStale version

Read

Consistency Model

39

Page 40: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

40

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Read up-to-date

Current version

Stale versionStale version

Consistency Model

40

Page 41: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

41

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Read ≥ v.6

Current version

Stale versionStale version

Consistency Model

41

Page 42: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

42

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Write

Current version

Stale versionStale version

Consistency Model

42

Page 43: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

43

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Write if = v.7

ERROR

Current version

Stale versionStale version

Consistency Model

43

Page 44: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

44

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Write if = v.7

ERROR

Current version

Stale versionStale version

Consistency Model

44

Mechanism: per record mastershipMechanism: per record mastership

Page 45: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

4646

Mastering

A 42342 EB 42521 W

C 66354 W

D 12352 EE 75656 C

F 15677 E A 42342 EB 42521 W

C 66354 W

D 12352 EE 75656 C

F 15677 EA 42342 EB 42521 W

C 66354 W

D 12352 EE 75656 C

F 15677 E

Tablet master

Page 46: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

47

Bulk Insert/Update/Replace

Client

Source Data

Bulk manager

1. Client feeds records to bulk manager

2. Bulk loader transfers records to SU’s in batches• Bypass routers and

message brokers• Efficient import into

storage unit

Page 47: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

48

Bulk Load in YDOT

• YDOT bulk inserts can cause performance hotspots

• Solution: preallocate tablets

Page 48: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

49

Index Maintenance

• How to have lots of interesting indexes, without killing performance?

• Solution: Asynchrony!– Indexes updated asynchronously when base

table updated

Planned functionalityPlanned functionality

Page 49: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

50

SHERPAIN CONTEXT

50

Page 50: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

5151

MObStor

• Yahoo!’s next-generation globally replicated, virtualized media object storage service

• Better provisioning, easy migration, replication, better BCP, and performance

• New features (Evergreen URLs, CDN integration, REST API, …)

• The object metadata problem addressed using Sherpa, though MObStor is focused on blob storage.

Page 51: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

52

Storage & Delivery Stack

Page 52: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

53

The World Has Changed

• Web applications need:– Scalability!

• Preferably elastic

– Geographic distribution– High availability– Reliable storage

• Web applications can do without:– Complicated queries– Strong transactions

Page 53: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

54

Web Data Management

Large data analysis(Hadoop)

Structured record storage

(PNUTS)

Blob storage(SAN/NAS)

• Scan oriented workloads

• Focus on sequential disk I/O

• $ per cpu cycle

• CRUD • Point lookups

and short scans

• Index organized table and random I/Os

• $ per latency

• Object retrieval and streaming

• Scalable file storage

• $ per GB

Page 54: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

55

Types of Record Stores

• Query expressiveness

Simple Feature rich

Object retrieval

Retrieval from single table of

objects/records

SQL

S3 PNUTS Oracle

Page 55: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

56

Types of Record Stores

• Consistency model

Best effort Strong guaranteesEventual

consistencyTimeline

consistencyACID

S3 PNUTS Oracle

Program centric

consistency

Program centric

consistencyObject-centric consistency

Object-centric consistency

Page 56: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

57

Types of Record Stores

• Elasticity (ability to add resources on demand)

Not scalable Elastic

Limited (via data

distribution)

VLSD(Very Large

Scale Distribution /Replication)

OraclePNUTS

S3

Page 57: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

58

Data Stores Comparison

• User-partitioned SQL stores– Microsoft Azure SDS– Amazon SimpleDB

• Multi-tenant application databases– Salesforce.com– Oracle on Demand

• Mutable object stores– Amazon S3

Versus PNUTS

• More expressive queries• Users must control partitioning• Limited elasticity

• Highly optimized for complex workloads

• Limited flexibility to evolving applications

• Inherit limitations of underlying data management system

• Object storage versus record management

Page 58: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

59

Application Design Space

Records Files

Get a few things

Scan everything

Sherpa MObStor

Everest Hadoop

YMDBMySQL

Filer

Oracle

BigTable

59

Page 59: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

60

Alternatives Matrix

Ela

stic

Ope

rabi

lity

Glo

bal l

ow

late

ncy

Ava

ilab

ilit

y

Stru

ctur

ed

acce

ss

Sherpa

Y! UDB

MySQL

Oracle

HDFS

BigTable

DynamoU

pdat

esCassandra

Con

sist

ency

m

odel

SQL

/AC

ID

60

Page 60: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

61

Further Reading

Efficient Bulk Insertion into a Distributed Ordered Table (SIGMOD 2008)Adam Silberstein, Brian Cooper, Utkarsh Srivastava, Erik Vee, Ramana Yerneni, Raghu Ramakrishnan

PNUTS: Yahoo!'s Hosted Data Serving Platform (VLDB 2008)Brian Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Phil Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana Yerneni

Page 61: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

62

Opening Up Yahoo! SearchPhase 1 Phase 2

Giving site owners and developers control over the appearance of Yahoo!

Search results.

BOSS takes Yahoo!’s open strategy to the next level by providing Yahoo!

Search infrastructure and technology to developers and companies to help them

build their own search experiences.

Page 62: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

63

babycenter

epicurious

Search Results of the Future

yelp.com

answers.com

LinkedIn

webmd

Gawker

New York Times

Page 63: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

64

BOSS Offerings

API

A self-service, web services model for developers and start-ups to quickly build and deploy new search experiences.

BOSS offers two options for companies and developers and has partnered with top technology universities to drive search experimentation, innovation and research into next generation search.

• University of Illinois Urbana Champaign• Carnegie Mellon University

• Stanford University

• Purdue University

• MIT

• Indian Institute of

Technology Bombay

• University of

Massachusetts

CUSTOM

Working with 3rd parties to build a more relevant, brand/site specific web search experience.

This option is jointly built by Yahoo! and select partners.

ACADEMIC

Working with the following universities to allow for wide-scale research in the search field:

(Slide courtesy Prabhakar Raghavan)

Page 64: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

65

Partner Examples

Page 65: 1 An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions

66

QUESTIONS?

66