joshua reich princeton university department of computer science 1 p2p file-systems for scalable...

1

Joshua ReichPrinceton University

Department of Computer Science

P2P File-Systems for Scalable Content Use

2

Goal:Scalable Content Distribution• In crowd (WAN – gateways)

or cloud (LAN – data-center servers)

• For use• Not all parts of content are used at

the same time!–Multimedia content– Executables– Virtual Appliances

3

Domain: Cloud

• Data-center w/ – physical machines– network storage

• VM - Software implementation of hardware machine [the machine as an executable]

• VMM - Software layer virtualizing hardware [OS for VMs]

4

Motivation• VM optimized for specific purpose

– Virtual Appliances

– Virtual Servers

– Virtual Desktops (VDI)

• Zero config, isolated, easy to replicate

• Shared infrastructure is cheaper

• Less IT headache

• 15772 unique images on EC2 alone!*

• Hosted VDI market alone est. $65B in 2013***http://thecloudmarket.com/stats#/totals, checked 15 July 2011

**http://www.cio.com/article/487109/Hosted_Virtual_Desktop_Market_to_Cross_65_Billion_in_201326 March, 2009

http://thecloudmarket.com/stats%23/totals



http://www.cio.com/article/487109/Hosted_Virtual_Desktop_Market_to_Cross_65_Billion_in_2013



5

• VM images stored on network

• Contention for networked storage results in I/O bottlenecks

• I/O bottlenecks significantly delay VM execution

High-level problem

• VM image stored on SAN or NAS

• Accessed by servers hosting VDI instances

• Everyone comes to work in the morning, starts up their desktop

• SAN overloaded by simultaneous access

• Virtual Desktops stall

SAN

Example: Hosted VDI Boot Storm

7

Specific Challenges1. Large image size + high demand =

contention-induced network bottleneck

2. VMM expects complete image– Either download image completely– Or continual remote access

3. Complex VM image access patterns– Non-linear– Differ from run to run

8

• Assume (2) & (3) aren’t problems

• Begins to look like video streaming

• Known approach: P2P Video-on-Demand

– Need to stream a series of ordered pieces

–While maintaining swarm efficiency

– Use mix of earliest-first & rarest-first

Analogy to Streaming Video

9

Novel VMTorrent Architecture

1. Large image/high demand -> P2P

2. Complete VM image req’d -> Quick-Start

3. Non-linear access -> Profile Prefetch

10

Related Work MatrixWork Approach Problem

AddressedNotes

Mietzner:2008 Shi:2008

Sequential distribution of VM

images

VM Deployment Slow, doesn’t scale

O’Donnell:2008Chen:2009

Naive P2P distribution of VM

images

VM Deployment Slow, scales

Industry Hardware overprovisioning

VM Deployment Fast, expensive

Chandra:2005Moka5

content prefetching +

on-demand streaming

Virtual Desktop Delivery

Fast, highly structured

Vlavianos:2006Zhou:2007

Mix earliest first / random first prefetch

Video Streaming

Fast, scales well

VMTorrent Quick start + P2P + profile prefetch

VM Deployment

Fast, scales well

11

VM

VMM

Hardware/OS

Custom FS

6 7

VMTorrent Architecture

Swarm

P2P Manager

profile

VMTorrent Instance

UnmodifiedVM & VMM

12

Understanding VMTorrent: A Bottom Up Approach

13

Traditional VM Execution

VM

HostVM

Image

FS

• VM runs on some host

• Virtual Machine: software implementation of a

computer• Implementation stored in an image

• Image stored on host’s local file system

14

Traditional VM Execution

VM

VMM HostHardware/

OS

VM Imag

e

FS

• Virtual Machine Monitor virtualizes hardware• Conducts I/O to image through FS

15

VM Execution Over Network

VM

VMM

Hardware/OS

VM Imag

e

FS

Either to download image

NetworkBacken

d

Network backend used

Or to access via remote FS

16

VM Execution Over Network

VM

VMM

Hardware/OS

VM Imag

e

FS

Remote access smaller hits, but also writes and re-reads

NetworkBacken

d

Download – one big up front performance hit

17

CustomFS

Quick Start with Custom FS

VM

VMM

Hardware/OS

VM Imag

e

FS

NetworkBacken

d

Divide image into pieces

But provide appearanceof complete image to VMM

Introduce custom file system

18

CustomFS

Quick Start w/ Custom FS

VM

VMM

Hardware/OS

NetworkBacken

d

0 1 23 4 56 7 8

VMM attempts to read piece 1Piece 1 is present, read completes

19

CustomFS


VM

VMM

Hardware/OS

NetworkBacken

d

0 1 23 4 56 7 8

VMM attempts to read piece 0Piece 0 isn’t local, read stallsVMM waits for I/O to completeVM stalls

20

CustomFS


VM

VMM

Hardware/OS

NetworkBacken

d

0 1 23 4 56 7 8

FS requests piece from backendBackend requests from network

21


VM

VMM

Hardware/OS

NetworkBacken

d

0

Later, network delivers piece 0

CustomFS

1 23 4 56 7 8

0

Read completesCustom FS receives, updates piece

VMM resumes VM’s execution

22

Improved Performance w/ Custom FS

VM

VMM

Hardware/OS

No waiting for image download to complete

NetworkBacken

d

No more writes or re-reads over network w/ remote FS

CustomFS

1 23 4 56 7 8

0

X

X

23

Custom FS + Network Backend

VM

VMM

Hardware/OS

NetworkBacken

d

CustomFS

1 23 4 56 7 8

0

24

Alleviate bottleneck to network storage

VM

VMM

Hardware/OS

NetworkBacken

d

CustomFS

1 23 4 56 7 8

0

Scaling w/ P2P Backend

Swarm

P2PManage

r

1 23 4 56 7 8

0

Exchange pieces w/ swarmP2P copy remains pristine

25

VM

VMM

Hardware/OS

CustomFS

1 23 4 56 7 8

0

Minimizing Stall Time

Swarm

P2PManage

r

1 23 4 56 7 8

0

VMM accesses to non-local pieces

6 7

4?

4?

4!

Trigger high priority swarm requests

26

VM

VMM

Hardware/OS

CustomFS

1 23 4 56 7 8

0

Custom FS + P2P Manager

Swarm

P2PManage

r

1 23 4 56 7 8

0

6 7

27

P2P Challenge: Request fulfillment latency

• Delays– Network RTT

– At image source (peer or server)

• Impact– If even occasionally it takes 0.5s to obtain

piece

– Over the course of thousands of requests

– 10’s of seconds may be lost

28

P2P Challenge: Network Capacity

Mem-cached: ideal access rate for given physical machine

(s)

Cumulative

Demand FS: ideal access rate w/ read-once never writePrefeching: ideal access rate w/ perfect prefetching

Even assuming no latency

100Mb Network

Delay lower bound

29

Solution: Prefetch Perfectly

30

P2P Challenge: Image Access Highly Nonlinear

31

• Collect access patterns for VM/workload

• Determine expected accesses–Divide accesses into blocks

– Sort by average access time

– Remove blocks accessed in small fraction of runs

• Encode new order in profile

Solution: Generate Profile Using Statistical Ordering

32

E.g., During boot storm

⇒ All actively fetch same small set of pieces

⇒ Low piece diversity

⇒ Little opportunity for peers to share

⇒ Low swarming efficiency

P2P Challenge: In-order Profile Prefetch Inefficient

33

Solution: Randomization and Throttling

• Randomize prefetch order

• Rate limiting (based on

priority)

• Deadline-based throttling

34

VM

VMM

Hardware/OS

CustomFS

1 23 4 56 7 8

0


Swarm

P2P Manager

1 23 4 56 7 8

0

6 7

profile

35

VM

VMM

Hardware/OS

CustomFS

1 23 4 56 7 8

0


Swarm

P2P Manager

1 23 4 56 7 8

0

6 7

profile

VMTorrent Instance

UnmodifiedVM & VMM

36

Evaluation

37

VM

Hardware/OS

CustomFS

1 23 4 56 7 8

0

VMTorrent Prototype

BT Swarm

P2P Manager

1 23 4 56 7 8

0

6 7

profile

Custom CUsing FUSE

Custom C++& Libtorrent

38

Emulab Testbed*

• Up to 101 modern hardware nodes

• One VMTorrent instance per node

• 100Mb LAN

*[White:2002]

39

VMs

40

WorkloadsVDI-like tasks

41

• We use normalized runtime (boot through shutdown)

• Normalized against memory-cached execution

• Allows easy cross-comparison for different VM/workload combinations

Data Presentation

42

Flash Crowd

• Ubuntu VM

• Boot-Shutdown task

• Immediate peer departure

Flash Crowd

43

44

Hypothesis

• Larger swarm -> Longer time until full swarm efficiency

• Demand prioritization -> Relative loss of prefetch piece diversity -> Lower swarm efficiency

45

Swarming Efficiency (flash crowd)

16

100

n

Staggered Arrival

46

joshua reich princeton university department of computer science 1 p2p file-systems for scalable...

Documents

vm host vm image fs

motivation vm

image image

video slide

traditional vm execution

complete vm image reqd

scalable content use

complete image