latency reduction techniques for remote memory access in anemone mark lewandowski department of...

29
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Upload: madeleine-barnett

Post on 13-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Latency Reduction Techniques for Remote Memory Access in ANEMONE

Mark LewandowskiDepartment of Computer ScienceFlorida State University

Page 2: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Outline

Introduction Architecture / Implementation

Adaptive NEtwork MemOry engiNE (ANEMONE) Reliable Memory Access Protocol (RMAP) Two Level LRU Caching Early Acknowledgments

Experimental Results Future Work Related Work Conclusions

Page 3: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Introduction

Virtual Memory performance is bound by slow disks

State of computers today lends to the idea of shared memory Gigabit Ethernet Machines on a LAN have lots

of free memory Improvements to ANEMONE

yield higher performance than disk and the original ANEMONE system

Cache

Memory

ANEMONEDisk

Registers

Page 4: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Contributions

Pseudo Block Device (PBD) Reliable Memory Access Protocol

Replace NFS

Early Acknowledgments Shortcut Communication Path

Two Level LRU-Based Caching Client Memory Engine

Page 5: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

ANEMONE Architecture

ANEMONE (NFS) ANEMONE

ClientNFS Swapping Pseudo Block Device (PBD)

Swap Daemon Cache Client Cache

Memory EngineNo caching Engine Cache

Must wait for server to receive page

Early ACKs

Memory ServerCommunicates with Memory Engine

Communicates with Memory Engine

Page 6: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Architecture

Client Module

RMAP Protocol

Engine Cache

Page 7: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Pseudo Block Device

Provides a transparent interface for swap daemon and ANEMONE

Is not a kernel modification Begins handling READ/WRITE requests in

order of arrivalNo expensive elevator algorithm

Page 8: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

IP

Transport

Application

RMAP

Swap Daemon

Ethernet

Reliable Memory Access Protocol (RMAP) Lightweight Reliable Flow Control Protocol sits next to

IP layer to allow swap daemon quick access to pages

Page 9: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

RMAP• Window Based Protocol

• Requests are served as they arrive

• Messages:

•REG/UNREG – Register the client with the ANEMONE cluster

•READ/WRITE – send/receive data from ANEMONE

•STAT – retrieves statistics from the ANEMONE cluster

Page 10: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Why do we need cache?

It is a natural answer to on-disk buffers Caching reduces network traffic Decreases Latency

Write latencies benefit the mostBuffers requests before they are sent over the

wire

Page 11: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Basic Cache Structure

FIFO Queue is used to keep track of LRU page Hashtable is used for fast page lookups

FIFO Queue

Cache_entry

Hash Function

TailHead

Index (Hash Table)

struct cache_entry { struct list_head queue; /* points to the linked list that makes up the cache */ unsigned long offset; /* Offset of page */ u8 *page; /* the page */ int write; struct sk_buff *skb; /* This may or may not point to an sk_buff. If it does, * then the cache must take care to call kfree_skb when the * page is kicked out of memory (this is to avoid a memcpy). */ int answered;

};

Page 12: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

ANEMONE Cache Details

Client Cache 16 MB Write-Back Memory allocation at load time

Engine Cache 80 MB Write-Through Partial memory allocation at load time

sk_buffs are copied when they arrive at the Engine

Page 13: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Early Acknowledgments

`

ClientMemory Engine

Memory Server

• Reduces client wait time• Can reduce write latency by up to 200 µs per

write request• Early ACK performance is slowed by small RMAP

window size• Small pool (~200) of sk_buffs are maintained for

forward ACKing

Page 14: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Experimental Testbed

Experimental testbed configured with 400,000 blocks (4KB page) of memory (~1.6 GB)

Page 15: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Experimental Description

Latency 100,000 Read/Write requests

Sequential/Random Application Run Times

Quicksort / POV-Ray Single/Multiple Processes

Execution Times Cache Performance

Measured cache hit rates Client / Engine

Page 16: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Sequential Read

Page 17: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Sequential Write

Page 18: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Random Read

Page 19: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Random Write

Page 20: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Single Process Performance

Increase single process size by 100 MB for each iteration Quicksort: 298% performance increase over disk, 226% increase

over original ANEMONE POV-Ray: 370% performance increase over disk, 263% increase

over original ANEMONE

Page 21: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Multiple Process Performance Increase number of 100 MB processes by 1 for each

iteration Quicksort: 710% increase over disk, and 117% increase

over original ANEMONE POV-Ray: 835% increase over disk, and 115% increase

over original ANEMONE

Page 22: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Client Cache Performance Hits save ~500 µs POV-Ray hit rate

saves ~270 seconds for 1200 MB test

Quicksort hit rate saves ~45 seconds for 1200 MB test

Swap daemon interferes with cache hit rates Prefetching

Page 23: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Engine Cache Performance

Cache performance levels out ~10%

POV-Ray does not exceed 10% because it performs over 3x the number of page swaps that Quicksort does

Engine cache saves up to 1000 seconds for 1200 MB POV-Ray

Page 24: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Future Work

More extensive testing Aggressive caching algorithms Data Compression Page Fragmentation P2P RDMA over Ethernet Scalability and Fault tolerance

Page 25: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Related Work

Global Memory System [feeley95] Implements a global memory management algorithm over ATM Does not directly address Virtual Memory

Reliable Remote Memory Pager [markatos96], Network RAM Disk [flouris99] TCP Sockets

Samson [stark03] Myrinet Does not perform caching

Remote Memory Model [comer91] Implements custom protocol Guarantees in-order delivery

Page 26: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Conclusions

ANEMONE does not modify client OS or applications

Performance increases by up to 263% for single processes

Performance increases by up to 117% for multiple processes

Improved caching is provocative line of research, but more aggressive algorithms are required.

Page 27: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Questions?

Page 28: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Appendix A: Quicksort Memory Access Patterns

Page 29: Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

Appendix B: POV-Ray Memory Access Patterns