speculative paging for future nvm storagecesg.tamu.edu/wp-content/uploads/2017/12/span_final.pdf ·...

Speculative Paging for Future NVM StorageViacheslav FedorovTexas A&M [email protected]

Jinchun KimTexas A&M [email protected]

Mian QinTexas A&M [email protected]

Paul V. GratzTexas A&M [email protected]

A. L. Narasimha ReddyTexas A&M University

[email protected]

ABSTRACTThe quest for greater performance and efficiency has drivenmoderncloud applications towards “in-memory” implementations, such asmemcached and Apache Spark. Looking forward, however, the costsof DRAM, due to its low area density and high energy consumption,maymake this trend unsustainable. Traditionally, OS paging systemmechanisms were intended to bridge the gap between expensive,under-provisioned DRAM and inexpensive, dense storage, however,in the past twenty years the latency of storage, relative to DRAMbecame too great to overcome without significant performanceimpact. Recent NVM storage devices, such as Intel Optane drivesand aggressive, 3D flash SSDs, may dramatically change the picturefor OS paging. These new drives are expected to providemuch lowerlatency compared to the existing flash-based SSDs or traditionalHDDs. Unfortunately, even these future NVM drives are still muchtoo slow to replace DRAM, since the access latency of fast NVMstorage is expected on the order of tens of microseconds, and theyoften require block-level access. Unlike traditional HDDs, for whichthe baseline OS paging policies are designed, these new SSDs placeno penalty for “random” access and their access latency promisesto be significantly less than traditional SSDs, thus arguing for arearchitecting of the OS paging system.

In this paper, we propose SPAN (Speculative PAging for futureNVM storage), a software-only, OS swap-based, page managementand prefetching scheme designed for emerging NVM storage. Un-like the baseline OS swapping mechanism, which is highly opti-mized for traditional spinning disks, SPAN leverages the inherentparallelism of NVM devices to proactivley fetch a set of pages fromNVM storage to the small and fast main DRAM. In doing so, SPANyields a speedup of ∼18% versus swapping into the NVM with thebaseline OS (∼50% of the performance lost by the baseline OS versusplacing the entire working set in DRAM memory). The proposedtechnique thus enables the utilization of such hybrid systems formemory-hungry applications, lowering the memory cost whilekeeping the performance comparable to the DRAM-only system.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] 2017, October 2–5, 2017, Alexandria, VA, USA© 2017 Association for Computing Machinery.ACM ISBN 978-1-4503-5335-9/17/10. . . $15.00https://doi.org/10.1145/3132402.3132409

CCS CONCEPTS• Software and its engineering → Main memory;

KEYWORDSMemory System, Paging, PrefetchingACM Reference Format:Viacheslav Fedorov, JinchunKim,MianQin, Paul V. Gratz, andA. L. NarasimhaReddy. 2017. Speculative Paging for Future NVM Storage. In Proceedings ofMEMSYS 2017, Alexandria, VA, USA, October 2–5, 2017, 12 pages.https://doi.org/10.1145/3132402.3132409

1 INTRODUCTIONModern computing trends are increasingly turning towards “BigData" and “Cloud Computing". These emerging applications placetremendous, growing pressure on the memory system and stor-age [6]. Traditional backing store technologies, however, are or-ders of magnitude slower than DRAM-based main memory. As aresult, computation providers have been forced to move to fullyDRAM-resident designs to achieve scaling performance. This trendis proven by the growth of cloud computing workloads such asmemcached [24] andApache Spark [42] designed to run on themainmemory. However, the extra pressure on main memory has led toDRAM capacity becoming a major performance bottleneck [1, 18].

As modern workloads data sets continue to grow, increasingDRAM capacity will become prohibitively expensive. Further, asDRAM footprints grow, their energy costs become a significant fac-tor in data center design [7]. One way of keeping up with this rapiddata set growth is through the use of the emerging non-volatilememory (NVM) technologies, such as Intel 3D Xpoint [10], Fastand high-density Flash [37], and other emerging memory tech-nologies [4, 31]. While these storage technologies have excellentcapacity per cost, are randomly addressable and non-volatile, theytypically have performance orders of magnitude slower thanDRAM,making them impractical to use as a direct replacement for DRAM,further many are still block-oriented U̇nfortunately, as we willshow, the current OS paging mechanisms designed for traditionalstorage technologies, perform poorly when used to page onto theseemerging NVM technologies. In this work we explore OS-level,speculative paging, designed with future NVM storage technolo-gies as the backing store.

In current systems, when data does not fit in the main memory, itis swapped out to a storage device behind the memory. OS managesthe DRAM and the storage device to create the impression of alarger memory. However, application performance is significantlyimpacted by the slower storage devices. In current systems, the OS

https://doi.org/10.1145/3132402.3132409

https://doi.org/10.1145/3132402.3132409

MEMSYS 2017, October 2–5, 2017, Alexandria, VA, USA V. Fedorov et al.

employs “demand paging” to bring pages from the swap device tomemory on a “page fault” (i.e. when the application attempts to ac-cess a page that is currently not in memory). With emerging NVMdevices, the performance gap between main memory and the swapdevices will be much smaller, potentially enabling other approachesto management of the swap space. Further, emerging NVM devicesexhibit much higher parallelism than magnetic disks. These char-acteristics warrant revisiting the problem of managing the swapdevices behind the main memory. We propose to rearchitect howthe OS performs paging and readahead on the swapping device toleverage the greater parallelism and higher performance availablewith emerging NVM devices. In particular, we propose to separatedemand paging from readahead or prefetching of pages, giving theprefetching task to a separate thread. Thus, the faulting applicationneed not wait for the full set of prefetched (readahead) pages to bepulled from the storage device. Further, with NVM storage freedfrom the constraints of large spinning disk seek times, we propose asophisticated algorithm to predict future page accesses for prefetch-ing, greatly reducing the number of useless pages pulled in from theNVM device. Our approach reduces the application running timeby 18% on an average, compared to the conventional OS swappingto the NVM device. We are able to achieve 50% of the maximumpotential speedup, i.e., when the application working set is fully inDRAM.

The rest of the paper is organized as follows. Section 2 providesa brief background on the virtual memory and page swappingin the modern OS and discusses the motivation of this work. InSection 3 we provide the design considerations of our approach.We evaluate the framework and several prediction algorithms inSection 4. Related work is discussed in Section 5. Finally, Section 6provides a summary of this work’s contribution.

2 MOTIVATIONHerewe briefly overview the operation of VirtualMemory (VM) andpaging in modern OSes at a basic level necessary for understandingthe following discussion. This is followed by a decomposition of thecharacteristics of current VM implementations and how they areunsuited to systems incorporating emerging NVM technologies.

2.1 Virtual Memory BackgroundVirtual Memory (VM) is a powerful abstraction used in practicallyevery modern OS [5, 36, 40]. Virtual Memory is implemented asa hardware-software system where the OS manages the virtual-to-physical address mapping for each distinct application, whilethe CPU is responsible for actually translating virtual memory ad-dresses into physical main memory addresses. The granularity ofsuch mappings is typically 4 kB and each such region is calleda page1. The interface between the OS and the CPU is providedthrough a page table and is standardized in x86 systems to enable ahardware page table walker in the CPU core to traverse the entriesand translate virtual addresses without OS intervention. The pagetable is indexed with the high-order bits of the virtual Address.When the OS needs to establish a translation between a given phys-ical memory page and its virtual address in VM, it sets up an entry

1Note, from here on we use Linux on x86 hardware as the basis of our discussion.Other OSes and hardware implement similar mechanisms.

in the page table to contain a translation to the physical address inquestion. The CPU then performs the page table lookup for everymemory operation, and if there is no write protection violation andthe translation is valid, the operation proceeds. Conversely, in thecase of a violation, a page fault is triggered and the OS is responsi-ble to handle the fault, whether by terminating the application orallocating a translation, respectively.

The OS uses the storage system as a backing store for physicalmemory to transparently allow a memory space larger than avail-able physical memory, through paging or page swapping. Whenan application attempts to access a page which is currently not inthe physical memory, a page fault occurs, calling the OS. The OSthen selects a page currently in physical memory to “swap out” tothe backing store, freeing space for the faulting page to be replacedinto the physical memory.

In current systems, there are several orders of magnitude differ-ence in performance between DRAM physical memory and tradi-tional magnetic disk backing storages. As a result, the OS is care-fully designed to select which pages to swap out, since removinga page which will be used in the near future will cause extra pagefaults. Most OSes implement replacement schemes based aroundthe principle of least-recently used (LRU). VM replacement uses thephysical memory as a fully associative cache and is implementedin coordination between the hardware and OS. Rather than imple-menting a full LRU, the OS typically maintains an approximationof LRU through two lists of pages, active and inactive. All pages inmemory have a hardware flag in their page table entry which is setby the hardware when the page is accessed. The OS initially clearsthis flag and periodically samples it to determine if the associatedpage was recently accessed. Pages in the inactive list which aretouched, are promoted to the active list, active list pages whichhaven’t been touched in some time are demoted to the inactive list.There is no effort made at keeping the relative recency informationwithin each list. When a page must be selected for eviction, one ofthe pages from the inactive list is selected, effectively at random.Although there are many suggested improvements to page manage-ment algorithms in the literature [13, 21, 32, 43], they are beyondthe scope for our discussion here.

2.2 Readahead and PrefetchingWith paging, application performance is largely driven by a combi-nation of two factors: the average latency of a page fault and thenumber of page faults. These factors must be carefully consideredtogether in order to maximize the performance in the presence ofpaging.

In an attempt to reduce the number of page faults, most OSesimplement some form of “proactive read”. For example, the LinuxOS implements a simple page prefetching scheme, “readahead”.On a demand page fault, an aligned block of pages (typically 8)containing the faulting page is requested from the backing store.Because the block is aligned by its location on the disk, typicallythe demand page is not the first read from the disk. This design isoptimized for traditional spinning disks where seek times are highwhile the overhead of reading contiguous blocks is relatively low.The extra pages are placed in the swap cache (i.e. readahead pagesare not installed into the application’s address space until there is a

Speculative Paging for Future NVM Storage MEMSYS 2017, October 2–5, 2017, Alexandria, VA, USA

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 50 100 150 200 250 300 350 400

Acc

ura

cy

Number of Readaheads, x 10k

Figure 1: OS Readahead accuracy with one application run-ning, PageRank benchmark from the SparkBench suite [19].

0

0.5

1

1.5

2

2.5

3

3.5

2-pg 4-pg 8-pg

Nor

mal

ized

Late

ncy

Incr

ease

Degree of OS Readahead

latency for demand requests total effective latency

HDD NVM HDD NVM HDD NVM

Figure 2: Page access latency, normalized to no-readahead,HDD swap device vs anNVM swap device. The ShortestPathsbenchmark from the SparkBench suite [19].

demand for them), in the hope that the extra pages speculativelyfetched might eventually be demanded by the application. Pagesin the swap cache may be swapped-out again without use, if thereadahead was inaccurate.

Fault induced page swapping occurs in the context of the originalapplication. The application is suspended until the main demandpage and all readahead pages have been read, then it may continue.By blocking the faulting application and scheduling speculativereadahead on every page fault, we can preload pages that are likelyto be used in near future. This is a reasonable policy when theaccuracy of the readahead is high and/or the seek latency of thebacking store is high.

Figure 1 depicts the accuracy (i.e. how many readahead pageswere useful) of the OS readahead scheme with the PageRank appli-cation. During the first ∼900 k readaheads, the application exhibitslinear pattern in page faults and the readahead is 100% accurate.As the application continues running, however, the prediction ac-curacy of the readahead scheme dramatically drops to 40–45% onaverage. This low accuracy can be explained if we refer to the mech-anism of page swapping in the kernel. As readahead selects pagesthat are contiguous on the backing store, pages which were initiallyswapped out at the same time are the pages selected for readaheadalong with the requested page. Unfortunately, temporal locality atswap out time does not always indicate the same temporal localityat demand fault time.

2.3 Page Fetch LatencyWith conventional readahead, the pressure on the swap device re-quest queue is artificially increased. Figure 2 shows the demandfault request latency as well as the effective latency for all pagefaults, with 2-page, 4-page, and 8-page OS readahead degrees, formagnetic, hard-disk drive (HDD)-based as well as NVM-based swapdevices, normalized to no readahead for HDD and NVM, respec-tively. In all cases, increasing readahead leads to increasing latencyon the demand faulting fetch. This is expected because page faultsblock the running application until the critical page along withall readahead pages are loaded into the swap cache. In traditionalHDDs, however, the latency of first swap page access is so high thatthe reduction in faults caused by readahead means that averagepage access times tend to decrease up to an 8-page readahead.

For NVM devices, however, the picture is quite different. Thefigure shows that both faulting page access time, as well as allpage access times go up dramatically as the number of readaheadpages increases. With NVM devices the per-page access latency ismuch lower than with HDDs. Further, the cost of random accessis much smaller than with HDDs. Thus, delaying demand faultpage fetching for readahead pages, which ultimately may not beused (see Section 2.2), comes at a high performance penalty, anobservation backed up in prior work by Park and Bahn [25].

0%

20%

40%

60%

80%

100%

120%

140%

160%

180%

HDD, 1 swpfile HDD, 10 swpfiles NVM, 1 swpfile NVM, 10 swpfiles

Nor

mal

ized

Wal

l Clo

ck T

ime

100.0%

57.8% 53.8% 49.8% 49.7%

0%20%40%60%80%

100%120%140%160%180%

HDD, 1swpfile

HDD, 10swpfiles

NVM, 1swpfile

NVM, 10swpfiles

NVMe SSD,1 swpfile

NVMe SSD,10 swpfiles

Nor

mal

ized

Wal

l Clo

ck T

ime

Figure 3: Memory-level parallelism comparison, traditionalHDD versus NVM. The ShortestPaths benchmark from theSparkBench suite [19].

Beyond simple fetch latency, the high seek times of traditionalHDDsmake it impractical to exploit memory-level parallelism.WithNVM, on the other hand, several swap files can be organized on asingle device for greater performance. Figure 3 presents the per-formance of the conventional OS swapping mechanism with NVMcompared to the HDD swap device, while running the ShortestPathsbenchmark from the SparkBench suite [19]. We see that increasingthe number of swap files to ten on the conventional HDD yieldsa ∼ 2× slowdown due to the seek costs of each swap page access.Conversely, ten swap files actually improves performance slightlyon NVM. The higher parallelism within NVMs makes multiplesimultaneous accesses feasible.

The I/O scheduler also plays a role in page fault latency. Theconventional OS I/O scheduler typically 1) exploits spatial proximityof requests on the disk and 2) ensures that requests from multipleapplications are interleaved such that constant progress is made inservicing those requests for all apps. With emerging NVM devices,spatial ordering of requests yields almost no advantage over random


requests. Additionally, the high degree of internal parallelism callsfor dispatching multiple requests from one application withoutwaiting for the others to issue their requests. Thus, previous workhas identified a need for novel scheduling approaches with NVMdevices [34].

In summary, OS mechanisms assume that swap space is locatedon a traditional HDD. Thus:• The relative cost of fetching additional adjacent pages tothe demand faulting page is negligible and hence on everydemand fetch, the OS prefetches a block of additional pagesadjacent in the swap file. With NVM, with no seek latency,this practice of always prefetching adjacent pages ahead ofnext likely demand fetch may not be ideal (Figure 2).• Current NVM devices exhibit high internal parallelism, withmany parallel accesses possible through multiple banks [27].Hence, the current practice of only one swap file per devicelimits the exploitable concurrency (Figure 3).• The current practice of fetching adjacent pageswithin a swapfile is not accurate in high concurrency workloads withmanycores (Figure 1). We need more accurate mechanisms thatcan efficiently implement prefetching.

We take these observations into account in our design of prefetch-ing mechanisms for NVM as outlined below. Our goal then is toattack the problem on two fronts, i.e., to build up a smart frameworkthat would allow us to carefully inject speculative requests so asto not disturb the demand requests, keeping their effective latencylow and to implement algorithms accurate enough to reduce thenumber of faults.

3 DESIGNIn this section we describe the design of our page-level specula-tive prefetching for future NVM-based backing stores. In order toachieve performance, prefetches must be accurate, timely, and min-imally interfere with demand requests. Here we describe our workto restructure the OS to support paging on NVM devices. We thenexamine prefetching algorithms culminating with our proposedSPAN technique.

3.1 OS support for Swapping on NVM devicesHere we focus on the swap-in process (reading pages from theswap device back into main DRAM). Based on the observations inSection 2, the following fundamental changes are necessary:• Utilizing a NOOP I/O request scheduler. By default, Linuxsorts the requests and uses the elevator mechanism to op-timize the spinning disk performance. As discussed above,this can lead to increased latencies seen by demand requests.NOOP I/O scheduler simply executes the requests in theorder they arrive, allowing to retrieve data in the order in-tended by the prefetching mechanism2.• Critical-Page-First (CPF). Instead of waiting for an entirealigned block from the device, we first request the demandpage and let the application continue as fast as possible whilethe remaining readahead/prefetched pages are fetched.

2The NOOP scheduler is essential and already implemented in Linux, to be fair we useit for both baseline and our proposed technique in all evaluations.

• Separate readahead/prefetch logic from demand fetch logic.Since scheduling read requests may easily take dozens ofmicroseconds, it is necessary to perform them outside of thefaulting application’s context. We use a different thread toimplement the OS-level page prefetching implementationhere, though a hardware approach could be used.• In order to keep the demand request latency as low as pos-sible, we maintain two separate queues, one for demandrequests and one for prefetches, and give higher priority tothe demand queue. We schedule the prefetches only whenthere are no demand requests waiting. The incoming demandrequests “jump ahead” of any prefetch/readahead requestsin the queue (see details in Section 3.2).

Our proposed SPAN technique incorporates all of these OS-levelchanges as part of the proposed technique.

3.2 Prefetch Queue ManagementThe OS prefetches speculative pages together with the demandpage. Any new faults and prefetches have to wait until the olderones are executed. Thus, old prefetches are favored over recent ones.However, with two separate queues for demand and speculativerequests, we have a choice as to which prefetches to prioritize, basedon which ones we believe have greater utility to the application.Intuitively, the prefetch requests from older demand faults willbe more out-of-sync with what the application is currently doing,than prefetches generated by more recent faults. In this case, wecan give higher priority to the new prefetches over the older ones.This is achieved by using a LIFO (Last-In-First-Out) queue policyinstead of the FIFO (First-In-First-Out) utilized by the conventionalreadahead. Thus, new demand page faults will create prefetchesthat are placed in the prefetch queue ahead of prefetch requestsfrom previous page faults. Once the prefetch requests from thecurrent page fault are fetched, the page faults from the last faultcan then be completed. To prevent the queue from overflowing, theoldest prefetches are dropped from the tail of the queue when spaceis required for new prefetches.

Note that before a swap request is enqueued, the OSmust be fullyprepared for the swapped-in page. Namely, a fresh free physicalpage frame must be allocated and locked, the relationship betweenthe swap file offset and the physical page established, and the pagemust be inserted into the OS swap cache. The enqueued request maybe further decomposed and/or merged with adjacent requests bythe I/O scheduler. Because prefetches are speculative in nature, theyrequire a means of “rolling-back” (canceling) their fetch. Such roll-backs make canceled prefetches expensive. We adopt a secondaryqueue, as described above, which is one level higher than the mainqueue, i.e., the entities in the secondary queue are hints to thefetch mechanism, and are not allocated physical memory until themoment they are actually issued to the swap device. This allowsnearly “free” roll-back (by simply erasing the entry in question),and only the actual prefetches get through the process of allocatingphysical memory.

In summary, to keep the demand fault latency low, any demandrequests must be fetched before the prefetches. The recent faults aremore important than the past ones, consequently, priority amongthe prefetches must be given to the prefetch requests induced by


AA+1A+2A+3A+4A+5A+6A+7

Virtual Addresses

Swap Addresses

a

b

c

Figure 4: The Temporal and Spatial schemes illustration.Fetched by Temporal (a), Spatial (b), both (c).

the recent demand faults. We fulfill both these requirements byimplementing a secondary queue for prefetch requests, with LIFOordering.

3.3 Prefetch AlgorithmA good prefetch algorithm should strive to predict pages that aresoon to be utilized by the application. Unlike prefetching in otherdomains, such as processor caches, page prediction is somewhatcomplicated in that we can only observe the page faults, not thesuccessful page accesses, without significant overhead.

Traditionally, there have been two approaches to such prediction:one based on temporal locality of accesses and the other one basedon spatial locality. Two pages are said to have temporal locality ifthey have been accessed together in the past and will be faultedtogether again in the future. Conversely, spatial locality implies thatif a page is faulted, then a page adjacent to its virtual address spacewill also be faulted. Here we describe an implementation of thesetwo traditional prefetch algorithms, along with our proposed SPANalgorithm which identifies more complex patterns of referencewithin the application page fault stream.

3.3.1 OS Readahead. Current OS policies fetch physically con-tiguous pages within the swap file. The physical locality in theswap file is based on the temporal locality in evictions of pagesfrom memory as described in Section 2.2

3.3.2 Temporal. Based on the observation that the pages adja-cent in the swap file were swapped out together, and thus werelikely used together at some point in the past, it is reasonable toassume that they are likely to be again demanded together in thefuture. A simple temporal approach thus would prefetch pagessurrounding the current faulting page in the swap file. The twomain distinctions from the standard OS readahead mechanism arethat temporal prefetches have lower priority than the demand pagefaults, and the more recent temporal predictions have priority overthe older ones. This reduces the latency of the demand requests,as well as avoids the potentially inaccurate stale prefetch requestsfrom getting executed. Refer to Figure 4 for an illustration.

3.3.3 Spatial. We also examine a simple spatial approach whichpredicts that certain consecutive page addresses will be faultedafter the current page fault. Figure 4 illustrates the concept. Inthis figure, a original page fault to the address A is followed by astream of spatial prefetches A+1, A+2, ..., A+7. In an ideal case, if theapplication exhibits such spatial patterns, the temporal prefetchershould also be able to capture them. However, due to the inability

OS: Page Fault Handler

NVM

SPAN Module

Signature Table (ST)

Pattern Table (PT)

Prefetch Filter (PF)

Trained by Page Fault

Insert Prefetch Request

Main Queue

Secondary Queue

Figure 5: SPAN design overview.

to precisely track the page access patterns beyond the first usage ofany given page, the OS LRU mechanism is imprecise which leadsto deviations in which pages are swapped out. Thus the temporalapproach may not be able to capture the relationships betweenpages with adjacent virtual addresses.

3.3.4 SPAN Algorithm. While the OS readahead is designedfor a traditional spinning disk to maximize the benefit of sequen-tial accesses, with the new storage class memory such as NVMs,there is no need to restrict the scope of prefetching under the spa-tial or temporal prefetching. Prior studies on hardware cache lineprefetching [11, 14, 16, 23, 28, 35] show that a delta pattern be-tween consecutive cache misses can be used to predict prefetchcandidates. We can apply same notion of delta pattern betweenadjacent page addresses. For example, if pages with the followingrepeating virtual addresses of 0x4, 0x6, 0x9, 0x8, 0x4, 0x6... faulted,the corresponding delta-pattern would look like this: “+2, +3, -1,-4, +2, ...”. At any given point in this sequence the next entry inthe sequence can be predicted by the string of deltas to that point.Thus, if a sequence of “+2, +3, -1” is seen, one can predict that thenext fault will be at a delta of -4 pages in the virtual address mapfrom the last faulting access.

Naïvely adopting the existing cache line prefetching algorithmsmarginally improves the overall performance because even a smallnumber of inaccurate prefetches leads to significant overhead fromadditional page faults. To ensureminimal mis-speculated prefetches,our proposed SPAN prefetcher is loosely inspired by the proces-sor cache domain hardware prefetcher, Signature Path Prefetcher(SPP) [16]. Similar to SPP, SPAN uses delta patterns as a signa-ture in its page prediction and dynamically throttles the depth ofprefetching when the prefetching accuracy becomes low. UnlikeSPP, SPAN is fully implemented in software in the OS and learnsdelta signatures on a per-process id (PID) basis between virtualpage numbers for each page fault. Leveraging PID greatly reducedthe noise in the fault reference stream, thus deltas seen in SPANare typically much less than ± 1000 pages.

Figure 5 shows the overall structure of SPAN. The SPAN moduleis a three-stage structure that consists of a Signature Table (ST)stage, a Pattern Table (PT) stage, and the Prefetch Filter (PF). SPANis trained by everymajor page fault and issues prefetch requests intothe secondary prefetch queue. The ST stage is indexed by ProcessID (PID) and stores the previously seen memory access patternas a compressed 16-bit signature. The PT is indexed by a historysignature generated from the ST stage and stores delta patterns


Signature Table (ST)

PID Page Signature

1 0x1 No History

20xA1è

0xA3

(0x1, +1)è

(0x12, +1,+2)

AccessPID 2Page0xA3

Pattern Table (PT)

Index Delta Cdelta Csig

0x00 0

00 0

0x1+1 2 9

è 10+2 7è 8

(0x1, +2)Sig Delta

Update Signature(0x12) = (0x1 << 4) ^ (+2) Update Delta Pattern

Figure 6: Update signature table and pattern table with newdelta pattern.

associated with the history signature. The PT stage also estimatesthe probability that a given access delta pattern will yield a usefulprefetch. If the delta in the PT is found to have sufficient probability(above a configured threshold), this pattern is passed to the PF forfiltering. The filter excludes prefetch requests from the PT stagethat are already in flight or have been previously demanded. Inthe remainder of this section we describe the training and prefetchalgorithm of SPAN in details.

Training Algorithm Figure 6 shows an example of the fault onpage 0xA3 from PID 2 updating the ST and PT. The ST is designed tocapture a memory access pattern between 4KB virtual pages and tocompress previous deltas into a 16-bit history signature. To capturedelta patterns between each page fault in a specific process, the STis indexed and tagged with a hash of the PID. Each entry in the STalso holds the last virtual page number (VPN) accessed by that PID,to calculate the delta pattern and associates it with the prefetchpattern in the PT. In this case, the ST finds a matching entry for PID2. The entry shows the previous signature for that PID was 0x1 andthe VPN of the last page faulted was Page 0xA1. The signature is acompressed representation of the deltas of the faults produced bythe given PID. The signature is generated via a series of XORs andshifts (see discussion below). Since the delta between Page 0xA1and 0xA3 is (+2), we learn that the previous signature 0x1 leads toa delta of (+2). The correlation between signature and delta patternis delivered to the PT to update prefetching pattern.

The PT holds the potential next delta patterns that correspond tospecific history signature from the ST. Therefore, the PT is indexedby the signature and each set contains the predicted next delta.Unlike the ST, whose entries correspond to each PID, each PTentry can be shared regardless of the PID number because multipleprocesses can show same memory access behavior. In other words,if PID 1 and PID 2 show the same access pattern, they will generatesame signature, index to same entry in the PT, and update deltapatterns in PT. In doing so, the globally shared PT accelerates thelearning process in SPAN.

Figure 6 shows a matching delta (+2) that corresponds to signa-ture 0x1 is found in PT, and the 4-bit occurrence counter (Cdelta ) for(+2) delta is incremented by one. The signature’s access count (Csiд )is also incremented. These counters, together serve to provide aconfidence in the relationship between the signature and the nextdelta stride. If there is no matching delta, the PT simply replaces an

entry with the lowest Cdelta value. Replacing the lowest confidentdelta on a PT miss is particularly effective when a process shows anoisy random access pattern.

New Signature

=(Old Signature << 4-Bit) XOR (Delta) (1)

After updating the PT, the ST is also updated with a new signaturebased on the current delta (+2). Equation 1 illustrates SPAN’s newhistory signature generation. The old signature 0x1 is left shiftedby 4-bits and XOR’ed with the current delta (+2) producing a newsignature 0x12. At this point, the new signature 0x12 represents ahistory of access pattern in PID 2 (+1, +2). In this way, the 16-bitsignature can represent the last four memory accesses in PID 2. Werefer to this signature as being compressed because deltas greaterthan (+15) have the potential to alias on top of previous deltas in thesignature, providing a graceful degradation in signature accuracyfor large deltas3. The last page faulted in PID 2 is also updated from0xA1 to 0xA3 in the ST.

Prefetching and Lookahead The new signature (0x12), gen-erated as described above, is now used to index the PT, as shownin Figure 7. In the figure, the signature indexes to two deltas, +1and +2 as possible prefetch candidates. Each delta’s confidence (P0)is computed as shown in the figure from the ratio between thedelta’s occurance count (Cdelta ) and the signature’s access count(Csiд ). Deltas with a confidence above the prefetch threshold (TP )are marked as prefetch candidates and delivered to the PF.

In addition, the PT generates a lookahead signature that recur-sively indexes into the PT until the prefetching threshold becomeslower than TP . Since there could be multiple deltas that exceed TP ,the delta with the highest Cdelta is selected to generate a looka-head signature. Lookahead prefetching with a compressed signatureprovides three main advantages over traditional spatial or tempo-ral prefetchers. First, if there is a sequential access that traversesthrough virtual pages with same delta pattern, a lookahead sig-nature will index into the same entry in the PT and accuratelyprefetch pages that would have been faulted. Second, if there is acomplex memory access pattern, the PT is able to provide a newdelta pattern for each lookahead stage. Third, if there is a randomdelta pattern that has low Cdelta in the PT, SPAN simply does notprefetch anything and saves the memory bandwidth.

Once the PT decides which page to prefetch, the prefetch can-didate is delivered to the PF. The main objectives of the PF, alsoshown in Figure 7, are to decrease redundant prefetch requests, andto track overall prefetching accuracy. The PF is a direct-mappeddata structure that records prefetched pages. SPAN always checksthe PF first, before it issues prefetches. If the PF already containsan entry for the given page, this means that page has already beenprefetched, and SPAN drops the redundant prefetch request. Dueto collisions, a filter entry may already be occupied by anotherprefetched page. In this case, SPAN simply replaces the old page,stores the current prefetch request in the filter, and issues the cur-rent prefetch. Note that this simple replacement policy might erasepages from the filter before they get evicted from the main memory,

3In practice, we found most deltas to be less than 16 pages, thus, increasing the shiftand signature size provided no significant benefit.


Prefetch Filter (PF)

Tag Valid

- 0

0xBF 1

Pattern Table (PT)

Index Delta Cdelta Csig

0x12 +1 2

10 +2 8

0x13 -1 2

5 +1 3

(0x12) Sig

𝑷𝟎 = 𝑪𝒅𝒅𝒅𝒅𝒅 / 𝑪𝒔𝒔𝒔 𝑷𝟎 = 𝟎.𝟖

𝑷𝟎 ≥ 𝑻𝑷 Check (0xA3 + 2)

(0x12 << 4) XOR (+2) = 0x122 Lookahead Signature

Prefetch (0xA3 + 2)

Figure 7: SPAN lookahead prefetching and filteringmechan-sim.

which could lead to re-prefetching, but we find in practice that thishappens very infrequently.

3.4 How many pages to prefetchIt is challenging to balance between the number of prefetchedpages, given the limited bandwidth due to the demand faults hav-ing higher priority, and the diminishing utilization (the fartherahead we prefetch, the more the likelihood that the pages wouldnot be utilized). We have experimented with throttling the numberof prefetches issued per demand fault based on the occupancy ofthe secondary queue. Our experiments show, however, that thisapproach does not yield any significant improvement in perfor-mance. The major reason is that by throttling the newer, moreuseful prefetches we are in essence giving more weight to the older,potentially less useful, ones.

Experimentally, we determined that issuing up to 16 prefetchesfor the Temporal and Spatial approach (since their accuracy is rel-atively low), and up to 64 prefetches for SPAN scheme, into thesecondary queue yields the best results. Since demand fetches aregiven higher priority, fewer prefetches are actually issued withhigher number of demand faults. Additionally, the LIFO scheme ofselecting prefetch requests across multiple page faults makes theactual number of queued requests not so critical. We tested the sec-ondary queue depth from 128 and 512 prefetches for all prefetchingschemes, however, the overall performance was insensitive to thequeue size, since the queues rarely overflowed.

4 EVALUATIONIn this section we evaluate the performance improvement withour proposed schemes over the baseline OS. We also show theideal performance with full DRAM configuration that does notsuffer from any hard page faults. The contributing aspects of ouralgorithm are also discussed in this section.

4.1 MethodologyWe evaluated our schemes on a real system with 96GB DRAMand Intel Xeon six-core 1.9 GHz processor. We implemented ourapproaches in the Linux Kernel v3.13.0. We used applications fromSparkBench suite [19] of in-memory data analytic benchmarkswritten for the Apache Spark engine [42]. The applications were

configured to have 8GB working sets/memory footprints. Exceptwhere otherwise noted, for each experiment only 4GB of DRAMare made available to the application, while the remaining 4GB ofworking set must be swapped to the backing store by the OS.

To emulate emerging NVM storage technologies, such as Intel3D XPoint [10], the conventional RAMDisk driver (BRD) was aug-mented with a standard request queue with a configurable latency.We emulated a 50 µs device latency for the bulk of our experiments.This 50 µs latency is a reasonable approximation of the memoryaccess latency of the Intel Optane SSD DC P4800X specification [9].Unlike traditional page swap mechanisms, SPAN uses ten RAMDiskswap files to leverage memory level parallelism described in Sec-tion 3.1. All disks in all experiments including the normalizingbaseline use the NOOP I/O scheduler, to avoid the performanceimpact of the default deadline scheduler on NVM storage, as dis-cussed in Section 3.1. Using the NOOP scheduler provides ∼3%improvement to all techniques versus the Deadline I/O scheduler.

In the experiments, the “Base OS” results have 4GB of availableDRAM with only one swap file, thus with a parallelism of 1 re-quest per operation, and use the conventional OS readahead, likea freshly-installed system would be configured. The “Full RAM”results depicts the maximum practical performance improvementif the full working set were placed in DRAM Main Memory (herethe full 96GB of DRAM are provided to the application with thebaseline OS).

Three prefetching techniques are examined: “TEMPORAL” rep-resenting the temporal scheme(see Section 3.3.2), “SPATIAL” rep-resenting the the spatial scheme (see Section 3.3.3), while SPANis the proposed page-level speculative prefetching technique (seeSection 3.3.4). The above mentioned schemes employ the offload-ing mechanism with a separate CPU core queueing the prefetchrequests into the secondary queue, out of the way of the demandfaults. As described earlier, the requests from the secondary queueare serviced by the swap device only when the main queue is empty.

4.2 Performance ImprovementFigure 8 presents a comparison of the SparkBench applications withvarious proposed prefetching schemes vs. the baseline OS reada-head, with the NOOP I/O scheduler as discussed above. The barsrepresent the wall clock running time, normalized to the baselinewall clock time. The “AVG” bars are geometric mean averages of therespective schemes across the applications. Across all applications,we see that the performance differential between “Base OS” and“Full RAM” is quite sigificant at 34% on average and as high 56% forPageRank. This shows the impact of placing 1/2 the working set inan NVM swap file can be quite high with the default OS.

Overall, the running time is improved by up to 30% (KMeans) andno worse than 1% (DecisionTree), and 18% on an average, or slightlymore than 1/2 the difference between “Base OS” and “Full RAM”.Wenote that none of schemes degrade performance. We observe thatthe “SPATIAL” and “TEMPORAL” predictors are approximatelyequal in performance improvement, while behaving differentlywith various applications (cf. KMeans, PregelOps, and TriangleCnt).SPAN beats both “TEMPORAL” and “SPATIAL” in performanceimprovement by approximately 2%. SPAN has an average accuracyof ∼80%, which is almost 2× higher than the other schemes. This


-5%

0%

5%

10%

15%

20%

25%

30%

35%

Nor

mal

ized

to B

asel

ine

Base OS TEMPORAL SPATIAL SPAN Full RAM

56% 46% 37%

Figure 8: Speedup for SparkBench [19] suite applications with various prediction schemes.

allows SPAN to gain 18% performance improvement on an averagewhile fetching almost half as many pages, thus saving the swapbandwidth.

DecisionTree and SVDpp show the lowest performance numbers,although for different reasons. DecisionTree has fewer page faultscompared to the rest of the suite (as indicated by its 6% maximumperformance increase with “Full RAM”), and thus SPAN may nothave enough opportunity to train itself. Additionally, the cost of notpredicting certain faults is more pronounced here because of thesmaller number of faults. On the contrary, SVDpp experiences manypage faults and the time between the consecutive demand faults isfrequently less than the NVM device latency. Since our algorithmsprioritize demand faults, there are simply not many prefetchingopportunities, as indicated by low performance from all the threeschemes. Note, however, SPAN is able to exploit those opportunitiesmore effectively which translates into ∼2% improvement over theother two schemes.

Page Fault reduction vs. Performance improvement. Fig-ure 9 presents the number of page faults in the applications. Inter-estingly, “Base OS” shows the fewest page faults of the schemesevaluated. This is despite the poor performance of “Base OS” com-pared to the other schemes (Figure 8). We note that the numberof page faults themselves does not give a complete picture of per-formance with NVM. Traditional OS swapping is geared towardssequential-access hard disks, thus it fetches several pages togetherwith the faulting page, forcing later demand page faults to wait forreadahead prefetches. As Figure 2 shows, prefetching more pagesleads to increased page fault latencies. Thus, for NVM drives withno seek time and low access latency, forcing further demand fetchfaults to wait on current readahead does reduce faults but at theexpense of much higher per-page access times due to these delays.It is very important to take into account this effect on page faultlatency, rather than only observing the raw numbers of page faults,of one is to accurately assess the performance of the schemes. Sincewe conduct our experiments on a real system, the wall clock timeaccurately reflects the total performance. Of the proposed schemes

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

Base OS TEMPORAL SPATIAL SPAN

Aver

age

num

bers

of

Page

Fau

lts

Figure 9: Page fault count with various prediction schemes.Average across all SparkBench suite.

SPAN achieves lower page faults while prefetching less pages dueto its higher prediction accuracy, which translates into less pressureon the swap device.

4.3 AnalysisIn this section we analyze the proposed schemes.

4.3.1 Prefetching Mechanism Decomposition. Figure 10 decom-poses the benefits provided by different components of the pro-posed SPAN prefetching mechanism. “NVM Parallelism” in thefigure shows the benefit to using TEMPORAL readahead with 10swap files, to exploit the NVM parallelism. Note that due to theway the OS manages multiple swap devices using a round-robinmechanism, the adjacent pages in the LRU list get swapped outinto consecutive swap files. Consequently, with ten swap files, pagefaults can exploit 10× the device-level parallellism, which allowsthe OS schemes to perform better with more swap files. However,the figure shows that further modifications to the OS readaheadmechanism are needed to extract this performance potential out ofthe fast devices. In the figure, the conventional the OS readaheadmechanism with 10 swap files (NVM Parallelism) is only able to in-crease performance by 7% on an average, compared to the baselinewith 1 swap file.

One important technical contribution of our work is “PrefetchOffloading”. Here we offload prefetching to a separate thread (and


-10%

-5%

0%

5%

10%

15%

20%

25%

30%

Spee

dup

with

diff

eren

t SP

AN M

echa

nism

s

NVM Parallelism Prefetch Offloading Prefetching Policy

Figure 10: Speedup with different SPAN mechanisms.

core) in order to allow application to continue running as soon aspossible after the demand page fault has been serviced, which pro-vides a significant performance boost. “Prefetch Offload” inherentlycontains the critical-page-first approach where the demand faultis serviced ahead of any prefetch requests as well as utilizing 10swap devices with the OS’s readahead prediction. In the figure wesee the proposed offloading mechanism provides an additional 9%performance gain over the “NVM Parallelism” alone, increasing thetotal gain to 16% on an average.

In the figure, the final portion of the bar, “Prefetching Policy”,shows that the addition of the sophisticated SPAN prefetch algo-rithm contributes an additional 2% of performance and bumps theoverall performance to 18% on an average showing the benefit ofa more accurate prefetching algorithm. As we will show, SPANdecreases the NVM traffic the most, while achieving the highestperformance among the tested schemes.

We note that two of the stacked bars in Figure 10, show negativeperformance. NVM Parallelism in DecisionTree and PrefetchOf-floading in SVDpp. In these benchmarks these techniques incura small performance penalty by themselves, though when imple-mented with the entire SPAN technique the benchmark showspositive performance overall.

4.3.2 Prefetch Utilization. The goal of prefetching is to lower thepage access latency. If the prediction is correct and the prefetchedpage is utilized, the effective latency of the respective page fault is re-duced to zero. However, issuing too many prefetches, or prefetcheswith low accuracy, would unnecessarily keep the swap device andthe main memory busy, wasting energy and potentially leadingto higher page fault latencies. It is thus the balance between theaccuracy and the number of prefetch requests issued that leads tobetter application performance.

Figure 11 presents the prefetched page utilization for the pro-posed schemes. We define utilization as the number of prefetchedpages that were actually used by the application before beingswapped-out again. Generally we see that “SPATIAL” is superior tothe “TEMPORAL”, except in TriangleCnt application, while SPANpresents a massive improvement in prefetch accuracy at 79% on anaverage, compared to 48% and 51% for “TEMPORAL” and “SPATIAL”schemes, respectively.

Note again that DecisionTree and SVDpp have counter-intuitivebehavior. SPAN achieves high utilization in both the applications,despite relatively small performance gain. We find that in Decision-Tree, SPAN is prefetching very few pages due to its confidence being

0%

20%

40%

60%

80%

100%

Pref

etch

Util

izatio

n

TemporalSpatialSPAN

Figure 11: Prefetch Utilization for the proposed algorithms,SparkBench suite.

This is SPP-64-sec-q

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Tota

l Pre

dict

ions

Con

side

red

Prefetched In Memory Wiped from queue Filtered before queue

Figure 12: Prediction outcomes for the SPAN algorithm.

low. This is because this application has few total faults, givingSPAN little time to train.

4.3.3 Prediction Outcomes. Figure 12 presents a breakdown ofthe outcomes of the prefetch predictions as a fraction of the totalnumber of predictions made. We observe approximately 65% of thepredictionsmake it to the secondary queue. Thus, 35% of predictionsare first filtered out by the PF in SPAN as being already prefetched bya prior request. Approximately 45% of the predictions are ultimately“Wiped from queue”. These represent predictions which are droppedbecause the rate of faults is too high to allow prefetching of thepredicted pages. A further ∼10% of predictions are found to alreadybe “in memory” by the time they are considered for prefetchingfrom the secondary queue. These represent cases where a demandfault has brought the page into memory while the prefetch waswaiting in the secondary queue.

As the figure shows, ultimately, only 10–20% of the total predic-tions are actually prefetched. ConnectedComponent and Decision-Tree stand out from the others in the figure. SPAN is very accurate,however in the former case, there are not enough opportunitiesfor the predictions to be fetched (because of the high frequencyof demand faults), given the high percentage of predictions wipedfrom secondary queue. The rest of the predictions are still accu-rate, however, by the time they are considered, the application hasalready brought those pages into the memory via demand faults,hence the large fraction of “in memory” predictions. DecisionTreeon the other hand has much fewer demand faults, so more predic-tions are considered for prefetch (only 5% predictions are “Wipedfrom queue”), but once again, during the time when the predictionsare waiting in the secondary queue, the corresponding pages are


0%

20%

40%

60%

80%

100%

120%

Tota

l NVM

Tra

ffic

nor

mal

ized

to B

ase

OS

TEMPORAL SPATIAL SPAN

140% 120% 132% 133%

Figure 13: Total NVM traffic.

demand-faulted by the application, leading to the larger percent-age of predictions “in memory”. Given the overall high numbersof prefetches “Wiped from queue”, NVM devices with more paral-lelism and/or lower latencies should help provide more bandwidthfor prefetches and allow for increased performance.

4.3.4 Total NVM traffic. Figure 13 presents the total traffic to theNVM swap device (prefetches aswell as demand fault requests) fromthe proposed schemes, normalized against the Base OS. Althoughthe Base OS achieves the lowest number of page faults (Figure 9),it actually fetches more pages from the NVM versus the otherschemes. This is because the Base OS forces the application to waitfor all readahead pages to be read prior to allowing the applicationto proceed. Thus, Base OS incurs the fewest faults but also bringsmore unused pages into memory.

We note that SPAN reduces NVM traffic by 30% versus the BaseOS readahead. Additionally, SPAN provides on an average ∼17%lower NVM traffic than the other proposed techniques, while alsoimproving the application performance by an additional 2% onan average. SPAN’s accuracy leads to a reduction in NVM trafficbecause fewer useless pages are prefetched from the NVM. Notethe high cost SPATIAL and TEMPORAL schemes pay for theirperformance on DecisionTree and SVDpp applications (120%–140%traffic increase over the Baseline). On the other hand, SPAN neverincreases the NVM traffic for any application tested.

4.3.5 Future NVM devices. NVM technologies are quickly evolv-ing and we expect to see higher performance and lower latencydevices in the future. Figure 14 shows the impact of our schemeswith faster and slower NVM devices, setting the emulated latencyfrom 0 to 200 µs. The latency of 50 µs used in all the prior exper-iments is ∼6× the latency of a page read from DRAM. This is anaverage access latency of some recently produced high-speed NVMSSDs, such as Intel Optane SSD DC P4800X [9].

First, we note that with 0 µs devices there is still a performanceloss as compared to the full RAM system. This loss represents theoverheads in performance as a result of OS page bookkeeping aswell as page copying latencies. We observed that with faster NVMdevices, the distances between page faults are shrinking since theapplications are able to run faster, also, prefetching is less effectivebecause the impact of an accurate prefetch is not as significant aswith slower devices (i.e. the fault latency reduction is not as great).For reference, we observed a 16% geometric mean speedup underthe Base OS Readahead of the SparkBench applications with 10-µs

0%

5%

10%

15%

20%

25%

30%

0 µs 10 µs 50 µs 100 µs 200 µs

Perf

orm

ance

Impr

ovem

ent

Nor

mal

ized

to B

asel

ine

TEMPORAL SPATIAL SPAN Full RAM

34% 43% 47%

Figure 14: SparkBench suite geometric mean running timeimprovement, various NVM device latencies.

devices vs. the 50-µs ones. Further, there is less opportunity forprefetching and the overheads of the more complex algorithmsmay start to negatively affect the performance. On the other hand,with slower devices (e.g. 200 µs latency), it is more challengingto fetch the correct pages in a timely manner so as to cover thelarger latency of accesses. The device queues tend to be more fullwith slower devices, which also impedes the opportunities for ourschemes. Overall, our proposed SPAN approach adapts to the under-lying NVM device characteristics by leveraging the high-accuracypredictions, and is able to achieve between 35–55% of the potentialmaximum performance improvement given by “Full RAM”.

5 RELATEDWORKAn extensive amount of research has been devoted to disk prefetch-ing approaches for file I/O. Trivedi [41] argues that it is hard topredict the page traffic based on past behavior and spatial contigu-ity of the working set. Patterson et al. [26] show that the growingworking sets of I/O-bound workloads challenge the traditional filesystem cache and readahead approaches. They present an algorithmfor choosing between caching and prefetching, and managing theOS buffers based on the application hints. We note that modernapplications often exhibit irregular data usage patterns and thus itis hard to provide such hints; however, dynamically adapting to theapplication access patterns remains of great importance. Kaplanet al. [15] revisited prefetching approaches and showed that thechoice of the best predictor depends on both application access be-havior and the available memory. Griffioen and Appleton proposed“automatic prefetching” [8] for files based on the probability of afile being requested after a given file. As an alternative approach,Cao et al. [2] proposed letting the applications control their ownfile cache replacement while keeping the kernel in control of cachespace allocation between the processes. While conceptually similarto our SPAN scheme, these approaches are all designed for filesrather than pages, thus their prediction mechanisms and outcomesare dramatically different.

To our knowledge, our work is the first to consider an OS-levelswap prefetching approach for DRAM with an emerging NVMbacking storage. Similar to our two-level approach, schemes usingDRAM as a cache for slower NVM memory were proposed [20, 22,29]. Page overlays [33] have been proposed to allow mapping ofcache lines from multiple virtual pages into one physical page toimprove utilization of DRAM space in a layered memory architec-ture. These approaches require additional hardware modifications


and thus cannot be applied in current systems. By contrast SPANis a pure software approach, entirely contained in the OS. Further,they assume that future NVMmemories are byte addressable, whilehere we explore the range of future NVM-based, block storage.

Apart from managing DRAM and NVM storage as a two-tierarchitecture, it is possible to have the two memory types on thesame level. Adaptive page placement based on data usage patternswas proposed by Ramos et al. [30]. Park and Bahn [25] simulatedthe DRAM-bus-attached PCM as a swap device, and argued for turn-ing off readahead. In contrast, our work proposes mechanisms toimprove performance beyond what is possible by simply disablingprefetching.

Many techniques have been proposed for prefetching in CPUcaches [3, 12, 39]. Somogyi et al. proposed spatial memory stream-ing prefetcher [38]. These approaches are a level lower than ourproposed approach, are mostly implemented in hardware, and canbe utilized independent of our schemes for further speedup. Ourapproach, being fully software-based, allows for more sophisti-cated prediction algorithms with higher accuracy. The drawbackof page-level prediction is the low amount of feedback availableat the OS level, making it harder to determine and prefetch theapplication page access patterns. Some effort has been devoted topage prefetching in the Linux kernel in the past [17]. The schemesassume a DRAM-only system, need ample free main memory andtake care to only operate during idle periods in system activity.These approaches are orthogonal to our schemes because we aimat helping the actively-running applications in a DRAM with fastNVM storage with restricted amounts of DRAM.

6 SUMMARYIn order to lower the cost of main memory in systems with hugememory demand, it is becoming practical to substitute emerging,less-expensive NVM storage for some of the DRAM in the sys-tem. The main challenge with NVM, however, is its relatively highaccess latency. Here we present SPAN, a software-only, OS swap-based page prefetching approach to managing hybrid DRAM andNVM systems. We show that it is possible to gain ∼55% of thelost performance due to swapping into the NVM and thus enablethe utilization of such hybrid systems for memory-hungry applica-tions, lowering the memory cost while keeping the performancecomparable to the DRAM-only system.

7 ACKNOWLEDGMENTSWe thank the National Science Foundation, which partially sup-ported this work through grants CCF-1320074 and I/UCRC-1439722,and Dell Corp for their generous support.

REFERENCES[1] Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Michael M.

Swift. 2013. Efficient Virtual Memory for Big Memory Servers. In Proceedings ofthe 40th Annual International Symposium on Computer Architecture (ISCA ’13).ACM, New York, NY, USA, 237–248. https://doi.org/10.1145/2485922.2485943

[2] Pei Cao, Edward W Felten, and Kai Li. 1994. Implementation and performance ofapplication-controlled file caching. In Proceedings of the 1st USENIX conferenceon Operating Systems Design and Implementation. USENIX Association, 13.

[3] Tien-Fu Chen and Jean-Loup Baer. 1995. Effective hardware-based data prefetch-ing for high-performance processors. IEEE Trans. Comput. 44, 5 (May 1995),609–623. https://doi.org/10.1109/12.381947

[4] Suock Chung, K. M. Rho, S. D. Kim, H. J. Suh, D. J. Kim, H. J. Kim, S. H. Lee, J. H.Park, H. M. Hwang, S. M. Hwang, J. Y. Lee, Y. B. An, J. U. Yi, Y. H. Seo, D. H. Jung,M. S. Lee, S. H. Cho, J. N. Kim, G. J. Park, Gyuan Jin, A. Driskill-Smith, V. Nikitin,A. Ong, X. Tang, Yongki Kim, J. S. Rho, S. K. Park, S. W. Chung, J. G. Jeong, and S. J.Hong. 2010. Fully integrated 54nm STT-RAMwith the smallest bit cell dimensionfor high density memory application. In Electron Devices Meeting (IEDM), 2010IEEE International. 12.7.1–12.7.4. https://doi.org/10.1109/IEDM.2010.5703351

[5] Peter J. Denning. 1970. Virtual Memory. ACM Comput. Surv. 2, 3 (Sept. 1970),153–189. https://doi.org/10.1145/356571.356573

[6] Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, MohammadAlisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, AnastasiaAilamaki, and Babak Falsafi. 2012. Clearing the Clouds: A Study of EmergingScale-out Workloads on Modern Hardware. In Proceedings of the SeventeenthInternational Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS XVII). ACM, New York, NY, USA, 37–48. https://doi.org/10.1145/2150976.2150982

[7] M. Ghosh and H. H. S. Lee. 2007. Smart Refresh: An Enhanced Memory ControllerDesign for Reducing Energy in Conventional and 3D Die-Stacked DRAMs. In40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO2007). 134–145. https://doi.org/10.1109/MICRO.2007.13

[8] James Griffioen and Randy Appleton. 1994. Reducing File System Latency Usinga Predictive Approach. In Proceedings of the USENIX Summer 1994 TechnicalConference on USENIX Summer 1994 Technical Conference - Volume 1 (USTC’94).USENIX Association, Berkeley, CA, USA, 13–13. http://dl.acm.org/citation.cfm?id=1267257.1267270

[9] Intel Optane SSD DC P4800X Series 2017. Intel Optane SSD DC P4800XSerie. (2017). http://www.intel.com/content/www/us/en/solid-state-drives/optane-ssd-dc-p4800x-brief.html

[10] Intel XPoint 2015. Intel 3D XPoint Technology. (2015). https://www.intelsalestraining.com/infographics/memory/3DXPointc.pdf

[11] Yasuo Ishii, Mary Inaba, and Kei Hiraki. 2009. Access Map Pattern Matchingfor Data Cache Prefetch. In Proceedings of the 23rd International Conference onSupercomputing (ICS ’09). ACM, New York, NY, USA, 499–500. https://doi.org/10.1145/1542275.1542349

[12] N. D. E. Jerger, E. L. Hill, and M. H. Lipasti. 2006. Friendly fire: understandingthe effects of multiprocessor prefetches. In 2006 IEEE International Symposium onPerformance Analysis of Systems and Software. 177–188. https://doi.org/10.1109/ISPASS.2006.1620802

[13] Song Jiang, Feng Chen, and Xiaodong Zhang. 2005. CLOCK-Pro: An EffectiveImprovement of the CLOCK Replacement. In Proceedings of the Annual Confer-ence on USENIX Annual Technical Conference (ATEC ’05). USENIX Association,Berkeley, CA, USA, 35–35. http://dl.acm.org/citation.cfm?id=1247360.1247395

[14] David Kadjo, Jinchun Kim, Prabal Sharma, Reena Panda, Paul Gratz, andDaniel Jimenez. 2014. B-fetch: Branch prediction directed prefetching for chip-multiprocessors. In Proceedings of the 47th Annual IEEE/ACM International Sym-posium on Microarchitecture. IEEE Computer Society, 623–634.

[15] Scott F. Kaplan, Lyle A. McGeoch, and Megan F. Cole. 2002. Adaptive Cachingfor Demand Prepaging. In Proceedings of the 3rd International Symposium onMemory Management (ISMM ’02). ACM, New York, NY, USA, 114–126. https://doi.org/10.1145/512429.512445

[16] Jinchun Kim, Seth H. Pugsley, Paul V. Gratz, A. L. Narasimha Reddy, Chris Wilk-erson, and Zeshan Christi. 2016. Path Confidence based Lookahead Prefetching.In Microarchitecture, 2016. MICRO-49. 49th Annual IEEE/ACM International Sym-posium on. IEEE.

[17] Con Kolivas. 2005. Linux Swap Prefetching. (2005). https://lwn.net/Articles/153353/

[18] Christos Kozyrakis, Aman Kansal, Sriram Sankar, and Kushagra Vaid. 2010. ServerEngineering Insights for Large-Scale Online Services. IEEE Micro 30, 4 (July 2010),8–19. https://doi.org/10.1109/MM.2010.73

[19] Min Li, Jian Tan, Yandong Wang, Li Zhang, and Valentina Salapura. 2015. Spark-Bench: A Comprehensive Benchmarking Suite for in Memory Data AnalyticPlatform Spark. In Proceedings of the 12th ACM International Conference onComputing Frontiers (CF ’15). ACM, New York, NY, USA, Article 53, 8 pages.https://doi.org/10.1145/2742854.2747283

[20] Gabriel H. Loh and Mark D. Hill. 2012. Supporting Very Large DRAM Cacheswith Compound-Access Scheduling and MissMap. Micro, IEEE 32, 3 (2012), 70–78.https://doi.org/10.1109/MM.2012.25

[21] Nimrod Megiddo and Dharmendra S. Modha. 2003. ARC: A Self-Tuning, LowOverhead Replacement Cache. In Proceedings of the 2Nd USENIX Conference onFile and Storage Technologies (FAST ’03). USENIX Association, Berkeley, CA, USA,115–130. http://dl.acm.org/citation.cfm?id=1090694.1090708

[22] Justin Meza, Jichuan Chang, HanBin Yoon, Onur Mutlu, and Parthasarathy Ran-ganathan. 2012. Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management. IEEE Comput. Archit. Lett. 11, 2 (July2012), 61–64. https://doi.org/10.1109/L-CA.2012.2

[23] Pierre Michaud. 2016. A best-offset prefetcher. In High Performance ComputerArchitecture (HPCA), 2016 IEEE 20th International Symposium on. IEEE.

https://doi.org/10.1145/2485922.2485943

https://doi.org/10.1109/12.381947

https://doi.org/10.1109/IEDM.2010.5703351

https://doi.org/10.1145/356571.356573

https://doi.org/10.1145/2150976.2150982

https://doi.org/10.1145/2150976.2150982

https://doi.org/10.1109/MICRO.2007.13

http://dl.acm.org/citation.cfm?id=1267257.1267270


http://www.intel.com/content/www/us/en/solid-state-drives/optane-ssd-dc-p4800x-brief.html

http://www.intel.com/content/www/us/en/solid-state-drives/optane-ssd-dc-p4800x-brief.html

https://www.intelsalestraining.com/infographics/memory/3DXPointc.pdf

https://www.intelsalestraining.com/infographics/memory/3DXPointc.pdf

https://doi.org/10.1145/1542275.1542349

https://doi.org/10.1145/1542275.1542349

https://doi.org/10.1109/ISPASS.2006.1620802

https://doi.org/10.1109/ISPASS.2006.1620802


https://doi.org/10.1145/512429.512445

https://doi.org/10.1145/512429.512445

https://lwn.net/Articles/153353/

https://lwn.net/Articles/153353/

https://doi.org/10.1109/MM.2010.73

https://doi.org/10.1145/2742854.2747283

https://doi.org/10.1109/MM.2012.25


https://doi.org/10.1109/L-CA.2012.2


[24] Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee,Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford,Tony Tung, and Venkateshwaran Venkataramani. 2013. Scaling Memcache atFacebook. In Proceedings of the 10th USENIX Conference on Networked SystemsDesign and Implementation (nsdi’13). USENIX Association, Berkeley, CA, USA,385–398. http://dl.acm.org/citation.cfm?id=2482626.2482663

[25] Yunjoo Park and Hyokyung Bahn. 2015. Management of Virtual Memory SystemsUnder High Performance PCM-based Swap Devices. In Proceedings of the 2015IEEE 39th Annual Computer Software and Applications Conference - Volume 02(COMPSAC ’15). IEEE Computer Society, Washington, DC, USA, 764–772. https://doi.org/10.1109/COMPSAC.2015.136

[26] R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka. 1995. In-formed Prefetching and Caching. In Proceedings of the Fifteenth ACM Symposiumon Operating Systems Principles (SOSP ’95). ACM, New York, NY, USA, 79–95.https://doi.org/10.1145/224056.224064

[27] Matthew Poremba, Tao Zhang, and Yuan Xie. 2016. Fine-granularity Tile-levelParallelism in Non-volatile Memory Architecture with Two-dimensional BankSubdivision. In Proceedings of the 53rd Annual Design Automation Conference(DAC ’16). ACM, New York, NY, USA, Article 168, 6 pages. https://doi.org/10.1145/2897937.2898024

[28] S. H. Pugsley, Z. Chishti, C. Wilkerson, P. f. Chuang, R. L. Scott, A. Jaleel, S. L.Lu, K. Chow, and R. Balasubramonian. 2014. Sandbox Prefetching: Safe run-timeevaluation of aggressive prefetchers. In 2014 IEEE 20th International Symposiumon High Performance Computer Architecture (HPCA). 626–637. https://doi.org/10.1109/HPCA.2014.6835971

[29] Moinuddin K. Qureshi and Gabriel H. Loh. 2012. Fundamental Latency Trade-offin Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with aSimple and Practical Design. In Proceedings of the 2012 45th Annual IEEE/ACM In-ternational Symposium on Microarchitecture (MICRO ’12). IEEE Computer Society,Washington, DC, USA, 235–246. https://doi.org/10.1109/MICRO.2012.30

[30] Luiz E. Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page Placementin Hybrid Memory Systems. In Proceedings of the International Conference onSupercomputing (ICS ’11). ACM, New York, NY, USA, 85–95. https://doi.org/10.1145/1995896.1995911

[31] S. Raoux, G. W. Burr, M. J. Breitwisch, C. T. Rettner, Y. C. Chen, R. M. Shelby, M.Salinga, D. Krebs, S. H. Chen, H. L. Lung, and C. H. Lam. 2008. Phase-changerandom access memory: A scalable technology. IBM Journal of Research andDevelopment 52, 4.5 (July 2008), 465–479. https://doi.org/10.1147/rd.524.0465

[32] John T. Robinson and Murthy V. Devarakonda. 1990. Data Cache ManagementUsing Frequency-based Replacement. In Proceedings of the 1990 ACM SIGMETRICSConference on Measurement and Modeling of Computer Systems (SIGMETRICS ’90).ACM, New York, NY, USA, 134–142. https://doi.org/10.1145/98457.98523

[33] Vivek Seshadri, Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu, Phillip B.Gibbons, Michael A. Kozuch, Todd C. Mowry, and Trishul Chilimbi. 2015. PageOverlays: An Enhanced Virtual Memory Framework to Enable Fine-grainedMemoryManagement. In Proceedings of the 42Nd Annual International Symposiumon Computer Architecture (ISCA ’15). ACM, New York, NY, USA, 79–91. https://doi.org/10.1145/2749469.2750379

[34] Kai Shen and Stan Park. 2013. FlashFQ: A Fair Queueing I/O Scheduler for Flash-based SSDs. In Proceedings of the 2013 USENIX Conference on Annual TechnicalConference (USENIX ATC’13). USENIX Association, Berkeley, CA, USA, 67–78.http://dl.acm.org/citation.cfm?id=2535461.2535471

[35] Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian, Chris Wilk-erson, Seth H. Pugsley, and Zeshan Chishti. 2015. Efficiently PrefetchingComplex Address Patterns. In Proceedings of the 48th International Sympo-sium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 141–152.https://doi.org/10.1145/2830772.2830793

[36] Abraham Silberschatz, Peter Baer Galvin, and Greg Gagne. 2008. OperatingSystem Concepts (8th ed.). Wiley Publishing.

[37] Sivashankar and S. Ramasamy. 2014. Design and implementation of non-volatilememory express. In Recent Trends in Information Technology (ICRTIT), 2014 Inter-national Conference on. 1–6. https://doi.org/10.1109/ICRTIT.2014.6996190

[38] Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, andAndreas Moshovos. 2006. Spatial Memory Streaming. In Proceedings of the33rd Annual International Symposium on Computer Architecture (ISCA ’06). IEEEComputer Society, Washington, DC, USA, 252–263. https://doi.org/10.1109/ISCA.2006.38

[39] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. 2007. Feedback Directed Prefetching:Improving the Performance and Bandwidth-Efficiency of Hardware Prefetch-ers. In 2007 IEEE 13th International Symposium on High Performance ComputerArchitecture. 63–74. https://doi.org/10.1109/HPCA.2007.346185

[40] Andrew S. Tanenbaum and Albert S. Woodhull. 1997. Operating Systems (2NdEd.): Design and Implementation. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.

[41] Kishor S Trivedi. 1979. An analysis of prepaging. Computing 22, 3 (1979), 191–210.[42] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion

Stoica. 2010. Spark: cluster computing with working sets. HotCloud 10 (2010),10–10.

[43] Y. Zhou, Z. Chen, and K. Li. 2004. Second-level buffer cache management. IEEETransactions on Parallel and Distributed Systems 15, 6 (June 2004), 505–519. https://doi.org/10.1109/TPDS.2004.13


https://doi.org/10.1109/COMPSAC.2015.136

https://doi.org/10.1109/COMPSAC.2015.136

https://doi.org/10.1145/224056.224064

https://doi.org/10.1145/2897937.2898024

https://doi.org/10.1145/2897937.2898024

https://doi.org/10.1109/HPCA.2014.6835971


https://doi.org/10.1109/MICRO.2012.30

https://doi.org/10.1145/1995896.1995911

https://doi.org/10.1145/1995896.1995911

https://doi.org/10.1147/rd.524.0465

https://doi.org/10.1145/98457.98523

https://doi.org/10.1145/2749469.2750379

https://doi.org/10.1145/2749469.2750379


https://doi.org/10.1145/2830772.2830793

https://doi.org/10.1109/ICRTIT.2014.6996190

https://doi.org/10.1109/ISCA.2006.38

https://doi.org/10.1109/ISCA.2006.38


https://doi.org/10.1109/TPDS.2004.13

https://doi.org/10.1109/TPDS.2004.13

speculative paging for future nvm storagecesg.tamu.edu/wp-content/uploads/2017/12/span_final.pdf ·...

Documents