rethinking logging, checkpoints, and recovery for high ...leis/papers/rethinkinglogging.pdf ·...

16
Rethinking Logging, Checkpoints, and Recovery for High-Performance Storage Engines Michael Haubenschild [email protected] Tableau Software Caetano Sauer [email protected] Tableau Software Thomas Neumann [email protected] Technische Universität München Viktor Leis [email protected] Friedrich-Schiller- Universität Jena ABSTRACT For decades, ARIES has been the standard for logging and re- covery in database systems. ARIES offers important features like support for arbitrary workloads, fuzzy checkpoints, and transparent index recovery. Nevertheless, many modern in- memory database systems use more lightweight approaches that have less overhead and better multi-core scalability but only work well for the in-memory setting. Recently, a new class of high-performance storage engines has emerged, which exploit fast SSDs to achieve performance close to pure in-memory systems but also allow out-of-memory work- loads. For these systems, ARIES is too slow whereas in- memory logging proposals are not applicable. In this work, we propose a new logging and recovery design that supports incremental and fuzzy checkpointing, index recovery, out-of-memory workloads, and low-latency transaction commits. Our continuous checkpointing algo- rithm guarantees bounded recovery time. Using per-thread logging with minimal synchronization, our implementation achieves near-linear scalability on multi-core CPUs. We im- plemented and evaluated these techniques in our LeanStore storage engine. For working sets that fit in main memory, we achieve performance close to that of an in-memory approach, even with logging, checkpointing, and dirty page writing enabled. For the out-of-memory scenario, we outperform a state-of-the-art ARIES implementation by a factor of two. ACM Reference Format: Michael Haubenschild, Caetano Sauer, Thomas Neumann, and Vik- tor Leis. 2020. Rethinking Logging, Checkpoints, and Recovery for High-Performance Storage Engines. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIG- MOD’20), June 14–19, 2020, Portland, OR, USA. ACM, New York, NY, USA, 16 pages. https://doi.org/10.1145/3318464.3389716 Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third- party components of this work must be honored. For all other uses, contact the owner/author(s). SIGMOD’20, June 14–19, 2020, Portland, OR, USA © 2020 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-6735-6/20/06. https://doi.org/10.1145/3318464.3389716 1 INTRODUCTION Durability and recovery after failure are key features of database management systems. The design and implemen- tation of recovery has implications across the entire data- base system architecture and affects overall performance. For decades, ARIES-style write-ahead logging (WAL) [38] has been the de facto standard for logging and recovery in disk-based database systems. This is due to the large feature set ARIES provides: it works with datasets and transaction footprints much larger than main memory, enables fast re- covery in the presence of repeated crashes, provides native and transparent support for indexes and space management, allows page-based fuzzy checkpoints with low interference, and is able to recover from media failures. Decades of rising DRAM capacities made it possible to keep many datasets in main memory rather than on disk. This hardware trend revealed major overheads in the tradi- tional database system architecture. A study by Harizopoulos et al. [20] found that the Shore storage engine spends more than 50% of time on buffer management and logging. This has led to the development of in-memory database systems like Silo [49] and VoltDB [36], which avoid buffer manage- ment altogether by keeping all data in main memory and rely on lightweight logging techniques rather than full-blown ARIES. Modern in-memory database systems are therefore much more efficient (and scalable) than traditional disk-based implementations. Whereas Shore requires over 200k instruc- tions per neworder TPC-C transaction only for logging [20], Silo executes the entire TPC-C transaction (including log- ging) using fewer than 100k instructions. More recently, fast PCIe-attached solid-state drives (SSD) have emerged, changing the hardware landscape once again. During the last five years, DRAM prices and capacities have stagnated, while flash-based SSDs have become much cheaper and faster [18]. Currently, main memory costs about 20 times more per gigabyte than SSD. A modern SSD typically has a bandwidth of over 3 GB/s, and a single server has enough PCIe lanes to directly attach 8 or 16 SSDs. This results in a hitherto unprecedented secondary storage bandwidth. SSDs are also much faster at random access than disks, making them suitable for both OLAP and OLTP.

Upload: others

Post on 06-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rethinking Logging, Checkpoints, and Recovery for High ...leis/papers/rethinkingLogging.pdf · memory database systems use more lightweight approaches that have less overhead and

Rethinking Logging, Checkpoints, and Recovery forHigh-Performance Storage Engines

Michael [email protected]

Tableau Software

Caetano [email protected] Software

Thomas [email protected] Universität

München

Viktor [email protected]

Friedrich-Schiller-Universität Jena

ABSTRACTFor decades, ARIES has been the standard for logging and re-covery in database systems. ARIES offers important featureslike support for arbitrary workloads, fuzzy checkpoints, andtransparent index recovery. Nevertheless, many modern in-memory database systems use more lightweight approachesthat have less overhead and better multi-core scalabilitybut only work well for the in-memory setting. Recently, anew class of high-performance storage engines has emerged,which exploit fast SSDs to achieve performance close to purein-memory systems but also allow out-of-memory work-loads. For these systems, ARIES is too slow whereas in-memory logging proposals are not applicable.In this work, we propose a new logging and recovery

design that supports incremental and fuzzy checkpointing,index recovery, out-of-memory workloads, and low-latencytransaction commits. Our continuous checkpointing algo-rithm guarantees bounded recovery time. Using per-threadlogging with minimal synchronization, our implementationachieves near-linear scalability on multi-core CPUs. We im-plemented and evaluated these techniques in our LeanStorestorage engine. For working sets that fit in main memory, weachieve performance close to that of an in-memory approach,even with logging, checkpointing, and dirty page writingenabled. For the out-of-memory scenario, we outperform astate-of-the-art ARIES implementation by a factor of two.ACM Reference Format:Michael Haubenschild, Caetano Sauer, Thomas Neumann, and Vik-tor Leis. 2020. Rethinking Logging, Checkpoints, and Recovery forHigh-Performance Storage Engines. In Proceedings of the 2020 ACMSIGMOD International Conference on Management of Data (SIG-MOD’20), June 14–19, 2020, Portland, OR, USA. ACM, New York, NY,USA, 16 pages. https://doi.org/10.1145/3318464.3389716

Permission to make digital or hard copies of part or all of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contactthe owner/author(s).SIGMOD’20, June 14–19, 2020, Portland, OR, USA© 2020 Copyright held by the owner/author(s).ACM ISBN 978-1-4503-6735-6/20/06.https://doi.org/10.1145/3318464.3389716

1 INTRODUCTIONDurability and recovery after failure are key features ofdatabase management systems. The design and implemen-tation of recovery has implications across the entire data-base system architecture and affects overall performance.For decades, ARIES-style write-ahead logging (WAL) [38]has been the de facto standard for logging and recovery indisk-based database systems. This is due to the large featureset ARIES provides: it works with datasets and transactionfootprints much larger than main memory, enables fast re-covery in the presence of repeated crashes, provides nativeand transparent support for indexes and space management,allows page-based fuzzy checkpoints with low interference,and is able to recover from media failures.Decades of rising DRAM capacities made it possible to

keep many datasets in main memory rather than on disk.This hardware trend revealed major overheads in the tradi-tional database system architecture. A study by Harizopouloset al. [20] found that the Shore storage engine spends morethan 50% of time on buffer management and logging. Thishas led to the development of in-memory database systemslike Silo [49] and VoltDB [36], which avoid buffer manage-ment altogether by keeping all data in main memory and relyon lightweight logging techniques rather than full-blownARIES. Modern in-memory database systems are thereforemuchmore efficient (and scalable) than traditional disk-basedimplementations. Whereas Shore requires over 200k instruc-tions per neworder TPC-C transaction only for logging [20],Silo executes the entire TPC-C transaction (including log-ging) using fewer than 100k instructions.

More recently, fast PCIe-attached solid-state drives (SSD)have emerged, changing the hardware landscape once again.During the last five years, DRAM prices and capacities havestagnated, while flash-based SSDs have becomemuch cheaperand faster [18]. Currently, main memory costs about 20 timesmore per gigabyte than SSD. A modern SSD typically has abandwidth of over 3GB/s, and a single server has enoughPCIe lanes to directly attach 8 or 16 SSDs. This results in ahitherto unprecedented secondary storage bandwidth. SSDsare also much faster at random access than disks, makingthem suitable for both OLAP and OLTP.

Page 2: Rethinking Logging, Checkpoints, and Recovery for High ...leis/papers/rethinkingLogging.pdf · memory database systems use more lightweight approaches that have less overhead and

Because SSDs provide block-wise access, and data needsto be loaded into DRAM before it can be processed, a num-ber of high-performance storage engines have recently beenproposed to exploit fast SSDs [7, 31, 39, 50]. Using tech-niques like pointer swizzling [31] and scalable optimisticsynchronization [32, 33], systems like LeanStore [31] offertransparent buffer management with very little overhead.This SSD-optimized approach is a new database architecturethat significantly differs from both the disk-based and thein-memory designs.Another potential alternative to SSDs is persistent mem-

ory. Unfortunately, the first generation of byte-addressablepersistent memory (“Intel Optane DC Persistent Memory”)is currently almost as expensive as DRAM, and is thereforenot (yet) the ideal medium for primary data storage. How-ever, its low write latency makes it a perfect technology forpersisting the log. Therefore, for high-performance storageengines, an economical design is to store the data itself onSSD and use a small amount of persistent memory to storejust the tail of the WAL. What currently remains unclear ishow to best implement logging and recovery for this newclass of database systems.

Neither of the existing approaches (ARIES or in-memory)is a good fit for SSD-optimized high-performance storage en-gines. Conceptually, traditional ARIES-based designs have allthe required features and are designed for the page-wise stor-age access, but have too much overhead and do not scale wellon modern multi-core CPUs due to the centralized loggingcomponent. State-of-the-art in-memory recovery techniques,as used by SiloR [55] and Hekaton [13, 30], have little over-head and good scalability, but are not optimized for data setslarger than memory.In this work, we propose a logging and recovery design

that is well suited for modern high-performance storage en-gines. It extends the scalable logging scheme proposed byWang and Johnson [52] with a novel continuous checkpoint-ing algorithm, an efficient page provisioning strategy, andan optimized cross-log commit protocol. We integrate allthese components into LeanStore [31], a lightweight buffermanager that uses pointer swizzling and optimistic synchro-nization. The resulting system can sustain transaction ratesclose to that of a pure in-memory system when all data fitsin DRAM, as well as handle out-of-memory workloads trans-parently. Our approach has low CPU overhead, scales wellon multi-core CPUs, balances I/O overhead and recoverytime, and exploits modern storage devices such as persistentmemory for low-latency commits and PCIe-attached SSDsfor larger-than-memory data sets. It offers many of the fea-tures typically associated with ARIES, including fuzzy check-points, index recovery, and transaction footprints larger thanmemory (steal), but has much lower overhead than ARIES im-plementations used in disk-based systems. Compared to over

200k instructions per TPC-C transaction spent for loggingin Shore [20], our approach requires only 15k instructions,and scales well on multi-core CPUs.

The key contributions of this work are (1) a framework ofpractical techniques for all recovery- and durability-relatedcomponents in flash-optimized high-performance storageengines, (2) a logging and commit protocol that uses re-mote flush avoidance for low-latency transactions on persis-tent memory, (3) a novel design for low-overhead, continu-ous checkpointing that bounds recovery time and smoothlyspreads writes over time, (4) a thorough discussion of pageprovisioning in the context of high-performance buffer man-agers, and (5) an implementation and evaluation of our ap-proach with real persistent memory against existing disk-based and in-memory designs, demonstrating its low instruc-tion footprint, good scalability, and fast recovery.

2 BACKGROUNDThis section reviews system designs that are relevant to theapproach proposed in this paper.

2.1 ARIESARIES [37, 38] has become the standard logging and recoverymechanism of disk-based DBMSs. One of the cornerstonesof the ARIES design is physiological logging, which assignsa page identifier to each log record but permits logical up-dates within a page. Logging in such a page-oriented fashionimplies that pages can be recovered in parallel and providesa good trade-off between log volume and replay efficiency.ARIES also supports in-place updates on database storagewith a steal policy, i.e., uncommitted changes can be writtento disk. In order to remove such changes during recovery, anundo phase executes and logs compensation operations thatare the logical inverse of the operations performed duringforward processing or redo recovery. Such logical compen-sation steps enable record-level locking (i.e., concurrencycontrol at the sub-page level) and system transactions thatperform maintenance steps such as page splits.Physiological logging and logical undo also provide two

key benefits, the first of which is fuzzy checkpointing. Thismeans that a dirty page can be written at any time, regard-less of transaction activity, as long as no physical operationsare in progress on that page—i.e., as long as appropriatelatches are acquired. Fuzzy checkpointing introduces verylow overhead on running transactions and permits constantpropagation of updates to disk, which guarantees boundedrecovery time, steady forward processing performance, andwell-balanced I/O throughput. In addition, fuzzy checkpoint-ing makes it much easier to produce both full and incremen-tal backups, and thus provides support for efficient mediarecovery. The second benefit is the ability to log and recover

Page 3: Rethinking Logging, Checkpoints, and Recovery for High ...leis/papers/rethinkingLogging.pdf · memory database systems use more lightweight approaches that have less overhead and

arbitrary page-based data structures, which implies that in-dexes can be recovered along with primary data and thusneed not be fully rebuilt during recovery.Despite all the advantages mentioned here, ARIES has

been deemed obsolete by many recent proposals for its ex-cessive forward-processing overhead and lack of scalability.Unfortunately, most of these approaches (as detailed below)also forego the features and advantages mentioned here.Therefore, a key design objective of this work is to main-tain the features of ARIES-style write-ahead logging withoutabandoning performance and scalability.

2.2 In-memory Database SystemsIn-memory database systems, in contrast to ARIES, employlightweight logging techniques. While many variations oflogical and physical logging exist (which we review in Sec-tion 5), the value logging approach used in Silo [49, 55] isthe most relevant for this paper. With value logging, a logrecord does not contain a page identifier; rather, it simplystores a logical tuple identifier and a transaction ID, alongwith the modified tuple’s contents. The main advantage ofthis approach is that it removes the dependency between logrecords of the same page, which in physiological logging isa source of additional thread synchronization during bothnormal processing and recovery.To eliminate the bottleneck of a centralized log, Silo em-

ploys distributed logging among CPU cores [28, 49, 52]. Thisapproach uses multiple log buffers, where each transactionis assigned to a log buffer (i.e., logs are partitioned by trans-action) and each buffer is written to its own file without anycoordination with other threads. Since durability is guidedby epochs and a group commit protocol, recovery processingrequires establishing the maximum epoch that is persistedin every log file. Then, each log file can be replayed in par-allel and in an arbitrary order, thanks to the value loggingmechanism and the use of monotonically-increasing trans-action IDs (largest ID wins). Despite these advantages, valuelogging suffers from the following inherent disadvantages: itdoes not support index recovery, requires the entire data setto live in main memory, and lacks incremental checkpoints.Our approach combines the best of both worlds: the scala-bility and low overhead of value logging and the features ofphysiological logging.

2.3 CheckpointsCheckpoints in a physiological logging system require writ-ing dirty pages to persistent storage. Unfortunately, as wedetail in Section 3.4, the state of the art has not been devel-oped beyond the traditional disk-based techniques. Instead,recent research has mostly focused on in-memory databasesthatmust write the entire database at a tuple granularity. This

(a) Same Page (b) Independent

PageA

Txn1

PageB

Txn2

GSN8+1 +1GSN7

GSN2

GSN1

GSN4

GSN8

GSN9

GSN9

GSN7

GSN2

GSN1

GSN4

GSN8

GSN8

GSN5

GSN5

Page Access GSN increment

+1

+1

Figure 1: GSNs establish a partial order between logrecords sufficient for recovery. If two changes dependon each other, the second one will have a higher GSN(a). For independent changes, GSNs are unordered (b).

paper bridges this gap by proposing checkpointing and evic-tion techniques that are appropriate for the high transactionrates and I/O volume made possible by modern hardware.

2.4 Scalable LoggingThe obvious approach for improving the scalability of ARIESis to replace the single global log with multiple logs, andassign one or a few threads to each. Thus, when transac-tions write log records, they do not contend on a singleglobal lock. However, on commit, a transaction has to per-sist the log records in other log partitions upon which theirown changes depend. When the logs are stored on HDD,or even on SSD, flushing several logs becomes prohibitivelyexpensive. Furthermore, distributed logs require additionalmeasures for correct recovery: since each log assigns its ownlocal LSN, the order in which to replay changes for a givenpage from different logs becomes nondeterministic.Wang and Johnson [52] propose a scalable logging tech-

nique that tackles both issues by exploiting persistent mem-ory and introducing the concept of global sequence numbers(GSN). Persistent memory, which is by now commerciallyavailable (e.g., Intel Optane DC Persistent Memory), reducesthe latency until changes are drained from CPU caches topersistent storage by more than an order of magnitude. Fur-thermore, these flushes can be done fully in parallel.GSNs are a lightweight, decentralized mechanism which

establishes a partial order between log records from differ-ent log partitions similar to distributed clocks [29] (alsoknown as “Lamport timestamps”). In a distributed clockanalogy, both transactions and pages act like processes in adistributed system, with log records being the events thatneed to be ordered. A transaction’s txnGSN (timestamp) isset to max(txnGSN,pageGSN) whenever it accesses a page(synchronizes its local clock). New log records (events) arecreated with txnGSN+1. When two transactions access thesame page (Figure 1a), the protocol ensures that the logrecord for the first change has the smaller GSN. On the other

Page 4: Rethinking Logging, Checkpoints, and Recovery for High ...leis/papers/rethinkingLogging.pdf · memory database systems use more lightweight approaches that have less overhead and

hand, when two transactions access distinct pages, theirGSNs are not synchronized, and the latter event can evenhave a smaller GSN (Figure 1b). Note that the full protocolalso ensures GSN ordering of log records inside each log,which we omitted from Figure 1 for simplicity. During recov-ery, log records for a page are gathered from all individuallogs, and sorted by GSN before they are applied.For durability, Wang and Johnson propose passive group

commit: When a transaction commits, it first flushes its ownlog and then waits in a group commit queue. Once all otherlogs are durable up to the transaction’s commit record GSN,its commit is finally acknowledged. This design effectivelysolves the scalability issue for logging, and indeed we use itas the basis for logging in our system. However, its imple-mentation in Shore-MT still suffers from high instructionoverhead, has unnecessarily high transaction latencies andrelies on a custom kernel module. For reference, the reportedTPC-C throughput [52] with 40 threads is below what oursystem achieves with a single thread (26k vs. 41k txn/s).

3 LOW-LATENCY LOGGING ANDROBUST CHECKPOINTINGWITHBOUNDED RECOVERY

In this paper, we propose a holistic approach for logging,checkpointing, and dirty page writing that combines thebenefits of ARIES with those of lightweight logging tech-niques. In our approach, each worker thread is assigned itsown log, which eliminates the single point of contention andimproves scalability on multi-core CPUs. Log records consistof a type, page ID, transaction ID, GSN (see Section 2.4), andthe before and after image of each change. Using a smallpersistent memory buffer, transactions commit immediatelywhen they are finished, without being appended to a groupcommit queue first. We improve over existing distributedlogging schemes with a new mechanism we call RemoteFlush Avoidance (RFA), which can detect and skip unnec-essary flushes in logs of other threads. Furthermore, oursystem uses a novel continuous checkpointing algorithm thatsmoothly spreads I/O over time and thus avoids write burstsand spikes in transaction latency. Compared to traditionalARIES implementations, the instruction overhead for cre-ating log entries and writing them to persistent storage ismuch lower in our system. Finally, we support larger-than-memory workloads, since we build our logging frameworkon top of the state-of-the-art LeanStore engine.

3.1 Two-Stage Distributed LoggingThe main scalability issue of ARIES is the synchronized ac-cess to the global log. Approaches such as Aether [22, 23],ELEDA [24], and Border-Collie [25] reduce contention onthe log by minimizing the time spent in critical sections, but

Log Partition 1

WAL writer

worker1

Stage 2: SSD

Log Partition 2...

Log Partition n

fullchunks

free chunkscurrentchunk

worker2

workern

Stage 3:Log Archive

Stage 1: Persistent Memory

Figure 2: Overview of two-stage distributed logging.

with a large-enough core count, the single log still becomesthe scalability bottleneck. Therefore, in our system, eachworker thread has its own separate log as shown in Figure 2.Every transaction is pinned to a worker thread, so that itslog records are all written to exactly one of those logs. Dif-ferent transactions can still operate on the same page, solog records for a certain page can end up in different logs.This design is based on the transaction-level log partitioningapproach of Wang and Johnson [52] explained in Section 2.4.

The log is organized into three stages. The first stage con-sists of a small number of log chunks organized in a circularlist as shown in Figure 2. One of the chunks is always des-ignated as the current chunk, which is where transactionsappend their log records. Whenever the current log chunkbecomes full, it enters the full portion of the list. From there,chunks are picked up by a dedicatedWALwriter thread—alsoone for each log—which flushes them into the second stage.After a chunk is successfully written, its buffer is zeroed outand placed into the free portion of the list, from which it willeventually be picked up as the current chunk to append newlog records. Lastly, log files are archived into the third stagefor media recovery—this is shown on the left side of Figure 2.In the hardware configuration assumed throughout this

paper, the first stage resides on persistent memory or battery-backed DRAM. Thus, transactions only need to flush CPUcaches when they want to persist log records upon transac-tion commit, and do not have to wait for the staging of fulllog chunks to SSD. This allows for very low commit latencyand high transaction throughput without the need for groupcommit. For the second stage, SSD storage is a good can-didate because it is much cheaper than persistent memory.Furthermore, SSD bandwidth is sufficient for the amountof log volume that even the fastest transaction systems canproduce.

Notwithstanding the benefits of logging to persistentmem-ory, our design is not restricted to it. An alternative solutionthat keeps the first stage in DRAM and guarantees persis-tence once log chunks are flushed to SSD still achieves highthroughput by employing an RFA-optimized version of groupcommit. This alternative is presented at the end of Section 3.2.

Page 5: Rethinking Logging, Checkpoints, and Recovery for High ...leis/papers/rethinkingLogging.pdf · memory database systems use more lightweight approaches that have less overhead and

global latch

Distributed LoggingARIES

Log

Logical Dependencies

INSERT(Pa,a1)

Tx1 Tx2

COMMIT(Tx1)

UPDATE(Pb,b1)

INSERT(Pb,b2)

Remote Flush Avoidance

flushes on n partitions✗ no contention✓ no remote flushes✓

COMMIT(Tx2)

Partition 2 Partition nPartition 1

...

Partition 2Partition 1 Partition n

...

Figure 3: Two independent transactions and the synchronizing operations in different logging strategies.

3.2 Low-Latency Commit Through RFASplitting the log into multiple partitions and pinning eachtransaction to one of those partitions allows worker threadsto create log entries without synchronization. However, toguarantee consistency all log partitions need to be flushed upto the transaction’s current GSN as depicted for “DistributedLogging” in Figure 3.

Let us give an example of why this is necessary: supposetransaction Txn1 deletes a tuple on page PA and records thischange withGSN 12 in log L1, but does not yet commit. Thentransaction Txn2 also inserts a tuple on PA and records this inits own log L2. The GSN protocol mandates that this changewill be recorded with a higher GSN, e.g.,GSN 13. When Txn2commits, it has to ensure that all prior changes on this pagehave been persisted in the log. It therefore needs to flush allother log partitions up to GSN 13.

Profiling our system showed that these remote log flusheslead to significantly reduced scalability (see experiment inSection 4.1).Wang and Johnson’s approach [52]works aroundthis issue by using passive group commit, where transactionsonly flush their own log and are put in a commit queue.A group commit thread periodically checks the minimumflushed GSN of all logs, and sets those transactions to commit-ted for which all necessary log records are persisted. Groupcommit is a good solution for workloads with a lot of con-current transactions without low-latency requirements.We propose a new technique called Remote Flush Avoid-

ance (RFA), which enables high throughput and scalabilitywhile still providing low-latency single transaction commits.The motivation for RFA comes from the observation that,for most transactions, there is neither logical conflict nordo they modify the same set of physical pages. A commitdependency is therefore not necessary. Consider the exampleshown in Figure 1b, in which Txn1 pessimistically flushes thelog of Txn2 sinceGSN5 < GSN8, despite the fact that the twotransactions modify different pages (PA and PB ). The basicdistributed logging approach fails to catch the independenceof Txn1 and Txn2 as it projects the dependency graph onto asingle time axis. In other words, it linearizes the partial order-ing of GSNs back to a total ordering, and thereby sacrificessome of its scalability advantages. When two independent

transactions create log records concurrently, one of themwill unavoidably have the smaller GSN, and thus the othertransaction needs to flush it.

RFA adds a few lightweight steps to the GSN protocol thatenable it to skip most remote partition flushes:(1) For each page, we remember Llast , the log that contains

the record for the most recent modification of this page.This information does not need to be persisted, andcan be stored in the buffer frame for that page.

(2) When a transaction starts, it determinesGSNflushed , themaximum GSN up to which all logs have been flushedat that point. Any log record created afterwards isguaranteed to have a higher GSN (see Section 2.4).

(3) Lastly, each transaction maintains a needsRemoteFlushflag, which is initially set to false.

Whenever a transaction accesses a page (either for read orwrite), it checks if the page GSN is below GSNflushed . If thatis the case, then all previous log records of that page havealready been flushed and the page access can proceed. Oth-erwise, if the page GSN is above GSNflushed , the algorithmfurther checks if the committing transaction’s log is the sameas Llast . If that is the case, then page access can also proceed,because it means that the last modification is from the sametransaction and thus will be flushed anyway when the cur-rent transaction commits. Only if both checks fail, then theneedsRemoteFlush flag is set to true, which causes all logsto be flushed when the transaction commits. In a nutshell,remote flushes are avoided whenever the log records of atransaction depend only on (1) guaranteed flushed changesor on (2) changes from its own log.Figure 3 summarizes the benefits of RFA. Given two in-

dependent transactions, ARIES requires synchronized ac-cess for each log record, while standard distributed loggingfrequently requires synchronization on commit. In order toavoid this, one could explicitly track dependencies, but this isprohibitively expensive as noted previously [52]. Therefore,RFA detects such dependencies without expensive bookkeep-ing data structures, allowing the two transactions shown inFigure 3 to commit without any synchronization.

RFA and group commit are orthogonal optimizations thatcan be combined or used individually, yielding four designs.

Page 6: Rethinking Logging, Checkpoints, and Recovery for High ...leis/papers/rethinkingLogging.pdf · memory database systems use more lightweight approaches that have less overhead and

When persistent memory is available, we argue that RFAwithout group commit is the best approach as it enables low-latency commits. However, even in the absence of persistentmemory, RFA helps to reduce transaction latency when com-bined with group commit. Without RFA, group commit re-quires a transaction to wait on all logs to be persisted beforeits commit is acknowledged. With RFA, transactions withtheir needsRemoteFlush flag set to false can commit as soonas their own log is persisted. In this design, in addition to aglobal group commit queue, each log has its own queue fortransactions that do not require a remote flush.

3.3 Challenges of CheckpointingIn a broad sense, a checkpoint is any technique that aims tobring the persistent state of a database up to date in orderto reduce recovery time and recycle log space. The classicpaper by Härder and Reuter [19] provides an abstract classi-fication of checkpointing techniques and how they relate tothe granularity of logging. Most in-memory systems writethe entire contents of a database in a transaction-consistentmanner when taking a checkpoint. To achieve that, thesesystems require shadow copies of individual tuples in mainmemory, usually in combination with multiversion concur-rency control [49], or in some cases by duplicating the en-tire database [10]. These systems are usually slower while acheckpoint is being taken, because individual records mustbe versioned and transaction activity must halt—or at leastbe coordinated in multiple phases [45]—to establish a pointof consistency. Given these restrictions, it is advisable to takecheckpoints sparingly in such systems. However, this trans-lates to longer recovery times, which is further worsened bythe fact that log replay is substantially slower (see Section 5).For the reasons given above, this paper focuses on page-

based checkpointing to complement our page-based buffermanagement and logging schemes. Robust page-based check-pointing is a difficult challenge in practice because it requiresfinding an acceptable trade-off between opposing goals. Inorder to bound recovery time to an acceptable level, a systemmust flush dirty pages at a rate that is compatible with therate of incoming transactions. If the flush rate is too high,this results in a waste of I/O resources and a needless in-crease in write amplification. On the other hand, if the flushrate is too low, the log size cannot be bounded, which canresult in service outage when the log device fills up, and longrecovery time in case of a failure. A low rate may also causethe buffer manager to be saturated with dirty pages, whichcan cause a drop in transaction throughput if insertions orqueries cannot allocate memory. Even if an appropriate flushrate is found, the checkpointer must be smart about whichpages to flush, since some dirty pages are worth flushing at alower rate than others (e.g., hot pages with frequent updates).These issues have been largely ignored in recent research,

which has focused mainly on alternative architectures thatpropagate changes at record (rather than page) granularity.

One good example of a checkpointing implementation intraditional WAL systems is PostgreSQL [48]. It uses two in-dependent processes to write dirty pages: a checkpointer anda background writer. The former is responsible for boundingthe size of the log and thus recovery time; it is triggeredwhen the log reaches a certain size, or by a timeout (5 min-utes by default). The checkpointer flushes every dirty pagein the buffer pool (i.e., a direct checkpoint [19]). As such, theLSN of the checkpoint log record also establishes the startingpoint of redo recovery. One problem with this approach isthat it incurs periodic bursts of high I/O activity and con-tention on the buffer manager data structures, leading toperformance dips. To solve this problem, the backgroundwriter is triggered at fixed time intervals and writes a cer-tain number of dirty pages calculated from configurationparameters. Besides PostgreSQL, other approaches arguethat checkpointing should be done “periodically” [38], “everyfew seconds” [10], or “roughly 10 seconds after the previouscheckpoint completed” [55]. The inherent problem of suchtime-based policies is that they fail to adapt to changes in theworkload and shift the problem of setting reasonable valuesto the database administrator.

3.4 Continuous CheckpointingSince the principal goal of checkpointing is to bound recov-ery time, which mainly depends on the amount of log thatneeds to be processed, it makes sense to couple the check-pointing policy to the amount of log written. Suppose wewant to limit the size of the WAL to 20GB. To achieve this,we need to write out certain (but not all) dirty pages. To beprecise, all pages in the buffer pool that contain unflushedchanges captured in log records that are more than 20GBaway from the tail of the log need to be written out beforethe log can be truncated to 20GB. At the same time, we donot want the checkpointer to cause high CPU overhead andheavy write bursts by constantly scanning the whole bufferpool just to be able to truncate a few MB of WAL.

To solve this problem, we propose a new technique calledcontinuous checkpointing. Instead of taking a checkpoint overthe whole buffer pool at once, we incrementally checkpointfractions of it, which avoids bursts in disk writes and leadsto a steadier performance. After each step, a small fractionof the WAL can be truncated.

To achieve this, we logically partition the buffer pool intoS buffer shards. In each checkpointer invocation, which wecall a checkpoint increment, the checkpointer picks a shard ina round-robin fashion, and writes out all dirty pages in it. Anincrement is triggered whenever 1/S of the configured loglimit is moved to the second stage of the log (see Figure 2).For example, with a WAL limit of 20GB, a buffer pool of

Page 7: Rethinking Logging, Checkpoints, and Recovery for High ...leis/papers/rethinkingLogging.pdf · memory database systems use more lightweight approaches that have less overhead and

// Maintain the persisted GSN for each shardGSN maxChkptedInShard[num_shards]; // Initially zerounsigned current_incr = 0;// Triggered by writing a certain amount of WALvoid checkpoint_increment() {

min_current_gsn = logs.getMinCurrentGsn();shard = current_incr % num_shards;pages = bufferPool.getPagesInShard(shard);foreach (page in pages) {

if (page.isDirty()) write(page);}maxChkptedInShard[shard] = min_current_gsn;// Get the minimum GSN up to which all// shards are checkpointedchkpted_gsn = min_val(maxChkptedInShard);min_tx_gsn = txManager.getMinActiveTxGsn();logs.prune(min(chkpted_gsn, min_tx_gsn));++current_incr;

}

Figure 4: Pseudo code for continuous checkpointing.

50GB, and S = 10 shards, for every 2GB of generated WAL,a checkpoint increment of a 5GB shard is triggered.The checkpointer maintains a table that, for each buffer

shard, stores the GSN up to which all changes have been per-sisted in that shard. Before each increment, the checkpointerremembers the smallest current GSN among all logs,GSNmin.Any log records created afterwards are guaranteed to have aGSN larger than GSNmin. When the checkpointer finishes ashard, it updates that shard’s table entry with GSNmin. Theminimum value among all entries of this table is then used todetermine the point up to which the log can be archived (aslong as the GSN of the oldest active transaction also permitsit—see Section 3.6). The algorithm is shown in Figure 4.

Figure 5 shows an example with a buffer pool of six pagesand S = 3 shards. The most recent checkpoint incrementhas been on shard b1, at which point GSNmin was GSN34(table in the center). Since buffer shard b2 is the next oneto be checkpointed, it consequently has the oldest persistedGSN and limits log truncation to records with < GSN 8. Thestaging of the orange WAL chunk now triggers the check-point increment of b2. The only dirty page in b2 is page G,which is written out. Next, the checkpointer updates the ta-ble entry for shardb2 withGSN 46, since that is the minimumcurrent GSN across all logs when it started the increment.This makes GSN24 the new overall minimum log recordneeded for recovery. Taking into account the minimum GSNof all active transactions, which is GSN 25, it can now movethe blue log chunk to the log archive, since that chunk onlycontains log records up to GSN 23.This technique is the key to continuously making check-

point progress (i.e., increase the maximum persisted GSN)without time-based configuration settings while still avoid-ing excessive write amplification. In steady state, for eachnew fraction of WAL staged, the oldest one will be pruned

Buffer Pool

GSN23GSN34

GSN4

4

WAL Shard b1

Shard b2

Shard b3

GSN48

minActiveTxGsn: GSN25

GSN15

Page F

Page G

GSN19

Log Archive

Page A

GSN32

Page G

GSN5

Page F

GSN15

Page D

GSN8

GSN8

Page D

empty

GSN32

Page A

GSN48

Page NGSN8

Active Transactions

Txn1 GSN25Txn2

# first GSN current GSNGSN27

GSN32 GSN48

Persisted GSN

Checkpointer

GSN46GSN34GSN8GSN24

GSN46

...Figure 5: Checkpointing example. The frequency ofcheckpoint increments is coupled to WAL volume.

and the WAL volume will stay stable at its configured size.Furthermore, this design eliminates both the manual tuningknob as well as any write bursts that would occur in fullcheckpoints, and is effective in keeping the recovery timebounded. By increasing the number of shards to, e.g., S = 128,the smoothness of the writes can be increased and the abso-lute deviation from the configured WAL limit reduced.

3.5 Page ProvisioningBesides checkpointing, there is another reason to persistdirty pages in a buffer-managed system: when pages haveto be read from disk, but the buffer pool is already full, thensome other pages need to be evicted. If these pages are dirty,they need to be written out to disk first. Database systemsusually have a dedicated background thread for this (e.g., thebackground writer of PostgreSQL [48] mentioned earlier),so worker threads do not have to issue synchronous writesbefore reading a page. However, research so far has lookedat the dirty page writer mostly in isolation.We take a holistic view on the different page flows that

happen inside the storage manager and how they interactwith each other. LeanStore partitions pages into a hot and acool area for page eviction [31]. Pages in the hot area can bedirectly referenced by pointers, i.e., they are swizzled, whilecool pages must be looked up with their page ID. As Figure 6shows, we consider a storage manager to be a closed dynamicsystem in which buffer pages flow between different states.Assuming the workload is in steady state, worker threadscontinuously swap buffer pages into the swizzled (hot) stateat a certain rate, e.g., 1,000 pages per second. They do so byallocating a new page, accessing a page on disk, or (this isspecific to our storage manager design) accessing a page thatis currently in the unswizzled (cool) state.

Since LeanStore achieves several hundred thousand trans-actions per second, even a moderate amount of additional

Page 8: Rethinking Logging, Checkpoints, and Recovery for High ...leis/papers/rethinkingLogging.pdf · memory database systems use more lightweight approaches that have less overhead and

swizzle

unswizzle persist

89%Hot

10%Cool

1%Free

evict

read page / new page

page provider

workers

900/s

900/s

400/s

100/s1000/s

Figure 6: The division of LeanStore’s buffer pool intohot (swizzled), cool (unswizzled) and free pages. Ac-tions in the system cause pages to transition betweenthe states. In steady state, worker threads request ex-actly asmany free pages as the page provider supplies.

work, such as consulting the page replacement strategy, canlead to a significant drop in throughput. Consequently, weexclude all other transitions (shaded arrows in Figure 6) fromthe critical path, so worker threads can focus solely on trans-action processing. This means that in steady state, a certainnumber of pages is in the “free” state. These pages are kept ina separate list, allowing workers to request new pages withvery little overhead, in particular without consulting globaleviction data structures. The percentage of pages in the freelist can be low, e.g., 1% as shown in Figure 6. It must onlysuffice to bridge the gap of short bursts of page requests.

For all support tasks, we introduce a dedicated page providerthread that does all of the extra housekeeping work:(1) It unswizzles pages and puts them into the cool area,

which is organized as a FIFO queue. From there theyget re-swizzled if worker threads access them again ina timely manner. Access to the queue is sped up witha hash table lookup.

(2) Pages from the older end of the queue are evicted andput into the free list.

(3) Before a dirty page can be evicted, it is persisted todisk first.

A classical dirty page writer, as employed by PostgreSQLfor example, is only concerned with the third point. Ourpage provisioner, in contrast, is responsible for keeping thewhole buffer pool in an equilibrium. Worker threads wakeup the page provisioner whenever the buffer pool deviatesfrom the desired composition. It then runs as many roundsas necessary until the target sizes of the FIFO queue andthe free list are restored. In each round, it first unswizzlesa fixed number of pages, e.g., 256, and inserts them at thefront of the queue. Second, it iterates over the queue fromthe back and unlinks clean pages, buffering them in a locallist. Once that list reaches a certain size, it is appended to the

free list with a single latch acquisition. Dirty pages, on theother hand, are simply marked as writeBack and copied intoa local writeback buffer. Pages in the writeBack state can stillbe modified. They can even transition between the swizzledand unswizzled state. However, they cannot be evicted, asotherwise it cannot be guaranteed that a subsequent readfrom disk will return the latest version. When the writebackbuffer fills up, it is written out in one go and the disk cacheis flushed. Note that the page provisioner gives up all latchesduring the I/O operation, so afterwards it starts again at theback of the queue. There it finds the (hopefully still clean)pages that it just wrote out and can move them to the freelist. Thus, clean pages in the FIFO queue are accessed exactlyonce, while dirty pages are normally accessed twice.

One might wonder whether it would be better to introduceseparate threads for unswizzling, page writing, and pageeviction. But all three actions depend on each other: Pagesare written out at the latest possible moment before they getevicted, thus avoiding write amplification from writing outpages multiple times. Also, in order to evict the correct pages,they need to migrate through the FIFO queue first. Thus, thequeue must always stay appropriately filled. Doing all of thisin the same thread proves to be a simple and robust design,where pages move continuously, and no single action canoutrun the others, which would introduce imbalance.

3.6 Transaction AbortTraditional systems implement the steal paradigm, i.e., un-committed changes may be written back to the persisteddatabase, whereas for in-memory systems only a no-stealpolicy makes sense. In terms of CPU instructions, there is notmuch difference whether the undo images are stored in thelog or in a dedicated in-memory rollback segment. For high-performance storage engines, in principle both approacheswould be possible. With current DRAM sizes, one could con-sider a system that pins all pages with uncommitted changesin the buffer pool. The advantages of such a system are theomission of undo recovery and therefore less WAL volume.On the other hand, a system that implements steal can han-dle larger-than-RAM write transactions. Also, tracking allpages with uncommitted changes incurs overhead and canhold back the checkpointer from forcing out the necessarypages for truncating the log, which violates our design goals.For these reasons, we use a steal architecture and write

before-images in the WAL for each data modification. Trans-action aborts are done logically, i.e., the regular access pathis used to execute the reverse operation (e.g., delete for insertand vice versa), including writing new log records. Abortsare efficient, as all log records for a transaction can be foundin a single log. After all changes are undone, an end-of-transaction record is inserted into the log. The final log flushrequired for successful transactions can be omitted.

Page 9: Rethinking Logging, Checkpoints, and Recovery for High ...leis/papers/rethinkingLogging.pdf · memory database systems use more lightweight approaches that have less overhead and

worker 2

2. Merge, Sort

worker 1p4,GSN7 p4,GSN2

p2,GSN1

p5,GSN6

p6,GSN5 p1,GSN3

p2,GSN5

p5,GSN4

p4,GSN7

p4,GSN8

p2,GSN1

p2,GSN5

p5,GSN8

p5,GSN4

p6,GSN7

p1,GSN8

Red

oR

ed

o

Log Partition 1

Log Partition 2

& Redo1. Partition logs

by page id

Figure 7: Overview of the first two recovery phases.The final undo phase is not shown.

Most transaction aborts are fast because recent log recordsstill reside in persistent memory. Therefore, only old or largetransactions need to read from SSD. We found that the ad-ditional space required in the WAL to write undo-images isoften small (e.g., only 20%, or 2230 Byte vs. 1850 Byte on av-erage for a TPC-C transaction). The reason is that insert logrecords do not require a before-image at all, update recordsonly contain the before and after image of the changed at-tributes, and deletes are infrequent in common workloads.

3.7 RecoveryOur checkpointing strategy ensures a tight bound on the logvolume that needs to be recovered after a failure. However,the concrete value should be set in proportion to the bufferpool size, so that the ratio between the amount of WAL writ-ten and the corresponding checkpoint work stays constant.Thus, for large buffer pools, e.g., several hundred gigabytes,the amount of work during recovery is not trivial.

Similar to ARIES, recovery has three phases: log analysis,redo, and undo. In order to still be able to recover in an ad-equate time frame, we parallelize each phase. As Figure 7shows, in the first phase, each thread scans all chunks ofa log partition, including those residing in persistent mem-ory, to separate winner and loser transactions. For winnertransactions, it partitions the log records by page ID into athread-local redo table. Log records for loser transactionsare put into a separate undo list.

In the redo phase, each thread is assigned a range of pageIDs. For each range, it merges the log records from all redotables from the previous phase. Afterwards, it sorts them by(pageId,GSN ), so that all log records for a page end up nextto each other in GSN order. Then, redo recovery can proceedpage by page. All log records for a page, including those ofloser transactions, are applied at this step. Merge, sort, andredo can happen interleaved and without synchronization,since threads operate on distinct ranges.Finally, in the undo phase, all changes from loser trans-

actions are reverted by iterating over the log records in the

undo bucket. Each change is undone logically as explainedfor transaction aborts in Section 3.6.

3.8 Implementation DetailsThis section discusses implementation details and perfor-mance optimizations that we implemented in our system.

Persistent memory can be accessed either in memorymode, or in app direct mode. We use app direct mode, which al-lows us to create an ext4 file system. A log chunk is allocatedin persistent memory by creating a file of the desired size,which is then memory mapped with the dax option, whichbypasses OS buffers and allows direct access with cache linegranularity. We avoid thrashing the CPU cache when writinglog records by using non-temporal store instructions, whichare conveniently accessible by the Intel Persistent MemoryDevelopment Kit (PMDK) via pmem_memcpy.

Finding the last valid log record in persistent memoryis not trivial, as log records can still leave the CPU in arbi-trary order. Thus, one could consider storing the end offsetwithin each chunk explicitly. However, this would requirean expensive cache flush per log entry, as the data needs tobe flushed before offset to ensure correct ordering. Addition-ally, as Renen et al. [51] observed, repeated updates of thesame persistent memory location are unreasonably expen-sive. They propose to use the popcnt instruction to computea lightweight checksum of each record. During recovery,these checksums are used to determine the last fully writtenlog entry. We implemented this approach in our system, andfound that it does not introduce any measurable overhead.Partially staged log chunks are not problematic, as in thatcase the intact copy in persistent memory gets precedence.

TheWriteback Buffermentioned in Section 3.5 accumu-lates pages locally before writing them out all at once. Inorder to saturate the bandwidth of high-speed NVMe SSDs,we use the asynchronous I/O API of the Linux Kernel. Addi-tionally, we open the database file with O_DIRECT to avoidOS caching effects and reduce CPU overhead. Thus, the onlycache that remains to be flushed is on the device itself. Weuse the fdatasync call for that instead of the typical fsync,which provides enough guarantees but allows for a slightlyfaster implementation. We issue 1024 page writes at a timeto keep the device queue filled. After a batch of pages hasbeen written out and flushed, we update the GSNpersistedfield in their buffer frames to the GSN of the copy of thepage. It is important that the GSNs are only updated afterthe writes have been flushed, as otherwise the checkpointermight prune the log too early. The writeback buffer is usedby both the checkpointer and the page provisioner.

Log Compression reduces the amount of data that hasto be transferred to both persistent memory and SSD. Wedo not employ a general-purpose compression scheme suchas LZ4, as it would incur a non-negligible CPU overhead,

Page 10: Rethinking Logging, Checkpoints, and Recovery for High ...leis/papers/rethinkingLogging.pdf · memory database systems use more lightweight approaches that have less overhead and

which would outweigh its benefits. We instead reuse infor-mation from former log entries of the same category. Forexample, we maintain the ID of the last inserted page, andthe transaction that did so. We apply this optimization forinsert and update entries, as they make up the majority of logvolume in common workloads. If we create another insertentry, we check if it can reuse that information. Note thatwe reset the cached IDs on chunk boundaries, so that logchunks stay independent, and can be processed in parallel.Furthermore, update records only contain the before andafter image of changed attributes together with a bitmaskwith the modified attributes set to 1. We observe that thesetwo techniques already reduce log volume by 30% in TPC-C.

For performance, it is vital that the checkpointer threaddoes not latch all pages in a shard at the same time, which canbe tens of thousands of pages for realistic buffer sizes. Similarto the page provisioner (see Section 3.5), it only latches eachpage briefly to mark it as writeBack and copy it into a localwriteback buffer before proceeding to the next page. Thecopy is necessary anyway in our design, because the in-memory representation of a page contains swizzled pointers,which must not get serialized to persisted storage. So beforea page gets written out, all pointers in the copy are replacedby their corresponding page IDs [17]. This design minimizesthe impact of checkpointing, as only one page is latched ata time, and no page latches are held during I/O. To keep upwith the worker threads on machines with many cores, ourimplementation has multiple checkpointer threads, whichcan work on distinct checkpoint increments in parallel. Thisdoes not affect correctness, as the minimum GSN in the tabledetermines how far the log can be truncated. For the resultsreported in Section 4, we use two checkpointer threads.

4 EVALUATIONWe implemented our logging and recovery approach in thehigh-performance LeanStore storage engine, which is writ-ten in C++17. LeanStore uses pointer swizzling and partitionsthe buffer pool into a hot/cold area. For the benchmarks, weuse TPC-C with all five transaction types. The benchmarkdriver is directly linked into the engine, so the results areunimpeded by network overhead. Relations and indexes arestored in B+-trees with a page size of 16 KB. Because oursystem does not yet implement full transaction isolation, weeffectively run all experiments in read uncommitted mode.All experiments are run on a system with an Intel Xeon

Gold 6212U CPU, which has 24 cores and 48 Hyper-Threads.The system is equipped with 192GB DRAM and 768GB In-tel Optane DC Persistent Memory. The operating system(Ubuntu 19.10) is installed on one SSD, while the databasefile resides on another PCIe-attached NVMe SSD. We put the

0200400600800

10001200

1 12 24 36 48threads

TPC

-C [k

txn/

s]

"SiloR"-styleGroup CommitOur approachNo RFAAetherARIES

Hyper-Threads

Figure 8: TPC-C (500 warehouses, 100GB buffer pool).

log on the persistent memory device, which is able to absorbthe almost 2 GB/s of log volume we generate.Each thread gets allocated 5 WAL chunks of 20MB each,

which has shown to be sufficient for the background stagingthread to offload the WAL. We fix the clock frequency of theCPU at 2.6 GHz, which is the highest value it can sustainwhen all threads are active. This reduces noise and leads tomore comparable benchmark results.The main goal of this section is to evaluate the different

logging approaches for a high-performance storage engine.Since Silo is an in-memory system (and its recovery imple-mentation SiloR is not publicly available), we decided toimplement all competitors in our own system. This allowsus to quantify the exact impact of logging and checkpoint-ing as all other system parameters, i.e., the benchmark code,the storage engine, and the data structures, stay unchanged.The ARIES-style configuration uses a single global log. Wealso implemented a state-of-the-art variant that contains theoptimizations proposed by Aether [22], namely consolidationarrays, flush pipelining (group commit), and decoupled bufferfill. The SiloR-style approach uses per-thread logs and em-ploys epoch-based group commit. It is the only competitorwhich allocates log chunks in DRAM first. It does not writeGSNs, page IDs or undo images into the WAL. Thus, it doesnot support out-of-memory workloads that persist pageswith uncommitted changes to disk. All competitors take fullcheckpoints each time the log exceeds its limit, during whichSiloR persists the whole database, while ARIES/Aether writesout dirty pages.

4.1 ScalabilityIn our first experiment, we compare transaction throughputand scalability of our approach against Aether, ARIES, andSiloR-style logging when the workload fits into memory. Wealso compare the scalability of RFA against passive groupcommit [52] and a variant that flushes all logs synchronouslyupon commit (labeled “No RFA”). For that, we run TPC-Cwith 500 warehouses (which amounts to 50GB of initial data)and a memory budget of 100GB. Figure 8 shows that ourapproach is able to scale very well. It achieves 41k TPC-C

Page 11: Rethinking Logging, Checkpoints, and Recovery for High ...leis/papers/rethinkingLogging.pdf · memory database systems use more lightweight approaches that have less overhead and

transactions per second with one thread and scales to 750ktxn/s with 24 threads. Beyond that, it further benefits fromhyperthreading and reaches its peak performance at 857ktxn/s with 40 threads. The only competitor that performssignificantly better is the in-memory group commit “SiloR”approach, which achieves 43k txn/s single threaded and 1.2Mtxn/s with 48 threads. We investigated why its scalabilityis better than the other competitors, even in comparisonwith the non-blocking passive group commit variant. Wefound that the persistent memory device is the culprit. Byputting the first stage of the WAL into DRAM, and leavingall other parameters the same, our approach achieves 960ktxn/s instead of 857k txn/s with 40 threads (scalability 22.6xinstead of 20.9x), indicating that our first generation persis-tent memory platform is oversaturated when many threadswrite into persistent memory concurrently.

ARIES-style logging is limited by the global log, so itreaches its peak performance of 123k txn/s already with4 threads. Aether performs slightly better and achieves up to180k txn/s with 16 threads, but eventually also runs into con-tention. This is consistent with previous reported results [52],and amplified in our system as our base performance is anorder of magnitude higher, resulting in the single log becom-ing the bottleneck much sooner. The dashed line in Figure 8shows the throughput of our system when RFA is disabled. Itperforms better than ARIES and Aether as it does not have tosynchronize threads on the creation of log entries. However,because it still has to synchronize all per-thread logs uponeach transaction commit, it starts to scale worse than RFA orgroup commit beyond 8 threads, reaching its peak of 690ktxn/s with 48 threads.As the scalability results show, RFA manages to avoid

most remote log flushes, resulting in a throughput closeto that of a group commit protocol which does not flushother logs at all. However, depending on the contentionin the workload, remote log flushes may be inevitable. Todemonstrate this effect, we varied the number of warehouses(which indirectly controls transaction interference in TPC-C) with 40 worker threads and measured the percentage oftransactions requiring a remote flush:

w=1 w=5 w=10 w=50 w=100 w=500

rem. flushes 92.0% 91.8% 90.8% 15.5% 14.7% 8.1%perf [txn/s] 596k 707k 721k 929k 861k 857k

As these results indicate, the more logical independencethe workload has, the more likely it becomes that remoteflushes can be avoided.

4.2 Log Volume and CheckpointingAmong the main features of our approach are the contin-uous checkpointing algorithm and its capability to bound

in-memory (100GB buf.) out-of-memory (40GB buf.)

TPC-C

[ktxn/s]checkp. [M

B/s]persist [M

B/s]W

AL [MB/s]

WAL vol. [G

B]reads [M

B/s]

0 25 50 75 100 125 0 25 50 75 100 125

0300600900

1200

0

500

1000

0300600900

1200

0500

100015002000

0

50

100

150

0250500750

1000

runtime [sec]

"SiloR"-styleOur approach

d

f h

c

e

b a

g

AetherOur approach

Figure 9: TPC-C performance over time (500 ware-houses, 100GB WAL limit, 40 worker threads).

the log volume. To understand the behavior of our approach,we run TPC-C for two minutes with 40 worker threads and500 warehouses (50GB of initial data). The left column ofFigure 9 shows that with a buffer pool of 100GB, our systemachieves a sustained transaction rate of 850k txn/s while con-stantly writing out 1.7 GB/s ofWAL, and over 1 GB/s of pagesnecessary for checkpointing. After 32 seconds the WAL vol-ume reaches its configured limit and stays there ( a ), whichvalidates that our continuous checkpointing approach cankeep the WAL bounded. In contrast, the checkpointer in theSiloR approach cannot keep up with the worker threads ( b )resulting in a growing WAL volume. It has to write out theentire database ( c )—which grows quickly—for each check-point. After 66 seconds, when the memory budget of 100GBis exhausted, the SiloR system stops processing ( d ) becauseit runs out of memory. In our system, in contrast, the pageprovisioner starts supporting the checkpointer with writingout pages at 140MB/s to free up buffer space ( e ). Since thisis done in the background, and the page replacement strat-egy is identifying correct eviction candidates, processingcontinues with no observable performance drop ( f ).

The right column of Figure 9 shows system behavior whenthe buffer pool does not even fit the initial data set. For that,we set the buffer pool size to 40GB, while keeping the num-ber of warehouses at 500. At the beginning of the benchmark,both Aether and our approach need a few seconds to load

Page 12: Rethinking Logging, Checkpoints, and Recovery for High ...leis/papers/rethinkingLogging.pdf · memory database systems use more lightweight approaches that have less overhead and

Table 1: TPC-C transaction rates andCPU instructions(500 warehouses, 40 worker threads). We enable allnecessary logging components step-by-step.

# Component txn/s instr. per txn

1 no logging 1.439k 47k2 + create WAL records 1.110k 59k3 + stage WAL records 934k 61k4 + remote log flushes 666k 78k5 + RFA 854k 65k6 + checkpointing 850k 66k

the working set into RAM ( g ), starting at over 1 GB/s. After-wards, both are able to keep a stable performance level, but,even in this out-of-memory scenario, Aether’s performanceis still a factor of two lower due to log synchronization. Thisunderlines that the classical disk-based database architec-ture is unsuited for fast SSDs. In steady state, our systemwrites 2GB/s (checkpointing: 280MB/s, page persisting tofree up buffer space: 1GB/s, WAL writing: 650MB/s) andreads 700MB/s (k), while sustaining 300k txn/s ( h ) andbounding the WAL to 100GB.

4.3 Dissecting the FeaturesIn the next experiment, we dissect the performance impactof the different logging components using TPC-C (40 threads,500 warehouses, 100GB buffer pool). Table 1 shows that thebasic storage manager, without any logging, checkpointingor page cleaning, is able to achieve 1.4M TPC-C transactionsper second (#1). Next, the worker threads allocate, write, andchecksum the log records in persistent memory (#2). Thisincurs the highest increase in CPU instructions, from 47kto 59k per transaction, and decreases performance to 1.1Mtxn/s. Staging the log records to SSD (#3) is almost purely anI/O operation. Thus, the instructions per transaction do notincrease much, but performance drops slightly to 934k txn/ssince staging is done by a background thread that has to ac-cess the log. However, once transactions flush all logs duringcommit (#4), performance drops to 666k txn/s. By introduc-ing RFA (#5), performance increases to 854k txn/s. Finally,the fully functioning system including checkpointing (#6)achieves 850k txn/s. Thus, the total amount of work for therecovery component is 19k instructions per transaction inour approach.

4.4 RFA and ContentionTo analyze the impact of RFA in more detail, we also runa YCSB experiment with a fixed table size of 500M records,each consisting of an 8 Byte key and a 64 Byte value. Theresulting database has a size of 36 GB. We run the benchmarkwith 24 threads, each executing a transaction consisting ofa single-tuple update. This stresses log synchronization to

4.8% 4.5% 4.4% 5.3%

0

2

4

6

0.0 0.25 0.50 0.75 1.0 1.25 1.5 1.75Zipf theta (θ)

YCSB

[Mtx

n/s]

"SiloR"-styleGroup CommitOur approachNo RFA

AetherARIES

18.8%50.4%

79.2%86.2%

Figure 10: YCSB-style workload (100% single-tuple up-dates). Our approach is annotated with the percentageof remote flushes.

Delivery Neworder Payment

no flush RFA

No RFA

Grp. Commit

no flush RFA

No RFA

Grp. Commit

no flush RFA

No RFA

Grp. Commit

0

50

100

150

Late

ncy

[mic

rose

cond

s] YCSB

no flush RFA

No RFA

Grp. Commit

0

10

20

30

Figure 11: TPC-C & YCSB transaction latencies for dif-ferent commit flush strategies. RFA is close to the op-timum, while group commit has higher latencies.

the maximum, as much of the work consists of creating logrecords.We vary the Zipf theta, resulting in different levels ofskew. As Figure 10 shows, with θ = 0 (i.e., a uniform distribu-tion) RFA is most effective and only 4.8% of all transactionsneed to synchronize their commit with other logs. Thetavalues up to 0.75 do not increase remote flushes significantly.An even higher θ increases the number of remote flushes, butperformance stays competitive compared to group commitand Silo-R. With θ = 1.25, contention in the workload startsto dominate, and with even higher values all competitorsconverge to a similar performance.

4.5 Transaction LatenciesIn this experiment, we compare optimal transaction latencies,i.e., those that could be achieved by not flushing log recordsin other logs at all, to those with and without RFA, andgroup commit. Transactions enter the system based on aPoisson process where the time between two transactions isexponentially distributed.

Figure 11 shows the execution latencies for the three writetransactions in TPC-C for a transaction rate of 100k txn/s.Without flushing other logs, a single delivery transactionhas a median execution time of 83 microseconds. With RFA,the median does not increase, and the 99th percentile onlyslightly increases from 113 µs to 124 µs . The median when

Page 13: Rethinking Logging, Checkpoints, and Recovery for High ...leis/papers/rethinkingLogging.pdf · memory database systems use more lightweight approaches that have less overhead and

in-memory (100GB buf.) out-of-memory (40GB buf.)

0 25 50 75 100 0 25 50 75 1000

250

500

750

runtime [seconds]

TPC

-C [k

txn/

s]

Our approachWiredTiger (WT)WT w/o checkpointingWT w/o chkpt. or logging

Figure 12: WiredTiger performance for TPC-C overtime (40GB buffer pool, 20 threads). We incrementallydisable checkpointing and logging.

always flushing all other logs is 88 µs , while with groupcommit it rises to 95 µs . Neworder and payment transactionsshow a similar behavior, while the general trend is that forshort transactions, the added latency of group commit ismore significant. In YCSB, the median latency for groupcommit is 7 µs , which is more than double the latency ofRFA, which has a median latency of 2.8 µs .For comparison, the epoch-based group commit system

SiloR has transaction latencies in the order of milliseconds,which is three orders of magnitude higher than in our system.

4.6 RecoveryAs the experiment in Section 4.2 shows, our approach is effec-tive in bounding the log volume that needs to be processedduring recovery. With a WAL limit of 100GB, we measurethe different phases of recovery time. Using 40 threads, totalrecovery time is 38 seconds, of which 14 seconds are spent onpartitioning the log chunks by page ID, and the remaining 24seconds are used for merging, sorting and applying the logrecords (undo time is negligible). This corresponds to 2.6 GBof WAL recovered per second. Within one second after re-covery finished, TPC-C transaction performance approachesthe value it had before the crash because recovery implicitlypre-warmed the buffer cache.

For comparison, Zheng et al. [55] report 0.93 GB/s for pro-cessing the recovery log in the state-of-the-art SiloR systemusing 32 cores. However, because their checkpointing ap-proach fails to put a tight upper bound on the log, it grows upto 180GB in their TPC-C experiment. This leads to a muchhigher recovery time of 211 seconds (which includes loadingtime for the last checkpoint). Therefore, despite frequentcheckpointing, their recovery time is more than five timeshigher than in our system.

4.7 System ComparisonTo the best of our knowledge, WiredTiger is one of the mostefficient open-source storage engines, which we found to befaster than LevelDB, RocksDB, BerkeleyDB, and PostgreSQL

on TPC-C. Furthermore, WiredTiger implements a typicalrecovery approach close to textbook ARIES and allows oneto disable logging and checkpointing individually. We inte-grated WiredTiger 2.9.3 into our TPC-C benchmark driver.To make for a fair comparison with our persistent-memory-based approach, we deactivate fsync in WiredTiger.

As in previous experiments, we run TPC-C with 500 ware-houses. Figure 12 shows the performance of WiredTiger overtime, once with a buffer pool of 100GB where the workloadfits in memory, and once with 40GB where the workloadexceeds it. One apparent observation is that WiredTiger hasvery high variability, especially when checkpoints are en-abled. This shows that checkpoints cause major drops inperformance, and are thus an issue in existing systems. Theperformance of our system, in contrast, is not only muchmore stable, but also one order of magnitude faster in theout-of-memory case.

5 RELATEDWORKThis section reviews alternative designs for logging, check-pointing, and recovery as well as complementary techniques.

In-memory database systems make various trade-offsin order to improve forward-processing performance at theexpense of recovery performance. Broadly speaking, thereare two classes of systems that represent these kinds of trade-offs, namely physical and logical logging. In the previoussections, we reviewed Silo as a representative for physicallogging systems; in this section, we discuss the latter class.Command logging [36] is a logical logging approach in

which every transaction must run as a stored procedure, andit is logged as a single log record containing the procedureID and its arguments. This approach has three major limi-tations: first, checkpoints must not only scan and write theentire database, but they must also be transaction-consistent,which incurs additional overhead. Second, recovery requiresre-executing transactions serially, leading to prohibitivelylong outages in case of system failures. Adaptive logging [54]is a hybrid approach that attempts to improve recovery timeof command logging by tracking dependencies among trans-actions and selectively producing physical log records thateliminate such dependencies, thus speeding up log replay bypermitting a higher degree of parallelism.Hekaton [13] has a unique approach among in-memory

systems. It takes checkpoints by sorting and merging logfiles (with value logging), essentially behaving like a log-structured merge tree [40]. As such, it permits incrementalcheckpoints, i.e., writing only recent modifications ratherthan the entire database. FineLine [46] employs a similar con-cept but it relies on physiological logging instead of valuelogging. FOEDUS [28] is a persistent-memory system thatpropagates changes from volatile images of pages into theirpersistent representation by means of log replay, which can

Page 14: Rethinking Logging, Checkpoints, and Recovery for High ...leis/papers/rethinkingLogging.pdf · memory database systems use more lightweight approaches that have less overhead and

be seen as a single-level merge tree. In contrast, the stagingapproach used in HANA [47]—with its L1 and L2 deltas—can be seen as a two-level merge tree. However, in HANAchanges are propagated at the granularity of whole columnsinstead of pages. Such log-structured approaches have twofundamental caveats: first, merging log files potentially in-creases I/O load and write amplification; second, they forbidevicting uncommitted changes to disk (i.e., they are no-steal).The first designs proposed for in-memory databases had

the limitation that the entire dataset had to fit into mainmemory. Given the advent of fast SSDs and persistent mem-ory, this limitation can be a “deal breaker”, especially fromthe economic perspective [4]. Thus, some techniques such asAnti-Caching [12] and Siberia [2, 34] were proposed to allowan in-memory database to grow larger than main memorywithout relying on a page-based buffer manager. Instead,they identify less-frequently used tuples and migrate themto separate data structures on persistent storage. To supportreads, additional lookups in such data structures are required,which can be optimized with techniques like Bloom filtering.A buffer manager with pointer swizzling performs betterwhen data is larger than main memory, since it does notrequire such lookups or tuple-level tracking of accesses.

Persistentmemory provides byte addressability and lowlatency. It has been explored in multiple logging design pro-posals [11, 14, 21, 27]. These approaches aim to reduce com-mit latency while still relying on a centralized log at somepoint, thus overlooking the scalability issue. Our design notonly addresses scalability but also considers the wider con-text of both logging and checkpointing.Some designs for persistent-memory database systems

also consider the radical approach of eliminating redo log-ging altogether (i.e., force) by performing out-of-place up-dates directly on the persisted database [8, 15, 41, 43, 44],in some cases delegating all storage to specialized searchstructures [5, 42]. However, this fundamental design trade-off presents some major limitations. First, since durabilityrequires random persistent writes, commit latency is un-avoidably higher than in logging approaches, as one flushoperation is required per tuple modified. Second, data mustnot be updated in-place, which results in lower performancefor someworkloads [6, 44], and introduces garbage collectionoverhead as well as issues with indexing and space manage-ment. Third, since a system without redo logging cannotsupport media recovery, replication to a stand-by secondaryserver is required [53] to handle device failures.

Instant recovery [16] is a family of algorithms that al-low new transactions to execute immediately after a sys-tem or even media failure, including localized single-pagecorruptions. A complementary technique is constant-timerecovery [3], which exploits multi-version storage to allow in-stantaneous undo of arbitrarily long user transactions. Since

both techniques rely on physiological logging, they can becombined with our approach to achieve efficient logging andcheckpointing as well as faster recovery.

ERMIA [26] is a physiological logging system that im-proves the scalability of centralized logging by leveragingmulti-version storage, which is used both for concurrencycontrol as well as recovery purposes. Our approach, in con-trast, is orthogonal to concurrency control and also workswith single-version storage.

The general problem of synchronizing transaction ac-tivity in the presence of multiple logs has also beenexplored by Bernstein and Das [9], who propose a mech-anism to parallelize the validation phase of an optimisticconcurrency control algorithm (OCC) by partitioning thedatabase. The key difference is that our approach considersa distributed log partitioned by transaction and not by data-base contents (i.e., by page), and that the focus here is not onOCC validation but on guaranteeing durability regardless ofthe employed concurrency control protocol.Other concurrency-control techniques of both the pes-

simistic [1] and optimistic [35] kind require tracking depen-dencies among transactions to enforce commit-time ordering,but such tracking happens at the granularity of individualrecords. The RFA technique proposed in this paper, on theother hand, only tracks dependencies at the page level, andonce detected, a dependency neither has any influence onthe serializability order or the decision to commit or abort,nor does it cause any delay in the logical execution of othertransactions—it just triggers the flush of different log buffers.

6 SUMMARYNeither ARIES nor in-memory logging approaches workwell for high-performance storage managers. We presenta design that covers all aspects of logging, checkpointing,and page provisioning. Using continuous checkpointing, weeliminate spikes in transaction performance and I/O. RemoteFlush Avoidance enables low-latency transaction commit andmakes our approach scalable on multi-core CPUs. Overall,our implementation inside LeanStore achieves not only therich feature set and robustness of a classical disk-based data-base architecture, but also the performance and scalabilityof state-of-the-art in-memory systems.

REFERENCES[1] Divyakant Agrawal, Amr El Abbadi, Richard Jeffers, and Lijing Lin.

1995. Ordered Shared Locks for Real-Time Databases. VLDB J. 4, 1(1995).

[2] Karolina Alexiou, Donald Kossmann, and Per-Åke Larson. 2013. Adap-tive Range Filters for Cold Data: Avoiding Trips to Siberia. PVLDB 6,14 (2013).

[3] Panagiotis Antonopoulos, Peter Byrne, Wayne Chen, Cristian Diaconu,Raghavendra Thallam Kodandaramaih, Hanuma Kodavalla, PrashanthPurnananda, Adrian-Leonard Radu, Chaitanya Sreenivas Ravella, and

Page 15: Rethinking Logging, Checkpoints, and Recovery for High ...leis/papers/rethinkingLogging.pdf · memory database systems use more lightweight approaches that have less overhead and

Girish Mittur Venkataramanappa. 2019. Constant Time Recovery inAzure SQL Database. PVLDB 12, 12 (2019).

[4] Raja Appuswamy, Renata Borovica-Gajic, Goetz Graefe, and AnastasiaAilamaki. 2017. The Five-minute Rule Thirty Years Later and its Impacton the Storage Hierarchy. In ADMS.

[5] Joy Arulraj, Justin J. Levandoski, Umar Farooq Minhas, and Per-ÅkeLarson. 2018. BzTree: A High-Performance Latch-free Range Indexfor Non-Volatile Memory. PVLDB 11, 5 (2018).

[6] Joy Arulraj, Andrew Pavlo, and Subramanya Dulloor. 2015. Let’sTalk About Storage & Recovery Methods for Non-Volatile MemoryDatabase Systems. In SIGMOD.

[7] Joy Arulraj, Andy Pavlo, and Krishna Teja Malladi. 2019. Multi-TierBuffer Management and Storage System Design for Non-Volatile Mem-ory. CoRR (2019).

[8] Joy Arulraj, Matthew Perron, and Andrew Pavlo. 2016. Write-BehindLogging. PVLDB 10, 4 (2016).

[9] Philip A. Bernstein and Sudipto Das. 2015. Scaling Optimistic Concur-rency Control by Approximately Partitioning the Certifier and Log.IEEE Data Eng. Bull. 38, 1 (2015).

[10] Tuan Cao, Marcos Antonio Vaz Salles, Benjamin Sowell, Yao Yue,Alan J. Demers, Johannes Gehrke, and Walker M. White. 2011. Fastcheckpoint recovery algorithms for frequently consistent applications.In SIGMOD.

[11] Joel Coburn, Trevor Bunker, Meir Schwarz, Rajesh Gupta, and StevenSwanson. 2013. From ARIES to MARS: transaction support for next-generation, solid-state drives. In SOSP.

[12] Justin DeBrabant, Andrew Pavlo, Stephen Tu, Michael Stonebraker,and Stanley B. Zdonik. 2013. Anti-Caching: A New Approach toDatabase Management System Architecture. PVLDB 6, 14 (2013).

[13] Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Åke Larson, PravinMittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling. 2013. Heka-ton: SQL server’s memory-optimized OLTP engine. In SIGMOD.

[14] Ru Fang, Hui-I Hsiao, Bin He, C. Mohan, and Yun Wang. 2011. Highperformance database logging using storage class memory.

[15] Shen Gao, Jianliang Xu, Theo Härder, Bingsheng He, Byron Choi, andHaibo Hu. 2015. PCMLogging: Optimizing Transaction Logging andRecovery Performance with PCM. IEEE Trans. Knowl. Data Eng. 27, 12(2015).

[16] Goetz Graefe, Wey Guy, and Caetano Sauer. 2016. Instant Recoverywith Write-Ahead Logging: Page Repair, System Restart, Media Restore,and System Failover, Second Edition. Morgan & Claypool Publishers.https://doi.org/10.2200/S00710ED2V01Y201603DTM044

[17] Goetz Graefe, Haris Volos, Hideaki Kimura, Harumi A. Kuno, JosephTucek, Mark Lillibridge, and Alistair C. Veitch. 2014. In-MemoryPerformance for Big Data. PVLDB 8, 1 (2014).

[18] Gabriel Haas, Michael Haubenschild, and Viktor Leis. 2020. ExploitingDirectly-Attached NVMe Arrays in DBMS. In CIDR.

[19] Theo Härder and Andreas Reuter. 1983. Principles of Transaction-Oriented Database Recovery. ACM Comput. Surv. 15, 4 (1983).

[20] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and MichaelStonebraker. 2008. OLTP through the looking glass, and what wefound there. In SIGMOD.

[21] Jian Huang, Karsten Schwan, and Moinuddin K. Qureshi. 2014.NVRAM-aware Logging in Transaction Systems. PVLDB 8, 4 (2014).

[22] Ryan Johnson, Ippokratis Pandis, Radu Stoica, Manos Athanassoulis,and Anastasia Ailamaki. 2010. Aether: A Scalable Approach to Logging.PVLDB 3, 1 (2010).

[23] Ryan Johnson, Ippokratis Pandis, Radu Stoica, Manos Athanassoulis,and Anastasia Ailamaki. 2012. Scalability of write-ahead logging onmulticore and multisocket hardware. VLDB J. 21, 2 (2012).

[24] Hyungsoo Jung, Hyuck Han, and Sooyong Kang. 2017. Scalable Data-base Logging for Multicores. PVLDB 11, 2 (2017).

[25] Jong-Bin Kim, Hyeongwon Jang, Seohui Son, Hyuck Han, SooyongKang, and Hyungsoo Jung. 2019. Border-Collie: A Wait-free, Read-optimal Algorithm for Database Logging on Multicore Hardware. InSIGMOD.

[26] Kangnyeon Kim, Tianzheng Wang, Ryan Johnson, and Ippokratis Pan-dis. 2016. ERMIA: Fast Memory-Optimized Database System for Het-erogeneous Workloads. In SIGMOD.

[27] Wook-Hee Kim, Jinwoong Kim, Woongki Baek, Beomseok Nam, andYoujip Won. 2016. NVWAL: Exploiting NVRAM in Write-Ahead Log-ging. In ASPLOS.

[28] Hideaki Kimura. 2015. FOEDUS: OLTP Engine for a Thousand Coresand NVRAM. In SIGMOD.

[29] Leslie Lamport. 1978. Time, Clocks, and the Ordering of Events in aDistributed System. Commun. ACM 21, 7 (1978).

[30] Per-Åke Larson, Mike Zwilling, and Kevin Farlee. 2013. The HekatonMemory-Optimized OLTP Engine. IEEE Data Eng. Bull. 36, 2 (2013).

[31] Viktor Leis, Michael Haubenschild, Alfons Kemper, and Thomas Neu-mann. 2018. LeanStore: In-Memory Data Management beyond MainMemory. In ICDE.

[32] Viktor Leis, Michael Haubenschild, and Thomas Neumann. 2019. Op-timistic Lock Coupling: A Scalable and Efficient General-Purpose Syn-chronization Method. IEEE Data Eng. Bull. 42, 1 (2019).

[33] Viktor Leis, Florian Scheibner, Alfons Kemper, and Thomas Neumann.2016. The ART of practical synchronization. In DaMoN. ACM.

[34] Justin J. Levandoski, Per-Åke Larson, and Radu Stoica. 2013. Identify-ing hot and cold data in main-memory databases. In ICDE.

[35] David B. Lomet, Alan Fekete, Rui Wang, and Peter Ward. 2012. Multi-version Concurrency via Timestamp Range Conflict Management. InICDE.

[36] Nirmesh Malviya, Ariel Weisberg, Samuel Madden, and Michael Stone-braker. 2014. Rethinking main memory OLTP recovery. In ICDE.

[37] C. Mohan. 1999. Repeating History Beyond ARIES. In VLDB.[38] C. Mohan, Don Haderle, Bruce G. Lindsay, Hamid Pirahesh, and Pe-

ter M. Schwarz. 1992. ARIES: A Transaction Recovery Method Support-ing Fine-Granularity Locking and Partial Rollbacks UsingWrite-AheadLogging. ACM TODS (1992).

[39] Thomas Neumann and Michael J. Freitag. 2020. Umbra: A Disk-BasedSystem with In-Memory Performance. In CIDR.

[40] Patrick E. O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J.O’Neil. 1996. The Log-Structured Merge-Tree (LSM-Tree). Acta Inf. 33,4 (1996).

[41] Ismail Oukid, Daniel Booss, Wolfgang Lehner, Peter Bumbulis, andThomasWillhalm. 2014. SOFORT: a hybrid SCM-DRAM storage enginefor fast data recovery. In DaMoN.

[42] Ismail Oukid, Johan Lasperas, Anisoara Nica, Thomas Willhalm, andWolfgang Lehner. 2016. FPTree: A Hybrid SCM-DRAM Persistent andConcurrent B-Tree for Storage Class Memory. In SIGMOD.

[43] Ismail Oukid, Wolfgang Lehner, Thomas Kissinger, Thomas Will-halm, and Peter Bumbulis. 2015. Instant Recovery for Main MemoryDatabases. In CIDR.

[44] Steven Pelley, Thomas F. Wenisch, Brian T. Gold, and Bill Bridge. 2013.Storage Management in the NVRAM Era. PVLDB 7, 2 (2013).

[45] Kun Ren, Thaddeus Diamond, Daniel J. Abadi, and Alexander Thomson.2016. Low-Overhead Asynchronous Checkpointing in Main-MemoryDatabase Systems. In SIGMOD.

[46] Caetano Sauer, Goetz Graefe, and Theo Härder. 2018. FineLine: log-structured transactional storage and recovery. PVLDB 11, 13 (2018).

[47] Vishal Sikka, Franz Färber, Wolfgang Lehner, Sang Kyun Cha, ThomasPeh, and Christof Bornhövd. 2012. Efficient transaction processing inSAP HANA database: the end of a column store myth. In SIGMOD.

[48] Hironobu Suzuki. 2016. The Internals of PostgreSQL. Self-published.http://www.interdb.jp/pg/

Page 16: Rethinking Logging, Checkpoints, and Recovery for High ...leis/papers/rethinkingLogging.pdf · memory database systems use more lightweight approaches that have less overhead and

[49] Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and SamuelMadden. 2013. Speedy transactions in multicore in-memory databases.In SIGOPS.

[50] Alexander van Renen, Viktor Leis, Alfons Kemper, Thomas Neumann,Takushi Hashida, Kazuichi Oe, Yoshiyasu Doi, Lilian Harada, andMitsuru Sato. 2018. Managing Non-Volatile Memory in DatabaseSystems. In SIGMOD.

[51] Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann, andAlfons Kemper. 2019. Persistent Memory I/O Primitives. In DaMoN.

[52] Tianzheng Wang and Ryan Johnson. 2014. Scalable Logging throughEmerging Non-Volatile Memory. PVLDB 7, 10 (2014).

[53] Tianzheng Wang, Ryan Johnson, and Ippokratis Pandis. 2017. QueryFresh: Log Shipping on Steroids. PVLDB 11, 4 (2017).

[54] Chang Yao, Divyakant Agrawal, Gang Chen, Beng Chin Ooi, and SaiWu. 2016. Adaptive Logging: Optimizing Logging and Recovery Costsin Distributed In-memory Databases. In SIGMOD.

[55] Wenting Zheng, Stephen Tu, Eddie Kohler, and Barbara Liskov. 2014.Fast Databases with Fast Durability and Recovery Through MulticoreParallelism. In OSDI.