socrates: the new sql server in the cloud · durability does not require copies in fast storage;...

Socrates: The New SQL Server in the CloudPanagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernandez

Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam,Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya

Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, VikramWakadeMicrosoft Azure & Microsoft Research

ABSTRACTThe database-as-a-service paradigm in the cloud (DBaaS)is becoming increasingly popular. Organizations adopt thisparadigm because they expect higher security, higher avail-ability, and lower and more flexible cost with high perfor-mance. It has become clear, however, that these expectationscannot be met in the cloud with the traditional, monolithicdatabase architecture. This paper presents a novel DBaaSarchitecture, called Socrates. Socrates has been implementedin Microsoft SQL Server and is available in Azure as SQL DBHyperscale. This paper describes the key ideas and featuresof Socrates, and it compares the performance of Socrateswith the previous SQL DB offering in Azure.

CCS CONCEPTS• Information systems→ DBMS engine architectures;

KEYWORDSDatabase as a Service, Cloud Database Architecture, HighAvailability

ACM Reference Format:Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejan-dro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Koss-mann, Sandeep Lingam,, Umar Farooq Minhas, Naveen Prakash,Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, KrystynaReisteter, Sheetal Shrotri, Dixin Tang, VikramWakade. 2019. Socrates:The New SQL Server in the Cloud. In 2019 International Conferenceon Management of Data (SIGMOD ’19), June 30-July 5, 2019, Am-sterdam, Netherlands. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3299869.3314047

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACMmust be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] ’19, June 30-July 5, 2019, Amsterdam, Netherlands© 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-5643-5/19/06. . . $15.00https://doi.org/10.1145/3299869.3314047

1 INTRODUCTIONThe cloud is here to stay. Most start-ups are cloud-native.Furthermore, many large enterprises are moving their dataand workloads into the cloud. The main reasons to moveinto the cloud are security, time-to-market, and a more flexi-ble “pay-as-you-go” cost model which avoids overpaying forunder-utilized machines. While all these reasons are com-pelling, the expectation is that a database runs in the cloudat least as well as (if not better) than on premise. Specifically,customers expect a “database-as-a-service” to be highly avail-able (e.g., 99.999% availability), support large databases (e.g.,a 100TB OLTP database), and be highly performant. Further-more, the service must be elastic and grow and shrink withthe workload so that customers can take advantage of thepay-as-you-go model.It turns out that meeting all these requirements is not

possible in the cloud using the traditional monolithic data-base architecture. One issue is cost elasticity which neverseemed to have been a consideration for on-premise data-base deployments: It can be prohibitively expensive to movea large database from one machine to another machine tosupport a higher or lower throughput and make the bestuse of the computing resources in a cluster. Another, moresubtle issue is that there is a conflict between the goal to sup-port large transactional databases and high availability: Highavailability requires a small mean-time-to-recovery whichtraditionally could only be achieved with a small database.This issue does not arise in on-premise database deploymentsbecause these deployments typically make use of special, ex-pensive hardware for high availability (such as storage areanetworks or SANs); hardware which is not available in thecloud. Furthermore, on-premise deployments control thesoftware update cycles and carefully plan downtimes; thisplanning is typically not possible in the cloud.To address these challenges, there has been research on

new OLTP database system architectures for the cloud overthe last ten years; e.g., [5, 8, 16, 17]. One idea is to decom-pose the functionality of a database management system anddeploy the compute services (e.g., transaction processing)and storage services (e.g., checkpointing and recovery) in-dependently. The first commercial system that adopted thisidea is Amazon Aurora [20].

https://doi.org/10.1145/3299869.3314047

https://doi.org/10.1145/3299869.3314047

https://doi.org/10.1145/3299869.3314047

SIGMOD ’19, June 30-July 5, 2019, Amsterdam, Netherlands P. Antonopoulos et al.

Today SocratesMax DB Size 4TB 100TBAvailability 99.99 99.999Upsize/downsize O(data) O(1)Storage impact 4x copies (+backup) 2x copies (+backup)CPU impact 4x single images 25% reductionRecovery O(1) O(1)Commit Latency 3 ms < 0.5msLog Throughput 50MB/s 100+ MB/sTable 1: Socrates Goals: Scalability, Availability, Cost

This paper presents Socrates, a new architecture for OLTPdatabase systems born out of Microsoft’s experience of man-aging millions of databases in Azure. Socrates is currentlyavailable in Azure under the brand SQL DB Hyperscale [2].The Socrates design adopts the separation of compute fromstorage. In addition, Socrates separates database log fromstorage and treats the log as a first-class citizen. As we willsee, separating the log and storage tiers separates durability(implemented by the log) and availability (implemented bythe storage tier). Durability is a fundamental property of anydatabase system to avoid data loss. Availability is needed toprovide good quality of service in the presence of failures.Traditionally, database systems have coupled the implemen-tation of durability and availability by dedicating computeresources to the task of maintaining multiple copies of thedata. However, there is significant, untapped potential byseparating the two concepts: (a) In contrast to availability,durability does not require copies in fast storage; (b) in con-trast to durability, availability does not require a fixed num-ber of replicas. Separating the two concepts allows Socratesto use the best fit mechanism for the task at hand. Concretely,Socrates requires less expensive copies of data in fast localstorage, fewer copies of data overall, less network bandwidth,and less compute resources to keep copies up-to-date thanother database architectures currently on the market.

Table 1 shows the impact of Socrates on Azure’s DBaaS of-ferings in terms of database scalability, availability, elasticity,cost (CPU and storage), and performance (time to recovery,commit latency and log throughput). How Socrates achievesthese improvements concretely is the topic of this paper.The remainder of this paper is organized as follows: Sec-

tion 2 discusses the state-of-the-art. Section 3 summarizes ex-isting SQL Server features that we exploited to build Socrates.Section 4 explains the Socrates architecture. Section 5 givesan overview of important distributed workflows in Socrates.Section 6 demonstrates the flexibility of Socrates to addresscost/availability/performance tradeoffs. Section 7 presentsthe results of performance experiments. Section 8 concludesthis paper with possible avenues for future work.

2 STATE OF THE ARTThis section revisits four prominent DBaaS systems whichare currently used in the marketplace.SQL DB is Microsoft’s DBaaS in Azure. Before Socrates,

SQL DB was based on an architecture called HADR thatis shown in Figure 1. HADR is a classic example of a log-replicated state machine. There is a Primary node whichprocesses all update transactions and ships the update logsto all Secondary nodes. Log shipping is the de facto standardto keep replicas consistent in distributed database systems[13]. Furthermore, the Primary periodically backups datato Azure’s Standard Storage Service (called XStore): log isbacked up every five minutes, a delta of the whole databaseonce a day, and a full backup every week. Secondary nodesmay process read-only transactions. If the Primary fails, oneof the Secondaries becomes the new Primary. With HADR,SQL DB needs four nodes (one Primary and three Secon-daries) to guarantee high availability and durability: If allfour nodes fail, there is data loss because the log is backedup only every five minutes.To date, the HADR architecture has been used success-

fully for millions of databases deployed in Azure. The serviceis stable and mature. Furthermore, HADR has high perfor-mance because every compute node has a full, local copyof the database. On the negative side, the size of a databasecannot grow beyond the storage capacity of a single machine.A special case occurs with long-running transactions whenthe log grows beyond the storage capacity of the machineand cannot be truncated until the long-running transactioncommits. O(size-of-data) operations also create issues. Forinstance, the cost of seeding a new node is linear with thesize of the database. Backup / restore, scale-up and down arefurther examples of operations whose cost grows linearlywith the size of the database. This is why SQL DB todaylimits the size of databases to 4TB (Table 1).Another prominent example of a cloud database system

that is based on log-replicated state machines is Google Span-ner [11]. To address the O(size-of-data) issues, Spanner au-tomatically shards data logically into partitions called splits.Multiple copies of a split are kept consistent using the Paxosprotocol [9]. Only one of the partitions, called leader, canmodify the data; the other partitions are read-only. Span-ner supports geo-replication and keeps all copies consistentwith the help of a TrueTime facility, a datacenter-based timesource which limits time drift between disparate replicas.Splits are divided and merged dynamically for load balanc-ing and capacity management.

In the last decade, an alternative architecture called shareddisk has been studied for databases in the cloud [5, 8, 16, 17].

Socrates: The New SQL Server in the Cloud SIGMOD ’19, June 30-July 5, 2019, Amsterdam, Netherlands

Figure 1: HADR Arch. (Replicated State Machines)

This architecture separates Compute and Storage. AWS Au-rora is the first commercial DBaaS that adopted this archi-tecture. An Aurora Primary Compute node processes updatetransactions (as in HADR) and each log record is shipped tosix Storage Servers which persist the data. These six StorageServers are distributed across three availability zones and atransaction commits safely if its log is persisted successfullyin four of the six Storage nodes. Within the Storage tier, thedata (and log) is partitioned for scalability. Aurora supportsa variable number of Secondary Compute nodes which areattached to the Storage nodes to source data pages.

Oracle pioneered yet a different DBaaS architecture basedon Exadata and Oracle RAC. In this architecture, all the nodesof a cluster are tightly coupled on a fast interconnect anda Shared Cache Fusion layer over a distributed storage tierwith Storage Cells that use locally attached flash storage[3, 7].

3 IMPORTANT SQL SERVER FEATURESSocrates builds on foundations already present in SQL Server.This section introduces several important (and not obvious)SQL Server features that were developed independently ofSocrates and are critical for Socrates.

3.1 Page Version StoreSQL Server maintains versions of database records for thepurpose of providing read snapshots in the presence of con-current writers (e.g., to implement Snapshot Isolation [4]).In the HADR architecture, all this versioning is done locally

in temporary storage. As will become clear in Section 4,Socrates adopts an extended shared-disk architecture whichrequires that row versions can no longer be kept locallyin temporary storage: Compute nodes must also share rowversions in the shared storage tier.

3.2 Accelerated Database RecoverySQL Server capitalizes on the persistent version store witha new feature called Accelerated Database Recovery (ADR).Prior to ADR, SQL Server used an ARIES-style recoveryscheme [18] that first analyzes the log, then a redo of all trans-actions that have not committed before the last checkpoint,and finally an undo of all uncommitted (failed) transactions.In this scheme, the undo phase can become unboundedlylong in the presence of long-running transactions. In produc-tion with millions of hosted databases, this unbounded undophase can indeed become a problem. It turns out that the ver-sion store can be used to improve this situation significantly:With a shared, persistent version store, the system has accessto the committed versions of a row even after a failure whichallows to eliminate the undo phase in many cases and thedatabase becomes available immediately after the analysisand redo phases, a constant-time operation bounded by thecheckpointing interval.

3.3 Resilient Buffer Pool ExtensionIn 2012, SQL Server released a feature called buffer pool ex-tension (BPE) which spills the content of the in-memory data-base buffer pool to a local SSD file (using the same lifetimeand eviction policies across both main memory and SSD).In Socrates, we extended this concept and made the bufferpool resilient; i.e., recoverable after a failure. We call thiscomponent RBPEX and it serves as the caching mechanismfor pages both in the compute and the storage tiers (Section4). Having a recoverable cache like RBPEX significantly re-duces the mean-time-to-recovery until a node reaches peakperformance (with warm buffers): If the failure is short (e.g.,a reboot of a machine after a software upgrade), it is muchcheaper to read and apply the log records of the (few) updatedpages than to refetch all (cached) pages from a remote serverwhich is needed in a traditional, non-recoverable cache. Ashorter mean-time-to-recovery, increases availability [14].Architecturally, RBPEX is a simple, straightforward con-

cept. However, a careful implementation, integration, andmanagement of RBPEX is critical for performance. If notdone right, performance can even degrade. We built RBPEXas a table in our in-memory storage engine, Hekaton [15],which ensures that read I/O to RBPEX is as fast as directI/O to the local SSD. Furthermore, Hekaton recovers RBPEXafter a failure - just like any other Hekaton table. Write I/Oto RBPEX needs to be carefully orchestrated because RPBEX


metadata I/O cannot be allowed to stall data I/O and RBPEXfailures cannot be allowed to corrupt the RBPEX state. Inorder to achieve this we intercepted the buffer pool pagelifetime tracking mechanism, which is a highly performancesensitive component.

3.4 RBIO protocolAs we will see, Socrates distributes the components of adatabase engine across multiple tiers. To support a richerdistribution of computation, we extended the traditionalSQL Server networking layer (called Unified CommunicationStack) with a new protocol called Remote Block I/O, or RBIOfor short. RBIO is a stateless protocol, strongly typed, hassupport for automatic versioning, is resilient to transientfailures, and has QoS support for best replica selection.

3.5 Snapshot Backup/RestoreSQL Server 2016 introduced the ability to take an almostinstantaneous backup when the database files were storedin Azure. This feature relied on the blob snapshots featureimplemented by Azure Storage (XStore) [10] which is or-ganized as a log-structured storage system [19]. In a log-structured file system, a backup is a constant-time operationas it merely needs to keep a pointer (timestamp) to the cur-rent head of the log. Socrates extends this feature by makingbackup/restore work entirely on XStore snapshots. As a re-sult, Socrates can do backups (and restores) in constant timewithout incurring any CPU or I/O cost in the Compute tier.With XStore’s Snapshot mechanism, the database files ofeven a very large database of hundreds of TBs can be re-stored in minutes. Of course, it takes longer to apply thelog to recover to the right moment in time (using ADR) andspin up the servers and refresh the caches for the restoreddatabase, but none of those operations depend on the sizeof the database. Backup/restore is one prominent examplewhere Socrates eliminated size-of-data operations from acritical operational workflow.

3.6 I/O Stack VirtualizationAt the lowest level of the I/O stack, SQL Server uses anabstraction called FCB for “File Control Block”. The FCBlayer provides I/O capabilities while abstracting the detailsof the underlying device. Using this abstraction layer, SQLServer can support multiple file systems and a diverse setof storage platforms and I/O patterns. Socrates exploits thisI/O virtualization tier extensively by implementing new FCBinstances which hide the Socrates storage hierarchy fromthe compute process. This approach helped us to implementSocrates without changing most core components of SQLServer: Most components “believe” they are components of amonolithic, standalone database system, and no component

above the FCB layer needs to deal with the complexity of adistributed, heterogeneous system that Socrates indeed is.

4 SOCRATES ARCHITECTURE4.1 Design Goals and PrinciplesToday, Azure successfully hosts many databases using theHADR architecture described in Section 2. Operating allthese databases in production has taught us important lessonsthat guided the design of Socrates. Before explaining theSocrates architecture, we describe these lessons and the cor-responding Socrates design goals and principles.

4.1.1 Local Fast Storage vs. Cheap, Scalable, Durable Stor-age. The first lesson pertains to the storage hierarchy: Directattached storage (SSD) is required for high performance,whereas cheap storage (hard disks) is needed for durabil-ity and scalability of large databases. On premise, these re-quirements are met with storage systems like SANs thattransparently optimize different kinds of storage devices ina single storage stack. In the cloud, such storage systemsdo not exist; instead, there is local storage (SSD) attachedto each machine which is fast, limited, and non-durable asit is lost when the machine fails permanently. Furthermore,clouds like Azure feature a separate, remote storage servicefor cheap, unlimited, durable storage. To achieve good per-formance, scalability, and durability in the cloud, Socrateshas a layered, scale-out storage architecture that explicitlymanages the different storage devices and services availablein Azure. One specific feature of this architecture is thatit avoids fragmentation and expensive data movement fordynamic storage allocation of fast-growing databases.

4.1.2 Bounded-time Operations. As shown in Table 1, oneimportant design goal of Socrates is to support large databasesin the order of 100 TB. Unfortunately, the current HADR ar-chitecture involves many operations whose performancedepends on the size of the database as described in Section2. Fast creation of new replicas is particularly important toachieve high availability at low cost because this operationdetermines the mean-time-to-recovery which directly im-pacts availability for a given number of replicas [14]. Therequirement to avoid any “size-of-data operations” has ledus to develop new mechanisms for backup/restore (basedon snapshots), management of the database log (staging),tiered caching with asynchronous seeding of replicas, andexploitation of the scale-out storage service.

4.1.3 From Shared-nothing to Shared-disk. One of the funda-mental principles of the HADR architecture (and any otherreplicated state-machine DBMS architecture) is that eachreplica maintains a local copy of the database. This principleconflicts with our design goal to support large databases of


several hundred TBs because no machine has that amountof storage. Even if it were possible, storage becomes the lim-iting factor and main criterion when placing databases onmachines in HADR; as a result, CPU cycles go to waste ifa large, say, 100TB database is deployed with a fairly lightworkload.

These observations motivated us to move away from theshared-nothing model of HADR (and replicated state ma-chines) and towards a shared-disk design. In this design, alldatabase compute nodes which execute transactions andqueries have access to the same (remote) storage service.Sharing data across database nodes requires support for dataversioning at different levels. To this end, Socrates relieson the shared version store described in Section 3.1. Thecombination of a shared version store and accelerated recov-ery (ADR, Section 3.2) makes it possible for new computenodes to spin up quickly and to push the boundaries of readscale-out in Socrates well beyond what is possible in HADR.

4.1.4 Low Log Latency, Separation of Log. The log is a poten-tial bottleneck of any OLTP database system. Every updatemust be logged before a transaction can commit and thelog must be shipped to all replicas of the database to keepthem consistent. The question is how to provide a highlyperformant logging solution at cloud scale?The Socrates answer to this question is to provide a sep-

arate logging service. This way, we can tune and tailor thelog specifically to its specific access pattern. First, Socratesmakes the log durable and fault-tolerant by replicating thelog: A transaction can commit as soon as its log records havebeen made durable. It turns out that our implementation ofquorum to harden log records is faster than achieving quo-rum in a replicated state machine (e.g., HADR). As a result,Socrates can achieve better commit performance as shownin Table 1.

Second, reading and shipping log records is more flexibleand scalable if the log is decoupled from other databasecomponents. Socrates exploits the asymmetry of log access:Recently created log records are in high demand whereas oldlog records are only needed in exceptional cases (e.g., to abortand undo a long-running transaction). Therefore, Socrateskeeps recent log records in main memory and distributesthem in a scalable way (potentially to hundreds of machines)whereas old log records are destaged and made availableonly upon demand.

Third, separating log makes it possible to stand on giant’sshoulders and plug in any external storage device to imple-ment the log. As shown in Appendix A, this feature is al-ready paying off as Socrates can leverage recent innovationsin Azure storage without changing any of its architecturaltenets. In particular, this feature allows Socrates to achieve

low commit latencies without the need to implement its ownlog shipping, gossip quorum protocol, or log storage system.

4.1.5 Pushdown Storage Functions. One advantage of theshared-disk architecture is that it makes it possible to off-load functions from the compute tier onto the storage tier,thereby moving the functions to the data. This way, Socratescan achieve significant performance improvements. Mostimportantly, every database function that can be offloadedto storage (whether backup, checkpoint, IO filtering, etc.)relieves the Primary Compute node and the log, the twobottlenecks of the system.

4.1.6 Reuse Components, Tuning, Optimization. SQL Serverhas a rich eco-system with many tools, libraries, and existingapplications. Applications on the existing, millions of SQLDB databases in Azure must migrate to Socrates in a seamlessway. Moreover, full backward compatibility with the millionsof SQL Server on-premise databases that might one day bemigrated to Azure is also a top priority. Thus, Socrates needsto support the same T-SQL programming language and basicAPIs for managing databases. Furthermore, SQL Server is anenterprise-scale database system with decades of investmentinto robustness (testing) and high performance. We did notwant to and cannot afford to reinvent the wheel and degradecustomer experience under any circumstances. So, criticalcomponents of SQL Server such as the query optimizer, thequery runtime, security, transaction management and re-covery, etc. are unchanged. Furthermore, Socrates databasesare tuned in the same way as HADR databases and Socratesbehaves like HADR for specific workloads (e.g., repeatedupdates to hot rows). Socrates (like HADR) also embracesa scale-up architecture for high throughput as this is thestate-of-the-art and sufficient for most OLTP workloads.

4.2 Socrates Architecture OverviewFigure 2 shows the Socrates architecture. As will becomeclear, it follows all the design principles and goals outlined inthe previous section: (a) separation of Compute and Storage,(b) tiered and scaled-out storage, (c) bounded operations, (d)separation of Log from Compute and Storage, (e) opportuni-ties to move functionality into the storage tier, and (f) reuseof existing components.

The Socrates architecture has four tiers. Applications con-nect to Compute nodes. As in HADR, there is one PrimaryCompute node which handles all read/write transactions.There can be any number of Secondaries which handle read-only transactions or serve as failover targets. The Computenodes implement query optimization, concurrency control,and security in the same way as today and support T-SQLand the same APIs (Section 4.1.6). If the Primary fails, oneof the Secondaries becomes the new Primary. All Compute


Figure 2: Socrates Architecture

nodes cache data pages in main memory and on SSD in aresilient buffer pool extension (Section 3.3).The second tier of the Socrates architecture is the XLOG

service. This tier implements the “separation of log” prin-ciple, motivated in Section 4.1.4. While this separation hasbeen proposed in the literature before [6, 8], this separationof log differentiates Socrates from other cloud database sys-tems such Amazon Aurora [20]. The XLOG service achieveslow commit latencies and good scalability at the storage tier(scale-out). Since the Primary processes all updates (includ-ing DML operations), only the Primary writes to the log.This single writer approach guarantees low latency and highthroughput when writing to the log. All other nodes (e.g.,Secondaries) consume the log in an asynchronous way tokeep their copies of data up to date.

The third tier is the storage tier. It is implemented by PageServers. Each Page Server has a copy of a partition of thedatabase, thereby deploying a scale-out storage architecturewhich, as we will see, helps to bound all operations as pos-tulated in Section 4.1.2. Page Servers play two importantroles: First, they serve pages to Compute nodes. Every Com-pute node can request pages from a Page Server, followinga shared-disk architecture (Section 4.1.3). We are currentlyworking on implementing bulk operations such as bulk load-ing, index creation, DB reorgs, deep page repair, and tablescans in Page Servers to further offload Compute nodes asdescribed in Section 4.1.5. In their second role, Page Serverscheckpoint data pages and create backups in XStore (the

Figure 3: XLOG Service

fourth tier). Like Compute nodes, Page Servers keep all theirdata in main memory or locally attached SSDs for fast access.

The fourth tier is the Azure Storage Service (called XStore),the existing storage service provided by Azure independentlyof Socrates and SQL DB. XStore is a highly scalable, durable,and cheap storage service based on hard disks. Data accessis remote and there are throughput and latency limits im-posed by storage at that scale and price point. SeparatingPage Servers with locally attached, fast storage from durable,scalable, cheap storage implements the design principlesoutlined in Section 4.1.1.Compute nodes and Page Servers are stateless. They can

fail at any time without data loss. The “truth” of the databaseis stored in XStore and XLOG. XStore is highly reliable andhas been used by virtually all Azure customers for manyyears without data loss. Socrates leverages this robustness.XLOG is a new service that we built specifically for Socrates.It has high performance requirements, must be scalable, af-fordable, and must never lose any data. We describe ourimplementation of XLOG, Compute nodes, and Page Serversin more detail next.

4.3 XLOG ServiceFigure 3 shows the internals of the XLOG Service. Startingin the upper left corner of Figure 3, the Primary Computenode writes log blocks directly to a landing zone (LZ) whichis a fast and durable storage service that provides strongguarantees on data integrity, resilience and consistency; inother words, a storage service that has SAN-like capabilities.


The current version of SQL DB Hyperscale uses the AzurePremium Storage service (XIO) to implement the landingzone. For durability, XIO keeps three replicas of all data.As with every storage service, there is a performance, cost,availability, and durability tradeoff. Furthermore, there aremany innovations in this space. Socrates naturally benefitsfrom these innovations. Appendix A studies this effect byshowing the performance impact of using an alternativestorage service instead of XIO.

The Primary writes log blocks synchronously and directlyto the LZ for lowest possible commit latency. The LZ is meantto be fast (possibly expensive) but small. The LZ is organizedas a circular buffer, and the format of the log is a backward-compatible extension of the traditional SQL Server log formatused in all of Microsoft’s SQL services and products. Thisapproach obeys the design principle of not reinventing thewheel (Section 4.1.6) and maintaining compatibility betweenSocrates and all other SQL Server products and services.One key property of this log extension is that it allows con-current log readers to read consistent information in thepresence of log writers without any synchronization (be-yond wraparound protection). Minimizing synchronizationbetween tiers leads to a system that is more scalable andmore resilient.The Primary also writes all log blocks to a special XLOG

process which disseminates the log blocks to Page Serversand Secondaries. These writes are asynchronous and possiblyunreliable (in fire-and-forget style) using a lossy protocol.One way to think about this scheme is that Socrates writessynchronously and reliably into the LZ for durability andasynchronously to the XLOG process for availability.The Primary writes log blocks into the LZ and to the

XLOG process in parallel. Without synchronization, it ispossible that a log block arrives at, say, a Secondary be-fore it is made durable in the LZ. We call such an unsyn-chronized approach speculative logging and it can lead toinconsistencies and data loss in the presence of failures. Toavoid these situations, XLOG only disseminates hardenedlog blocks. Hardened blocks are blocks which have alreadybeen made durable (with write quorum) in the LZ. To thisend, the Primary writes all log blocks first into the “pend-ing area” of the XLOG process. Furthermore, the Primaryinforms XLOG of all hardened log blocks. Once a block ishardened, XLOG moves it from the “pending area” to theLogBroker for dissemination, thereby also filling in gaps andreordering out of order blocks from the lossy protocol towrite into the “pending area”.

To disseminate and archive log blocks, the XLOG processimplements a storage hierarchy. Once a block is moved intothe LogBroker, an internal XLOG process called destagingmoves the log to a fixed size local SSD cache for fast ac-cess and to XStore for long term retention. Again, XStore is

cheap, abundant, durable, yet slow.We refer to this long-termarchive for log blocks as LT. If not specified otherwise, SQLDB keeps log records for 30 days for point-in-time recoveryand disaster recovery with fuzzy backups. It would be prohib-itively expensive to keep 30 days-worth of log records in theLZ which is a low-latency, expensive service. This destagingpipelinemust be carefully tuned: Socrates cannot process anyupdate transactions once the LZ is full with log records thathave not been destaged yet. While this tiered architecture iscomplex, no other log backup process is needed; between theLZ and LT (XStore), all log information is durably stored. Fur-thermore, this hierarchy meets all our latency (fast commitin LZ) and cost requirements (mass storage in XStore).Consumers (Secondaries, Page Servers) pull log blocks

from the XLOG service. This way, the architecture is morescalable as the LogBroker need not keep track of log con-sumers, possibly hundreds of Page Servers. At the top level,the LogBroker has a main-memory hash map of log blocks(called Sequence Map in Figure 3). In an ideal system, alllog blocks are served from this Sequence Map. If the datais not found in the Sequence Map, the local SSD cache ofthe XLOG process is the next tier. This local SSD cache isanother circular buffer of the tail of the log. If a consumerrequests a block that has aged out of the local SSD cache, thelog block is fetched from the LZ and, if that fails, from theLT as a last resort where the log block is guaranteed to befound.The XLOG process implements a few other generic func-

tions that are required by a distributed DBaaS system: leasesfor log lifetime, LT blob cleanup, backup/restore bookkeep-ing, progress reporting for log consumers, block filtering,etc. All of these functions are chosen carefully to preservethe stateless nature of the XLOG process, to allow for easyhorizontal scaling, and to avoid affecting the main XLOGfunctions of serving and destaging log.

4.4 Primary Compute Node andGetPage@LSN

A Socrates Primary Compute node behaves almost identi-cally to a standalone process in an on-premise SQL Serverinstallation. The database instance itself is unaware of thepresence of other replicas. It does not know that its storageis remote, or that the log is not managed in local files. Incontrast, a HADR Primary node is well aware that it partici-pates in a replicated state machine and achieves quorum toharden log and commit transactions. In particular, a HADRPrimary knows all the Secondaries in a tightly coupled way.The Socrates Primary is, thus, simpler.

The core function of the Primary, however, is the same:Process read/write transactions and produce log. There areseveral notable differences from an on-premise SQL Server:


• Storage level operations such as checkpoint,backup/restore, page repair, etc. are delegated to thePage Servers and lower storage tiers.

• The Socrates Primary writes log to the LZ using thevirtualized filesystem mechanism of Section 3.6. Thismechanism produces an I/O pattern that is compatiblewith the LZ concept described in Section 4.3.

• The Socrates Primary makes use of the RBPEX cache(Section 3.3). RBPEX is integrated transparently as alayer just above the I/O virtualization layer.

• Arguably, the biggest difference is that a Socrates Pri-mary does not keep a full copy of the database. Itmerely caches a hot portion of the database that fitsinto its main memory buffers and SSD (RBPEX).

This last difference requires a mechanism for the Primaryto retrieve pages which are not cached in the local node.We call this mechanism GetPage@LSN. The GetPage@LSNmechanism is a remote procedure call which is initiated bythe Primary from the FCB I/O virtualization layer using theRBIO protocol (Section 3.4). The prototype for this call hasthe following signature:

getPage(pageId, LSN)

Here, pageId identifies uniquely the page that the Primaryneeds to read, and LSN identifies a page log sequence numberwith a value at least as high as the last PageLSN of the page.The Page Server (Section 4.6) returns a version of the pagethat has applied all updates up to this LSN or higher.

To understand the need for this mechanism, consider thefollowing sequence of events:(1) The Primary updates Page X in its local buffers.(2) The Primary evicts Page X from its local buffers (both

buffer pool and RBPEX) because of memory pressureor accumulated activity. Prior to the page eviction,the Primary follows the write-ahead logging protocol[18] and flushes all the log records that describe thechanges to Page X to XLOG.

(3) The Primary reads Page X again.In this scenario, it is important that the Primary sees thelatest version of Page X in Step 3 and, thus, issues a getPage(X,X-LSN) request with a specific X-LSN that guarantees thatthe Page Server returns the latest version of the page.To guarantee freshness, the Page Server handles a get-

Page(X, X-LSN) request in the following way:(1) Wait until it has applied all log records from XLOG up

to X-LSN.(2) Return Page X.

This simple protocol is all that is needed to make sure thatthat the Page Server does not return a stale version of PageX to the Primary. (Section 4.6 contains more details on PageServers.)

We have not described yet how the Primary knows whichX-LSN to use when issuing the getPage(X, X-LSN) call. Ide-ally, X-LSN would be the most recent page LSN for PageX. However, the Primary cannot remember the LSNs of allpages it has evicted (essentially the whole database). Instead,the Primary builds a hash map (on pageId) which stores ineach bucket the highest LSN for every page evicted from thePrimary keyed by pageId. Given that Page X was evicted atsome point from the Primary, this mechanism will guaranteeto give an X-LSN value that is at least as large as the largestLSN for Page X and is, thus, safe.

4.5 Secondary Compute NodeFollowing the reuse design principle (Section 4.1.6), a SocratesSecondary shares the same apply log functionality as inHADR. A (simplifying) difference is that the Socrates Sec-ondary need not save and persist log blocks because that isthe responsibility of the XLOG service. Furthermore, Socratesis a loosely coupled architecture so that the Socrates Sec-ondary does not need to know who produces the log (i.e.,which node is the Primary). As in HADR, the Socrates Sec-ondary processes read-only transactions (using SnapshotIsolation [4]). The most important components such as thequery processor, the security manager, and the transactionmanager are virtually unchanged from standalone SQL Serverand HADR.

As with the Primary, the most significant changes betweenSocrates and HADR come from the fact that Socrates Secon-daries do not have a full copy of the database. This fact isfundamental to achieving our goal to support large databasesand making Compute nodes stateless (with a cache). As aresult, it is possible that the Secondary processes a log recordthat relates to a page that is not in its buffers (neither mainmemory nor SSD). There are different policies conceivablefor this situation. One possible policy is to fetch the pageand apply the log record. This way, the Secondary’s cachehas roughly the same state as the Primary’s cache (at leastfor updated pages) and performance is more stable after afail-over to the Secondary.

The policy that SQL DB Hyperscale currently implementsis that log records that involve pages that are not cachedare simply ignored. This policy results in an interesting racecondition because the check for existence in the cache canconflict with a concurrent, pending GetPage@LSN requestfrom a read-only transaction processed by the Secondary. Toresolve this conflict, Secondaries must registerGetPage@LSNrequests before making the actual call and the apply-logthread of the Secondary queues log records of pending Get-Page@LSN requests until the page is loaded.


Another interesting race condition arises when the Sec-ondary does B-tree traversals to process read-only transac-tions. The Secondary applies the same GetPage@LSN proto-col to fetch pages from Page Servers as the Primary. Again,the Secondary has a PageLSN cache to conservatively de-termine the right LSN. This LSN is typically lower than thelast LSN written by the Primary at the same point in time.This can result in inconsistencies. Consider the followingsituation as part of a B-tree traversal:

• The Secondary reads B-tree Node P with LSN 23. (TheSecondary has applied log until LSN 23.)

• The next Node C, a child of P, is not cached. The Sec-ondary issues a getPage(C, 23) request.

• The Page Server returns Page C with LSN 25, followingthe protocol described in Section 4.4.

According to the GetPage@LSN protocol, the Page Servermay return a page from the future. This can create incon-sistencies. For instance, what if Page C was split at Time24? In this case, the Secondary has looked at Page P fromits present (before the split) and looks at Page C from thefuture (after the split) which may result in wrong results.Fortunately, it is easy to detect such inconsistencies. If theSecondary detects such an inconsistency during an indextraversal, it will pause to give the log apply thread some timeto consume more log to refresh the stale index pages (i.e.,Page P). After that pause, it will restart the B-tree traversal,hoping that the index structure is now consistent.The key insight here is that persistent data structures in

SQL Server have been designed such that logically coherenttraversals can be achieved even in the presence of pages thatare physically obtained from different points in time (froman LSN perspective). Logical traversals rely on the versionstore described in Section 3.1 to extract the right version ofa record from a page given the transaction timestamp (usingSnapshot Isolation).

4.6 Page ServersA Page Server is responsible for (i) maintaining a partitionof the database by applying log to it, (ii) responding to Get-Page@LSN requests from Compute nodes and (iii) perform-ing distributed checkpoints and taking backups.

Following the reuse principle of Section 4.1.6, Page Serverscarry out the log apply task in a manner that is similar tothe procedure for Secondaries described in Section 4.5. Incontrast to Secondaries which need to be made aware of allchanges to the database and, thus, need to consume all logblocks, Page Servers only ever care about log blocks thatcontain log records that involve pages of the database parti-tion handled by that particular Page Server. To this end, thePrimary includes sufficient out-of-band annotations for eachlog block that indicates which partitions are affected by log

records in a log block. XLOG uses this filtering informationto disseminate only relevant log blocks to each Page Server.Serving GetPage@LSN requests is also straightforward.

The Page Server simply follows the protocol of Section 4.4.To this end, Page Servers also use RBPEX, the SSD-extended,recoverable cache. The mechanisms to use RBPEX are thesame for Page Servers as for Compute nodes, but the pol-icy is different. Compute nodes cache the hottest pages forbest performance; their caches are sparse. In contrast, PageServers cache less hot pages, those pages that are not hotenough to make it in the Compute node’s cache. That is whySQL DB Hyperscale currently implements Page Servers us-ing a covering cache; i.e., all pages of the partition are storedin the Page Server’s RBPEX. Furthermore, Socrates organizesa Page Server’s RBPEX in a stride-preserving layout suchthat a single I/O request from a compute node that coversa multi-page range translates into a single I/O request atthe Page Server. Since the Page Server’s cache is dense, thePager Server does not suffer from read amplification whilethe sparse RBPEX caches at Compute nodes do. This charac-teristic is important for the performance of scan operationsthat commonly read up to 128 pages. Another importantcharacteristic of the Page Server cache is that it provides in-sulation from transient XStore failures. On an XStore outage,the Page Server continues operating in a mode where pagesthat were written in RBPEX but not in XStore are remem-bered and checkpointing is resumed (and XStore is caughtup) when XStore is back online. The same mechanism allowsSocrates to aggregate multiple I/Os being sent to XStore in asingle large write operation in order to get the best possiblethroughput out of the underlying storage service. Finally,when a new Page Server is started, its RBPEX is seeded asyn-chronously while the Page Server is already available andable to serve requests and apply log. Decoupling long run-ning operations such as seeding new Page Servers from otheroperational tasks (Section 4.1.3) is one of the key principleswe tried to enforce at all layers in the Socrates stack.

Finally, Page Servers take checkpoints and do backupsby interacting with XStore. How Page Servers and XStoreoperate together for these operations is the subject of thenext subsection.

4.7 XStore for Durability andBackup/Restore

As shown in the previous sections, the truth of the databaseis stored in XStore. The details of XStore are described in [10].In a nutshell, XStore is cheap (based on hard disks), durable(virtually no data loss due to a high degree of replicationacross availability zones), and provides efficient backup andrestore by using a log-structured design (taking a backup


merely requires keeping a pointer into the log) [19]. But XS-tore is potentially slow. The goal of all other components ofthe Socrates architecture is to add performance and avail-ability. In other words, XStore plays in Socrates the samerole as hard disks and tape in a traditional database system.The main-memory and SSD caches (RBPEX) of Computenodes and Page Servers play in Socrates the same role asmain memory in a traditional system.

Once these analogies are understood, it is straightforwardto see how Socrates does checkpointing, recovery, backup,and restore. For checkpointing, a Page Server regularly shipsmodified pages to XStore. Backups are implemented usingXStore’s snapshot feature which allows to create a backup inconstant time by simply recording a timestamp [10]. When auser requests a Point-In-Time-Restore operation (PITR), therestore workflow identifies (i) the complete set of snapshotsthat have been taken just before the time of the PITR, and(ii) the log range required to bring this set of snapshots fromtheir time to the requested time. These snapshots are thencopied to new blobs and a restore operation starts in whicheach blob is attached to a new Page Server instance, a newXLOG process is bootstrapped on the copied log blobs andthe log applied to bring the database all the way to the PITRrequested time.

This Socrates PITR process is much more efficient than inHADR today. Like taking a snapshot, restoring a snapshotin XStore is a short, meta-data operation that is executed inconstant time (independent of the size of the data). Further-more, the database becomes available almost instantaneouslybecause the new Page Servers can start serving pages of therestored database instantaneously. However, performancewill degrade until the caches of the Page Servers are restoredbecause pages need to be fetched and recovered from XStorefirst. This example demonstrates again the importance toleverage existing cloud services (Section 4.1.4).

5 SOCRATES ATWORKThe GetPage@LSN protocol described in Sections 4.4 to 4.6is a nice example of how Socrates implements distributedprotocols. The design principles are always the same: TheSocrates mini-services like Primary, Secondaries, XLOG, andPage Servers are autonomous and decoupled and commu-nication is asynchronous whenever possible. For scalability,mini-services do not need to know about other mini-services;for instance, the Primary need not know how many PageServers exist, possibly hundreds (Section 6). Synchronizationis done by time travel whenever possible and by waiting if amini-service is behind.

Socrates implements all distributed algorithms followingthese principles. Other examples, omitted for brevity, aredistributed checkpointing (across all Page Servers), fail-over

of the Primary, scaling up and down (i.e., serverless), creatinga new Secondary, management of leases for log consumption,and creating new Page Servers.

6 DISCUSSION & SOCRATESDEPLOYMENTS

The Socrates architecture has many advantages. SeparatingCompute and Storage (Page Servers in Socrates) has beenstudied extensively in the literature [5, 8, 16, 17, 20]: It helpsto build scalable database systems that grow beyond thestorage capabilities of a single machine. Furthermore, thisprinciple helps to establish a more fine-grained pay-as-you-go model in the cloud in which customers pay only for thestorage that they need and independently for the computethat they consume. Socrates inherits all the advantages ofthis separating compute and storage principle. It also inheritsthe disadvantages in that reads can become more expensiveas they may need to access remote servers; Socrates remediesthis disadvantage by aggressively caching data in Computenodes in main memory and disk, and Section 4 described themany innovations we did to get caching right.

A novel feature of the Socrates architecture is that it sepa-rates availability from durability. In Socrates, XLOG and XS-tore are the tiers that are responsible for durability whereasthe Compute and Page Server tiers are only there for avail-ability: If they fail, no data is lost, but the service becomesunavailable until they are restored. The big advantage ofthis separation is that it provides flexibility and fine-grainedcontrol to navigate the availability / performance / cost trade-off. This way a Socrates deployment can be tailored to thespecific needs of an application.

XLOG and XStore are needed in all Socrates deploymentsfor durability. So, the simplest Socrates deployment con-sists of a single Compute node (the Primary Compute node,no Secondaries) and one Page Server that handles page re-quests for the entire database. This deployment is the mostcost-effective deployment; its performance depends on thehardware used to run the Primary and the Page Server. Thedownside of this minimum deployment is availability: If,say, the Primary Compute node fails, a new Compute nodeneeds to be spun up to become the new Primary, resultingin downtime. Once the new Primary node is up, the systemis available, but peak performance is only achieved after thecache of the new Primary is hot.To achieve higher availability and avoid performance jit-

ter after failures, any number of Secondaries can be added,at higher cost and better performance because the Secon-daries can be used for read-only transactions. Increasing thenumber of Page Servers also increases cost, performance,and availability. Interestingly, there are two ways in Socrates


to add Page Servers with subtly different availability im-pact. One way to add a Page Server is to make the shardingof the database more fine-grained. This way, availability isimproved because the partitions are smaller and, thus, themean-time-to-recovery is smaller as it is faster to spin up anew Page Server for a partition. According to [14], a lowermean-time-to-recovery implies higher availability. Further-more, this approach improves performance by increasing thedegree of parallelism when bulk operations (e.g., large tablescans or bulkloads) are pushed down to Page Servers [12].With today’s network and hardware parameters, we calcu-lated that a good partition size for a Page Server is 128GB.So, a Socrates database with hundreds of TB will result in adeployment with thousands of Page Servers.

A second way to add a Page Server is to create a replica ofan existing Page Server. This approach increases availabilityas the replica is ready (and hot) when the Page Server fails.

Geo-replication is an important way to increase both avail-ability and performance. Socrates allows to deploy Secon-daries and Page Servers in different data centers and availabil-ity zones. This approach improves performance, for instance,because the database can be queried by local Secondariesaround the world. But, of course, geo-replication also comesat a cost of shipping the log across data centers.Finally, an important advantage of the Socrates architec-

ture is that it makes best use of other, existing cloud services.The Azure XStore service is used to do efficient backups(for point in time recovery) and to implement durability in ascalable and cheap way. The XLOG tier depends on AzurePremium Storage. As we will see in Section 7, Socrates nat-urally benefits from innovations in these existing services.The Compute and Page Server tiers implement native data-base functionality only and do not replicate any functionalityprovided by other, more general cloud services.

7 PERFORMANCE EXPERIMENTS ANDRESULTS

This section presents experimental results that assess theeffectiveness of the Socrates design. As a baseline, we usethe HADR architecture (Section 2).

7.1 Software and Services UsedWe implemented Socrates as part of the SQL DB Hyperscaleservice in Azure. At the time of writing this paper, this servicewas in preview and we used this preview version for allexperiments reported in this paper. We experimented withtwo different deployments:

• Production: This is the same version of SQL DB Hy-perscale that Azure customers get. It allowed us to doan apples-to-apples comparison of Socrates with the

CPU % Write TPS Read TPS Total TPSHADR 99.1 347 1055 1402Socrates 96.4 330 1005 1335

Table 2: CDB Throughput: HADR vs. Socrates (1TB)

HADR-based SQL DB service. It uses Azure PremiumStorage (XIO) to implement the Socrates LZ.

• Test: We also deployed Socrates in a test cluster tostudy the impact of a new, improved premium storageservice to implement the LZ.

The experiments with the Test system are presented in Ap-pendix A.We used Azure VMs of different T-shirt sizes. For the

experiments in the Test cluster, we used VMs with 64 cores,432 GB of main memory, and 32 disks. In the Productiondeployment, we used VMs with 8 and 16 cores for bothSocrates and HADR.We used the CDB benchmark [1] which is Microsoft’s

Cloud Database Benchmark (also known as the DTU bench-mark) and has been used to test the performance of all Mi-crosoft DBaaS offerings in Azure. CDB is based on a syn-thetic database with six tables and a scaling factor to generatedatabases of different sizes. If not stated otherwise, we useda 1TB CDB database which is a size that is comfortably sup-ported by HADR. We also did experiments that demonstratethe scalability of Socrates to much larger sizes (not supportedby HADR).CDB defines a set of transaction types covering a wide

range of operations from simple point lookups to complexbulk updates. Furthermore, CDB specifies workload mixesthat test system characteristics for specific workloads; e.g.,read-only.

7.2 Experiment 1: CDB Default Mix,Throughput, Production Cluster

Table 2 shows the throughput of Socrates and HADR inproduction on an eight core VM with 64 concurrent clientthreads to generate the workload. For these experiments weused the default workload mix of CDB which executes alltransaction types of the benchmark. The size of the databasewas 1TB.

Table 2 shows that Socrates throughput is about 5% lowerthan the throughput of HADR. This result is not unex-pected for a service in preview, especially considering thatHADR has been tuned over the years for this benchmark.It is interesting to understand how Socrates loses this 5%.HADR achieves 99.1% CPU utilization which is almost per-fect. Socrates has a lower CPU utilization as it needs to waitlonger for remote I/Os. This lower CPU utilization explainshalf of the deficit. Recall that remote I/O is fundamental to


Data Scale Memory RBPEX Local cacheSize Factor Size Size hit %1TB 20000 56GB 168GB 52Table 3: Socrates Cache Hit Rate (CDB)

Data Customers Memory RBPEX Local cacheSize Size Size hit %30TB 3.1M 88GB 320GB 32

Table 4: Socrates Cache Hit Rate (TPC-E)

any DBaaS architecture that scales database sizes beyondwhat can fit into a single machine. From performance pro-files, it seems that the other 2.5% performance gap comefrom a higher CPU cost of writing the log to a remote service(XLOG). We believe that Socrates can catch up and becomeeven better than HADRwith larger caches and further tuningof the I/O and network protocols.

7.3 Experiment 2: Caching BehaviorTable 3 shows the cache hit rate of Socrates for a 1TB CDBdatabase and an RBPEX with 56GB of main memory buffers,and 168 GB of SSD storage space. (Obviously, the hit rateis 100% in HADR because HADR stores a full copy of thedatabase in every Compute node.)For this experiment, we again used the default workload

mix of CDB. This workload randomly touches pages scat-tered across the entire database. Thus, this workload is a badcase for caching. Nevertheless, we achieve a 50% hit rate fora cache that is only 15% of the size of the database. To studya more realistic scenario, we ran Socrates on a 30TB TPC-Edatabase using the TPC-E benchmark. For this experiment,we increased the buffer pool of the Socrates Primary (88 GBof main memory and 320 GB of SSD). Table 4 shows thateven though the cache is only about 1% of the size of thedatabase, Socrates had a 32% hit rate.

These results indicate that a smart local SSD cache can beextremely effective. We believe that similar improvementscan be made in other operational areas, too, thereby exploit-ing the flexible, decomposed Socrates architecture; e.g., toreduce the cost of recovery, impact on database engine up-grade, and peak-to-peak performance after a machine restart.

7.4 Experiment 3: Update-heavy CDB, LogThroughput

The goal of this experiment was to evaluate the ability to sus-tain high update throughputs. Specifically, this experimentstudied the logging throughput of Socrates and HADR usinga special workload mix of CDB that produces the maximum

SF Log MB/s CPU %HADR 30000 56.9 46.2Socrates 30000 89.8 73.2

Table 5: CDB Log Throughput: HADR vs. Socrates

amount of log data. In this experiment, the log is the bot-tleneck for both HADR and Socrates and the performanceof the system is determined by the logging bandwidth. Forthese experiments, we used VMs with 16 cores and 256 clientthreads to generate the transactional workload to make surethat indeed the logging component is saturated.Table 5 shows that Socrates beats HADR in this experi-

ment. The low CPU utilization indicates that in both systemsindeed the logging component is the bottleneck.Why is Socrates better in this experiment? What deter-

mines the logging bandwidth? To answer these questions,we need to look at the entire logging pipeline. HADR needsto drive log and database backup from the Compute nodes inparallel with the user workload. Log production is restrictedto the level at which the log backup egress can be safelyhandled by the Azure Storage tier (XStore) underneath. Incontrast, Socrates can leverage XStore’s snapshot feature forbackup which results in a much higher log production rateupstream at the Socrates Primary.

This result is a good example of how pushing storage func-tions (such as backup in this case) down into a dedicatedstorage tier is an extremely powerful concept. Such optimiza-tions in lower tiers of the system can significantly impactthe performance of customer-facing Compute nodes as thereare complex performance dependencies even in a looselycoupled system like Socrates.

8 CONCLUSIONThis paper presented Socrates, a new architecture for DBaaSofferings in the cloud. Socrates powers Microsoft’s new SQLdatabase service inAzure, called SQLDBHyperscale. Socratesrelies on the well established principle of separating Computeand Storage to achieve better availability and elasticity. Fur-thermore, Socrates separates durability and availability. Thisapproach is novel and has not been studied in the literaturebefore. The big advantage of this separation is that it allowsto flexibly meet customer requirements regarding the cost /performance / availability tradeoff.Socrates is a novel architecture, and we are still in the

early days of understanding and exploiting its full potential.We are currently implementing bulk operations in parallelin Socrates Page Servers. Further avenues for future workinclude exploring multi-master variants for Socrates, betterHTAP support in Socrates, and making use of the log forother services such as audit and security.


Acknowledgments. Socrates and SQL DB Hyperscale is theresult of a large collaborative effort across the Microsoft SQLorganization and Microsoft Research. We would like to thankElias Yousefi Amin Abadi, Ruokun An, Wayne Chen, Chris-tian Damianidis, Paul Larson, Justin Lewandoski, AlexandruNedelcu, Shweta Raje, Brendan Rowan, Swati Roy, Sridha-ran Sakthivelu (Intel) and Weidan Yan for their invaluablecontributions to the design and implementation of Socrates.

REFERENCES[1] 2018. CDB/DTU benchmark. https://docs.microsoft.com/en-us/azure/

sql-database/sql-database-service-tiers-dtu.[2] 2018. Microsoft SQL Hyperscale. https://docs.microsoft.com/en-us/

azure/sql-database/sql-database-service-tier-hyperscale.[3] 2018. Oracle Cloud Database. https://cloud.oracle.com/database/.[4] Hal Berenson, Phil Bernstein, Jim Gray, Jim Melton, Elizabeth O’Neil,

and Patrick O’Neil. 1995. A Critique of ANSI SQL Isolation Levels. InSIGMOD 1995.

[5] Philip A. Bernstein, Colin W. Reid, and Sudipto Das. 2011. Hyder - ATransactional Record Manager for Shared Flash. In CIDR 2011.

[6] Dhruba Borthakur. 2017. The Birth of RocksDB-Cloud. http://rocksdb.blogspot.com/2017/05/the-birth-of-rocksdb-cloud.html.

[7] Peter Braam, Sean Roberts, Matthew O’Keefe, and David Bonnie. 2016.The Limits of Open Source in Extreme-scale Storage Systems Design.https://docplayer.net/62056362-The-limits-of-open-source-in-extreme\-scale-storage-systems-design.html.

[8] Matthias Brantner, Daniela Florescu, David Graf, Donald Kossmann,and Tim Kraska. 2008. Building a Database on S3. In SIGMOD 2008.

[9] Alain Bui and Hacène Fouchal (Eds.). 2002. OPODIS 2002. StudiaInformatica Universalis, Vol. 3.

[10] Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, ArildSkjolsvold, Sam McKelvie, Yikang Xu, Shashwat Srivastav, JieshengWu, Huseyin Simitci, Jaidev Haridas, Chakravarthy Uddaraju, HemalKhatri, Andrew Edwards, Vaman Bedekar, Shane Mainali, Rafay Ab-basi, Arpit Agarwal, Mian Fahim ul Haq, Muhammad Ikram ul Haq,Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli, MarvinMcNett, Sriram Sankaran, Kavitha Manivannan, and Leonidas Rigas.2011. Windows Azure Storage: a highly available cloud storage servicewith strong consistency. In SOSP 2011.

[11] James C. Corbett, Jeffrey Dean,Michael Epstein, Andrew Fikes, Christo-pher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, ChristopherHeiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, EugeneKogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura,David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford,Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang.2012. Spanner: Google’s Globally-Distributed Database. In OSDI 2012.

[12] Michael J. Franklin, Björn Þór Jónsson, and Donald Kossmann. 1996.Performance Tradeoffs for Client-Server Query Processing. In SIGMOD1996.

[13] Jim Gray, Pat Helland, Patrick E. O’Neil, and Dennis E. Shasha. 1996.The Dangers of Replication and a Solution. In SIGMOD 1996.

[14] Jim Gray and Andreas Reuter. 1990. Transaction Processing: Conceptsand Techniques.

[15] Per-Åke Larson, Spyros Blanas, Cristian Diaconu, Craig Freedman,Jignesh M. Patel, and Mike Zwilling. 2011. High-Performance Concur-rency Control Mechanisms for Main-Memory Databases. PVLDB 5, 4(2011).

[16] Simon Loesing, Markus Pilman, Thomas Etter, and Donald Kossmann.2015. On the Design and Scalability of Distributed Shared-DataDatabases. In SIGMOD 2015.

[17] David B. Lomet, Alan Fekete, GerhardWeikum, andMichael J. Zwilling.2009. Unbundling Transaction Services in the Cloud. In CIDR 2009.

[18] C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and PeterSchwarz. 1992. ARIES: A Transaction Recovery Method SupportingFine-granularity Locking and Partial Rollbacks Using Write-aheadLogging. TODS 17, 1 (1992).

[19] Mendel Rosenblum and John K. Ousterhout. 1992. The Design andImplementation of a Log-Structured File System. TOCS 10, 1 (1992).

[20] Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmade-sam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Mau-rice, Tengiz Kharatishvili, and Xiaofeng Bao. 2017. Amazon Aurora:Design Considerations for High Throughput Cloud-Native RelationalDatabases. In SIGMOD 2017.

https://docs.microsoft.com/en-us/azure/sql-database/sql-database-service-tiers-dtu

https://docs.microsoft.com/en-us/azure/sql-database/sql-database-service-tiers-dtu

https://docs.microsoft.com/en-us/azure/sql-database/sql-database-service-tier-hyperscale

https://docs.microsoft.com/en-us/azure/sql-database/sql-database-service-tier-hyperscale

https://cloud.oracle.com/database/

http://rocksdb.blogspot.com/2017/05/the-birth-of-rocksdb-cloud.html

http://rocksdb.blogspot.com/2017/05/the-birth-of-rocksdb-cloud.html

https://docplayer.net/62056362-The-limits-of-open-source-in-extreme\ -scale-storage-systems-design.html

https://docplayer.net/62056362-The-limits-of-open-source-in-extreme\ -scale-storage-systems-design.html


STDEV Min (us) Median (us) Max (us)XIO 431 2518 3300 36864DD 167 484 800 39857Table 6: CDB UpdateLite Latency: XIO vs. DD

A EXPERIMENT 4: TEST CLUSTER, XIOVS. DIRECTDRIVE

The last set of experiments studied the impact of the storageservice used to implement the LZ. Azure constantly improvesits storage services and Socrates allows SQL DB Hyperscaleto take advantage of these innovations.For this experiment, we ran Socrates in a Test cluster in

two different configurations with identical hardware andrunning the same workload on the same database. The onlydifference was the implementation of the LZ: We comparedan implementation based on XIO (as in all previous experi-ments) and another implementation based on a new Azureservice called DirectDrive (DD for short). When we did theseexperiments, DD was also in preview. DD aggressively ex-ploits new technology trends such as RDMA. DD usesWin32and the Windows I/O stack. For this experiment, we used aCDB workload mix with mostly small updates and no readtransactions.We reduced the number of clients andmeasuredthe latency of committing a transaction. In this workload theCPU utilization is low because the clients did not generateenough work to saturate it.Table 6 shows that DD has significantly better min and

median latency. Only for the max latency are XIO and DDthe same.

Figure 4: Socrates UpdateLite Throughput vs. Threads

The results of Table 6 were obtained with a single clientthread. Figure 4 depicts the throughput with a growing num-ber of concurrent client threads. Obviously, lower latencytranslates into higher throughput as long as the CPU of thePrimary is under-utilized. This rule is workload-dependent

Threads Log MB/s CPU %XIO 128 69 30DD 16 70 9

Table 7: CDB UpdateLite Log Throughput: XIO vs. DD

and depends, in particular, on the read/write ratio. Neverthe-less, DD is the clear winner indicating that Socrates benefitsnicely from such innovations as DD.Finally, Table 7 shows the CPU utilization. In this exper-

iment, we varied the number of client threads such thatSocrates with both XIO and DD has roughly the same logthroughput of 70 MB per second. Table 7 shows that XIOneeds 8x the load to achieve the same log throughput as DD,thereby consuming three times as much CPU in the Primary(and eight times as much CPU in the clients). This way, DDcan significantly reduce the cost of Socrates databases asCPU dominates the cost of running a DBaaS offering. Here,one factor is that XIO requires expensive REST calls whereasDD requests go through cheaper Win32 calls. We hope thatleveraging DD will also help Socrates to close the perfor-mance gap to SQL DB discussed in Experiment 1.In summary, all these experiments make a simple point:

Socrates can leverage storage innovations in ways that tradi-tional, more monolithic DBaaS architectures cannot. Theseimprovements were achieved without changing a single lineof code. In the long run, we believe that these improvementsoutweigh the performance penalties due to reading data fromremote servers.

socrates: the new sql server in the cloud · durability does not require copies in fast storage;...

Documents