taking omid to the clouds: fast, scalable transactions for...

Taking Omid to the Clouds:Fast, Scalable Transactions for Real-Time Cloud Analytics

Ohad ShachamYahoo Research

[email protected]

Yonatan GottesmanYahoo Research

[email protected]

Aran BergmanTechnion

[email protected] Bortnikov

Yahoo [email protected]

Eshcar HillelYahoo Research

[email protected]

Idit KeidarTechnion and Yahoo [email protected]

ABSTRACTWe describe how we evolve Omid, a transaction process-ing system for Apache HBase, to power Apache Phoenix, acloud-grade real-time SQL analytics engine.

Omid was originally designed for data processing pipelinesat Yahoo, which are, by and large, throughput-oriented mono-lithic NoSQL applications. Providing a platform to supportconverged real-time transaction processing and analytics ap-plications – dubbed translytics – introduces new functionaland performance requirements. For example, SQL supportis key for developer productivity, multi-tenancy is essentialfor cloud deployment, and latency is cardinal for just-in-timedata ingestion and analytics insights.

We discuss our efforts to adapt Omid to these new do-mains, as part of the process of integrating it into Phoenixas the transaction processing backend. A central piece ofour work is latency reduction in Omid’s protocol, which alsoimproves scalability. Under light load, the new protocol’s la-tency is 4x to 5x smaller than the legacy Omid’s, whereasunder increased loads it is an order of magnitude faster. Wefurther describe a fast path protocol for single-key transac-tions, which enables processing them almost as fast as nativeHBase operations.

PVLDB Reference Format:Shacham, Gottesman, Bergman, Bortnikov, Hillel, and Keidar.Taking Omid to the Clouds:Fast, Scalable Transactions for Real-Time Cloud Analytics. PVLDB,11 (12): 1795 - 1808, 2018.DOI: https://doi.org/10.14778/3229863.3229868

1. INTRODUCTIONIn recent years, transaction processing [35] technologies

have paved their way into multi-petabyte big data plat-forms [39, 26, 40]. Modern industrial transaction process-ing systems (TPSs) [39, 40, 14, 8] complement existing un-derlying NoSQL key-value storage with atomicity, consis-

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected] of the VLDB Endowment, Vol. 11, No. 12Copyright 2018 VLDB Endowment 2150-8097/18/8.DOI: https://doi.org/10.14778/3229863.3229868

tency, isolation and durability (ACID) semantics that enableprogrammers to perform complex data manipulation with-out over-complicating their applications. Google’s Percola-tor [39] pioneered a transaction API atop the Bigtable stor-age. Apache Incubator projects Tephra [14] and Omid [40]followed suit with Apache HBase [2].

Such technologies must evolve in light of the recent shifttowards massive deployment of big data services in pub-lic clouds; for example, the AWS cloud can run HBase aspart of Elastic MapReduce [1], and SQL engines like ApachePhoenix increasingly target public cloud deployment. Yetadapting TPS technology for cloud use is not without chal-lenges, as we now highlight.

1.1 ChallengesDiverse functionality. Large-scale transaction sup-

port was initially motivated by specific use cases like con-tent indexing for web search [39, 40] but rapidly evolved intoa wealth of OLTP and analytics applications (e.g., [41]).Today, users expect to manage diverse workloads withina single data platform, to avoid lengthy and error-proneextract-transform-load processes. A Forrester report [15]coins the notion of translytics as “a unified and integrateddata platform that supports multi-workloads such as trans-actional, operational, and analytical simultaneously in real-time, ... and ensures full transactional integrity and dataconsistency”.

As use cases become more complex, application developerstend to prefer high-level SQL abstractions to crude NoSQLdata access methods. Indeed, scalable data managementplatforms (e.g., Google Spanner [26], Apache Phoenix [6],and CockroachDB [8]) now provide full-fledged SQL inter-faces to support complex query semantics in conjunctionwith strong data guarantees. SQL APIs raise new require-ments for transaction management, e.g., in the context ofmaintaining secondary indexes for accelerated analytics.

Scale. Public clouds are built to serve a multitude of ap-plications with elastic resource demands, and their efficiencyis achieved through scale. Thus, cloud-first data platformsare designed to scale well beyond the limits of a single ap-plication. For example, Phoenix is designed to scale to 10Kquery processing nodes in one instance, and is expected toprocess hundreds of thousands or even millions of transac-tions per second (tps).

Latency. Similarly to many technologies, the adoptionof transactions took a “functionality-first” trajectory. For

example, the developers of Spanner [26] wrote: “We be-lieve it is better to have application programmers deal withperformance problems due to overuse of transactions as bot-tlenecks arise, rather than always coding around the lack oftransactions”. Yet the expectation for low latency is rapidlypicking up. Whereas early applications of big data transac-tion processing systems were mostly throughput-sensitive[39, 40], with the thrust into new interactive domains like so-cial networks [7], messaging [22] and algorithmic trading [11]latency becomes essential. SLAs for interactive user expe-rience mandate that simple updates and point queries com-plete within single-digit milliseconds. The early experiencewith Phoenix shows that programmers often refrain from us-ing transaction semantics altogether for fear of high latency,and instead adopt solutions that may compromise consis-tency, e.g., causing an index to inconsistently represent thedata.

Multi-tenancy. Data privacy is key in systems thathost data owned by multiple applications. Maintaining ac-cess rights is therefore an important design consideration forTPSs, e.g., in maintaining shared persistent state.

1.2 Evolving Omid for public cloudsThe motivation of this work is to improve scalability, la-

tency and functionality of Omid [5] to the scale neededin large, multi-tenant clouds. The paper describes recentfunctionality extensions and performance improvements wemade in Omid to power Phoenix [6], a Hadoop SQL-compliantdata analytics platform designed for cloud deployment.

In contrast to earlier SQL query engines for Hadoop (e.g.,Hive[3] and Impala [4]), which focused on large-scale pro-cessing of immutable data, Phoenix targets converged dataingestion and analytics [16], which necessitates transactionsemantics. Phoenix applications are both latency and through-put sensitive. Phoenix uses HBase as its key-value stor-age layer, and Omid as a transaction processing layer atopHBase1.

Omid is an open source transaction manager for HBase.It provides a (somewhat relaxed) variant of snapshot isola-tion (SI) [21], similarly to other modern transaction man-agers [39, 26, 14, 8]. Its implementation is lock-free, whichmakes it scale better than traditional lock-based serializabil-ity protocols. Omid is database-neutral, i.e., it can use anydata store with a traditional NoSQL API (see Section 2). Inparticular, it does not require any changes to HBase code.

This paper makes the following contributions:

1. Protocol re-design for low-latency (Section 3). Thenew protocol, Omid Low Latency (Omid LL), dissi-pates Omid’s major architectural bottleneck. It re-duces the latency of short transactions by 5x underlight load, and by 10x–100x under heavy load. It alsoscales the overall system throughput to 550K tps whileremaining within real-time latency SLAs. In contrastto previously published protocols (e.g., [39]), our solu-tion is amenable to multi-tenancy. The developmentof this new feature is reported in [10].

2. Novel fast-path (FP) algorithm (Section 4). Our novelOmid FP protocol maximizes the performance of single-key transactions that arise in many production usecases. It defines a dedicated API for single-key read,

1Tephra is supported as alternative transaction manager, thoughits scalability and reliability are inferior to Omid’s.

write, and read-modify-write transactions, which ex-ecute at the latency of native HBase operations re-gardless of system load. They run twice as fast asOmid LL transactions, and complete within 2–4 ms onmid-range hardware. This comes at the cost of a minor(15–25%) negative impact on long (latency-insensitive)transactions. Note that Omid LL and Omid FP arecomplementary latency-oriented improvements.

3. Evaluation and comparison with alternative designs(Section 5). We extensively evaluate the new algo-rithms. We further compare Omid’s centralized archi-tecture to a decentralized two-phase commit (2PC)-based approach, as adopted in Percolator and Cock-roachDB.

4. SQL compliance (Section 6). We add support for cre-ating secondary indexes on-demand, without imped-ing concurrent database operations or sacrificing con-sistency. We further extend the traditional SI model,which provides single-read-point-single-write-point se-mantics, with multiple read and write points. Thisfunctionality is used to avoid recursive read-your-own-writes scenarios in complex queries. The developmentof these new features in Omid and Phoenix is reportedin [9] and [12], respectively.

Finally, Section 7 reviews related work, and Section 8 con-cludes.

2. BACKGROUNDA transaction processing system runs atop an underly-

ing key-value store and allows users to bundle multiple dataoperations into a single atomic transaction. Section 2.1 de-scribes the NoSQL data model and the data access API ofthe underlying key-value store, and Section 2.2 defines trans-action semantics provided by the TPS. Section 2.3 providesbackground on the modus operandi of existing TPSs thatsupport SI, including Omid. Finally, Section 2.4 overviewsHBase and Phoenix.

2.1 Data store APIThe data store holds objects (often referred to as rows or

records) identified by unique keys. Each row can consist ofmultiple fields, representing different columns. We considermulti-versioned objects, where object values are associatedwith version numbers, and multiple versions associated withthe same key may co-exist. We further assume that a writeoperation can specify the version number it writes to. Thedata store provides the following API:

get(key, version) – returns the requested version of key.The API further allows traversing (reading) earlier ver-sions of the same key.

scan (fromKey, toKey, version) – a range query (exten-sion of get). Returns an iterator that supports retrievalof multiple records using getNext() calls.

put(key, version, fields, values) – creates or updates anobject, setting the specified fields to the specified val-ues. If the version already exists, its value is updated;otherwise, a new version is added.

remove(key, version) – removes an object with the givenkey and version.

check&mutate(key, field, old, new) – checks the recordassociated with key. If field holds old, replaces it with

new; either way returns field’s previous value. If old isnil, creates field and initializes it with new only if keyis missing in data store.

Most NoSQL data store implementations guarantee atomicexecution of each of the above API’s, except traversals ofiterators returned by scan.

2.2 Transaction semanticsTPSs provide begin and commit APIs for delineating trans-

actions: a transaction is a sequence of read and write op-erations on different objects that occur between begin andcommit. For simplicity, we only describe single-key oper-ations here; multi-key range queries (scans) are treated assequences of reads. Thus, transactional reads and writes areimplemented using the datastore’s get and put operations.Two transactions are said to be concurrent if their execu-tions overlap, i.e., one of them begins between the begintime and commit time of the other; otherwise, we say thatthey are non-overlapping.

A TPS ensures the ACID properties for transactions: atom-icity (all-or-nothing), consistency (preserving each object’ssemantics), isolation (in that concurrent transactions do notsee each other’s partial updates), and durability (wherebyupdates survive crashes).

Different isolation levels can be considered for the thirdproperty. We consider a variant of snapshot isolation (SI) [21]that, similarly to generalized snapshot isolation [30], relaxesthe real-time order requirement. Nevertheless, our imple-mentation only relaxes the ordering of fast path transactions(described in Section 4) relative to regular ones (that do notuse the fast path); regular transactions continue to satisfySI amongst themselves. Moreover, Omid LL, without thefast path, satisfies SI as Omid does.

Our relaxed correctness condition satisfies the key “snap-shot” property of SI, which ensures that a transaction read-ing from the database does not see a mix old and new val-ues. For example, if a transaction updates the values oftwo stocks, then no other transaction may observe the oldvalue of one of these stocks and the new value of the other.However, it relaxes the real-time order guarantee of SI byallowing (fast-path) transactions to take effect ‘in the past’.Specifically, the system enforces a total order T on all com-mitted transactions, so that (i) non-overlapping transactionsthat update the same key occur in T in order of their committimes; (ii) each transaction’s read operations see a consistentsnapshot of the database reflecting a prefix of T ; and (iii)a transaction commits only if none of the items it updatesis modified by a transaction ordered in T after its snapshottime and before its commit time.

Note that as with SI, two concurrent transactions con-flict only if they both update the same item. In contrast,under serializability, a transaction that updates an itemalso conflicts with transactions that read that item. Snap-shot isolation is thus amenable to implementations (usingmulti-versioning) that allow more concurrency than serial-izable ones, and hence scale better. It is provided by popu-lar database technologies such as Oracle, PostgreSQL, andSQL Server, and TPSs such as Percolator, Omid, and Cock-roachDB.

Following a commit call, the transaction may successfullycommit, whereby all of its operations take effect, or abort,in which case none of its changes take effect.

Figure 1: Transaction processing architecture: A client libraryexposes an API for executing transactions of data store opera-tions. A centralized Transaction Manager (TM) handles trans-action begin and commit requests, while data is written directlyto the underlying data store. The TM has a backup for highavailability.

Algorithm 1 TPS operation schema.

1: procedure begin2: obtain read timestamp tsr3: procedure write(tsr, key, fields, values)4: . transactional write5: optionally check for conflicts and abort if found6: indicate write intent for key with values and tsr7: add key to local write-set

8: procedure read(tsr, key) . transactional read9: if key has write intent then

10: resolve, possibly abort writing transaction

11: return highest version ≤ tsr of key

12: procedure commit(tsr, write-set). check for write-write conflicts

13: obtain commit timestamp tsc14: if validate(write-set, tsr) then15: write commit version tsc to persistent commit entry16: else17: abort18: post-commit: update meta-data

2.3 TPS operation schemaFigure 1 depicts, at a high level, the primary components

of the TPS architecture, their APIs, and interaction withthe data store.

In many TPSs, transaction processing follows the follow-ing general schema, outlined in Algorithm 1, while systemsvary in their implementations of each of the steps.

Most of the systems employ a centralized transaction man-ager (TM) service [39, 34, 40, 14], sometimes called times-tamp oracle, for timestamp allocation and other functionali-ties. Because a centralized service can become a single pointof failure, the TM is sometimes implemented as a primary-backup server pair to ensure its continued availability fol-lowing failures.

Begin. When a transaction begins, it obtains a readtimestamp (version) tsr for reading its consistent snapshot.In most cases, this is done using the centralized TM [39, 34,40, 14].

Transactional writes. During a transaction, a write op-eration indicates its intent to write to a single object a cer-tain new value with a certain version number. In Omid, theversion is the transaction’s tsr, which exceeds all versionswritten by transactions that committed before the current

transaction began. Note that the version order among con-current transactions that attempt to update the same keyis immaterial, since all but one of these transactions aredoomed to abort.

It is possible to buffer write intents locally (at the client)in the course of the transaction, and add the write intentsto the data store at commit time [39].

In some solutions writes check for conflicts before declar-ing their intents [8], whereas in others, all conflict detectionis deferred to commit time [39, 34, 40, 14].

Transactional reads. The reads of a given transactionobtain a consistent snapshot of the data store at logical time(i.e., version) tsr. Each read operation retrieves the value ofa single object associated with the highest timestamp thatis smaller or equal to the transaction’s tsr.

On encountering a write intent, read cannot proceed with-out determining whether the tentative write should be in-cluded in its snapshot, for which it must know the writingtransaction’s commit status. To this end, TPSs keep per-transaction commit entries, which are the source of truthregarding the transaction status (pending, committed, oraborted). This entry is updated in line 15 of Algorithm 1as we explain below, and is checked in order to resolve writeintents in line 10. In some cases [39, 8], when the statusof the writing transaction is undetermined, the read force-fully aborts it by updating the commit entry accordingly, asexplained below.

Commit. Commit occurs in four steps:1. Obtain a commit timestamp, tsc. In most cases, e.g.,

[39, 34, 40, 14], this is the value of some global clockmaintained by a centralized entity.

2. Validate that the transaction does not conflict withany concurrent transaction that has committed sinceit had begun. For SI, we need to check for write-writeconflicts only. If write intent indications are buffered,they are added at this point [39]. Validation can becentralized [34, 40, 14] or distributed [39, 8].

3. Commit or abort in one irrevocable atomic step bypersistently writing to the commit entry, which canreside in a global table [40, 8] or alongside the first keywritten by the transaction [39].

4. Post-commit : Finally, a transaction changes its writeintents to persistent writes in case of commit, and re-moves them in case of abort. This step is not essen-tial for correctness, but reduces the overhead of futuretransactions. It occurs after the transaction is persis-tently committed or aborted via the commit entry, andcan be done asynchronously.

2.4 Big data platformsApache HBase is one of the most scalable key-value stor-

age technologies available today. Like many state-of-the-artdata stores, it scales through horizontal sharding (partition-ing) of data across regions. An HBase instance is deployedon multiple nodes (region servers), each of which typicallyserves hundreds of regions. Production HBase clusters of1K nodes and above are becoming common. For example,Yahoo Japan leverages an HBase cluster of 3,800 nodes thatcollectively store 37PB of data [17].

Phoenix complements the HBase storage tier with a queryprocessing (compute) tier. The latter scales independently(the current scalability goal is 10,000 query servers). Phoenix

compiles every SQL statement into a plan, and executes iton one or more servers. Its query processing code invokesthe underlying HBase for low-level data access, and a TPS(Omid or Tephra) for transaction management, through clientlibraries.

Wherever possible, Phoenix strives to push computationclose to data (e.g., for filtering and aggregation), in order tominimize cross-tier communication. For this, it makes ex-tensive use of server-side stored procedures, which in HBaseare supported by the non-intrusive coprocessor mechanism.Omid uses HBase coprocessors too, both for performanceimprovements and for specific services, such as garbage col-lection of redundant data.

3. LOW-LATENCY TRANSACTIONSWe now describe Omid LL, a scalable low-latency TPS

algorithm that satisfies standard (unrelaxed) SI semanticsand is amenable to multi-tenancy. We saw above that, whilemany TPSs follow a similar schema, they make differentdesign choices when implementing this schema. We overviewour design choices in Section 3.1. We then proceed to givea detailed description of the protocol in Section 3.2.

3.1 Omid LL design choicesWe now discuss our design choices, which are geared to-

wards high performance without sacrificing cloud deploya-bility. Table 1 compares them with choices made in otherTPSs. We are not familiar with another TPS that makesthe same design choices as Omid LL.

Table 1: Design choices in TPSs. C – centralized, D –distributed, R – replicated.

TPS validation commit multi readentry tenancy force

updates abortPercolator,Omid 2PC D D no yesCockroachDB D D yes yesOmid1, Tephra C R yes noOmid C C yes noOmid LL C D yes yes

Centralized validation. Omid LL adopts Omid’s cen-tralized conflict detection mechanism, which eliminates theneed for locking objects in the data store, and is extremelyscalable [40].

Other TPSs (like Percolator and CockroachDB [8]) in-stead use a distributed 2PC-like protocol that locks all writ-ten objects during validation (either at commit time or dur-ing the write). To this end, they use atomic check&mutateoperations on the underlying data store. This slows downeither commits (in case of commit-time validation) or trans-actional writes (in case of write-time validation), which takesa toll on long transactions, where validation time is substan-tial. To allow us to compare the two approaches, we alsoimplement a 2PC-based version of Omid, Omid 2PC, whichfollows Percolator’s design.

Distributed commit entry updates with multi-tenancy.The early generation of Omid [34] (referred to as Omid1)and Tephra replicate commit entries of pending transactionsamong all active clients, which consumes high bandwidthand does not scale. Omid instead uses a dedicated committable, and has the centralized TM persist all commits to thistable. Our experiments show that the centralized access to

commit entries is Omid’s main scalability bottleneck, andwhile this bottleneck is mitigated via batching, this also in-creases latency. Omid chose this option as it was designedfor high throughput.

Here, on the other hand, we target low latency. We there-fore distribute the commit entry updates, and allow commitentries of different transactions to be updated in parallelby independent clients. We will see below that this modifi-cation reduces latency by up to an order of magnitude forsmall transactions.

We note that commit table updates are distributed alsoin CockroachDB and Percolator. The latter takes this ap-proach one step further, and distributes not only the committable updates but also the actual commit entries. There,commit entries reside in user data tables, where the firstrow written in a given transaction holds the commit entryfor that transaction. The problem with this approach is thatit assumes that all clients have permissions to access all ta-bles. For example, a transaction attempting to read fromtable A may encounter a write intent produced by a trans-action that accessed table B before table A, and will needto refer to that transaction’s commit entry in table B in or-der to determine its status. This approach did not pose aproblem in Percolator, which was designed for use in a singleapplication, but is unacceptable in multi-tenant settings.

Unlike data tables, which are owned by individual applica-tions that manage their permissions, the dedicated committable is owned by the TPS; it is accessed exclusively by theTPS client library, which in turn is only invoked internallyby the database engine, and not by application code.

The commit table is a highly contended resource. Whileit is not very big at any given time, it is accessed by everytransaction that modifies data. In large clusters, the updaterate may become huge (see Section 5). We therefore shardthis table evenly across many nodes in order to spread theload.

Write intent resolution. As in other TPSs, reads re-solve write intents via the commit entry. If the transactionstatus is committed, the commit time is checked, and, ifsmaller than or equal to tsr, it is taken into account; if thetransaction is aborted, the value is ignored. In case thetransaction status is pending, Omid FP, like Percolator andCockroachDB, has the reader force the writing transactionto abort. This is done using an atomic check&mutate oper-ation to set the status in the writing transaction’s commitentry to aborted.

Omid and Tephra, on the other hand, do not need toforce such aborts2, because they ensure that if the read seesa write intent by an uncommitted transaction, the latterwill not commit with an earlier timestamp than the read.Omid LL avoids such costly mechanisms by allowing readsto force aborts.

3.2 Omid LL algorithmLike Omid, Omid LL uses a dedicated commit table (CT)

for storing commit entries. A transaction is atomically com-mitted by adding to the CT a commit entry mapping its tsrto its tsc. The post-commit phase then copies this infor-mation to a dedicated commit column in the data table, inorder to spare future reading transactions the overhead of

2Omid’s high availability mechanism may force such abortsin rare cases of TM failover.

Data Table Commit Tablekey value version commit tsr commitk1 a 3 nil

(a) Pending transaction

Data Table Commit Tablekey value version commit tsr commitk1 a 3 nil 3 7

(b) Committed transactionData Table Commit Table

key value version commit tsr commitk1 a 3 7 3 7

(c) Post-committed transaction

Figure 2: Evolution of Omid LL metadata during a transaction.The transaction receives tsr = 3 in the begin stage and uses thisas the version number for tentative writes. It receives a committimestamp tsc = 7 when committing.

Algorithm 2 Omid LL’s TM algorithm with HA support.

1: procedure begin2: checkRenew()3: return Clock.fetchAndIncrement()

4: procedure commit(txid, write-set)5: checkRenew()6: tsc ← Clock.fetchAndIncrement()7: if conflictDetect(txid, write-set, tsc) then8: return tsc9: else

10: return abort11: procedure checkRenew . HA support; δ is the lease time12: if lease < now + 0.2δ then . extend lease13: renew lease for δ time . atomic operation14: if failed then halt15: if Clock = epoch then . extend epoch16: epoch ← Clock + range17: if ¬CAS(maxTS, Clock, epoch) then halt

checking the commit table. Figure 2 shows the metadata ofa transaction during its stages of execution.

Whereas Omid’s CT is updated by the TM, Omid LLdistributes the CT updates amongst the clients. Its TM isthus a simplified version of Omid’s TM, and appears in Algo-rithm 2. It has two roles: First, it allocates begin and com-mit timestamps by fetching-and-incrementing a monotoni-cally increasing global clock. Second, upon commit, it callsthe conflictDetect function, which performs validation (i.e.,conflict detection) using an in-memory hash table by check-ing, for each key in the write-set, that its commit timestampin the hash-table is smaller than the committing transac-tion’s tsr. If there are no conflicts, it updates the hash tablewith the write-set of the new transaction and its tsc. ThecheckRenew procedure supports the TM’s high availability,and is explained at the end of this section.Client operations proceed as follows (cf. Algorithm 3):

Begin. The client sends a begin request to the TM, whichassigns it a read timestamp tsr.

Write. The client uses the data store’s update API toadd a tentative record to the data store, with the writtenkey and value, version number tsr, and nil in the commitcolumn. It also tracks key in its local write-set.

Read. The algorithm traverses data records (using thedata store’s ds.get API) pertaining to the requested key with

Algorithm 3 Omid LL’s client-side operations.

procedure beginreturn TM.begin

procedure write(tsr, key, fields, values)track key in write-setadd commit to fields and nil to valuesreturn ds.put(key, tsr, fields, values)

procedure read(tsr, key)for rec ← ds.get(key, versions down from tsr) do

. set tsc to the commit timestamp associated with recif rec.commit 6=nil then . commit cell exists

tsc ← rec.commitelse . commit cell is empty, check CT

tsc ← CT.get(rec.ts)if tsc = nil then . no CT entry, set status to ⊥

tsc ←CT.check&mutate(rec.ts, commit, nil, ⊥)if tsc=nil then . forced abort successful

tsc ← ⊥if tsc = ⊥ then

. check for race: commit before ⊥ entry createdre-read rec from dsif rec.commit 6=nil then

tsc ← rec.commitCT.remove(rec.ts)

else continue . writing transaction aborted

if tsc < tsr thenreturn rec.value

return nil . no returnable version found in loop

procedure commit(tsr, write-set)tsc ← TM.commit(tsr, write-set) . may return abortif tsc 6= abort then

if CT.check&mutate(tsr, commit, nil, tsc) = ⊥ thentsc ← abort

. post-commitfor all keys k ∈ write-set do

if tsc = abort then ds.remove(k, tsr)else ds.put(k, tsr, commit, tsc)

CT.remove(tsr) . garbage-collect CT entry

a version that does not exceed tsr, latest to earliest, and re-turns the first value that is committed with a version smallerthan or equal to tsr. To this end, it needs to discover thecommit timestamp, tsc, associated with each data record itconsiders.

If the commit cell of the record is empty (commit=nil),then we do not know whether the transaction that wrote itis still pending or simply not post-committed. Since readcannot return without determining its final commit times-tamp, it must forcefully abort it in case it is still pending,and otherwise discover its outcome. To this end, read firstrefers to the CT. If there is no CT entry associated withthe transaction (tsc=nil), read attempts to create an entrywith ⊥ as the commit timestamp. This is done using anatomic check&mutate operation, due to a possible race witha commit operation; if a competing commit attempt findsthe ⊥ entry, it aborts. Note that it is possible to wait a con-figurable amount of time before attempting to create a ⊥entry, in order to allow the transaction to complete withoutaborting it.

There is, however, a subtle race to consider in case thecommit record is set to ⊥ by a read operation. Considera slow read operation that reads the data record rec whenits writing transaction is still pending, and then stalls for a

while. In the interim, the writing transaction successfullycommits, updates the data record during post-commit, andfinally garbage-collects its CT entry. At this point the slowreader wakes up and does not find the commit entry in theCT. It then creates a ⊥ entry even though the transactionhad successfully committed. To discover this race, we havea read operation that either creates or finds a ⊥ entry in theCT re-read the data record, and if it does contain a commitentry, use it.

Commit. The client first sends a commit request tothe TM. If the TM detects conflicts then the transactionaborts, and otherwise the TM provides the transaction withits commit timestamp tsc. The client then proceeds tocommit the transaction, provided that no read had forcefullyaborted it. To ensure the latter, the client uses an atomiccheck&mutate to create the commit entry.

To avoid an extra read of the CT on every transactionalread, once a transaction is committed, the post-commit stagewrites the transaction’s tsc to the commit columns of allkeys in the transaction’s write-set. In case of an abort, itremoves the pending records. Finally, it garbage-collects thetransaction’s CT entry.

3.2.1 High availabilityThe TM is implemented as a primary-backup process pair

to ensure its high availability. The backup detects the pri-mary’s failure using timeouts. When it detects a failure, itimmediately begins serving new transactions, without anyrecovery procedure. Because the backup may falsely suspectthe primary because of unexpected network or processingdelays we take precautions to avoid correctness violations incases when both primaries are active at the same time.

To this end, we maintain two shared objects, managed inApache Zookeeper and accessed infrequently. The first ismaxTS, which is the maximum timestamp an active TM isallowed to return to its clients. An active (primary) TMperiodically allocates a new epoch of timestamps that itmay return to clients by atomically increasing maxTS us-ing an atomic compare-and-swap (CAS) operation. Follow-ing failover, the new primary allocates a new epoch for itself,and thus all the timestamps it returns to clients exceed thosereturned by previous primary.

The second shared object is a locally-checkable lease, whichis essentially a limited-time lock. The lease is renewed forsome parameter δ time once 80% of this time elapses. Aswith locks, at most one TM may hold the lease at a giventime. This ensures that no client will be able to commit atransaction in an old epoch after the new TM has startedusing a new one.

4. FAST-PATH TRANSACTIONSThe goal of our fast path is to forgo the overhead associ-

ated with communicating with the TM to begin and com-mit transactions. This is particularly important for shorttransactions, where the begin and commit overhead is notamortized across many operations. We therefore focus onsingle-key transactions.

To this end, we introduce in Section 4.1 a streamlinedfast path (FP) API that jointly executes multiple API callsof the original TPS. We proceed, in Section 4.2, to explaina high-level general fast path algorithm for any system thatfollows the generic schema of Algorithm 1 above. Finally,

in Section 4.3, we describe our implementation of the fastpath in Omid FP, and important practical optimizations weapplied in this context.

4.1 APIFor brevity, we refer to the TPS’s API calls begin, read,

write, and commit as b, r, w, and c respectively, and wecombine them to allow fast processing. The basic FP trans-actions are singletons, i.e., transactions that perform a singleread or write. These are supported by the APIs:

brc(key) – begins an FP transaction, reads key within it,and commits.

bwc(key,val) – begins an FP transaction, writes val into anew version of key that exceeds all existing ones, andcommits.

We further support a fast path transaction consisting ofa read and a dependent write, via a pair of API calls:

br(key) – begins an FP transaction and reads the latestversion of key. Returns the read value along with aversion ver.

wc(ver, key,val) – validates that key has not been writtensince the br call that returned ver, writes val into a newversion of key, and commits.

Read-only transactions never abort, but bwc and wc mayabort. If an FP transaction aborts, it can be retried eithervia the fast path again, or as a regular transaction.

4.2 Generic fast path algorithm

Algorithm 4 Generic support for FP transactions; eachdata store operation is executed atomically.

Client-side logic:

1: procedure brc(key)2: rec ← ds.get(last committed record of key) . no preGet3: return rec.value4: procedure bwc(key, value)5: return ds.putVersion(∞, key, value)

6: procedure br(key)7: rec ← ds.get(last committed record of key) . no preGet8: return 〈rec.version, rec.value〉9: procedure wc(ver, key, value)

10: return ds.putVersion(ver, key, value)

Data store (stored procedure) logic:

11: procedure putVersion(old, key, value). used by FP writes (bwc and wc)

12: if key has no tentative version ∧13: last committed version of key ≤ old then14: ver ← F&I(key’s maxVersion) +115: ds.put(key, value, ver, ver) . no prePut16: return commit17: else18: return abort19: procedure preGet(tsr, key)

. executed atomically with ds.get call, once per key20: bump(key’s maxVersion, tsr)

21: procedure prePut(key, val, tsr, nil). checked atomically before put (by transactional write)

22: if key has a committed version that exceeds tsr then23: abort24: procedure preUpdate(key, tsr, commit, tsc)

. executed atomically with ds.update call (by post-commit)25: bump(key’s maxVersion, tsc)

The generic fast path algorithm, which we present in Al-gorithm 4, consists of two parts: client side-logic, and logicrunning within the underlying data store. The latter is im-plemented as a stored procedure, Specifically, it supports anew flavor of put, putVersion, which is used by singletonwrites, and it extends the put, get, and update APIs usedby regular transactions with additional logic. The new logicis presented in the preGet, prePut, and preUpdate proce-dures, which are executed before, and atomically with, thecorresponding data store calls.

Singleton reads (line 1) simply return the value associatedwith the latest committed version of the requested key theyencounter. They ignore tentative versions, which may leadto missing the latest commit in case its post-commit didnot complete, but is allowed by our semantics. FP readscan forgo the begin call since they do not need to obtain asnapshot time a priori. They can also forgo the commit call,since they perform a single read, and hence their ‘snapshot’is trivially valid.

In case bwc encounters a tentative version, it does nottry to resolve it, but rather simply aborts. This may causefalse aborts in case the transaction that wrote the tentativeversion has committed and did not complete post-commit,as well as in the case that it will eventually abort. In gen-eral, this mechanism prioritizes regular transactions over FPones. We choose this approach since the goal is to completeFP transactions quickly, and if an FP transaction cannotcomplete quickly, it might as well be retried as a regularone.

Such a singleton write has two additional concerns: (1)it needs to produce a new version number that exceeds allcommitted ones and is smaller than any commit timestampthat will be assigned to a regular transaction in the future.(2) It needs to make sure that conflicts with regular trans-actions are detected.

To handle these concerns, we maintain the timestampsas two-component structures, consisting of a global versionand a locally advancing sequence number. In practice, weimplement the two components in one long integer, withsome number ` least significant bits reserved for sequencenumbers assigned by FP writes (in our implementation, ` =20). The most significant bits represent the global versionset by the TM. The latter increases the global clock by 2`

upon every begin and commit request.To support (1) a bwc transaction reads the object’s latest

committed version and increments it. The increment is doneprovided that it does not overflow the sequence number. Inthe rare case when the lower ` bits are all ones, the globalclock must be incremented, and so the FP transaction abortsand is retried as a regular one.

It is important to note that the singleton write needs toatomically find the latest version and produce a new one thatexceeds it, to make sure that no other transaction creates anewer version in the interim. This is done by a new putVer-sion function implemented in code that resides at the datastore level. In Section 4.3 below, we explain how we imple-ment such atomic conditional updates in an HBase coproces-sor as part of Omid FP. The first parameter to putVersionis an upper bound on the key’s current committed version;since a singleton write imposes no constraints on the object’scurrent version, its upper bound is ∞.

Next, we address (2) – conflicts between FP and regulartransactions. In case an ongoing regular transaction writes

to a key before bwc accesses it, bwc finds the tentative writeand aborts.

It therefore remains to consider the case that a regulartransaction T1 writes to some key after FP transaction FP1,but T1 must abort because it reads the old version of the keybefore FP1’s update. This scenario is illustrated in Figure 3.Note that in this scenario it is not possible to move FP1 to‘the past’ because of the read.

Figure 3: Conflict between FP transaction FP1 and regulartransaction T1.

In order for T1 to detect this conflict, the version writtenby FP1 has to exceed T1’s snapshot time, i.e., tsr. To thisend, we maintain a new field maxVersion for each key, whichis at least as high as the key’s latest committed version. Thedata store needs to support two atomic operations for up-dating maxVersion. The first is fetch-and-increment, F&I,which increments maxVersion and returns its old value; F&Ithrows an abort exception in case of wrap-around of the se-quence number part of the version. The second operation,bump, takes a new version as a parameter and sets maxVer-sion to the maximum between this parameter and its oldvalue.

Singleton writes increment the version using F&I (line 14),and the post-commit of transactional writes (line 24) bumpsit to reflect the highest committed version. Every trans-actional read bumps the key’s maxVersion to the readingtransaction’s tsr (line 19); transactional writes (line 21) aremodified to check for conflicts, namely, new committed ver-sions exceeding their tsr.

In the example of Figure 3, T1’s read bumps x’s maxVer-sion to its tsr, and so FP1, which increments x’s maxVer-sion, writes 1 with a version that exceeds tsr. Thus, T1’swrite detects the conflict on x.

Note that this modification of transactional writes incursan extra cost on regular (non-FP) transactions, which wequantify empirically in Section 5.

The br and wc operations are similar to brc and bwc, re-spectively, except that wc uses the version read by br as itsupper bound in order to detect conflicting writes that occurbetween its two calls.

4.3 Implementation and optimizationAssociating a maxVersion field with each key is waste-

ful, both in terms of space, and in terms of the number ofupdates this field undergoes. Instead, when implementingsupport for Omid FP’s fast path in HBase, we aggregate themaxVersions of many keys in a single variable, which we callthe Local Version Clock (LVC).

Our implementation uses one LVC in each region server.Using a shared LVC reduces the number of updates: a trans-actional read modifies the LVC only if its tsr exceeds it. Inparticular, a transaction with multiple reads in the sameregion server needs to bump it only once.

We implement the two required atomic methods - F&I andbump on the LVC using atomic hardware operations (F&Iand CAS, respectively). The HBase coprocessor mechanism

enforces atomic execution of the stored code blocks by hold-ing a lock on the affected key for the duration of the oper-ation. Thus, putVersion executes as an atomic block, andcalls the LVC’s F&I method inside this block. Similarly, thecalls to bump from preGet and preUpdate execute inside anatomic block with the ensuing get an update, respectively.

Note that although the coprocessor only holds a lock onthe affected key, the joint update of the key and the LVCis consistent because it uses a form of two-phase locking:when the stored procedure begins, its locks the key, then theatomic access to the LVC effectively locks the LVC duringits update; this is followed by an update of the key’s recordand the key’s lock being released.

The LVC is kept in memory, and is not persisted. Mi-gration of region control across region servers, which HBaseperforms to balance load and handle server crashes, must behandled by the LVC management. In both cases, we need toensure that the monotonicity of the LVC associated with theregion is preserved. To this end, when a region is started ina new server (following migration or recovery), we force thefirst operation accessing it in the new server to accesses theTM to increment the global clock, and then bump the localclock to the value of the global clock. Since the LVC valuecan never exceed the global clock’s value, this bootstrappingprocedure maintains its monotonicity.

5. EVALUATIONWe describe our methodology and experiment setup in

Section 5.1, and present our results in Section 5.2.

5.1 Methodology

5.1.1 Evaluated systemsWe now evaluate our protocol improvements to Omid. We

evaluate Omid LL as described in Section 3 and Omid FPas described in Section 4. Our baseline is the legacy Omidimplementation [40]. We also compare the systems to Omid2PC, which uses distributed conflict resolution. Omid 2PCdefers writes to commit time (as Percolator does), whence itwrites and validates keys using check&mutate. In Omid FP,single-key transactions use the FP API, whereas longer trans-actions use the standard API. The other systems use thestandard API only.

Note that both CockroachDB and Percolator use 2PC-based conflict resolution, so our Omid 2PC mimics theirprotocols. Direct comparison with the actual systems is notfeasible because Percolator is proprietary and CockroachDBsupports SQL transactions over a multi-tier architecture us-ing various components that are incompatible with HBase.We also do no compare Omid FP to Omid1 and Tephra,since both significantly outperforms Omid1 [40].

To reduce latency, we configure the TPSs to perform thepost-commit phase asynchronously, by a separate thread,after the transaction completes.

Note that Translytics requires many short transactionsand long scans. In this paper we improve the performanceof short transactions, while incurring few aborts. The latterimplies that scans do not spend much time forcing aborts,and therefore, their performance is the same as in the vanillaOmid.

5.1.2 Experiment setupOur experiment testbed consists of nine 12-core Intel Xeon

5 machines with 46GB RAM and 4TB SSD storage, inter-connected by 10G Ethernet. We allocate three of these toHBase nodes, one to the TM, one to emulate the client whoseperformance we measure, and four more to simulate back-ground traffic as explained below. Each HBase node runsboth an HBase region server and the underlying HadoopFile System (HDFS) server within 8GB JVM containers.

Note that the only scalability bottleneck in the tested sys-tems is the centralized TM. HBase, on the other hand, canscale horizontally across thousands of nodes, each of whichprocesses a small fraction of the total load. Since each nodetypically serves millions of keys, data access rates remainload-balanced across nodes even when access to individualkeys is highly skewed. And since read/write requests areprocessed independently by each HBase node, their perfor-mance remains constant as the workload and system size arescaled by the same factor.

Thus, to understand the system’s performance at scale,we can run transactions over a small HBase cluster with anappropriately-scaled load, but need to stress the TM as abigger deployment would. We do this at a ratio of 3:1000;that is, we run transactions on a 3-node HBase cluster andload the TM with a begin/commit request rate that wouldarise in a 1000-node HBase cluster with the same per-nodeload. For example, to test the client’s latency at 100K tps,we have the TM handle 100K tps, and have an HBase de-ployment of three nodes handle read/write requests of 300tps. As explained above, the HBase latency at 100K tpswith 1000 servers would be roughly the same as in this de-ployment.

We use four machines to generate the background load onthe TM using a custom tool [40] that asynchronously postsbegin and commit requests on the wire and collects the TM’sresponses. We note that although in this experiment theTM maintains a small number of client connections (servingmany requests per connection), the number in a true 1000-node system still falls well within the OS limit, hence no realbottleneck is ignored.

We measure the end-to-end client-incurred latency on asingle client node that generates transactions over the 3-server HBase cluster. Note that the client also generates be-gin/commit requests, which account for ∼ 0.3% of the TM’sload. The client runs the popular YCSB benchmark [25],exercising the synchronous transaction processing API in avarying number of threads.

5.1.3 WorkloadOur test cluster stores approximately 23M keys (∼7M keys

per node). The values are 2K big, yielding roughly 46GBof actual data, replicated three-way in HDFS. The keys arehash-partitioned across the servers. The data accesses are50% reads and 50% writes. The key access frequencies fol-low a Zipf distribution, generated following the descriptionin [36], with θ = 0.8, which yields the abort rate reportedin the production delpoyment of Omid [40]. In this distri-bution, the first key is accessed 0.7% of the time.We test the system with two transaction mixes:

Random mix – transaction sizes (number of reads andwrites) follow a Zipf distribution with θ = 0.99, with acutoff at 10. With these parameters, 63% of the trans-actions access three keys or less, and only 3% access

10 keys. We vary the system load from 30K to 500Ktransactions per second (tps).

BRWC – 80% of the transactions are drawn from the ran-dom mix distribution, and 20% perform a read andthen a write to the same key.

We add the BRWC workload since single-key read+writetransactions are common in production, but are highly un-likely to occur in our random mix, which uses random keyselection with billions of keys.

100 200 300 400 500Throughput (tps * 1000)

0

10

20

30

40

50

Late

ncy

[mse

c]

OmidOmid LL

Figure 4: Throughput vs. latency, transaction size = 1.

5.2 Results

Throughput and latency of short transactions.Omid FP is motivated by the prevalence of short transac-tions in production, and is designed with the goal of accel-erating such transactions. Its advantage is less pronouncedfor long transactions, where the cost of begin and commit isamortized across many reads and writes. To make the com-parison meaningful, we classify transactions by their lengthsand whether they access a single key, and study each trans-action class separately.

We begin with short transactions. Figure 4 presents theaverage latency of single-key transactions run as part of therandom mix, as a function of system throughput. Figure 5athen zooms in on the latency of such transactions under athroughput of 100K tps, and breaks up the different factorscontributing to it. Figure 5b presents a similar breakdownunder a high load of 500K tps; Omid is not included sinceit does not sustain such high throughput.

As we can see, under light load, Omid LL and Omid 2PCimprove the latency of Omid by 4x to 5x. This is becausein Omid, both begin and commit wait for preceding trans-actions to complete the writes of their commit entries; thisstems from Omid’s design choice to avoid the need for re-solving pending write intents by aborting transactions; seepenultimate column in Table 1. Single-key writes suffer fromboth the begin and commit latencies, whereas single-keyreads suffer only from begins (Figure 5a). Omid 2PC haslonger commit latencies and shorter write latencies becauseit defers writes to commit time.

As load increases, Omid suffers from a well-pronouncedbottleneck, and its latency at 250K tps is doubled, wherethe other systems are unaffected. The extra delay in Omidis due to batching of commit record updates, which its TMapplies to handle congestion [40].

Under low load, Omid LL is slightly faster than Omid 2PC(due to the latter’s use of atomic check&mutate). But un-der high load, Omid 2PC is slightly better (10% faster on

Omid

Omid LL

Omid 2P

C

Omid FP

Omid

Omid LL

Omid 2P

C

Omid FP

Omid

Omid LL

Omid 2P

C

Omid FP

05

10152025303540

Late

ncy

[mse

c]

Write Read BRWC

(a) Low load (100K tps).

Omid LL

Omid 2P

C

Omid FP

Omid LL

Omid 2P

C

Omid FP

Omid LL

Omid 2P

C

Omid FP

02468

101214

Late

ncy

[mse

c]

Write Read BRWC

Commit Read/Write Begin

(b) High load (500K tps).

Figure 5: Latency breakdown for single-key transactions under random mix workload.

average for single-key transactions) due to the centralizedconflict analysis becoming a bottleneck in Omid LL.

The FP API delivers better performance for this traf-fic. For instance, under low load (Figure 5a), single writestake an average of 2.4ms using the bwc API versus 5.7ms inOmid LL (and 5.9ms in Omid 2PC). For comparison, a na-tive HBase write takes roughly 2ms under this load. A singleread executed using brc takes 1.5ms, which is the averagelatency of a native HBase read, versus 2.5ms as a regulartransaction in Omid LL (2.7ms in Omid 2PC). For transac-tions that read and write a single key as part of the BRWCworkload, the fast path implementation (consisting of br andwc calls) completes within 4ms, versus 6.5ms for Omid LL,7.1ms for Omid 2PC, and 37.6ms for Omid. Under high load(Figure 5b), the fast path is even more beneficial: it reducesthe latency of both read and write by more than a factor of2.

Long transactions. We now examine longer transactionsrun as part of the random mix. Figure 6 shows the resultsfor transactions of lengths 5 and 10. We see that the ab-solute latency gap of the new systems over Omid remainssimilar, but is amortized by other operations. Omid’s con-trol requests (begin and commit) continue to dominate thedelay, and comprise 68% of the latency of 10-access transac-tions (under low load). In contrast, the transaction latencyof Omid LL (and Omid FP) is dominated by data access,as only 13% (resp. 18%) of the time is spent on the controlpath. Omid 2PC spends 66% time executing commit be-cause it defers writes to commit time, but this is offset by ashorter read/write execution time.

The FP mechanism takes a toll on the data path, whichuses atomic check&mutate operations instead of simple writes.This is exacerbated for long transactions. For example, a10-access transaction takes 24.8ms with Omid FP, versus21.7ms with Omid LL. The performance of Omid 2PC withlong transactions is similar to that of Omid FP – e.g., 25.6msfor 10-access transactions – because it also uses check&mutateoperations during the commit validation phase.

Figure 7 summarizes the tradeoffs entailed by Omid FPrelative to Omid LL for the different transaction classes. Wesee that under low load (Figure 7a), the speedup for single-write transactions is 2.3x, whereas the worst slowdown is13%. In systems geared towards real-time processing, thisis a reasonable tradeoff, since long transactions are infre-quent and less sensitive to extra delay. Under high load(Figure 7b), the fast path speedup for writes is 2.7x.

Abort rates. We note that Omid FP yields slightlyhigher rates of transaction aborts compared to Omid (recallthat Omid LL aborts tentative writes in favor of concur-rent reads, whereas Omid FP also aborts singleton writes inpresence of concurrent tentative writes). However, the abortrates exhibited by all the systems are minor. Here, underthe highest contention, Omid FP aborts approximately 0.1%of the transactions versus Omid LL’s 0.08%, Omid 2PC’s0.08% and Omid’s 0.07%.

6. SQL ORIENTED FEATURESComprehensive SQL support in Phoenix involved func-

tionality extensions as well as performance optimizations.Performance-wise, Phoenix extensively employs stored pro-

cedures implemented as HBase coprocessors in order to elim-inate the overhead of multiple round-trips to the data store.We integrated Omid’s code within such HBase-resident pro-cedures. For example, Phoenix coprocessors invoke transac-tional reads, and so we implemented Omid’s transactionalread – the loop implementing read in Algorithm 3 – as acoprocessor as well. This allowed for a smooth integrationwith Phoenix and also reduced the overhead of transactionalreads when multiple versions are present.

Functionality-wise, we added support for on-the-fly con-struction of secondary indexes (see Section 6.1) and ex-tended Omid’s SI model to allow for multiple read- andwrite-points, as required in some Phoenix applications (seeSection 6.2).

6.1 Index constructionA secondary index in SQL is an auxiliary table that pro-

vides fast access to data in a table by a key that is differentfrom the table’s primary key. This need often emerges inanalytics scenarios, in which data is accessed by multipledimensions. Typically, the secondary key serves as the pri-mary key of the secondary index, and is associated with aunique reference into the base table (e.g., primary key +timestamp). SQL query optimizers exploit secondary in-dexes in order to produce efficient query execution plans.Query speed is therefore at odds with update speed sinceevery write to a table triggers writes to all its indexes.

The SQL standard allows creating indexes on demand.When a user issues a Create Index command, the database(1) populates the new index table with historic data fromthe base table, and (2) installs a coprocessor to augmentevery new write to the base table with a write to the indextable. It is desirable to allow temporal overlap between the

100 200 300 400 500Throughput (tps * 1000)

0

20

40

60

80

100

Late

ncy

[mse

c]

OmidOmid LL

(a) Throughput vs latency, transaction size=10

Omid

Omid LL

hbase

Omid FP

Omid

Omid LL

hbase

Omid FP

0

10

20

30

40

50

60

Late

ncy

[mse

c]

TX size 5TX size 10

CommitRead/WriteBegin

(b) Latency breakdown, transaction size = 5,10

Figure 6: Latency vs. throughput and latency breakdown for long transactions in random mix workload.

Write Read BRWC Tx ofsize 5

Tx ofsize 10

-50%-25%

0%25%50%75%

100%125%150%

Spee

dup

(a) Low load (100 tps)

Write Read BRWC Tx ofsize 5

Tx ofsize 10

-50%

0%

50%

100%

150%

Spee

dup

(b) High load (500 tps)

Figure 7: Latency speedup with fast path API in Omid FP.

two, in order to avoid stalling writes while the index is beingpopulated.

We exploit Omid’s SI semantics in order to create suchindexes. To achieve (1), Phoenix invokes a transaction thatscans a snapshot of the base table and streams the datainto the index table. This way, historic data is capturedwithout blocking concurrent puts. Phoenix has two typesof indexes: global (sharded independently of the base table)and local (co-sharded with the base table). In the lattercase, the index building transaction is pushed into a copro-cessor; this local distributed execution speeds up the processsignificantly.

Once the bulk population completes, the index can be-come available to queries. To achieve (2), the databasecreates a coprocessor that augments all transactions thatupdate the base table with an additional update of the sec-ondary index.

In order to guarantee the new index’s consistency with re-spect to the base table, the snapshot creation and the triggersetup must be atomic. In other words, all writes beyond thesnapshot timestamp must be handled by the coprocessor.Omid achieves this through a new fence API implementedby the TM, which is invoked when the coprocessor is in-stalled. A fence call produces a new fence timestamp byfetching-and-incrementing the TM’s clock, and records thefence timestamp in the TM’s local state. Subsequently, theTM aborts every transaction whose read timestamp pre-cedes the fence’s timestamp and attempts to commit afterit. Note that even though building an index can take hours,the fence operation is an atomic operation that occurs atindex creation. Therefore, the fence only forces aborts oftransactions that are active at the point in time when thefence is created and write to the fence’s table. In practice,this amounts to around 10 transactions, and even if transac-tions are long, it is still less than 100 transactions. Moreover,

there is at most one fence timestamp per index at a giventime.

Neither the bulk index population nor its incremental up-date require write conflict detection among the index keys,for different reasons. The former does not contend with anyother transaction, and hence is committed automatically –the commit cells are created simultaneously with the rest ofthe index data. The latter is voided by the TM detectingconflicts at the base table upon commit. Hence, in transac-tions augmented with secondary index updates, there is noneed to embed the added index keys in the commit request,and so the load on the TM does not increase. Omid providesextensions for its put API for this scenario.

6.2 Extended snapshot semanticsSome Phoenix applications require semantics that deviate

from the standard SI model in that a transaction does notsee (all of) its own writes. We give two examples.

Checkpoints and snapshots. Consider a social net-working application that stores its adjacency graph as a ta-ble of neighbor pairs. The transitive closure of this graphis computed in multiple iterations. Each iteration scans thetable, computes new edges to add to the graph, and insertsthem back into the table. Other functions like PageRankare implemented similarly. Such programs give rise to thepattern given in Figure 8.

Note that the SQL statement loops over entries of T .Normally, the SI model enforces read-your-own-writes se-mantics. However, in this case the desirable semantics arethat the reads only see data that existed prior to the currentstatement’s execution. This isolation level is called snapshotisolation exclude-current (SIX).

To support this behavior, we implement in Omid two newmethods, snapshot and checkpoint. Following a checkpoint,

SQL program that requires SIX isolation:

for iterations i = 1, . . . doinsert into T as select func(T.rec) from T where ...

Calls generated by Phoenix for the SQL program:

for iterations i = 1, . . . docheckpoint . move to SIX isolationiterator ← T.scan(. . . ) . scan in current snapshotfor all rec ∈ iterator.getNext() do . parallel loop

T.put(func(rec)) . write to next snapshot

snapshot . make all previous writes visible

Figure 8: SQL program requiring SIX isolation and correspond-ing Phoenix execution plan.

we move to the SIX level, where reads see the old snapshot(i.e., updates that precede the checkpoint call) and writesoccur in the next snapshot, which is not yet visible. Thesnapshot call essentially resets the semantics back to SI,making all previous writes visible. Given these two calls,Phoenix translates SQL statements as above to the loopin Figure 8– it precedes the evaluation of the nested SQLstatement with a checkpoint, and follows it with a snapshot.Thus, each SQL statement sees the updates of the precedingiteration but not of the current one.

To support the SIX isolation level in Omid, the TM pro-motes its transaction timestamp counter tsr in leaps of some∆ > 1 (rather than 1 as in the legacy system). The clientmanages two distinct local timestamps, τr for reads and τwfor writes. By default, it uses τr = τw = tsr, which achievesthe traditional SI behavior. Omid then has the new meth-ods, snapshot and checkpoint, increment τr and τw, respec-tively. If the consistency level is set to SIX, it maintainsτw = τr + 1 < tsr + ∆, thereby separating the reads fromthe writes. A transaction can generate ∆ − 1 snapshotswithout compromising correctness. By default, Omid uses∆ = 50.

Snapshot-all. When a transactions involves multiple snap-shots that update the same key, it is sometimes required toread all versions of key that existed during the transaction.This is called the snapshot-all isolation level.

We use this isolation level, for example, when aborting atransaction that involves multiple updates to a key indexedby a secondary index. Consider a key k that is initiallyassociated with a value v0, and a secondary index maps v0to k. When a transaction updates k to take value v1, atentative deletion marker is added to v0 in the secondaryindex while v1 is added. If the transaction then updatesthe same key (in an ensuing snapshot) to take the valuev2, then v1 is marked as deleted and v2 is added, and soon. In case the transaction aborts, we need to roll-back alldeletions. But recall that index updates are not tracked inthe write-set, and so we need to query the data store inorder to discover the set of updates to roll back. We dothis by querying k in the primary table with the snapshot-all isolation level, to obtain all values that were associatedwith k throughout the transaction, beginning with v0. Wecan then clean up the redundant entries in the secondaryindex, and remove the deletion marker from v0.

7. RELATED WORKThe past few years have seen a growing interest in dis-

tributed transaction management [38, 27, 18, 20, 42, 33, 32].Recently, many efforts have been dedicated to improvingperformance using advanced hardware trends like RDMA

and HTM [43, 28, 29]. These efforts are, by and large, or-thogonal to ours.

Our work follows the line of industrial systems, such asGoogle’s Spanner [26], Megastore [19], and Percolator [39],CockroachDB [8], and most recently Apache Phoenix [6];the latter employs transaction processing services providedby either Tephra [14] or Omid [40]. Production systemscommonly develop transaction processing engines on top ofexisting persistent highly-available data stores: for example,Megastore is layered on top of Bigtable [24], Warp [32] usesHyperDex [31], and CockroachDB [8] uses RocksDB [13].Similarly to Omid, Omid FP is layered atop HBase [2].

As discussed in Section 3, a number of these systems fol-low a common paradigm with different design choices, andwe choose a new operation point in this design space. Inparticular, Omid LL eliminates the bottleneck of Omid bydistributing the commit entry, makes commits and beginsfaster than Omid’s by allowing reads to abort pending trans-actions. Unlike Percolator and CockroachDB, Omid LL usesscalable centralized conflict detection. This eliminates theneed to modify HBase to support locking, and facilitatescommitting the algorithm back to the production system.

As a separate contribution we developed a fast path forsingle-key transactions, which is applicable to any of theaforementioned systems, and embodied it in Omid FP. Asimilar mechanism was previously developed in Mediator [23]for the earlier generation of Omid [34]. Mediator focused onreconciling conflicts between transactions and native atomicoperations rather than adding an FP API as we do here. Asa consequence, Mediator is less efficient than our fast path,and moreover, its regular transactions can starve in case ofcontention with native operations.

Our implementation of the fast path uses a local versionclock. Similar mechanisms were used in Mediator and inCockroachDB, which uses per-region Hybrid Logical Clocks[37] for distributed timestamp allocation.

8. CONCLUSIONAs transaction processing services begin to be used in new

application domains, low transaction latency and rich SQLsemantics become important considerations. In addition,public cloud deployments necessitate solutions compatiblewith multi-tenancy. Motivated by such use cases we evolvedOmid for cloud use.

As part of this evolution, we improved Omid’s protocolto reduce latency (by up to an order of magnitude) and im-prove throughput scalability. We further designed a genericfast path for single-key transactions, which executes themalmost as fast as native HBase operations (in terms of boththroughput and latency), while preserving transactional se-mantics relative to longer transactions. Our fast path algo-rithm is not Omid-specific, and can be similarly supportedin other transaction management systems. Our implemen-tation of the fast path in Omid FP can process single-keytransactions at a virtually unbounded rate, and improvesthe latency of short transactions by another 3x–5x underhigh load.

We have extended Omid with functionalities required inSQL engines, namely secondary index construction and multi-snapshot semantics. We have successfully integrated it intothe Apache Phoenix translytics engine.

AcknowledgmentsWe thank James Taylor for fruitful discussions.

9. REFERENCES

[1] Amazon EMR. https://aws.amazon.com/emr/.

[2] Apache HBase. http://hbase.apache.org.

[3] The Apache Hive data warehouse.https://hive.apache.org.

[4] Apache Impala. https://impala.apache.org.

[5] Apache Omid.https://omid.incubator.apache.org/.

[6] Apache phoenix. https://phoenix.apache.org.

[7] Chatter. https://www.salesforce.com/eu/products/chatter/overview/.

[8] CockroachDB. https://github.com/cockroachdb/cockroach/blob/master/docs/design.md.

[9] Omid-82.https://issues.apache.org/jira/browse/OMID-82.

[10] Omid-90.https://issues.apache.org/jira/browse/OMID-90.

[11] Opentsdb – the scalable time series database.http://opentsdb.net.

[12] Phoenix-3623. https://issues.apache.org/jira/browse/PHOENIX-3623.

[13] RocksDB. http://rocksdb.org/.

[14] Tephra: Transactions for Apache HBase.https://tephra.io.

[15] The Forrester Wave: Translytical Data Platforms, Q42017. https://reprints.forrester.com/#/assets/2/364/RES134282/reports.

[16] Who is using Phoenix?https://phoenix.apache.org/who_is_using.html.

[17] Yahoo Japan: Hortonworks customer profile.https://hortonworks.com/customers/

yahoo-japan-corporation/.

[18] M. K. Aguilera, J. B. Leners, and M. Walfish. Yesquel:Scalable sql storage for web applications. InProceedings of the 25th Symposium on OperatingSystems Principles, SOSP ’15, pages 245–262, 2015.

[19] J. Baker, C. Bond, J. C. Corbett, J. Furman,A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd,and V. Yushprakh. Megastore: Providing scalable,highly available storage for interactive services. InProceedings of the Conference on Innovative Datasystem Research (CIDR), pages 223–234, 2011.

[20] M. Balakrishnan, D. Malkhi, T. Wobber, M. Wu,V. Prabhakaran, M. Wei, J. D. Davis, S. Rao, T. Zou,and A. Zuck. Tango: Distributed data structures overa shared log. In Proceedings of the Twenty-FourthACM Symposium on Operating Systems Principles,SOSP ’13, pages 325–340, 2013.

[21] H. Berenson, P. A. Bernstein, J. Gray, J. Melton, E. J.O’Neil, and P. E. O’Neil. A critique of ANSI SQLisolation levels. In Proceedings of the 1995 ACMSIGMOD International Conference on Management ofData, San Jose, California, May 22-25, 1995., pages1–10, 1995.

[22] D. Borthakur, J. Gray, J. S. Sarma,K. Muthukkaruppan, N. Spiegelberg, H. Kuang,K. Ranganathan, D. Molkov, A. Menon, S. Rash,R. Schmidt, and A. Aiyer. Apache hadoop goesrealtime at facebook. In Proceedings of the 2011 ACMSIGMOD International Conference on Management of

Data, SIGMOD ’11, pages 1071–1080, New York, NY,USA, 2011. ACM.

[23] E. Bortnikov, E. Hillel, and A. Sharov. Reconcilingtransactional and non-transactional operations indistributed key-value stores. In Proceedings ofInternational Conference on Systems and Storage,SYSTOR 2014, pages 10:1–10:10, New York, NY,USA, 2014. ACM.

[24] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A.Wallach, M. Burrows, T. Chandra, A. Fikes, andR. E. Gruber. Bigtable: A distributed storage systemfor structured data. ACM Trans. Comput. Syst.,26(2):4:1–4:26, June 2008.

[25] B. F. Cooper, A. Silberstein, E. Tam,R. Ramakrishnan, and R. Sears. Benchmarking cloudserving systems with ycsb. In Proceedings of the 1stACM Symposium on Cloud Computing, SoCC ’10,pages 143–154, New York, NY, USA, 2010. ACM.

[26] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost,J. Furman, S. Ghemawat, A. Gubarev, C. Heiser,P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li,A. Lloyd, S. Melnik, D. Mwaura, D. Nagle,S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak,C. Taylor, R. Wang, and D. Woodford. Spanner:Google’s globally-distributed database. In 10thUSENIX Symposium on Operating Systems Designand Implementation (OSDI 12), pages 261–264, 2012.

[27] J. Cowling and B. Liskov. Granola: Low-overheaddistributed transaction coordination. In Proceedings ofthe 2012 USENIX Conference on Annual TechnicalConference, USENIX ATC’12, pages 21–21, 2012.

[28] A. Dragojevic, D. Narayanan, M. Castro, andO. Hodson. Farm: Fast remote memory. In 11thUSENIX Symposium on Networked Systems Designand Implementation (NSDI 14), pages 401–414, Apr.2014.

[29] A. Dragojevic, D. Narayanan, E. B. Nightingale,M. Renzelmann, A. Shamis, A. Badam, andM. Castro. No compromises: Distributed transactionswith consistency, availability, and performance. InProceedings of the 25th Symposium on OperatingSystems Principles, SOSP ’15, pages 54–70, 2015.

[30] S. Elnikety, W. Zwaenepoel, and F. Pedone. Databasereplication using generalized snapshot isolation. In24th IEEE Symposium on Reliable DistributedSystems, 26-28 October 2005, Orlando, FL, USA,pages = 73–84, year = 2005, crossref =DBLP:conf/srds/2005, url =https://doi.org/10.1109/RELDIS.2005.14, doi =10.1109/RELDIS.2005.14, timestamp = Fri, 19 May2017 01:25:58 +0200, biburl = http://dblp.uni-trier.de/rec/bib/conf/srds/ElniketyZP05, bibsource =dblp computer science bibliography, http://dblp.org.

[31] R. Escriva, B. Wong, and E. G. Sirer. Hyperdex: Adistributed, searchable key-value store. In Proceedingsof the ACM SIGCOMM 2012 Conference onApplications, Technologies, Architectures, andProtocols for Computer Communication, pages 25–36,2012.

[32] R. Escriva, B. Wong, and E. G. Sirer. Warp:Lightweight multi-key transactions for key-valuestores. CoRR, abs/1509.07815, 2015.

https://aws.amazon.com/emr/

http://hbase.apache.org

https://hive.apache.org

https://impala.apache.org

https://omid.incubator.apache.org/

https://phoenix.apache.org

https://www.salesforce.com/eu/products/chatter/overview/

https://www.salesforce.com/eu/products/chatter/overview/

https://github.com/cockroachdb/cockroach/blob/master/docs/design.md

https://github.com/cockroachdb/cockroach/blob/master/docs/design.md

https://issues.apache.org/jira/browse/OMID-82

https://issues.apache.org/jira/browse/OMID-90

http://opentsdb.net

https://issues.apache.org/jira/browse/PHOENIX-3623

https://issues.apache.org/jira/browse/PHOENIX-3623

http://rocksdb.org/

https://tephra.io

https://reprints.forrester.com/#/assets/2/364/RES134282/reports

https://reprints.forrester.com/#/assets/2/364/RES134282/reports

https://phoenix.apache.org/who_is_using.html

https://hortonworks.com/customers/yahoo-japan-corporation/

https://hortonworks.com/customers/yahoo-japan-corporation/

[33] I. Eyal, K. Birman, I. Keidar, and R. van Renesse.Ordering transactions with prediction in distributedobject stores. In LADIS, 2013.

[34] D. G. Ferro, F. Junqueira, I. Kelly, B. Reed, andM. Yabandeh. Omid: Lock-free transactional supportfor distributed data stores. In IEEE 30th InternationalConference on Data Engineering, Chicago, ICDE2014, IL, USA, March 31 - April 4, 2014, pages676–687, 2014.

[35] J. Gray and A. Reuter. Transaction Processing:Concepts and Techniques. Morgan KaufmannPublishers Inc., San Francisco, CA, USA, 1st edition,1992.

[36] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, andP. J. Weinberger. Quickly generating billion-recordsynthetic databases. In Proceedings of the 1994 ACMSIGMOD International Conference on Management ofData, SIGMOD ’94, pages 243–252, New York, NY,USA, 1994. ACM.

[37] S. Kulkarni, M. Demirbas, D. Madeppa, B. Avva, andM. Leone. Logical physical clocks and consistentsnapshots in globally distributed databases. 2014.

[38] S. Patterson, A. J. Elmore, F. Nawab, D. Agrawal,and A. E. Abbadi. Serializability, not serial:Concurrency control and availability inmulti-datacenter datastores. PVLDB, 5(11):1459-1470,2012.

[39] D. Peng and F. Dabek. Large-scale incrementalprocessing using distributed transactions andnotifications. In 9th USENIX Symposium on OperatingSystems Design and Implementation (OSDI 10), 2010.

[40] O. Shacham, F. Perez-Sorrosal, E. Bortnikov, E. Hillel,I. Keidar, I. Kelly, M. Morel, and S. Paranjpye. Omid,reloaded: Scalable and highly-available transactionprocessing. In 15th USENIX Conference on File andStorage Technologies (FAST), 2017.

[41] J. Shute, R. Vingralek, B. Samwel, B. Handy,C. Whipkey, E. Rollins, M. Oancea, K. Littleeld,D. Menestrina, S. Ellner, J. Cieslewicz, I. Rae,T. Stancescu, and H. Apte. F1: A distributed sqldatabase that scales. PVLDB, 6(11):1068-1079, 2013.

[42] A. Thomson, T. Diamond, S.-C. Weng, K. Ren,P. Shao, and D. J. Abadi. Calvin: Fast distributedtransactions for partitioned database systems. InProceedings of the 2012 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’12,pages 1–12, 2012.

[43] X. Wei, J. Shi, Y. Chen, R. Chen, and H. Chen. Fastin-memory transaction processing using rdma andhtm. In Proceedings of the 25th Symposium onOperating Systems Principles, SOSP ’15, pages87–104, 2015.

taking omid to the clouds: fast, scalable transactions for...

Documents