qtls: high-performance tls asynchronous offload framework...

QTLS: High-Performance TLS Asynchronous OffloadFramework with Intel® QuickAssist Technology

Xiaokang Hu∗Shanghai Jiao Tong UniversityIntel Asia-Pacific R&D Ltd.

[email protected]

Changzheng WeiIntel Asia-Pacific R&D [email protected]

Jian Li†Shanghai Jiao Tong University

[email protected]

Brian WillIntel Corporation

Chandler, Arizona, [email protected]

Ping YuLu Gong

Intel Asia-Pacific R&D Ltd.{ping.yu,lu.gong}@intel.com

Haibing GuanShanghai Jiao Tong University

[email protected]

AbstractHardware accelerators are a promising solution to optimizethe Total Cost of Ownership (TCO) of cloud datacenters. Thispaper targets the costly Transport Layer Security (TLS) andinvestigates the TLS acceleration for the widely-deployedevent-driven TLS servers or terminators. Our study revealsan important fact: the straight offloading of TLS-involvedcrypto operations suffers from the frequent long-lastingblockings in the offload I/O, leading to the underutilizationof both CPU and accelerator resources.

To achieve efficient TLS acceleration for the event-drivenweb architecture, we propose QTLS, a high-performance TLSasynchronous offload framework based on Intel® QuickAs-sist Technology (QAT). QTLS re-engineers the TLS softwarestack and divides the TLS offloading into four phases toeliminate blockings. Then, multiple crypto operations fromdifferent TLS connections can be offloaded concurrently inone process/thread, bringing a performance boost. Moreover,QTLS is built with a heuristic polling scheme to retrieve accel-erator responses efficiently and timely, and a kernel-bypassnotification scheme to avoid expensive switches betweenuser mode and kernel mode while delivering async events.The comprehensive evaluation shows that QTLS can provideup to 9x connections per second (CPS) with TLS-RSA (2048-bit), 2x secure data transfer throughput and 85% reductionof average response time compared to the software baseline.∗Work done as a software engineer intern in Intel DCG/NPG†Corresponding author

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACMmust be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] ’19, February 16–20, 2019, Washington, DC, USA© 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6225-2/19/02. . . $15.00https://doi.org/10.1145/3293883.3295705

CCS Concepts • Computer systems organization →Heterogeneous (hybrid) systems; • Networks → Secu-rity protocols; • Hardware→ Hardware accelerators.

Keywords SSL/TLS, crypto operations, event-driven webarchitecture, crypto accelerator, asynchronous offload

1 IntroductionNowadays, cloud datacenters demand continuous perfor-mance enhancement to fulfill the requirements of cloudservices that are becoming globalized and scaling rapidly[5, 12, 25]. Hardware accelerators, including GPU, FPGA andASIC, are a promising solution to optimize the Total Cost ofOwnership (TCO) as both energy consumption and computa-tion cost can be reduced by offloading similar and recurringjobs [12, 23, 29, 31].

This paper targets an important type of datacenter work-load: the SSL/TLS processing in various TLS servers or ter-minators [30, 56]. As the backbone protocols for Internetsecurity, SSL/TLS have been employed (typically in the formof HTTPS) by 64.3% of the 137,502 most popular websites onthe Internet as reported in November 2018 [34]. However,due to the continual involvement of crypto operations, partic-ularly the costly asymmetric encryption, the TLS processingincurs significant resource consumption compared to an in-secure implementation on the same platform [6, 23, 32, 59].

There have been some studies that made efforts to offloadTLS-involved crypto operations to general-purpose acceler-ators, such as GPU, FPGA and Intel® Xeon Phi™ processor[18, 22–24, 40, 43, 58, 59]. These studies mainly concentratedon the programming to enable efficient calculation of theneeded crypto algorithms (e.g., RSA [55]), without payingmuch attention to the offload I/O. However, our study revealsthat for the widely-deployed event-driven 1 TLS servers orterminators, such as Nginx [36], HAProxy [7] and Squid [8],the straight offloading of crypto operations is not sufficientlycompetent for the TLS acceleration. The main challenge is1handling multiple concurrent connections in one process/thread, insteadof thread-per-connection

158

https://doi.org/10.1145/3293883.3295705

https://www.acm.org/publications/policies/artifact-review-badging/#available

https://www.acm.org/publications/policies/artifact-review-badging/#reusable

PPoPP ’19, February 16–20, 2019, Washington, DC, USA Xiaokang Hu, Changzheng Wei, Jian Li, Brian Will et al.

the frequent long-lasting blockings in the offload I/O, lead-ing to (1) a large amount of CPU cycles spent waiting and(2) a low utilization of the parallel computation units in-side the accelerator. As both CPU and accelerator resourcesare underutilized, the anticipated performance enhancementcannot be reached.

In this paper, we propose QTLS, a high-performance TLSasynchronous offload framework based on Intel® QuickAs-sist Technology (QAT) [20], to achieve efficient TLS accel-eration for the event-driven web architecture. As a modernASIC-based crypto acceleration solution, QAT is competitivein energy-efficiency and cost-performance compared to thegeneral-purpose accelerators [5, 25]. Moreover, it provides anadditional security layer for sensitive data (e.g., private keys),which can be securely protected inside the ASIC hardwareand never be exposed to the memory [45].

To eliminate blockings in the offload I/O, QTLS re-engineersthe TLS software stack to enable the asynchronous supportfor crypto operations in all the layers. The TLS offloadingis divided into four phases: pre-processing, QAT responseretrieval, async event notification and post-processing. Inthe pre-processing phase, the offload jobs are paused aftercrypto submission to return control to the application pro-cess. When QAT responses for crypto results are retrieved,the application process is notified by async events to resumethe paused offload jobs and begin the post-processing phase.In this novel framework, CPU resources are fully utilized tohandle concurrent connections. Multiple crypto operationsfrom different TLS connections can be offloaded concurrentlyin one process/thread, which greatly increases the utilizationof the parallel computation engines inside the QAT accelera-tor. To further enhance performance, QTLS is built with (1) aheuristic polling scheme that leverages the application-levelknowledge to achieve efficient and timely QAT responseretrieval, and (2) a kernel-bypass notification scheme thatintroduces an application-defined async queue to avoid ex-pensive switches between user mode and kernel mode whiledelivering async events.

QTLS leverages theQAT accelerator to offload TLS-involvedcrypto operations. Its main idea is also applicable to othertypes of ASIC crypto accelerators. We have implementedQTLS based on Nginx [36] and OpenSSL [13]. Nginx is anevent-driven high-performance HTTP(S) and reverse proxyserver, used by 48.5% of the top one million websites [33].OpenSSL is an SSL/TLS and crypto library, widely used on agreat variety of systems [53].

The performance advantages of QTLS have been validatedthrough extensive experiments that covered both TLS 1.2and 1.3 protocols, both full and abbreviated (i.e., sessionresumption) handshakes, and multiple mainstream ciphersuites. It’s demonstrated that QTLS greatly improves theTLS handshake performance, achieving up to 9x connectionsper second (CPS) with TLS-RSA (2048-bit) over the softwarebaseline. In addition, the secure data transfer throughput is

CLIENT SERVER

Asymmetric-keyCalculation

TCP Connection

Client Hello

Server Finished

Encrypted Premaster

Crypto Keys Generation

Client Finished

Server Hello

Secure Data Transfer

Multiple PRFOperations

SymmetricCipher + MAC

TLSHandshake

Figure 1. Classic RSA-wrapped TLS processing

enhanced by more than 2x and the average response time isreduced by nearly 85%.

In summary, this work makes the following contributions:1. An important fact is revealed: for the event-driven TLS

servers or terminators, the straight offloading of cryptooperations suffers from the frequent long-lasting block-ings in the offload I/O.

2. We propose and design the TLS asynchronous offloadframework for the event-driven web architecture. Dueto the enormous performance advantages, it has beenadopted by a number of service providers, like Alibabafor e-commence [41] and Wangsu for CDN [21].

3. A heuristic polling scheme and a kernel-bypass notifica-tion scheme are designed to further enhance the perfor-mance of QTLS.

4. We show that QTLS can be practically implemented withNginx and OpenSSL, and evaluate its performance withextensive experiments.

2 Background and MotivationThis section first gives an overview of the TLS protocols andthe event-driven web architecture. Then, we present back-ground knowledge about Intel® QAT. Finally, we highlightthe challenges with TLS offloading for the event-driven webarchitecture.

2.1 TLS OverviewTransport Layer Security (TLS) and its predecessor, SecureSockets Layer (SSL), have been de-facto standards for theInternet security for more than 20 years [57]. TLS 1.0-1.2protocols are currently dominant and the TLS 1.3 protocolhas recently been ratified and published (as RFC 8446) byIETF [37].An TLS connection can be divided into two main phases

[10]: (1) the handshake phase that determines crypto al-gorithms, performs authentications and negotiates sharedkeys for data encryption and message authentication code(MAC); (2) the secure data transfer phase that sends en-crypted data over the established TLS connection. The classicRSA-wrapped TLS processing (i.e., the TLS-RSA cipher suite)in the TLS 1.2 protocol is illustrated in Figure 1.

159

QTLS: QAT-based TLS Asynchronous Offload Framework PPoPP ’19, February 16–20, 2019, Washington, DC, USA

Handshake: The RSA-wrapped handshake requires twonetwork round trips to exchange Hello messages, encryptedPremaster and Finished messages, as shown in Figure 1. Atthe server side, one asymmetric-key calculation (RSA sign)is involved for authentication, which is the most computing-intensive part of the TLS processing [6, 23]. Besides, multiplepseudo random function (PRF) [54] operations are neededfor key derivation. For other cipher suites in the TLS 1.2protocol with forward secrecy [3], such as the ECDHE-RSAcipher suite [48], the handshake is more complicated andinvolves two more asymmetric elliptic-curve cryptography(ECC) [47] calculations at the server side. In the TLS 1.3protocol, although one network round trip can be saved inthe handshake phase, the involved crypto operations cannotbe omitted [9]. Also, the enhanced security requires morekey derivation operations with the new HMAC-based keyderivation function (HKDF) [28]. Table 1 summarizes thenumber of server-side crypto operations to perform a fullhandshake.

Table 1. Server-side crypto operations for full handshake

TLS Cipher Suite RSA ECC PRF/HKDF1.2 TLS-RSA 1 0 41.2 ECDHE-RSA 1 2 41.2 ECDHE-ECDSA 0 3 41.3 ECDHE-RSA 1 2 > 4

Secure data transfer: This phase is based on the nego-tiated crypto keys, including at least two cipher keys andtwo MAC keys. Each key pair is used for data encryptionand MAC calculation in one direction. Prior to encryptionand transmission, the data object is fragmented into units of16KB if it is larger than this.

Resource consumption: TLS imposes a much higher re-source consumption over an insecure implementation dueto the costly crypto operations [6, 23, 32, 59]. A recent study[35] reported that a modern CPU core, which is able toserve over 30K HTTP connections per second, cannot handlemore than 0.5K TLS handshakes per second with the main-stream ECDHE-RSA (2048-bit) cipher suite. The overheadmainly originates from the heavy asymmetric-key calcula-tions [6, 23]. As the requested web object increases in size,the computation overhead incurred by data encryption andMAC calculation becomes considerable as well. Performancetesting [35] showed that when the request size increasesfrom 1 KB to 100 KB, the throughput drop compared to theunencrypted case increases strongly from 45.7% to 85.4%.

Session resumption: TLS provides a shortcut named ses-sion resumption (with either session ID or session ticket)that avoids performing the full handshake each time [10].Specifically, the asymmetric-key calculations can be skippedin the handshake phase on later connections from the sameclient. Although session resumption is effective in reducing

crypto instance

request/responsering pairs

computationengines

load balance

assigned

•••

•••

run

processor thread

•••

•••

crypto instance

assigned

run

processor thread

crypto instance

assigned

run

processor thread

crypto instance

assigned

run

processor thread

Figure 2. A QAT endpoint and its usage model

the TLS resource consumption at the server side, the pro-tection afforded by forward secrecy is compromised [42]. Inreal-life deployment, service providers restrict the lifetimeof session IDs or tickets (generally less than an hour [42]) toachieve a trade-off between performance and security.

2.2 Event-driven Web ArchitectureThere are two competitive architectures for web applications:one is based on threads, while the other is based on events[11]. Thread-based web applications (e.g., Apache [14]) ini-tiate a new thread for each incoming connection. This suf-fers from costly thread switches and high system resourceconsumption under high traffic. In contrast, event-drivenweb applications (e.g., Nginx [36]) have the ability to handlethousands of concurrent connections in one process/thread,bringing significant advances in both performance and scal-ability [15, 44]. To align with the multi-core architecture,multiple worker processes can be launched simultaneouslyto handle incoming connections in a balanced manner.To consolidate multiple connections to a single flow of

execution, the event-driven web architecture works withnetwork sockets in an asynchronous (non-blocking) modeand monitors them with an event-based I/O multiplexingmechanism (e.g., epoll [49] or kqueue [52]). In the so-calledevent loop, the application process waits for events on themonitored sockets. Once a queue of new events is returned,the corresponding event handlers are invoked one by one.

2.3 Intel® QuickAssist TechnologyIntel® QuickAssist Technology (QAT) is an ASIC-based solu-tion to enhance security and compression performance forcloud, networking, big data and storage applications [20].Both kernel driver and userspace driver (implemented as alibrary) are available to meet different requirements.

Usage model: As illustrated in Figure 2, a QAT endpointpossesses multiple parallel computation engines and a num-ber of hardware-assisted request/response ring pairs. Softwarewrites requests onto a request ring and reads responses backfrom a response ring. The QAT hardware load-balances re-quests from all rings across all available computation engines.The availability of responses can be indicated using eitherinterrupt or polling.

160


Several request/response ring pairs (for different cryptotypes) are grouped into a QAT crypto instance, which is ac-tually a logical unit that can be assigned to a process/threadrunning on the CPU core. A modern QAT endpoint cansupport up to 48 crypto instances to scale with the multi-core architecture. If there are multiple QAT endpoints inthe server, one process can be assigned with multiple QATinstances from different endpoints to employ more compu-tation engines.

Parallelism: On the one hand, crypto requests submittedfrom different processes to different QAT instances can becalculated in parallel on multiple computation engines. Onthe other hand, concurrent crypto requests submitted fromone process to one QAT instance can also be calculated inparallel as long as there are available computation engines.Here, "concurrent" means to submit the next request withoutwaiting for the completion of previous ones. If there aresufficient concurrent requests, it is possible to fully load allparallel computation engines with only one or two QATinstances.

OpenSSL QAT Engine:QAT integrates its crypto serviceinto OpenSSL (a widely-used crypto library) [13] through astandard engine named QAT Engine, which allows the seam-less use by applications. When the QAT Engine is loaded,applications can transparently offload crypto operations tothe QAT accelerator via the normal OpenSSL API . In general,the QAT Engine supports the offloading of asymmetric cryp-tography (e.g., RSA, ECDH and ECDSA), symmetric chainedcipher (e.g., AES128-CBC-HMAC-SHA1) and PRF.

2.4 Challenges with TLS OffloadingA straight offload mode for TLS processing is to replace thecrypto-related function call with an I/O call that interactswith the hardware accelerator. This mode reuses the existingTLS software stack with only few modifications. However,for the event-driven web architecture, this kind of straightoffloading is not sufficiently competent and cannot reach theanticipated performance enhancement.

In the software-based TLS processing, the crypto-relatedfunction call is a synchronous blocking call to use the CPUresource for crypto calculation. Both the TLS library layerand the application layer cannot move on if a crypto opera-tion has not been completed yet. This kind of design workswell with the CPU-only architecture, but in the straight of-fload mode, it causes the frequent long-lasting blockings inthe offload I/O, as illustrated in Figure 3. Although the QATdriver provides an inherently non-blocking interface (i.e.,request/response), the QAT Engine cannot return controlto upper layers after it submits a crypto request from theconnection C1. That’s to say, it has to wait until the QATaccelerator completes this request and generates a response.Only after the response for crypto result is consumed, theapplication process can continue execution to generate thenext crypto request (giving rise to the next blocking).

C1

calculating

C1 C2 C2

calculating

••• waiting waiting •••

request response

••• •••

request response

ApplicationProcess

TLS Library

QAT Driver

QAT Instance

ComputationEngines

QAT Engine

blocking

blocking

blocking

blocking

Figure 3. TLS straight offload mode with QAT

For the event-driven web architecture, the blockings inthe offload I/O lead to the underutilization of both CPU andQAT resources. First is the large amount of CPU cycles spentwaiting. After a crypto request is submitted to the accelerator,the application process (typically running on a dedicatedcore) is in the waiting state (either busy-looping or sleepingto wait for the QAT response), stopping the cycle of handlingevents for a long time. Second is the low utilization of theparallel computation engines inside the QAT accelerator.Since the second crypto request can only be submitted afterthe completion of the first one, for each application process,no more than one computation engine can be employed atthe same time.

3 QTLS DesignTo achieve efficient TLS acceleration for the event-drivenweb architecture, we propose QTLS, a high-performance TLSasynchronous offload framework that leverages the QATaccelerator to offload crypto operations. Its main idea is alsoapplicable to other types of ASIC crypto accelerators.

3.1 OverviewAn overview of QTLS framework is shown in Figure 4. Toeliminate blockings in the offload I/O, QTLS re-engineersthe TLS software stack to enable the asynchronous supportfor crypto operations in all the layers, including: the QATEngine layer, the TLS library layer and the application layer.It divides the TLS offloading into four phases:• Pre-processing: When a TLS handler (e.g., the handshakehandler) encounters a crypto operation, the offload jobis paused immediately after the crypto request is sub-mitted to the QAT accelerator. The application processcontinues execution to handle other connections, insteadof being blocked and waiting for the QAT response. Mul-tiple crypto operations from different connections (C1,C2, ...) can be offloaded concurrently in one process.

• QAT Response Retrieval: As more and more concurrentcrypto requests are submitted, the QAT responses forthem need to be retrieved in time. A heuristic pollingscheme is designed here to achieve efficient and timelyretrieval.

161


TLS ASYNCHandshake TLS Write

crypto pause

•••

C1

TLS State Machine

QAT Engine

QAT Userspace Driver QAT Instance

TLS Library (async crypto support)

C2 C3 •••

crypto operation

non-blockingcrypto API request

TLS Write

cryptoresumption

•••

C1

TLS State Machine

QAT Engine

C2 C3 •••

TLS ASYNCHandshake

TLS Library (async crypto support)

cryptocalculation

result

QAT Instance

Heuristic PollingScheme

Application-levelknowledges

poll for newresponses

QAT Instance

Resonse Callback

QAT Userspace Driver

newresponses

Kernel-bypassNotification

Async Event

Added toMonitoring

Pre-processing

Post-processing

QAT Response RetrievalAsync Event Notification

ComputationEngines •••

Figure 4. An overview of QTLS framework

• Async Event Notification: Each time a new QAT responseis retrieved, the pre-registered response callback gener-ates an async event to inform the application about thecompletion of an async crypto request. A kernel-bypassnotification scheme is designed here to avoid expensiveswitches between user mode and kernel mode while de-livering async events.

• Post-processing: According to the captured async event,the corresponding TLS connection along with the sameTLS handler will be rescheduled by the application pro-cess to resume the paused offload job. Since the cryptocalculation result is now ready, this connection can moveon to the next step.In this novel framework, CPU resources are fully utilized

to handle concurrent connections. Furthermore, the con-currently offloaded crypto requests greatly increase the uti-lization of the parallel computation engines inside the QATaccelerator.

3.2 Pre-processing and Post-processingAs discussed in §2.4, the root cause of the blockings in theoffload I/O is that both the TLS library layer and the applica-tion layer cannot move on if a crypto operation has not beencompleted yet. We circumvent this restriction by introduc-ing asynchronous crypto behaviors into the TLS softwarestack and splitting the offload I/O into a pre-processing phase(where async offload jobs are paused) and a post-processingphase (where async offload jobs are resumed).In the TLS library layer, QTLS introduces crypto pause

and crypto resumption. The former is leveraged by the QATEngine to pause the current offload job after a crypto requestis submitted and return control to the upper application witha specific error code. The latter is leveraged by the applicationprocess to resume a paused offload job to consume the cryptoresult. The pause and resumption of an offload job requirecareful management of the TLS context and the crypto state.We have developed two types of implementations based onOpenSSL, which will be elaborated in §4.1.

The application layer is the initiator and terminator ofasync crypto operations. QTLS introduces a new state namedTLS-ASYNC into the application-maintained TLS state ma-chine, which is connected to all other TLS states (e.g., Hand-shake and TLS-Write) that may involve crypto operations.Suppose that the currently-handled connection C1 is inHandshake state (i.e., executing the handshake handler), asshown in Figure 4. Once the TLS state machine identifies apaused offload job from the specific error code returned bythe TLS library, it changes the TLS state from Handshaketo TLS-ASYNC. Each paused offload job corresponds to anasync handler. Here, it is set to the handshake handler, indi-cating that the same handler needs to be rescheduled laterto conduct the post-processing phase. In addition, the appli-cation layer is responsible to monitor all the paused offloadjobs and wait for async events that signal the completion ofcrypto requests. The detailed design about monitoring andnotification will be presented together in §3.4.

The QAT Engine layer is a bridge between the TLS librarylayer and the QAT driver layer. On the one hand, it registersa response callback (used to generate an async event) whenit uses the non-blocking API provided by the QAT driver tosubmit a crypto request. On the other hand, after the submis-sion of a crypto request, it uses the crypto pause provided bythe TLS library to return control to the upper application.

A special case is the failure of crypto submission, possiblybecause the QAT request ring is already full. QTLS resolvesthis case by reusing the proposed framework. The offloadjob is paused after an unsuccessful crypto submission. Theapplication is notified about this failure through either anasync event or a specific return code. The same TLS handlerwill be rescheduled later to resume the paused offload joband retry the crypto submission.

3.3 QAT Response RetrievalQAT responses can be retrieved through either interrupt orpolling. QTLS leverages userspace I/O for crypto offloading,where one userspace-based polling operation has much less

162


overhead than one kernel-based interrupt [19]. Therefore,QTLS selects polling to achieve high throughput, especiallyat high-load state. A convenient polling method is to initiatean independent thread for each application process to pollthe assigned QAT instances at regular intervals, as whatthe QAT Engine does by default, but it comes with severaldrawbacks inevitably:

• It introduces frequent context switches between applica-tion processes and polling threads.

• It is hard to determine an appropriate polling interval: asmall interval may lead to a large amount of ineffectivepolling operations while a big interval may incur a longlatency or even impede the throughput.

To achieve better performance and adapt well to variedtraffic, we design a heuristic polling scheme for QTLS. Sincethe application is the producer of crypto requests, it hasthe best knowledge about the proper time to retrieve QATresponses. The heuristic polling scheme is integrated intothe application to (1) avoid an independent polling threadand (2) leverage the application-level knowledge to guidethe polling behaviors. Sometimes, a polling operation shouldbe postponed to coalesce sufficient responses; however, an-other time, a polling operation may need to be executedimmediately to reduce latency. The heuristic polling schemeintegrates both efficiency and timeliness, with the goal tomake the response retrieve rate always match the requestsubmission rate.

Efficiency: When there are a large number of concurrentTLS connections (i.e., high offload traffic), it is reasonableto coalesce sufficient QAT responses into one polling oper-ation. The number of inflight (submitted but not retrievedwith a response) crypto requests is an appropriate indica-tor and a polling operation is triggered when this numberreaches a pre-defined threshold. A notable fact is that theasymmetric-key calculation takes a much longer time thanthe calculation of other types of crypto requests. Therefore,a bigger threshold is used if there exist inflight asymmetriccrypto requests.

Timeliness: During off-hours with few TLS connections,a timely polling is important and necessary. An TLS con-nection can be identified as (1) active if it is processing thehandshake, reading an HTTPS request or writing the HTTPSresponse back; or (2) idle if it is waiting for a request fromthe end client (including the keepalive case). In the currentdesign, each active TLS connection can only have one asynccrypto request at the same time. Therefore, once the num-ber of inflight async crypto requests equals the number ofactive TLS connections, a polling operation needs to be exe-cuted immediately. Otherwise, the application process mayget stuck as all active connections are waiting for QAT re-sponses. This timeliness constraint greatly reduces the querylatency when there are only few end clients.

3.4 Async Event NotificationFor the event-driven web architecture, a natural way forthe async event notification is to reuse the existing event-driven approach. Specifically, each time an async offload jobis paused, a file descriptor (FD) [51] is allocated to it andmonitored by the application’s I/O multiplexing mechanism,together with the network sockets. When a QAT response isretrieved, the response callback function writes an event onthe corresponding FD to inform the application.

However, as file descriptors are maintained in the kernel,this kind of FD-based notification inevitably introduces ex-pensive switches between user mode and kernel mode. Inorder to avoid this performance penalty, we design a kernel-bypass notification scheme. An application-defined asyncqueue is introduced to work as the monitoring and notifi-cation channel. When the TLS state machine identifies apaused offload job, it already has the knowledge about itsasync hander, which is actually the TLS handler needed to berescheduled when the application receives the async event.This knowledge can be shared to the lower software stack.Then, when a QAT response is retrieved, the response call-back function can complete notification just by inserting thecorresponding async handler into the tail of the async queue.This async queue will be processed at the end of the mainevent loop. As long as there exist inflight crypto requests, themain event loop keeps execution, instead of sleep-waitingfor events on the I/O multiplexing mechanism.

4 Implementation DetailsQTLS can be implemented on various event-driven TLSservers or terminators (e.g. Nginx, HAProxy and Squid). Thissection presents the implementation based on Nginx [36]and OpenSSL [13]. Nginx uses one master process for man-agement and multiple worker processes for doing the realwork.

4.1 OpenSSL Async Crypto SupportWe have developed the crypto pause and resumption inOpenSSL for several years, and there are two kinds of imple-mentations: stack async and fiber async.

Stack Async: Initially, we implemented the crypto pauseand resumption by altering the normal sequence of programexecution according to the state flag, as illustrated in Figure5. When the crypto API (e.g., RSA sign) is invoked by the TLSAPI (e.g., handshake) for the first time, it executes normallyto submit the crypto request. After a successful submission,it sets the state flag to inflight and returns to the caller. Thestate flag is set to ready when the QAT response is retrieved.Then, the same TLS API is called again to conduct cryptoresumption, which requires careful skipping of some TLS op-erations (e.g, ssl3_get_message) that have been completed inthe first call. As the state flag is ready now, the same crypto

163


Crypto API Begin

crypto submit

Crypto API End

Pause

crypto result

ResumeFail

flag: retryflag: inflight crypto submit

•••

Retry

TLS API TLS API

careful skipping

Crypto API End

fail

Crypto API Begin

Crypto API End

retryjump tosuccess

Figure 5.Workflow of stack async

API jumps over the crypto submission part to directly con-sume the crypto result. For the failure of crypto submission,the state flag is set to retry and the application is responsibleto call the same TLS API later to retry the submission. Thestack async implementation has a good performance but itis intrusive to OpenSSL as its API is not compatible with theexisting synchronous API.

Fiber Async: The intrusive stack async implementationis unacceptable to the OpenSSL Community. After discus-sion with OpenSSL maintainers, we agreed to a new imple-mentation that unifies the asynchronous and synchronousbehaviors with the fiber mechanism [50]. Fibers are cooper-ative lightweight threads of execution in user space, which,in effect, provide a natural way to organize concurrent codebased on asynchronous I/O [27]. By using explicit yield-ing and cooperative scheduling, fiber async implementationmanages multiple execution paths for async offload jobs inone process/thread, as illustrated in Figure 6.

When the TLS API (e.g., handshake) is called for the firsttime, it uses ASYNC_start_job to start a new fiber-basedASYNC_JOB to encapsulate (fiber context swap here) the run-ning piece of the TLS connection. The fiber-based ASYNC_JOB can be paused at any point to return control to themain code and resumed later to directly jump to the pausepoint. If the invoked crypto API (e.g., RSA sign) identifiesan async offload job through ASYNC_get_current_job, it firstsubmits a crypto request in non-blocking mode and thenleverages ASYNC_pause_job to return control (fiber contextswap here). When the same TLS API is called again, it usesASYNC_start_job with the paused ASYNC_JOB as an argu-ment to directly jump (fiber context swap here) to the pausepoint and consume the crypto result.

The fiber async implementation has a slight performancepenalty due to the fiber management and switches, but it iscompatible with the existing API. It has been included intoOpenSSL official releases since version 1.1.0 series.

4.2 Nginx Modifications for Async Crypto SupportThe crypto pause may occur in the following functions: ngx_ssl_handshake, ngx_ssl_handle_recv, ngx_ssl_write and ngx_

job != NULL

non-blocing crypto submission

TLS API

ASYNC_start_job(NULL, ...)start a new ASYNC_JOB

Crypto API Begin

job = ASYNC_get_current_job()

ASYNC_pause_job()

context swap

contextswap

consume crypto result

TLS API

ASYNC_start_job(job, ...)resume an ASYNC_JOB

context swap

Crypto API End

ASYNCPAUSE

ASYNCFINISH

pause point

Crypto Pause Crypto Resumption

Figure 6.Workflow of fiber async

ssl_shutdown. These functions are modified to recognize thenew error code from OpenSSL named SSL_ERROR_WANT_ASYNC, which indicates that an async crypto request hasbeen submitted. If this new error code is received, the asynchandler of the paused offload job is is set to the currenthandler (e.g., ngx_ssl_handshake_handler).

Ideally, after an async offload job is paused and the execu-tion is returned to Nginx, the only type of event expected forthis TLS connection is an async event, which will resume thepaused job. However, the possibility exists that a read event(from the network socket) occurs before the expected asyncevent. This read event could be the next message during theTLS handshake or the first HTTP request after connectionestablishment. This kind of event disorder may cause theHTTP or TLS State Machine to go back to a previous state.To address this issue, QTLS clears and saves the handler ofthe read event when an async event is being expected, andrestores it when the async event is being processed.

4.3 Heuristic Polling SchemeThe information about the inflight crypto requests is col-lected in the QAT Engine layer for accuracy. The differenttypes of TLS crypto operations are counted independently,denoted by Rasym , Rcipher and Rpr f respectively. Each timea crypto-related function in the QAT Engine (e.g., qat_rsa_priv_dec) is invoked, the corresponding number (e.g., Rasym )is increased by one. Once a QAT response is retrieved, thecorresponding number is subtracted by one in the responsecallback function. The total number of inflight crypto re-quests (denoted by Rtotal ) is calculated by adding these threevariables together and shared to the upper application witha new engine command.

Nginx has a stub_status module that collects useful infor-mation including the number of alive connections and thenumber of idle connections that are waiting for a request.Based on this, QTLS adds the checking for TLS-enabled con-nections and calculates the number of active TLS connections(denoted by TCactive ) with the the equation: TCactive =

TCalive −TCidle .

164


The timeliness and efficiency constraints can be satisfiedeither when Rtotal increases or when TCactive decreases.Therefore, in Nginx, wherever a crypto operation may beinvolved orTCactive may be updated , it is required to checkthese constraints and determine whether to execute a pollingoperation. The default thresholds for the efficiency constraintare set to 48 (for Rasym > 0) and 24 (for Rasym = 0). We haveopened the threshold setting in the Nginx configuration file.

In our large amount of tests, the heuristic polling schemeworks well to retrieve all QAT responses in time. For afailover, a timer (e.g., five milliseconds) can be set insideNginx to periodically check whether at least one heuristicpolling operation is triggered during the last interval. If notbut there exist in-flight offload requests, an extra pollingoperation can be executed at once.

4.4 Async Event NotificationFD-based notification: QAT Engine uses the set-FD API toassociate a FD with an async job before crypto pause. Theapplication uses the get-FD API at proper time to obtain theinformation about added FDs (need to be added into moni-toring) and deleted FDs (need to be deleted frommonitoring).Normally, a new FD should be created for each paused offloadjob. Considering the fact that each TLS connection can onlyhave one async offload job at the same time, an optimizationfor lower overhead is to share one FD across all async jobsfrom the same TLS connection.

Kernel-bypass notification: An application-level call-back function is introduced to manipulate the async queue.This function takes the async handler information as theargument and inserts it into the tail of the async queue.In OpenSSL, two new members, named callback and call-back_arg, are added to the data structure of the fiber-basedASYNC_JOB. Meanwhile, a series of new APIs are introducedfor users to set or get them.

Each time the TLS statemachine identifies a paused offloadjob, it invokes the SSL_set_async_callback API to set theapplication-level callback and the async handler. When aQAT response is retrieved, the response callback functionleverages the ASYNC_WAIT_CTX_get_callback API to checkwhether the callback member has been set for the currentoffload job. If so, it can directly invoke the application-levelcallback function with callback_arg (i.e., the async-handlerinformation) as the argument to complete notification.

5 Evaluation5.1 Experimental SetupWe established an experimental testbed with one testedserver and two client servers that were connected back-to-back via Intel® XL710 40GbENICs. All servers were equippedwith two 22-core Intel® Xeon® E5-2699 v4 CPU (hyper-threading enabled) and 64GB RAM. We ran CentOS 7.3 withLinux Kernel 3.10 on them. QTLS was installed on the tested

server with one Intel® DH8970 PCIe QAT card, which con-tains three independent QAT endpoints.

The tested QTLS was based on Nginx 1.10.3 and OpenSSL1.1.0g. Nginx supports a number of working modes, such asa web server, a reverse proxy server or a load balancer. Sinceall these modes share the same TLS processing module, weonly configured Nginx as an HTTPS web server to conductthe testing. In addition, the fiber async implementation wasused in the evaluation as it has been included into OpenSSLofficial releases.In the tested server, multiple Nginx workers were con-

figured to occupy the same number of dedicated hyper-threading (HT) cores. Each worker process was equippedwith one QAT instance. The allocated QAT instances weredistributed evenly from the three QAT endpoints. The timer-based polling thread (needed in some configurations) foreach Nginx worker was pinned to the same core as theworker.

The evaluation was conducted in terms of TLS handshakeperformance, secure data transfer throughput and average re-sponse time. We covered both TLS 1.2 and 1.3 protocols, bothfull and abbreviate (i.e., session resumption) handshakes, andmultiple mainstream cipher suites in the evaluation. To eval-uate all aspects of QTLS, the following five configurationsare compared:

• SW : software calculation with modern AES-NI instruc-tions for TLS-involved crypto operations.

• QAT+S: straight offloadmodewith the timer-based pollingthread (10 µs polling interval by default) for QAT re-sponse retrieval.

• QAT+A: TLS asynchronous offload framework with thetimer-based polling thread and the FD-based notificationscheme.

• QAT+AH : replacing the polling thread with the heuristicpolling scheme, based on the QAT+A configuration.

• QTLS: the full QTLS with both heuristic polling schemeand kernel-bypass notification scheme.

5.2 Full Handshake PerformanceThe TLS handshake performance was measured with theOpenSSL s_time tool. In each of the two client servers, 1000s_time processes were simultaneously launched to establishnew TLS connections (i.e., full handshake) with the Nginxrunning on the tested server. We first evaluated the fullhandshake performance for three mainstream cipher suitesin the TLS 1.2 protocol: TLS-RSA, ECDHE-RSA and ECDHE-ECDSA.

The full handshake performance of TLS-RSA (2048-bit) isshown in Figure 7a. Here "2HT" indicates that two Nginxworkers are running on two dedicated hyper-threading (HT)cores (belonging to the same physical core). It can be seenthat the connections per second (CPS) value increases lin-early for all five configurations when the number of Nginx

165


2HT 4HT 8HT 16HT 24HT 32HT0

15

30

45

60

75

90

105

Con

nect

ions

pers

econ

d(K

)

SWQAT+SQAT+AQAT+AHQTLS

(a) TLS-RSA (2048-bit)


6

12

18

24

30

36

42

Con

nect

ions

pers

econ

d(K

)


(b) ECDHE-RSA (2048-bit)

P-256 P-384 B-283 B-409 K-283 K-4090

3

6

9

12

15

18

Con

nect

ions

pers

econ

d(K

)

SWQAT+S

QAT+AQAT+AH

QTLS

(c) ECDHE-ECDSA (six NIST curves)

Figure 7. Full handshake performance for three mainstream cipher suites in the TLS 1.2 protocol


5

10

15

20

25

30

35

Con

nect

ions

pers

econ

d(K

)


Figure 8. Full handshake performance inTLS 1.3 with ECDHE-RSA (2048-bit)


20

40

60

80

100

120

Con

nect

ions

pers

econ

d(K

)


(a) 100% abbreviated handshakes


15

30

45

60

75

90

105

Con

nect

ions

pers

econ

d(K

)


(b) Full / Abbreviated = 1 : 9

Figure 9. Session resumption performance in TLS 1.2 with ECDHE-RSA (2048-bit)

workers varies from 2 to 24. Taking the 8HT case as an ex-ample, the SW configuration gives 4.3K CPS. The straightoffloading of RSA and PRF operations (i.e., the QAT+S config-uration) improves the CPS by a factor of two. In the QAT+Aconfiguration, the TLS asynchronous offload framework fullyutilizes both CPU and QAT resources, bringing a CPS boostto 29.5K, which is a 7x improvement over the SW config-uration. In the QAT+AH configuration, an additional 20%CPS enhancement (i.e., 35.8K) was obtained by the heuristicpolling for QAT responses. Finally, the kernel-bypass notifi-cation scheme further improves the CPS by 8%, leading to aCPS value of 38.8K in the QTLS configuration. In general, thefull QTLS provides a 9x CPS improvement over the softwarebaseline. For the 32HT case, both the QAT+AH and the QTLSconfigurations give about 100K CPS, achieving the upperlimit of the DH8970 QAT card.Figure 7b shows the handshake performance of ECDHE-

RSA (2048-bit), which involves two more ECC calculationscompared to the TLS-RSA cipher suite. The OpenSSL defaultcurve, NIST P-256, was used for the ECC computation. TheQAT+S configuration shows no CPS improvement over theSW configuration due to the blockings in the offload I/O.The QAT+A configuration with the TLS asynchronous of-fload framework improves the CPS by a factor of more thanfour. The heuristic polling scheme and kernel-bypass notifi-cation scheme give additional 20% and 6% CPS enhancementrespectively. The full QTLS provides a 5.5x CPS improve-ment over the software baseline, with 16 Nginx workers toachieve the QAT upper limit (i.e., 40K CPS).

For another mainstream cipher suite ECDHE-ECDSA thatinvolves three ECC calculations, we evaluated six NIST curves

with four Nginx workers, as shown in Figure 7c. A strikingphenomenon is that for the P-256 curve, the SW configura-tion provides an abnormal high CPS and outperforms theQAT+S configuration significantly. This is because the primeof the P-256 curve is "Montgomery Friendly" and can beimplemented into the Montgomery domain [17]. This fea-ture makes the ECDSA (P-256) sign 2.33x faster than thetraditional implementation. Nevertheless, QTLS enhancesthe CPS by more than 70% compared to the special softwareimplementation. It is worth noting that not all ECC primecurves can be optimized by using the Montgomery domain.For the stronger P-384 curve, QTLS provides a 14x CPS im-provement over the software baseline. Moreover, accordingto SafeCurves [1], the widely-used P-256 curve has poten-tial security risks. Therefore, we also evaluated another fourcounterpart curves: NIST binary curves (B-283 and B-409)and NIST koblitz curves (K-283 and K-409). For all these fourcurves, QTLS enhances the CPS by more than 12x comparedto the SW configuration.

TLS 1.3: We also evaluated the full handshake perfor-mance for the TLS 1.3 protocol (provided by OpenSSL 1.1.1-pre8) with the cipher suite ECDHE-RSA (2048-bit), as shownin Figure 8. QTLS provides a 3.5x CPS improvement overthe software baseline, lower than the TLS 1.2 case. The mainreason is that the TLS 1.3 protocol introduces a new keyderivation function named HKDF, which cannot be offloadedthrough the QAT Engine currently.

5.3 Session Resumption PerformanceSession resumption are widely used in real-life deploymentto reduce the resource consumption at the server side. In the

166


4 16 32 64 128 256 512 1024Requested file size (KB)

0

5

10

15

20

25

30

35

Thro

ughp

ut(G

bps)

SWQAT+SQAT+A

QAT+AHQTLS

Figure 10. Secure data transfer with varied files

1 2 4 6 8 12 16 32 64 128 256Number of concurrent end clients

0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Ave

rage

resp

onse

time

(ms) SW

QAT+SQAT+AQTLS

Figure 11. Latency under different concurrencies

2 4 8 12 16 20 24 28 32Number of Nginx workers

0

15

30

45

60

75

90

105

Con

nect

ions

pers

econ

d(K

)

10µs1msHeuristic

(a) Full handshake with TLS-RSA (2048-bit)

16 32 48 64 96 128 192 256 512Number of concurrent end clients

0

4

8

12

16

20

24

28

Thro

ughp

ut(G

bps)

10µs1msHeuristic

(b) Secure data transfer with 64 KB file

1 2 4 6 8 12 16 32 64Number of concurrent end clients

0

1.5

3

4.5

6

7.5

9

Res

pons

etim

e(m

s)

10µs1msHeuristic

(c) Average response time

Figure 12. Comparison between the timer-based polling thread and the heuristic polling scheme

TLS 1.2 protocol, an abbreviated handshake for session re-sumption involves PRF calculations only. We first evaluatedthe session resumption performance with 100% abbreviatedhandshakes. The evaluation environment is identical to thefull handshake testing. A difference is that the "reuse" optionwas used in the s_time tool to establish resumed connections.As shown in Figure 9a, with abbreviated handshakes only,QTLS can provide a 30%-40% CPS enhancement over thesoftware baseline. In contrast, the straight offload mode (i.e.,the QAT+S configuration) suffers from the blockings in theoffload I/O and gives an obviously lower CPS compared tothe SW configuration without TLS offloading.As mentioned in §2.1, session resumption compromises

the protection afforded by forward secrecy and the lifetimeof session IDs or tickets is typically restricted. Real-life trafficis a mixture of full and abbreviated handshakes. The ratiobetween them depends on the security policy. We evaluatedan example ratio of 1:9 (i.e., 10% full handshakes) with thecipher suite ECDHE-RSA (2048-bit), as shown in Figure 9b.Compared to the SW configuration, QTLS improves the CPSby more than 2x. For a specific ratio between full and abbre-viated handshakes, the CPS enhancement with the ciphersuite ECDHE-RSA (2048-bit) ranges between 1.3x and 5.5x.A bigger percentage of full handshakes (i.e., more stringentsecurity policy) leads to a higher CPS improvement.

5.4 Secure Data Transfer ThroughputWe used Apachebench (ab) to measure secure data transferthroughput with the requested file size varying from 4 KBto 1024 KB. The cipher suite for secure data transfer wasAES128-SHA. In the tested server, eight Nginx workers were

launched and the keepalive setting was tuned to avoid theinfluence of TLS handshake. In a client sever, 400 ab processeswere configured to continuously request for a fixed file.

The experiments results are shown in Figure 10. For the4 KB case, QTLS only provides a slight higher throughputcompared to the SW configuration. As the requested filesize increases, the performance benefits of QTLS becomesbigger because the data encryption for the file incurs morecipher operations at the server side. Taking the 128 KB caseas an example, one file incurs eight (calculated by 128/16)cipher operations. The QAT+A configuration improves thethroughput by 60% compared to the SW configuration. Thecombination of heuristic polling scheme and kernel-bypassnotification scheme contributes additional 25%-30% through-put enhancement. The full QTLS provides more than 2xthroughput improvement over the software baseline.

5.5 Average Response TimeThe average response time was evaluated with different con-currencies, i.e., different numbers of concurrent end clients.Only one Nginx worker was launched in the tested server.In a client server, multiple ab processes were launched tosimulate end clients and make concurrent requests for asmall-size page (less than 100 bytes) with the cipher suiteTLS-RSA (2048-bit). A full handshake is needed for eachrequest.

As shown in Figure 11, for a concurrency of 1, the QAT+Sconfiguration has the lowest latency due to the busy-loopwaiting for the crypto result. The QTLS configuration withthe heuristic polling scheme has the second smallest response

167


time. This is because once the number of inflight offload re-quests equals the number of active TLS connections (here,1), a heuristic polling operation is immediately performed toretrieve at least one QAT response. The QAT+A configura-tion shows a longer response time since the QAT responseretrieval is only triggered after a fixed time interval (here,10 µs). The SW configuration has the highest latency dueto the software calculation of the computing-intensive RSAoperation.With the increase of concurrency, the advantage of the

TLS asynchronous offload framework becomes more signifi-cant. Since multiple crypto operations from different connec-tions can be offloaded concurrently, the QAT+A configura-tion achieves about 75% latency reduction over the softwarebaseline when the concurrency reaches 64. After addingthe heuristic polling scheme and kernel-bypass notificationscheme, the full QTLS can decrease the average responsetime by nearly 85%.

5.6 Polling Thread vs. Heuristic PollingWe evaluated the timer-based polling thread with differentpolling intervals to prove the advantages of the heuristicpolling scheme. Three testing scenarios were designed withthe TLS asynchronous offload framework: (1) 10 µs: usingthe timer-based polling thread with a 10 µs interval; (2) 1 ms:using the timer-based polling thread with a 1ms interval; (3)heuristic: using the proposed heuristic polling scheme.The experimental results are shown in Figure 12. The

10 µs configuration brings a relatively low handshake per-formance (about 20% gap) because it suffers from frequentcontext switches and a large amount of ineffective pollingoperations. For the 1 ms configuration, when there is onlya small number of concurrent end clients, it not only givesa high latency but also lowers the throughput dramaticallyas it has to wait for 1 ms before retrieving QAT responses.The heuristic polling scheme leverages the application-levelknowledge to make the response retrieve rate match therequest submission rate. It achieves the best performance forall the above experiments and adapts well to varied networktraffic (i.e., different concurrencies).

6 Related WorkTLS acceleration has been attracting more and more atten-tion due to its increasing importance. The main TLS perfor-mance bottleneck is the involved costly crypto operations.A variety of efforts have been made to accelerate TLS fromthe following aspects.

Software acceleration: Shacham et al. [38] presented analgorithmic approach, batching the SSL handshakes on theweb server with Batch RSA, to speed up SSL. Boneh et al. [2]described three fast variants of RSA (Batch RSA, Multi-factorRSA and Rebalanced RSA) and analyzed their security issues.A client-side caching of handshake information is proposed

in [39] to reduce the network overhead. Castelluccia et al.[4] examined a technique for re-balancing RSA-based clien-t/server handshakes, which puts more work on clients toachieve better SSL performance on the server.

CPU-level improvement: AES instruction set, an ex-tension to the x86 instruction set architecture, has beenintegrated into many processors to improve the speed ofencryption and decryption using AES [46]. Kounavis et al.[26] presented a set of processor instructions along withsoftware optimizations that can be used to accelerate end-to-end encryption and authentication. In [16], a novel constantrun-time approach was proposed to implement fast modularexponentiation on IA processors.

Hardware offloading: This is a promising solution be-cause accelerators typically provide much faster crypto cal-culation than the CPU. A number of studies have been con-ducted to enable TLS-involved crypto algorithms (e.g., RSA,AES and HMAC) on different types of hardware accelerators,including GPU [18, 23, 43, 58], FPGA [22, 24, 40] and Intel®Xeon Phi™ processor [59]. The main focus of these studiesis to program the general-purpose accelerators to enableefficient crypto calculation. In contrast, the QAT acceleratorleveraged in this work inherently supports high-performanceand energy-efficient calculation for mainstream crypto al-gorithms. Our focus is to eliminate blockings in the offloadI/O and achieve high-performance TLS offloading for theevent-driven TLS servers/terminators.

7 ConclusionIn this paper, we proposed and designed QTLS, a QAT-basedhigh-performance TLS asynchronous offload framework,for the popular event-driven web architecture. QTLS re-engineers the TLS software stack and divides the TLS of-floading into four phases: pre-processing, QAT response re-trieval, async event notification and post-processing. Thisnovel framework eliminates blockings in the offload I/O tofully utilize both CPU andQAT resources. Moreover, a heuris-tic polling scheme and a kernel-bypass notification schemewere designed to further enhance the system performance.We have implemented QTLS based on the widely-deployedNginx and OpenSSL. The comprehensive performance eval-uation demonstrated that QTLS outperforms the software-based TLS processing by a significant margin, especially inthe handshake phase.

AcknowledgmentsThe authors would like to thank the anonymous refereesfor their valuable comments and thank the PC/AE chairsfor their assistance to solve the encountered problems. Thiswork is supported in part by the National Key Research andDevelopment Program of China (No. 2016YFB1000502) andthe National Science Fund for Distinguished Young Scholars(No. 61525204).

168


References[1] Daniel J. Bernstein and Tanja Lange. 2017. SafeCurves: choosing safe

curves for elliptic-curve cryptography. Retrieved November 30, 2018from https://safecurves.cr.yp.to/

[2] Dan Boneh and Hovav Shacham. 2002. Fast variants of RSA. Crypto-Bytes 5, 1 (2002), 1–9.

[3] Ran Canetti, Shai Halevi, and Jonathan Katz. 2003. A Forward-SecurePublic-Key Encryption Scheme. In Proceedings of the InternationalConference on the Theory and Applications of Cryptographic Techniques(EUROCRYPT). 255–271.

[4] Claude Castelluccia, Einar Mykletun, and Gene Tsudik. 2006. Improv-ing secure server performance by re-balancing SSL/TLS handshakes.In Proceedings of the ACM Symposium on Information, computer andcommunications security (ASIACCS). 26–34.

[5] Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat,Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey,Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, KalinOvtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, DerekChiou, and Doug Burger. 2016. A Cloud-Scale Acceleration Archi-tecture. In Proceedings of the 49th Annual IEEE/ACM InternationalSymposium on Microarchitecture (MICRO). 7:1–7:13.

[6] Cristian Coarfa, Peter Druschel, and Dan SWallach. 2006. PerformanceAnalysis of TLS Web servers. ACM Transactions on Computer Systems(TOCS) 24, 1 (2006), 39–69.

[7] HAProxy Community. 2018. HAProxy - The Reliable, High Perfor-mance TCP/HTTP Load Balancer. Retrieved November 30, 2018 fromhttp://www.haproxy.org/

[8] Squid Community. 2018. Squid: Optimising Web Delivery. RetrievedNovember 30, 2018 from http://www.squid-cache.org/

[9] Cas Cremers, Marko Horvat, Jonathan Hoyland, Sam Scott, and Thylavan der Merwe. 2017. A Comprehensive Symbolic Analysis of TLS1.3. In Proceedings of the ACM SIGSAC Conference on Computer andCommunications Security (CCS). 1773–1788.

[10] Tim Dierks. 2008. The transport layer security (TLS) protocol version1.2. Technical Report.

[11] Benjamin Erb. 2012. Concurrent programming for scalable web archi-tectures. (2012).

[12] Daniel Firestone, Andrew Putnam, SambhramaMundkur, Derek Chiou,Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu,Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, SomeshChaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, FengfenLiu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel,Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, NisheethSrivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, DougBurger, Kushagra Vaid, David A. Maltz, and Albert Greenberg. 2018.Azure Accelerated Networking: SmartNICs in the Public Cloud. In Pro-ceedings of the 15th USENIX Symposium on Networked Systems Designand Implementation (NSDI). 51–66.

[13] OpenSSL Software Foundation. 2018. OpenSSL: Cryptography andSSL/TLS Toolkit. Retrieved November 30, 2018 from https://www.openssl.org/

[14] The Apache Software Foundation. 2018. The Apache HTTP ServerProject. Retrieved November 30, 2018 from https://httpd.apache.org/

[15] Owen Garrett. 2015. NGINX vs. Apache: Our View of a Decade-OldQuestion. Retrieved November 30, 2018 from https://www.nginx.com/blog/nginx-vs-apache-our-view/

[16] Vinodh Gopal, James Guilford, Erdinc Ozturk, Wajdi Feghali, Gil Wol-rich, and Martin Dixon. 2009. Fast and constant-time implementationof modular exponentiation. (2009).

[17] Shay Gueron and Vlad Krasnov. 2015. Fast prime field elliptic-curvecryptography with 256-bit primes. Journal of Cryptographic Engineer-ing 5, 2 (2015), 141–151.

[18] Owen Harrison and John Waldron. 2008. Practical Symmetric KeyCryptography on Modern Graphics Hardware. In Proceedings of the

17th USENIX Security Symposium (Security). 195–210.[19] Intel. 2014. Intel® QuickAssist Technology Performance Optimization

Guide. Technical Report. https://01.org/sites/default/files/page/330687_qat_perf_opt_guide_rev_1.0.pdf

[20] Intel. 2018. Intel® QuickAssist Technology (Intel® QAT). RetrievedNovember 30, 2018 from https://www.intel.com/content/www/us/en/architecture-and-technology/intel-quick-assist-technology-overview.html

[21] Intel and Wangsu. 2018. Working Together to Build a HighEfficiency CDN System for HTTPS. Technical Report. https://01.org/sites/default/files/downloads/intelr-quickassist-technology/i12036-casestudy-intelqatcdnaccelerationen337190-001us.pdf

[22] Takashi Isobe, Satoshi Tsutsumi, Koichiro Seto, Kenji Aoshima, andKazutoshi Kariya. 2010. 10 Gbps Implementation of TLS/SSL Accel-erator on FPGA. In Proceedings of the 18th International Workshop onQuality of Service (IWQoS). 1–6.

[23] Keon Jang, Sangjin Han, Seungyeop Han, Sue B Moon, and KyoungSooPark. 2011. SSLShader: Cheap SSL Acceleration with CommodityProcessors. In Proceedings of the 8th USENIX conference on NetworkedSystems Design and Implementation (NSDI).

[24] Zia-Uddin-Ahamed Khan and Mohammed Benaissa. 2015.Throughput/area-efficient ECC processor using Montgomerypoint multiplication on FPGA. IEEE Transactions on Circuits andSystems II: Express Briefs 62, 11 (2015), 1078–1082.

[25] Moein Khazraee, Lu Zhang, Luis Vega, and Michael Bedford Taylor.2017. Moonwalk: Nre Optimization in ASIC Clouds. In Proceedings ofthe 21st International Conference on Architectural Support for Program-ming Languages and Operating Systems (ASPLOS). 511–526.

[26] Michael E. Kounavis, Xiaozhu Kang, Ken Grewal, Mathew Eszenyi,Shay Gueron, and David Durham. 2010. Encrypting the Internet. InProceedings of the Annual Conference of ACM Special Interest Group onData Communication (SIGCOMM). 135–146.

[27] Oliver Kowalke. 2018. Boost.Fiber 1.68.0 Overview. Retrieved Novem-ber 30, 2018 from https://www.boost.org/doc/libs/1_68_0/libs/fiber/doc/html/fiber/overview.html

[28] Hugo Krawczyk and Pasi Eronen. 2010. HMAC-based Extract-and-Expand Key Derivation Function (HKDF). Technical Report.

[29] Yang Liu, Jianguo Wang, and Steven Swanson. 2018. Griffin: Unit-ing CPU and GPU in Information Retrieval Systems for Intra-QueryParallelism. In Proceedings of the 23rd ACM SIGPLAN Symposium onPrinciples and Practice of Parallel Programming (PPoPP). 327–337.

[30] Peter Membrey, David Hows, and Eelco Plugge. 2012. SSL load balanc-ing. In Practical Load Balancing. Springer, 175–192.

[31] Rui Miao, Hongyi Zeng, Changhoon Kim, Jeongkeun Lee, and MinlanYu. 2017. SilkRoad: Making Stateful Layer-4 Load Balancing Fast andCheap Using Switching ASICs. In Proceedings of the Annual Conferenceof ACM Special Interest Group on Data Communication (SIGCOMM).15–28.

[32] David Naylor, Alessandro Finamore, Ilias Leontiadis, Yan Grunen-berger, Marco Mellia, Maurizio Munafò, Konstantina Papagiannaki,and Peter Steenkiste. 2014. The Cost of the "S" in HTTPS. In Proceedingsof the 10th ACM International on Conference on emerging NetworkingExperiments and Technologies (CoNEXT). 133–140.

[33] Q-Success. 2018. Usage of Nginx broken down by ranking. Re-trieved November 30, 2018 from https://w3techs.com/technologies/breakdown/ws-nginx/ranking

[34] Qualys. 2018. SSL Pulse. Retrieved November 30, 2018 from https://www.ssllabs.com/ssl-pulse/

[35] Amir Rawdat. 2017. Testing the Performance of NGINX andNGINX Plus Web Servers. Retrieved November 30, 2018from https://www.nginx.com/blog/testing-the-performance-of-nginx-and-nginx-plus-web-servers/

[36] Will Reese. 2008. Nginx: the High-performance Web Server and Re-verse Proxy. Linux Journal 2008, 173, Article 2 (2008).

169

https://safecurves.cr.yp.to/

http://www.haproxy.org/

http://www.squid-cache.org/

https://www.openssl.org/

https://www.openssl.org/

https://httpd.apache.org/

https://www.nginx.com/blog/nginx-vs-apache-our-view/

https://www.nginx.com/blog/nginx-vs-apache-our-view/

https://01.org/sites/default/files/page/330687_qat_perf_opt_guide_rev_1.0.pdf

https://01.org/sites/default/files/page/330687_qat_perf_opt_guide_rev_1.0.pdf

https://www.intel.com/content/www/us/en/architecture-and-technology/intel-quick-assist-technology-overview.html



https://01.org/sites/default/files/downloads/intelr-quickassist-technology/i12036-casestudy-intelqatcdnaccelerationen337190-001us.pdf



https://www.boost.org/doc/libs/1_68_0/libs/fiber/doc/html/fiber/overview.html

https://www.boost.org/doc/libs/1_68_0/libs/fiber/doc/html/fiber/overview.html

https://w3techs.com/technologies/breakdown/ws-nginx/ranking

https://w3techs.com/technologies/breakdown/ws-nginx/ranking

https://www.ssllabs.com/ssl-pulse/

https://www.ssllabs.com/ssl-pulse/

https://www.nginx.com/blog/testing-the-performance-of-nginx-and-nginx-plus-web-servers/

https://www.nginx.com/blog/testing-the-performance-of-nginx-and-nginx-plus-web-servers/


[37] Eric Rescorla. 2018. The transport layer security (TLS) protocol version1.3. Technical Report.

[38] Hovav Shacham and Dan Boneh. 2001. improving SSL HandshakePerformance via Batching. In Proceedings of the Cryptographer’s Trackat RSA Conference (CT-RSA). 28–43.

[39] Hovav Shacham, Dan Boneh, and Eric Rescorla. 2004. Client-sidecaching for TLS. ACM Transactions on Information and System Security(TISSEC) 7, 4 (2004), 553–575.

[40] Mostafa I Soliman and Ghada Y Abozaid. 2011. FPGA implementationand performance evaluation of a high throughput crypto coprocessor.Journal of Parallel and Distributed Computing (JPDC) 71, 8 (2011),1075–1084.

[41] Alibaba Open Source. 2018. tengine qat ssl. Retrieved November 30,2018 from http://tengine.taobao.org/document/tengine_qat_ssl.html

[42] Drew Springall, Zakir Durumeric, and J Alex Halderman. 2016. Mea-suring the Security Harm of TLS Crypto Shortcuts. In Proceedings ofthe Internet Measurement Conference (IMC). 33–47.

[43] Robert Szerwinski and Tim Güneysu. 2008. Exploiting the Powerof GPUs for Asymmetric Cryptography. In Proceedings of the 10thInternational Workshop on Cryptographic Hardware and EmbeddedSystems (CHES).

[44] LiteSpeed Technologies. 2018. Event-Driven vs. Process-Based Web Servers. Retrieved November 30, 2018 fromhttps://www.litespeedtech.com/products/litespeed-web-server/features/event-driven-architecture

[45] Changzheng Wei, Jian Li, Weigang Li, Ping Yu, and Haibing Guan.2017. STYX: A Trusted and Accelerated Hierarchical SSL Key Man-agement and Distribution System for Cloud Based CDN Application.In Proceedings of the ACM Symposium on Cloud Computing (SoCC).201–213.

[46] Wikipedia. 2018. AES instruction set. Retrieved November 30, 2018from https://en.wikipedia.org/wiki/AES_instruction_set

[47] Wikipedia. 2018. Elliptic-curve cryptography. Retrieved November 30,2018 from https://en.wikipedia.org/wiki/Elliptic-curve_cryptography

[48] Wikipedia. 2018. Elliptic-curve Diffie–Hellman. Retrieved Novem-ber 30, 2018 from https://en.wikipedia.org/wiki/Elliptic-curve_Diffie-Hellman

[49] Wikipedia. 2018. Epoll. Retrieved November 30, 2018 from https://en.wikipedia.org/wiki/Epoll

[50] Wikipedia. 2018. Fiber (computer science). Retrieved November 30,2018 from https://en.wikipedia.org/wiki/Fiber_(computer_science)

[51] Wikipedia. 2018. File descriptor. Retrieved November 30, 2018 fromhttps://en.wikipedia.org/wiki/File_descriptor

[52] Wikipedia. 2018. Kqueue. Retrieved November 30, 2018 from https://en.wikipedia.org/wiki/Kqueue

[53] Wikipedia. 2018. OpenSSL. Retrieved November 30, 2018 fromhttps://en.wikipedia.org/wiki/OpenSSL

[54] Wikipedia. 2018. Pseudorandom function family. Retrieved November30, 2018 from https://en.wikipedia.org/wiki/Pseudorandom_function_family

[55] Wikipedia. 2018. RSA (cryptosystem). Retrieved November 30, 2018from https://en.wikipedia.org/wiki/RSA_(cryptosystem)

[56] Wikipedia. 2018. TLS termination proxy. Retrieved November 30,2018 from https://en.wikipedia.org/wiki/TLS_termination_proxy

[57] Wikipedia. 2018. Transport Layer Security. Retrieved November 30,2018 from https://en.wikipedia.org/wiki/Transport_Layer_Security

[58] Jason Yang and James Goodman. 2007. Symmetric Key Cryptographyon Modern Graphics Hardware. In Proceedings of the InternationalConference on the Theory and Application of Cryptology and InformationSecurity (ASIACRYPT). 249–264.

[59] Shun Yao and Dantong Yu. 2017. PhiOpenSSL: Using the Xeon PhiCoprocessor for Efficient Cryptographic Calculations. In Proceedingsof the IEEE International Parallel and Distributed Processing Symposium(IPDPS). 565–574.

170

http://tengine.taobao.org/document/tengine_qat_ssl.html

https://www.litespeedtech.com/products/litespeed-web-server/features/event-driven-architecture

https://www.litespeedtech.com/products/litespeed-web-server/features/event-driven-architecture

https://en.wikipedia.org/wiki/AES_instruction_set

https://en.wikipedia.org/wiki/Elliptic-curve_cryptography

https://en.wikipedia.org/wiki/Elliptic-curve_Diffie-Hellman

https://en.wikipedia.org/wiki/Elliptic-curve_Diffie-Hellman

https://en.wikipedia.org/wiki/Epoll

https://en.wikipedia.org/wiki/Epoll

https://en.wikipedia.org/wiki/Fiber_(computer_science)

https://en.wikipedia.org/wiki/File_descriptor

https://en.wikipedia.org/wiki/Kqueue

https://en.wikipedia.org/wiki/Kqueue

https://en.wikipedia.org/wiki/OpenSSL

https://en.wikipedia.org/wiki/Pseudorandom_function_family

https://en.wikipedia.org/wiki/Pseudorandom_function_family

https://en.wikipedia.org/wiki/RSA_(cryptosystem)

https://en.wikipedia.org/wiki/TLS_termination_proxy

https://en.wikipedia.org/wiki/Transport_Layer_Security


A Artifact AppendixA.1 AbstractThis artifact contains the source codes of the two impor-tant components of QTLS, i.e., the asynchronous offloadmode and the heuristic polling scheme. At least one QATacceleration device is required to conduct the performanceevaluation.

A.2 Artifact check-list (meta-information)• Program: OpenSSL, QAT Engine, Async Mode Nginx• Compilation: gcc• Binary: Linux executables• Run-time environment: CentOS 7.3 or later.• Hardware: QAT acceleration device is required.• Execution: Shell commands.• Metrics: CPS (Connections per second) for handshake andThroughput (Mbps) for secure data transfer.

• Experiments: Full handshake, session resumption and datatransfer throughput.

• Publicly available?: Yes, in both Zenodo and Github.• Workflow frameworks used?: No.

A.3 DescriptionA.3.1 How deliveredMost of our work has been available as open source. QTLSincludes three components: asynchronous offload mode (themost important part), heuristic polling scheme (an impor-tant optimization) and kernel-bypass notification scheme (afurther optimization). The source codes of the former twohave been placed on Zenodo (DOI: 10.5281/zenodo.2008661,web link: https://zenodo.org/record/2008661).

It’s recommended to refer to our Github repositories forthe latest codes and guidelines.

• Intel® QAT OpenSSL Engine: https://github.com/intel/QAT_Engine

• Intel® QAT Async Mode Nginx: https://github.com/intel/asynch_mode_nginx

The fiber async infrastructure is available in OpenSSLofficial releases since version 1.1.0 series.

A.3.2 Hardware dependenciesTwo connected physical servers are required: one as thetested server and one as the client. At least one of the fol-lowing QAT acceleration devices is required in the testedserver:

• Intel® C62X Series Chipset• Intel® Communications Chipset 8925 to 8955 Series• Intel® Communications Chipset 8900 to 8920 Series

A.3.3 Software dependencies• GNU C Library version 2.23 or later• OpenSSL 1.1.0e or later• Validated in CentOS 7.3 with kernel 3.10.

A.4 InstallationA complete software environment in the tested server in-volves the installation of QAT Driver, OpenSSL, QAT Engineand Async Mode Nginx:

1. Intel® QAT Driver• download the newest driver and guideline from theQAT website

• install driver according to the guideline2. OpenSSL (1.1.0e or later)

• download it from the OpenSSL website• install it according to the README

3. QAT Engine (v0.5.37 or later)• download it from our Github repository• install it and configure qat service according to theREADME

4. Async Mode Nginx• download it from our Github repository• install it according to the README

A.5 Experiment workflow• Start Nginx in the tested server. One can modify theNginx conf file for different configurations (refer tothe README of Async Mode Nginx for more details).

• Launch benchmarks (e.g., OpenSSL s_timea or Apachebench)in the client server to evaluate the TLS performanceof the running Nginx in the tested server.

A.6 Evaluation and expected resultFour configurations (SW, QAT+S, QAT+A and QAT+AH ) canbe evaluated with this artifact. A few notes to compare dif-ferent configurations:

• Pay attention to hyper-threading, which has an obvi-ous influence on the performance.

• Set CPU core affinity for both Nginx workers and thepolling threads. Make sure that different configura-tions employ the same number of cores.

• Multiple benchmark processes may be needed in theclient server to fully load the running Nginx.

After each QAT-related testing, one can check how manycrypto requests have been processed by the QAT accelerator:

[root@server]# cat /sys/kernel/debug/qat*/fw_counters

It is expected too see the following performance enhance-ment between different configurations:

• QAT+S: in some cases, it provides higher performancecompared to the SW configuration; in other cases, itprovides similar or even lower performance.

• QAT+A: for full handshake, it provides significant per-formance boost compared to the SW configuration;for most of other cases, it can also provide obviousperformance enhancement.

171

https://zenodo.org/record/2008661

https://github.com/intel/QAT_Engine

https://github.com/intel/QAT_Engine

https://github.com/intel/asynch_mode_nginx

https://github.com/intel/asynch_mode_nginx


• QAT+AH : benefiting from the heuristic polling scheme,it brings additional performance enhancement (about20% for most cases) over the QAT+A configuration.

A.7 Experiment customizationWe extended the simple engine setting scheme in Nginxto an SSL Engine Framework, which provides flexible andpowerful accelerator configuration directly in the Nginx conffile. One can customize the QAT offload behaviors throughthis framework, such as which crypto algorithms to offload,which polling method to use and the software/hardwareswitches. Besides, the number of Nginx workers and the TLScipher suite to use can also be set in the Nginx conf file.An example customization of the QAT offload behaviors

is as follows:

worker_processes 8;load_module modules/ngx_ssl_engine_qat_module.so;...ssl_engine {

use qat_engine;default_algorithm RSA ,EC,DH,PKEY_CRYPTO;qat_engine {

qat_offload_mode async;qat_notify_mode poll;qat_poll_mode heuristic;qat_heuristic_poll_asym_threshold 48;qat_heuristic_poll_sym_threshold 24;

}}...

172

qtls: high-performance tls asynchronous offload framework...

Documents