motivation - university of michigantnm.engin.umich.edu/wp-content/uploads/sites/353/2018/10/... ·...

12
................................................................................................................................................................................................................. SIRIUS IMPLICATIONS FOR FUTURE W AREHOUSE-SCALE COMPUTERS ................................................................................................................................................................................................................. DEMAND IS EXPECTED TO GROW SIGNIFICANTLY FOR CLOUD SERVICES THAT DELIVER SOPHISTICATED ARTIFICIAL INTELLIGENCE ON THE CRITICAL PATH OF USER QUERIES, AS IS THE CASE WITH INTELLIGENT PERSONAL ASSISTANTS SUCH AS APPLES SIRI .IF THE PREDICTION OF THE TREND IS CORRECT, THESE TYPES OF APPLICATIONS WILL LIKELY CONSUME MOST OF THE WORLDS COMPUTING CYCLES.THE SIRIUS PROJECT WAS MOTIVATED TO INVESTIGATE WHAT THIS FUTURE MIGHT LOOK LIKE AND HOW CLOUD ARCHITECTURES SHOULD EVOLVE TO ACHIEVE IT. ...... Ultimately, that’s why [Clarity Lab] is running [the] Sirius project. The Apples and Googles and … Microsoft know how this new breed of service operates, but the rest of the world doesn’t. And they need to.” 1 In this article, we discuss Sirius, an open end-to-end intelligent personal assistant (IPA) application, modeled after popular IPA serv- ices such as Apple’s Siri. Sirius leverages well- established open infrastructures for speech recognition, image recognition, and question- answering systems. We use Sirius to investigate the performance, power, and cost implications of hardware accelerator-based server architec- tures for future datacenter designs. Among the popular acceleration options, including GPUs, Intel Phi, and field-programmable gate arrays (FPGAs), the FPGA-accelerated server is the best server option for a homogeneous datacen- ter design when the design objective is to mini- mize latency or maximize energy efficiency with a latency constraint. The FPGA achieves an average 16 times reduction on the query latency across various query types over the baseline multicore system. GPUs provide the highest total cost of ownership (TCO) reduc- tion on average. GPU-accelerated servers can achieve an average 10 times query-latency reduction, translating to a 2.6 times TCO reduction (while FPGA-accelerated servers achieve 1.4 times TCO reduction). When excluding FPGAs as an acceleration option, GPUs provide the best latency and cost reduc- tion among the rest of the accelerator choices. Replacing FPGAs with GPUs leads to a 66 percent longer latency, but in return achieves a 47 percent TCO reduction. Motivation Siri, Google’s Google Now, and Microsoft’s Cortana represent a class of emerging Web service applications known as IPAs. An IPA is an application that uses inputs such as the user’s voice, vision (images), and contextual information to provide assistance by answer- ing questions in natural language, making recommendations, and performing actions. These IPAs are emerging as one of the fastest- growing Internet services; they recently have Johann Hauswald Michael A. Laurenzano Yunqi Zhang Cheng Li Austin Rovinski Arjun Khurana Ronald G. Dreslinski Trevor Mudge University of Michigan Vinicius Petrucci Federal University of Bahia Lingjia Tang Jason Mars University of Michigan ........................................................ 42 Published by the IEEE Computer Society 0272-1732/16/$33.00 c 2016 IEEE

Upload: others

Post on 09-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Motivation - University of Michigantnm.engin.umich.edu/wp-content/uploads/sites/353/2018/10/... · 2019-09-10 · offerings in wearable technologies such as smart watches and glasses

........................................................................................................................................................................................................................

SIRIUS IMPLICATIONS FOR FUTUREWAREHOUSE-SCALE COMPUTERS

........................................................................................................................................................................................................................

DEMAND IS EXPECTED TO GROW SIGNIFICANTLY FOR CLOUD SERVICES THAT DELIVER

SOPHISTICATED ARTIFICIAL INTELLIGENCE ON THE CRITICAL PATH OF USER QUERIES, AS IS

THE CASE WITH INTELLIGENT PERSONAL ASSISTANTS SUCH AS APPLE’S SIRI. IF THE

PREDICTION OF THE TREND IS CORRECT, THESE TYPES OF APPLICATIONS WILL LIKELY

CONSUME MOST OF THE WORLD’S COMPUTING CYCLES. THE SIRIUS PROJECT WAS

MOTIVATED TO INVESTIGATE WHAT THIS FUTURE MIGHT LOOK LIKE AND HOW CLOUD

ARCHITECTURES SHOULD EVOLVE TO ACHIEVE IT.

......Ultimately, that’s why [ClarityLab] is running [the] Sirius project. TheApples and Googles and … Microsoft knowhow this new breed of service operates, butthe rest of the world doesn’t. And they needto.”1

In this article, we discuss Sirius, an openend-to-end intelligent personal assistant (IPA)application, modeled after popular IPA serv-ices such as Apple’s Siri. Sirius leverages well-established open infrastructures for speechrecognition, image recognition, and question-answering systems. We use Sirius to investigatethe performance, power, and cost implicationsof hardware accelerator-based server architec-tures for future datacenter designs. Among thepopular acceleration options, including GPUs,Intel Phi, and field-programmable gate arrays(FPGAs), the FPGA-accelerated server is thebest server option for a homogeneous datacen-ter design when the design objective is to mini-mize latency or maximize energy efficiencywith a latency constraint. The FPGA achievesan average 16 times reduction on the querylatency across various query types over the

baseline multicore system. GPUs provide thehighest total cost of ownership (TCO) reduc-tion on average. GPU-accelerated servers canachieve an average 10 times query-latencyreduction, translating to a 2.6 times TCOreduction (while FPGA-accelerated serversachieve 1.4 times TCO reduction). Whenexcluding FPGAs as an acceleration option,GPUs provide the best latency and cost reduc-tion among the rest of the accelerator choices.Replacing FPGAs with GPUs leads to a 66percent longer latency, but in return achieves a47 percent TCO reduction.

MotivationSiri, Google’s Google Now, and Microsoft’sCortana represent a class of emerging Webservice applications known as IPAs. An IPA isan application that uses inputs such as theuser’s voice, vision (images), and contextualinformation to provide assistance by answer-ing questions in natural language, makingrecommendations, and performing actions.These IPAs are emerging as one of the fastest-growing Internet services; they recently have

Johann Hauswald

Michael A. Laurenzano

Yunqi Zhang

Cheng Li

Austin Rovinski

Arjun Khurana

Ronald G. Dreslinski

Trevor Mudge

University of Michigan

Vinicius Petrucci

Federal University of Bahia

Lingjia Tang

Jason Mars

University of Michigan

............................................................

42 Published by the IEEE Computer Society 0272-1732/16/$33.00�c 2016 IEEE

Page 2: Motivation - University of Michigantnm.engin.umich.edu/wp-content/uploads/sites/353/2018/10/... · 2019-09-10 · offerings in wearable technologies such as smart watches and glasses

been deployed on well-known platformssuch as iOS, Android, and Windows Phone,making them ubiquitous on mobile devicesworldwide. In addition, the usage scenariosfor IPAs are rapidly increasing, with recentofferings in wearable technologies such assmart watches and glasses. Recent projectionspredict that the wearables market will reach485 million annual device shipments by2018.

In this article, we present Sirius, the firstopen source voice and vision IPA. Prior to therelease of Sirius, large companies such asApple, Google, Microsoft, and Amazon had amonopoly on the intelligent assistant applica-tion space. But without access to a representa-tive open intelligent assistant application,researchers cannot investigate the systemarchitecture implications of this type of work-load. Sirius is the first effort to serve this pur-pose. The Sirius study provides insights on thelandscape of current accelerator hardware indatacenters for a future in which demand forintelligent assistants grows radically.

Specifically, our study found that IPAs differfrom many Web service workloads present inmodern warehouse-scale computers (WSCs).In contrast to the queries of traditionalbrowser-centric services, IPA queries streamthrough software components that leveragerecent advances in speech recognition, naturallanguage processing (NLP), and computervision to provide users with a speech- and/orimage-driven contextually based question-and-answer system. Owing to the computationalintensity of these components and the largedata-driven models they use, service providershouse the required computation in massivedatacenter platforms in lieu of performing thecomputation on the mobile devices themselves.This offloading approach is used by both Siriand Google Now as they send compressedrecordings of voice commands and queries todatacenters for speech recognition and seman-tic extraction.2 However, datacenters have beendesigned and tuned for traditional Web serv-ices, and questions arise as to whether the cur-rent design employed by modern datacenters,composed of general-purpose servers, is suit-able for emerging IPA workloads.

In particular, IPA queries require a signifi-cant amount of computational resources com-pared to traditional text-based Web services

such as search. The computational resourcesrequired for a single leaf query, for example,are in excess of 100 times more than that oftraditional Web search. Figure 1 illustrates thescaling of computational resources in a mod-ern datacenter required to sustain an equiva-lent throughput of IPA queries. Because of thelooming scalability gap shown in the figure,both academia and industry have expressedsignificant interest in leveraging hardwareacceleration in datacenters using various plat-forms such as GPUs, many-core coprocessors,and FPGAs to achieve high performance andenergy efficiency. To gain further insight onwhether there are sufficient accelerationopportunities for IPA workloads and what thebest acceleration platform is, we must addressthe following challenges:

� Identify critical computational andperformance bottlenecks throughoutthe end-to-end lifetime of an IPAquery.

� Understand the performance, energy,and cost tradeoffs among popularaccelerator options given the charac-teristics of IPA workloads.

� Design future server and datacentersolutions that can meet the amountof future user demand while beingcost and energy efficient.

However, the lack of a representative,publicly available, end-to-end IPA system isprohibitive for investigating the design spaceof future accelerator-based server designs forthis emerging workload. To address this chal-lenge, we constructed Sirius as an end-to-endstand-alone IPA service that implements an

DC after scalingfor matching no. of

IPA queries

1x 10x 100x

DC for browser-based Web search

queries

Figure 1. Impact of higher computational requirements for intelligent

personal assistant (IPA) queries on datacenters. Illustrated is the scaling of

computational resources in a modern datacenter required to sustain an

equivalent throughput of IPA queries compared to Web search.

.................................................................

MAY/JUNE 2016 43

Page 3: Motivation - University of Michigantnm.engin.umich.edu/wp-content/uploads/sites/353/2018/10/... · 2019-09-10 · offerings in wearable technologies such as smart watches and glasses

IPA’s core functionalities, such as speech rec-ognition, image matching, NLP, and a ques-tion-and-answer system. Sirius takes as inputuser-dictated speech and images captured bya camera. A voice command primarily exer-cises speech recognition on the server side toexecute a command on the mobile device. Avoice query additionally leverages a sophisti-cated NLP question-and-answer system toproduce a natural language response to theuser. A voice and image question—such as,“When does this restaurant close?” coupledwith an image of the restaurant—also lever-ages image matching with an image databaseand combines the matching output with thevoice query to select the best answer for theuser.

Sirius also provides the Sirius Suite, abenchmark suite composed of the sevencomputational bottlenecks on a query’spathway that represent 92 percent of thatquery’s execution time. The Sirius Suite alsoincludes implementations of these bottle-necks for a spectrum of computational sub-strates, including CPU, GPU, FPGAs, andmany-core (Phi), all of which are includedin the open source release.3 Sirius was thetop-trending open source project onGitHub for the first few weeks after itsrelease. In fact, Sirius and the Sirius Suitehave been downloaded tens of thousands oftimes since their release and are alreadybeing used in research papers as well as pro-duction projects.

In designing Sirius, we performed inves-tigative research by going to several largecompanies and talking with key engineerson relevant teams. We asked for insights asto the algorithms and approaches actuallyused in production, and, using this infor-mation, we integrated three services builtusing well-established open source projectsrepresentative of those found in commercialsystems. These included Carnegie MellonUniversity’s Sphinx,4 which representedspeech recognition based on the widelyused Gaussian mixture model (GMM);Kaldi5 and RWTH Aachen University’sSpeech Recognition System (RASR),6

which represented the industry’s recenttrend toward speech recognition based onthe deep neural network (DNN); Open-Ephyra, which represented the-state-of-the-

art question-and-answer system based onIBM’s Watson7; and SURF8 implementedusing OpenCV,9 which represented thestate-of-the-art image-matching algorithmswidely used in various production applica-tions. We used these open source projects tosteer the construction of both Sirius and theSirius Suite.

Sirius: An End-to-End IPAIn this section, we describe the design objec-tives for Sirius, then present an overview ofSirius and a taxonomy of the query types itsupports. Finally, we detail the underlyingalgorithms and techniques used by Sirius.

Design ObjectivesWe had three key objectives when designingSirius. The first objective was completeness:Sirius should provide a complete IPA servicethat takes the input of human voice andimages and provides a response to the user’squestion with natural language. The secondobjective was representativeness: the computa-tional techniques used by Sirius to providethis response should be representative ofstate-of-the-art approaches used in commer-cial domains. The third objective was deploy-ability: Sirius should be deployable and fullyfunctional on real systems.

Overview: Life of an IPA QueryFigure 2 presents a high-level diagram of theend-to-end Sirius query pipeline. The life of aquery begins with a user’s voice and/or imageinput through a mobile device. The audio isprocessed by an automatic speech recognition(ASR) front end that translates the user’s speechquestion into text. The text then goes througha query classifier that decides whether thespeech is an action or a question. For an action,the command is sent back to the mobile devicefor execution. Otherwise, the Sirius back endreceives the question in plain text. The ques-tion-answering service extracts informationfrom the input, searches its database, and choo-ses the best answer to return to the user. If animage accompanies the speech input, Siriususes computer-vision techniques to match theinput image to its image database and returnrelevant information about the matched imageusing the image-matching service. For example,

..............................................................................................................................................................................................

TOP PICKS

.................................................................

44 IEEE MICRO

Page 4: Motivation - University of Michigantnm.engin.umich.edu/wp-content/uploads/sites/353/2018/10/... · 2019-09-10 · offerings in wearable technologies such as smart watches and glasses

a user can ask, “What time does this restaurantclose?” while capturing an image of the restau-rant via smart glasses.3 Sirius can then return ananswer to the query based not only on the user’sspeech but also on information from the image.Table A in “Sirius Web Extra” summarizesthe queries supported by Sirius (see http://extras.computer.org/extra/mmi2016030042s1.pdf).

Design of Sirius: IPA Services and AlgorithmicComponentsWe leverage open infrastructures that use thesame algorithms as commercial applications.Speech recognition in Google Voice has usedspeaker-independent GMMs and hiddenMarkov models (HMMs) and is adoptingDNNs. The OpenEphyra framework usedfor question answering is an open sourcerelease from Carnegie Mellon University’sprior research collaboration with IBM on theWatson system.7 OpenEphyra’s NLP techni-ques, including conditional random field(CRF), have been recognized as state-of-the-art and are used at Google and in otherindustry question-answering systems.10 Ourimage-matching pipeline design is based onthe widely used SURF algorithm. We imple-ment SURF using the open source computervision (OpenCV9) library, which is employedin commercial products from companiesincluding Google, IBM, and Microsoft.

Automatic Speech Recognition PipelineThe ASR inputs are feature vectors represent-ing the speech segment. The ASR compo-

nent relies on a combination of an HMMand either a GMM or a DNN. Sirius’sGMM-based ASR uses Sphinx,4 whereas theDNN-based ASR includes Kaldi5 andRASR.6

As Figure 3 shows, the HMM builds atree of states for the current speech frameusing input feature vectors. The GMM orDNN scores the probability of the state tran-sitions in the tree, and the Viterbi algo-rithm11 then searches for the most likely pathbased on these scores. The path with thehighest probability represents the final trans-lated text output. The GMM scores HMMstate transitions by mapping an input featurevector into a multidimensional coordinatesystem and iteratively scores the featuresagainst the trained acoustic model.

Image-Matching PipelineThe image-matching pipeline receives aninput image, attempts to match it againstimages in a preprocessed image database, andreturns information about the matchedimages. Image keypoints are extracted fromthe input image using the SURF algorithm.8

In feature extraction, the image is down-sampled and convolved multiple times tofind interesting keypoints at different scales(see Figure 4). The keypoints are passed tothe feature descriptor component, wherethey are assigned an orientation vector, andsimilarly oriented keypoints are grouped intofeature descriptors. The descriptors from theinput image are matched to preclustereddescriptors representing the database images

Answer

Question answering

Searchdatabase

ActionM

obile

Ser

ver

Displayanswer

Imagedatabase

Image matching

Image

Imag

e data q

uestionVoice Questionor

action

Queryclassifier

Automaticspeech recognition

Users

Executeaction

Figure 2. End-to-end diagram of the Sirius pipeline. Sirius is built of three core components that communicate with one

another to service IPA queries.

.................................................................

MAY/JUNE 2016 45

Page 5: Motivation - University of Michigantnm.engin.umich.edu/wp-content/uploads/sites/353/2018/10/... · 2019-09-10 · offerings in wearable technologies such as smart watches and glasses

using an approximate nearest-neighborsearch.

Question-Answering PipelineThe text output from the ASR is passed toOpenEphyra,7 which uses word stemming,regular expression matching, and part-of-speech tagging. Figure 5 shows a diagram ofthe OpenEphyra engine incorporating thesecomponents, generating Web search queries,and filtering the returned results. The PorterStemming12 algorithm (stemmer) exposes a

word’s root by matching and truncating com-mon word endings. OpenEphyra also uses asuite of regular-expression patterns to matchcommon query words. The CRF classifier2

takes a sentence, the position of each word inthe sentence, and the label of the current andprevious word as input to make predictionson the part of speech for each word of aninput query. Filters using the same techniquesextract information from the returned docu-ments; the document with the highest overallscore after score aggregation is returned.

HM

M

Speech decoder

Featureextraction

Feature vectors

Speech

0.3

0.9

0.2

...

Viterb

i search

Inputlayer

Hiddenlayers

Outputlayer

DNN scoringor

GMM scoring

Trained dataWord

dictionaryLanguage

modelAcoustic

model

ScoredHMM states

“Who was elected44th president?

Figure 3. Automatic speech recognition pipeline. The decoding stage receives speech features and uses either a Gaussian

mixture model or a deep neural network in combination with a hidden Markov model to transcribe the speech to text.

Calculate Hessian matrix

SURF featureextractor

Build scale-space

Find keypoints

Haar wavelet

SURF featuredescriptor

Orientationassignment

Keypointdescriptor

Descriptordatabase

Image descriptors

Image Keypoints

Figure 4. Image-matching pipeline. Keypoints are first extracted from the image and are used to build descriptors (grouping of

keypoints). These descriptors are used to find similar images in the database.

..............................................................................................................................................................................................

TOP PICKS

.................................................................

46 IEEE MICRO

Page 6: Motivation - University of Michigantnm.engin.umich.edu/wp-content/uploads/sites/353/2018/10/... · 2019-09-10 · offerings in wearable technologies such as smart watches and glasses

Real System Analysis for SiriusIn this section, we present a real-system anal-ysis of Sirius. The experiments throughoutthis section are performed using an IntelHaswell server.

Scalability GapTo gain insights on the required resource scal-ing for IPA queries in modern datacenters,we juxtapose the computational demand ofan average Sirius query with that of an aver-age Web search query. We compare the aver-age query latency for both applications on asingle core at a very low load.

Figure 6a presents the average latency ofboth Web search using open source ApacheNutch (http://nutch.apache.org) and Siriusqueries. The average Nutch-based Websearch query latency is 91 ms on the Haswell-based server. Sirius’s query latency is signifi-cantly longer, averaging around 15 s across42 queries spanning our three query classes(voice command, VC; voice query, VQ; andvoice image query, VIQ). Based on this sig-nificant difference in the computational

demand, we perform a back-of-the-envelopecalculation of how the computational resour-ces (machines) in current datacenters mustscale to match the throughput in queries forIPAs and Web search.

Figure 6b presents the number of ma-chines needed to support IPA queries as thenumber of queries increases. The x-axis showsthe ratio between IPA queries and traditionalWeb search queries. The y-axis shows the ratioof computational resources needed to supportIPA queries relative to Web search queries.Current datacenter infrastructures will need toscale their computational resources to 165times their current size when the number ofIPA queries scales to match the number ofWeb search queries. We refer to this through-put difference as the scalability gap.

Cycle Breakdown of Sirius ServicesTo identify computational bottlenecks, weperform top-down profiling of hot algorith-mic components for each service. Figure 7presents the average cycle breakdown results.We identify the architectural bottlenecks for

^[0-9,th]$ 44th

VERB NUM N

elected 44th president

Stemmer

Question answering

Web search

Input filter

Docum

ent filters

Regex

CRF

elect

Scores

Docum

ent selector

Documents“Who was elected

44th president?”

“Barack Obama”

“... elected 44thpresident” “-ed”

Figure 5. OpenEphyra question-answering pipeline. The QA system uses a combination of natural language processing

techniques to generate search queries for the document database and score the returned queries to pick out the best answer.

.................................................................

MAY/JUNE 2016 47

Page 7: Motivation - University of Michigantnm.engin.umich.edu/wp-content/uploads/sites/353/2018/10/... · 2019-09-10 · offerings in wearable technologies such as smart watches and glasses

these hot components to investigate the per-formance improvement potential for a gen-eral-purpose processor. From our analysis,even with all stall cycles removed (that is, per-fect branch prediction, infinite cache, and soon), the maximum speedup is bound byabout three times. Considering the orders ofmagnitude difference indicated by the scal-ability gap, further acceleration is needed tobridge the gap.

Accelerating SiriusIn this section, we describe the platforms andmethodology used to accelerate the key com-ponents of Sirius. We also present and discussthe results of accelerating each of these com-ponents across four different acceleratorplatforms.

Accelerator PlatformsWe used a total of four platforms, summar-ized in Table B online, to accelerate Sirius.Our baseline platform was an Intel XeonHaswell CPU running single-threadedkernels.

Each accelerator platform has advantagesand disadvantages. Multicore CPU offershigh clock frequency and is not limited bybranch divergence, but it has the leastamount of threads available. The GPU ismassively parallel, but it is also power hungry,requires a custom ISA, has large data transferoverheads, and offers limited branch diver-gence handling. Intel Phi has a many-core,standard programming model (same ISA);offers optional porting and compiler help;handles branch divergence; and has highbandwidth. On the other hand, it has datatransfer overheads and relies on the compiler.Finally, FPGAs can be tailored to implementefficient computation and data layout for theworkload, but they also run at a much lowerclock frequency, are expensive, and are hardto develop for and maintain with softwareupdates.

16

14

12

10

Late

ncy

(s)

No.

of m

achi

nes

8

6

0.091

0.0 0.2 0.4

QIPA/QWS

Scalability gap

0.6 0.8 0.10

4

2

0

103

102

101

100

IPAquery

Websearch(a) (b)

WS

VC

VQ

VIQ

Time (s)0 5 10 15 20 25 30

0.091ASRIMMQA

Figure 6. Scalability gap and latency. (a) Average latency of Web search using Apache Nutch and Sirius queries. (b) The

number of machines needed to support IPA queries as the number of queries increases. (c) Latency across query types.

HMM

DNNHMM

GMM78% 22%15%

85%

(a)

StemmerOther

3%

Search

CRF

12%

17%22%

46%

FE ANN3%

FD

56%

41%

(b)

(d)(c)Regex

Figure 7. Cycle breakdown per service. (a) ASR (Sphinx); (b) ASR (RASR); (c)

QA (OpenEphyra); (d) IMM (SURF). Across the services, several hot

components emerge as good candidates for acceleration.

..............................................................................................................................................................................................

TOP PICKS

.................................................................

48 IEEE MICRO

Page 8: Motivation - University of Michigantnm.engin.umich.edu/wp-content/uploads/sites/353/2018/10/... · 2019-09-10 · offerings in wearable technologies such as smart watches and glasses

Sirius Suite: A Collection of IPA ComputationalBottlenecksWe extracted Sirius’s key computational bot-tlenecks to construct a suite of benchmarkscalled the Sirius Suite. The Sirius Suite andits implementations across the describedaccelerator platforms are available alongsidethe end-to-end Sirius application.3 Weported existing open source C/Cþþ imple-mentations available for each algorithmiccomponent to our target platforms. Weadditionally implemented stand-aloneC/Cþþ benchmarks based on Sirius’ssource code where none were currentlyavailable. For each Sirius Suite benchmark,we built an input set representative of IPAqueries. The baseline implementations aresummarized in column 3 of Table 1. Thetable also shows the granularity at whicheach thread performs the computation onthe accelerators.

Porting MethodologyThe common porting methodology usedacross all platforms is to exploit the largeamount of data-level parallelism availablethroughout the processing of a single IPAquery.

Multicore CPUWe used the Pthread library to accelerate thekernels on the multicore platform by divid-

ing the size of the data. Each thread is respon-sible for a range of data over a fixed numberof iterations. This approach lets each threadrun concurrently and independently, syn-chronizing only at the end of the execution.

GPUWe used Nvidia’s CUDA library to port theSirius components to the Nvidia GPU. Toimplement each CUDA kernel, we variedand configured the GPU block and grid sizesto achieve high resource utilization, matchingthe input data to the best thread layout. Weported additional string manipulation func-tions not currently supported in CUDA forthe stemmer kernel.

Intel PhiWe ported our Pthread versions to the IntelPhi platform, leveraging the target compiler’sability to parallelize the loops on the targetplatform. To investigate this platform’s poten-tial to facilitate ease of programming, we usedthe standard programming model and customcompiler to extract performance from theplatform. As such, the results represent whatcan be accomplished with minimal pro-grammer effort.

FPGAWe used previously published details ofFPGA implementations for several of ourSirius benchmarks in this work. We designed

Table 1. The Sirius Suite and granularity of parallelism.

Service Benchmark Baseline Input set Granularity

Automatic speech

recognition (ASR)

Gaussian mixture model Sphinx4 Hidden Markov model

(HMM) states

For each HMM state

Deep neural network RASR6 HMM states For each matrix multiplication

Question

answering (QA)

Porter Stemming (stemmer) Porter12 4M word list For each individual word

Regular expression (regex) Super Light Regular

Expression Library

(SLRE;

http://cesanta.com)

100 expressions/

400 sentences

For each regex-sentence pair

Conditional random

fields (CRFs)

CRFSuite13 CoNLL-2000 shared task For each sentence

Image matching

(IMM)

Feature extraction (FE) SURF8 JPEG image For each image tile

Feature description (FD) SURF8 Vector of keypoints For each keypoint

.................................................................

MAY/JUNE 2016 49

Page 9: Motivation - University of Michigantnm.engin.umich.edu/wp-content/uploads/sites/353/2018/10/... · 2019-09-10 · offerings in wearable technologies such as smart watches and glasses

our own FPGA implementations for GMMand stemmer and evaluated them on a XilinxFPGA. Our full paper offers more details onthe FPGA designs.14

Accelerator ResultsFigure 8 presents the performance speedupachieved by the Sirius kernels running on eachaccelerator platform. For the numbers fromprior literature, we scaled the FPGA speedupnumber to match our FPGA platform basedon the fabric usage and area reported in priorwork. We used numbers from the literaturefor kernels (regex and CRF) that were alreadyported to the GPU and yielded better speed-ups than our implementations.

ASRThe GMM implementation had the best per-formance on the GPU (70 times) after opti-mizations. The FPGA implementation usinga single GMM core achieved a speedup of 56times; when fully utilizing the FPGA fabric,we achieved a 169 times speedup using threeGMM cores. RWTH’s DNN includes bothmultithreaded and GPU versions out of thebox. RWTH’s DNN parallelizes the entireframework (both HMM search and DNNscoring) and achieves good speedup in bothcases. In the cases where we used a customkernel or cited literature, we assumed a 3.7times speedup for the HMM15 as a reason-able lower bound.

Question AnsweringThe NLP algorithms as a whole have similarperformance across platforms because of thenature of the workload: high input variabilitywith many test statements causes high branchdivergence. The FPGA stemmer implemen-tation achieved 6 times speedup over thebaseline with a single core using only 17 per-cent of the FPGA. Scaling the number ofcores to fully utilize the FPGA’s resourcesyielded a 30 times speedup over the baseline.

Image MatchingThe image-processing kernels achieved thebest speedup on the GPU that uses heavilyoptimized OpenCV9 SURF implementations,yielding speedups of 10.5 and 120.5 times forfeature extraction and feature description,respectively. The tiled multicore version yieldsgood speedup, but the performance does notscale as well on the Phi because the number oftiles is fixed, which means there is littleadvantage to having more threads available.The GPU version has better performancebecause it uses a data layout explicitly opti-mized for a larger number of threads.

Implications for Future Server DesignWe first investigated the end-to-end latencyreduction and power efficiency achieved acrossserver configurations for Sirius’s services,including ASR, question answering, andimage matching.

Latency ImprovementFigure 9 presents the end-to-end querylatency across Sirius’s services on a single leafnode configured with each accelerator. Forquestion answering, we focused on the NLPcomponents comprising 88 percent of thequestion-answering cycles, because search hasalready been well studied.

Our baseline in this figure, CMP, is thelatency of the original algorithm implementa-tions of Sirius running on a single core of anIntel Haswell server, described in Table Bonline. CMP (subquery) is our Pthreadedimplementation of each service exploiting par-allelism within a single query. This is executedon four cores (eight threads) of the Intel Has-well server. CMP (subquery) in general achievesa 25 percent latency reduction over the

FPGA

Phi

GPU

CMP

Pla

tform

GMM DNN StemmerWorkload speedup heat map

CRF FE FD

160

140

120

100

80

60

40

20

Regex

Figure 8. Heat map of acceleration results. The heat map presents the

results of accelerating each of the seven kernels (x-axis) across the four

platforms (y-axis), with the darker color representing higher speedups.

..............................................................................................................................................................................................

TOP PICKS

.................................................................

50 IEEE MICRO

Page 10: Motivation - University of Michigantnm.engin.umich.edu/wp-content/uploads/sites/353/2018/10/... · 2019-09-10 · offerings in wearable technologies such as smart watches and glasses

baseline. Across all services, the GPU andFPGA significantly reduce the query latency.

Datacenter DesignNext, we evaluate multiple design choices fordatacenters composed of accelerated servers toimprove performance (throughput) andreduce the TCO. We also investigate eachplatform’s energy efficiency and throughputimprovement, which can be found online inthe “Energy Efficiency” and “ThroughputImprovement” sections of the web extra.

TCO AnalysisImproving throughput lets us reduce theamount of computing resources (servers)needed to serve a given load. However, thismay not necessarily lead to a reduction in adatacenter’s TCO. Although reducing thenumber of machines leads to a reduction inthe datacenter construction cost and power/cooling infrastructure cost, we may increasethe per-server capital or operational expendi-ture cost either from the additional accelera-tor purchase cost or energy cost.

We performed our TCO analysis using theTCO model recently proposed by Google.16

Table D online gives the parameters used in

our TCO model. We based the server priceand power usage on the following server con-figuration based on the OpenCompute Proj-ect: 1 CPU Intel Xeon E3 1240 V3 3.4 GHz,32 Gbytes of RAM, and two 4-Tbyte disks.

Figure 10 presents the datacenter TCOwith various accelerator options, normalizedby the TCO achieved by a datacenter thatuses only CMPs. FPGAs and GPUs providehigh TCO reduction. We further discuss theTCO results, derive our datacenter designs,and present query-level results online.

Overall Latency Reduction ResultsUsing the derived datacenter design andquery-level results, Figure 11 presents thelatency reduction of these two accelerateddatacenters and how homogeneous acceler-ated datacenters can significantly reduce thescalability gap, from the current 165 timesresource scaling, shown in Figure 6, downto 16 and 10 times for GPU- and FPGA-accelerated datacenters, respectively.

S ince the release of Sirius, hundreds ofcompanies around the world have down-

loaded the Sirius code base. Large companies

Latency (s) (* includes DNN and HMM combined)

0 1 2 3 4 5

*

0.09*

0.3

0 1 2 3 4 5

CMP

CMP (subquery)

GPU

Phi

FPGA

0.22

0.19

ASR_GMMASR_HMM

ASR_DNNASR_HMM

0

0.94

2 4 6 8 10 12 14

CMP

CMP (subquery)

GPU

Phi

FPGA

0 1 2 3

0.12

0.05

QA_StemmerQA_CRFQA_RegexQA_Other

IMM_FEIMM_FD

Figure 9. Latency across platforms for each service. The speedups translate into latency

gains, with the x-axis presenting the end-to-end latency of a single query across the four

accelerator platforms.

.................................................................

MAY/JUNE 2016 51

Page 11: Motivation - University of Michigantnm.engin.umich.edu/wp-content/uploads/sites/353/2018/10/... · 2019-09-10 · offerings in wearable technologies such as smart watches and glasses

such as Ford, Intel, GE, and Wells Fargo, aswell as numerous medium-sized and start-upcompanies, have demonstrated interest inintegrating Sirius into their products. Withthe release of Sirius, we have made open thetype of technology only expected from thebig cloud companies. Now any smaller outfitcan have a nicely packaged end-to-end solu-tion and a set of tools to design its own IPA.Thus, Sirius represents the democratizationof intelligent assistants. Now the technologyis in the hands of everyone. MICR O

....................................................................References1. C. Metz, “Voice Control Will Force an Over-

haul of the Whole Internet,” Wired, 24

Mar. 2015; www.wired.com/2015/03/voice-

control-will-force-overhaul-whole-internet.

2. J. Lafferty, A. McCallum, and F.C.N. Pereira,

Conditional Random Fields: Probabilistic

Models for Segmenting and Labeling

Sequence Data, ACM, 2001.

3. “Sirius: An Open End-to-End Voice and

Vision Personal Assistant,” 2015; http://

sirius.clarity-lab.org.

4. D. Huggins-Daines et al., “Pocketsphinx: A

Free, Real-Time Continuous Speech Recog-

nition System for Hand-Held Devices,”

Proc. IEEE Int’l Conf. Acoustics, Speech

and Signal Processing, 2006; doi:10.1109/

ICASSP.2006.1659988.

5. D. Povey et al., “The Kaldi Speech Recogni-

tion Toolkit,” Proc. IEEE Workshop Auto-

matic Speech Recognition and

Understanding, 2011; http://infoscience.

epfl.ch/record/192584.

6. D. Rybach et al., “RASR—The RWTH

Aachen University Open Source Speech

Recognition Toolkit,” Proc. IEEE Automatic

Speech Recognition and Understanding

Workshop, 2011.

7. D. Ferrucci et al., “Building Watson: An

Overview of the DeepQA Project,” AI Mag-

azine, vol. 31, no. 3, 2010, pp. 59–79.

8. H. Bay, T. Tuytelaars, and L. Van Gool,

“Surf: Speeded Up Robust Features,” Com-

puter Vision–ECCV 2006, Springer, 2006,

pp. 404–417.

9. G. Bradski and A. Kaehler, Learning

OpenCV: Computer Vision with the OpenCV

Library, O’Reilly Media, 2008.

10. J. Dean et al., “Large Scale Distributed

Deep Networks,” Proc. 25th Conf. Advan-

ces in Neural Information Processing Sys-

tems, 2012, pp. 1232–1240.

11. G. David Forney Jr., “The Viterbi Algo-

rithm,” Proc. IEEE, vol. 61, no. 3, 1973, pp.

268–278.

12. M.F. Porter, “An Algorithm for Suffix

Stripping,” Program: Electronic Library and

Information Systems, vol. 14, no. 3, 1980,

pp. 130–137.

13. N. Okazaki, “CRFSuite: A Fast Implementa-

tion of Conditional Random Fields (CRFs),”

blog, 2007; www.chokkan.org/software/

crfsuite.

ASR(DNN/HMM) QA IMMASR(GMM/HMM)

8x6x4x2x1x

–2x–4x–6x–8x

–10x

TCO

CMP (subquery)GPU

PhiFPGA

Figure 10. Total cost of ownership across platforms for each service. Bars

show the benefits of accelerator DCs normalized to the TCO achieved by a

DC that uses only CMPs.

Web

sea

rch

IPA

que

ry

IPA

que

ry (

GP

U)

IPA

que

ry (

FPG

A)

16141210864

0.0911.48 0.942

0

Late

ncy

(s)

Num

ber

of m

achi

nes

103

102

101

1000.0 0.2 0.4

Scalability gap

IPA query (w/o acceleration)IPA query (GPU)IPA query (FPGA)

0.6QIPA/QWS

0.8 1.0

Figure 11. Bridging the scalability gap. The significant latency reductions

using GPUs and FPGAs in homogeneous DCs help bridge the scalability gap.

..............................................................................................................................................................................................

TOP PICKS

.................................................................

52 IEEE MICRO

Page 12: Motivation - University of Michigantnm.engin.umich.edu/wp-content/uploads/sites/353/2018/10/... · 2019-09-10 · offerings in wearable technologies such as smart watches and glasses

14. J. Hauswald et al., “Sirius: An Open End-to-

End Voice and Vision Personal Assistant

and its Implications for Future Warehouse

Scale Computers,” Proc. Int’l Conf. Archi-

tectural Support for Programming Lan-

guages and Operating Systems, 2015, pp.

223–238.

15. J. Chong, E. Gonina, and K. Keutzer,

“Efficient Automatic Speech Recognition

on the GPU,” GPU Computing Gems Emer-

ald Edition, Morgan Kaufmann, 2011.

16. L.A. Barroso, J. Clidaras, and U. Holzle, The

Datacenter as a Computer: An Introduction

to the Design of Warehouse-Scale

Machines, 2nd ed., 2013.

Johann Hauswald is a PhD candidate inthe Computer Science and EngineeringDepartment at the University of Michigan.He received an MS in computer science andengineering from the University of Michi-gan. Contact him at [email protected].

Michael A. Laurenzano is a PhD candidatein the Computer Science and EngineeringDepartment at the University of Michigan.He received an MS in computer science andengineering from the University of Califor-nia, San Diego. Contact him at [email protected].

Yunqi Zhang is a PhD candidate in theComputer Science and Engineering Depart-ment at the University of Michigan. Hereceived an MS in computer science andengineering from the University of Califor-nia, San Diego. Contact him at [email protected].

Cheng Li is a PhD student in the ComputerScience and Engineering Department at theUniversity of Illinois, Urbana–Champaign.She received an MS in computer science andengineering from the University of Michi-gan, where she completed the work for thisarticle. Contact her at [email protected].

Austin Rovinski is an undergraduate stu-dent in electrical engineering at the Univer-sity of Michigan. Contact him at [email protected].

Arjun Khurana is a master’s student in theComputer Science and Engineering Depart-

ment at the University of Michigan. He

received a BSE in electrical engineering

from the University of Michigan. Contact

him at [email protected].

Ronald G. Dreslinski is an assistant profes-sor in the Computer Science and Engineer-

ing Department at the University of Michi-

gan. He received a PhD in computer science

and engineering from the University of

Michigan. Contact him at rdreslin@umich.

edu.

Trevor Mudge is the Bredt Family Professorof Computer Science and Engineering at

the University of Michigan, Ann Arbor. He

received a PhD in computer science from

the University of Illinois, Urbana–

Champaign. Contact him at tnm@umich.

edu.

Vinicius Petrucci is an assistant professor inthe Department of Computer Science at the

Federal University of Bahia, Brazil. He

received a PhD in computer science from

the Fluminense Federal University, Brazil.

Contact him at [email protected].

Lingjia Tang is an assistant professor in theComputer Science and Engineering Depart-

ment at the University of Michigan. She

received a PhD in computer science from

the University of Virginia. Contact her at

[email protected].

Jason Mars is an assistant professor in theComputer Science and Engineering Depart-

ment at the University of Michigan. He

received a PhD in computer science from

the University of Virginia. Contact him at

[email protected].

.................................................................

MAY/JUNE 2016 53