cseweb.ucsd.educseweb.ucsd.edu/~mbtaylor/papers/taylor_landscape_ds_ieee_micro_2013.pdfpoint of...

12
. .................................................................................................................................................................................................................. ALANDSCAPE OF THE NEW DARK SILICON DESIGN REGIME . .................................................................................................................................................................................................................. THE RISE OF DARK SILICON IS DRIVING A NEW CLASS OF ARCHITECTURAL TECHNIQUES THAT ‘‘SPEND’’ AREA TO ‘‘BUY’’ ENERGY EFFICIENCY.THIS ARTICLE EXAMINES FOUR RECENTLY PROPOSED DIRECTIONS (‘‘THE FOUR HORSEMEN’’) FOR ADAPTING TO DARK SILICON, OUTLINES A SET OF EVOLUTIONARY DARK SILICON DESIGN PRINCIPLES, AND SHOWS HOW ONE OF THE DARKEST COMPUTING ARCHITECTURESTHE HUMAN BRAINOFFERS INSIGHTS INTO MORE REVOLUTIONARY DIRECTIONS FOR COMPUTER ARCHITECTURE. ......Recent VLSI technology trends have led to a disruptive new regime for dig- ital chip designers, where Moore’s law con- tinues but CMOS scaling provides increasingly diminished fruits. As in prior years, the computational capabilities of chips are still increasing by 2.8 per process generation. However, a utilization wall 1 lim- its us to only 1.4 of this benefit—causing large underclocked swaths of silicon area— hence the term dark silicon. 2,3 Fortunately, simple scaling theory makes the utilization wall easy to derive, helping us to think intuitively about the problem. Tran- sistor density continues to improve by 2 every two years, and native transistor speeds improve by 1.4. But transistor energy effi- ciency improves by only 1.4, which, under constant power budgets, causes a 2 shortfall in energy budget to power a chip at its native frequency. Therefore, our utilization of a chip’s potential is falling exponentially by a jaw-dropping 2 per generation. Thus, if we are just bumping up against power limita- tions in the current generation, then in eight years, designs will be 93.75 percent dark! A recent paper refers to this widespread disruptive factor informally as the ‘‘dark sili- con apocalypse,’’ 4 because it officially marks the end of one reality (Dennard scaling 5 ), where progress could be measured by improvements in transistor speed and count, and the beginning of a new reality (post-Dennard scaling), where progress is measured by improvements in transistor en- ergy efficiency. Previously, we tweaked our circuits to reduce transistor delays and turbo-charged them with dual-rail domino to reduce fan-out-of-4 (FO4) delays. From now on, we will tweak our circuits to mini- mize capacitance switched per function; we will strip our circuits down and starve them of voltage to squeeze out every femtojoule. Whereas once we would spend exponentially increasing quantities of transistors to buy performance, now we will spend these tran- sistors to buy energy efficiency. The CMOS scaling breakdown was the direct cause of industry’s transition to multi- core in 2005. Because filling chips with cores does not fundamentally circumvent utiliza- tion wall limits, multicore is not the final Michael B. Taylor University of California, San Diego ....................................................... 8 Published by the IEEE Computer Society 0272-1732/13/$31.00 c 2013 IEEE

Upload: others

Post on 06-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: cseweb.ucsd.educseweb.ucsd.edu/~mbtaylor/papers/taylor_landscape_ds_ieee_micro_2013.pdfpoint of view. But the success of complex multiregime devices such as metal-oxide-semiconductor

...................................................................................................................................................................................................................

A LANDSCAPE OF THE NEWDARK SILICON DESIGN REGIME

...................................................................................................................................................................................................................

THE RISE OF DARK SILICON IS DRIVING A NEW CLASS OF ARCHITECTURAL TECHNIQUES

THAT ‘‘SPEND’’ AREA TO ‘‘BUY’’ ENERGY EFFICIENCY. THIS ARTICLE EXAMINES FOUR

RECENTLY PROPOSED DIRECTIONS (‘‘THE FOUR HORSEMEN’’) FOR ADAPTING TO DARK

SILICON, OUTLINES A SET OF EVOLUTIONARY DARK SILICON DESIGN PRINCIPLES, AND

SHOWS HOW ONE OF THE DARKEST COMPUTING ARCHITECTURES—THE HUMAN

BRAIN—OFFERS INSIGHTS INTO MORE REVOLUTIONARY DIRECTIONS FOR COMPUTER

ARCHITECTURE.

......Recent VLSI technology trendshave led to a disruptive new regime for dig-ital chip designers, where Moore’s law con-tinues but CMOS scaling providesincreasingly diminished fruits. As in prioryears, the computational capabilities ofchips are still increasing by 2.8� per processgeneration. However, a utilization wall1 lim-its us to only 1.4� of this benefit—causinglarge underclocked swaths of silicon area—hence the term dark silicon.2,3

Fortunately, simple scaling theory makesthe utilization wall easy to derive, helping usto think intuitively about the problem. Tran-sistor density continues to improve by 2�every two years, and native transistor speedsimprove by 1.4�. But transistor energy effi-ciency improves by only 1.4�, which, underconstant power budgets, causes a 2� shortfallin energy budget to power a chip at its nativefrequency. Therefore, our utilization of achip’s potential is falling exponentially by ajaw-dropping 2� per generation. Thus, ifwe are just bumping up against power limita-tions in the current generation, then in eightyears, designs will be 93.75 percent dark!

A recent paper refers to this widespreaddisruptive factor informally as the ‘‘dark sili-con apocalypse,’’4 because it officially marksthe end of one reality (Dennard scaling5),where progress could be measured byimprovements in transistor speed andcount, and the beginning of a new reality(post-Dennard scaling), where progress ismeasured by improvements in transistor en-ergy efficiency. Previously, we tweaked ourcircuits to reduce transistor delays andturbo-charged them with dual-rail dominoto reduce fan-out-of-4 (FO4) delays. Fromnow on, we will tweak our circuits to mini-mize capacitance switched per function; wewill strip our circuits down and starve themof voltage to squeeze out every femtojoule.Whereas once we would spend exponentiallyincreasing quantities of transistors to buyperformance, now we will spend these tran-sistors to buy energy efficiency.

The CMOS scaling breakdown was thedirect cause of industry’s transition to multi-core in 2005. Because filling chips with coresdoes not fundamentally circumvent utiliza-tion wall limits, multicore is not the final

Michael B. Taylor

University of California,

San Diego

.......................................................

8 Published by the IEEE Computer Society 0272-1732/13/$31.00 �c 2013 IEEE

Page 2: cseweb.ucsd.educseweb.ucsd.edu/~mbtaylor/papers/taylor_landscape_ds_ieee_micro_2013.pdfpoint of view. But the success of complex multiregime devices such as metal-oxide-semiconductor

solution to dark silicon;3 it is merely indus-try’s initial, transitional response to theshocking onset of the dark silicon age. In-creasingly over time, the semiconductor in-dustry is adapting to this new designregime, realizing that multicore chips willnot scale as transistors shrink and that thefraction of a chip that can be filled withcores running at full frequency is droppingexponentially with each process genera-tion.1,3 This reality forces designers to ensurethat, at any point in time, large fractions oftheir chips are effectively dark—either idlefor long periods of time or significantlyunderclocked. As exponentially larger frac-tions of a chip’s transistors become darker,silicon area becomes an exponentiallycheaper resource relative to power and energyconsumption. This shift calls for new archi-tectural techniques that ‘‘spend’’ area to‘‘buy’’ energy efficiency. This saved energycan then be applied to increase performance,or to have longer battery life or lower operat-ing temperatures.

The utilization wall that causes dark siliconTable 1 shows the derivation of the utiliza-

tion wall1 that causes dark silicon.2,3 Itemploys a scaling factor, S, which is theratio between the feature sizes of two processes(for example, S ¼ 32=22 ¼ 1:4x between 32and 22 nm). In both Dennard and post-Dennard scaling, the transistor count scalesby S 2, and the transistor switching frequencyscales by S. Thus, our net increase in comput-ing performance is S3, or 2.8x.

However, to maintain a constant powerenvelope, these gains must be offset by a cor-responding reduction in transistor switchingenergy. In both cases, scaling reduces transis-tor capacitance by S, improving energy effi-ciency by S. In Dennard scaling, we canscale the threshold voltage and thus the oper-ating voltage, which yields another S 2 energy-efficiency improvement. However, in today’spost-Dennard, leakage-limited regime, wecannot scale threshold voltage without expo-nentially increasing leakage, and as a result,we must hold operating voltage roughly con-stant. The end result is a shortfall of S2, or 2�per process generation. This shortfall multi-plies with each process generation, resultingin exponentially darker silicon over time.

This shortfall prevents multicore frombeing the solution to scaling.1,3 Althoughadvancing a single process generation wouldallow enough transistors to increase corecount by 2�, and frequency could be 1.4�faster, the energy budget permits only a1.4� total improvement. Per Figure 1, acrosstwo process generations (S ¼ 2), designerscould increase core count by 2� leaving fre-quency constant, or they could increase fre-quency by 2� with leaving core countconstant, or they could choose some middleground between the two. The remaining 4�potential remains inaccessible.

More positively stated, the true new poten-tial of Moore’s law is a 1.4� energy-efficiencyimprovement per generation, which could beused to increase performance by 1.4�. Addi-tionally, if we could somehow make use ofdark silicon, we could do even better.

Although the utilization wall is based on afirst-order model that simplifies many fac-tors, it has proved to be an effective toolfor designers to gain intuition about the fu-ture, and has proven remarkably accurate(see the sidebar ‘‘Is Dark Silicon Real? A Re-ality Check’’). Follow-up work6-8 has lookedat extending this early work1,3 on dark sili-con and multicore scaling with more sophis-ticated models that incorporate factors suchas application space and cache size.

Dark silicon misconceptionsLet’s clear up a few misconceptions before

proceeding. First, dark silicon does not meanblank, useless, or unused silicon; it’s just

Table 1. Dennard vs. post-Dennard (leakage-limited) scaling.1 In

contrast to Dennard scaling,5 which held until 2005, under the

post-Dennard regime, the total chip utilization for a fixed power

budget drops by S2 with each process generation. The result is an

exponential increase in dark silicon for a fixed-sized chip under a

fixed area budget.

Transistor property Dennard Post-Dennard

D Quantity S2 S2

D Frequency S S

D Capacitance 1/S 1/S

V 2DD 1=S2 1

) D Power ¼ D QFCV 2 1 S2

) D Utilization ¼ 1/Power 1 1=S2

.............................................................

SEPTEMBER/OCTOBER 2013 9

Page 3: cseweb.ucsd.educseweb.ucsd.edu/~mbtaylor/papers/taylor_landscape_ds_ieee_micro_2013.pdfpoint of view. But the success of complex multiregime devices such as metal-oxide-semiconductor

silicon that is not used all the time, or at itsfull frequency. Even during the best days ofCMOS scaling, microprocessor and othercircuits were chock full of ‘‘dark logic’’used infrequently or for only some applica-tions—for instance, caches are inherentlydark because the average cache transistor isswitched for far less than one percent ofcycles, and FPUs remain dark in integercodes.

Soon, the exponential growth of dark sil-icon area will push us beyond logic targetedfor direct performance benefits towardswaths of low-duty cycle logic that exists,not for direct performance benefit, but forimproving energy efficiency. This improved

energy efficiency can then allow an indirectperformance improvement because it freesup more of the fixed power budget to beused for even more computation.

The four horsemenRecently, researchers proposed a taxon-

omy—the four horsemen—that identifiesfour promising directions for dealing withdark silicon that have emerged as promisingpotential approaches as we transition beyondthe initial multicore stop-gap solution. Theseresponses originally appeared to be unlikelycandidates, carrying unwelcome burdens indesign, manufacturing, or programming.None is ideal from an aesthetic engineering

4 cores at 1.8 GHz

4 cores at 2×1.8 GHz(12 cores dark)

2×4 cores at 1.8 GHz(8 cores dark, 8 dim)

(Industry’s choice)

75% dark after two generations;93% dark after four generations

65 nm 32 nm

Spectrum of trade-offsbetween no. of cores andfrequency

Example:

65 nm → 32 nm (S = 2)

....

....

....

Figure 1. Multicore scaling leads to large amounts of dark silicon.3 Across two process gen-

erations, there is a spectrum of trade-offs between frequency and core count; these include

increasing core count by 2� but leaving frequency constant (top), and increasing frequency

by 2� but leaving core count constant (bottom). Any of these trade-off points will have

large amounts of dark silicon.

...............................................................................................................................................................................................

Is Dark Silicon Real? A Reality Check

A quick survey of recent designs from multicore outfits such as Tilera,

Intel, and AMD indicates that industry has pursued core count and fre-

quency combinations consistent with the utilization wall. For instance,

Intel’s 90-nm single-core Prescott chip ran at 3.8 GHz in 2004. Dennard

scaling would suggest that a 22-nm multicore version should run at 15.5

GHz, and contain 17 superscalar cores, for a total improvement of 69� in

instruction throughput. Instead, the upcoming 2013 22-nm Intel Core i7

4960X runs at 3.6 GHz and has six superscalar cores, a 5.7� peak serial

instruction throughput improvement. The darkness ratio is thus 91.74 per-

cent versus the 93.75 percent predicted by the utilization wall. The latest

2012 International Technology Roadmap for Semiconductors also shows

that scaling has proceeded consistently with post-Dennard predictions.

.............................................................

10 IEEE MICRO

...............................................................................................................................................................................................

DARK SILICON

Page 4: cseweb.ucsd.educseweb.ucsd.edu/~mbtaylor/papers/taylor_landscape_ds_ieee_micro_2013.pdfpoint of view. But the success of complex multiregime devices such as metal-oxide-semiconductor

point of view. But the success of complexmultiregime devices such as metal-oxide-semiconductor field-effect transistors (MOS-FETs) has shown that engineers can toleratecomplexity if the end result is better. Futurechips are likely to employ not just one horse-man, but all of them, in interesting andunique combinations.

The shrinking horsemanWhen confronted with the possibility of

dark silicon, many chip designers insist thatarea is expensive, and that they would justbuild smaller chips instead of having dark sil-icon in their designs. Among the four horse-men, these ‘‘shrinking chips’’ are the mostpessimistic outcome. Although all chipsmay eventually shrink somewhat, the onesthat shrink the most will be those forwhich dark silicon cannot be applied fruit-fully to improve the product. These chipswill rapidly turn into low-margin businessesfor which further generations of Moore’slaw provide small benefit. Below is an exam-ination of the spectrum of second-ordereffects associated with shrinking chips.

Cost side of shrinking silicon. Understandingshrinking chips requires considering semi-conductor economics. The ‘‘build smallerchips’’ argument has a ring of truth; afterall, designers spend much of their time tryingto meet area budgets for existing chipdesigns. But exponentially smaller chips arenot exponentially cheaper; even if siliconbegins as 50 percent of system cost, after afew process generations, it will be a tiny frac-tion. Mask costs, design costs, and I/O padarea will fail to be amortized, leading to ris-ing costs per mm2 of silicon, which ulti-mately will eliminate incentives to movethe design to the next process generation.These designs will be ‘‘left behind’’ onolder generations.

Revenue side of shrinking silicon. Shrinkingsilicon can also shrink the chip sellingprice. In a competitive market, if there is away to use the next process generation’sbounty of dark silicon to attain a benefit tothe end product, then competition willforce companies to do so. Otherwise, theywill generally be forced into low-end,

low-margin, high-competition markets, andtheir competitor will take the high end andenjoy high margins. Thus, in scenarioswhere dark silicon could be used profitably,decreasing area in lieu of exploiting itwould certainly decrease system costs, butwould catastrophically decrease sale price.Hence, the shrinking-chips scenario is likelyto happen only if we can find no practicaluse for dark silicon.

Power and packaging issues with shrinkingchips. A major consequence of exponentiallyshrinking chips is a corresponding exponen-tial rise in power density. Recent analysis ofmany-core thermal characteristics hasshown that peak hotspot temperature risecan be modeled as Tmax ¼ TDP � ðRconvþk=AÞ, where Tmax is the rise in temperature,TDP is the target chip thermal design power,Rconv is the heat sink thermal convection re-sistance (lower is a better heat sink), k incor-porates many-core design properties, and A ischip area.8 If area drops exponentially, thesecond term dominates and chip tempera-tures rise exponentially. This in turn willforce a lower TDP so that temperature limitsare met, and reduce scaling below even thenominal 1.4� expected energy-efficiencygain. Thus, if thermals drive your shrink-ing-chip strategy, it is much better to holdyour frequency constant and increase coresby 1.4� with a net area decrease of 1.4�than it is to increase your frequency by1.4� and shrink your chip by 2�.

The dim horsemanAs exponentially larger fractions of a chip’s

transistors become dark transistors, siliconarea becomes an exponentially cheaper re-source relative to power and energy consump-tion. This shift calls for new architecturaltechniques that spend area to buy energy effi-ciency. If we move past unhappy thoughts ofshrinking silicon and consider populatingdark silicon area with logic that we use onlypart of the time, then we are led to some in-teresting new design possibilities.

The term dim silicon refers to techniquesthat put large amounts of otherwise-darksilicon area to productive use by employingheavy underclocking or infrequent useto meet the power budget—that is, the

.............................................................

SEPTEMBER/OCTOBER 2013 11

Page 5: cseweb.ucsd.educseweb.ucsd.edu/~mbtaylor/papers/taylor_landscape_ds_ieee_micro_2013.pdfpoint of view. But the success of complex multiregime devices such as metal-oxide-semiconductor

architecture is strategically managing thechip-wide transistor duty cycle to enforcethe overall power constraint.8,9 Whereasearly 90-nm designs such as Cell and Pre-scott were dimmed because actual powerexceeded design-estimated power, we areconverging on increasingly more elegantmethods that make better trade-offs.

Dim silicon techniques include dynami-cally varying the frequency with the numberof cores being used, scaling up the amount ofcache logic, employing near-threshold volt-age (NTV) processor designs, and redesign-ing the architecture to accommodate burststhat temporarily allow the power budget tobe exceeded, such as Turbo Boost and com-putational sprinting.10

Turbo Boost 1.0. Although first-generationmulticores had a ship-time-determined topfrequency that was invariant of the numberof currently active cores, Intel’s TurboBoost 1.0 enabled second-generation multi-cores to make real-time trade-offs betweenactive core count and the frequency thecores ran at: the fewer the cores, the higherthe frequency. When Turbo Boost is enabled,it uses the energy gained from turning offcores to increase the voltage and then the fre-quency of the active cores. This technique,known as dynamic voltage and frequencyscaling (DVFS), increases power proportionalto the cube of the increase in frequency.

NTV processors. In the past, DVFS was alsoused to save cubic power when frequencieswere decreased. However, today, processormanufacturers operate transistors at reducedvoltages—around 2.5� the threshold volt-age, an energy-delay optimal point. Thispoint is right at the edge of an operating re-gime where frequency starts to drop precipi-tously as voltage is reduced, which makesdownward-DVFS much less effective.

Nonetheless, researchers have begun toexplore this regime. One recent approach isNear-Threshold Voltage (NTV) logic,11

which operates transistors in the near-thres-hold regime slightly above the threshold volt-age, providing more palatable trade-offsbetween energy and delay than subthresholdcircuits, for which frequency drops exponen-tially with voltage decreases. Researchers have

explored wide-SIMD NTV processors,12

which seek to exploit data parallelism,along with NTV many-core processors13

and an NTV x86 processor.14

Although NTV per-processor performancedrops faster than the corresponding savings inenergy-per-instruction (5� energy improve-ment for an 8� performance cost), the perfor-mance loss can be offset by using 8� moreprocessors in parallel if the workload allowsit. Then, an additional 5� processors couldturn the energy efficiency gains into additionalperformance. So, with ideal parallelization,NTV could offer 5� the throughput im-provement by absorbing 40� the area. Butthis would also require 40� more free paral-lelism in the workload relative to the parallel-ism consumed by an equivalent energy-limited super-threshold many-core processor.

In practice, for many applications, 40�additional parallelism can be elusive. Forchips with large power budgets that can al-ready sustain hundreds of cores, applicationsthat have this much spare parallelism are rel-atively rare. Interestingly, because of this ef-fect, NTV’s applicability across applicationsincreases in low-energy environments becausethe energy-limited baseline super-thresholddesign has consumed less of the available par-allelism. Furthermore, NTV clearly becomesmore applicable for workloads with extremelylarge amounts of parallelism.

NTV presents several circuit-related chal-lenges that have seen active investigation, es-pecially because technology scaling willexacerbate rather than ameliorate these factors.A significant NTV challenge has been suscep-tibility to process variability. As operating vol-tages drop, variation in transistor thresholddue to random dopant fluctuation is propor-tionally higher, and leakage and operating fre-quency can vary greatly. Because NTVdesigns can expand the area consumption byapproximately 8� or more, variation issuesare exacerbated. Other challenges include thepenalties involved in designing low-operatingvoltage static RAMs (SRAMs) and theincreased interconnection energy consump-tion due to greater spreading across cores.

Bigger caches. An often-proposed dim-siliconalternative is to simply allocate otherwisedark silicon area for caches. Because only a

.............................................................

12 IEEE MICRO

...............................................................................................................................................................................................

DARK SILICON

Page 6: cseweb.ucsd.educseweb.ucsd.edu/~mbtaylor/papers/taylor_landscape_ds_ieee_micro_2013.pdfpoint of view. But the success of complex multiregime devices such as metal-oxide-semiconductor

subset of cache transistors (such as a word-line) is accessed each cycle, cache memorieshave low duty cycles and thus are inherentlydark. Compared to general-purpose logic, alevel-1 (L1) cache clocked at its maximumfrequency can be about 10� darker persquare millimeter, and larger caches can beeven darker. Thus, adding cache is one wayto simultaneously increase performance andlower power density per square millimeter.We can imagine, for instance, expandingper-core cache at a rate that soaks up theremaining dark silicon area: 1.4 to 2�more cache per core per generation. How-ever, many applications do not benefitmuch from additional cache, and upcomingTSV-integrated DRAM will reduce thecache benefit for those applications that do.

Computational sprinting and TurboBoost. Other techniques employ ‘‘temporaldimness’’ as opposed to ‘‘spatial dimness,’’temporarily exceeding the nominal thermalbudget but relying on thermal capacitanceto buffer against temperature increases, andthen ramping back to a comparatively darkstate. Intel’s Turbo Boost 2.0 uses thisapproach to boost performance up until theprocessor reaches nominal temperature, rely-ing on the heat sink’s innate thermal capaci-tance. ARM’s big.LITTLE employs fourA15 cores until the thermal envelope isexceeded (anecdotally, about 10 seconds),then switches over to four lower-energy,lower-performance A7 cores. Computationalsprinting carries this a step further, employ-ing phase-change materials that let chips ex-ceed their sustainable thermal budget by anorder of magnitude for several seconds, pro-viding a short but substantial computationalboost. These modes are especially useful for‘‘race to finish’’ computations, such as web-page rendering, for which response latencyis important, or for which speeding up thetransition of both the processor and its sup-port logic to a low-power state reduces en-ergy consumption.

The specialized horsemanThe specialized horseman uses dark sili-

con to implement a host of specialized co-processors, each either much faster or muchmore energy efficient (100 to 1,000�) than

a general-purpose processor.1 Executionhops between coprocessors and general-purpose cores, executing where it is most ef-ficient. The unused cores are power- andclock-gated to keep them from consumingprecious energy. Unlike dim silicon, whichtends to focus on manipulating voltages, fre-quencies, and duty cycles as ways to managepower, specialized logic focuses on reducingthe amount of capacitance that needs to beswitched to perform a particular operation.

The promise for a future of widespreadspecialization is already being realized: weare seeing a proliferation of specialized accel-erators that span diverse areas such as base-band processing, graphics, computer vision,and media coding. These accelerators enableorders-of-magnitude improvements in en-ergy efficiency and performance, especiallyfor computations that are highly parallel.Recent proposals have extrapolated thistrend and anticipate that the near futurewill see systems comprising more coproces-sors than general-purpose processors.1,7 Thisarticle refers to these systems as coprocessor-dominated architectures, or CoDAs.

As specialization usage grows to combatthe dark silicon problem, we are faced witha modern-day specialization ‘‘Tower ofBabel’’ crisis that fragments our notion ofgeneral-purpose computation and eliminatesthe traditional clear lines of communicationbetween programmers and software and theunderlying hardware. Already, we see thedeployment of specialized languages such asCUDA that are not usable between similararchitectures (for example, AMD and Nvi-dia). We see overspecialization problems be-tween accelerators that cause them to becomeinapplicable to closely related classes of com-putations (such as double-precision scientificcodes running incorrectly on a GPU’s non-IEEE-compliant floating-point hardware).Adoption problems are also caused by the ex-cessive costs of programming heterogeneoushardware (such as the slow uptake of SonyPlayStation 3 versus Xbox). Moreover, spe-cialized hardware risks obsolescence as stan-dards are revised (for example, a JPEGstandard revision).

Insulating humans from complexity. Thesefactors speak to potential exponential

.............................................................

SEPTEMBER/OCTOBER 2013 13

Page 7: cseweb.ucsd.educseweb.ucsd.edu/~mbtaylor/papers/taylor_landscape_ds_ieee_micro_2013.pdfpoint of view. But the success of complex multiregime devices such as metal-oxide-semiconductor

increases in design, verification, and pro-gramming effort for these CoDAs. Combat-ing the Tower of Babel problem requiresdefining a new paradigm for how specializa-tion is expressed and exploited in future pro-cessing systems. We need new scalablearchitectural schemas that employ pervasivelyspecialized hardware to minimize energy andmaximize performance while at the sametime insulating the hardware designer andprogrammer from such systems’ underlyingcomplexity.

Overcoming Amdahl-imposed limits onspecialization. Amdahl’s law provides an ad-ditional roadblock for specialization. Tosave energy across the majority of the com-putation, we must find broad-based special-ization approaches that apply to bothregular, parallel code and irregular code.We must also ensure that communicatingspecialized processors doesn’t fritter awaytheir energy savings on costly cross-chipcommunication or shared-memory accesses.

Recent efforts. The UCSD GreenDroidprocessor (see Figure 2)3,15 is one suchCoDA-based system that seeks to addressboth complexity issues and Amdahl limits.GreenDroid is a mobile application processorthat implements Android mobile environmenthotspots using hundreds of specialized corescalled conservation cores, or c-cores.1,9 C-cores,which target both irregular and regular code,are automatically generated from C or Csource code, and support a patching mecha-nism that lets them track software changes.They attain an estimated �8 to 10� energy-efficiency improvement, at no loss in serialperformance, even on nonparallel code, andwithout any user or programmer intervention.

Unlike NTV processors, c-cores need notfind additional parallelism in the workload tocover a serial performance loss. Thus, c-coresare likely to work across a wider range of work-loads, including collections of serial programs.However, for highly parallel workloads inwhich execution time is loosely concentrated,NTV processors might hold an area advantagebecause of their reconfigurability.

Other specialized processors such as theUniversity of Wisconsin-Madison’s DySER16

and the University of Michigan’s Beret17

propose alternative architectures that exploitspecialization like c-cores, but focus onimproving reconfigurability at the cost ofenergy savings. Recent efforts have alsoexamined the use of approximate neural-network-based computing as an elegantway to package programmability, reconfi-gurability, and specialization.18

The ‘‘deus ex machina’’ horsemanOf the four horsemen, this is by far the

most unpredictable. ‘‘Deus ex machina’’refers to a plot device in literature or theaterin which the protagonists seem increasinglydoomed until the very last moment, whensomething completely unexpected comesout of nowhere to save the day. For dark sil-icon, one deus ex machina would be a break-through in semiconductor devices. However,as we shall see, the breakthroughs that wouldbe required would have to be quite funda-mental—in fact, we most likely would haveto build transistors out of devices otherthan MOSFETs. Why? Because MOSFETleakage is set by fundamental principles ofdevice physics, and is limited to a subthresh-old slope of 60 mV/decade at room temper-ature; this corresponds to a reduction of 10�leakage current for every 60 mV that thethreshold voltage is above the Vss, which isdetermined by properties of thermionicemission of carriers across a potential well.Thus, although innovations such as Intel’sFinFET/TriGate transistor and high-Kdielectrics represent significant achievementsmaintaining a subthreshold slope close totheir historical values, they still remain with-in the scope of the MOSFET-imposed limitsand are one-time improvements rather thanscalable changes.

Two VLSI candidates that bypass theselimits because they are not based on thermalinjection are tunnel field-effect transistors(TFETs),19 which are based on tunnelingeffects, and nanoelectromechanical system(NEMS) switches,20 which are based onphysical relays. TFETs are reputed to havesubthreshold slopes on the order of30 mV/decade—twice as good as the idealMOSFET—but with lower on-currentsthan MOSFETs, limiting their use inhigh-performance circuits. NEMS deviceshave essentially a near-zero subthreshold

.............................................................

14 IEEE MICRO

...............................................................................................................................................................................................

DARK SILICON

Page 8: cseweb.ucsd.educseweb.ucsd.edu/~mbtaylor/papers/taylor_landscape_ds_ieee_micro_2013.pdfpoint of view. But the success of complex multiregime devices such as metal-oxide-semiconductor

slope but slow switching times. Both TFETsand NEMS devices thus hint at orders-of-magnitude improvements in leakage butremain untamed and fall short of beingintegrated into real chips.

Realizing the importance of the fourthhorseman, a recent $194 million DARPA/

MARCO STARnet program is fundingfour centers, each focusing on a key directionfor beyond-CMOS approaches: developingelectron spin-based memory computationdevices (C-SPIN), formulating new in-formation-processing models that can lever-age statistical (that is, nondeterministic)

CP

UL1 L1

L1 L1

CP

U

CP

UC

PU

CP

U

L1 L1

L1 L1

CP

U

CP

UC

PU

CP

U

L1 L1

L1 L1

CP

U

CP

UC

PU

CP

UL1 L1

L1 L1

CP

U

CP

UC

PU

1 mm

1 mm

OCN

D $

CP

U

I $C

C

C

C

C

C C C

C C

(b)

D-cacheI-cache

CPU

FPU

Tile

Inte

rnal

sta

te in

terf

ace

C-core

C-core

C-core

C-core

OCN

(a)

(c)

Figure 2. The GreenDroid architecture, an example of a coprocessor-dominated architecture

(CoDA). The GreenDroid Mobile Application Processor comprises 16 nonidentical tiles (a).

Each tile (b) holds components common to every tile—the CPU, on-chip network (OCN),

and shared level-1 (L1) data cache—and provides space for multiple conservation cores, or

c-cores, of various sizes. A variety of in-tile networks (c) connect components and c-cores.

.............................................................

SEPTEMBER/OCTOBER 2013 15

Page 9: cseweb.ucsd.educseweb.ucsd.edu/~mbtaylor/papers/taylor_landscape_ds_ieee_micro_2013.pdfpoint of view. But the success of complex multiregime devices such as metal-oxide-semiconductor

beyond-CMOS devices (SONIC), engineer-ing nonconventional atomic scale engineeredmaterials (FAME), and creating new devicesthat extend prior work on TFETs to operateat even lower voltages (LEAST).

Evolutionary design principles for darksilicon

While researchers work to mature the newideas represented by the four horsemen, whatprinciples should guide today’s designs thatmust tackle dark silicon? Listed below is aset of evolutionary, rather than revolutionary,dark silicon design principles that are moti-vated by changing trade-offs created bydark silicon:

� Moving to the next generation will pro-vide an automatic 1.4� energy-efficiencyincrease. Figure out how you will use it.As a baseline, chip capabilities willscale with energy, whether it is allocatedto frequency or more cores. You can in-crease or decrease frequency or transis-tor counts, but transistors switched perunit time can increase by only 1.4�.

� The next generation will create a largeamount of dark area. Determine, foryour domain, how to trade mostly darkarea for energy. If the die area is fixed,any scaling is going to have a surplusof transistors. Which combination ofthe four horsemen is most effective inyour domain? Should you go dim—more caches? Underclocked arrays ofcores? NTV on top of that? Add accel-erators or c-cores? Use new kinds of de-vices? Shrink your chip?

� Pipelining makes less sense than it used to.Figure out if faster transistor delays willallow you to fit more in a pipeline stagewithout reducing frequency. Pipeliningincreases duty cycle and introduces addi-tional capacitance in circuits (registers,prediction circuits, bypassing, and clocktree fan out), neither of which is dark sil-icon friendly. Reducing pipeline depthand increasing FO4 depths reducescapacitive overhead. Note, too, that exces-sive pipelining and frequency exacerbatesthe gap between processing and memory.

� Architectural multiplexing and logic shar-ing are becoming increasingly questionable

optimizations. See if they still make sense.Sharing introduces additional energyconsumption because it requires sharersto have longer wires to the shared logic,and it introduces additional perfor-mance and energy overheads from thecontrol logic that manages the sharing.For example, architectures that haverepositories of nonshared state thatshare physical pipelines (such as large-scale multithreading) pay large wirecapacitances inside these memories toshare that state. As area gets cheaper, itwill make less sense to pay these over-heads, and the degree of sharing will de-crease so that the energy cost of pullingstate out of these state repositories willbe reduced.

� Multiplexing and RAMs that facilitatesharing of program data are still a goodidea. Keep them. If different threads ofcontrol are truly sharing data, multi-plexed structures, such as shared RAM,or crossbars, are often still more efficientthan coherence protocols or otherschemes.

� Architectural techniques for saving tran-sistors should only be applied if they donot worsen energy efficiency. Transistorsare getting exponentially cheaper, andwe can’t use them all at once. Whyare we trying to save transistors? Lo-cally, transistor-saving optimizationsmake sense, but an exponential windis blowing against these optimizationsin the long run.

� Power rails are the new clocks. Designwith them in mind. Ten years ago, itwas a big step to move beyond a fewclock domains. Now, chips can havehundreds of clock domains, all withtheir own clock gates. With dark silicon,we will see the same effect with powerrails; we will have hundreds andmaybe thousands of power rails in thefuture, all with their own power gates,to manage the leakage for the many het-erogeneous system components.

� Heterogeneity results from the shift from a1D objective function (performance) to a2D objective function (performance andenergy). Design with the shape of thisfunction in mind. The past lacked in

.............................................................

16 IEEE MICRO

...............................................................................................................................................................................................

DARK SILICON

Page 10: cseweb.ucsd.educseweb.ucsd.edu/~mbtaylor/papers/taylor_landscape_ds_ieee_micro_2013.pdfpoint of view. But the success of complex multiregime devices such as metal-oxide-semiconductor

heterogeneity, because designs werelargely measured according to a singleaxis—performance. To first order, therewas a single optimal design point. Nowthat performance and energy are bothimportant, a Pareto curve trades off per-formance and energy, and there is noone optimal design across that curve;there are many optimal points. Optimaldesigns will incorporate several suchpoints across these curves.

These rules of thumb will guide our exist-ing designs along an evolutionary path to be-come increasingly dark silicon friendly—butwhat then of more revolutionary approaches?

Insights from the brain: a dark technologyPerhaps one promising indicator that low-

duty cycle, ‘‘dark technology’’ can be mas-tered, unlocking new application domains, isthe efficiency and density of the humanbrain. The brain, even today, can performmany tasks that computers cannot, especiallyvision-related tasks. With 80 billion neuronsand 100 trillion synapses operating at lessthan 100 mV, the brain embodies an existenceproof of highly parallel, reliable, and darkoperation, and embodies three of thehorsemen—dim, specialized, and deus exmachina. Neurons operate with extremelylow-duty cycles compared to processors—atbest, 1 kilohertz. Although computing with sil-icon-simulated neurons introduces excessive‘‘interpretive’’ overheads—neurons and transis-tors have fundamentally different properties—the brain can offer us insight and long-termideas about how we can redesign systems forthe extremely low-duty cycles and low voltagescalled for by dark silicon. Here are some ofthese properties, which may give us insighton more revolutionary extensions to the evolu-tionary principles proposed in the last section:

� Specialization. As with the specializedhorseman, different groups of neuronsserve different functions in cognitiveprocessing, connect to different sensoryorgans, and allow reconfiguration,evolving with time synaptic connec-tions customized to the computation.

� Very dark operation. Neurons fire at amaximum rate of approximately 1,000

switches per second. Compare this toarithmetic logic unit (ALU) transistorsthat toggle at three billion times persecond. The most active neuron’s activ-ity is a millionth of that of processingtransistors in today’s processors.

� Low-voltage operation. Brain cells oper-ate at approximately 100 mV, yieldingCV 2 energy savings of 100� versus1-V operation, in a clear parallel tothe dim horseman’s NTV circuits.Communication is low swing and lowvoltage, saving large amounts of energy.

� Limited sharing and memory multiplex-ing. Any given neuron can switch only1,000 times per second, by definition,so it must have extremely limited shar-ing, because a point of multiplexingwould be a bottleneck in parallel pro-cessing. The human visual system startswith 6M cones in the retina, similar toa 2-megapixel display, processes it withlocal neurons, and then sends it on the1M-neuron optic nerve to the visualcortex. There is no central memorystore; each pixel has a set of its ownALUs, so to speak, so energy wastedue to multiplexing is minimal.

� Data decimation. The human brainreduces the data size at each step andoperates on concise but approximaterepresentations. If using 2 megapixelssuffices to handle color-related visiontasks, why use more than that? Largersensors would just require more neu-rons to store and compute on thedata. We should ensure that we are pro-cessing no more data than necessary toachieve the final outcome.

� Analog operation. The neuron performsa more complex basic operation thanthe typical digital transistor. On theinput side, neurons combine informa-tion from many other neurons; andon the output, despite producing rail-to-rail digital pulses, encode multiplebits of information via spikes timings.Could this suggest that there are moreefficient ways to map operations ontosilicon-based technologies? In RF wire-less front-end communications, analogprocessing enables computations thatwould be impossible to do at speed

.............................................................

SEPTEMBER/OCTOBER 2013 17

Page 11: cseweb.ucsd.educseweb.ucsd.edu/~mbtaylor/papers/taylor_landscape_ds_ieee_micro_2013.pdfpoint of view. But the success of complex multiregime devices such as metal-oxide-semiconductor

digitally. However, analog techniquesmight not scale well to deep nanometertechnology.

� Fast, static, ‘‘gather, reduce, and broad-cast’’ operators. Neurons have fan outand fan in of approximately 7,000 toother neurons that are located signifi-cant distances away. Effectively, theycan perform efficient operations thatcombine vector-style gather memoryaccesses to large numbers of static-memory locations, with a vector-stylereduction operator and a broadcast.Do more efficient ways exist for imple-menting these operations in silicon? Itcould be useful for computations thatoperate on finite-sized static graphs.

Recently, both the EU and US govern-ments have proposed initiatives to enablegreater studies of the computational capabil-ities of the brain. Although brain-inspiredcomputing has already come and gone sev-eral times in the brief history of manmadecomputers, dark silicon may cause theseapproaches to become increasingly relevant.

A lthough silicon is getting darker, forresearchers the future is bright and ex-

citing. Dark silicon will cause a transforma-tion of the computational stack and providemany opportunities for investigation. M I CR O

AcknowledgmentsThis work was partially supported by

NSF awards 0846152, 1018850, 0811794,and 1228992, Nokia and AMD gifts, andby STARnet, an SRC program sponsoredby MARCO and DARPA. I thank the anon-ymous reviewers for their valuable insightsand suggestions.

....................................................................References

1. G. Venkatesh et al., ‘‘Conservation Cores:

Reducing the Energy of Mature Computa-

tions,’’ Proc. 15th Architectural Support

for Programming Languages and Op-

erating Systems Conf., ACM, 2010,

pp. 205-218.

2. R. Merrit, ‘‘ARM CTO: Power Surge Could

Create ‘Dark Silicon,’’’ EE Times, 22 Oct.

2009.

3. N. Goulding et al., ‘‘GreenDroid: A Mobile

Application Processor for a Future of Dark

Silicon,’’ Hot Chips Symp., 2010.

4. M. Taylor, ‘‘Is Dark Silicon Useful? Harness-

ing the Four Horsemen of the Coming Dark

Silicon Apocalypse,’’ Proc. 49th Ann. Design

Automation Conf. (DAC 12), ACM, 2012,

pp. 1131-1136.

5. R.H. Dennard, ‘‘Design of Ion-Implanted

MOSFET’s with Very Small Physical Dimen-

sions,’’ IEEE J. Solid-State Circuits,

vol. SC-9, 1974, pp. 256-268.

6. H. Esmaeilzadeh et al., ‘‘Dark Silicon and

the End of Multicore Scaling,’’ ACM

SIGARCH Computer Architecture News,

vol. 39, no. 3, 2011, pp. 365-376.

7. N. Hardavellas et al., ‘‘Toward Dark Silicon

in Servers,’’ IEEE Micro, vol. 31, no. 4,

2011, pp. 6-15.

8. W. Huang et al., ‘‘Scaling with Design Con-

straints: Predicting the Future of Big Chips,’’

IEEE Micro, vol. 31, no. 4, 2011, pp. 16-29.

9. J. Sampson et al., ‘‘Efficient Complex Oper-

ators for Irregular Codes,’’ Proc. 17th Int’l

Symp. High Performance Computer Archi-

tecture (HPCA 11), IEEE CS, 2011,

pp. 491-502.

10. A. Raghavan et al., ‘‘Computational Sprint-

ing,’’ Proc. IEEE 18th Int’l Symp. High-

Performance Computer Architecture (HPCA

12), IEEE CS, 2012, doi:10.1109/HPCA.

2012.6169031.

11. R. Dreslinski et al., ‘‘Near-Threshold Com-

puting: Reclaiming Moore’s Law Through

Energy Efficient Integrated Circuits,’’ Proc.

IEEE, vol. 98, no. 2, 2010, pp. 253-266.

12. E. Krimer et al., ‘‘Synctium: A Near-Threshold

Stream Processor for Energy-Constrained

Parallel Applications,’’ IEEE Computer Archi-

tecture Letters, Jan. 2010, pp. 21-24.

13. D. Fick et al., ‘‘Centip3de: A 3930 DMIPS/W

Configurable Near-Threshold 3D Stacked

System with 64 ARM Cortex-M3 Cores,’’

Proc. IEEE Int’l Solid-State Circuits Conf.,

IEEE, 2012, pp. 190-192.

14. S. Jain et al., ‘‘A 280 mV-to-1.2 V Wide-

Operating-Range IA-32 Processor in 32 nm

CMOS,’’ Proc. IEEE Int’l Solid-State Circuits

Conf., IEEE, 2012, pp. 66-68.

15. N. Goulding-Hotta et al., ‘‘The GreenDroid

Mobile Application Processor: An Architec-

ture for Silicon’s Dark Future,’’ IEEE Micro,

vol. 31, no. 2, 2011, pp. 86-95.

.............................................................

18 IEEE MICRO

...............................................................................................................................................................................................

DARK SILICON

Page 12: cseweb.ucsd.educseweb.ucsd.edu/~mbtaylor/papers/taylor_landscape_ds_ieee_micro_2013.pdfpoint of view. But the success of complex multiregime devices such as metal-oxide-semiconductor

16. V. Govindaraju, C.-H. Ho, and K. Sankaralin-

gam, ‘‘Dynamically Specialized Datapaths

for Energy Efficient Computing,’’ Proc. IEEE

17th Int’l Symp. High-Performance Computer

Architecture (HPCA 11), IEEE CS, 2011,

doi:10.1109/HPCA.2011.5749755.

17. S. Gupta et al., ‘‘Bundled Execution of Re-

curring Traces for Energy-Efficient General

Purpose Processing,’’ Proc. 44th Ann.

IEEE/ACM Int’l Symp. Microarchitecture,

ACM, 2011, pp. 12-23.

18. H. Esmaeilzadeh et al., ‘‘Neural Acceleration

for General-Purpose Approximate Pro-

grams,’’ Proc. 45th Ann. IEEE/ACM Int’l

Symp. Microarchitecture, IEEE CS, 2012,

pp. 449-460.

19. A. Ionescu et al., ‘‘Tunnel Field-Effect Transis-

tors as Energy-Efficient Electronic Switches,’’

Nature, 17 Nov. 2011, pp. 329-337.

20. F. Chen et al., ‘‘Demonstration of Integrated

Micro-Electro-Mechanical Switch Circuits for

VLSI Applications,’’ Proc. IEEE Int’l

Solid-State Circuits Conf., IEEE, 2010,

pp. 150-151.

Michael B. Taylor is an associate professorin the Department of Computer Science andEngineering at the University of California,San Diego, where he leads the Center forDark Silicon. His research interests includedark silicon, chip design, parallelizationtools, and Bitcoin computing systems.Taylor has a PhD in electrical engineeringand computer science from the Massachu-setts Institute of Technology.

Direct questions and comments aboutthis article to Michael B. Taylor, 9500Gilman Drive, MC 0404 EBU 3B 3202, LaJolla, CA 92093-0404; [email protected].

.............................................................

SEPTEMBER/OCTOBER 2013 19