jxcdc, october 2019 1 spintronic in-memory …jxcdc, october 2019 1 spintronic in-memory pattern...

13
JXCDC, OCTOBER 2019 1 Spintronic In-Memory Pattern Matching Using Computational RAM (CRAM) Zamshed I. Chowdhury, Student Member, IEEE, S. Karen Khatamifard, Member, IEEE, Zhengyang Zhao, Student Member, IEEE, Masoud Zabihi, Student Member, IEEE, Salonik Resch, Student Member, IEEE, Meisam Razaviyayn, Member, IEEE, Jian-Ping Wang, Fellow, IEEE, Sachin Sapatnekar, Fellow, IEEE, and Ulya R. Karpuzcu, Member, IEEE Abstract—Traditional Von Neumann computing is falling apart in the era of exploding data volumes as the overhead of data transfer becomes forbidding. Instead, it is more energy- efficient to fuse compute capability with memory where the data reside. This is particularly critical for pattern matching, a key computational step in large-scale data analytics, which involves repetitive search over very large databases residing in memory. Emerging spintronic technologies show remarkable versatility for the tight integration of logic and memory. In this paper, we introduce SpinPM, a novel high-density, reconfigurable spintronic in-memory pattern matching Spin-Orbit-Torque (SOT) – specifically Spin-Hall-Effect (SHE) – substrate based on Com- putational RAM (CRAM); and demonstrate the performance benefit SpinPM can achieve over conventional and near memory processing systems. Index Terms—Computational Random Access Memory, Process- ing In Memory, Pattern Matching, SHE-MTJ. I. I NTRODUCTION Classical computing platforms are not optimized for efficient data transfer, which complicates large-scale data analytics in the presence of exponentially growing data volumes. Imbal- anced technology scaling further exacerbates this situation by rendering data communication, and not computation, a critical bottleneck [1]. Specialization in hardware cannot help in this case unless conducted in a data-centric manner. Tight integration of compute capability into the memory, Processing in memory (PIM), is especially promising as the overhead of data transfer becomes forbidding at scale. The rich design space for PIM spans full-fledged processors and co-processors residing in memory [2]. Until the emergence of 3D-stacking, however, the incompatibility of the state-of-the-art logic and memory technologies prevented practical prototype designs. Still, 3D-stacking can only achieve near memory processing, NMP [3], [4], [5]. The main challenge remains to be fusing computation and memory without violating array regularity. Emerging spintronic technologies show remarkable versatility for the tight integration of logic and memory. This paper introduces a high-density, reconfigurable spintronic in-memory compute substrate for pattern matching, SpinPM, which fuses computation and memory by using Spin-Orbit-Torque (SOT) – specifically, Spin-Hall-Effect (SHE) – as the switching principle to perform in-situ computation in spintronic memory arrays. The basic idea is adapting Computational RAM (CRAM) [6] to add compute capability to the magnetic tunnel junction (MTJ) based memory cell [7], [8], without breaking the array regularity. Thereby each memory cell can participate in gate- level computation as an input or as an output. Computation is This work was supported in part by NSF grant no. SPX-1725420. Zamshed I. Chowdhury, S. Karen Khatamifard, Zhengyang Zhao, Masoud Zabihi, Salonik Resch, Jian-Ping Wang, Sachin Sapatnekar and Ulya R. Karpuzcu are with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN, 55455 USA, e-mail: [email protected]. Meisam Razaviyayn is with University of Southern California, USA. not disruptive, i.e., memory cells acting as gate inputs do not loose their stored values. SpinPM can implement different types of basic Boolean gates to form a functionally complete set, therefore there is no fundamental limit to the types of computation. Each column in an SpinPM array can have only one active gate at a time, however, computation in all columns can proceed in parallel. SpinPM provides true in-memory computing by reconfiguring cells within the memory array to implement logic functions. As all cells in the array are identical, inputs and outputs to logic gates do not need to be confined to a specific physical location in the array. In other words, SpinPM can intiate computation at any location in the memory array. Pattern matching is at the core of many important large- scale data analytics applications, ranging from bioinformatics to cryptography. The most prevalent form is string matching via repetitive search over very large reference databases residing in memory. Therefore, compute substrates such as SpinPM, that collocate logic and memory to prevent slow and energy- hungry data transfers at scale, have great potential. In this case, each step of computation attempts to map a short character string to (the most similar substring of) an orders-of-magnitude- longer character string, and repeats this process for a very large number of short strings, where the longer string is fixed and acts as a reference. In this paper we detail the end-to-end design of the SpinPM accelerator for large-scale string matching. The design covers device, circuit and architecture level details, including the pro- gramming interface. We evaluate SpinPM using representative benchmarks from emerging application domains to pinpoint its potential and opportunities for optimization Specifically, Section II covers the basics of how SpinPM fuses compute with memory; Section III introduces a SpinPM implementation for pattern (string) matching; Sections IV, V provides the evaluation; Section VI compares and contrasts SpinPM to related work; and Section VII concludes the paper. II. BASICS A. Fusing Computation and Memory Without loss of generality, SpinPM adapts Computational RAM (CRAM) [9] as the spintronic PIM substrate to accelerate large- scale string matching. In its most basic form, a CRAM array is essentially a 2D magneto-resistive RAM (MRAM). When compared to the standard 2T(ransistor)1M(TJ) MRAM cell, however, the array features an additional Logic Line (LL) per cell – as indicated in Fig.1(a) and Fig.1(b) – to connect cells to perform logic operations. A CRAM cell can operate as a regular MRAM memory cell or serve as an input/output to a logic gate. Each MTJ consists of two layers of ferromagnets, termed as pinned and free layers, separated by a thin insulator. The magnetic spin orientation of the pinned layer is fixed; of the free layer, controllable. SpinPM uses Spin-Hall-Effect (SHE)

Upload: others

Post on 28-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

  • JXCDC, OCTOBER 2019 1

    Spintronic In-Memory Pattern MatchingUsing Computational RAM (CRAM)

    Zamshed I. Chowdhury, Student Member, IEEE, S. Karen Khatamifard, Member, IEEE, Zhengyang Zhao, StudentMember, IEEE, Masoud Zabihi, Student Member, IEEE, Salonik Resch, Student Member, IEEE, Meisam

    Razaviyayn, Member, IEEE, Jian-Ping Wang, Fellow, IEEE, Sachin Sapatnekar, Fellow, IEEE, and Ulya R.Karpuzcu, Member, IEEE

    Abstract—Traditional Von Neumann computing is falling apartin the era of exploding data volumes as the overhead ofdata transfer becomes forbidding. Instead, it is more energy-efficient to fuse compute capability with memory where thedata reside. This is particularly critical for pattern matching,a key computational step in large-scale data analytics, whichinvolves repetitive search over very large databases residingin memory. Emerging spintronic technologies show remarkableversatility for the tight integration of logic and memory. In thispaper, we introduce SpinPM, a novel high-density, reconfigurablespintronic in-memory pattern matching Spin-Orbit-Torque (SOT)– specifically Spin-Hall-Effect (SHE) – substrate based on Com-putational RAM (CRAM); and demonstrate the performancebenefit SpinPM can achieve over conventional and near memoryprocessing systems.

    Index Terms—Computational Random Access Memory, Process-ing In Memory, Pattern Matching, SHE-MTJ.

    I. INTRODUCTION

    Classical computing platforms are not optimized for efficientdata transfer, which complicates large-scale data analytics inthe presence of exponentially growing data volumes. Imbal-anced technology scaling further exacerbates this situation byrendering data communication, and not computation, a criticalbottleneck [1]. Specialization in hardware cannot help in thiscase unless conducted in a data-centric manner.

    Tight integration of compute capability into the memory,Processing in memory (PIM), is especially promising as theoverhead of data transfer becomes forbidding at scale. Therich design space for PIM spans full-fledged processors andco-processors residing in memory [2]. Until the emergence of3D-stacking, however, the incompatibility of the state-of-the-artlogic and memory technologies prevented practical prototypedesigns. Still, 3D-stacking can only achieve near memoryprocessing, NMP [3], [4], [5]. The main challenge remains tobe fusing computation and memory without violating arrayregularity.

    Emerging spintronic technologies show remarkable versatilityfor the tight integration of logic and memory. This paperintroduces a high-density, reconfigurable spintronic in-memorycompute substrate for pattern matching, SpinPM, which fusescomputation and memory by using Spin-Orbit-Torque (SOT) –specifically, Spin-Hall-Effect (SHE) – as the switching principleto perform in-situ computation in spintronic memory arrays.The basic idea is adapting Computational RAM (CRAM) [6]to add compute capability to the magnetic tunnel junction(MTJ) based memory cell [7], [8], without breaking the arrayregularity. Thereby each memory cell can participate in gate-level computation as an input or as an output. Computation is

    This work was supported in part by NSF grant no. SPX-1725420. ZamshedI. Chowdhury, S. Karen Khatamifard, Zhengyang Zhao, Masoud Zabihi,Salonik Resch, Jian-Ping Wang, Sachin Sapatnekar and Ulya R. Karpuzcuare with the Department of Electrical and Computer Engineering, Universityof Minnesota, Minneapolis, MN, 55455 USA, e-mail: [email protected] Razaviyayn is with University of Southern California, USA.

    not disruptive, i.e., memory cells acting as gate inputs do notloose their stored values.

    SpinPM can implement different types of basic Booleangates to form a functionally complete set, therefore there is nofundamental limit to the types of computation. Each columnin an SpinPM array can have only one active gate at a time,however, computation in all columns can proceed in parallel.SpinPM provides true in-memory computing by reconfiguringcells within the memory array to implement logic functions. Asall cells in the array are identical, inputs and outputs to logicgates do not need to be confined to a specific physical locationin the array. In other words, SpinPM can intiate computationat any location in the memory array.

    Pattern matching is at the core of many important large-scale data analytics applications, ranging from bioinformaticsto cryptography. The most prevalent form is string matching viarepetitive search over very large reference databases residingin memory. Therefore, compute substrates such as SpinPM,that collocate logic and memory to prevent slow and energy-hungry data transfers at scale, have great potential. In this case,each step of computation attempts to map a short characterstring to (the most similar substring of) an orders-of-magnitude-longer character string, and repeats this process for a very largenumber of short strings, where the longer string is fixed andacts as a reference.

    In this paper we detail the end-to-end design of the SpinPMaccelerator for large-scale string matching. The design coversdevice, circuit and architecture level details, including the pro-gramming interface. We evaluate SpinPM using representativebenchmarks from emerging application domains to pinpointits potential and opportunities for optimization Specifically,Section II covers the basics of how SpinPM fuses computewith memory; Section III introduces a SpinPM implementationfor pattern (string) matching; Sections IV, V provides theevaluation; Section VI compares and contrasts SpinPM torelated work; and Section VII concludes the paper.

    II. BASICSA. Fusing Computation and MemoryWithout loss of generality, SpinPM adapts Computational RAM(CRAM) [9] as the spintronic PIM substrate to accelerate large-scale string matching. In its most basic form, a CRAM arrayis essentially a 2D magneto-resistive RAM (MRAM). Whencompared to the standard 2T(ransistor)1M(TJ) MRAM cell,however, the array features an additional Logic Line (LL) percell – as indicated in Fig.1(a) and Fig.1(b) – to connect cellsto perform logic operations. A CRAM cell can operate as aregular MRAM memory cell or serve as an input/output to alogic gate.

    Each MTJ consists of two layers of ferromagnets, termedas pinned and free layers, separated by a thin insulator. Themagnetic spin orientation of the pinned layer is fixed; of thefree layer, controllable. SpinPM uses Spin-Hall-Effect (SHE)

  • JXCDC, OCTOBER 2019 2

    TLTM

    LL

    SHE ChannelBSLRWL WWL

    (a)

    LL

    EBSL

    RWL0 WWL0 RWL1 WWL1 RWL2 WWL2OBSL

    In0 In1 Out

    (b)

    Rout

    R0 R1In0 In1

    Out

    V0 V0

    LL

    EBSL EBSL

    OBSL

    (c)

    Fig. 1: (a) SpinPM cell; (b) 2-input gate formation in the array; (c) 2-input NOR gate circuit equivalent.

    MTJs, where – in order to separate read and write paths formore effective optimization – a heavy metal layer (i.e., the SHEchannel) is juxtaposed with the free layer. The SHE channelhas a high resistance that brings down the write current, leadingto lower energy consumption. Changing the spin orientationof the free layer entails passing a (polarized) current throughthe SHE channel, where the current direction sets the freelayer orientation. The relative orientation of the free layer withrespect to the pinned layer, i.e., anti-parallel (AP) or parallel(P), gives rise to two distinct MTJ resistance levels, i.e., Rhighand Rlow, which encode logic 1 and 0, respectively.Memory Configuration: When the array is configured as amemory, the Logic Line (LL) is active, i.e., connected to avoltage source. In the following, we detail the configurationfor Read Word Line (RWL) and Write Word Line (WWL),considering various memory operations:

    • Data retention: The RWL and WWL are set to 0 to isolatethe cells and to prevent current flow through the MTJ and theSHE channel (which we refer to together as the SHE-MTJ).

    • Read: RWL is set to 1, to connect each SHE-MTJ to itsLL. WWL is set to 0. A small voltage pulse applied betweenLL and Bit Select Line (BSL) induces a current through theSHE-MTJ, which is a function of the resistance level (i.e.,logic state), and which in turn a sense amplifier attached toBSL captures.

    • Write: WWL is set to 1 and transistor TM is switched ON,to connect the SHE channel to its LL. RWL is set to 0.A large enough voltage pulse (in the order of the supplyvoltage) is applied between LL and BSL to induce a largeenough current through LL and the SHE channel to changethe spin orientation of the free layer.

    Logic Configuration: In logic mode, LL establishes theconnection between inputs and outputs of logic gates. Wealso distinguish between two sets of BSL: even BSL (EBSL)and odd BSL (OBSL). These are utilized to connect all cellsparticipating in computation, on a per column basis, as inputand output cells. Such cells may act as logic gate inputs oroutputs, where the only restriction is having all inputs connectedon the same type of BSL, and the output, on the different type.For each SpinPM input cell participating in computation, RWLis set to 1 to connect its MTJ with LL. On the other hand, foreach SpinPM output cell participating in computation, WWLis set to 1 to turn the switch TM on, which in turn connectsthe SHE channel to the LL. A voltage pulse applied betweenEBSL and OBSL induces a current, dependent on the resistancelevels of input cells, through the SHE channel of the outputcell. As an example, Fig.1(b) demonstrates the formation of atwo input logic gate in the array, where cells labeled by “0”,“1”, and “2” correspond to the inputs In0, In1, and the outputOut, respectively. The output cell is pre-set to a known logicvalue (“0” or “1”), depending on the type of logic operationto perform, by a standard write. Fig.1(c) depicts the equivalentcircuit: OBSL is grounded; while EBSL is set to voltage V0. The

    In0 In1 Out IOut = I0 + I10 (Rlow) 0 (Rlow) 1 I00 > Icrit0 (Rlow) 1 (Rhigh) 0 I01 < Icrit1 (Rhigh) 0 (Rlow) 0 I10 = I01 < Icrit1 (Rhigh) 1 (Rhigh) 0 I11 < Icrit

    TABLE I: 2-input NOR truth table (Out pre-set = 0).value of V0 determines the current through the input MTJs, I0and I1, as a function of their resistance values R0 and R1. Eachinput resistance captures the resistance of the correspondingSHE-MTJ, whereas the output resistance ROut is only the SHEchannel resistance of the output cell. Input and output cellsare connected to LL by setting the respective RWL and WWLto 1 (colored red in Fig. 1(b)). IOut = I0 + I1 flows throughROut. If IOut is higher than the critical switching current Icrit,it will change the free layer orientation of Out’s MTJ, andthereby, the logic state of Out. Otherwise, Out will keep itsprevious (pre-set) state.

    We can easily expand this example to more than two inputs.The key observation is that we can change the logic state of theoutput as a function of the logic states of the inputs, within thearray. And voltages applied between BSLs of the participatingcells dictate how such functions look like.

    Continuing with the example from Fig.1(b)/(c), let us try toimplement an universal, 2-input NOR gate. Table I providesthe truth table. Out would be 0 in this case for all inputcombinations but In0 = 0, In1 = 0, which incurs the lowestR0 and R1, and hence, the highest IOut = I0+I1. Let us referto this value of IOut as I00, following Table I. Accordingly, ifwe pre-set Out to 0 (before computation starts), and determineV0 such that I00 does exceed Icrit, while both I11 and I01 = I10do not, Out would not switch from (its pre-set value) 0 to 1,for all input combinations but In0 = 0, In1 = 0.

    As Boolean gates of practical importance (such as NOR)are commutative, a single voltage level at the BSLs of theinputs suffices to define a specific logic function. Each voltagelevel can serve as a signature for a specific logic gate. Inthe following, we will refer to such as Vgate. In the exampleabove, Vgate = VNOR. While NOR gate is universal, we canimplement different types of logic gates following a similarmethodology for mapping the corresponding truth tables to theSpinPM array.

    B. Basic Computational BlocksWe will next introduce basic SpinPM computational blocks forpattern matching, including inverters (INV), buffers (COPY),3-input and 5-input majority (MAJ) gates, and 1-bit full adders.INV: INV is a single-input gate. Still, we can follow a similarmethodology to the NOR implementation (Table I): Pre-setoutput to 0, and define VINV in a way such that I0 (I1), i.e.,the current if the input is 0 (1), is higher (lower) than Icritsuch that the output does (not) switch from the pre-set 0 to1. By definition, I1 < I0 applies, as R1 > R0. However, inthis case the input cell will inevitably switch, as well, dueto same current flowing through the input and output cells.INV therefore is a destructive gate, which still can be very

  • JXCDC, OCTOBER 2019 3

    useful if the input data is not needed in the subsequent stepsof computation – as it is the case in our case studies.COPY: For 1-bit copy, two back-to-back invocations of INVcan suffice. A more time and energy efficient implementation,however, can perform the same function in one step as follows:Pre-set output to 1, and define VCOPY in a way such that I0(I1), i.e., the current if the input is 0 (1), is higher (lower) thanIcrit such that the output does (not) switch from the pre-set 1to 0. By definition, I1 < I0 applies, as R1 > R0.MAJ: MAJ gates accept an odd number of inputs, and assignthe majority (logic) state across all inputs to the output. Thestructure for a 3-input MAJ3 or 5-input MAJ5 gate is notany different from the circuit structure in Fig. 1(c) except thehigher number of inputs. As an example, IOut of the MAJ3gate assumes its highest value for the 000 assignment of thethree inputs – as the resistances of the three inputs, R0, R1,and R2, assume their lowest value for 000.

    Any input assignment having at least one 1, gives rise to alower IOut than I000; and having at least two 1s, to an evenlower IOut. Finally, IOut reaches its minimum for the inputassignment 111, for which the inputs assume their highestresistance. Accordingly, we can pre-set the output to 1, anddefine VMAJ3 in a way such that IOut remains higher thanIcrit if the three inputs have less than two 1s, such that Outswitches from the pre-set 1 to 0, to match the input majority.We can symmetrically define VMAJ5, assuming a pre-set of 1.XOR: XOR is an especially useful gate for comparison,however, a single-gate SpinPM implementation is not possible:In this case we need Out (not) to switch for 00 and 11, butnot for 01 and 10, if the pre-set is 1 (0). However, due toI00 > I01 = I10 > I11, and assuming a pre-set of 1, we cannotlet both I00 and I11 remain higher than Icrit (such that Outswitches), while I01 = I10 remain lower than Icrit (such thatOut does not switch). The same observation holds for a pre-setof 0, as well.

    S1= S2= Out=In0 In0 NOR(In0,In1) COPY(S1) TH(In0,In1,S1,S2)

    0 0 1 1 00 1 0 0 11 0 0 0 11 1 0 0 0

    TABLE II: XOR implementation

    We can implement XOR using a combination of universalSpinPM gates such as NOR. Thereby each XOR takes at least4 steps (i.e., logic evaluations). For pattern matching, we willrely on a more efficient 3-step implementation (Table II):

    In Step-1, we compute S1=NOR(In0,In1). In Step-2, weperform S2=COPY(S1). In the final Step-3, we invoke

    a 4-input thresholding TH function, which renders a 1only if its inputs contain more than two zeros: Out =TH(In0,In1,S1,S2).

    TH has a pre-set of 0, and the operating principle is verysimilar to the majority gates except that TH accepts an evennumber of inputs. We can further optimize this implementation,and fuse Step-1 and Step-2 by implementing NOR as a two-output gate.Full Adder: A full adder has three inputs: In0, In1, andcarry-in Ci. The two outputs are Sum and the carry-out Co.

    Like other logic functions, we can implement this adderusing NOR gates. However, an implementation based on apair of MAJ gates reduces the required number of stepssignificantly [10]. Fig.2 provides a step-by-step overview:Step-1: Co = MAJ(In0,In1,Ci)

    Step-2: S1 = INV(Co)Step-3: S2 = COPY(S1)Step-4: Sum = MAJ(In0,In1,Ci,S1,S2)

    C. Column-level ParallelismSpinPM can perform only one type of logic function in acolumn at a time. This is because there is only one LL thatspans the entire column, and any cell within the column toparticipate in computation gets directly connected to this LL(Section II-A).

    On the other hand, the voltage levels on BSLs determinethe type of the logic function, where each BSL spans an entirecolumn. Furthermore, in each row, each WWL and RWL – whichconnect cells participating in computation to LL – span an entirerow. Therefore, all columns can perform the very same logicfunction in parallel, on the same set of rows. In other words,SpinPM supports a special form of SIMD (single instructionmultiple data) parallelism, where instruction translates intologic gate/operation; and data, into input cells in each column,across all columns, which span the very same rows.

    Multi-step operations are carried out in each column indepen-dently, one step at a time, while all columns operate in parallel.The output from each logic step performed within a columnstays in that column, and can serve as input to subsequent logicsteps (performed in the same column). All columns follow thesame sequence of operations at the same time. In case of amulti-bit full adder, as an example, the carry and sum bits aregenerated in the same column as the input bits, which are usedin subsequent 1-bit additions in the very same column.

    To summarize, SpinPM can have part or all columns comput-ing in parallel, or the entire array serving as memory. Regularmemory reads and writes cannot proceed simultaneouslywith computation. Large scale pattern matching problems cangreatly benefit from this execution model, as we are going todemonstrate next.

    III. SPINTRONIC PATTERN MATCHINGPattern matching is a key computational step in large-scale dataanalytics. The most common form by far is character stringmatching, which involves repetitive search over very largedatabases residing in memory. Therefore, compute substratessuch as SpinPM, that collocate logic and memory to avoid thelatency and energy overhead of expensive data transfers, havegreat potential. Moreover, comparison operations dominate thecomputation, which represent excellent acceleration targetsfor SpinPM. As a representative and important large-scalestring matching problem, in the following, we will useDeoxyriboNucleic Acid (DNA) sequence pre-alignment [11]as a running example without loss of generality, and expandSpinPM’s evaluation to other string matching benchmarks inSection V.

    At each step, DNA sequence pre-alignment tries to mapa short character string to (the most similar substring of) anorders-of-magnitude-longer character string, and repeats thisprocess for a very large number of short strings, where thelonger string is fixed and acts as a reference. For each string,the characters come from the alphabet A(denine), C(ytosine),G(uanine), and T(hymine). The long string represents acomplete genome; short strings, short DNA sequences (from thesame species). The goal is to extract the region of the referencegenome to which the short DNA sequences correspond to. Wewill refer to each short DNA sequence as a pattern, and thelonger reference genome as reference.

    Aligning each pattern to the most similar substring of thereference usually involves character by character comparisons

  • JXCDC, OCTOBER 2019 4

    In0

    In1 Ci

    Co

    Sum

    S 2S 1

    LL

    In0

    In1 Ci

    Co

    Sum

    S 2S 1

    LL

    Co

    Sum

    S 1 S 2Ci

    In1

    In0

    LL

    Step-1: Co = MAJ(In0, In1,Ci)

    In0

    In1 Ci

    Co

    Sum

    S 2S 1

    LL

    Step-2: S1 = INV(Co) Step-3: S2 = COPY(S1) Step-4: Sum = MAJ(In0,In1,Ci,S1,S2)

    Fig. 2: Full adder implementation [9]. Output of each gate is depicted in red.

    to derive a similarity score, which captures the number ofcharacter matches between the pattern and the (aligned sub-string of the) reference. Improving the throughput performancein terms of number of patterns processed per second in anenergy-efficient manner is especially challenging, consideringthat a representative reference (i.e., the human genome) can bearound 109 characters long, that at least 2 bits are necessaryto encode each character, and that a typical pattern datasetcan have hundreds of millions patterns to match [12], whereSpinPM can help due to reduced data transfer overhead and(column) parallel comparison/similarity score computations.

    By effectively pruning the search space, DNA pre-alignmentcan significantly accelerate DNA sequence alignment – which,besides complex pattern matching in the presence of errors,include pre- and post-processing steps typically spanning (input)data transformation for more efficient processing, search spacecompaction, or (output) data re-formatting. The execution timeshare of pattern matching (accounting for possible complexerrors in patterns and the reference) can easily reach 88% inhighly optimized GPU implementations of popular alignmentalgorithms [13] 1. In the following, we will only cover basicpattern matching (which can still account for basic errormanifestations in the patterns and the reference) within thescope of pre-alignment.

    Mapping any computational task to the SpinPM arraytranslates into co-optimizing the data layout, data representation,and the spatio-temporal schedule of logic operations, to makethe best use of SpinPM’s column-level parallelism. This entailsdistribution of the data to be processed, i.e., the reference andthe patterns, in a way such that each column can performindependent computations.

    The data representation itself, i.e., how we encode eachcharacter of the pattern and the reference strings, has a bigimpact on both the storage and the computational complexity.Specifically, data representation dictates not only the type, butalso the spatio-temporal schedule of (bit-wise) logic operations.Spatio-temporal scheduling, on the other hand, should takeintermediate results during computation into account, whichmay or may not be discarded (i.e., overwritten), and whichmay or may not overwrite existing data, as a function of thealgorithm or array size limitations.A. Data Layout & Data RepresentationWe use the data layout captured by Fig. 3, by folding the longreference over multiple SpinPM columns. Each column hasfour dedicated compartments to accommodate a fragment ofthe folded reference; one pattern; the similarity score (for thepattern when aligned to the corresponding fragment of thereference); and intermediate data (which we will refer to asscratch). The same format applies to each column, for efficientcolumn-parallel processing. Each column contains a differentfragment of the reference.

    We determine the number of rows allocated for each ofthe four compartments, as follows: In the DNA pre-alignment

    1For this implementation of the common Burrows-Wheeler-Aligner (BWA)algorithm, the time share of the pattern matching kernel, inexact match caller,increases from 46% to 88%, as the number of base mismatches allowed (aninput parameter to the algorithm) is varied from one to four (both representingtypical values).

    Folded Reference Patterns Simil. Score Scratchcol 1col 2col 3

    col N

    . . . . . . . . . . .

    Fig. 3: Data layout per SpinPM array.

    problem, the reference corresponds to a genome, therefore,can be very long. The species determines the length. As acase study for large-scale pattern matching, in this paper wewill use approx. 3×109 character-long human genome. Eachpattern, on the other hand, represents the output from a DNAsequencing platform, which biochemically extracts the locationof the four characters (i.e., bases) in a given (short) DNA strand.Hence, the sequencing technology determines the maximumlength per pattern, and around 100 characters is typical formodern platforms processing short DNA strands [14]. The sizeof the similarity score compartment, to keep the character-by-character comparison results, is a function of the pattern length.Finally, the size of the scratch compartment depends on boththe reference fragment and pattern length.

    While the reference length and the pattern length areproblem specific constants, the (reference) fragment length(as determined by the folding factor), is a SpinPM designparameter. By construction, each fragment should be at leastas long as each pattern. The maximum fragment length, onthe other hand, is limited by the maximum possible SpinPMcolumn height, considering the maximum affordable capacitiveload (hence, RC delay) on column-wide control lines such asBSL and LL. However, column-level parallelism favors shorterfragments (for the same reference length). The shorter thefragments, the more columns would the reference occupy, andthe more columns, hence regions of the reference, would be“pattern-matched” simultaneously.

    For data representation, we simply use 2-bits to encode thefour (base) characters, hence, each character-level comparisonentails two bit-level comparisons.

    B. Proof-Of-Concept SpinPM DesignSpinPM comprises two computational phases, which Algorithm1 captures at the column-level: match, i.e., aligned bit-wisecomparison and similarity score computation. As each col-umn performs the very same computation in parallel, in thefollowing, we will detail column-level operations.

    Algorithm 1 2-phase pattern matching at column-levelloc = 0while loc < len(fragment)−len(pattern) do

    Phase-1: Match (Aligned Comparison)align pattern to location loc of reference fragment;(bit-wise) compare aligned pattern to fragmentPhase-2: Similarity Score Computationcount the number of character-wise matches;derive similarity score from countloc++

    end while

  • JXCDC, OCTOBER 2019 5

    In Algorithm 1, len(fragment) and len(pattern) representthe (character) length of the reference fragment and the pattern,respectively; and loc, the index of the fragment string wherewe align the pattern for comparison. The computation in eachcolumn starts with aligning the fragment and the pattern string,from the first character location of the fragment onward. Foreach alignment, a bit-wise comparison of the fragment andpattern characters comes next. The outcome is a len(pattern)bits long string, where a 1 (0) indicates a character-wise(mis)match. We will refer to this string as the match string.Hence, the number of 1s in the match string acts as a measurefor how similar the fragment and the pattern are, when alignedat that particular character location (loc per Algorithm 1).

    A reduction tree of 1-bit adders counts the number of 1sin the match string to derive the similarity score. Once thesimilarity score is ready, next iteration starts. This processcontinues until the last character of the pattern reaches the lastcharacter of the fragment, when aligned.Phase-1 (Match, i.e., Aligned Comparison): Each alignedcharacter-wise comparison gives rise to two bit-wise compar-isons, each performed by an 2-input XOR gate. Fig.4a providesan example, where we compare the base character ‘A’ (encodedby ‘00’) of the fragment with the base character ‘A’ (i), and‘T’ (encoded by ‘10’) (ii), of the pattern. A 2-input NOR gateconverts the 2-bit comparison outcome to a single bit, whichrenders a 1 (0) for a character-wise (mis)match. Recall that aNOR gate outputs a 1 only if both of its inputs are 0, and thatan XOR gate generates a 0 only if both of its inputs are equal.The implementation of these gates follows Section II-B.

    0 0

    NOR 00

    0 0

    1

    A

    A

    … Match String …

    1 0

    NOR 01

    0 0A

    T

    0 ……

    XOR XOR

    Match Mismatch

    (i) (ii)

    (a)

    1 0

    +1 1 1 0 0 0

    0 1 1 0 0 1 0 0

    + + +

    +0 1

    +0 01

    +0

    1

    +0

    CinCin

    Level-1

    Level-2

    Carry Sum

    (b)

    Fig. 4: Aligned bit-wise comparison (a), and adder reductiontree used for similarity score computation (b).

    SpinPM can only have one gate active per column at a time(Section II-C). Therefore, for each alignment (i.e., for eachloc or iteration of Algorithm 1), such a 2-bit comparison takesplace len(pattern) times in each column, one after another.Thereby we compare all characters of the aligned pattern to allcharacters of the fragment, before moving to the next alignment(at the next location loc per Algorithm 1). That said, each such2-bit comparison takes place in parallel over all columns, wherethe very same rows participate in computation.Phase-2 (Similarity Score Computation): For each alignment(i.e., iteration of Algorithm 1), once all bits of the match stringare ready – i.e., the character-wise comparison of the fragmentand the aligned pattern string is complete for all characters,we count the number of 1s in the match string to calculatethe similarity score. A reduction tree of 1-bit adders performsthe counting, as captured by Fig.4b, with the carry and sumpaths shown explicitly for the first two levels. The top rowcorresponds to the contents of the match string; and each ⊕,to a 1-bit adder from Section II-B. len(pattern), the patternlength in characters, is equal to the match string length in bits.Hence, the number of bits required to hold the final bit-count(i.e., the similarity score) is N = blog2[len(pattern)]c+ 1. A

    naive implementation for the addition of len(pattern) numberof bits requires len(pattern) steps, with each step using an N -bit adder, to generate an N -bit partial sum towards the N -bitend result. For a typical pattern length of around 100 [14], thistranslates into approx. 100 steps, with each step performinga N = 7 bit addition. Instead, to reduce both the number ofsteps and the operand width per step, we adopt the reductiontree of 1-bit adders from Fig.4b. Each level adds bits in groupsof two, using 1-bit adders. For a typical pattern length ofaround 100 [14], we thereby reduce the complexity to 1881-bit additions in total.

    Alignment under basic error manifestations in the patternand the reference is also straight-forward in this case. For DNAsequence alignment, the most common errors take the form ofsubstitutions (due to sequencing technology imperfections andgenetic mutations), where a character value assumes a differentvalue than actual [15], [16], [17], [18]. We can set a tolerancevalue t (in terms of number of mismatched characters) basedon expected error rates and pass an alignment as a “match” ifless than t characters mismatch.Assignment of Patterns to Columns: In each SpinPM arraywe can process a given pattern dataset in different ways. Wecan assign a different pattern to each column, where a differentfragment of the reference resides, or distribute the very samepattern across all columns. Either option works as long aswe do not miss the comparison of a given pattern to allfragments of the reference. In the following, we will stickto the second option, without loss of generality. This optioneases capturing alignments scattered across columns (i.e., wheretwo consecutive columns partially carry the most similar regionof the reference to the given pattern). A large reference can alsooccupy multiple arrays and give rise to scattered alignments atarray boundaries, which column replication at array boundariescan address.

    Many pattern matching algorithms rely on different formsof search space pruning to prevent unnecessary brute-forcesearch across all possibilities. At the core of such pruningtechniques lies indexing the reference, which is known aheadof time, in order to direct detailed search for any given patternto the most relevant portion of the reference (i.e., the portionthat most likely incorporates the best match). The result ispattern matching at a much higher throughput. SpinPM, as well,features search space pruning for efficient pattern matching. Theidea is chunking each pattern and the reference into substringsof known length, and creating a hash (bit) for each substring.Thereby, both the pattern and the reference become bit-vectors,of much shorter length than their actual representations. Searchspace pruning in SpinPM simply translates into bit-wisecomparison of each pattern bit-vector to the (longer) referencebit-vector, within the memory array, in a similar fashion to theactual full-fledged pattern mapping algorithm, considering allpossible alignments exploiting SpinPM’s massive parallelism atthe column level. Thereby we eliminate unnecessary attemptsfor full-fledged matching (using actual data representation andnot hashes).

    IV. EVALUATION SETUP

    Technology Parameters: Table III provides technology param-eters for a representative SHE-MTJ and a more conventionalSpin-Torque-Transfer (STT)-MTJ (near-term and projectedlong-term) based implementations. The critical current Icritrefers to an MTJ switching probability of 50%, which wouldincur a high write error rate (WER). To compensate, whenderiving gate latency and energy values, we conservatively

  • JXCDC, OCTOBER 2019 6

    TABLE III: Technology Parameters.SHE STT STT

    Near-term Long-termMTJ Type Interfacial PMTJMTJ Diameter (nm) 10 45 10TMR (%) 100 133 [22] 500RA Product (Ωµm2) 20 5 1 [23]Critical Current Icrit (µA) 3.0 (SHE) 100 3.95

    3.9 (MTJ)Switching Latency (ns) 1 3 [24] 1 [22]RP (KΩ) 253.97 3.15 12.7RAP (KΩ) 507.94 7.34 76.39RSHE (KΩ) 64 – –Write Latency (ns) 1.72 3.65 1.72Read Latency (ns 1.24 1.21 1.24Write Energy (fJ) 0.4 12.41 2.62Read Energy (fJ) 0.29 0.29 0.29VINV (V) 1.05-1.81 0.84–1.3 0.23–0.48VCOPY (V) 1.05-1.81 0.84–1.3 0.23–0.48VNOR (V) 0.62-0.75 0.68–0.74 0.20–0.22VMAJ3 (V) 0.53-0.61 0.65–0.69 0.20–0.21VMAJ5 (V) 0.40-0.43 0.61–0.62 0.19–0.20VTH (V) 0.43-0.86 0.62–0.63 0.19–0.20

    assume a 2× (5×) larger Icrit for the near (long) term STT-MTJ technology. The corresponding value for SHE-MTJ isby definition lower due to Spin-Hall-Effect based switching.The long-term STT specification is based on projections fromliterature [19]. SHE-MTJ specification comes from [20]. Wemodel access transistors after 22nm (HP) PTM [21].Simulation Infrastructure: We developed a step-accuratesimulator to capture the throughput performance and energyconsumption of SpinPM based pattern matching as a functionof the technology parameters. We model the peripheral circuitryusing NVSIM [25] to extract the row decoder, mux, precharge,and sense amplifier induced energy and latency overheads at22nm. NVSIM captures parasitic effects such as the temperatureimpact on wire resistance and contribution of drain-to-channelcapacitance in the drain capacitance.Array Size & Organization: It is evident that, dependingon the pattern matching problem at hand, we might needSpinPM arrays ranging from modest to very large in size. Thethought provoking issue here is how to deal with sufficientlylarge arrays as it might restrict the design space, consideringfabrication and circuit-level-design related limitations. As anexample, the proof-of-concept implementation requires 300arrays of 10K columns and around 2K rows each for thestring matching case study from genomics. This renders atotal size of roughly 24Mb per array, which is not excessivelylarge. Still, the fabrication technology might not be matureenough to synthesize such an array. Commercial MRAMmanufacturers address this challenge by banking. For example,EverSpin [26] uses 8 banks in its 256 Mb (32Mb ×8) MRAMproduct. Distributing array capacity to banks helps satisfy thelatency and energy requirement per access, as well. For SpinPMbased pattern matching, we too are inclined to use a hierarchyof banks, to enhance scalability. While a clever data layout,operation scheduling and parallel activation of banks can maskthe time overhead, the energy and area overhead would belargely due to replication of control hardware across banks.The most straight-forward option for banked SpinPM would beto treat each bank simply as an individual array which wouldmap even shorter fragments of the reference to patterns fromthe input pattern dataset.Search Space Pruning: Without loss of generality, we useGRIM filter [27] to convert the patterns and the referenceto bit-vectors. Except bit-vector generation, all operations(including bit-vector mapping) are implemented entirely inSpinPM arrays. We assume dedicated logic for bit-vectorgeneration and synthesize it in 22nm to extract the energyand time overheads. We account for the overhead of searchspace pruning throughout the evaluation, which spans bit-vector

    generation and matching in the SpinPM arrays.Benchmarks: We evaluate SpinPM using four pattern matchingapplications (which also include common computational kernelsfor pattern matching such as bit count), besides the runningexample of DNA sequence pre-alignment throughout thepaper. Table IV tabulates these applications along with thecorresponding problem sizes.

    TABLE IV: Benchmark Applications.Problem Pattern Sub-Array

    Benchmark Size Length SizeDNA 3G char. 100 char. 512×512Bit count 1000000 32-bit vectors 1-bit 512×512String Matching 10396542 words 10 char. string 512×512Rivest Cipher 4 10396542 words 248 bit 512×512Word count 1471016 words 32 bits 512×512

    DNA sequence pre-alignment (DNA) is our running case studythroughout the paper.

    We use a real human genome, NCBI36.54, from the 1000genomes project [28] as the reference, and 3M 100-basecharacter long real patterns from SRR1153470 [29].Bit count (BC) [30] counts the number of ones in a set ofvectors of fixed length. This includes addition of bits in thevectors and the subsequent addition of all individual counts.The input vectors are mapped to the columns of SpinPM toexploit parallelism.String Match (SM) [31] matches a search string with a pre-stored reference string to identify the part of the reference ofhighest or lowest similarity. Space separated string segmentsand the search substring (which forms the pattern) itselfare mapped to SpinPM columns such that all searches areperformed in parallel.Rivest Cipher 4 (RC4) is a popular stream cipher. Upongenerating a cipher key, i.e., a string, it performs bitwise XORon the cipher key and the text to cipher. The same key is usedto decipher the text, as well. Segments of input text and thecipher key are mapped to SpinPM columns.Word Count (WC) [31] counts the number of occurrence ofspecific words in an input text file, through word matching.The words are mapped to SpinPM columns along with searchwords, and the word matching in each column is executedconcurrently.Baselines for comparison:Near-Memory-Processing (NMP) Baseline: For NMP basedpattern matching, we use an HMC model based on publisheddata [3], which covers memory and logic layers, and com-munication links. To favor the NMP baseline, we ignorethe power required to navigate the global wires between thememory controller and the logic layer, and intermediate routingelements. For logic layer, we consider single issue in-ordercores, modeled after ARM Cortex A5 [32] with 1GHz clockand 32KB instruction and data caches.

    We first consider a total of 64 cores to provide parallelprocessing.

    For communication, we assume an HMC-like configurationwith four communication links operating at their peak frequencyof 160 GB/s. To derive the throughput performance, we use thesame reference and input patterns to profile each benchmark.We then use the instruction and memory traces to calculate thethroughput. We validated this model through CasHMC [33]simulations. For reference, we also include a hypothetical NMPvariant which includes 128 cores in the logic layer, and incurszero memory overhead.SpinPM-STT Baseline: To demonstrate the effect of differentcell technologies, we also use an STT-MRAM based SpinPMimplementation as a baseline for comparison.

  • JXCDC, OCTOBER 2019 7

    DNA BC SM RC4 WC

    Nor

    mal

    ized

    Mat

    ch R

    ate

    110

    010

    000

    NMPSpinPM−STT

    NMP−HypSpinPM−STT(Proj)

    (a) Match rate (patt/sec)DNA BC SM RC4 WCN

    orm

    aliz

    ed C

    ompu

    te E

    ffici

    ency

    5e−

    021e

    +00

    5e+

    01

    NMPSpinPM−STT

    NMP−HypSpinPM−STT(Proj)

    (b) Compute Eff. (patt/sec/mW)

    Fig. 5: Throughput and energy efficiency of SpinPM-SHE.

    STT-MTJ based SpinPM (SpinPM-STT) performs logicoperations following the same principle as the SHE-based(SpinPM-SHE) implementation, however, transposed [9].

    V. EVALUATIONWe next characterize benchmark applications, in terms ofmatch rate and compute efficiency, when mapped to SHE-based SpinPM (SpinPM-SHE). We use match rate (in termsof number of patterns processed per second) for throughput;match rate per milliwatt, for compute efficiency. Fig.5a depictsthe match rates for SpinPM-SHE normalized to four baselines:NMP, a hypothetical variant of NMP with no memory overhead(NMP-Hyp), and two variants of STT-based SpinPM (near termSpinPM-STT and projected SpinPM-STT) respectively. Overall,we observe that, SpinPM-SHE outperforms NMP baselinesin all benchmark applications. Moreover, in comparison tonear-term SpinPM-STT, SpinPM-SHE shows a significant im-provement in throughput performance and compute efficiency.All applications have smaller improvement w.r.t. NMP-Hyp ascompared to NMP, since NMP-Hyp has no memory overheadand hence has a much higher match rate than NMP to startwith.

    Fig.5b depicts the outcome for compute efficiency. Generallywe observe a similar trend to match rate, with all benchmarks(but BC) featuring ≥ 2× improvement even w.r.t. the idealbaseline NMP-Hyp. Overall, BC shows the least benefit incompute efficiency w.r.t. NMP-Hyp, since BC has a lowercompute to memory access ratio and eliminating memoryoverhead greatly improves the NMP-Hyp throughput andcompute efficiency.

    SpinPM-SHE performs significantly better, in terms of matchrate and compute efficiency, than near term SpinPM-STT due tosmaller switching latency and energy consumption. Moreover,the match rate and compute efficiency of SpinPM-SHE arequite close to that of projected long-term STT-MTJ basedimplementations.

    A. Impact of Process VariationWe conclude the evaluation with a discussion on the impact ofprocess variation, which, due to imperfections in manufacturingtechnology, may result in significant deviation in deviceparameters from their expected values. Both access transistorsand the SHE-MTJ in an SpinPM cell are subject to processvariation. Since access transistors are fabricated using therelatively more mature CMOS technology, the effect of processvariation is far less dominating than in its initial years. Beinga relatively new technology, MTJ devices are more susceptibleto process variation, which directly affects critical parameterssuch as switching current and switching latency. However, asMTJ technology matures, it is likely that it too will be able toreduce the impact of process variation.

    One concern is variation in critical switching current, whichcan directly translate into variation in bias voltages on bit selectlines, i.e., Vgate, which determines the gate type. However,

    different SpinPM gates featuring close Vgate values (andhence may be subject to this type of variation) are usuallydistinguished either by a different value of the preset or adifferent number of inputs, which makes it unlikely that thegate functions would overlap with each other as a result ofvariation. We validated this observation assuming a variationin switching current by ±5%,±10% and ±20%, respectively,for all evaluated gates implemented in the SpinPM array.

    B. Gate-level Characterization

    NOT OR NAND XOR

    Thr

    ough

    put (

    GO

    p/s)

    1e+

    031e

    +04

    1e+

    051e

    +06

    SpinPM−SHESpinPM−STT(Proj.)

    SpinPM−STTAmbit

    Fig. 6: Throughput w.r.t. Ambit [34].

    We next compare the throughput performance of SpinPM withAmbit [34] and Pinatubo [35]. Ambit reports a comparativebulk throughput with respect to CPU and GPU baselines,in executing basic logic operations on fixed sized vectorsof one-bit operands. Pinatubo reports bit-wise throughput ofOR operation only, on a 220 bit long vector. We consideredthe highest throughput (for a 128-row operation) reported byPinatubo. To conduct a fair comparison, we assume the samevector size of 32MB used in Ambit. Fig.6 captures the outcome,w.r.t. Ambit, in terms of Giga operations per second (GOPs),for NOT, OR, NAND, and XOR implementations. We observea higher throughput for SpinPM-SHE across all of these bitwiseoperations.

    The high degree of parallelism and lack of actual datatransfer within the array – which is not the case for Ambit perSectionVI – are the main reasons behind such improvement.For the more complex logic operation XOR, the throughputimprovement for SpinPM-SHE and projected SpinPM-STT areboth ≈ 4× over Ambit; whereas for near-term SpinPM-STT,only 1.34×. In this case, SpinPM-SHE performs slightly worse,although very insignificant – around 0.3% less, than projectedSpinPM-STT. In comparison to OR throughput of Pinatubo,SpinPM-SHE has similar improvement as projected SpinPM-STT (12×). For this comparison, we do not optimize datalayout or operation scheduling for SpinPM. That said, Ambitis based on a mature (DRAM) technology, and therefore moreversatile for integration in conventional systems.

    VI. RELATED WORKWithout loss of generality, we base SpinPM on the spintronicPIM substrate CRAM which was briefly presented in [9] andevaluated for a single-neuron digit recognizer along with a smallscale 2D convolution in [19]. CRAM is unique in combiningmulti-grain (possibly dynamic) reconfigurability with trueprocessing in memory semantics. The resistive AssociativeProcessor [36] and DRAM-based DRAF [37] on the otherhand, rely on look-up-tables to support reconfigurable fabricslike FPGA.

    The conventional bitline computing substrates employ senseamplifier based logic operations and can not truly eliminate datamovement overhead within the array boundary. The SRAM-based Compute Cache [38] can carry out different vectoroperations in the cache, but SpinPM supports a wider rangeof computations on much larger data than could fit in cache.

  • JXCDC, OCTOBER 2019 8

    Maintaining data coherence among cores which constitute near-memory logic is also an issue [39], [40] which is not the casefor SpinPM due to the absence of dedicated cores (with full-fledged memory hierarchies) to implement logic operations.Recent proposals for bit-wise in memory computing includeAmbit [34], DRISA [41], Pinatubo [35] and STT-CiM [42].DRAM-based solutions such as Ambit or DRISA use modifiedsense amplifier based designs. These designs support bitwiseoperations in DRAM, but can only perform computation on adesignated set of rows. Thus, to compute on an arbitrary row,the row must first be copied to these dedicated compute rowsand then copied back once the computation is complete. Whileboth designs feature high degrees of (column) parallelism,they suffer from data movement overheads within the arrayboundary. Pinatubo [35] on the other hand, can perform bitwiseoperations on data residing in multiple rows, using a specializedsense amplifier with variable reference voltage, which increasesthe susceptibility to variation.

    VII. CONCLUSIONThis paper introduces SpinPM, a novel, reconfigurable spin-tronic pattern matching substrate for true in-memory patternmatching, which represents a key computational step in large-scale data analytics. When configured as memory, SpinPMis not any different than an MRAM array. Each MRAM cell,however, can act as an input or output to a logic gate, ondemand. Therefore, reconfigurability does not compromisememory density. Each column can have only one logic gateactive at a time, but the very same logic operation canproceed in all columns in parallel. We implement a proof-of-concept SpinPM array with SHE-MTJ technology for large-scale character string matching to pinpoint design bottlenecksand aspects subject to optimization. The encouraging resultsfrom Section V indicate a great potential for throughputperformance and compute efficiency.

    REFERENCES[1] M. Horowitz, “Computing’s Energy Problem (and What We Can Do

    About It),” Keynote at International Solid State Circuits Conference,February 2014.

    [2] G. H. Loh et al., “A Processing in Memory Taxonomy and a Case ForStudying Fixed-function PIM,” in Workshop on Near-Data Processingin conjunction with MICRO, 2013.

    [3] “Hybrid Memory Cube (HMC),” http://www.hotchips.org/wp-content/uploads/hc archives/hc23/HC23.18.3-memory-FPGA/HC23.18.320-HybridCube-Pawlowski-Micron.pdf.

    [4] “Hybrid Bandwidth Memory (HBM),” http://www.amd.com/en-us/innovations/software-technologies/hbm.

    [5] R. Nair et al., “Active Memory Cube: A Processing-in-Memory Archi-tecture for Exascale Systems,” IBM Journal of R.&D., vol. 59, no. 2/3,2015.

    [6] J.-P. Wang et al., “General Structure for Computational Random AccessMemory (CRAM),” 2015, US Patent 9224447 B2.

    [7] A. Lyle et al., “Direct Communication between Magnetic TunnelJunctions for Nonvolatile Logic Fanout Architecture,” Applied PhysicsLetters, vol. 97, no. 152504, 2010.

    [8] J. Wang et al., “Programmable Spintronics Logic Device Based on aMagnetic Tunnel Junction Element,” Applied Physics Letters, vol. 97,no. 10D509, 2005.

    [9] Z. Chowdhury et al., “Efficient in-memory processing using spintronics,”IEEE Computer Architecture Letters, vol. 17, no. 1, pp. 42–46, 2018.

    [10] C. Augustine et al., “Low-power Functionality Enhanced ComputationArchitecture Using Spin-based Devices,” in International Symposium onNanoscale Architectures, 2011.

    [11] R. Kaplan et al., “Rassa: Resistive accelerator for approximate long readdna mapping,” arXiv preprint arXiv:1809.01127, 2018.

    [12] S. S. Ajay et al., “Accurate and comprehensive sequencing of personalgenomes,” Genome Research, vol. 21, no. 9, 2011.

    [13] P. Klus et al., “Barracuda-a fast short read sequence aligner using graphicsprocessing units,” BMC research notes, vol. 5, no. 1, p. 27, 2012.

    [14] “Illumina sequencing by synthesis (SBS) technology:https://www.illumina.com/technology/next-generation-sequencing/sequencing-technology.html.”

    [15] M. Schirmer et al., “Illumina error profiles: resolving fine-scale variationin metagenomic sequencing data,” BMC Bioinformatics, vol. 17, no. 1,2016.

    [16] J. M. Mullaney et al., “Small insertions and deletions (indels) in humangenomes,” Human Molecular Genetics, vol. 19, no. R2, 2010.

    [17] S.-M. Ahn et al., “The first korean genome sequence and analysis: fullgenome sequencing for a socio-ethnic group,” Genome research, vol. 19,no. 9, pp. 1622–1629, 2009.

    [18] J. Wala et al., “Genome-wide detection of structural variants and indelsby local assembly,” bioRxiv, p. 105080, 2017.

    [19] M. Zabihi et al., “In-memory processing on the spintronic cram:From hardware design to application mapping,” IEEE Transactions onComputers, 2018.

    [20] M. Zabihi et al., “Using spin-hall mtjs to build an energy-efficient in-memory computation platform,” in 20th International Symposium onQuality Electronic Design (ISQED), March 2019, pp. 52–57.

    [21] “Predictive technology Model,” http://ptm.asu.edu/.[22] G. Jan et al., “Demonstration of fully functional 8Mb perpendicular STT-

    MRAM chips with sub-5ns writing for non-volatile embedded memories,”in Symposium on VLSI Technology (VLSI-Technology), June 2014.

    [23] H. Maehara et al., “Tunnel Magnetoresistance above 170% and Resis-tance–Area Product of 1 Ohm(micro-m)2 Attained by In situ Annealingof Ultra-Thin MgO Tunnel Barrier,” Applied Physics Express, vol. 4,no. 3, 2011.

    [24] H. Noguchi et al., “7.5 A 3.3ns-access-time 71.2 microW/MHz 1Mbembedded STT-MRAM using physically eliminated read-disturb schemeand normally-off memory architecture,” in IEEE International Solid-StateCircuits Conference (ISSCC), Feb 2015.

    [25] X. Dong et al., “NVSim: A circuit-level performance, energy, andarea model for emerging non-volatile memory,” in Emerging MemoryTechnologies. Springer, 2014, pp. 15–50.

    [26] “Everspin Technologies,” https://www.everspin.com/.[27] J. S. Kim et al., “Grim-filter: Fast seed location filtering in dna read

    mapping using processing-in-memory technologies,” BMC genomics,vol. 19, no. 2, p. 89, 2018.

    [28] “1000 genomes project,” ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/.

    [29] “SRR1153470,” https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1153470.

    [30] M. R. Guthaus et al., “Mibench: A free, commercially representativeembedded benchmark suite,” in Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop on. IEEE, 2001, pp. 3–14.

    [31] C. Ranger et al., “Evaluating mapreduce for multi-core and multiprocessorsystems,” in High Performance Computer Architecture, 2007. HPCA 2007.IEEE 13th International Symposium on. Ieee, 2007, pp. 13–24.

    [32] “Cortex-A5 Processor,” http://www.arm.com/products/processors/cortex-a/cortex-a5.php/.

    [33] D. Jeon et al., “Cashmc: A cycle-accurate simulator for hybrid memorycube,” IEEE Computer Architecture Letters, vol. 16, no. 1, pp. 10–13,Jan 2017.

    [34] V. Seshadri et al., “Ambit: In-memory accelerator for bulk bitwiseoperations using commodity DRAM technology,” in Proceedings of the50th Annual IEEE/ACM International Symposium on Microarchitecture,ser. MICRO-50 ’17. New York, NY, USA: ACM, 2017, pp. 273–287.[Online]. Available: http://doi.acm.org/10.1145/3123939.3124544

    [35] S. Li et al., “Pinatubo: A processing-in-memory architecture for bulkbitwise operations in emerging non-volatile memories,” in DesignAutomation Conference (DAC), 2016 53nd ACM/EDAC/IEEE. IEEE,2016, pp. 1–6.

    [36] L. Yavits et al., “Resistive Associative Processor,” CAL, vol. 14, no. 2,2015.

    [37] M. Gao et al., “DRAF: A Low-power DRAM-based ReconfigurableAcceleration Fabric,” ISCA, 2016.

    [38] S. Aga et al., “Compute Caches,” HPCA, 2017.[39] A. Boroumand et al., “LazyPIM: An Efficient Cache Coherence

    Mechanism for Processing-in-Memory,” CAL, vol. 16, no. 1, Jan 2017.[40] M. Gao et al., “Practical Near-Data Processing for In-Memory Analytics

    Frameworks,” in PACT, 2015.[41] S. Li et al., “Drisa: A dram-based reconfigurable in-situ accelerator,” in

    Proceedings of the 50th Annual IEEE/ACM International Symposium onMicroarchitecture. ACM, 2017, pp. 288–301.

    [42] S. Jain et al., “Computing in memory with spin-transfer torque magneticram,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems,vol. 26, no. 3, pp. 470–483, March 2018.

    [43] “Micron tn-41-01: Calculating memory system power forddr3.” [Online]. Available: https://www.micron.com/resource-details/3465e69a-3616-4a69-b24d-ae459b295aae

    [44] C.-M. Liu et al., “Soap3: ultra-fast gpu-based parallel alignment tool forshort reads,” Bioinformatics, vol. 28, no. 6, pp. 878–879, 2012.

    [45] J. Kim et al., “Genome read in-memory (grim) filter: Fastlocation filtering in dna read mapping using emergingmemory technologies https://people.inf.ethz.ch/omutlu/pub/GRIM-genome-read-in-memoryfilter psb17-poster.pdf,” 2017.

    http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.18.3-memory-FPGA/HC23.18.320-HybridCube-Pawlowski-Micron.pdfhttp://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.18.3-memory-FPGA/HC23.18.320-HybridCube-Pawlowski-Micron.pdfhttp://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.18.3-memory-FPGA/HC23.18.320-HybridCube-Pawlowski-Micron.pdfhttp://www.amd.com/en-us/innovations/software-technologies/hbmhttp://www.amd.com/en-us/innovations/software-technologies/hbmhttps://www.illumina.com/technology/next-generation-sequencing/sequencing-technology.htmlhttps://www.illumina.com/technology/next-generation-sequencing/sequencing-technology.htmlhttp://ptm.asu.edu/https://www.everspin.com/ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1153470https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1153470http://www.arm.com/products/processors/cortex-a/cortex-a5.php/http://www.arm.com/products/processors/cortex-a/cortex-a5.php/http://doi.acm.org/10.1145/3123939.3124544https://www.micron.com/resource-details/3465e69a-3616-4a69-b24d-ae459b295aaehttps://www.micron.com/resource-details/3465e69a-3616-4a69-b24d-ae459b295aaehttps://people. inf. ethz. ch/omutlu/pub/GRIM-genome-read-in-memoryfilter_psb17-poster. pdfhttps://people. inf. ethz. ch/omutlu/pub/GRIM-genome-read-in-memoryfilter_psb17-poster. pdf

  • JXCDC, OCTOBER 2019 9

    APPENDIXA. ReconfigurabilityInvoking a logic gate within the SpinPM array translates intopre-setting the output, connecting all cells participating incomputation to LL by setting the corresponding WLs, andsetting a voltage between BSLs of the inputs and output cellsthat equals to Vgate, which depends on the type of the logicgate. Therefore, modulo output pre-set, the complexity ofreconfiguration is very similar to the complexity of addressingin the memory array. SpinPM is reconfigurable along twodimensions:• Each cell can serve as an input or as an output for alogic gate depending on the computational demands of theworkload within the course of execution.

    • For a fixed input-output assignment, the logic function itselfis reprogrammable. For example, we can reconfigure the gatefrom Fig.1(b)/(c) to implement another function than NORby simply changing Vgate, to, e.g., VNAND (and applyinga different output pre-set, as need be).By default, SpinPM acts as an MRAM array. A dedicated

    architecturally visible set of registers keeps the configurationbits to program SpinPM cells as logic gate input/outputs. Theseconfiguration bits capture not only the physical location in thearray, but also whether the cell represents an input or an output,the pre-set value for the output, and Vgate. A fixed or floatingportion of the SpinPM array can keep these configuration bitsas part of the machine state, as well.

    B. Spatio-Temporal SchedulingThe goal of classic memory data layout optimizations isto perform as many computations as possible per unit datadelivered from the memory to the processor, as the data com-munication between the processor and the memory representsthe bottleneck. SpinPM, on the other hand, brings computecapability to the data to be processed. The goal becomesminimizing the direct physical distance between the cellsparticipating in computation. Considering that an output cellcan serve as an input cell in subsequent steps of computation,the physical location of the cells carrying the input datafor subsequent steps can dynamically change as computationproceeds.

    This optimization problem gives rise to two strongly corre-lated sub-problems: the layout of data to be processed inthe memory array, and the spatio-temporal scheduling ofcomputations within the array. In this regard, the optimizationproblem has many analogies to floor-planning and placementalgorithms deployed in the computer aided design of digitalsystems, which aim to minimize the “distance” (in terms ofwire length) between interconnected circuit blocks. In SpinPMcontext, “interconnected blocks” translate into interconnectedcells (over LL) participating in computation (Section II-A). Wewill look closer into this effect in Section C.

    SpinPM hence features a unique trade-off between datareplication and parallelism: Due to the internal array structure,(unless replicated), the same cell can only participate inone computational step at a time, which may impair furtheropportunities for parallel execution. Data replication can unlockmore parallelism in such cases, at the expense of a largermemory footprint.

    As an example, Figure 7 illustrates a time lapse of logicoperations on two datasets, each 3-bit in length, in a SpinPMarray of four columns (rotated in horizontal direction for easeof illustration). Time T0 captures the initial state. T1-T4 is

    spent on preset operations of output cells in all four rows,conservatively through standard writes of one row at a time,before a sequence of bitwise ORs and MAJ3s take place (oneach column in parallel). Time T5 shows the OR operationon the first bits of the datasets, performed in each column atthe same time (bold bit values are involved in computation).T6 and T7 do the same for next two bits of the datasets. Thelast stage of computation, T8, performs MAJ3 on the three bitresult of the previous sequence of ORs. Last step, R, highlightsthe final result bits in each column.

    C. Practical ConsiderationsArray Size: The maximum column height (i.e., the maximumnumber of rows) per SpinPM array depends on the gate voltageVgate (Section II-A), the interconnect material for LL and BSL(which connects the input and output cells together in forming agate), as well as the technology node. We conduct the followingexperiment to determine the maximum column height: Weconsider a two-input, one output SpinPM gate which has theinput cells and the output cell located in adjacent rows. Ineach experiment, we shift the output cell further away fromthe input cells, by one cell at a time. The process continuesuntil we reach the terminating condition, which is when thecurrent through the output cell falls below the required criticalswitching current for the most conservative input cell resistancestates.

    Assuming copper interconnect segments of 160nm for LL,for representative SpinPM gates used in pattern matching, thisanalysis renders approximately 2K cells per column at 22nm,where the latency overhead induced by this maximum distancecomputation barely reaches < 1% of the switching time of theSHE-MTJ as detailed in Section IV.

    The feasibility of array dimensions is contingent upon thecorrect functionality of the array at subarray granularity. Ourcircuit-level analysis reveals a maximum subarray size of 512×512 bits, without sacrificing the reliability of array functionality.

    Since SpinPM cells use two transistors, the area of each cellis dominated by the transistors. The area of each SpinPM cell(at 22nm) is roughly 100F 2 considering the current densityrequirement of MTJ devices (MTJs are placed on top oftransistors and roughly consume ∼ 5% of the transistor area).Array Periphery: Peripheral overheads, mainly induced byaddressing and control operations, can play a vital role indetermining the pattern matching throughput. Accordingly,throughout the evaluation, we consider the time and energyoverheads of peripheral circuitry including row and columndecoders, multiplexers, and sense amplifiers. For memory readand write operations a SpinPM array is not quite differentthan a standard STT-MRAM array, hence we model peripheryafter the standard STT-MRAM. During computation, however,as all columns operate in parallel, column decoder overheaddoes not apply (which we conservatively keep). The peripheryduring computation rather becomes similar to the periphery ofPinatubo [35], an alternative PIM substrate (although SpinPMcomputation relies on a different mechanism, totally excludingsense amplifier involvement during computation contrary toPinatubo). Even during computation where all columns areactive, the current draw in an SpinPM array remains relativelymodest. For example, for SHE-MTJ (as detailed in Section IV),a 128MB array would still consume considerably less currentthan a DDR3 SDRAM write operation [43].Preset Overhead: Each logic operation requires the outputto be (pre)set to a predefined value. Computation is columnparallel, i.e., in all columns, the output cell resides in the

  • JXCDC, OCTOBER 2019 10

    0 1 1 0 1 0 1 10 11 0 0 1 0 0 0 01 01 1 1 1 1 0 1 11 11 0 1 1 1 1 1 11 1

    0 1 1 0 1 01 0 0 1 0 01 1 1 1 1 01 0 1 1 1 1

    { { {Data1 Data2 Output CellsColumn1Column2Column3Column4

    0 1 1 0 1 01 0 0 1 0 01 1 1 1 1 01 0 1 1 1 1

    { { {Data1 Data2 Output Cells1111

    1 1

    11

    0 1 1 0 1 01 0 0 1 0 01 1 1 1 1 01 0 1 1 1 1

    { { {Data1 Data2 Output Cells1 1

    11

    1 1 11

    1 1 1

    1

    0 1 1 0 1 01 0 0 1 0 01 1 1 1 1 01 0 1 1 1 1

    { { {Data1 Data2 Output Cells1 1 1

    1

    1 1 1 11 1 1 11 1 1 1

    0 1 1 0 1 01 0 0 1 0 01 1 1 1 1 01 0 1 1 1 1

    { { {Data1 Data2 Output Cells1 1 1 1

    1 1 1 11 1 1 11 1 1 1

    0 1 1 0 1 01 0 0 1 0 01 1 1 1 1 01 0 1 1 1 1

    { { {Data1 Data2 Output Cells0 1 1 1

    1 1 1 11 1 1 11 0 1 1

    0 1 1 0 1 01 0 0 1 0 01 1 1 1 1 01 0 1 1 1 1

    { { {Data1 Data2 Output Cells0 1 1 1

    1 1 1 11 1 1 11 0 0 1

    0 1 1 0 1 01 0 0 1 0 01 1 1 1 1 01 0 1 1 1 1

    { { {Data1 Data2 Output Cells0 1 1 1

    1 1 1 11 1 1 11 0 0 0

    0 1 1 0 1 01 0 0 1 0 01 1 1 1 1 01 0 1 1 1 1

    { { {Data1 Data2 Output Cells0 1 1 1

    Result Bits

    T1 T2 T3

    T4 T5 T6 T7

    T8 R

    T0

    Fig. 7: Timing diagram of example logic execution.

    very same row. Accordingly, before firing column-parallelcomputation, the corresponding row where the output cellsreside should be preset. To this end, we can use a “gang” preset,which presets all cells in the output row(s) simultaneously.The alternative is relying on the standard write operation,which can preset (columns in) one row at a time. The gangpreset is equivalent to a parallel COPY operation – where allcolumns compute in parallel and where the output cells areall in the respective rows subject to gang preset. Hence, thediscussion about the periphery overhead during column-parallelcomputation directly applies here, and the current draw remainsmodest.Read Disturbance: Read disturbance is an issue that ariseswhen read current and write current become similar due to non-linear technology scaling of different electrical components ofan array. SHE-MTJ devices have separate read and write pathsthrough the device, eliminating read disturbance effects.

    D. System InterfaceSpinPM can serve as a stand-alone compute engine or a co-processor attached to a host processor. Following the near-memory processing taxonomy from [2], due to the reconfigura-bility (Section A), both SpinPM design points still fall into the“programmable” class. A classic system has to specify how tooffload both computation and data to the co-processor, and howto get the results back from the co-processor. For a SpinPM co-processor, we do not need to communicate data values – instead,the SpinPM array requires (ranges of) data addresses to identifythe data to process, and the specification for computation, i.e.,which function to perform on the corresponding data.We will next cover the SpinPM system stack to support in-memory execution semantics for pattern matching.SpinPM Instructions: In addition to conventional memoryread and write, SpinPM instructions cover computationalbuilding blocks for in-memory pattern matching. SpinPMinstructions hence form two classes: data transfer (read,write) and computational (arithmetic/logic). By construction,computational SpinPM instructions are block instructions:two dimensional vector instructions, which operate on allcolumns and on a subset of rows of an SpinPM array ata time. Hence, key operands for any computational SpinPMinstruction are the row numbers of the source(s) (i.e., input(s) tocomputation) and destination(s) (i.e., output(s) to computation).Depending on the size of the pattern matching problem,multiple SpinPM arrays may be deployed in parallel. Therefore,the computational subset of SpinPM instructions facilitatesgang-execution on all SpinPM arrays, as well. In the following,

    we will generically use the term SpinPM substrate to referto all arrays participating in computation. We also make thedistinction between macro- and micro-instructions. The set ofmicro-instructions covers actual bit-level operations performedin the SpinPM substrate, while the set of macro-instructionsforms the high-level programming interface.Programming Interface: To match SpinPM’s column-levelparallelism, memory allocation and declaration of variables(which represent inputs and outputs to computation) happenat column granularity. Depending on the problem, a variablemay cover the entire column or only a portion. The followingcode snippet provides an example, where an integer variablex gets written (assigned) to row r and column c in a SpinPMarray (line 5):

    1 int x = ...;2 ...3 int y;4 preset(r, ncell, val);5 intpm xpm = writepm(x, r, c, sizeof(x));6 y = readdirpm(xpm);

    In this case, besides x and y, ncell, val, c and r represent(already defined) integer values. The SpinPM-specific (com-posite) data type intpm captures row and column coordinatesfor each variable stored in the array. xpm in line 5 keeps thisinformation for variable x, after it gets written to column c,from row r onwards, by the writepm function. The subsequentread in line 6, conducted by the readdirpm, directly assignsthe value of x to y. SpinPM also features a read function,readpm, which has a similar interface to writepm with explicitrow and column specification. We consider each such functionas a macro-instruction.

    The preset function in line 4 presets ncell number of(consecutive) cells, starting from row r, each to value val.SpinPM features different variants of this function, includingone to gang-preset the entire scratch area (Fig. 3), and anotherwhere val is interpreted as a bitmask (of ncell bits) ratherthan a single-bit preset value which applies over the entirerange of the specification.

    Each pattern matching problem to be mapped to SpinPMfeatures three basic stages:(i) Allocating and initializing the reference, pattern, and

    scratch regions in each array (Fig. 3);(ii) Computation;

    (iii) Collecting the pattern matching outcome.Variants of preset and writepm functions cover stage

    (i); and variants of read(dir)pm, stage (iii). Stage (ii) can

  • JXCDC, OCTOBER 2019 11

    take different forms depending on the encoding of pat-tern and reference characters, but generally primitives suchas addpm(int start, int end, intpm result) apply, whichsums all cell contents between rows start and end, ona per column basis, and writes the result back where resultpoints. addpm macro-instruction can directly implement Phase-2from Algorithm1 to calculate the bit-count on the match string(Section III-B).Code Generation: Code generation simply entails translatinga sequence of macro-instructions to a sequence of micro-instructions for the SpinPM memory controller (SMC) todrive in-array computation. Micro-instructions specify the typeof operation and the rows to connect as inputs and outputs.For example, nand(ri, rj, rk) specifies row ri as the outputand row rj and rk as inputs to form a NAND gate in theSpinPM array. The macro-instruction nandpm, on the otherhand, performs the very same operation on multi-bit operands(of width ncell): nandpm(ri, rj, rk, ncell). In this case, ri,rj, and rk still demarcate the starting rows for the source anddestination (of ncell bit) operands. nandpm hence translatesinto a sequence of ncell number of nand micro-instructions.For addpm type of macro-instructions, on the other hand, aspatio-temporal scheduling pass (Section B) determines thecorresponding composition of micro-instructions. The goal isto maximize the throughput performance for the given datalayout. This usually translates into masking the overhead ofpresets or other types of writes (per row) by coalescing whenpossible.SpinPM Memory Controller (SMC): SMC orchestratescomputation in the SpinPM substrate. SpinPM features aninternal clock. During computation, SMC allocates each micro-instruction a specific number of cycles to finish depending onthe operation and operand widths. This time window includesperipheral overheads and the scheduling overhead due to SMC,besides computation. After the allocated time elapses (andunless an exception is the case), SMC fetches the next set ofmicro-instructions. SMC features an instruction cache wheremicro-instructions reside until they are issued to the SpinPMsubstrate. Before issue, SMC decodes the micro-instructionsusing a look-up table to initiate preset, and subsequently, toset the appropriate voltage level on input and output BSL (as afunction of the operation, as explained in Section II-B), beforeactivating the corresponding rows in the specified arrays forcomputation. The look-up table keeps the voltage level andthe preset value for each bit-level operation from Section II-B,which forms a SpinPM micro-instruction. No look-up tableaccess is necessary for read and write operations.

    E. A Closer Look into Performance and Energy

    We will next provide a detailed throughput performance andenergy characterization, along with a sensitivity analysis, usingDNA as a case study. SHE-MTJ is considered as the celltechnology used in SpinPM. To characterize SpinPM basedDNA sequence pre-alignment, we use an NVIDIA Tesla K20XGPU based implementation of the common Burrows-WheelerAligner (BWA) algorithm [44] as a baseline. We deploy thevery same reference and input pattern pools for the GPU andSpinPM. In order for the comparison to be fair, we only takethe basic pattern matching portion of the GPU baseline intoconsideration (Section III).

    We consider two design points, which differ in how thepatterns (from the input pattern pool) get assigned to columnsfor matching. In other words, how patterns are scheduled

    for computation in the SpinPM array: The first one is aNaive implementation, where we take one pattern and blindlycopy it to every column of all arrays to perform similaritysearch. The second implementation, on the other hand, featuresOracular pattern scheduling, which can avoid assigning apattern to a column where a too dissimilar (reference) fragmentresides. Oracular is straight-forward to implement by addinga search-space pruning step before full-fledged mapping takesplace, as explained in Section III-B, by e.g., using hash-based filtering [45]. We will leave exploration of this richdesign space to future work, but cover the overhead of arepresentative practical implementation in the following. Anypractical SpinPM implementation would fall somewhere in thespectrum between these two extremes.Naive Design (Naive): The caveat here is the very highoverhead of redundant computation, due to processing onepattern at a time and mapping each such pattern naively toall reference fragments. As a single pattern is matched tothe entire reference, across all arrays, at a time, the apparentserialization hurts the throughput, in terms of the number ofpatterns matched per second, i.e., the match rate.Oracular Pattern Scheduling (Oracular): The oracularscheduler resides between the input pattern pool and SpinPM,and controls to which column in which array each pattern goes.Oracular may still feed a given pattern to multiple columns,in multiple arrays, however, does not consider columns whichcarry a too dissimilar (reference) fragment. In other words,Oracular directs patterns to columns and arrays in a waysuch that achieving a high similarity score becomes morelikely. While Oracular bases its pattern scheduling decisionson perfect information, a practical implementation of this ideawould incur the overhead of gathering this information, i.e.,extracting a schedule to keep pattern matching confined tocolumns where a high similarity score is more likely. In anycase such smart scheduling of patterns benefits the throughputperformance by reducing redundant computation which eatsfrom the energy budget.

    However, since all columns in a SpinPM-SHE array performpattern matching (in lock-step but) in parallel, before compu-tation begins, we require that all columns have their patternsready. Scheduling patterns takes time, which might furtheraffect the throughput performance of SpinPM, if we let thearray sit idly, waiting for scheduling decisions to take place.We can mask this overhead, as drawing pattern schedulingdecisions for all the columns in an array takes less time thanwriting patterns in the columns of that array. This, in effect,would not introduce any timing overhead towards the systemthroughput, although there is an energy overhead.

    Fig.8 shows the throughput performance and compute effi-ciency, normalized to GPU baseline, for Naive and Oracular,when processing a pool of 3M patterns. We use match rate(in terms of number of patterns processed per second) forthroughput; match rate per milliwatt, for compute efficiency.Naive yields very low throughput – by mapping each pattern toevery column of each array at a time, and thereby increasing thetotal execution time significantly. Oracular pattern schedulingis very effective in eliminating this inefficiency: we observethat the throughput performance w.r.t. Naive increases byorders of magnitude in this case. The fundamental limitationfor Naive is the redundancy in computation.

    We next identify the individual contributions of actualcomputation stages. Fig.9 shows the distribution of energyand latency components. The preset overheads are 67.67%and 70.7% in energy and latency, respectively, where the bit-

  • JXCDC, OCTOBER 2019 12

    Naive NaiveOpt Oracular OracularOpt

    Nor

    mal

    ized

    Mat

    ch R

    ate

    010

    030

    050

    0

    2.049e−4 2.135e−4

    614.623640.554

    (a) Match Rate (patt/sec)Naive NaiveOpt Oracular OracularOpt

    Nor

    mal

    ized

    Com

    pute

    Effi

    cien

    cy0

    200

    600

    1000

    4.3124e−4 4.3127e−4

    1293.83 1293.83

    (b) Compute Eff. (patt/sec/mW)

    Fig. 8: Performance and Energy Characterization.

    34.58% 0.04%

    55.73%

    9.64%

    MatchWriteAddRead

    (a) Energy

    7.04%

    7.72% 0.13%

    85.11%

    MatchAddWriteRead

    (b) Latency

    Fig. 9: Breakdown of energy and latency in computation.

    line (BL) driver energy and latency overheads are < 1%. Thebreakdowns in Fig.9 do not contain preset and BL driver relatedoverheads. Apart from these, we observe that the majority ofthe energy (Fig.9a) is consumed by the match operations andadditions during similarity score computations. However, incase of latency (Fig.9b), the dominant components change toread-outs of similarity scores while the match and additionshave similar shares. In case of both energy and latency, writesconsume < 1% of the share.

    This breakdown clearly identifies preset overhead as theessential bottleneck. Also, although the time required by thematch and similarity score compute phases are not drasticallydifferent, the energy required by the similarity score computephase is around 1.5× of that of match phase. Accordingly,we next look into preset and similarity score computationoperations for optimization opportunities.Optimized Designs (NaiveOpt, OracularOpt): As the re-duction tree for addition (Fig.4b), which is at the core ofsimilarity score computations, already represents an efficientdesign, we focus on optimizations to reduce the preset overhead.Since presets are inevitable for logic operations, it is notpossible to entirely get rid of them. However, we can stillhide preset latency through careful scheduling of presets.

    As presets do not correspond to actual computation, Naiveand Oracular simply perform them in between computation.The challenge comes from successive steps in computationusing the very same set of cells to implement logic functions.Instead of interrupting computation to preset these cells everytime a few computation steps are completed, we can distributesuch consecutive steps to different cells, using the scratch areafrom Fig.3, and preset them at once, before computation starts.We call the resulting designs NaiveOpt and OracularOpt,respectively. The NaiveOpt and OracularOpt bars in Figure 8aand Figure 8b capture the resulting energy and throughputperformance. We observe that, for each design option, energyconsumption of the optimized case is unchanged. This isbecause the optimization only changes the scheduling of presets,where the total number of presets performed still remainsthe same. The throughput performance, on the other hand,skyrockets in both cases thanks to gang presets (Section C).Practical Search Space Pruning: The throughput for

    ●●

    ●● ●

    2 4 6 8 10

    050

    015

    0025

    00

    No. of Probable Match Locations

    Nor

    mal

    ized

    Val

    ue

    ● Match RateCompute Eff.

    Fig. 10: Impact of filtering inaccuracy on throughput.

    Oracular represents the theoretically achievable maximum.We next consider a practical implementation, as detailed inSection III-B. For the GRIM filter based implementation, weobserve that the filtering overhead, as compared to the actualpattern matching overhead, is very insignificant and thereforehas effectively no impact, even considering very high sub-string lengths (used for chunking the patterns and the referencein converting them to bit-vectors). However, the accuracy ofthe filtering still has an impact, as captured by Fig. 10. Thisfigure shows, without loss of generality, how the throughputand compute efficiency (both normalized to NMP baseline) ofDNA changes when the number of locations (column indices)in an array for possible matches increases (which would bethe case under heavy aliasing during hashing). The decrease inperformance numbers is intuitive since more match locationsrefer to more iterations of the same set of search patternsthrough SpinPM arrays. Luckily, even for a high degree offilter inaccuracy, SpinPM-SHE can perform better than thebaseline.

    How close a practical implementation can come to Oracularstrongly depends on the actual values of the patterns, as well,which may or may not ease scheduling decisions. Since eacharray keeps consecutive fragments of the reference, it is alwayspossible that patterns directed into a particular array do nothave any matches in any of the columns. We may not alwaysbe able to eliminate such ill-schedules, depending on thepattern values, where the incurred redundant computation woulddegrade performance. The feasibility of any pattern scheduleris contingent upon the distribution of the patterns, in terms ofthe columns in the arrays where the most similar fragmentsreside.Sensitivity Analysis: Up until now, we have used a patternlength of 100 characters. We will next examine the impact ofpattern length on energy and throughput characteristics. Withoutloss of generality, we confine the analysis to OracularOpt.For the purpose of design space exploration, we experimentwith pattern lengths of 200 and 300 characters, which are repre-sentative values for the alignment of short DNA sequences [14].We keep the array structure the same, while the reference lengthremains fixed by construction.

    Fig.11 summarizes the outcome. Understandably, with thepattern length increasing, more computation becomes necessaryto generate the similarity scores in each row. However, thiseffect does not directly translate into degraded performance:The throughput for increasing pattern lengths remains closeto the baseline throughput for 100-character patterns. This isbecause the preset optimization is scalable. Irrespective of theapplication domain, the maximum pattern length is actually

  • JXCDC, OCTOBER 2019 13

    100bp 200bp 300bp

    Nor

    mal

    ized

    Mat

    ch R

    ate

    020

    040

    060

    0

    640.553 679.111733.971

    (a) Match rate (patt/sec)100bp 200bp 300bp

    Nor

    mal

    ized

    Com

    pute

    Effi

    cien

    cy0

    200

    600

    1000

    1293.83

    723.99

    549.89

    (b) Compute Eff. (patt/sec/mW)

    Fig. 11: Sensitivity to pattern length for OracularOpt.

    limited by technology constraints, since the required number ofcells per column also increases with increasing pattern length.We further observe that the compute efficiency (i.e., the matchrate per mW) decreases due to increases in computation peralignment, which is congruent with the intuition.

    IntroductionBasicsFusing Computation and MemoryBasic Computational BlocksColumn-level Parallelism

    Spintronic Pattern MatchingData Layout & Data RepresentationProof-Of-Concept SpinPM Design

    Evaluation SetupEvaluationImpact of Process VariationGate-level Characterization

    Related WorkConclusionReferencesAppendixReconfigurabilitySpatio-Temporal SchedulingPractical ConsiderationsSystem InterfaceA Closer Look into Performance and Energy