co-training of feature extraction and classification using ...co-training of feature extraction and...

6
Co-training of Feature Extraction and Classification using Partitioned Convolutional Neural Networks Wei-Yu Tsai, Jinhang Choi, Tulika Parija, Priyanka Gomatam, Chita Das, John Sampson, and Vijaykrishnan Narayanan {wzt114,jpc5731,txp5172,pkg5076}@psu.edu, {das,sampson,vijay}@cse.psu.edu School of Electrical Engineering and Computer Science, Pennsylvania State University, University Park, PA USA ABSTRACT ere are an increasing number of neuromorphic hardware plat- forms designed to eciently support neural network inference tasks. However, many applications contain structured processing in addition to classication. Being able to map both neural network classication and structured computation onto the same platform is appealing from a system design perspective. In this paper, we perform a case study on mapping the feature extraction stage of pedestrian detection using Histogram of Oriented Gradients (HoG) onto a neuromophic platform. We consider three implementations: one that approximates HoG using neuromorphic intrinsics, one that emulates HoG outputs using a trained network, and one that allows feature extraction to be absorbed into classication. e proposed feature extraction methods are implemented and eval- uated on neuromorphic hardware (IBM Neurosynaptic System). Our study shows that both a designed approximation and a ”par- roted” emulation can achieve similar accuracy, and that the laer appears to beer capitalize on limited training and resource budgets, compared to the absorbed approach, while also being more power ecient than the programmed approach by a factor of 6.5x-208x. CCS CONCEPTS Computing methodologies Neural networks; Computer systems organization Neural networks; KEYWORDS Convolutional neural network; feature extraction; neuromorphic hardware ACM Reference format: Wei-Yu Tsai, Jinhang Choi, Tulika Parija, Priyanka Gomatam, Chita Das, John Sampson, and and Vijaykrishnan Narayanan. 2017. Co-training of Fea- ture Extraction and Classication using Partitioned Convolutional Neural Networks. In Proceedings of Design Automation conference, Austin, Texas USA, June 2017 (DAC’54), 6 pages. DOI: hp://dx.doi.org/10.1145/3061639.3062218 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permied. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from [email protected]. DAC’54, Austin, Texas USA © 2017 ACM. 978-1-4503-4927-7/17/06. . . $$15.00 DOI: hp://dx.doi.org/10.1145/3061639.3062218 1 INTRODUCTION Research into systems supporting classication, recognition, and other tasks via neural network inference is experiencing an explo- sion of interest. Due to many formulations of computing these inferences having the acceleration-opportunity trifecta of being highly structured, highly parallel, and highly local, many neural network accelerators and neuromorphic hardware platforms have been proposed [7, 1214]. In general, these specialized hardware accelerators have highly parallel processing units, specialized func- tionality in each unit, and are optimized for primitives that reduce data movement between memory and processing elements, thereby making them very ecient at performing classication tasks com- pared to traditional CPUs and GPUs. Much of the work on systems designed to accelerate articial neural networks, in part because of the presumed opacity of internal features, and in part because of how successful the platforms have proved to be as ecient classication engines, has focused either on the training and inference of large networks in order to provide clas- sications from raw inputs, or on training more modest networks to classify the output of structured feature extractors. However, the same primitives that are used to accelerate inference can also be leveraged for other tasks. For instance, work has been done showing how the IBM neurosynaptic system (TrueNorth) [3, 13], a power-eciency oriented neuromorphic hardware platform using digital spikes for signals, can be used to implement highly ecient audio feature extraction [17, 18], in addition to classication tasks. e capability for implementing both ecient feature extraction and an ecient classication on the same platform is very com- pelling from a systems integration perspective. In this paper, we perform a case study on pedestrian detection to explore the uses of neuromorphic hardware beyond pure classica- tion tasks and the eectiveness of imposing explicit structure and semantics on the training process for a neuromorphic classication system. We target the IBM Neurosynaptic System [3, 13] as our neu- romorphic platform of choice, but our high-level design approaches are broadly applicable to many neuromorphic platforms. Figure 1 shows the high level ow for a pedestrian detection task based on using the histogram-of-oriented-gradients (HoG)[8] feature rep- resentation and how this feature extraction could be pursued on a neuromorphic platform. We consider three dierent design ap- proaches for mapping HoG to a neuromorphic substrate, namely a programmed approach using approximations of HoG computation components built from neuromorphic primitives (NApprox ), a net- work trained to mimic HoG outputs (Parrot ), and feature extraction fundamentally entangled with the process of end-to-end classi- cation (Absorbed ). We compare all of the approaches against an FPGA-based baseline [1]. e contributions of this paper include:

Upload: others

Post on 11-Mar-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Co-training of Feature Extraction and Classification using ...Co-training of Feature Extraction and Classification using Partitioned Convolutional Neural Networks Wei-Yu Tsai, Jinhang

Co-training of Feature Extraction and Classification usingPartitioned Convolutional Neural Networks

Wei-Yu Tsai, Jinhang Choi, Tulika Parija, Priyanka Gomatam, Chita Das, John Sampson,and Vijaykrishnan Narayanan

{wzt114,jpc5731,txp5172,pkg5076}@psu.edu,{das,sampson,vijay}@cse.psu.eduSchool of Electrical Engineering and Computer Science, Pennsylvania State University, University Park, PA USA

ABSTRACT�ere are an increasing number of neuromorphic hardware plat-forms designed to e�ciently support neural network inferencetasks. However, many applications contain structured processingin addition to classi�cation. Being able to map both neural networkclassi�cation and structured computation onto the same platformis appealing from a system design perspective. In this paper, weperform a case study on mapping the feature extraction stage ofpedestrian detection using Histogram of Oriented Gradients (HoG)onto a neuromophic platform. We consider three implementations:one that approximates HoG using neuromorphic intrinsics, onethat emulates HoG outputs using a trained network, and one thatallows feature extraction to be absorbed into classi�cation. �eproposed feature extraction methods are implemented and eval-uated on neuromorphic hardware (IBM Neurosynaptic System).Our study shows that both a designed approximation and a ”par-roted” emulation can achieve similar accuracy, and that the la�erappears to be�er capitalize on limited training and resource budgets,compared to the absorbed approach, while also being more powere�cient than the programmed approach by a factor of 6.5x-208x.

CCS CONCEPTS•Computingmethodologies→Neural networks; •Computersystems organization→ Neural networks;

KEYWORDSConvolutional neural network; feature extraction; neuromorphichardware

ACM Reference format:Wei-Yu Tsai, Jinhang Choi, Tulika Parija, Priyanka Gomatam, Chita Das,John Sampson, and and Vijaykrishnan Narayanan. 2017. Co-training of Fea-ture Extraction and Classi�cation using Partitioned Convolutional NeuralNetworks. In Proceedings of Design Automation conference, Austin, TexasUSA, June 2017 (DAC’54), 6 pages.DOI: h�p://dx.doi.org/10.1145/3061639.3062218

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro�t or commercial advantage and that copies bear this notice and the full citationon the �rst page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permi�ed. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior speci�c permission and/or afee. Request permissions from [email protected]’54, Austin, Texas USA© 2017 ACM. 978-1-4503-4927-7/17/06. . .$$15.00DOI: h�p://dx.doi.org/10.1145/3061639.3062218

1 INTRODUCTIONResearch into systems supporting classi�cation, recognition, andother tasks via neural network inference is experiencing an explo-sion of interest. Due to many formulations of computing theseinferences having the acceleration-opportunity trifecta of beinghighly structured, highly parallel, and highly local, many neuralnetwork accelerators and neuromorphic hardware platforms havebeen proposed [7, 12–14]. In general, these specialized hardwareaccelerators have highly parallel processing units, specialized func-tionality in each unit, and are optimized for primitives that reducedata movement between memory and processing elements, therebymaking them very e�cient at performing classi�cation tasks com-pared to traditional CPUs and GPUs.

Much of the work on systems designed to accelerate arti�cialneural networks, in part because of the presumed opacity of internalfeatures, and in part because of how successful the platforms haveproved to be as e�cient classi�cation engines, has focused either onthe training and inference of large networks in order to provide clas-si�cations from raw inputs, or on training more modest networksto classify the output of structured feature extractors. However,the same primitives that are used to accelerate inference can alsobe leveraged for other tasks. For instance, work has been doneshowing how the IBM neurosynaptic system (TrueNorth) [3, 13], apower-e�ciency oriented neuromorphic hardware platform usingdigital spikes for signals, can be used to implement highly e�cientaudio feature extraction [17, 18], in addition to classi�cation tasks.�e capability for implementing both e�cient feature extractionand an e�cient classi�cation on the same platform is very com-pelling from a systems integration perspective.

In this paper, we perform a case study on pedestrian detection toexplore the uses of neuromorphic hardware beyond pure classi�ca-tion tasks and the e�ectiveness of imposing explicit structure andsemantics on the training process for a neuromorphic classi�cationsystem. We target the IBMNeurosynaptic System [3, 13] as our neu-romorphic platform of choice, but our high-level design approachesare broadly applicable to many neuromorphic platforms. Figure 1shows the high level �ow for a pedestrian detection task based onusing the histogram-of-oriented-gradients (HoG) [8] feature rep-resentation and how this feature extraction could be pursued ona neuromorphic platform. We consider three di�erent design ap-proaches for mapping HoG to a neuromorphic substrate, namely aprogrammed approach using approximations of HoG computationcomponents built from neuromorphic primitives (NApprox), a net-work trained to mimic HoG outputs (Parrot), and feature extractionfundamentally entangled with the process of end-to-end classi�-cation (Absorbed). We compare all of the approaches against anFPGA-based baseline [1]. �e contributions of this paper include:

Page 2: Co-training of Feature Extraction and Classification using ...Co-training of Feature Extraction and Classification using Partitioned Convolutional Neural Networks Wei-Yu Tsai, Jinhang

HoG block norm.(optional)

Classifier PP

FPGA

Ix/Iy tan-1θ histogramamplitude

Neuromorphic HW

Ix/IyInner

product histogramMax pooling

Neuromorphic HW

Trained classifier

Neuromorphic HW

absorbed feature extractor + classifier (monolithic NN)

or

or

Traditional HoG

NApprox HoG

Parrot HoG

Figure 1: Pedestrian detection �ow using HoG depictingmultiple approaches for implementing feature extraction.In addition to an FPGA baseline implementation of HoGfeature extraction, we consider approximation using neuro-morphic primitives, parroting the output of an HoG extrac-tor with a neural network, and a monolithic classi�cationnetwork approach wherein feature extraction is absorbedinto the classi�er.

• We show that the precision-recall curves for NApprox arecomparable to those of the baseline FPGA system whenusing an equivalent FPGA classi�er, and that NApprox andParrot produce comparable curves when using an equiva-lent state of the art neuromorphic classi�er [11], demon-strating that all three of these approaches produce similarquality features.

• We show that, while NApprox and Parrot produce similarquality results, Parrot uses between 6.5x-208x less power.�is indicates that trained mimicry can potentially be usedto map key components in a �ow onto a neuromorphicplatform much more e�ciently than direct programmaticemulation via neuromorphic intrinsics. At the same time,the di�culty seen in training convergence in the Absorbedapproach, when given the same resources as Parrot, indi-cates the value of imposing some degree of structure, evenwithin a learned-model approach, targeting a neuromor-phic platform.

• Parrot HoG operates with stochastic input signals, and gen-erates output each clock tick. �erefore, the representationof the signals and features can be as simple as 1-spike withthe probability proportional to the value. In that case, thethroughput of a single parrot HoG module can be boostedup to 1000 cells/sec, leading to a low system power con-sumption 192mW for the full-HD @ 26 f ps .

2 BACKGROUND2.1 Pedestrian Detection with HoGPedestrian detection is a classic problem in the �eld of computervision [9]. It has been shown that a human’s salient features canbe su�ciently represented by gradient orientations within spatialrelationships to detect people within images [8], leading to the

Pixel0 Pixel1 Pixel2

Pixel3 Pixel4 Pixel5

Pixel6 Pixel7 Pixel8

Figure 2: Pixel structure to illustrate gradient computation

development of HoG variants as feature descriptors. Furthermore,it is possible to map the primitives of HoG e�ciently to lowerprecision and approximate versions of its component operations,which has spurred deployment of HoG on embedded systems [1, 16].

State of the art HoG approaches are heuristic methods consistingof the following procedures: color-space adjustments, gradientcomputation, orientation votes in cells, contrast normalization overspatial blocks, feature collection over a window, and classi�cationwith respect to scoring from a trained model. Gradient computationstarts with dividing the image to cells of size 8× 8 pixels. Followingthis, the algorithm can be broken down into the following steps, allof them pertaining to one cell at a time:

(i) Gradient vector: A discrete derivative mask is applied onthe cell in order to compute the gradient. It was concludedthat a centered, 1-d point derivative ([-1,0,1]) gave the mostoptimal performance. �e result is the derivative of theimage in the x- and y-direction at every pixel. �is isequivalent to:Ix = Pixel5 − Pixel3 and Iy = Pixel1 − Pixel7in Figure 2.

(ii) Computation of the gradient angle and magnitude: A�er�nding the x and y derivatives, the angle and magnitudeof the gradient are calculated at every pixel.

(iii) �e orientations of the pixels are then binned in a his-togram. �e binning is typically done by the gradientmagnitude at the pixel, with the bin width being evenlyspaced at 20°, over 0°- 180°or 0°to 360°.

�e feature descriptor of an image is obtained by concatenating thehistogram of all the cells of the image.

2.2 IBM Neurosynaptic SystemFor our neuromorphic platform, we target the TrueNorth chip [3,13]. TrueNorth combines event-driven communication with col-located memory and computation to build a highly parallel non-von Neumann architecture. �e architectural abstraction can berepresented as a large-scale, extremely low power, spiking neuralnetwork. �e basic building block in a single TrueNorth chip iscalled a neurosynaptic core. Each neurosynaptic core consists of256 axons (input lines), 256 neurons (output lines) and 256 × 256synapses represented by a crossbar connection ensuring full con-nectivity between neurons and axons of a single neurosynapticcore. A single TrueNorth chip comprises 4096 neurosynaptic coresand thus a total of 1million neurons and 256million synapses. Con-nectivity on the TrueNorth chip can be seen as two layered. First isthe local connectivity within the neurosynaptic core and on top ofthat is a global, distributed on chip and o� chip interconnect whichwires multiple cores. A neuron can be connected to an axon onthe same core (local, intra-core) or on another core (long-distance,inter-core). A TrueNorth core consumes ∼ 16 µW (66mW for 4096cores @ 0.8 V ). Detailed description can be found in [3–5]

Page 3: Co-training of Feature Extraction and Classification using ...Co-training of Feature Extraction and Classification using Partitioned Convolutional Neural Networks Wei-Yu Tsai, Jinhang

�e function of each crossbar in a TrueNorth core is the innerproduct of the input vector, 256 elements of binary value indicatingwhether there is an incoming spike from the axon, and the weightmatrix, 256 × 256 elements of weights corresponding to the axon.More speci�cally, the corresponding weight of a crossbar point isdetermined by a 1-bit connectivity indicator, and the look-up tablewith 4-entries each neuron, respect to the four possible axon-typeslabeled on each axon. �e neurons in the core update the mem-brane potentials with the inner-products, and �re if the membranepotential exceed the threshold (plus a random number, if stochasticmode is enabled).

Corelets [4], abstract the con�guration of the TrueNorth net-work and specify the number of cores, neuron and axon types, andconnectivity all encapsulated into one model. �e corelet program-ming can be hierarchical: a main corelet consists of a number ofsubcorelets, which perform small portions of the overall operation.�e corelet programming environment provides the conversion ofthe corelet objects into model �les runnable on both the TrueNorthhardware and a validated simulator (1:1 mapping).

Instead of programming the detailed con�gurations of the hard-ware, IBM provides another design approach: the energy-e�cientdeep neuromorphic network (Eedn) [11]. Eedn is a TrueNorth-speci�c, CNN-like network that uses back-propagation to updatethe weights of the hidden layers during training. �e di�erencebetween Eedn and traditional convolutional networks lies in theweights, neuron representation and the grouping of �lters and lay-ers. �e weights in Eedn maintain a high precision hidden valueduring trainingwhich are thenmapped to one of the trinaryweights(−1, 0, 1) during network operation. �e neurons in Eedn are spik-ing neurons which have a threshold activation function. �us, thederivative of this function is approximated for training. Lastly, Eednpartitions layers and the corresponding �lters into multiple groupsto ensure the �lters are sized such that they can be implementedusing the 256 × 256 TrueNorth core crossbars.

2.3 Feature Extraction on TrueNorthWhile most prior work utilizing TrueNorth has focused on its useas a classi�er, some e�orts have mapped more structured compu-tations to the platform. Tsai et al. [17, 18] constructed a featureextractor out of TrueNorth primitives that performed audio featureextraction to replace an o�-chip feature extractor and evaluatedmultiple implementations on the TrueNorth platform. However,prior work has not considered the multiple design paradigms ofapproximated, parroted, and absorbed feature extraction, the cross-comparison of which is a primary focus of this work.

3 DESIGN PARADIGMS FORNEUROMORPHIC FEATURE EXTRACTION

Mapping traditional compute-intensive algorithms onto TrueNorth,amidst constraints of accuracy and performance, presents a num-ber of interesting trade-o�s. To gain insights into how best to mapstructured computations onto a neuromorphic platform, we imple-ment feature extraction on TrueNorth using three distinct designparadigms, each of which we detail below.

Operation Original Computation TrueNorth Computation

Gradient vectorUsing �lters

(-1 0 1) & (-1 0 1)'Result: Ix and Iy

Using �lters(-1 0 1),(1 0 -1),(-1 0 1)', (1 0 -1)'

Result: Ix, -Ix, Iy, -Iy(Pattern Matching)

Gradient Angle θ = tan−1 IxIy

θ for which(Ixcosθ + Iysinθ )is maximum.(Comparison)

GradientMagnitude

√I2x + I

2y

(Ixcosθ + Iysinθ )(Inner Product)

Histogram

Binned bymagnitude, eitherchoosing 9 binsfrom 0 to 180or 18 bins

from 0 to 360.

Binned by count,with 18 binsfrom 0 to 360.

(Inner Product)

Table 1: HoG implementation: Conventional v.s. approxi-mation for TrueNorth

3.1 NApprox: HoG using TrueNorth PrimitivesAs discussed in the previous section, certain operations on theTrueNorth chip can be done with high speed and low cost. In par-ticular, the inner product operation is very e�cient on TrueNorth,as are pa�ern matching and comparison operations. However, im-plementing exact, full precision versions of any given part of theHoG pipeline on TrueNorth is cumbersome at best. �us, we con-struct an approximate HoG feature extraction pipeline by buildingan approximate version of each of the four major components ofthe HoG �ow out of primitives e�cient on TrueNorth.

Table 1 illustrates the primary components in the remappingfrom the HoG algorithm onto TrueNorth in terms of their math-ematical expressions and underlying TrueNorth primitives. �egradient vector is found by performing low precision pa�ern match-ing. Furthermore, the expression (Ixcosθ + Iysinθ ) is equivalent tothe magnitude of the gradient. �is form can be e�ciently mappedon to the hardware. �e gradient angle is the direction in whichthe gradient is dominant. �e angle can, therefore, be found byevaluating the magnitude in di�erent directions, and �nding theone in which the magnitude is maximum. �e resultant angles arebinned in the histogram.

To validate this mapping approach, and to gain insights intofeatures speci�c to True North versus more broadly applicable toother neuromorphic platforms, we also developed a so�ware modelthat operates equivalently to the NApprox HoG on TrueNorth. Intesting with a thousand training images from the INRIA PersonDataset [8], the outputs of the hardware implementation and so�-ware model achieved over 99.5% correlation when con�gured tooperate with the same quantization width. Building the so�waremodel allows us to explore a variety of quantization options be-yond those currently available on the True North platform, andthus perform more direct comparisons with other implementationalternatives.

3.2 Parrot-HoGInstead of implementing feature extraction by programming thedetailed operations, we can train a classi�er that behaves like a

Page 4: Co-training of Feature Extraction and Classification using ...Co-training of Feature Extraction and Classification using Partitioned Convolutional Neural Networks Wei-Yu Tsai, Jinhang

1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

180°90° 270°

label

angle

sample

1

sample

2

sample

3

Figure 3: Randomly generated train-data for training thefeature extractor.

desired feature extractor. Prior work [10] has termed this a ”Parrottransformation”. In an HoG feature vector, each bin has a valuecorrelated to the degree of the input region matching the respectiveorientation. Similarly, one can construct a neural network wherethe neurons of a particular class in the neural network outputthe con�dence that the input data belongs to the class, therebyproducing an equivalent feature vector.

To train the classi�er to learn to produce HoG feature vectors,we prepared randomly generated labeled training data, examplesof which are shown in Figure 3. Automatic generation of labeleddata is possible because HoG is a well-de�ned function of the in-put pixels, in contrast with the ”pedestrian” function. Empirically,we discovered that the initial layer in the network needed to beprovided with all 8 × 8 inputs to the cell, or else it was di�cult totrain the response to cell-level, rather than local, gradient features.Also, we generate the training samples with di�erent ratio of 1′sand 0′s so that the feature extractor can learn to deal with sampleswith o�sets. �e samples in each class are somewhat similar tothose in the neighboring classes, so the distribution of con�dencescores matching the HoG histograms is more important than theparticular classi�cation that would be produced if the network wereused as a classi�er.

3.3 Absorbed Features (Monolithic Classi�er)As a �nal point of comparison, we also consider whether an explicitfeature extraction stage is valuable in a neuromorphic environment,by examining a raw-image to classi�cation system that doesn’timpose particular feature extraction semantics. We provide thissystem with the equivalent neuromorphic platform resources to thecombined resource count of the maximum used by an extractor andclassi�er for the previous two approaches, and use the same trainingset used to train the classi�er for the explicit feature extractiondesigns.

4 METHODOLOGY AND VALIDATIONIn order to ensure that the HoG feature extractors are producinghigh quality outputs, we built support vector machine classi�erscomparable to that used in [2] for classi�cation of FPGA-basedHoG feature extraction. Support Vector Machine models (SVMs)are a well-known machine learning method, based on supervisedlearning. Using LIBSVM [6], we trained linear SVM classi�ers frommining hard negative examples through 2,416 positive person im-ages and 12,180 negative images in the INRIA Person Train Dataset.In other words, a�er the training of an SVM model is completed,

Figure 4: Comparison of pedestrian detection feature ex-traction approaches with SVM classi�ers viamiss-rate/false-positive curves. �e NApprox HoG performs similarly tothe traditional FPGA HoG implementation whether using�oating-point or 64-spike precision.

we go through negative training images to �lter false positives, toaugment the SVM model as negatives.

To separate algorithmic changes in the approximated HoG ap-proach from TrueNorth-speci�c quantization e�ects, we validatedboth a high precision so�ware and low-precision TrueNorth com-patible version of feature extraction against the precision-recallcharacteristics of the FPGA approach in [2]. Figure 4 indicates thatthe quality of TrueNorth NApprox HoG, high precision so�wareNApprox HoG, and the original FPGA implementation provide com-parable precision-recall characteristics when a resource-equivalentSVM is used as the classi�er. More speci�cally, the con�gurationsof the compared feature extractors are as follows: FPGA-HoG is anHoG of 9 orientation bins from [2] (weighted voting in magnitude,�xed-point computation), NApprox(fp) is the full-precision versionof the neuromorphic primitive HoG of 18 bins (voting in counts,�oating-point(fp) computation, and NApprox is the TrueNorth-compatible, reduced precision implementation of NApprox(fp). AllHoGs exploit contrast normalization over 2× 2 cells in a block, andl2norm represents the normalization using v

‖v ‖2 .In order to adapt to TrueNorth-speci�c resource constraints, the

following changes were made to our proposed HoG schemes. First,color channels are reduced from RGB to grayscale. To compute the8× 8 gradient matrix for a cell described in Section 3, 10× 10 pixelsare fed to HoG. A weighted voting is conducted among 18 orien-tation bins. However, aliasing1 is ignored to satisfy consistencyin approximation designs. Assuming overlapped spatial blocks incontrast normalization, we set 2 × 2 cells to a block with striding acell in both vertical and horizontal way. As a result, providing eachHoG window with 64×128 pixels, we create 7,560(= 7×15×18×4)feature elements for each training image.

1Discrete orientation bins cannot exactly represent an angular orientation betweentwo neighbored bins. To mitigate aliasing, a bilinear interpolation can be consideredas a weighted voting in magnitude [8].

Page 5: Co-training of Feature Extraction and Classification using ...Co-training of Feature Extraction and Classification using Partitioned Convolutional Neural Networks Wei-Yu Tsai, Jinhang

We use the INRIA Person Test Dataset [8] as test cases. EachSVM model infers person detection from 15 HoG windows, whereeach window size increases by 1.1×. As a result, we can havetens of thousands of detection windows in a full-HD (1920 × 1080)image. �e detection windows are then narrowed by performingnon-maximum suppression (NMS) with ϵ = 0.2 [8]. Detectioncandidates are evaluated as a function of false positives per imageversus miss rate as proposed by Dollar et al [9], which is a proxyfor precision-recall curves. In determining true positives, the ratioof a detection’s overlapped region to ground truth images has tobe larger than or equal to 0.5. Otherwise, detection is regarded as afalse positive.

5 RESULTSIn this section, we evaluate the extracted HoG features using theclassi�er running on neuromorphic hardware, in order to showcaseend-to-end pedestrian detection system performance. In contrast totraditional so�ware and FPGA approaches, performing normaliza-tion is costly on the TrueNorth platform. �erefore, the experimentselide block normalization when using the neuromorphic classi�er.

5.1 Accuracy AnalysisWe use the same Eedn network for the three cases: monolithic NN(absorbed features), NApprox HoG, and Parrot HoG. We designan 18-layer Eedn classi�er, using 2864 cores, for the pedestriandetection on extracted HoG features. We design another 2-layerEedn classi�er for the Parrot HoG feature extraction, using 8 coresfor each cell of 8 × 8 pixels (in total, 1024 cores for a window of64 × 128 pixels). Combining the two Eedn networks, 3888 cores areused.

For the monolithic (absorbed) approach depicted at the top ofFigure 1, the Eedn network with the combined network structure(3888 core) was trained as a single network by back-propagation.�e resultant network always makes blind decisions (all-positive orall-negative), meaning that this combination of network con�gura-tion and training set do not converge to a useful learned response.Note that similar pixel-to-classi�cation networks using Eedn havebeen successfully trained for vision tasks [11] within single-chipresource budgets, albeit for lower resolution inputs. We thereforesuspect that the network over-�ts due to the training set used beinginsu�cient for the size of network needed to process 64 by 128pixel inputs. However, the same set of training images is su�cientto train the 2864 core Eedn classi�ers (iso-resource) for both theParrot and NApprox HoG approaches, both of which produce HoGfeatures from the same 64 by 128 pixel inputs.

Figure 5 shows the miss rate versus false positive curves for theParrot (32-spike stochastic coding) and NApprox HoGs with Eednclassi�ers. NApprox and Parrot have very similar miss rate versusfalse positive tradeo�s, implying that they produce features of simi-lar quality. Importantly, however, the Parrot HoG uses substantiallyfewer resources than NApprox to achieve similar quality featureextraction.

5.2 Power E�ciencyFor a full-HD image, we use sliding windows in six di�erent scales(1.1× between each scaling layers). In each of the scaling layers, we

Figure 5: Comparison of pedestrian detection approacheswith Eedn classi�ers. �e HoG feature extraction ap-proaches, NApprox and Parrot, perform similarly despite di-vergent resource usage.

process in cells of 8×8 pixels, resulting in the number of cells in eachlayer being {240×135, 160×90, 106×60, 71×40, 47×26, 31×17}, a to-tal of 57749 cells per image. For throughput equivalent to prior HDapproaches utilizing recon�gurable hardware of 26 f ps [1], the sys-tem should have an overall throughput of 1.5million cells/second.

For the NApprox design, we use clock signals to accumulatethe weighted sum for multiple clock ticks (usually 1ms per tick)in the membrane potentials, so that we can provide more preciseinner-product results. �e input signals to the NAapprox HoG areencoded in 64−spike representation (6−bit �xed-point resolution).�erefore, a single NApprox HoG module, using 26 TrueNorthcores, can provide a throughput of 15 cells/sec. Accordingly, theNApprox approach for full-HD images @ 26 f ps requires a verylarge number of cores (nearly 650 TrueNorth chips) and wouldconsume 40W . �us, while we show that it is possible to utilizesuch a programmatic approach on a neuromorphic platform, it isnot always the most practical deployment option.

For the parrot HoG, we consider design options for the precisionof the input representation from 32−spikes to 1−spike in stochasticcoding representation. Figure 6 shows the trade-o�s of precisionto classi�er accuracy. With the 32−spike signal representation,each parrot HoG module provides a throughput of 31 cells/sec. �ethroughput can be increased to 1000 cells/sec by using 1−spikerepresentation. As a result, the parrot HoG approach processes thefull-HD images @ 26 f ps using a substantially smaller number ofchips, with total power of 6.15W and 192mW for 32− and 1−spikerepresentations, respectively.

For a comparison across designs, we synthesized the HoG mod-ule in [1] on a Xilinx Virtex-7 690T FPGA [20] with an IBM CAPIinterface [15] and estimated the power consumption using XilinxVivado [19]. �e power for the logic part of the HoG acceleratorin isolation is reported as 1.12W . For a fair comparison to theproposed TrueNorth systems that handles the internal signalingand data movement, the system level consumption of the FPGAapproach which contains the peripherals, such as clocking and

Page 6: Co-training of Feature Extraction and Classification using ...Co-training of Feature Extraction and Classification using Partitioned Convolutional Neural Networks Wei-Yu Tsai, Jinhang

Approach Signal resolution Power EstimationHigh-precision HoG

on FPGA [1] 16-bit Logic only: 1.12 WSystem: 8.6 W

NApprox HoGon TrueNorth

64-spike(6-bit)

40 W≈ 650 TrueNorth chips

Parrot HoGon TrueNorth

32-spike(5-bit) 6.15 W

4-spike(2-bit) 768 mW

1-spike(1-bit) 192 mW

Table 2: Estimated power consumption for HoG feature ex-traction approaches: FPGA (baseline), NApprox (direct pro-grammatic mapping of HoG algorithm via synaptic core in-trinsics), and Parrot (trained neural network approximationof HoG).

Figure 6: Classi�er accuracy andmiss ratewith di�erent rep-resentation for the validation set of training data.

CAPI interfaces, is reported as 8.6 W . Table 2 shows the com-parison among the feature extraction approaches. Low precisionParrot HoG encodings produce su�ciently high quality output atsubstantially lower power requirements.

6 CONCLUSIONSWe have presented two di�erent explicit feature extraction ap-proaches on the TrueNorth neuromorphic platform. We show thata mimicry approach can provide equivalent result quality to oneusing cra�ed intrinsics while using fewer resources and 6.5x-208xless power. Moreover, we present results that imply that HoG usingthe Parrot paradigm can be trained with a smaller training set andclassi�er than a monolithic classi�er that absorbs feature extraction.�is indicates that explicitly mapping at least some portion of fea-ture extraction tasks to neuromorphic platforms is both possible andeven sometimes preferable. Compared to prior recon�gurable ap-proaches, neuromorphic feature extraction can produce competitivequality-power trade-o�s while o�ering integration of both extrac-tion and classi�cation tasks in a single platform. �e partitionedCNNs developed in this work are prototypical implementations ofapplying structure within a learned model approach suitable forneuromorphic platforms, and optimization of the combined ParrotHoG and Eedn network designs for be�er power e�ciency, as wellas the thorough exploration of monolithic approaches, for moredirect comparison, are subjects of our future work.

REFERENCES[1] S. Advani, Y. Tanabe, K. Irick, J. Sampson, and V. Narayanan. A scalable architec-

ture for multi-class visual object detection. In 2015 25th International Conferenceon Field Programmable Logic and Applications (FPL), pages 1–8, Sept 2015.

[2] S. K. Advani. Large-Scale Object Recognition for Embedded Wearable Platforms.PhD thesis, Pennsylvania State University, University Park, PA, USA, 2016.

[3] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam,Y. Nakamura, P. Da�a, G. J. Nam, B. Taba, M. Beakes, B. Brezzo, J. B. Kuang,R. Manohar, W. P. Risk, B. Jackson, and D. S. Modha. Truenorth: Design and tool�ow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE Trans-actions on Computer-Aided Design of Integrated Circuits and Systems, 34(10):1537–1557, Oct 2015.

[4] A. Amir, P. Da�a, W. P. Risk, A. S. Cassidy, J. A. Kusnitz, S. K. Esser, A. An-dreopoulos, T. M. Wong, M. Flickner, R. Alvarez-Icaza, E. Mc�inn, B. Shaw,N. Pass, and D. S. Modha. Cognitive computing programming paradigm: Acorelet language for composing networks of neurosynaptic cores. In�e 2013International Joint Conference on Neural Networks (IJCNN), pages 1–10, Aug 2013.

[5] A. S. Cassidy, P. Merolla, J. V. Arthur, S. K. Esser, B. Jackson, R. Alvarez-Icaza,P. Da�a, J. Sawada, T. M. Wong, V. Feldman, A. Amir, D. B. D. Rubin, F. Akopyan,E. Mc�inn, W. P. Risk, and D. S. Modha. Cognitive computing building block:A versatile and e�cient digital neuron model for neurosynaptic cores. In �e2013 International Joint Conference on Neural Networks (IJCNN), pages 1–10, Aug2013.

[6] C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector machines. ACMTransactions on Intelligent Systems and Technology, 2(3):1–27, Apr. 2011.

[7] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, andO. Temam. Dadiannao: A machine-learning supercomputer. In Proceedings of the47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47,pages 609–622, Washington, DC, USA, 2014. IEEE Computer Society.

[8] N. Dalal and B. Triggs. Histograms of OrientedGradients for HumanDetection. In2005 IEEE Computer Society Conference on Computer Vision and Pa�ern Recognition(CVPR’05), pages 886–893. IEEE, 2005.

[9] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: An evaluationof the state of the art. IEEE Trans. Pa�ern Anal. Mach. Intell., 34(4):743–761, Apr.2012.

[10] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural acceleration forgeneral-purpose approximate programs. In Proceedings of the 2012 45th AnnualIEEE/ACM International Symposium on Microarchitecture, pages 449–460. IEEEComputer Society, 2012.

[11] S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, R. Appuswamy, A. An-dreopoulos, D. J. Berg, J. L. McKinstry, T. Melano, D. R. Barch, C. di Nolfo,P. Da�a, A. Amir, B. Taba, M. D. Flickner, and D. S. Modha. Convolutional net-works for fast, energy-e�cient neuromorphic computing. CoRR, abs/1603.08270,2016.

[12] S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside, E. Painkras, S. Temple, andA. D. Brown. Overview of the spinnaker system architecture. IEEE Transactionson Computers, 62(12):2454–2467, Dec 2013.

[13] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan,B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, B. Brezzo, I. Vo, S. K. Esser, R. Ap-puswamy, B. Taba, A. Amir, M. D. Flickner, W. P. Risk, R. Manohar, and D. S.Modha. A million spiking-neuron integrated circuit with a scalable communica-tion network and interface. Science, 345(6197):668–673, 2014.

[14] J. Schemmel, A. Grbl, S. Hartmann, A. Kononov, C. Mayr, K. Meier, S. Millner,J. Partzsch, S. Schiefer, S. Scholze, R. Sch�ny, and M. O. Schwartz. Live demon-stration: A scaled-down version of the brainscales wafer-scale neuromorphicsystem. In 2012 IEEE International Symposium on Circuits and Systems, pages702–702, May 2012.

[15] J. Stuecheli, B. Blaner, C. R. Johns, and M. S. Siegel. Capi: A coherent acceleratorprocessor interface. IBM Journal of Research and Development, 59(1):7:1–7:7, Jan2015.

[16] A. Suleiman and V. Sze. An energy-e�cient hardware implementation of hog-based object detection at 1080hd 60 fps with multi-scale support. J. Signal Process.Syst., 84(3):325–337, Sept. 2016.

[17] W. Y. Tsai, D. Barch, A. Cassidy, M. Debole, A. Andreopoulos, B. Jackson, M. Flick-ner, J. Arthur, D. Modha, J. Sampson, and V. Narayanan. Always-on speechrecognition using truenorth, a recon�gurable, neurosynaptic processor. IEEETransactions on Computers, PP(99):1–1, 2016.

[18] W. Y. Tsai, D. R. Barch, A. S. Cassidy, M. V. DeBole, A. Andreopoulos, B. L.Jackson, M. D. Flickner, D. S. Modha, J. Sampson, and V. Narayanan. La�e:Low-power audio transform with truenorth ecosystem. In 2016 InternationalJoint Conference on Neural Networks (IJCNN), pages 4270–4277, July 2016.

[19] Xilinx. Vivado design suite user guide: Ge�ing started. h�ps://www.xilinx.com/support/documentation/sw manuals/xilinx2015 4/ug910-vivado-ge�ing-started.pdf, Nov 2015.

[20] Xilinx. 7 series fpgas data sheet: Overview. h�ps://www.xilinx.com/support/documentation/data sheets/ds180 7Series Overview.pdf, March 2017.