a high performance hardware architecture for multilayer spiking … · 2017. 9. 21. · a high...

A High Performance HardwareArchitecture for MultilayerSpiking Neural Networks

by

M. Sc. Marco Aurelio Nuno Maganda

A Dissertationsubmitted to the Program in Computer Science

Computer Science Departmentin Partial fulfillment of the requirements for the degree

of

DOCTOR IN COMPUTER SCIENCE

at the

National Institute for Astrophysics, Optics andElectronics

October 2009Tonantzintla, Puebla

Advisors:

Dr. MIGUEL ARIAS ESTRADA, INAOEDr. CESAR TORRES HUITZIL,

CINVESTAV-TAMAULIPAS

c©INAOE 2009All rights reserved

The author hereby grants to INAOE permission to reproduce and todistribute copies of this thesis in whole or in part

Abstract

In this work, a hardware architecture for Multilayer Spiking Neural Networks (SNNs)has been developed. Several attempts for implementing Multilayer SNNs have beendone, but none of them have explored a hardware implementation and tradeoffs in de-tail. Spiking Neuron Models (SNMs) have obtained the interest of the computer scienceresearchers due to their potential use as a more biological plausible processing model,which have been tested and successfully applied in several computational tasks, likespeech processing, computer vision and robot navigation, with less hardware resourcesand iterations compared with classical models.In this research the hardware plausibility of SNNs has been demonstrated through thedesign of a dedicated hardware architecture for emulating a large number of spikingneurons. Dedicated modules have been developed for coding, recall and learning, be-ing the main difference and contribution of this work compared to previously reportedworks the coding and the learning that has been carried out on a single chip. Thisimplementation scheme reduces the bandwidth required for processing several stages indifferent hardware processors, and allows the hardware reusing of processing elements.In classical neural networks, a set of input patterns is passed as input to the network,and the network performs the learning by adjusting the weights for adapting the actualnetwork output to the target or desired network output. In multilayer SNNs, threeprocessing stages can be identified. In the first processing phase, the data coding pro-cess is performed. The implemented coding scheme required for multilayer SNNs is theGaussian Receptive Fields, which maps an input value into a set of firing times. Theinput firing times obtained by the GRFs are used for feeding the SNN. The main ad-vantage of the GRFs is that they generate a sparse coding for the input values, solvingseveral problems of scale adjusting of other coding schemes. In the second phase, thenetwork output is computed, by evaluating the input firing times using the activationfunction. Each neuron connection have a set of synaptic terminals. The number ofsynaptic terminal is fixed in compilation time, but the architecture can be extendedto support different number of terminals. The output firing time of the current neu-ron depends on the weights and delays associated to each synaptic terminal. Wheneach neuron generates its own firing time, the neuron remains in an inactivity state.In the third processing stage, the computation of the adjustment for the weights anddelays of each synaptic connection is performed. Then, the input pattern is presentedand according to the desired output firing times, the network weights and delays are

i

adjusted. For the proposed architecture, hardware resource and performance statisticshave been reported, as well as the potential application of the proposed architecture tomachine learning datasets. A very similar convergence and a reduction in the numberof iterations required for the SNNs is obtained, and the potential application of theproposed architecture to other interesting problems is shown.The proposed hardware modules for each processing stage in SNNs obtain importantperformance improvements rates. For the GRFs module, the obtained performance im-provement is at least 10X with respect to SW implementation. For the recall module,the obtained performance improvement is at least 4X with respect to the SW imple-mentations, and in the overall proposed architecture obtains an average performanceimprovement of at least 9X with respect to SW implementation. The actual perfor-mance rates are obtained for devices where only 40 physical processing elements canbe implemented. The potential performance is much better in larger FPGA deviceswhere more processors can be implemented. With the current design, a maximum of512 neurons by layer, and only 3-layer networks can be emulated. These improvementsgive an important starting point for large SNNs networks in FPGAs. The proposedhardware modules have been validated on a Virtex-II Pro FPGA device. The proposedarchitecture is designed to be scalable for supporting more processing elements, withthe possibility of improving the actual reported performance by its implementation invery large FPGA devices.The spiking nature of this network allows the efficient implementation in FPGAs due tothe use of fine grain parallelism and the use of pipelining to speed up the SNNs compu-tations. Moreover, the used scheme provides insights for furthers improvements in thesimulation of realistic and large scale neural computation through embedded systemsthat could provide a mechanism to analyze the dynamics in the perception/action loopof systems.

ii

Resumen

En esta tesis, se ha desarrollado una arquitectura hardware para redes neuronales pul-santes (SNNs) multicapa. Existen varios intentos para la implementacion de este tipo deredes, pero ninguno de estos ha profundizado en cuanto a la implementacion y los com-promisos hardware en detalle. Los modelos neuronales pulsantes (SNMs) han atraidoel interes de los investigadores en ciencias computacionales, debido a su uso potencialcomo un modelo de procesamiento mas plausible biologicamente, el cual ha sido uti-lizado exitosamente en diferentes tareas computacionales, como navegacion robotica,vision por computadora y reconocimiento del habla, ademas los SNMs utilizan unacantidad menor de recursos hardware y requieren de una menor cantidad de iteracionespara el procesamiento en comparacion con los modelos clasicos.En esta investigacion, se ha demostrado la plausibilidad en hardware de las SNNs medi-ante el diseno de una arquitectura hardware dedicada que permita la emulacion de unagran cantidad de neuronas pulsantes. Se han implementado modulos dedicados para lasfases de codificacion, recall y de aprendizaje (learning), siendo la principal diferenciay contribucion de este trabajo que la codificacion y el aprendizaje son llevados a cabodentro del chip (on-chip). Este esquema de implementacion reduce el ancho de bandarequerido para las transferencias de datos entre modulos de procesamiento de diferentesplataformas (PC-FPGA) y permite el reuso de elementos procesadores hardware.En redes neuronales clasicas, un conjunto de patrones son pasados como entrada a unadeterminada red y la red debe efectuar el aprendizaje mediante el ajuste de los pesosactuales, la salida actual y la salida esperada (o funcion objetivo). En redes neuronalespulsantes multicapa, se identifican tres etapas de procesamiento basico: en la primeraetapa, se efectua la codificacion de los datos. El esquema de codificacion implementadose denomina campos receptivos gausianos (GRFs), los cuales convierten los datos deentrada (continuos o discretos) en tiempos de disparo de salida. La ventaja de los GRFsconsiste en que efectuan una codificacion dispersa, la cual evita ciertos problemas deajuste de escala presentes en otras tecnicas de codificacion. En la segunta etapa, lostiempos de disparo obtenidos por el proceso de codificacion, son pasados al modulo derecall, el cual se encarga de obtener la salida de la red, que consiste de un conjuntode tiempos de disparo de salida. Para cada conexion entre neuronas en una red neu-ronal pulsante multicapa, se requiere un conjunto de terminales sinapticas. El numerode terminales sinapticas no ha sido definido completamente, pero para la arquitecturapropuesta, este numero es fijo, pudiendo ser modificada en tiempo de compilacion para

iii

soportar diferente numero de terminales sinapticas. Los tiempos de disparo de salidason obtenidos al evaluar los tiempos de disparo de entrada por medio de una funcionde activacion. Por lo tanto, la salida de la neurona depende de los tiempos de disparode entrada, y de los retardos y pesos asociados a cada conexion sinaptica. Una vezque una neurona ha generado su tiempo de disparo de salida, entra en un estado deinactividad. En la tercera etapa de procesamiento, se calcula el ajuste de los pesos yretardos de cada terminal sinaptica asociada a las conexiones de la red. Dependiendodel patron de entrada, el patron de salida actual y el patron de salida obtenido, es elajuste a los pesos y retardos efectuado.Para la arquitectura propuesta, se reporta la utilizacion de recursos hardware y es-tadısticas de desempeno, ademas de la potencial aplicacion de la arquitectura a con-juntos de datos utilizados en aplicaciones de aprendizaje automatico. Se obtienen re-sultados similares en cuanto a convergencia, y se propone la aplicacion potencial de laarquitectura propuesta a otros problemas interesantes.Los modulos hardware propuestos para cada etapa de procesamiento obtienen impor-tantes mejoras de desempeno en comparacion con implementaciones software. Para elmodulo hardware para la codificacion GRF, la mejora en cuanto al desempeno es de almenos 10X. En cuanto al modulo hardware para la fase de recall, se obtiene una mejoraen cuanto al desempeno de al menos un factor de 4X. En la arquitectura global, seobtiene una mejora en cuanto al desempeno de al menos un factor de 9X. Las mejorasactuales estan limitadas por el dispositivo seleccionado para la implementacion actual,la cual permite la implementacion de solo 40 elementos procesadores. El desempeno po-tencial es mucho mejor en dispositivos FPGAs de mayor capacidad, donde una mayorcantidad de elementos procesadores puede ser implementados. Estos resultados sonun punto de partida importante en cuanto a redes neuronales pulsantes multicapa degran tamano. Con el diseno actual, solo es posible emular 512 neuronas por capa y unmaximo de 3 capas. La arquitectura propuesta fue validada en un dispositivo FPGAVirtex II PRO. La arquitectura propuesta esta disenada para ser escalable y hacer posi-ble la implementacion de una mayor cantidad de elementos procesadores en dispositivosFPGAs de mayor capacidad.La naturaleza pulsante de la red neuronal implementada permite su implementacioneficiente en dispositivos FPGAs, debido al uso de paralelismo de grano fino y la im-plementacion de paralelismo Pipeline para acelerar los calculos llevados a cabo por lared implementada. Ademas, la arquitectura actual permite visualizar ciertos aspectospara futuras implementaciones de simulaciones realıstas de calculos neuronales a granescala, mediante el uso de sistemas embebidos que permitan proveer mecanismos paraanalizar la dinamica de sistemas de ciclo accion/percepcion.

iv

Acknowledgments

I would like to thank to Dr. Miguel Arias Estrada and Dr. Cesar Torres Huitzil fortheir support and guidance for the developing of this research work.

I would like to thank to the members of my defense committee for their comments.Thanks to Dr. Bernard Girau, Dr. Rene Cumplido, Dra. Claudia Feregrino, Dr. Jesus

Ariel Carrasco and Dr. Carlos Alberto Reyes.

I would like to thank to the CONACYT for the financial support for the developing ofthis research work.

I would like to thank to the workmates and colleges from several INAOE researchprojects for the motivation and help provided during the development of this research

work

I would like to thank to the graduate students who gave me motivation and guidancefor the developing of this research work.

v

Dedicatoria

A mi madre Natividad Maganda Garcıa.

A mi esposa, mi novia, mi amiga, mi amante, mi dulce desayuno, mi pastel perfecto,mi bebida preferida, mi plato predilecto, mi sueno, mi fiesta, mi alegrıa, la comida

mas sabrosa, mi perfume, mi bebida ... a la mujer que es todo en mi vida.

vii

Contents

List of Figures xiii

List of Tables xv

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 General Objective . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3.2 Specific Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 41.3.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Contents and Thesis Organization . . . . . . . . . . . . . . . . . . . . . 5

2 Background and Previous Work 72.1 Neuro-Biological Foundations . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Advantages of ANNs . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 Disadvantages of ANNs . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Artificial Neuron and Neural Networks . . . . . . . . . . . . . . . . . . 102.3 Neuron Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Classical Neuron Models . . . . . . . . . . . . . . . . . . . . . . 132.3.1.1 Digital Output Neurons . . . . . . . . . . . . . . . . . 132.3.1.2 Analog Output Neurons . . . . . . . . . . . . . . . . . 13

2.4 Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5 Learning in Artificial Neural Networks . . . . . . . . . . . . . . . . . . 162.6 Networks of Spiking Neurons . . . . . . . . . . . . . . . . . . . . . . . . 19

2.6.1 Spiking Neuron Models (SNMs) . . . . . . . . . . . . . . . . . . 192.6.1.1 Hodgkin-Huxley Model (HH) . . . . . . . . . . . . . . 202.6.1.2 Leaky Integrate-and-fire Model (LIF) . . . . . . . . . . 212.6.1.3 Spike Response Model (SRM) . . . . . . . . . . . . . . 21

2.6.2 FeedForward Topology for SNNs . . . . . . . . . . . . . . . . . . 232.6.3 Learning Algorithms in SNNs . . . . . . . . . . . . . . . . . . . 27

2.7 FPGA Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.7.1 FPGA Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 29

ix

CONTENTS

2.7.2 Reconfigurable Computing . . . . . . . . . . . . . . . . . . . . . 312.8 Parallel Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.8.1 Parallelism in Feed-Forward Neural Networks . . . . . . . . . . 332.8.2 Fundamentals of Parallel Computing . . . . . . . . . . . . . . . 342.8.3 Speed Definition on FPGA Devices . . . . . . . . . . . . . . . . 342.8.4 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . 352.8.5 General Considerations for FPGA-based Hardware Accelerators

for Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 352.9 FPGA Implementations for SNNs . . . . . . . . . . . . . . . . . . . . . 36

3 Gaussian Receptive Fields (GRFs) and SpikeProp 413.1 Coding in SNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.1.2 Types of Information Coding . . . . . . . . . . . . . . . . . . . 413.1.3 Biological Foundations of Receptive Fields . . . . . . . . . . . . 443.1.4 Population Coding through Gaussian Receptive Fields . . . . . 453.1.5 Mathematical Background . . . . . . . . . . . . . . . . . . . . . 46

3.2 SpikeProp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.2.1 Recall phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.2.2 Learning phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2.2.1 SpikeProp with Delay, Decay and Threshold Learning . 523.2.2.2 Other SpikeProp Improvements . . . . . . . . . . . . . 58

3.3 Algorithm Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4 Proposed Architecture 614.1 Issues to be Addressed, Restrictions and Limitations . . . . . . . . . . 614.2 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2.1 Modules Description . . . . . . . . . . . . . . . . . . . . . . . . 624.2.2 Modules Interaction . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3 Coding: Gaussian Receptive Fields Processor (GRFP) . . . . . . . . . . 644.3.1 Modules Description . . . . . . . . . . . . . . . . . . . . . . . . 644.3.2 Modules Interaction . . . . . . . . . . . . . . . . . . . . . . . . 664.3.3 Integration of GRFPs with other SNN Modules . . . . . . . . . 69

4.4 Recall: Spiking Neural Layer Processor (SNLP) . . . . . . . . . . . . . 704.4.1 Single Hardware Neuron: Neural Processor (NP) . . . . . . . . . 704.4.2 Modules Description . . . . . . . . . . . . . . . . . . . . . . . . 714.4.3 Modules Interaction . . . . . . . . . . . . . . . . . . . . . . . . 72

4.5 Learning Modules (LMs) . . . . . . . . . . . . . . . . . . . . . . . . . . 724.5.1 Module Description . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.6 Additional Hardware Modules . . . . . . . . . . . . . . . . . . . . . . . 754.6.1 Class Encoder (CE) . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.7 Performance and Parallelism Analysis . . . . . . . . . . . . . . . . . . . 804.7.1 Parallelism Levels in the Proposed Architecture . . . . . . . . . 80

x

CONTENTS

4.7.2 Parallelism Models and Performance Estimation . . . . . . . . . 804.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5 Hardware Implementation 895.1 Validation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.1.1 SW Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.1.1.1 Handel-C . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.1.2 HW Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.1.2.1 Prototyping Board . . . . . . . . . . . . . . . . . . . . 89

5.1.3 FPGA flow (Environment Flow) . . . . . . . . . . . . . . . . . . 905.1.3.1 Developing Flow . . . . . . . . . . . . . . . . . . . . . 905.1.3.2 Interfacing Framework . . . . . . . . . . . . . . . . . . 935.1.3.3 FPGA Model . . . . . . . . . . . . . . . . . . . . . . . 945.1.3.4 Memory Map . . . . . . . . . . . . . . . . . . . . . . . 94

5.2 Main Block of the Full SNN System . . . . . . . . . . . . . . . . . . . . 975.3 Hardware Implementation for Gaussian Receptive Fields . . . . . . . . 97

5.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 975.3.2 Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.4 Hardware Implementation for Recall Phase in Multilayer SNNs . . . . . 1025.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.4.2 Issues Related to the Neuron Potential Computation . . . . . . 1025.4.3 Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.5 Hardware Implementation for Learning Phase in Multilayer SNNs . . . 1055.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055.5.2 Learning comparison . . . . . . . . . . . . . . . . . . . . . . . . 108

5.6 Overall Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1105.6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6 SNN Architecture Tests with Datasets 1156.1 State of the Art of SNNs Applied to Standard Machine Learning Datasets1156.2 Application Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.2.1 IRIS Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1166.2.2 Wisconsin Breast Cancer Diagnosis Dataset (WBCD) . . . . . . 118

6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7 Conclusions and Future Work 1237.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1237.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 1247.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

xi

CONTENTS

7.3.1 Refereed Publications . . . . . . . . . . . . . . . . . . . . . . . . 1247.3.2 Unrefereed Publications . . . . . . . . . . . . . . . . . . . . . . 124

7.4 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . 1257.5 Research Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Biblography 127

xii

List of Figures

2.1 Biological Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Generic Model of Artificial Neurons . . . . . . . . . . . . . . . . . . . . 122.3 Diagram of a McCulloch-Pitts Unit . . . . . . . . . . . . . . . . . . . . 142.4 Different Activation Functions . . . . . . . . . . . . . . . . . . . . . . . 152.5 Different NN Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . 172.6 Supervised Learning Process . . . . . . . . . . . . . . . . . . . . . . . . 182.7 A two-weight Layer Feed-forward Neural Network . . . . . . . . . . . . 192.8 Basic Electric Circuit for the HH Model . . . . . . . . . . . . . . . . . 202.9 Basic Electric Circuit for the LIF Model . . . . . . . . . . . . . . . . . 202.10 EPSP and IPSP for Different Decay Rates . . . . . . . . . . . . . . . . 222.11 Feedforward SNNs with Multiple Synaptic Connections . . . . . . . . . 242.12 Simple SNN Model with 2 Synaptic Connections . . . . . . . . . . . . . 252.13 Computation of Neural Output for a Neuron with 4 Inputs . . . . . . . 262.14 Taxonomy of Neural Network Models, Topologies and Learning Algorithms 282.15 FPGA Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.16 Virtex II PRO CLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.17 Virtex II PRO SLICE . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.18 Look-Up Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1 Mean Firing Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2 Temporal Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.3 Gaussian Function with Different Parameters . . . . . . . . . . . . . . . 473.4 Example of GRF Coding . . . . . . . . . . . . . . . . . . . . . . . . . . 483.5 Input Data Codification using Several GRFs . . . . . . . . . . . . . . . 49

4.1 System Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.2 Complete Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.3 Gaussian Fields Architecture . . . . . . . . . . . . . . . . . . . . . . . . 684.4 GRFPs Integrated with other SNNs Processes . . . . . . . . . . . . . . 704.5 Structure of a Neural Processor . . . . . . . . . . . . . . . . . . . . . . 714.6 Components of a SNLP . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.7 Learning Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.8 Class Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

xiii

LIST OF FIGURES

4.9 Example of Encoding Output Class with 4 Neurons. . . . . . . . . . . 774.10 Example of Encoding Output Class with 7 Neurons. . . . . . . . . . . 784.11 Different Class Coding using 3 Output Neurons. . . . . . . . . . . . . 794.12 Layer Parallelism using Pipeline Processing . . . . . . . . . . . . . . . . 834.13 Different Dividers for the MST . . . . . . . . . . . . . . . . . . . . . . 844.14 Execution Time obtained from the Model for Networks with a large num-

ber of Output Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.15 Execution Time obtained from the Model for Networks with a large num-

ber of Input Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.16 Execution Time obtained from the Model for Networks with a large num-

ber of Hidden Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . 884.17 Execution Time obtained from the Model for Balanced Networks . . . . 88

5.1 Main Components of the AXM-XPL board . . . . . . . . . . . . . . . . 915.2 AlphaData ADMXPL Board . . . . . . . . . . . . . . . . . . . . . . . . 915.3 Alphadata ACD-PMC Board . . . . . . . . . . . . . . . . . . . . . . . 925.4 Developing Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 925.5 FPGA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.6 FPGA Memory Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.7 Main Blocks of the Full SNN System . . . . . . . . . . . . . . . . . . . 975.8 Performance Comparison between SW and HW Implementation of GRFs 1005.9 Error Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015.10 Activation Function Computation Space . . . . . . . . . . . . . . . . . 1045.11 Widths and Variables used for Neuron Potential . . . . . . . . . . . . . 1045.12 Threshold Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.13 Performance Comparison for the Recall Phase Implementation in HW

and SW (Real Datasets) . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.14 Comparison of Learning Performance of both SW and HW Implementa-

tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.15 Network Output Firing times for Selected Learning Iterations. . . . . . 112

6.1 MSE for Iris Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.2 Correct Classified Samples for IRIS Dataset . . . . . . . . . . . . . . . 1196.3 MSE for Wisconsin Breast Cancer Dataset . . . . . . . . . . . . . . . . 1206.4 Correct Classified Samples for Wisconsin Breast Cancer Dataset . . . . 120

xiv

List of Tables

2.1 Spiking Neuron Models Comparison . . . . . . . . . . . . . . . . . . . . 232.2 Weights and Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.3 Related Work Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1 Characteristics of the Target FPGA Device . . . . . . . . . . . . . . . . 905.2 Maximum of Weights and Delays in the Defined Region Memory . . . . 965.3 Synthesis Results for the GRFHA - FPGA Device: Virtex II PRO (XC2VP30) 995.4 Execution Time for both HW and SW GRFs Implementations (in mil-

liseconds) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015.5 Synthesis Results for the Layer and Neural Processors- FPGA Device:

Virtex II PRO (XC2VP30) . . . . . . . . . . . . . . . . . . . . . . . . . 1025.6 XOR dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.7 Input and Output Firing Times (Patterns) for the XOR Problem. . . . 1095.8 Network Configuration and Learning Parameters . . . . . . . . . . . . . 1105.9 Results for the Overall Architecture . . . . . . . . . . . . . . . . . . . . 113

6.1 Criteria for Classification Quality . . . . . . . . . . . . . . . . . . . . . 1176.2 Results for the IRIS Dataset . . . . . . . . . . . . . . . . . . . . . . . 1186.3 Results for the WBC Dataset . . . . . . . . . . . . . . . . . . . . . . . 119

xv

Chapter 1

Introduction

Artificial Neural Networks (ANNs) are parallel computational models composed ofdensely interconnected, simple, nonlinear, adaptive processing units, characterized byan inherent capability for storing experiential knowledge and rendering it available foruse. Each processing unit or neuron, receives inputs from other neurons, performs aweighted summation, applies an activation function to the weighted sum, and outputsits results to other neurons in the network. ANNs resemble the biological brains in twofundamental respects: first, knowledge is acquired by the networks from its environ-ment through a learning process, and second, synaptic weights are employed to storethe acquired knowledge [1].

Parallel processing of neural networks simulation has attracted interest during thepast years. The learning and recall of neural networks can be represented mathemati-cally as linear algebra functions that operate on vectors and matrices [2].

ANNs are widely used in a number of applications where the Neural Networks areusually implemented as a software program on an ordinary digital computer. However,software implementations cannot fully exploit the essential property of parallelism foundin biological NNs. In this respect, implementation of ANNs using hardware elementssuch as Very-Large-Scale Integration (VLSI) integrated circuits is beneficial [3].

Spiking Neural Networks (SNNs) have been proposed as an alternative to classicalprocessing schemes. The main characteristic of this type of networks is the informationinterchange among neurons, which is done by means of digital spikes, reducing therequired bandwidth for data transmission.

1.1 Motivation

In recent years, a great interest in SNNs has grown. Several researchers argue thatSNNs solve the same problems that classical neurons, but with less iterations, whichrepresents less processing time and faster networks.

In [4], a performance comparison of different hardware platforms for simulation ofspiking neural networks with sizes from 8K neurons up to 512K neurons has been re-

1

Motivation

ported. Results for workstations (Spark-Ultra), digital signal processors (TMS-C8X),neurocomputers (CNAPS, SYNAPSE) and small and large-scale parallel computers(4xPentium, CM-2, SP2) are presented. Only supercomputers like CM-2 can matchthe performance requirements for the simulation of very large scale spiking neural net-works. The main reason of the poor performance of the other parallel computers is thelimited I/O (Input/Output) resources which is a result of the I/O bounded characterof computing spiking networks. There is a lot of work about spiking neural networksaccelerators. Recently, other processing alternatives have been explored, as in [5], whereGraphics Processing Units (GPUs) are used for efficient implementation of SNNs.

The parallelism shown by ANNs models is lost when these models are implementedon computers with sequential processing schemes. Designing dedicated hardware ar-chitectures is a great challenge, since many hardware architectures have been proposedand many of them have the following problems: low performance, low bandwidth, fixednumber of neurons, fixed topology and fixed learning rules.

On the other hand, Field Programmable Gate Arrays (FPGAs) have become re-cently an alternative to full custom VLSI design. FPGAs are reasonably priced andmake the development cycle shorter in comparison to VLSI. An FPGA is composedof a matrix of logic cells placed over a network of interconnection lines. The user canprogram the function of each cell as well as the interconnections between cells and theI/O cells, which are also programmable [6].

FPGAs have been widely used for computer vision tasks, neural networks, datacompression, cryptography, and any kind of computational intensive algorithm thatcould be parallelized. One use of FPGAs is for implementing Artificial Neural Networks.The inherent parallelism of neural network models has led to the design of a large varietyof hardware implementations. Applications as image processing, speech synthesis andanalysis, high energy physics, and so on, have found in neural hardware a field withpromising solutions.

An approach to information processing that does not require algorithm or rule de-velopment and that often reduces significantly the quantity of software that must bedeveloped is called Neurocomputing, which allows the solution of several types of prob-lems. Formally, Neurocomputing is the engineering discipline concerned with non pro-gramed adaptive information processing systems - neural networks- that autonomouslydevelop operation capabilities in adaptive response to an information environment. Aneurocomputer is a parallel, distributed-information processors that mimics the func-tioning of the human brain, including its capabilities for self-organization and learning.Neurocomputers can be built either as hard-wired machines (direct implementationtechniques) capable of implementing only a limited set of neural network architectures,or as flexible, reconfigurable machine (emulation approach) that can run a wide varietyof neural networks architectures [7].

When implemented on hardware, neural networks can take full advantage of theirinherent parallelism and execute several orders of magnitude faster than software sim-ulation, becoming adequate for real world applications [8]. FPGAs have been usedsuccessfully for applications where large quantity of computations are demanded.

2

Introduction

An FPGA-based implementation of ANNs with large number of neurons is stilla challenging task because ANNs algorithms are ”multiplication-rich” and it is rela-tively expensive to implement multipliers on fine-grained FPGAs [3]. Several process-ing schemes have been proposed. Some of the most relevant are: Simple Instruction -Multiple Data Steam (SIMD), Multiple Instruction - Multiple Data Stream (MIMD),Pipelined computers, Vector Computers, Systolic Arrays. MIMD computers are distin-guished from SIMD or Vector Computers by having multiple control streams as well asmultiple data streams. The multiple control streams process in parallel and the per-formed operations may be different [9]. A multiprocessor is a single integrated systemthat contains multiple processors, each one capable of executing an independent streamof instructions, but they need an integrated system for moving data among the pro-cessors, to memory, and to I/O devices. FPGAs are suitable for implementing SIMD,Pipelined and Vector computers.

Extensive theoretical work has shown that SNNs are computationally more powerfulthan conventional artificial neural networks [10]. This implies that SNNs need fewernodes to solve the same problem than conventional ANNs. FPGA implementation givesthe flexibility to develop SNNs for a particular task without committing to costly siliconASIC fabrication. In addition, digitally based SNNs provide a number of other desirablefeatures such as noise-robustness and simple real-world interfaces. The motivation ofdigital simulation of artificial ANNs is as diverse as the background of researchers. Dueto programmability, digital hardware offers a high degree of flexibility and provides aplatform for simulation on neuron levels as well as on network levels [11].

Several hardware implementations for SNNs are found in the literature, but the twomost relevant according to the hardware implementation to be proposed in this workare: In [12], a partitioning approach between hardware and software is reported, wherefor the hardware partition, a processing scheme based on multiple instances of theMicroBlaze processor is proposed. The used approach uses a controller that can time-multiplex a single neuron or synapse. Several neurons or synapses can be implementedon a single processor. In [13] an scheme based on Evolutionary Strategies (ES) whereseveral processors and layers process in parallel, is reported. This processing schemeis interesting due to the capability of altering itself the network topology for testingseveral topological variations. Both hardware implementations are interesting, but theapplication of the implemented architecture to classical machine learning problems arenot reported. In [14], the application of SNNs to classical machine learning problemsis reported. There is not hardware implementations of the SpikeProp learning. Theproposed architecture used in this work performs the SpikeProp learning. From [12]we take the idea of dividing the neural processing in several processing elements orcores, but implementing a more simpler processor without using hardcores or softcores.This idea allows the implementation of a large number of simple processors workingin parallel, which allows a potential performance increasing and scalability. From [13],we take the idea of giving to the implemented architecture the capability of on-chiplearning applied to multilayer SNNs, where the topology evolution is discarded, but theweights and delays are adjusted using the SpikeProp algorithm.

3

Research Questions

1.2 Research Questions

• Is there an interesting tradeoff in the design of a hardware architecture thatsatisfies the processing requirements of parallel SNNs, which can emulate theparallelism, the interconnection schemes and the dense information flow?

• Is it possible to define an interconnection scheme for achieving high performanceneural computations that maximize the number of synaptic connections per timeunit?

• Is it possible to define a learning block that performs the SpikeProp learning,minimizing the memory accesses and allowing the reusing of hardware elements?

• Is it possible to find and adapt an appropriate parallel scheme for SNNs recalland learning phases?

• What are the factors that are related to the performance-resource utilizationtrade-off of the SNN hardware architecture to be proposed?

1.3 Research Objectives

1.3.1 General Objective

The objective of this research work is the design and implementation of a hardwareplatform for feedforward multilayer networks of spiking networks, with on-chip capa-bilities for solving general machine learning problems. The proposed platform must beflexible for allowing several topology variations, taking as input a dataset provided bythe user, performing the recall and learning process and allowing to the user the testingof several network parameters for finding a better solution for a given application.

1.3.2 Specific Objectives

• Propose a reusable, scalable neural processor element to perform spiking neuralcomputations and to improve on previous works.

• Propose an interconnection scheme to achieve a large number of synapse compu-tations per second, and to overcome previous limitations in hardware NN connec-tivity, as reported in the literature.

• Propose an scalable and hardware suitable learning block for implementing theSpikeProp algorithm.

• Analyze and propose an scaling scheme for neural processors and connectivity inorder to achieve high parallelism.

4

Introduction

• Validate the proposed architecture through its hardware implementation into areconfigurable device.

Along this thesis different levels of abstraction for the design and implementationof an architecture for Multilayer SNNs are presented. The processing performed bythe proposed architecture is divided in three main phases: the coding phase, the recallphase and the learning phase. Each one of the proposed phases is mapped in onesingle hardware core for debugging purposes. The presented abstraction levels andthe physical implementation of the proposed architecture allow the validation of theproposed objectives and the definition of the main contribution of the present thesis.

1.3.3 Contributions

The main contributions of the proposed work can be summarized as follows:

• A hardware block for transforming real valued data into spike times, that couldbe useful for several learning applications. The proposed hardware block uses theGaussian Receptive Fields (GRFs) coding, but other coding schemes could beadapted to the proposed block.

• A hardware block for the recall phase computation in multilayer SNNs, that takesas input the firing times generated by the coding block and obtains the networkoutput. This block can be configured for fitting several configurations with respectto the number of layers and the number of neurons by layer.

• A hardware block for the learning computation in multilayer SNNs, that com-putes the appropriate learning parameters and adjusts the weights and delaysfor learning. This block will provide the learning capability to the implementednetwork, that allows the exploration of several topological variations for solvinga wide range of applications.

• A compact hardware core that includes the previously mentioned blocks, withthe possibility of testing several implementation variations of the three proposedblocks for solving several machine learning applications.

1.4 Contents and Thesis Organization

This thesis is structured as follows:

• In chapter 2, general concepts about ANNs are explained. Also, a review of severalhardware implementations of both classical and spiking neurons is presented andbasic concepts about FPGAs and parallel processing are detailed.

• In chapter 3, the analyzed algorithms for coding, recall and learning are describedin detail.

5

Contents and Thesis Organization

• In chapter 4, the proposed architecture is defined. The proposed architectureis divided in 3 main parts: the codification architecture, the architecture forobtaining the network output and the architecture for computing the learningparameters and adjusting weights and delays.

• In chapter 5, the implementation and results are shown. The FPGA target deviceand the platform that hosts to the target FPGA are described. The resourcestatistics and comparison of several variations for the proposed architectures aredefined, and a brief discussion of the implementation results of the proposedarchitecture is given.

• In chapter 6, the architecture is tested with several datasets for machine learning.Accuracy for each dataset is obtained and compared with other results of thestate of the art.

• Finally, in chapter 7, conclusions and future work are presented.

6

Chapter 2

Background and Previous Work

2.1 Neuro-Biological Foundations

The brain is roughly modeled as a complex nonlinear parallel computer. The braincan be conceived as a biological information-processing system. It performs certaincomputations many times faster than current microprocessor-based digital computers[1] [4] [12]. A neural network is a massively parallel distributed processor made upof single processing units, which has a natural propensity for storing experimentalknowledge and making it available for use. It resembles the brain in two aspects:

• Knowledge is acquired by the network from its environment through a learningprocess.

• Interneuron connection strengths, known as synaptic weights, are used to storethe acquired knowledge.

The procedure used for the modification of the network synaptic weights to performa desired design objective is called the learning process or learning algorithm. A neuralnetwork obtains its computing power from its parallel distributed organization and fromits ability to learn and generalize.

Generalization refers to the ability of the neural network for producing reasonableoutputs for inputs not encountered during training. The generalization features pro-vide to neural networks the possibility of solving complex problems that are currentlyintractable.

The brain can be understood as a processing machine that performs informationprocessing tasks using relatively simple processing elements called neurons. The neuronsare excitable brain cells which can be excited to produce electrical pulses (spikes), bywhich they communicate with neighboring cells. Camillo Golgi invented a revolutionarystaining method that allows to see neurons and their connections under the microscopy,and Santiago Ramon y Cajal used Golgi’s technique to map out systematically and drawin meticulous and artful detail the various cell types and network structures [15].

7

Neuro-Biological Foundations

Figure 2.1: Biological Neuron [16]

The neuron has 4 components: synapses, dendrites, axon and cell body. Electricalcurrents arrive to the neuron through the dendrites. The soma processes those inputcurrents and if a threshold is surpassed, generates a new pulse that is propagated alongthe axon. The information is transmitted from an axon to a dendrite via a synapse,which is done by means of chemical neurotransmitters [16]. A biological neuron schemeis shown in figure 2.1.

The biological neuron operates in the following way. The lipidic cell membraneof a neuron maintains different concentration between the inside and the outside ofthe cell, of various ions (the main ones are Na+, K+, and Cl−), by a combination ofthe action of active ion pumps and controllable ion channels. When the neuron is atrest, these channels are closed, and due to the activity of the pumps and the resultantconcentration difference, the inside of the neuron has a stable net negative electricpotential of around -70 mV, compared to the fluid outside. [15]

Communication between neurons occurs as a results of the release by the presy-naptic cell substances called neurotransmitters, and the subsequent absorption of thesesubstances by the postsynaptic cell. When an action potential arrives to the the presy-naptic membrane, it changes the permeability of the membrane, producing an influxof calcium ions. These ions cause the vesicles containing the neurotransmitters to fusewith the presynaptic membrane and to release their neurotransmitters into the synapticcleft.

The neurotransmitters diffuse across the junction and join to the postsynpatic mem-brane at certain receptor sites. The chemical action at the receptor sites results inchanges in the permeability of the postsynaptic membrane to certain ionic species. Aninflux of positive species into the cell will tend to depolarize the resting potential; thiseffect is excitatory. If negative ions enter, a hyperpolarization effect occurs; this effect isinhibitory. Both effects are local effects that spread a short distance into the cell bodyand are summed at the axon hillock. If the sum is greater that a certain threshold, anaction potential is generated [17].

8


2.1.1 Advantages of ANNs

There are several advantages of ANNs in contrast to conventional algorithms, due towhich they are widely accepted as a successful computational element:

• ANNs have the ability to learn, enabling them to compute functions for whichno formal, mathematical representation is currently known. In addition, they areable to represent any function, simple or complex, linear or non-linear; they areconsidered universal approximators.

• They are more fault-tolerant than other algorithms because small changes in theinput values normally cause no changes in the output values at all. ANNs canalso adapt to the failure of simple or multiple neurons, allowing the whole systemto be still efficient enough even when some parts fail.

• They have an inherent parallel nature, and this motivates the implementation ofANNs in hardware or software implementations using parallel computers.

• The information in ANNs is stored associatively, this is, it is much simpler torecall a pattern that is close to the input pattern that it would be with randommemory based machines.

• They are capable of automatic classification and generalization of input patterns,allowing to sensitive default values to be chosen for incompletely specified inputpatterns.

2.1.2 Disadvantages of ANNs

Neural Networks have several disadvantages:

• A neural network typically behaves as a black box in the system theoretical sense.When given an input value, the network produces an output value. But in general,it will not be possible to deduce the behavior of the network from its internalparameters without completely simulating it. This makes it difficult, if not evenimpossible, to validate ANNs for their correctness in solving a given problem.

• Gaining knowledge within an ANNs is in many cases impossible without usingtime-intensive learning algorithms.

• Almost all of the recently used learning algorithms are computer intensive andrequire acceleration strategies, like custom parallel implementations on VLSI cir-cuits [18].

9

Artificial Neuron and Neural Networks

2.2 Artificial Neuron and Neural Networks

From a engineering point of view the neurons are poor substitutes for processors; theyare slower and less reliable by several orders of magnitude. In the brain this setback isovercome by redundancy: that is, by making sure that a very large number of neuronsis always involved in any process, and by making them operate in parallel. This is incontrast, to conventional computers, where individual operations are as a rule performedsequentially, that is, one after the other, so that failure of any part of the chain is fatal.The other fundamental difference is that conventional computers can only execute adetailed specification of orders, the program, requiring the programmer to know exactlywhich data can be expected and how to respond. Any subsequence change in the actualsituation, not foreseen by the programmer, leads to trouble [15].

In many real-world applications, we want computers to perform complex tasks thatcould be solved by conventional computers (like pattern recognition problems). Hence,the technology is known as Artificial Neural Networks (ANNs), or simply NeuralNetworks (NNs).

The individual computational elements that make up most artificial neural systemsare called artificial neurons, nodes or processing elements. Artificial neural networksare specified by the following characteristics:

• Artificial neuron model. Defines how the artificial neuron responds to the inputs.

• Network topology. Defines how the artificial neurons are connected and which isthe direction of the information flow.

• Training or learning algorithms. Defines how a network adapts its internal pa-rameters for a certain class of patterns.

2.3 Neuron Models

The simplest model of computation is a finite automaton. Historically, the first “infi-nite” model of computation was the Turing Machine. This model can be thought ofas a finite automaton (control unit) equipped with an infinite storage (memory). Moreprecisely, the memory is an infinite one-dimensional array of cells.

The elements of a model of computations are:

computation = storage+ transmission+ processing (2.1)

According to [19], there are five computation models:

• The mathematical model. The theory of computation is the branch of computerscience and mathematics that deals with whether and how efficiently problemscan be solved on a model of computation, using an algorithm. A model of com-putation is the definition of the set of allowable operations used in computation

10


and their respective costs. A diversity of mathematical models of computers havebeen developed. Typical mathematical models of computers are the following:State models include Turing Machine, Push-down automaton, Finite state au-tomaton, and Parallel Random Access Machines (PRAMs), functional modelsinclude lambda calculus, logical models include logic programing and concurrentmodels include Actor model and process calculi.

• The logic operational model. A Turing machine is composed of an infinite tape,in which symbols can be stored and read again. A read-write head can move tothe left or to the right according to its internal state, which is updated at eachstep. The Turing thesis states that computable functions are those which canbe computed with this type of device. The Turing approach made clear for thefirst time what “programming” means, at a time when no computer had yet beenbuilt.

• The computer model. The first electronic computing device was developed in the1930s and 40s. Since then, “computation-with-the-computer” has been regardedas computability itself.

• Cellular Automata model. A cellular automaton is a collection of “colored” cellson a grid of specified shape that evolves through a number of discrete time stepsaccording to a set of rules based on the states of neighboring cells. The mainproblem for cellular automata are communication and coordination among allthe computing cells. This can be guaranteed thought certain algorithms andconversions. Cellular automata as computing model resemble massively paral-lel multiprocessors systems of the kind that has attracted considerable interestrecently. Neural Networks takes a different approach to problem solving thanconventional computers. Neural networks and computers are not in competitionbut complement each other.

• The biological model (Neural Networks). Cellular automata and neural networksare potentially more efficient than conventional computers in certain applicationareas, but much processing and storage resources must be used for their imple-mentation.

A neuron is an information-processing unit that is fundamental for the operationof a neural network [20]. The generic model of a neuron, which forms the basis fordesigning artificial neural networks is shown in figure 2.2. The three basic elements ofthe neuronal model are [1]:

1. A set of synapses or connection links, each one characterized by a weight orstrenght of its own. Specially, a signal xj at the input of synapse j, connected toneuron k, multiplied by the synaptic weight wij.

2. An adder for summing the input signals, weighted by the respective synapses ofthe neuron.

11

Neuron Models

Figure 2.2: Generic Model of Artificial Neurons

3. An activation function for limiting the amplitude of the output of a neuron.

One classification of neural network models can be done based on generations ofneural models. In [10], three generations of neural models are classified:

1. The first generation is based on the McCulloch-Pitts Neurons as computationalunits. This model of neuron is the base of several network models such as multi-layer perceptrons, Hopfield nets, and Boltzman machine. A characteristic featureof these model is that they can only generate a digital output. In 1943, McCul-loch and Pitts considered the computational power of simple binary units. Theneuron spikes involve sudden, transient shifts of membrane voltages from negativeto positive. These spikes somehow carry information through the brain. McCul-loch and Pits assumed a simple coding scheme for this information carrying: eachspike would represent a binary 1 (or Boolean true), and the lack of a spike wouldrepresent a binary 0 (or Boolean false). They showed how these spikes could becombined to do logical and arithmetical operations [21].

2. The second generation is based on computational units that apply to a weightedsum (or polynomial) of the inputs and “activation function” with a continuousset of possible output values. Typical examples of networks of second generationare feedforward and recurrent sigmoidal neural nets, and networks of radial basisfunction units. One characteristic feature of this second generation of neural net-work models is that they support learning algorithms that are based on gradientdescent such as backpropagation.

3. The third generation employs spiking neurons as computational units. This gen-eration of neural networks models will be described later in this chapter.

12


The first two types of neural models are considered into the classical approach, whilethe later is considered into the Spiking approach.

2.3.1 Classical Neuron Models

2.3.1.1 Digital Output Neurons

The computing elements are a generalization of the common logic gates used in con-ventional computing and, since they operate by comparing their total input with athreshold, this field of research is known as threshold logic.

McCulloch-Pitts networks are even simpler computing units that use only binarysignals, i.e., ones or zeros. The nodes produce only binary results and the connectionstransmit exclusively ones or zeros. The networks are composed of directed unweightededges of excitatory or inhibitory type. Each McCulloch-Pitts unit is also provided witha certain threshold value θ. The inhibitory connections are marked using a small circleattached to the end of the edge. The rule for evaluating the input to a McCulloch-Pittsunit is the following:

• Assume that a McCulloch-Pitts unit gets an input x1, x2, ..., xm through n exci-tatory edges and an input y1, y2, ..., yn throught m inhibitory connections.

• if m = 1 and at least one of the signals is 1, the unit is inhibited and the resultof the computation is 0.

• Otherwise the total excitation x = x1 + x2 + ... + xn is computed and comparedwith the threshold θ of the unit.

The McCulloch-Pitts model seems very limited since only binary information can beproduced and transmitted, but it already contains all necessary features to implementcomplex neuron models. In figure 2.3, an abstract Mc-Culloch-Pitts computing unit isshown.

2.3.1.2 Analog Output Neurons

In this type of neurons, the computation of logical operations using continuous variablesis or might be possible. The behavior of the activation function is defined according tothe neuron model representation or according to the class of solved problems [22]. Themain activation functions used in neurons with analog output are:

1. Threshold function. This form is commonly referred as Heaviside function. Infigure 2.4(a), the Heaviside function is shown. This function is a discontinuousfunction whose value is zero for a negative argument and one for a positive argu-ment.

ϕ(υ) =

{1, if υ ≥ 0

0, if υ < 0(2.2)

13

Neuron Models

Figure 2.3: Diagram of a McCulloch-Pitts Unit

2. Piecewise-Linear function. A piecewise function is a function whose formulachanges depending on which input values are used. The existence of the linearpart allows the implementation of continuous activation functions on the basis ofsuch elements. The activation function R can be relatively simply implemented.In figure 2.4(b), an example of a piecewise function is shown. This function isused for saving CPU time, by approximating the sigmoid function via a series ofstraight lines.

ϕ(υ) =

1, υ ≥ 1

2

υ, 12> υ > −1

2

0, υ ≤ −12

(2.3)

3. Sigmoid function. The sigmoid function is real-valued and differentiable, havingeither a non-negative or non-positive first derivative and exactly one inflectionpoint. But this function is difficult for implementations. Its disadvantage is theexistence of only positive values. However, this disadvantage can be consideredinsignificant, because it can be eliminated on the level of the network structureby the element threshold change. In figure 2.4(c), the Sigmoid function is shown.

ϕ(υ) =1

1 + exp(−aυ)(2.4)

4. Hyperbolic tangent function. Sometimes it is desirable to have the activationfunction range from -1 to +1, in which case the activation function assumes anantisymmetric form with respect to the origin. This function is similar, by itsproperties, to the sigmoid function. The major difference is that it maps its inputto values between -1 and 1. There is experimental evidence that the hyperbolictangent function leads to a faster training of the network. In figure 2.4(d), thehyperbolic tangent function is shown.

14


−1 −0.5 0 0.5 1

0

0.5

1

(a) Heaviside Function−1 −0.5 0 0.5 1

0

0.5

1

(b) Picewise Function

−1 −0.5 0 0.5 1

0

0.5

1

(c) Sigmoid Function−1 −0.5 0 0.5 1

−1

−0.5

0

0.5

1

(d) Tangent Function

Figure 2.4: Different Activation Functions

ϕ(υ) = tanh(υ) =eυ − e−υ

eυ + e−υ(2.5)

2.4 Network Topologies

The manner in which the neurons of a neural network are structured is intimatelylinked with the learning algorithm used to train the network [1]. In general, the threefundamental classes of network architecture are:

1. Single-Layer Feedforward Networks. In a layered neural network the neurons areorganized in the form of layers. In the simplest form of a layered network, aninput layer of source node projects onto an output layer of neurons (computationnodes), but not the other way. This network is strictly a feedforwad or acyclic

15

Learning in Artificial Neural Networks

type. This type of networks is named single-layer networks. In figure 2.5(a), anexample of this type of network topology is shown.

2. A feedforward neural network (FFN) is a nonlinear function of its inputs, which isthe composition of the functions of its neurons. A FFN is represented graphicallyas a set of neurons connected together, in which the information flows only inthe forward direction, from inputs to outputs. In a graph representation, wherethe vertices are the neurons and the edges are the connections, the graph of afeedforward network is acyclic. The neurons that perform the final computationare called output neurons; the other neurons that perform intermediate compu-tations are called hidden neurons [17]. In figure 2.5(b), an example of this typeof network topology is shown. For FFNs, the gradient of the cost function can becomputed with an algorithm called backpropagation algorithm. For each examplek, the backpropagation algorithm for computing the gradient of the cost functionrequires two steps: a) A propagation phase (also called recall phase or forwardphase), where the inputs corresponding to example k are input to the network,and the potentials and outputs of all neurons are computed, and b) A propagationphase (also called learning phase or backward phase), where the weight adjust-ments for each neuron in the network are computed. In figure 2.7, an example ofthis type of network topology is shown.

3. In a Recurrent Neural Network, the connection graph exhibits cycles. In thatgraph, there exists at least one path that, following the connections, leads backto the starting vertex (neuron); such path is called a cycle. Since the output ofa neuron cannot be a function of itself, such an architecture requires that timebe explicitly taken into account: the output of a neuron cannot be a function ofitself at the same instant of time, but it can be a function of its past value(s) [17].In figure 2.5(c), an example of this type of network topology is shown.

2.5 Learning in Artificial Neural Networks

“Learning is a process by which the free parameters of a neural network are adaptedthrough a process of stimulation by the environment in which the network is embedded.The type of learning is determined by the manner in which the parameter changes takeplace” [1]. The learning process implies the following sequence of events :

• The neural network is stimulated by an environment.

• The neural network undergoes changes in its free parameters as a result of thestimulation.

• The neural network responds in a new way to the environment because of thechanges that have occurred in its internal structure.

16


(a) Single-Layer (b) FeedForward

(c) Recurrent

Figure 2.5: Different NN Topologies

17

Learning in Artificial Neural Networks

Figure 2.6: Supervised Learning Process

Artificial Neural Networks have been successfully used for recognizing objects fromtheir feature patterns. For classification patterns, the ANNs should be trained priorto the phase of recognition process. The process of traning an ANN can be broadlyclassified into three typical categories: i) Supervised learning, ii) Unsupervised learning,iii) Reinforcement learning [23]. These categories are explained below:

• Supervised learning. The supervised learning process requires a trainer that sub-mits both the input and the target patterns for the objects to get recognized.Given such input and output patterns for a number of objects, the task of super-vised learning calls for adjustment of network parameters (such as weights andnon-linearities), which consistently can satisfy the input-output requirement forthe entire object class. Among the supervised learning algorithms, the most com-mon are the back-propagation training, Naive bayes classifier, Support VectorMachines, Random Forest, Inductive Logic Programing, Case-based reasoning,Decision Tree learning among others. In figure 2.6, the supervised learning pro-cess is shown.

• Unsupervised learning. The process of unsupervised learning is required in manyrecognition problems, where the target pattern is unknown. The unsupervisedlearning process attempts to generate a unique set of weights for one particularclass of patterns. The objective of unsupervised learning process is to adjustthe weights autonomously, until an equilibrium condition is reached when theweights do not change further. The process of unsupervised learning, thus, maps

18


Figure 2.7: A two-weight Layer Feed-forward Neural Network

a class of objects to a class of weights. Generally, the weight adaptation processis described by a recursive functional relationship. Depending on the topologyof neural nets and their applications, these recursive relations are constructedintuitively. Among the typical class of unsupervised learning, there are: Hopfieldnets, associative memory, and cognitive neural.

• Reinforcement learning. This type of learning may be considered as an intermedi-ate form of the above two types of learning. Here, the learning machine does someaction on the environment and gets a feedback response from the environment.The learning system grades its action good (rewarding) or bad (punishable) basedon the environmental response and accordingly adjusts its parameters. Generally,parameter adjustment is continued until an equilibrium state occurs, followingwhich there will be no more changes in its parameters. The self-organizing neurallearning may be categorized under this type of learning.

2.6 Networks of Spiking Neurons

2.6.1 Spiking Neuron Models (SNMs)

Experimental evidence accumulated during the last few years indicate that many bio-logical neural systems use the timing of single action potential (or “spikes”) to encodeinformation. These experimental results from neurobiology have lead to the investiga-tion of a third generation of neural network models which employs spiking neurons ascomputational units. In this section, the most used SNMs are described.

19

Networks of Spiking Neurons

Figure 2.8: Basic Electric Circuit for the HH Model

Figure 2.9: Basic Electric Circuit for the LIF Model

2.6.1.1 Hodgkin-Huxley Model (HH)

The spiking neurons are based on conductance-based models, based in the electricalmodel of “Spiking Neuron” defined by Hodgkin and Huxley in 1952. The basic ideais to model electro-chemical information transmission of natural neurons by electriccircuits made of capacitors and resistors. C is the capacitance of the membrane, the giare the conductance parameters for the different ion channels (sodium, Na, Potassium,K, etc) and the Ei are the corresponding equilibrium potentials. Variables m, h andn describe the opening and closing of the voltages dependent channels. This electricalmodel is shown in figure 2.8

The HH model is realistic but too complex for the simulation of large SNNs ondigital computers.

20


2.6.1.2 Leaky Integrate-and-fire Model (LIF)

The leaky integrate-and-fire is the best-known example of a formal spiking neuronmodel. The basic circuit of an integrate-and-fire model consists of a capacitor C inparallel with a resistor R driven by a current I(t). The driving current can be splitinto two components, I(t) = IR + ICAP . The first component is the resistive currentIR which passes through the linear resistor R. It can be calculated from the Ohms’slaw as IR = u/R, where u is the voltage across the resistor. The second componentICAP charges de capacitor C. From the definition of the capacity as C = q/u (where qis the charge and u is the voltage), a capacitive current ICAP = Cdu/dt is defined byequation 2.6. The basic circuit can be described as follows: A current I(t) charges theRC circuit. The voltage u(t) across the capacitance is compared to a threshold ϑ. If

u(t) = ϑ at time t(f)i , an output pulse δ(t− t(f)

i ) is generated. The described circuit isshown in figure 2.9 [24].

I(T ) =u(t)

R+ C

du

dt(2.6)

Defining the time constant of the neurons as τm = RC, for modeling the voltageleakage, the usual formula for the LIF model is defined by equation 2.7.

τdu

dt= urest − u(t) +RI(t) (2.7)

Spikes are generated when the membrane potential u crosses the threshold ϑ frombelow. The moment of threshold crossing defines the firing time t(f), as established onequation 2.8.

t(f) : u(t(f)) = ϑ anddu(t)

dt

∣∣∣∣t=t(f)

> 0 (2.8)

Immediately after t(f), the potential is reset to a given value urest. An absoluterefractory period can be modeled by forcing the neuron to a value u = −uabs during atime dabs after a spike emission, and then restarting the integration with initial valueu = urest.

2.6.1.3 Spike Response Model (SRM)

The SRM has been introduced in [25], and it is an approximation of the dynamicsof integrate-and-fire neurons expressed by differential equations. The neuron status isupdated through a linear summation of the postsynaptic potentials resulting from theimpinging spike trains at the connecting synapses.

The model expresses the membrane potential u at time t as an integral over the past,including a model of refractoriness, but without any additional differential equation.SRM is a phenomenological model of neuron (this is ,a model based on empirical obser-vation of phenomena, in a way which is consistent with fundamental theory of SNNs,but is not directly derived from theory), based on the occurrence of spike emissions.

21


The moment when a given neuron j emits an action potential will be referred toas the firing time of that neuron and will be denoted by t

(f)i , with f varying from 1 to

n. The spike train of neuron j is then characterized by the set of firing times given byequation 2.9.

Fi = {t(1)j , t

(2)j , ..., t

(n)j } (2.9)

where t(n)j is the most recent spike of neuron j.

The potential difference between the interior of the cell (soma) and its surroundingsis called the membrane potential. The state of a spiking neuron is represented by avoltage across its cell membrane and a threshold. This potential is directly affected bythe postsynaptic potentials - PSPs generated by the spikes received from presynapticneurons. If the membrane potential reaches a threshold, an action potential (spike) istriggered and sent out through the axon and its branches to the postsynaptic neurons.If the postsynaptic potential is positive, it produces an increase in the neuron potentialand it is said to be excitatory (EPSP), and if the change is negative, it produces adecrease in the neuron potential the synapse is inhibitory (IPSP). The process of spiketransmission through the axon has an associated delay, called the axonal delay [26].

When an action potential (the electrical signal which rapidly propagates along aneuron axon to other neurons and which is the result of a change in membrane potential)is received, the synapse transforms it into a change in postsynaptic neuron membranepotential (post synaptic potential) whose typical shape is shown in figures 2.10(a) and2.10(b), where the membrane potential is increased (dotted line) or decreased (dashedline).

0 0.5 1 1.5 2

−1

−0.5

0

0.5

1

EPSP IPSP

(a) Decay = 0.070 0.5 1 1.5 2

−1

−0.5

0

0.5

1

EPSP IPSP

(b) Decay = 0.14

Figure 2.10: EPSP and IPSP for Different Decay Rates

The PSP is the response generated by a spike from the presynaptic neuron i in thepostsynaptic neurons j. Due to interneural distances and finite axonal transmissiontimes, there may be a delay between the emission of a spike at the beginning of its

22


Model Biological FlopsProperties

HH ++ 129LIF -+ 5

SRM +- N/AMLP – N/A

Table 2.1: Spiking Neuron Models Comparison

corresponding PSP. As defined in [25], the equation 2.10 defines a generic PSP generatedin neuron j by a spike emitted by neuron i.

εij(t) =

[exp

(−t− t(f)

i −∆axij

τm

)− exp

(−t− t(f)

i −∆axij

τs

)]H(t− t(f)

i −∆axij ) (2.10)

where τs and τm are time constants, ∆axij is the axonal transmission delay, and H is

the Heavside function.Since spikes are time-discrete digital events they are fully characterized by their

firing time. The thresholds models are focused on the variable u. Some well-knowninstances of spiking neurons differ in the specific way that the dynamics of the variableu is defined. Despite its simplicity, the SRM is more general than the LIF model andis often able to compete with the HH model for the simulation neuro-computationalproperties.

In [27], a comparison of biological properties for different SNMs is given. Thepreviously described models exhibit both biological and computational properties. Also,a comparison of hardware resource consumption for the implementation of each neuronmodel reported in [28] has been studied, but this properties will be used in the relatedwork section in this chapter. Some of these properties are summarized in table 2.1. Inthis table, in the column of biological properties the legend “++” indicates that themodel exhibits a large number of physiological parameters and rich dynamics, while thelegend “–” indicates the opposite. The legends “+-” and “-+” indicates intermediatepoints between the maximum and minimum.

2.6.2 FeedForward Topology for SNNs

The feedforward network architecture for SNNs consists of a feedforward network ofspiking neurons with multiple delayed synaptic terminals. Spiking neurons generateaction potentials when the internal neuron state variable, called “membrane potential”crossed a threshold [29]. Neurons in layer H receive connections from neurons in layerI. A single connection between two neurons consists of m delayed synaptic terminals.

23


Figure 2.11: Feedforward SNNs with Multiple Synaptic Connections

A synaptic terminal k is associated with a weight wkij and delay dk. A spike neuron igenerates m delayed spike-response functions ε(t−(ti+d

k)), the sum of which generatesthe membrane-potential in neuron h. The network topology is shown in figure 2.11.

The neurons in the network generate action potentials (spikes or output pulses) whenthe internal neuron potential (often also called state variable), crosses a given thresholdϑ. The relation between the neuron potential and the input spikes are defined by aspiking neuron model, in this case the used neuron model is the SRM defined in [25]

In order to show how the SRM model processes the input spikes, the behaviorof one single neuron is analyzed. This minimal model is shown in figure 2.12. Forsimplicity, only 4 input neurons are defined, where each neuron connection has twosynaptic connections that contribute to the state variable of the output neuron. Theinput delays and weights associated to each synaptic connection are shown in table 2.2.These weights and delays were randomly generated. Each one of the firing times of theinput neuron is shown in figure 2.13(a). The same input firing times affected by theircorresponding delays [giving a total of 8 contributions, 2 delays for each input spike]are shown in figure 2.13(b). The contribution functions of each synaptic connection(evaluating the activation function each simulation time and making the product ofthat activation function by its corresponding weight) are shown in figure 2.13(c). Thetotal contributions by each input neuron are shown in figure 2.13(d). Finally, theaccumulated contribution is shown in figure 2.13(e), and the output firing time for thedefined threshold (in this case, set to 50.00) is shown in figure 2.13(f).

24


Neuron Connection0 Connection1

0 Delay 0.050000 0.080000Weight 9.090909 13.636364

1 Delay 0.100000 0.150000Weight 20.909091 21.818182

2 Delay 0.150000 0.230000Weight 22.727273 14.545455

3 Delay 0.200000 0.240000Weight 15.454545 7.272727

Table 2.2: Weights and Delays

Figure 2.12: Simple SNN Model with 2 Synaptic Connections

25


00.

250.

501

00.

250.

501

00.

250.

501

00.

250.

501

(a)

Inpu

tfir

ing

tim

es

00.

250.

501

00.

250.

501

00.

250.

501

00.

250.

501

00.

250.

501

00.

250.

501

00.

250.

501

00.

250.

501

(b)

Del

ayed

inpu

tfir

ing

tim

es

00.

5010203040

00.

5010203040

00.

5010203040

00.

5010203040

00.

5010203040

00.

5010203040

00.

5010203040

00.

5010203040

(c)

Indi

vidu

alco

ntri

buti

onto

stat

eva

riab

le

00.

5010203040

00.

5010203040

00.

5010203040

00.

5010203040

(d)

Neu

rona

lco

ntri

buti

onto

stat

eva

ri-

able

00.

10.

20.

30.

40.

50204060

(e)

Stat

eva

riab

le

00.

10.

20.

30.

40.

50

0.51 (f

)O

utpu

tfir

ing

tim

e

Fig

ure

2.13

:C

omputa

tion

ofN

eura

lO

utp

ut

for

aN

euro

nw

ith

4In

puts

26


2.6.3 Learning Algorithms in SNNs

A neural network has to be configured such that the application of a set of input pro-duces the desired set of outputs. Various methods to set the strengths of the connectionsexist. One way is to set the weights explicitly, using a priori knowledge. Another wayis to “train” the neural network by feeding teaching patterns and letting it change itsweights according to some learning rule [30].

In [31], an interesting review of supervised learning methods for SNNs is given.The following schemes are described: methods based on gradient evaluation, statisticalmethods, linear algebra methods, evolutionary methods, learning in syncfire chains andspike-based supervised hebbian learning. Several advantages and disadvantages of thedescribed schemes are analyzed. Based on the described learning methods, a mixedtaxonomy of both spiking and classical learning methods is shown in figure 2.14. Inthis section, only the methods based on gradient evaluation are analyzed.

A model based on the IF, which exhibits nonlinear properties and with similarbiological properties than radial basis functions (RBF) models is proposed in [32].

In [33], an Evolutionary Strategy (ES) for solving the supervised learning trainingcombined with the SRM model is proposed. Accuracy results for different dataset arereported. In [34], the application of the Parallel Diferential Evolution Algorithm as asupervised training algorithm for SNNs is reported.

The SpikeProp algorithm was derived analogous to [35]. Equations are derived forfully connected feedforward networks. This algorithm will be described in the nextchapter. The original SpikeProp proposes a learning rule that adapts the weights of thenetwork in a similar way to a feedforward network [36] [14]. Neuron decay, delays andthresholds are initialized and remain constant for the complete simulation period.

Several improvements have been proposed for the SpikeProp algorithm. In [37], theSpikeProp algorithm is extended for Recurrent Spiking Neural Neurons (RSNNs). In[38] and [39] a set of learning rules for learning neuron delays, decays and thresholds isintroduced. In [40] the weights are initialized for obtaining the same results than [14].Other experiments performed allowed to use negative weights for improving the algo-rithm convergence rate. In [41] a modification for SpikeProp by including a momentumterm in the weight update equation is proposed. In [42], a weight limitation approachis implemented to guarantee that neurons fire at a time that is not longer than themaximal network delay and initial weights enable networks to converge rapidly.

2.7 FPGA Technology

A Field-Programmable Gate Array (FPGA) is a semiconductor device that can beconfigured by the customer or designer after manufacturing - hence the name “fieldprogrammable”. FPGAs are programmed using a logic circuit diagram or a sourcecode in a hardware description language (HDL) to specify how the chip will work. Theycan be used to implement any logical function that an Application-Specific-Integrated

27

FPGA Technology

Figure 2.14: Taxonomy of Neural Network Models, Topologies and Learning Algorithms

28


Circuit (ASIC) could perform, but the ability to update the functionality after shippingoffers advantages for many applications [43].

2.7.1 FPGA Structure

An FPGA consists of three main parts. A set of programmable logic cells also calledlogic blocks or configurable blocks, a programmable interconnection network and a setof input and output cells around the device [44]. This structure is shown in figure 2.15.A function to be implemented in FPGA is partitioned in modules, each of which can beimplemented on a logic block. The logic blocks communicate through a programmableinterconnection network that includes both nearest neighbor as well as hierarchical andlong path wires [45]. The three basic components (logic blocks, interconnection networkand input and output blocks) can be programmed by the user in the field. FPGAs canbe programmed once or several times depending of the technology used.

The simple and homogeneous architecture of an FPGA device has evolved to becomemuch more heterogeneous, including on-chip memory blocks as well as Digital SignalProcessing (DSPs) blocks such as multiply/multiply-accumulate units.

There are 2 basic types of FPGAs: SRAM-based reprogrammable and One-timeprogrammable (OTP). The difference between these types of FPGAs consists on theimplementation of the logic cell and the mechanism used to make connection in thedevice.

The dominant type of FPGA is SRAM-based and it can be reprogrammed by theuser as often as the user choices. In the SRAM logic cell, instead of conventional gatesthere is instead a Look Up Table (LUT) which determines the output based on thevalues of the inputs. SRAM bits are also used to make connections.

For Xilinx devices, the logic block is called Configurable Logic Block (CLB). A CLBprovides functional elements for both combinational asynchronous and synchronouslogic, including basic storage elements. Exact number and features vary from deviceto device, but every CLB consists of a configurable switch matrix with 4 or 6 inputs,some selection circuitry (MUX, etc), and flip-flops. The switch matrix is highly flexibleand can be configured to handle combinational logic, shift registers or RAMs. In figure2.16, a CLB for a Xilinx Virtex II PRO device is shown. CLBs are organized in anarray and are used to build combinational and synchronous logic designs. Each CLBelement is tied to a switch matrix to access to the general routing matrix. A CLBelement comprises 4 similar Slices, with a fast local feedback within the CLB. The fourslices are split in two columns of two slices with two independent carry logic chainsand one common shift chain. Each slice includes two 4-input function generators, carrylogic, arithmetic logic gates, wide function multiplexers and two storage elements. Each4-input function generator is programmable as a 4-input LUT, 16 bits of distributedRAM, or a 16-bit variable-tap shift register element [46]. The structure of a Slice isshown in figure 2.17

The most common way for implementing combinational logic is using a Look-uptable (LUTs) [45]. The LUT operates as a memory with N address lines and 2N

29

FPGA Technology

Figure 2.15: FPGA Structure

Figure 2.16: Virtex II PRO CLB (Reproduced from [46])

30


Figure 2.17: Virtex II PRO SLICE (Reproduced from [46])

memory locations. For implementing a specific logic function with the memory, thetruth table is stored into the memory. Conceptually, a LUT can be understood asa set of programmable memory cells whose output is selected using a multiplexer,which allows the selection of a specific memory cell. In figure 2.18, a XOR functionimplemented on a 3-input LUT is shown.

The most often-stated reason for the development of custom neurocomputers is thatconventional general-purpose processors do not fully exploit the parallelism inherent inneural-network models and that highly parallel architectures are required for that. Itis evident that in general FPGAs cannot match ASIC processors in performance, andin this regard FPGAs have always lagged behind conventional microprocessors. In theFPGA structures can be considered as an alternative to software, then it is possible thatthe FPGAs may be able to deliver better cost:performance ratios on given applications.Moreover, the capacity for reconfiguration means that may be extended to a range ofapplications, e.g. several different types of neural networks. Thus the main advantagethe FPGA is that it may offers a better cost:performance ratio than either custom ASICneurocomputers of general purpose processors, and with more flexibility. The promiseof FPGA is that they offer, in essence, the ability to realize “semi-custom” machinesfor neural networks and the possibility of outperforming conventional processors. Byutilizing FPGA reconfigurability, there are strategies to implement ANNs on FPGAscheaply and efficiently [47].

2.7.2 Reconfigurable Computing

Reconfigurable Computing (RC), the use of programmable logic to accelerate com-putation, arose in the late 80’s with the widespread commercial availability of Field-Programmable Gate Arrays (FPGAs). The innovative development of FPGAs whoseconfiguration could be re-programmed an unlimited number of times supposed the in-vention of a new field in which many different hardware algorithms could execute, in

31

FPGA Technology

Figure 2.18: Look-Up Table

turn, on a single device, just as many different software algorithms can run on a con-ventional processor [45].

The speed advantage of direct hardware execution on the FPGA - roughly 10X to100X the equivalent software algorithm - attracted the attention of the supercomput-ing community as well as Digital Signal Processing (DSP) systems developers. RCresearches found that FPGAs offer significant advantages over microprocessors andDSPs for high performance, low volume applications, particularly for applications thatcan exploit customized bit widths and massive instruction-level parallelism.

There are two primary methods in conventional computing for the execution ofalgorithms:

• The first one consists on using hardwired technology, such as ASIC or a boardlevel solution, which groups a set of individual components.

• The second one consists on using software-programmed microprocessors.

Reconfigurable computing is intended to fill the gap between hardware and soft-ware, achieving potentially much higher performance than software, while maintain-ing a higher level of flexibility than hardware. Reconfigurable devices including field-programmable gate arrays (FPGAs), contain an array of computational elements whosefunctionality is determined through multiple programmable configuration bits. FPGAsand reconfigurable computing have been shown to accelerate a variety of applications.Some of these applications are: automatic target recognition, string pattern matching,data compression, genetic algorithms, ANNs, etc. In order to achieve these performancebenefits, a wide range of applications combine reconfigurable logic and general purpose

32


microprocessors. The processors perform the operations that cannot be done efficientlyin the reconfigurable logic [48].

2.8 Parallel Processing

A parallel computer usually consists of a number of processing elements (PEs). Eachprocessing element consists of a processor and memory. The memory can either be onthe processing chip or on separated chip(s).

Parallel computers can be classified according to Flynn’s classification, based on thenumber of simultaneous instruction and data streams [49]:

• SISD (single instruction stream multiple data stream): A sequential computerwith a single CPU.

• SIMD (single instruction stream multiple data stream): A single program con-trols multiple execution units.

• MISD (multiple instruction streams single data stream): Multiple processorsapplying different instructions to a single datum.

• MIMD (multiple instruction streams multiple data stream): Computers withmore than one processor and the ability to execute more than one program si-multaneously

2.8.1 Parallelism in Feed-Forward Neural Networks

According to [2], the BackPropagation algorithm has four different kinds of parallelism:

• Training Session Parallelism. Starts training session with different initial trainingparameters on different processing elements.

• Training Set Parallelism. Splits the training set across the processing elements.Each element has a local copy of the complete weight matrix and accumulatedweight change valued for the given training patterns.

• Pipelining. Allows the training patterns to be ”‘pipelined”’ between layers, thatis, the hidden and output layers are computed on different processors. While theoutput layer processor calculates output and error valued for the present trainingpatterns, the hidden layer processor computes the next training patterns. Theforward and backward phase may also be parallelized in a pipeline. Pipeliningrequires a delayed weight update.

• Node Parallelism. Computes the neuron activity within a layer in parallel (namedneuron parallelism). Further, the computation within each neuron may also runin parallel.

33

Parallel Processing

2.8.2 Fundamentals of Parallel Computing

Microprocessor technology has delivered significant improvements in clock speed overthe past decade, it has also exposed a variety of other performance bottlenecks. Toalleviate these bottlenecks, microprocessor designers have explored alternate routes tocost-effective performance gains. Clock speed of microprocessors have posted impres-sive gains - two to three orders of magnitude over the past 20 years. However, theseincrements in clock speed are severely diluted by the limitation of memory technol-ogy. At the same time, higher levels of device integration have also resulted in a verylarge transistor count, raising the issue of how best to utilize them. Consequently,techniques that enable execution of multiple instructions in a single clock cycle havebecome popular. [50]

Pipelining is an implementation technique in which multiple instructions are over-lapped in execution. Today, pipelining is key to making processors fast [51].

2.8.3 Speed Definition on FPGA Devices

There are three main definitions of speed depending of the problem contexts: through-put, latency and timing [52]. These definitions are parameters that characterized thetemporal response for a given application. The definitions are detailed below:

1. Throughput refers to the amount of data that is processed per clock cycle. Acommon metric for throughput is bits (or bytes) per second.

2. Latency refers to the time between data input and processed data output. Thetypical metric for latency is time or clock cycles.

3. Timing refers to the logic delays between sequential elements. When a designdoes not “meet timing”, this means that the delay of the critical path (the largestdelay between flip-flops) is greater than the target clock period. The standardmetrics for timing are clock period and frequency. Several methods for improvingtiming are:

• Add register layers. This technique consists on adding intermediate layersof registers to the critical path. This technique is widely used in pipelinedregisters, where an additional clock cycle latency does not violate the designspecifications and the overall functionality is not affected by the addition ofthe registers.

• Parallel structures. This technique consists on reorganizing the critical pathfor implementing the logical structures in parallel. For example, a functionthat currently evaluates through a serial string of logic can be broken inseveral parts and each part can be evaluated in parallel.

• Flatten Logic Structures. Typically, synthesis and layout tools duplicatelogic for reducing the fanout, but logic structure coded in a serial fashion is

34


not broken, because information about the priority requirements design isnot provided.

2.8.4 Performance Metrics

Two metrics are commonly used for the speed of neural network simulations. Perfor-mance during training is measured in connections updates per second (CUPS). Thisaccounts for the number of weights updated per second. For the recall phase perfor-mance, connections per second (CPS) are used, which describe the number of weightmultiplications in the forward pass per second.

2.8.5 General Considerations for FPGA-based Hardware Ac-celerators for Neural Networks

There are several aspects that must be considered for the design of FPGA-based accel-erators for Neural networks [47] [3]. Some of them are described below:

• Data representation. The standard representation is generally based on two’scomplement, but other numeric formats can be useful. Some of the most extendednumeric formats are fixed and floating point. The fixed point representation isthe most convenient for hardware implementation, because there is no specialneed for specific hardware for the fundamental operations, but the drawback ofthis representation is the lost of precision. The floating point representationobtains more precise results, but its direct implementation on hardware is verycomplex, but some other approaches could be followed. The interest in usinginteger weights consists on the fact that integer multipliers can be implementedmore efficiently than floating-point ones. For example, in [53], an analysis of bothfloating point representation and fixed point representation used for implementingbackpropagation algorithm on FPGAs is described. The conclusion of this studyis that the fixed-point representation obtains very similar results than the floatingpoint-representation.

• Sum of products computation. There are several ways for implementing the sumof products computation. The weight-input product computation can be imple-mented using either a chain or tree of multipliers blocks (DSP48 slices on XilinxVirtex-4 devices, or MULT18x18 for Virtex-II devices). With the use of a treethe use of tile logic is quite uneven and less efficient that with a chain. Then,there is a trade-off between latency versus efficient use of the device.

• Storage and update of weights. For NN computations, Distributed RAM is toosmall to hold most of the data that is to be processed. In general, Block-RAMs areused for storing both weights and input values in a single block and simultaneouslyread out (as the RAM is dual-ported). For large networks, even the Block-RAM

35

FPGA Implementations for SNNs

may not be enough, and data has to be periodically loaded into and retrievedfrom the FPGA device.

• Learning. The typical learning algorithm usually is chosen based on how fastthe algorithm converge. For hardware, this criterion is different: the selectionof the implemented algorithm is based on how directly it can be implemented inhardware and what are the costs and performance of that implementation.

2.9 FPGA Implementations for SNNs

In this section, a brief description of the existing FPGA implementations of both ap-proaches is given.

In [54], several aspects related to SNNs are addressed. Some interesting contribu-tions of the Schrauwen’s thesis are: the Analog Spiking Neuron Approximation Back-propagation (ASNAProp) learning rule, which allows the learning rule to know the fullstate rule of the membrane potential, not only at the firing times, and the hardware im-plementation of the Liquid State Machine (LSM). The LSM was originally presented asa general framework to perform real-time computation on temporal signals or as a cor-tical microcolumn model. In the LSM model, a 3D structured local connected networkof spiking networks is randomly created. The ASNAProp was conceived as an error-backpropagation learning rule for SNNs, proposed as an alternative to the SpikePropand improvements previously analyzed, but no hardware implementation is proposed[55]. The application of both HW LSM and software implementation of ASNAProp tospeech recognition problem is reported.

In [56] a hybrid platform for simulation SNNs called RT-Spike is proposed. Anhybrid approach that implements the neuron model in hardware and the network modeland learning in software. The implemented neuron model is the SRM. The networklearning is performed in software, but the authors state that it is possible to migratethe learning capability of the hardware component of the system. The purpose of thework is to investigate biologically realistic models for real-time robotic control operatingwith closed action perception loops. Hardware resource statistics and performance arereported.

In [8], a simple model of the neuron is implemented in hardware, but the imple-mented model is highly simplified, and a compensation using a high number of neuronsis proposed. Adaptation of synaptic weights is implemented using Hebbian learning.While for some classification tasks it remains useful, Hebbian learning is inaccurate forother applications, such as low level vision or robotic navigation. Exploration of othertechniques such as hybrid learning techniques, Hebbian learning for reinforcement of su-pervised learning and genetic algorithms could be used for evolving network topologiesfor exploring a larger search space.

In [57], a hardware accelerator for conductance-based spiking neuron models is pro-posed. One of the implemented neurons is the Hodgkin-Huxley model. The proposedarchitecture is fully described using Matlab and Xilinx System Generator for DSMs.

36


Later, the models are validated on the target FPGA, which is a XC2V1000 FPGA de-vice. An important technique used in the proposed system is the compartmental mul-tiplexing, which is a technique that reduces the required programmable space on thetarget device. Instead of being computed in parallel, compartments can be multiplexedin a similar manner to that where different simulations were multiplexed previously.Each simulation time step contains a fixed number of time slots.

In [58] a cellular SNN model with reconfigurable connectivity is reported. The modelwas implemented on an FPGA and an array of 64 neurons has been implemented andsuccessfully solves a obstacle avoidance task for a small mobile robot.

In [59] a real-time, large scale, leaky-integrate-and fire neural network processor im-plemented on an FPGA is presented. The proposed architecture is a Single Instructionpath, Multiple Data path (SIMD) array processor. It consists of an array of ProcessingElements (PE) operating concurrently on the same instruction, issued from a centralsequencer, using locally stored data. The input stimulus to the processor is portedvia 2 input modules which can read in data asynchronously. Similarly there are also 2output modules. The neurons and synapses are implemented in what the authors calledNeural Processing Elements (NPE). The SIMD neural processor, detailed in this report,has 10 NPEs, each of which emulates 120 virtual neurons and 912 synapses. The tar-get FPGA was a Xilinx Virtex-II (XC2V1000-4), 1 million gate equivalent, clocked at50MHz and situated on a Celoxica RC200 development board. A total of 1200 neuronswere implemented in hardware.

In [60], an accelerator for neural networks for pulse-coded neurons is presented.The neural model used is a modified version of the French and Stein neuron, which isa derivation of the LIF model. A comparison of this work with neurocomputers andneural accelerators is presented. There is no network learning reported.

In [61], a simulator for pulse-coded neural networks is presented. The neuron modelused is based on the integrate and fire model with adaptive synapses, that allows toadjust the weight updating for the target task to be solved. Many applications for com-puter vision could be solved with the proposed model. The importance of embeddingthe learning and the routing of spikes into the FPGA to exploit the inherent comput-ing parallel resources is acknowledged, however, the difficulties are also highlighted,and this requires the network topology to be stored and managed completely by thehardware component.

Different hardware and software accelerators for simulating spiking neural networksare difficult to compare since the time resolution of the integration, the objectives, thenetwork topology and the neuron models are different. Compared with other implemen-tations, the proposed approach implements feedforward networks of spiking neurons,using a backpropagation based learning rule. From the mosaic of neuron models, topolo-gies and learning rules, the following works will be used for a comparison of architectureproposed in this work. The criteria for selecting these works are: the network topolo-gies (explicitly defined as Feed-forward), and the learning rule. None of the reportedimplementations have the same structure than the proposed architecture. Measuringa quantitative comparison in terms of performance with other implementations would

37


be impossible, given the absence of a standard criterion for measuring performance.Several criteria additional to the minimum error achieved on different possible prob-lems might be taken into account, such as: execution-speed, learning-speed, size ofthe neuron, generalization ability, possibility of learning on-chip or off-chip, biologicalplausibity, etc.

The most similar approaches found in the state of the art with common character-istics (network topology and neuron model) to the proposed architecture are:

• In [13], an FPGA platform for on-line topology exploration of spiking neural net-works. The Dynamic Partial Reconfiguration (DPR) is a reconfiguration mode inwhich certain areas of the device can be configured while other areas remain oper-ational and unaffected by the reprogramming. This approach is used to performthe topology evolution, which allows the exploration of a search space and whencombined with weight learning it can be considered a powerful problem solver.The weight learning rule used in this work is STDP, which is a type of hebbianlearning which modifies the synaptic weights considering the simultaneity of thefiring times of pre- and post-synaptic neurons. A simplification of hebbian learn-ing oriented to digital hardware is proposed. The system consists of three parts:a hardware substrate, a computation engine and a adaptation mechanism. Thehardware substrate provides the flexibility for allowing the adaptation mechanismto modify the engine. It provides a mechanism to test different possible topologiesin a dynamic way, to change connectionism and to allow a wide-enough searchspace. The hardware substrate platform contains two fixed and one or more recon-figurable modules. Reconfigurable modules contain the neural network: neurons,layers, connection matrices and arrays. Different possible configurations must beavailable for each module, allowing different possible combinations of configura-tions for the network. A search algorithm should be responsible to search the bestcombination of these configurations, specifically, a genetic algorithm determineswhich configuration bitstream is downloaded to the FPGA. The adaptation mech-anism uses the STDP learning for adapting the weights to the target task. Thereported architecture implement a layered neural networks with tree layers, eachlayer containing 10 neurons and is internally fully connected. The application ofthe implemented network to a frequency discriminator is used in order to test thecapability of the learning network to solve a problem with dynamic characteris-tics. The network is implemented on a Xilinx xc2s200 FPGA device. Results forhardware resource utilization are reported.

• In [12] and [62], a partitioning approach between hardware and software is re-ported. The used approach consists on using a controller that can time-multiplexa single neuron or synapse. The main component of the system is the MicroBlazesoft processor. The program running on the MicroBlaze soft processor is con-tained in the Local Memory Bus (LMB) Block RAM and is accessed through theLMB BRAM controller. The neuron/synapse/STDP blocks are implemented inthe FPGA logic and accesses by the processor using the Fast Simplex Links (FSL)

38


interface. A network topology with 100 input neuron fully connected with 100output neurons in the output layer and 100 neurons in a “training” layer is pro-posed. The reported architecture is used to perform one dimensional coordinatetransformation. The coordinate transformation is used to convert arm angles ofa haptic sensor to (x,y) coordinates. These coordinates are then used to gener-ate a haptic image representation of the search space. The STDP algorithm isused for training the network. For testing the maximum capability for the largestnetwork possible, a network structure consisting of 1400 input, training outputsynapses was implemented. Each processor is responsible for 350 of these outputneurons. The target FPGA is the Nallatech BenNuey system, which provided amulti-FPGA platform. Two of the FPGAs in this platform where utilized for the4 MicroBlaze system. The reported improvement factor is closer or higher that100, but the performance comparison is done using a Matlab program. For a morereliable comparison, the architecture reported in this work should be comparedwith a program obtained by an optimized compiler (not an interpreter as Matlab).

The last two related works use the STDP learning. The STDP is a general termfor functional changes in neurons and at synapses that are sensitive to the timing ofaction potentials in connected neurons. The STDP term typically refers to increasesor decreases in the efficacy of synaptic transmission, through it can also refer to otherfunctional changes, such as altered dendritic integration [25]. Some disadvantages of theSTDP learning are: Since the teacher currents suppress all undesired firing times duringthe training, the only correlations of pre- and postsynaptic activities occur around thetarget firing times. At other times, there is not correlation and thus no mechanism toweaken these synaptic weights that led the neuron to fire at undesired times during thetesting phase. Other disadvantage is that synapses continue to change their parameterseven if the neuron fires exactly at the desired times. Thus, stable solutions can beachieved only by applying some additional constrains or extra learning rule to theSTDP [31].

The last two reviewed works have in common the feedfordward topology, with fullyconnected neurons in consecutive layers, and the number of layers is the same, butthe learning can cope with more layers. One of the most well known learning rulesfor classical NNs is the backpropagation learning, which has been adapted to SNNs in[63], and several interesting improvements are described in [38], [64], [61]. Taking asstarting point the FF topology, it could be interesting to design a hardware architecturefor multilayer SNNs, but a different approach for learning.

In table 2.3, a summary of the characteristics of each work in the literature is shown.Only the last two works are considered for the comparison of the proposed architecture,because they have similar configurations.

39


Wor

kN

euro

nM

odel

Net

wor

kT

opol

ogy

Net

wor

kL

earn

ing

Neu

rons

FP

GA

de-

vic

eR

esou

rce

uti

liza

tion

(SL

ICE

S)

[54]

LIF

Str

uct

ure

d3D

SN

NL

SM

1600

Vir

tex-4

n/a

[56]

SR

Mn/a

Heb

bia

n10

24V

irte

x-

2000

E45

81

[8]

LIF

Rec

urr

ent

Gen

etic

Alg

orit

hm

18n/a

n/a

[57]

Hodgk

in-

Huxle

yn/a

n/a

n/a

XC

2V10

00n/a

[58]

LIF

CN

Nn/a

64A

PE

X-

20K

-200

E-

2X

5939

[59]

LIF

n/a

n/a

1200

XC

2V10

00n/a

[60]

Fre

nch

&Ste

inn/a

Heb

bia

nn/a

n/a

n/a

[61]

LIF

Spar

sely

connec

ted

n/a

40V

irte

x-I

IP

ron/a

[13]

LIF

FF

ST

DP

30xc2

s299

3000

[62]

LIF

FF

ST

DP

168

xc2

v80

0027

955

Tab

le2.

3:R

elat

edW

ork

Sum

mar

y

40

Chapter 3

Gaussian Receptive Fields (GRFs)and SpikeProp

3.1 Coding in SNNs

3.1.1 Introduction

The mammalian brain contains more than 1010 densely packed neurons that are con-nected to an intricate network. In every small volume of cortex, thousands of spikesare emmied each millisecond [25]. Several interesting questions about coding can beformulated:

• What is the information contained in such a spatio-temporal patterns of pulses?

• What is the code used by the neurons to transmit that information?

• How might other neurons decode the signal?

• How an external observer can read the code and understand the message of theneural activity pattern?

The previous questions have been addressed by the neurophysiology community andseveral mathematical models have been proposed.

3.1.2 Types of Information Coding

One of the approaches related to neural coding establishes that the relevant informationis contained in the mean firing time of the neuron, given by equation 3.1.

v =nsp(t)

T(3.1)

where T is the size of the time window and nsp(t) is the number of spikes that occur inthe time window. The division of the length of the time window gives the firing time.

41

Coding in SNNs

During recent years, more and more experimental evidence has been accumulatedwhich suggest that a straightforward firing time concept based on temporal averagingmay be too simplistic to describe brain activity. One of the main arguments is thatthe reaction time in behavioral experiments are often too short to allow long temporalaverages. Humans can recognize and respond to a visual scene in less that 400 ms.Recognition and reaction involve several processing steps from the retinal input to thefinger movement at the output. If, at each processing step, neurons had to wait andperform a temporal average in order to read the message of the presynaptic neurons,the reaction time would be much longer [25].

There are several coding schemes, but it can be divided in the following three ap-proaches [25]:

1. Rate coding. The information is encoded in the firing time of the neurons. Themain types of temporal coding are:

• Rate as a spike count (average over time). This is essentially the spikecount in an interval of duration T divided by T . The length of T of the timewindow is set by the experiments and depends on the type of neuron recordedrecorded from and the stimulus. In practice, to get sensible averages, severalspikes should occur within the time window. In figure 3.1, an example of theaverage firing time is shown.

• Rate as a spike density (average over several runs). The experimenter recordsfrom a neuron while stimulating with some input sequence. The same stim-ulation sequence is repeated many times. For each short interval of time,before, during, and after the stimulation sequence, the experimenter countsthe number of times that a spike has occurred and sums them over all repe-titions of the experiment. The problem with this approach is that the spikedensity cannot be the decoding schemes used by neurons in the brain. Nev-ertheless, the experimental spike density measure can make sense, if thereare large populations of neurons which are independent of each other andsensitive to the same stimulus [65].

• Rate as population activity. The population coding is defined by equation3.2

A(t) =1

∆

n(t, t+ ∆)

N(3.2)

where N is the size of the population, ∆ a small interval and n(t, t+ ∆) thenumber of spikes; summed over all neurons in the population. The codingmodel is also named space rate code, and does not suffer the disadvantagesof a firing rate.

2. Temporal coding. The information is encoded by the timing of the spikes. Thistype of coding is also called Time to first-spike. This is a very simple temporal

42

Gaussian Receptive Fields (GRFs) and SpikeProp

Figure 3.1: Mean Firing Time

Figure 3.2: Temporal Coding

coding method which consists on the coding of an analog variable directly in afinite interval. A sensory neuron might be driven by an external stimulus whichis suddenly switched on at time t0. This code establishes that for each successorneuron the timing of the first spike to follow t0 contains all information aboutthe stimulus. A neuron which fires shortly after t0 could signal a strong stimulus,firing somewhat later would signal a weak stimulus. In figure 3.2, the output ofthree spiking neurons used to encode two analog variables is shown.

Population coding. The information is encoded by the activity of different pools (pop-ulations) of neurons, where a neuron may participate on several pools.

There are strong ongoing debates about the question which neural codes are usedfor biological neural systems, there is a growing evidence that the brain may use allthree coding approaches previously mentioned and combinations of them. The temporalcoding seems to be specially relevant in the context of fast information processing [66].

43

Coding in SNNs

3.1.3 Biological Foundations of Receptive Fields

Input to the nervous system is in the form of five senses: pain, vision, taste, smell andhearing. Vision, taste, smell and hearing input are the special senses. Pain, temperatureand pressure are known as somatic senses. Sensory input begins with sensors that reactto stimuli in the form of energy that is transmitted into an action potential and sentto the Central Nervous System (CNS) [67]. Sensory receptors are classified accordingto the type of energy they can detect and respond to:

• Mechanoreceptors: hearing and balance, stetching.

• Photoreceptors: light.

• Chemoreceptors: smell and taste mainly, as well as internal sensors in the digestiveand circulatory systems.

• Thermoreceptors: changes in temperature.

• Electroreceptors: detect electrical currents in the surrounding environment.

Neurons can be stimulated in various ways. There are specialized neurons thatconvert the physical energy of the environment into a neural signal. These neurons arecalled receptors. Visual receptors are found in the retina and they are responsible forconverting light energy to an electrochemical neural signal for vision [68].

The receptive field of a sensory neuron is a region of space in which the presence ofa stimulus will alter the firing of that neuron. Receptive fields have been identified forneurons of the following systems [67] [26] :

• The auditory system. In the auditory system, receptive fields can be volumes (or3D regions) in auditory space, or can be regions of auditory frequencies.

• The somatosensory system. In the somatosensory system, receptive fields areregions of the skin or part of internal organs. Some types of mechanoreceptorshave large receptive fields, while others have smaller ones.

• The visual system. In the visual system, receptive fields are volumes in visualspace. The receptive field of a single photoreceptor is a cone-shaped volumecomprising all the visual directions in which the light will alter the firing of thatcell. The receptive field is often identified as the region of the retina where theaction of light alters the firing of the neuron.

Characterizing the relationship between stimulus and response is difficult becauseneural responses are complex and variable. Neurons typically respond by producingcomplex spike sequences that reflect both the intrinsic dynamics of the neuron and thetemporal characteristics of the stimulus. Isolating features of the response that encodechanges in the stimulus can be difficult, especially if the same time scale changes inthe same order as the average interval between spikes. Neural responses can vary from

44


trial to trial even when the stimulus is presented repeatedly. There are many potentialsources of this variability including variable levels of arousal and attention, randomnessassociated with various biophysical process that affect to neuron firing, and the effects ofother cognitive processes taking place during a trial. Typically, many neurons respondto a given stimulus, and stimulus features are therefore encoded by the activities oflarge neural populations [69].

When trying to implement spike-based classifiers, a critical factor becomes the inputencoding: as the coding interval is restricted to a fixed interval, the full range of theinput can be encoded by using small temporal differences. Alternatively, the inputcould be distributed over multiple input neurons. The simulation of spiking neurons isiterated with a fixed time-step, an increased temporal resolution for the input valuesgenerates a computational penalty to the entire network [63]. This problem can bepresent when trying to use large datasets, and the range of the data contained in thedataset surpass the fixed interval assigned for the coding interval.

3.1.4 Population Coding through Gaussian Receptive Fields

There are several arguments in favor of population coding. The firing time, which couldbe defined as a temporal average over many spikes of a single neuron, only works well ifthe input is constant or if it changes on a time scale, which is slow with respect to thesize of the temporal averaging window. Sensory input in a real-world scenario, however,is never constant. Moreover, reaction times are often short which indicate that neuronsdo not have the time for temporal averaging. Instead of an average over time, rate maybe defined as an average over a population of neurons with identical properties [70].

Sensory and motor variables are typically represented by a population of broadlytuned neurons. A coarser representation with broader tuning can often improve codingaccuracy, but sometimes the accuracy may also improve with sharper tuning. Theo-retical analysis concludes that tunning width and accuracy depends crucially on thedimension of the encoded variable. The results demonstrate a universal dimensionalityeffect in neural population coding [71].

Inspired by the local receptive fields of biological neurons, the Gaussian ReceptiveFields (GRFs) are applied for input coding in feedforward networks of spiking neurons.The GRFs are used as a technique for coding continuous input variables into populationfiring times associated to each input neuron of the population. Each input dimensionis associated to a graded sensitive overlapping profile. Each dimension of the datasetis encoded separately, allowing the reduction of dimensionality when processing largedatasets.

There are several interesting applications of population coding. For SNNs, a learn-ing rule analogous to the Backpropagation Algorithm is SpikeProp. Backpropagation(BP) is one of the most used training algorithms for supervised networks, because ofits capability for dealing with large scale learning problems [19]. The study of thisalgorithm is interesting due to its capability for solving complex problems with onlyone hidden layer. Several authors have investigated about the SpikeProp algorithm,

45

Coding in SNNs

results for several datasets are obtained and performance and convergence tests havebeen made [38] [64] [61].

Other application of population coding is used in [72], where the GRFs codingscheme is used for data clustering and later the obtained firing times are passed toRecurrent Radial Basis Function Networks for obtaining the data clusters. The finaltest of the proposed architecture consists on clustering a set of images with size 256-256pixels, 3-band RGB satellite images. The proposed network was trained using 32,796randomly chosen points to separate 4 clusters.

The Gaussian Receptive Fields Module that is described in this chapter, fits in alarge project of embedded on-chip learning of SNNs, using the SpikeProp learning andthe Spike Response Model as neuron model.

3.1.5 Mathematical Background

The application of GRFs for supervised learning was introduced in [63], where a set ofanalog variables pass through the GRFs and, later are sent to a Spiking Neural Networkfor its processing and learning.

A Gaussian function is defined by:

f(x) = ae(x−b)2

2σ2 (3.3)

for some real constant a > 0, b > 0, c > 0 and e = 2.718281828 (Euler’s Number),where a is the height of the Gaussian Peak, b is the position of the center of the peakand σ controls the width. In figure 3.3, gaussian functions with different heights andwidths are shown.

Datasets usually are collected from multiple resources and stored in the datasetrepository. Resources may include multiple databases, data cubes or flat files. Differ-ent issues could arise during integration of data to be processed by ANNs. These issuesinclude data transformation, which involves smoothing, generalization of the data, at-tribute construction and generalization. In one dataset, each attribute has its ownrange. Data normalization uses different techniques to narrow down values to a certainrange. One of the most common normalization techniques is the Min-max normaliza-tion. The Min-max normalization performs a linear transformation on a set of data.Suppose that mina and maxa are the minimum and the maximum values for attributeA. Min-max normalization maps a value v of A to v′ in the range [newmina, newmaxa]by equation 3.4.

v′ =v −mina

maxa −mina∗ (newmaxa − newmina) + newmina (3.4)

The normalization problem can be divided into three phases:

• Reading the dataset.

• Finding out the statistics (Minimum and maximum for the Min-Max normaliza-tion).

46


−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.50

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

A=1, B=0, C=0.5A=0.5, B=0.5, C=0.5A=1.5, B=0, C=0.3

Figure 3.3: Gaussian Function with Different Parameters

• Normalize and store the data.

In a similar way as defined by [63], where the Min-Max normalization is proposed,the input data is defined inside the range [Imin, Imax]. The variable can be coded bym neurons (or GRFs). For a neuron i, the centers of that neuron are given by

bi =i− γ

1m

(3.5)

where the proposed value for γ belongs to the range [0, 1]. The widths are given by

σ =1

K1 ∗m(3.6)

where the proposed value for K1 belongs to the range [1, 2].

The process of coding an analog variable is shown in figure 3.4. The total numberof GRFs for this example is 4; each field corresponds to one firing time for one inputlayer neuron. The input value to be coded is 0.5 and the obtained firing times valuesare: N0 = 0.33 ms, N1 = 0.89 ms, N2 = 0.89 ms, N3 = 0.33 ms.

In figure 3.5, several configurations of GRFs are tested. The coded value is 0.5,but the number of GRFs varies. In figure 3.5(a), 5 GRFs are used. In figure 3.5(b),8 GRFs are used. In figure 3.5(c), 10 GRFs are used. In figure 3.5(d), 12 GRFs areused. There is not an established method for computing an optimal size, width and

47

Coding in SNNs

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized Input Range (Coded Value:0.5)

Nor

mal

ized

Oup

ut R

ange

(F

iring

tim

es)

GRF1 GRF2 GRF3 GRF4

Figure 3.4: Example of GRF Coding

number of gaussian fields for supervised or unsupervised learning, only empirical testshave been reported in [64], [29], [14], [63], [72], [38] and [39].

The basic pseudocode for the GRF coding is shown in algorithm 1. The algorithminputs are:

• Dataset. A matrix containing the input dataset, with M rows (samples) and Ncolumns (variables).

• NumberofGaussianFields. The number of gaussian fields (NoGFs) required. Itis assumed that the NoGFs is the same for all the variables of the dataset, but inother case, this variable can be replaced with a vector containing the number ofgaussian fields required for each input variable.

• CT . A vector containing the localizaton of the centroid for each GF, and com-puted according to equation 3.5.

• SIGMAC. A vector containing the square width of each GF. The widht of eachGR is computed according to equation 3.6.

and the algorithm output is the IFT matrix, with R rows andN∗NumberofGaussianFieldscolumns. Note that for a large number of gaussian fields per variable, the memory re-quirements are high, but the potential parallelism that can be applied is high too.

48


−0.5 0 0.5 1 1.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Nor

mal

ized

Oup

ut R

ange

(F

iring

tim

es)

GRF1 GRF2 GRF3 GRF4 GRF5

(a) Coding with 5 GRFs

−0.5 0 0.5 1 1.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Nor

mal

ized

Oup

ut R

ange

(F

iring

tim

es)

GRF1GRF2GRF3GRF4GRF5GRF6GRF7GRF8

(b) Coding with 8 GRFs

−0.5 0 0.5 1 1.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Nor

mal

ized

Oup

ut R

ange

(F

iring

tim

es)

GRF1GRF2GRF3GRF4GRF5GRF6GRF7GRF8GRF9GRF10

(c) Coding with 10 GRFs

−0.5 0 0.5 1 1.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Nor

mal

ized

Oup

ut R

ange

(F

iring

tim

es)

GRF1GRF2GRF3GRF4GRF5GRF6GRF7GRF8GRF9GRF10GRF11GRF12

(d) Coding with 12 GRFs

Figure 3.5: Input Data Codification using Several GRFs

49

SpikeProp

Algorithm 1 GRF Coding

for j = 1 to TotalVariables do// Step 1: Compute Maximum, Minimum and Range for each variableMAX = ComputeMax (j)MIN = ComputeMin (j)Range = MAX - MIN// Step 2: Compute GRF Codingfor i = 1 to TotalSamples do

// Step 2.A: Normalize DataEin = (DataSet[i][j] - Min) / Range// Step 2.B: Compute Gaussian Field from Normalized Datafor k = 1 to NumberofGaussianFields do

X = e((Ein−CT [k])2/(2∗SIGMAC))

end forend for

end for

3.2 SpikeProp

In [63], a learning algorithm for multilayer SNNs called SpikeProp is proposed. Thisalgorithm is analogous to the backpropagation algorithm. In [40], a more detailed ex-planation about the mathematical foundations of the SpikeProp algorithm and moredetailed experiments are reported. In [64], learning methods analogous to classicalRProp and QuickProp algorithms are adapted to SNNs in order to make a fast train-ing. In [39], some improvements to SpikeProp learning are proposed. The proposedimprovements consist on training delay and time constant for every connection, andthe threshold for each neuron.

The Spikeprop algorithm has two main processing phases: recall and learning. Thefirst processing phase in SNNs is called Recall Phase. The main function is to propagatethe input pattern through the hidden layers until the output is produced. The secondprocessing phase in SNNs is called Learning, which adjusts the weights in function ofthe actual network inputs and outputs, with the purpose of minimizing the networkerror.

3.2.1 Recall phase

In FF SNNs, the recall phase can be defined formally as follows. A neuron j, having aset Γj of immediate predecessors (“pre-synaptic neurons”), receives a set of spikes withfiring times ti, i ∈ Γj. Any neuron generates at most one spike during the simulationinterval, and fires when the internal state variable reaches a threshold ϑ. The dynamicsof the state variable x(t) is determined by the input spikes, whose impact is describedby the spike-response function ∆(t) weighted by the synaptic efficacy (“weight”) wij:

50


xj(t) =∑i∈Γj

(wijε(t− ti)) (3.7)

A network consists of a fixed number of m synaptic terminals (or connections), whereeach terminal serves as a sub-connection that is associated with a different delay d andweight w. The unweighted contribution of a single synaptic terminal k to the statevariable y is given by:

yki = ε(t− ti − dk) (3.8)

with ε(t) a spike response function, with ε(t) = 0 for t < 0. The time ti is the firingtime of a previous neuron i, and dk is the delay associated with the synaptic terminalk. The spike response function is given by:

ε(t) =t

τe(1− t

τ) (3.9)

Extending 3.7 to include multiple synapses per connection and inserting 3.8, thestate variable xj of neuron j receiving input from all neurons i can be described as theweighted sum of the pre-synaptic contributions

xj(t) =∑i∈Γj

m∑k=1

(wijkyji (t)) (3.10)

The equations 3.7, 3.8, 3.9 and 3.10 compute the recall phase of the SpikePropalgorithm. These equations are mapped to the algorithm 2, which computes the networkoutput by obtaining the output firing times of both hidden and output neurons. Theinputs and outputs of the proposed algorithm are:

• FiringT imeI . The input firing times of the networks. These input firing timesare obtained of a coding process (in this case, the GRFs, but other coding schemescould be evaluated).

• Threshold. The threshold must be the same for all the neurons in the net-works, but some improvements for SpikeProp propose that there could be differentthresholds, or even the thresholds could be dynamically modified (learned).

• WeightsIH . A matrix of weights for the connections between input and hiddenneurons.

• WeightsHO. A matrix of weights for the connections between hidden and outputneurons.

• DelaysIH . A matrix of delays for the connections between input and hiddenneurons.

51

SpikeProp

• DelaysHO. A matrix of delays for the connections between hidden and outputneurons.

The outputs of the algorithm are:

• FiringT imeH . When a hidden neuron surpasses its established threshold, itsoutput firing time is stored in this vector.

• FiringT imeO. When an output neuron surpasses its established threshold, itsoutput firing time is stored in this vector.

• NeuronPotentialO. The final neuron potential of each output neuron at its firingtime is stored in this vector.

• NeuronPotentialH . The final neuron potential of each hidden neuron at its firingtime is stored in this vector.

3.2.2 Learning phase

The SpikeProp learning equations are derived from a fully connected feedforward net-work with layers labeled H (input), I (hidden) and J (output). The target of theSpikeProp algorithm is to learn a set of firing times tkj , at the output neurons j ∈ J fora given set of input patterns P [t1...th], where P [t1...th] denotes a single input patterndescribed by single spike time for each neuron h ∈ H. The error-function is defined by:

E =1

2

∑j∈Γj

(taj − tdj )2 (3.11)

where tdj are the desired spike times and taj are the actual firing times. For error-backpropagation, each synaptic terminal is treated as a separate connection k withweight wkij. To calculate the backprop rule, the following computations are carried out:

∆wkij = −η δEδw1

(3.12)

3.2.2.1 SpikeProp with Delay, Decay and Threshold Learning

In [39] and [55], SpikeProp is analyzed for several improvements. Some of these improve-ments include the possibility of using learning rules for delays, decays and thresholds.The proposed rules were only tested with the XOR data set, but no tests with large datasets are provided. The proposed rules are described below. In equations 3.13 and 3.14,the weight learning rule is shown. The first equation computes the delta relation forneurons located in hidden layers, and the second equation computes the delta relationfor neurons located in output layers.

52


Algorithm 2 SpikeProp Algorithm (Recall Phase)

for r = 0.00 to SimulationTime,r+=Resolution dofor i = 1 to HiddenNeurons doNeuronPotentialH [i]← 0for j = 1 to InputNeurons doC ← 0for k = 1 to SynapticConnections doW ← WeightsIH [j][i][k]ET ← r − FiringT imesI(j)−DelaysIH [j][i][k]F ← ActivationFunction(ET, τ)C ← C +W ∗ F

end forNeuronPotentialH [i]← NeuronPotentialH [i] + Cif (NeuronPotentialH [i] > NeuronThreshold) and(NeuronEnabledH [i] = 1) thenFiringtimeH [i]← SimulationT imeNeuronEnabledH [i] = 0

end ifend for

end forend forfor r = 0.00 to SimulationTime,r+=Resolution do

for i = 1 to OutputNeurons doNeuronPotentialO[i]← 0for j = 1 to HiddenNeurons doC ← 0for k = 1 to SynapticConnections doW ← WeightsHO[j][i][k]ET ← r − FiringT imesH(j)−DelaysHO[j][i][k]F ← ActivationFunction(ET, τ)C ← C +W ∗ F

end forNeuronPotentialO[i]← NeuronPotentialO[i] + Cif (NeuronPotentialO[i] > NeuronThreshold) and(NeuronEnabledO[i] = 1) thenFiringtimeO[i]← SimulationT imeNeuronEnabledO[i] = 0

end ifend for

end forend for

53

SpikeProp

δe

δdkij=δE

δtj(taj )

δtjδaj(t)

(taj )δaj(t)

δdkij(taj ) (3.13)

δaj(t)

δdkij(taj ) = −wkij

δεkijδt

(taj − tai − dkij) (3.14)

For convenience, δi can be redefined as in equation 3.15. The pseudocode for thecomputation of this learning adjustment parameter is shown in algorithm 3. Note thatfor the computation of the denominator of equation 3.15, it is required the δ parameterof the output layer.

δi =

∑j∈Γi δj{

∑k w

kij

∂yki (taj )

∂tai}∑

h ∈ Γi∑

l wlhi

∂yki (taj )

∂tai

(3.15)

For convenience, δj can be redefined as in equation 3.16. The pseudocode for thecomputation of this learning adjustment parameter is shown in algorithm 4. Note thatin the first part of the algorithm, the difference between the desired output firing timeand actual firing times is used for computing the numerator of equation 3.16. Thedenominator of equation 3.16 is computed in function of the actual output firing andthe firing times and delays of the neurons located in the hidden layer.

δj =(tdj − taj )∑

i∈Γ(∑l

ij(∂yli(t

ja)

∂taj))

(3.16)

For a hidden layer, the weight adaptation rule is defined by equation 3.17. For anoutput layer, the weight adaptation rule is defined by equation 3.18. The pseudocodefor the weight adaptation rules for both hidden and output layers is shown in algorithm5.

∆wkih = −ηwεkih(taj − tai − daih)δi. (3.17)

∆wkho = −ηwεkih(tao − tah − daho)δj. (3.18)

where ηw is the learning rate for weights. In an analogous way, the delay adjustmentfactor for hidden neurons is defined by equation 3.19, and for output neurons, the delayadjustment factor is defined by equation 3.20.

∆dkih = −ηdεkih(tah − tai − daih)δi[

(tah − tai − daih)(τ kih)

2− 1

τ kih

](3.19)

∆dkho = −ηdεkih(taj − tai − daih)δi[

(tao − tao − daho)(τ kho)

2− 1

τ kho

](3.20)

54


Algorithm 3 SpikeProp Algorithm (Learning Phase - Delta Computation for HiddenNeurons)

for i = 1 to HiddenNeurons do// Compute the numerator partNUM ← 0SpikeT ← FiringtimeH [i]for j = 1 to OutputNeurons doO ← 0ActualT ← FiringtimeO[j]for k = 1 to SynapticConnections doW ← WeightsHO[i][j][k]D ← DelaysHO[j][i][k]if (FiringtimeO[i] > FiringtimeH [i] +D) thenET ← SpikeT − ActualT −DelaysHO[j][i][k]F ← ActivationFunction(ET, τ) ∗ (1/ET − 1/τ)O ← O +W ∗ F

elseO ← 0

end ifend forNUM ← NUM +DeltaO[j] ∗O

end for// Compute the denominator partOutput← 0for j = 1 to InputNeurons doC ← 0for k = 1 to SynapticConnections doW ← WeightsIH [i][j][k]D ← DelaysIH [j][i][k]if (FiringtimeH [i] > FiringtimeI [i] +D) thenET ← FiringT imesH [i]− FiringT imesI [j]−DelaysIH [j][i][k]F ← ActivationFunction(ET, τ) ∗ (1/ET − 1/τ)C ← C +W ∗ F

elseC ← 0

end ifend forOutput← Output+ C

end for// Compute the delta valueDeltaH [i]← NUM/Output

end for

55

SpikeProp

Algorithm 4 SpikeProp Algorithm (Learning Phase - Delta Computation for OutputNeurons)

for i = 1 to OutputNeurons do// Compute the numerator partNUM ← Desired[i]− Actual[i]// Compute the denominator partOutput← 0for j = 1 to HiddenNeurons doC ← 0for k = 1 to SynapticConnections doW ← WeightsHO[i][j][k]D ← DelaysHO[j][i][k]if (FiringtimeO[i] > FiringtimeH [i] +D) thenET ← FiringT imesO[i]− FiringT imesH [j]−DelaysHO[j][i][k]F ← ActivationFunction(ET, τ) ∗ (1/ET − 1/τ)C ← C +W ∗ F

elseC ← 0

end ifend forOutput← Output+ C

end for// Compute the delta valueDeltaO[i]← NUM/Output

end for

56


where ηd is the learning rate for delays. In similar way to weight learning, the timeconstants or decays can be adjusted as established in equation 3.21, and thresholds canbe adjusted as established in equation 3.22. These equations will not be explained indetail, because they were not implemented in this work, but they can be considered forfuture improvements.

Algorithm 5 SpikeProp Algorithm (Learning Phase - Weight and Delay Adjustment)

// Update the hidden-output layer weights and delaysfor i = 1 to OutputNeurons doActualFT ← FiringtimeO[i]ActualDelta← DeltaO[i]for j = 1 to HiddenNeurons doSpikeT ← FiringtimeH [j]for k = 1 to SynapticConnections doET ← ActualFT − SpikeT −DelaysHO[j][i][k]F ← ActivationFunction(ET, τ)ChangeW ← LearnRate ∗ F ∗ ActualDeltaChangeD ← LearnRate ∗ F ∗ ActualDelta ∗ ((1/ET )− (1/τ))WeightsHO[i][j][k]← WeightsHO[i][j][k] + ChangeWDelaysHO[i][j][k]← DelaysHO[i][j][k] + ChangeD

end forend for

end for// Update the hidden-output layer weights and delaysfor i = 1 to HiddenNeurons doActualFT ← FiringtimeH [i]ActualDelta← DeltaH [i]for j = 1 to InputNeurons doSpikeT ← FiringtimeI [j]for k = 1 to SynapticConnections doET ← ActualFT − SpikeT −DelaysIH [j][i][k]F ← ActivationFunction(ET, τ)ChangeW ← LearnRate ∗ F ∗ ActualDeltaChangeD ← LearnRate ∗ F ∗ ActualDelta ∗ ((1/ET )− (1/τ))WeightsIH [i][j][k]← WeightsIH [i][j][k] + ChangeWDelaysIH [i][j][k]← DelaysIH [i][j][k] + ChangeD

end forend for

end for

∆τ kij = −ητδE

δτ kij(3.21)

57

Algorithm Discussion

∆ϑj = −ηϑδE

δϑj(3.22)

3.2.2.2 Other SpikeProp Improvements

The modification of equations 3.17 and 3.18 for supporting a momentum term is definedby equation 3.23, where α is the momentum parameter. This modification is proposedin [41].

∆w(t) = −ηδEδw

(t) + α∆w(t− 1) (3.23)

Other interesting modification proposed in [64] is the RProp algorithm. RProp isshort for resilient propagation. Rprop is unique in that the learning rate adjustmentdepends on the sign of the gradient, not of the magnitude. If the error surface is highlynon-linear and complex, the gradient becomes unpredictable and is a poor measureof how the learning rate should be adjusted. The learning parameters are defined inequations 3.24 and 3.25, where tmin and tmax are the minimum and the maximumchange in time allowed between a couple of neurons when a neuron receives an inputspike and when it produces one output spike, after any applicable delay. | d | is thenumber of delay connections between any two neurons. | Γi | is the number of inputneurons for a given layer. These equations are used to compute the initialization of thenetwork weights as well.

wmax =τϑ

| d || Γi | tmine(

tminτ−1) (3.24)

wmin =τϑ

| d || Γi | tmaxe( tmax

τ−1) (3.25)

QuickProp is a training method that tries to approximate the global error surfaceby examining the local one for each weight. In this method, two assumptions areconsidered: The error for each weight can be approximated by a parabola openingupward and the second derivative of the error respect to one weight is not affected bythe other weights that change at that time. The weight updating rule for both hiddenand output layers is defined by equation 3.26. This improvement is proposed in [64].

∆w(t) =δEδw

(t)δEδw

(t− 1)− δEδw

(t)∆w(t− 1) = β∆w(t− 1) (3.26)

3.3 Algorithm Discussion

The presented algorithms have inherent parallelism capabilities, due to the large quan-tity of iteration cycles that must be performed for obtaining the network output. Forthe GRFs, an scheme for evaluating several gaussian functions can be defined, where

58


the input coding for a given pattern can be benefited by using several hardware pro-cessors dedicated to the input coding. Even, it is possible to propose a parallel schemefor obtaining the coding of several patterns in parallel.

For the Recall phase in SNNs, the network output for each neuron must be com-puted. For the neuron output computation, there is a set of processing phases that canbe identified. It is possible to define a pipeline processing scheme for computing theneuron output, and using several neurons implemented on hardware for acceleratingthe processing of the network. For large networks, it is possible to define a multiplexionscheme where several modules can reuse hardware resources.

For the Learning phase in SNNs, weights and delays for each neuron must beadapted. This process is performed only one time by each neuron, but the adapta-tion of a Pipeline processing for accelerating the processing can be implemented inhardware.

Globally, the hardware implementation of these algorithms together will bring sev-eral advantages and potential performance improvements.

59

Chapter 4

Proposed Architecture

The implemented architecture can be divided into three parts: Coding, Recall andLearning. In the coding part, samples obtained from the input dataset are transformedinto input firing times. In the Recall part, the input firing times are processed forobtaining the network output. In the Learning part, the gradient is computed and thenetwork weights are updated. In this chapter, each one of the basic modules of thefinal architecture is described separately and the integration of all the modules in onecompact architecture is presented at the end of the chapter.

4.1 Issues to be Addressed, Restrictions and Limi-

tations

In this work, a hardware architecture for multilayer FF SNNs is proposed, but beforegiving the details, several aspects should be mentioned to be taken into account in thearchitecture definition:

• Hardware resource simplification. In order to reduce and optimize hardware re-sources to allow a more dense architecture, hardware simplification is required.Some arithmetic computations should be simplified. For the exponential functionfor example, where divisions and multiplications are required, it is desirable toeliminate the division, which is one of the most hardware greedy arithmetic com-putations. The dividing operation was eliminated by storing the reciprocal of thedenominator in a set of registers, and later making a product instead of a divi-sion. For the multiplying operation, there are some additional hardware resourcesavailable on FPGAS, like MULT18x18 modules on Xilinx’s FPGA devices.

• Flexibility. The integration of the final architecture with hardware cores for SNNsis possible. The user can define the degree of parallelism by adding processors,but the parallelism degree also depends on the FPGA space availability. In thissense, the architecture is as flexible as the target device permits.

61

Architecture Overview

• Data representation. The main limitation of FPGAs is the data width limitationcompared with moderns PCs. The design of specialized and minimal arithmeticsis suitable to be implemented on FPGAs rather than modern PCs. The proposedarithmetic is specially designed for the specific computation performed by thearchitecture.

4.2 Architecture Overview

4.2.1 Modules Description

In this section, the high-level modules of the proposed architecture will be described indetail. The proposed modules with the required buses for the data and control transfersand the organization of these modules are shown figure 4.2. The main modules of thearchitecture are:

• The GRFM is implemented by a set of Gaussian Receptive Field Processors(GRFPs). The basic mode of the proposed architecture works with only 1 GRFP,but if higher parallelism is desired, then more GRFPs must be added.

• The SNNM is implemented by a set of Spiking Neural Network Processors (SNLPs).One SNLP performs the computations required for a couple of layers. For exam-ple, for a three layer network, 2 SNLPs must be defined.

• The WAUM is implemented in hardware by the Learning Modules (LMs). Thereis one LM for each SNLP defined.

• External RAM Interface. For the designed architecture, it is assumed that onlyone external memory can be used for data storing. In case that more than onememory are available on the target platform, then multiple RAM interfaces couldbe used, and each group of processing elements (GRFP, SNLP, LM and WUM)could have their own RAM interface. This interface allows to each group ofprocessing elements to access to the external memory.

• External Memory. This memory stores weight values, delay values and firingtimes generated for each group of processing elements.

• Global Control Unit. Generates all the synchronization signals required for allthe modules of the architecture.

• Routers. The routers can read or write data to/from the Global Data Bus. In areading operation, the router reads data from the Global Data Bus and sends thedata to the processing element connected directly to the router (GRFP, SNLPor LM). In a writing operation, the router reads from the processing elementconnected directly to the router (GRFP, SNLP or LM) and sends the data to theGlobal Data Bus. Only one router can be active, so the control module generatesthe appropriate synchronization signals.

62


All the data transfers among modules are done via a Global Data Bus. All thecontrol signals generated by the Global Control Unit are sent to the modules via aGlobal Control Bus. In the following sections, each one of the hardware components isdetailed.

4.2.2 Modules Interaction

In figure 4.1, the different processing stages involved in the system are shown. Eachprocessing module can be summarized as follows:

1. The dataset is loaded in one external memory. A dataset consists on a fixed num-ber of patterns (rows) and a fixed number of variables (columns). Additionally,a class vector, that contains a class label for each pattern in the data set, can bestored in the dataset (as an additional column) or can be stored in a separatedmemory region. The dataset must be accessible at any time. Each element ofthe data set is stored using a 16-bit fixed point representation, 8 bits for integerpart and 8 bits for fractional part. The class column is stored using an integerrepresentation. The proposed architecture can only process integer classes. Theinternal representation format is an 8-bit integer, that allows to the target networkto process up to 256 different classes.

2. The dataset is accessed only by the Gaussian Receptive Fields Module (GRFM),which transforms the dataset into firing times. The GRFM transforms the inputdata into the output firing times, evaluating each normalized data into each gaus-sian field. The GRF coding consists in the evaluation of the input data on a set ofoverlapping gaussian functions. Each gaussian function can be understood as aninput firing time, and the value of the input firing time depends on the size andcenter separation for each GRF and the total number of GRFs. The generatedfiring times are stored in external memory or can be sent to the Spiking NeuralNetwork (Recall Phase).

3. The SNNM takes as input the firing times generated in the previous phase. Thesefiring times can be loaded from external memory, or can be taken directly fromthe GRFM. These firing times represent the firing times of the input layer of theimplemented SNN. The SNNM processes at least one layer and the output firingtimes of the last layer of the SNN are sent to the exterior as the output firingtimes. These firing times can be stored in memory or sent to the Class DecoderModule.

4. The firing times generated by the Spiking Neural Networks are sent to the ClassDecoder, which decodes the firing times for obtaining the class assigned by thenetwork to the input pattern.

5. The class assigned by the networks and the output firing times can be sent to theWeight Adjustment and Updating Module (WAUM), which obtains the network

63

Coding: Gaussian Receptive Fields Processor (GRFP)

adjustment for each weight in the network.

6. The Learning Module (LM) computes the weight adjustments and performs theweight updating according with the actual and desired firing times for each inputpattern.

4.3 Coding: Gaussian Receptive Fields Processor

(GRFP)

The proposed architecture uses a 16-bit fixed point representation. Then, the maximumfiring time that can be obtained is 15.99, using 4 bits for the integer part and 12 bitsfor the fractional part. The main reason of this selection is the data width of theBlockRAM memories available on the target FPGA device. Other pre-built resourcethat is used by the proposed architecture is the 18-bit by 18-bit multipliers of Virtextechnology, which reduces the resources used for multiplying operations.

The main architecture is designed for implementing a maximum number of 16 codingneurons. This is based on several works, where the number of coding neurons arebetween 4 to 16. If the user requirements ask for more neurons, the architecture canbe modified with minimal changes for supporting more gaussian fields.


In this section, the components of the GRFP are described. The architecture is designedto be flexible and modular depending of the user requirements. The GRFP containsa set of computational elements that makes possible the conversion of real value intofiring times or spikes. It can be replicated as many times as possible for improvingperformance, or can be defined in minimal resource mode, where only one GRFP isused. In figure 4.3, the main components of the GRFHA are shown. These componentsare described below:

• External Memory. This unit stores the dataset to be processed. The columns ofthe dataset must be stored separately and later, the GRFHA performs a processfor converting the input dataset into the firing times or spikes.

• External Memory Access Unit. This unit allows the GRFPs to access to thedataset stored in the External Memory through the Registers Bank. This unitalso allows the Firing times Memory to write its contents to the external memorywhen the coding finishes.

64


Fig

ure

4.1:

Syst

emD

atafl

ow

65


• Temporal Register Bank (TR). The purpose of this Register Bank is to store datawhen different data widths in external and internal memories are used (actually,the external memory has a 64-bit width, while the internal BRAM memory hasa 16-bit width)

• Maximum Minimum Range Computation Module (MMRCM). This module ac-cesses to the dataset and computes Maximum, Minimum and Range for eachdata column of the dataset. This module can be excluded from the architectureor not synthesized if these parameters are known or were precomputed before cod-ing. The results obtained by this module are stored in the Maximum MinimumRange Memory (MMRM).

• Centroid Computation Module (CCM). This module makes the computation ofcentroids and widths of gaussian fields as defined by equations 3.5 and 3.6. Thenumber of computed centroids corresponds to the number of GFs required for eachvariable in the dataset. These centroids are stored in the Parameter Memory.

• Parameter Memory (PM). This memory stores the centroids and statistics ob-tained by the MMRCM and CCM. Each GRFP accesses to this memory module,and several parameters are distributed among each GRFP.

• Gauss Processors (GPs). Each GP performs the computation of each GaussianField as defined by equation 3.3. The exponential function is required for comput-ing the Gauss function. The denominator can be pre-computed if the number ofelements in the series is fixed. Then, the numerator can be calculated iteratively,and several computations can be made in pipeline if dedicated hardware is usedfor its implementation. Then the hardware module has a latency of K elements,where K is the number of elements of the series.

• Firing Time Memory (FM). This memory concentrates all the firing times ob-tained by each one of the Gauss Units. Later, the firing times will be stored inexternal memory.


The modules interaction required for a GRFP can be summarized as follows:

1. The statistics are computed by the MMRCM for the data normalization.

2. The data to be codified are stored in the TRs.

3. The data are normalized using the parameters obtained from the PM.

4. The normalized data are evaluated by each GP, and each result is stored in theFTM.

66


Figure 4.2: Complete Architecture

67


Figure 4.3: Gaussian Fields Architecture

68


5. The firing times stored in the FTM are transfered to the external memory.

The proposed architecture can be configured in some of the following modes:

• Mode A. This mode allows to process one variable or columns in the datasets asfast as possible. Then, more than one row value of the dataset is obtained fromthe external memory and stored in the Register Bank.

• Mode B. This mode allows to process several variables of a dataset. This ispossible by adding more PMs (or using multi-port memories instead of one-portmemories). In this case, several samples of the dataset are stored in temporalmemories, and the coding of these samples can be obtained in parallel, for laterpass the codded samples to other neural processing blocks.

4.3.3 Integration of GRFPs with other SNN Modules

The final integration of the proposed core with neural processing cores is described inthis section. The proposed core could be used as the first block on one architecturewhere the neurons are implemented as a systolic array. A possible configuration of theproposed module with other SNNs blocks is shown in figure 4.4. A summary of eachone of the system blocks is described below:

• Input dataset, which can be organized in the following way:

– The data columns can be coded using GRFs.

– The class column can be coded using GRFs or other coding schemes, likewinner-takes-all coding scheme as proposed in [63].

• The GRFPs code the real values obtained from the input dataset into firing timesand send these values to the Neural Network Processor Module.

• The Neural Network Processor Module obtains the network output in functionof the input firing times and the weights and delays associated to each networkconnection.

• A Learning block can perform the weight and delay adjustments for learningsome task, or to process data in an unsupervised fashion, for data clusteringapplications.

69

Recall: Spiking Neural Layer Processor (SNLP)

Figure 4.4: GRFPs Integrated with other SNNs Processes

4.4 Recall: Spiking Neural Layer Processor (SNLP)

4.4.1 Single Hardware Neuron: Neural Processor (NP)

The basic implemented neuron model is shown in figure 4.5. This processor is calledNeural Processor (NP), and it has the following inputs:

• Threshold. Determines the threshold that must be reached by the internal poten-tial variable in the neuron to generate its output firing time. The threshold canbe the same for each processor, or it can be different for each NP.

• Simulation time. Each neuron can potentially work with different simulation time,but only implementations where all the neurons of the neuron array work withthe same simulation time are implemented.

• CLK and Reset signals. These signals are required for module synchronization.

The NP has the following outputs:

• Firing time. When the neuron potential reaches the established threshold, thesimulation time required for reaching that neuron potential is stored in the firingtime register for future processing.

• Neuron status. A neuron status register bank is available to external processingelements. Each register of this bank contains information about the neuron sta-tus. If an ‘1’ is stored, the neuron has been fired, otherwise, the neuron is stillprocessing and it has not been fired.

More information about the neuron behavior is detailed on section 2.6.1.3.

70


Figure 4.5: Structure of a Neural Processor


In figure 4.6, the main components of a SNLP are shown. The main components aredescribed below:

• Neural Processors (NPs). Neural Processor obtains the neuron output in functionof the weights and delays, as established by equation (3.10). Each NP accesses toits own Weight Memory (WM) and Delay Memory (DM).

• Weight Distribution Unit. This unit stores the weights and delays for a givenneuron and sends them to external memory reorganized in a different format.Then, it takes those weights and delays and distributes them to the correspondingWeight Memory (WM) and Delay Memory (DM)

• Load Weight-Delay Unit. Once the weights and delays are organized in a con-venient way, they are loaded to the Weight Memory (WM) and Delay Memory(DM).

• Activation Function Memory (AFM). In the case of the implemented SNNs, onlysome values of the activation function (given by equation 3.9) are required. Thedomain and range are well known, and this allows to compute the exponentialfunction. The exponential function requires a large amount of hardware resourcesfor its implementation. An implementation based on LUT allows to reduce thenumber of external memory accesses, and the hardware resources used for its

71

Learning Modules (LMs)

hardware implementation. In the case of the Xilinx Virtex FPGAs families, thereare internal dual-port (BRAM) memories that can be used for data storing. Ifthe constant τ is fixed, and the resolution of the timescale is known, then it ispossible to compute the value and later send that precomputed data to each LUT.An address computation is required to access the right memory elements and touse it for the neuron potential calculation. The address computation is simple,since it is performed only with a subtraction between the input firing time andthe current firing.

• External RAM Interface. Allows the SNLP to access the data stored in theexternal memory.

• External Memory. Stores weights, delays, input and output firing times, activa-tion functions and architectural parameters required by SNLPs.

• Global Control Unit. Generates the synchronization signals when the data trans-fer among the memories of the SNLP must be performed, computes the addresslocation required for each NP, and generates the reset and clock signals for eachNP.


The modules interaction required for a SNLP can be summarized as follows:

1. The activation function memory of each NP is loaded with the precomputedactivation function.

2. Weights and delays are stored in the Weight Memory (WM) or Delay Memory(DM) respectively.

3. The NPs perform the neuron processing using the weights, delays and activationfunction stored in their internal memories.

4. When an NP obtains its output firing time, this firing time is stored in the FTR.The SR is modified with the new neuron status.

5. The internal control unit sends the output firing times stored in the FTRs to theexternal memory for processing these firing times on the next processing layer.

4.5 Learning Modules (LMs)

4.5.1 Module Description

The learning process is performed in hardware by the LMs. Two processing phases areperformed by the LM:

72


Figure 4.6: Components of a SNLP

73

Learning Modules (LMs)

• Delta Computation. In function of the weights and delays of each neuron j withits preceding layer i, the delta computation is performed. This computation isdone based on equations 3.15 and 3.16. As established on these equations, thereare two computation cases for these equations. The first case consists in thedelta computation of the last neuron layer. The second case consists in the deltacomputation of each neuron belonging to a hidden layer.

• New Weight Computation and Weight Updating. Each one of the weights ofthe network is read from the external memory, and the new weight is obtainedfrom the learning rate, the delta parameters associated to that neuron, and thecurrent weight. When the new weights are obtained, they are reloaded to theexternal memory for future use in the learning or recall phases. This modulereads the weights from the external memory, computes the adjustment factors forweights and delays (according to equations 3.16 and 3.15), performs the weightsupdating (according to equations 3.17 and 3.18) and delays updating (accordingto equations 3.19 and 3.20) and finally sends back weights and delays to theexternal memory for their future use in learning or recall phases.

The main components of the Learning Module (LM) are shown in figure 4.7. Eachcomponent is described below:

• Control unit. This unit defines the time when the data must be loaded, andgenerates the memory address for each BRAM. In this case, the control is similarto the implemented for the NP, because weights and delays must be reloaded fromthe external memory.

• Weights Memory (WM). This memory is the same as that used in each NP0, butthe control is granted to this module. This memory contains the weights of eachconnection of each neuron of the layer l − 1 with the actual neuron j, and theseweights are used for computing the weigts adjustment factor.

• Activation Function Memory (AFM). This memory is reused from each NP, butthe control signals are controlled by the current LM.

• Activation Firing Time Memory (AFTM). This memory stores the input firingtimes for the current neuron.

• Reciprocal Firing Time Memory (RFTM). Contains the reciprocal of each possiblefiring time. The memory contents can be computed at the beginning of theprocessing, or can be obtained from a host PC via the PCI bus.

• Delta Register (DR). Contains the delta computation for the current neuron.

• Numerator Register (NR). Contains the numerator part according to equation3.15 or equation 3.16.

74


Figure 4.7: Learning Module

• Denominator Register (DR). Stores the denominator part according to equation3.15 or equation 3.16.

The processing executed by this processor is purely sequential, and no parallel pro-cessing is carried out. Then, the processing time of this processor depends only on thenumber of neurons in the network layers. This processing performed by this processoris executed only one time per neuron when the neuron has been fired, and no moreprocessing is required.

4.6 Additional Hardware Modules

4.6.1 Class Encoder (CE)

The components of the CE module are shown in figure 4.8. Basically, the control unitreceives the class to be coded and generates the appropriate control signals. There is anoutput firing time memory, which stores as many firing times as the number of differentclasses in the dataset (and this number is equal to the number of output neurons inthe network). A range is defined in terms of a MAX and MIN values, which couldbe defined by the user depending of the requirements of the application. The basicpseudocode for this encoding is defined in algorithm 6.

In figure 4.9, the output firing times for 4 output neurons are shown. Each outputneuron represents an output class label and the output firing time for an output neuron.

75

Additional Hardware Modules

Figure 4.8: Class Encoder

The firing time of each neuron depends on the limits of a defined range. In figure 4.10,the same encoding process for 7 output neurons is shown. There is not a clear definitionon how the firing time distribution among the output neurons must be defined. For aN-class problem, the first output firing time is located at the lower limit of the outputrange and the last output firing time is located at the upper limit of the output range,and the remaining firing times are located into intermediate position distributed alongthe output range. Then, for a three classes problem, the first output firing time islocated at the beginning of the output range, the second firing time is located in themiddle of the output range, and the last firing time correspond to the upper limit ofthe range. Other alternative consists in making the separation time among firing timesconstant, but this solution was not explored in this work. In figure 4.11, the three codedneurons and the three possible classes are shown.

For the encoding of the desired output firing times, the next considerations must befollowed:

• The number of output classes must be the same that the number of output neu-rons.

• The output classes must be enumerated properly. For a N-classes problem, theclass enumeration must be 1,2,...N.

76


0 1 2 3 4 5 6 70

1Neuron 1

0 1 2 3 4 5 6 70

1Neuron 2

0 1 2 3 4 5 6 70

1Neuron 3

0 1 2 3 4 5 6 70

1Neuron 4

Time

Figure 4.9: Example of Encoding Output Class with 4 Neurons.

Algorithm 6 Desired Output Computation AlgorithmR = TMAX − TMIN

Increment = R/(TOTAL CLASSES − 1)TTMIN = TMIN

for i = 1 to TOTAL CLASSES doTTMIN = TTMIN + i ∗ IncrementFT [i] = TTMIN

end for

77

Additional Hardware Modules

0 1 2 3 4 5 6 7 8 9 100

1Neuron 1

0 1 2 3 4 5 6 7 8 9 100

1Neuron 2

0 1 2 3 4 5 6 7 8 9 100

1Neuron 3

0 1 2 3 4 5 6 7 8 9 100

1Neuron 4

0 1 2 3 4 5 6 7 8 9 100

1Neuron 5

0 1 2 3 4 5 6 7 8 9 100

1Neuron 6

0 1 2 3 4 5 6 7 8 9 100

1Neuron 7

Time

Figure 4.10: Example of Encoding Output Class with 7 Neurons.

78


0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

1Neuron 1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

1Neuron 2

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

1Neuron 3

Time

(a) Desired Output Pattern for a Class 1 Pat-tern

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

1Neuron 1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

1Neuron 2

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

1Neuron 3

Time

(b) Desired Output Pattern for a Class 2Pattern

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

1Neuron 1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

1Neuron 2

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

1Neuron 3

Time

(c) Desired Output Pattern for a Class 3 Pat-tern

Figure 4.11: Different Class Coding using 3 Output Neurons.

79

Performance and Parallelism Analysis

4.7 Performance and Parallelism Analysis

4.7.1 Parallelism Levels in the Proposed Architecture

We have defined a four-level hardware architecture that exploits several parallelismschemes. The levels are described as follow:

• At the first level, each NP implements a 5-state pipeline, where the temporal par-allelism is used for breaking complex operations into simpler operations workingin parallel. At the same level, each GP implements a k-state pipeline (k is thenumber of gaussian fields)

• In the second level, several GPs and NPs can be placed in parallel. The NPs aregrouped into one larger processor (SNLPs), while the GPs are grouped into alarge processor (GRFPs).

• In the third level, several SNLPs and GRFPs can be grouped for simulating amultilayer network. In this way, a SNLP implements one neuron layer, and if NSNLPs are defined, then a N-layer network can be implemented, where each layerprocessing has a time delay T with respect to the previous layer.

• In the fourth level, both GRFPs and SNLPs can work in parallel. The GRFPs ob-tain the GRF coding for some input patterns, while the SNLPs obtain the networkoutput from other input firing times (previously obtained from the GRFPs).

The implementation of those parallelism levels can be useful to generate an archi-tecture with a better performance than any application implemented on a sequentialcomputer, because pipeline is one of the most effective parallelism techniques used inthe implementations of neural networks. In the third level of parallelism, the numberof NPs defined for each SNLP defines the number of “physical” neurons that can becomputed by the architecture. Increasing the number of NPs allows to process largernetworks with better performance.

4.7.2 Parallelism Models and Performance Estimation

The more complex processing phase of the implemented architecture is the recall phase.In this processing phase, the output of hidden and output layers in the network areobtained. In this section, a brief formalization of the model for obtaining the totalnumber of cycles for the proposed architecture using any number of NPs and neuronsis proposed.

The proposed model requires the following assumptions:

• There is only one external memory (weights and delays are stored in one singlememory location).

• Only one fetch memory operation can be done at any instant of time.

80


• The NPs assigned to each layer process in parallel, but the weight and delayloading is performed sequentially.

The input parameters of the model are:

• Number of neurons in the input, hidden and output layers. The proposed ar-chitecture can process networks with any number of neurons and any number oflayers.

• Number of NPs in each SNLP. For this model, the NPs are the same for eachdefined SNLP, but the model can be modified for obtaining the number of cycleswhen an unbalanced number of NPs in each SNLP is defined.

• Maximum Number of Steps (MNS). The MNS is the quotient among the Max-imum Simulation Time (MST) and the length of each finite step or resolution,and represents the maximum number of steps required by the architecture forprocessing the output of all the neurons in the network. The MNS is defined byequation 4.1.

MNS =MST

Resolution(4.1)

For the presented results, the resolution is 0.01 s and the MST is 10.24. Then,using the equation 4.1, the obtained MNS is 1024.

• Divisor for MNS (DMNS). The processing required for the MNS can be performedin the following ways:

1. Process all the steps of the input-hidden layer, and when finished, start theprocessing of the hidden-output layer. This processing scheme guaranteesthat all the neurons are processed, but if early firing times are presented,all the remaining time slots in the simulation time must be processed.

2. Divide the execution steps in clusters of N Steps, where the firstN Steps areprocessed by the input-hidden layer of the network, and later, the followingN Steps are processed in the input-hidden layer in parallel with the firstN Steps by the hidden-output layer. This processing scheme divides thenetwork processing in a pipelined way, as shown in figure 4.13. In figure4.13(a), the computation of the network is obtained without dividing theexecution steps. In figure 4.13(b) a division by 2 is used, and in figure4.13(c), a division by 4 is used. The number of steps performed by eachone of the partitions generated by dividing the whole simulation time intosmall processing clusters is given by equation 4.2. The larger the divider,the smaller number of steps performed by each partition.

STEPs IT =MNS

DMNS(4.2)

81


The main advantage of using MNS is the possibility of processing both layers inparallel, which allows to reduce the execution time required for completing theinherent processing of the network. The DMNS can be set to any positive integer.

• Work Clock Frequency (WCF). This clock frequency can vary depending on thetarget FPGA device, but for the provided results this frequency was set to 100MHz.

The processing layer required for obtaining the output firing times of neurons ofboth hidden and output layers can be divided in two parts:

1. Memory Access. Each NP requires a set of memory accesses for reading weightsand delays, or for writing its results. These data is stored in internal memories,which can be accessed by the module at any time.

2. Neuron Processing. The number of cycles (or time) required for computing eachneuron potential state during a fixed time interval. If the neuron threshold is notreached, then the neuron processing continues until one of the following conditionsis satisfied:

• The threshold has been reached and the neuron has generated an outputspike.

• The MNS has been reached and the neuron has not generated an outputspike.

The reported number of cycles was obtained for the worst case. For the proposedarchitecture, the worst case is given when all the neurons in both hidden and outputlayers fire later (very close to the maximum simulation interval).

The Number of Neurons - NPs Relation (NNPR) depends on both parameters. Ifthe number of NPs is small or less than the number of simulated neurons in hidden oroutput layers, then this relation yields to a value less than 1. If the number of NPsis greater than the number of simulated neurons in hidden or output layers, then thisrelation yields larger that 1, and the NPs must be multiplexed through time. Equation4.3 defines how this relation is obtained.

NNPR = max{NHN,NON} ∗min{NPIH , NPHO} (4.3)

If the number of neurons to be simulated is greater than the number of available NPs,these NPs must be multiplexed on time. In figure 4.12, an example of this multiplexionis shown. For the proposed example, only 4 NPs are available, and the number ofneurons to be simulated in each layer is 8 (The network topology is 8-8-8). The datasequence loading and processing are described below:

• First, in the timeslot 0, weights and delays of the hidden neurons 0 to 3 are loaded.

82


Figure 4.12: Layer Parallelism using Pipeline Processing

• In the timeslot 1, weights and delays for the output neurons 0 to 3 are loaded,and in parallel, the output of the hidden neuron 0 to 3 is computed.

• In the timeslot 2, weights and delays of the hidden neurons 4 to 7 are loaded, andin parallel, the output of the output neurons 0 to 3 is computed.

• In the timeslot 3, weights and delays of the output neurons 4 to 7 are loaded, andin parallel, the output of the hidden neurons 4 to 7 is computed.

• Finally, in the timeslot 4, the output of the hidden neuron 4 to 7 is computed.

The number of clock cycles required for loading weights and delays required for allthe NPs in the hidden layer is determined by equation 4.4.

LoadIH = (NIN ∗NSC) ∗NPsIH (4.4)

where NIN is the Number of Input Neurons and NSC is the Number of Synap-tic Connections. The number of clock cycles required for loading weights and delaysrequired for all the NP in the hidden layer is determined by equation 4.5

LoadHO = (NHN ∗NSC) ∗NPsHO (4.5)

where NHN is the Number of Hidden Neurons.For each NP, there is a latency of 5 clocks. The number of latency clock cycles

is justified by breaking several complex operations into simple operations, with thepurpose of minimizing the combinational delay and increasing the clock frequency. Theperformed operations to be broken are:

1. To compute the address direction for the delays memory.

2. To access to the activation function based on the delay obtained from the de-lay memory. Also in this step, the address direction for the weights memory iscomputed.

83


(a) Without using Divider

(b) Using Divider 2

(c) Using Divider 4

Figure 4.13: Different Dividers for the MST

3. To compute the product of the activation function with the associated weight.

4. To accumulate the weight computation and compare it with the fixed threshold.

5. To store the resulting firing time and the neuron status.

The total number of cycles required for processing the data contained in a NPassociated to the input-hidden layer is given by equation 4.6.

ProcessIH = LATENCY + ((STEPs IT ) ∗ (NIN ∗NSC)) (4.6)

Where LATENCY is the number of cycles required for the NP for generating thefirst results. For the presented results, the LATENCY is 5 cycles. The total numberof cycles required for processing the data contained in a NP associated to the hidden-output layer is given by equation 4.7.

ProcessHO = LATENCY + ((STEPs IT ) ∗ (NHN ∗NSC)) (4.7)

The maximum number of cycles required for the network processing depends onthe LoadIH , LoadHO, ProcessIH , ProcessHO and NNPR. This computation can bedivided in two parts. The first part computes the overlapped processing of both layers,as shown in figure 4.12. This first part is given by equation 4.8.

84


TotalC = NNPR∗max{LoadHO, P rocessIH}+(NNPR−1)∗max{LoadIH , P rocessHO}(4.8)

The second part includes the first part (LoadIH) and the last part (ProcessHO) ofthe pipeline shown in figure 4.12. The number of cycles required for one single iterationis defined by equation 4.9

Cycles Iteration = LoadIH + TotalC + ProcessHO (4.9)

Finally, the maximum execution time (in seconds) required for the architecture tofinish its processing (in the worst case) is given by equation 4.10.

ExecutionT ime = Cycles Iteration ∗DMNS ∗ 1

WCF(4.10)

The proposed model works with both regular and irregular topologies. The graphsshown in this section help to understand the performance of the model when the worstcase is given. The reported performance for a given network with the same topologyand number of neurons will depend of the initialized weights, delays, thresholds anddecay constants. For the shown graphs, the resolution is set to 10e−2 seconds, andthe simulation interval is set to 10.23 seconds. The results obtained by this modelcan vary from the results obtained by the physical implementation of the architectureon an FPGA device, due to several aspects like: maximum clock frequency, availablehardware resources, weights, delays and decays initializations, among other factors. Infigure 4.14, a set of topologies with a large number of output neurons was tested. Infigure 4.15 a set of topologies with a large number of input neurons was tested. In figure4.16, a set of topologies with a large number of hidden neurons was tested. Finally, infigure 4.17, the number of neurons for all the layers of the network is the same (it canbe said that the network is perfectly balanced).

4.8 Discussion

In this chapter, the proposed hardware blocks for the target architecture are described.The network dataflow identifies three processing stages: the coding phase, the recallphase and the learning phase. For each processing stage, a hardware block is proposed.

Regarding to the hardware block for GRFs, the proposed hardware block is designedto be scalable by adding more GRFPs. The architecture allows the implementation ofseveral GRFPs with several GPs implemented on each GRFP.

Regarding to the hardware block for Recall Phase, the proposed architecture allowsthe implementation of one or more SNLPs, where the basic module is the NP. Thearchitecture could be extended for working with more NPs.

Regarding to the hardware block for learning, two processing stages can be definedin the learning block: the computation of the weights and delays adjustment, and the

85

Discussion

NPs:4 NPs:8 NPs:16 NPs:32 NPs:640

1

2

3

4

5

6x 10

7

Tested topologies

Exe

cutio

n tim

e (s

)

I:64 H:128 O:512I:32 H:64 O:256I:16 H:32 O:128I:8 H:16 O:64I:4 H:8 O:32

Figure 4.14: Execution Time obtained from the Model for Networks with a large numberof Output Neurons

updating of the weights and delays. The proposed architecture implements the Spike-Prop learning, but other learning techniques related to gradient descent computationcan be adapted to the proposed hardware block.

The resource estimation for the Recall phase is proposed as an instrument for pre-dicting the network performance using a given combination of neurons and layers, andthis estimation will be useful for validating the network performance when the hardwareimplementation of the proposed architecture is reported.

86



0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 10

7

Tested topologies

Exe

cutio

n tim

e (s

)


Figure 4.15: Execution Time obtained from the Model for Networks with a large numberof Input Neurons

87

Discussion


2

4

6

8

10

12

14

16

18x 10

7

Tested topologies

Exe

cutio

n tim

e (s

)


Figure 4.16: Execution Time obtained from the Model for Networks with a large numberof Hidden Neurons


0.5

1

1.5

2

2.5

3

3.5x 10

8

Tested topologies

Exe

cutio

n tim

e (s

)


Figure 4.17: Execution Time obtained from the Model for Balanced Networks

88

Chapter 5

Hardware Implementation

In this chapter, the results for the implementation of each one of the hardware modulesand the complete architecture described in Chapter 4 are presented.

5.1 Validation Environment

5.1.1 SW Tools

5.1.1.1 Handel-C

Handel-C is a programming language used for compiling programs that generates con-figuration files for FPGAs. Handel-C uses much of the syntax of conventional C withthe addition of operators for dedicated hardware operations. Sequential programs canbe written in Handel-C, but to gain the maximum benefit in performance from the tar-get hardware, parallel constructs must be used. Since Handel-C is based on the syntaxof conventional C language, programs in Handel-C are implicitly sequential. Writingone command after another indicates that those instructions should be executed in thatexact order. To execute instructions in parallel, the par keyword must be used [73].

The target of the Handel-C language compiler is low-level hardware. This meansthat applications that require several operations of functions executed in parallel canbe implemented using the parallelism constructors, but the resulting hardware architec-ture implements all the parallelism specified by the user. Handel-C parallelism is trueparallelism, not the time-slice parallelism used in general-purpose processors. Whenit is defined that two instructions must work in parallel, each instruction is executedexactly at the same time by two independent pieces of hardware.

5.1.2 HW Tools

5.1.2.1 Prototyping Board

The Alphadata ADC-PMC is a PCI plug-in card to allow two PMC boards to be usedin a single slot of the PC [74]. This board is shown in figure 5.3. This board hosts a

89

Validation Environment

4-Input Total Total TotalFPGA LUTs Slice-FFs Slices BRAMs MULT18x18s

xc2vp30-6ff896 27,392 27,392 13,696 136 136

Table 5.1: Characteristics of the Target FPGA Device

daughter board called ADM-XPL, which is an advanced PCI Mezzanine card (PMC)supporting Xilinx Virtex-II PROTM(V2PRO) devices. The ADM-XPL board supports2VP8, 2VP20 or 2VP30 devices with either 1 or 3 embedded PowerPC Processors. TheADM-XPL board utilizes an FPGA PCI bridge developed by Alpha Data supporting64 bit PIC at up to 66MHz. This development board is shown in figure 5.2. The mainhardware components are shown in figure 5.1.

In the prototyping platform several memory resources are provided on-board, in-cluding DDR SDRAM, pipelined ZBT and flash, all of which are optimized for directuse by the FPGA using IP and toolkits provided by Xilinx [75]. Also, a set of softwarelibraries are provided through a SDK [76], for interfacing the target platform with userapplications.

The supported FPGA device for the target platform is a Virtex-II Pro FPGA (partnumber: XC2VP30). The hardware resources available for the target FPGA deviceare shown in table 5.1. The clock frequency can be defined by the user though a pro-grammable clock, but the oscillator available on the FPGA board limits the maximumclock frequency to 100 MHz. Additionally to the logic available in the FPGA, thetarget FPGA contains 2 PowerPC hardcore 32-bit processors, but for the proposedimplementation, these processors are not used.

5.1.3 FPGA flow (Environment Flow)

5.1.3.1 Developing Flow

The developing flow is shown in figure 5.4. The starting point of the proposed archi-tecture is the software implementation of the algorithms. For the GRFs, an excellenttutorial and application is detailed in [72], and for the SpikeProp algorithm, the Moore’sM. Sc. Thesis [40] includes a detailed explanation of SpikeProp algorithm. The pro-posed software implementations were developed in MATLAB and Visual C++.

The HDL used for the hardware modeling for the proposed architecture is Handel-C.First, the proposed hardware is simulated using the Handel-C simulator, for validatingthe correctness of the proposed architecture. Later, the proposed hardware is syn-thesized with the tools and applications of the target architecture for evaluating itsperformance on the target FPGA device.

The second step in the developing flow is the compilation from Handel-C to VHDLcode. This VHDL component is inserted into a VHDL wrapper, that performs all thetransfer and synchronization tasks required for the execution of the implemented archi-

90


Figure 5.1: Main Components of the AXM-XPL board

Figure 5.2: AlphaData ADMXPL Board

91


Figure 5.3: Alphadata ACD-PMC Board

Figure 5.4: Developing Flow

92


tecture. The hardware synthesis is made from the Xilinx ISE tools, and the bitstreamconfiguration file is generated.

When the configuration file is generated, a C program performs the device config-uration, reads the input data from text files (containing the dataset to be evaluatedand the parameters of the simulation to be performed), makes the data transfers toand from the PC RAM to the platform external RAM by reading the dataset from atext file generated from a Matlab or a C program. The dataset is sent to the externalmemories of the proposed platform, the implemented architecture takes the control ofthe external memory, and makes the appropriate memory accesses for computing thenetwork output/learning. Once that the hardware architecture has finished its process-ing, then releases the memory banks and the host can access to the data stored in somememory regions defined for this purpose.

5.1.3.2 Interfacing Framework

Software Modules:

• Weight and delay generator module. Depending of the networks parameters es-tablished, the weights and delays are generated and transferred to the externalmemory in the FPGA board.

• Load or generate large datasets module. This module obtains a predefined datasetfrom a text file, or can generate randomly one with predefined features (samples,variables, number of classes). This module also obtains basic statistics of theanalyzed dataset (as minimum, maximum, number of examples, number of classes,etc).

• Activation function computing module. This module evaluates each possible fir-ing time value using the equation 2.10. The computed values are later sent tothe memory of the target platform for its use in the computation of the neuronoutputs.

Hardware Modules:

• GRFs (documented previously in section 4.3). Actually, the results for this archi-tecture are simulated, and loaded by the SW wrapper core, and finally transferredto the architecture for its processing.

• SNNs (documented previously in section 4.4). Actually, the inputs of this moduleare the input firing times and its output consists on the hidden and output firingtimes.

• Learning (documented previously in section 4.5). Actually, when all the firingtimes of the implemented networks have been obtained, they are sent to the hostcomputer, and passed to the hardware emulator for validating the system.

93


5.1.3.3 FPGA Model

The abstraction of the target FPGA device is shown in figure 5.5. The steps for theHandel-C component for obtaining full access to the external memory are describedbelow:

• When the FPGA is programmed, the external memory is blocked for any processthat requires a memory reading or writing operation.

• Once the FPGA has been programmed, the control logic gets the control of all theperipherals (memories and PCI bus controller). Then the input data to be usedis transferred from host-PC to the external board memory via DMA transfers.In this phase, the control logic has full control of the memory registers, andthe Handel-C component must be deactivated, which implies that the activationsignal is tied to low, and the architecture must wait until the communication logicties this signal to high.

• Once the Handel-C component is activated, then they get full control on thememory registers required.

• Once the Handel-C has finished its processing, then the bus release signal mustbe tied to low, which indicates to the wrapper that the architecture has to finishits processing and that the data stored in external memory could be read.

• The communication logic reads the bus release signal, and determines the end ofprocessing of the Handel-C component, and finally, the transfer of the resultingdata to the host computer is initialized and the host takes the control of the bustransfers.

5.1.3.4 Memory Map

There is a total of 8 MB of memory, organized in 1M words by 64 bits. For accesspurposes, it is divided into regions of 1 MByte (131,072 words of 8 bytes each one, fora total of 8 memory regions). Each one of the starting address can be calculated asproduct of the index region by 131,072. The regions used actually for the architectureare shown in figure 5.6 and their functionality is described below:

• Input-hidden weights. In this memory region, the weights of the input-hiddenlayer are stored.

• Input-hidden delays. In this memory region, the delays of the input-hidden layerare stored.

• Hidden-output weights. In this memory region, the weights of the hidden-outputlayer are stored.

94


Figure 5.5: FPGA Model

• Hidden-output delays. In this memory region, the delays of the hidden-outputlayer are stored.

• Input layer firing times. In this memory region, the input firing times for theinput neurons are stored.

• Output layer firing times. In this memory region, the output firing times obtainedfrom the output layer are stored.

• Pre-computed activation function. In this memory region, the activation functionvalues obtained by the activation function computing module are stored.

For the proposed architecture, the selected widths for storing weights and delaysare 16-bit width for the delay and 16-bit for the weight. The weight is stored usinga fixed point representation (8-bit for integer part and 8-bit for fractional part), butother variations of this representation are supported. For the delay storing, a 16-bitinteger representation is selected. The maximum number of weights is shown in table5.2.

95


Figure 5.6: FPGA Memory Map

Weight or delay width

Number ofSynapses

8-bits 16-bits 32-bits

2 524,288 262,144 131,0724 262,144 131,072 65,5328 131,072 65,532 32,768

Table 5.2: Maximum of Weights and Delays in the Defined Region Memory

96


Figure 5.7: Main Blocks of the Full SNN System

5.2 Main Block of the Full SNN System

The main blocks of the complete proposed system are shown in figure 5.7. Both SWand HW modules are included, and later in this chapter both modules will be detailed.

5.3 Hardware Implementation for Gaussian Recep-

tive Fields

5.3.1 Implementation

The software performance is obtained by running the equivalent C-based implementa-tion on a Pentium IV processor running at 3.66 GHz, with 512 MB of RAM Memory.

In table 5.3 the synthesis results are shown. Results for 1 and 2 GRFPs are pre-sented, and for each one of these experiments, 4, 8, 10 and 16 Gauss modules weresynthezised. The reported parameters are: MULT18x18 and BRAMs primitives usedin the design, total slices and gate count for each synthesized module. The target FPGAdevice is a Virtex II PRO (XC2VP30) with 13,696 available SLICES and 136 BRAMsand MULT18x18 blocks.

97

Hardware Implementation for Gaussian Receptive Fields

For comparing the data precision of the proposed architecture, a comparison of HWversus SW implementations is presented. The minimum time-scale used for the simula-tion was 0.01 ms. In the graph shown in figure 5.9, almost the total of experiments havea SSD measure equal or less than the minimum simulation time-scale. The reason ofthe relatively large error in the other experiments is generated by the periodic quotientby the number of gaussian fields (e.g 1/6), which generates a large error due to thefixed-point conversion, but neural simulations are not affected by this error, becausethe obtained error is too small for affecting the network output.

5.3.2 Testbench

The datasets used for the performed experiments are:

• A 1,000 randomly generated dataset (in the presented results graphs, this exper-iment is shown with the legend “RAND-1000”).

• A 10,000 randomly generated dataset (in the presented results graphs, this ex-periment is shown with the legend “RAND-10000”).

• The Iris dataset, the reported execution time is the total execution time for codingthe complete dataset, which has 150 samples with 4 variables.

• The Wisconsin Breast Cancer dataset, the reported execution time is the totalexecution time for coding the complete dataset, which has 699 samples with 9variables.

5.3.3 Results

In table 5.4, the execution time for several variants of GRFs is shown. For the hardwareexecution time, the number of clock cycles for the target process is obtained and dividedby the clock frequency. For the software execution time, an average of 20 differentruns was obtained using the Visual C++ profile application. A comparison of severalexperiments is shown in figures 5.8(a) and 5.8(b). The data width of the proposedexperiments is set to 4 bits for the integer part and 12 bits for the fractional part(4-12). At this moment, only datasets that use 16-bit integers were coded, where theinput range is [10-120], but the obtained results must be the same when coding datasetswith fixed point representation values (but this representation is limited to format 8-bit integer part - 8 - bit fractional part). The legends in the plotted graphs have thefollowing meanings:

• “SW”. Software Implementation on the PC with the characteristics previouslymentioned.

• “HW-XGRFS-YGPs”, where X is the number of GRFPs and Y is the numberof GPs for each defined GRFP.

98


GRFPs Gauss MULT18X18s BlockRAMs Total GateProcessors slices count

1 4 22 4 3,058 425,8421 8 38 4 3,792 503,8511 10 46 4 4,044 541,7691 16 70 4 5,173 658,4772 4 42 4 4,026 524,2402 8 74 4 4,953 675,6412 10 90 4 5,772 753,2602 16 136 4 7,715 753,260

[1ex]

Table 5.3: Synthesis Results for the GRFHA - FPGA Device: Virtex II PRO(XC2VP30)

5.3.4 Discussion

In the first part of the chain of processing of the proposed architecture, the GRFcoding is performed using GRFPs. When synthesizing the proposed architecture withonly GRFPs (the SNLPs were excluded), execution time results were obtained, andthe obtained performance improvement fall in the range from 4X to 16X dependingon the number of implemented GPs. The maximum number of GRFs that can becomputed is established to 4, as proposed in [64], since 4 is good enough for obtainingan acceptable network performance, but if more GRFs are required, the architecturecan be synthesized to fulfill that requirement. These improvement rates are obtainedbecause each GRF is mapped in one separated module, and several patterns (whenadding more GRFPs) can be processed in parallel. Only the 30% of the target FPGAdevice is used when only GRF coding is implemented. This implementation is a verycompact, and the rest of the FPGA resources can be used for the neural computationarchitecture. The implemented GRFP uses only 4 GPs. If more GRFPs and GPsare required, then the architecture can be extended, but this involves more hardwareresources for the GRFs coding. Using the implemented cores only for coding, at leasta 2X performance improvement is obtained, and for the performed experiments, amaximum performance improvement of 16X is obtained.

The error function gives some interesting conclusions. There are some variationsof gaussian fields that give a very small error, and these variations can be used forapplications where the error is a critical factor. The other variations can be used forapplications with high speed processing requirements. The proposed architecture hasan average error less than or equal to 0.15 ms, which allows to preserve the precisionrequired for the coding and SNNs problems.

99

Hardware Implementation for Gaussian Receptive Fields

128 256 512 1024 20480

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Number of Samples

Pro

cess

ing

time

(s)

HW−2GRFP−4GPs HW−1GRFP−4GPs SW

(a) 4 GRFs

128 256 512 1024 20480

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Number of Samples

Pro

cess

ing

time

(s)

HW−2GRFP−8GPs HW−1GRFP−8GPs SW

(b) 8 GRFs

Figure 5.8: Performance Comparison between SW and HW Implementation of GRFs

100


Figure 5.9: Error Comparison

GRFs SWET HWET-1P HWET-2P

4 1.15 0.50 0.325 1.35 0.53 0.346 1.58 0.56 0.367 1.78 0.59 0.388 1.99 0.62 0.409 2.19 0.65 0.4210 2.28 0.68 0.4411 2.59 0.71 0.4612 2.81 0.74 0.4813 3.00 0.77 0.5014 3.20 0.80 0.5215 3.41 0.83 0.54

Table 5.4: Execution Time for both HW and SW GRFs Implementations (in millisec-onds)

101

Hardware Implementation for Recall Phase in Multilayer SNNs

NPs MULT18x18s BlockRAMs Total Gateslices count

2 21 10 2,870 1,487,0124 33 14 3,500 2,302,1518 57 22 5,011 3,934,75916 105 38 7,871 7,197,227

Table 5.5: Synthesis Results for the Layer and Neural Processors- FPGA Device: VirtexII PRO (XC2VP30)

5.4 Hardware Implementation for Recall Phase in

Multilayer SNNs

5.4.1 Implementation

A functional prototype has been implemented using the Handel-C HDL. The proto-type allows to explore several topologies and obtain previous timing results. Later,the prototyped architecture is mapped into a FPGA for its validation. The proposedarchitecture was implemented on an ADM-XPL board previously described. For theproposed architecture, only 2 SNLPs with 16 NPs each one has been validated, due toFPGA space limitations. The system is fully functional, and allows to explore severalmachine learning applications which can take advantage of the proposed architecture.

In table 5.5, resource statistics for the proposed implementations are reported.

5.4.2 Issues Related to the Neuron Potential Computation

The minimum timestep for the proposed architecture can be established in simulationtime, but for the implemented architecture this timestep is fixed and set to 0.01 seconds.For controlling this timestep, there is a simulation time counter that stores the actualsimulation time computed by the network. Then, an increment of one unit in thesimulation time counter represents an advance of one resolution step (if the step is setto 0.01 seconds, then when the counter stores 0, the simulation time is 0.00 seconds,when the counter stores a 1, the actual simulation time is 0.01 seconds, and so on).The maximum number of timesteps can be defined by the user or the architecture cancompute all the possible timesteps until the 16-bit counter reaches its maximum possiblevalue. At this moment, the architecture stops the computation of neuron outputs. Theactual timestep is used for the computation of an activation function address. Theactivation function is computed in software and stored in BlockRAMs, and when theneuron potential is computed, an address must be generated for obtaining the value ofthe activation function from the BlockRAM. This is the main reason for the integerrepresentation of the simulation time counter. As the length of the BlockRAM onlycover a small part of the possible, if a larger space is required, the addition of more

102


BlockRAMs is possible for covering that space. The length of the space cover by theBlockRAMs depends on the configuration available for this component, and the wordlength required for storing the activation function. For the particular target device, thelength of the space covered is 1024 words with a 16-bit data stored in each word. Thisidea is shown in figure 5.10.

Weights and delays are stored also in BlockRAMs memories, and the word lengthfor the BlockRAMs is also 16-bit. This is the minimum possible configuration in VirtexII Pro, where other different configurations can be established. Weights, delays andactivation functions data have the following numerical representations:

• The internal data representation used for the weights is 16-bit fixed point repre-sentation, where the format is 8-bit for the integer part and 8-bit for the fractionalpart. This resolution can be changed at compilation time.

• The internal data representation used for the activation function is fixed pointrepresentation, where the format is 1-bit for the integer part and 15-bit for thefractional part.

• The internal representation for the delays is an integer representation. For eachsimulation time, the index to the activation function memory is computed usingthe delays.

In figure 5.11, the process of neuron potential computation from the operators pre-viously described is shown. Both, activation function and weight are stored in 16-bitregisters. The lengths of the operators are adjusted to an unsigned data of 18-bit length,and these registers are the inputs of a Mult18x18 block, which performs the productoperation between these operands. The Mult18x18 primitive on Xilinx Vixtex-II FPGAdevices have 2 inputs of 18-bit length and one output of 36-bit length. Once the productcomputation between the weight and activation function has been obtained, then only8 bits for the integer part and 8 bits for the fractional part are selected and added tothe neuron potential, which has a fixed point representation (9 bits for the integer partand 9 bits for the fractional part).

5.4.3 Testbench

The datasets used for the performed experiments are:

• The Iris dataset, the reported execution time is the total execution time for ob-taining the network output for the the complete dataset, which has 150 sampleswith 4 variables.

• The Wisconsin Breast Cancer dataset, the reported execution time for obtainingthe network output for the the complete dataset, which has 699 samples with 9variables.

103

Hardware Implementation for Recall Phase in Multilayer SNNs

Figure 5.10: Activation Function Computation Space for both 16-bit counter and 10-bitcounter

Figure 5.11: Widths and Variables used for Neuron Potential

104


5.4.4 Results

In figure 5.12, several network performance evaluations are shown. In figure 5.12(a), theexecution time comparison for a 64-64-64 FF SNN using thresholds in the range [2,128]is shown. In the plot, the software implementation (“SW” legend) and the hardwareimplementation with 2 SNLPs and 4 NPs (“HW-4P” legend), 8 NPs (“HW-8P” legend)and 16 NPs (“HW-16P” legend) is shown. In analog manner, in figure 5.12(b), thesame comparison is performed but for a 128-128-128 FF-SNN, and in figure 5.12(c), thecomparison is performed but for a 256-256-256 FF-SNN.

5.4.5 Discussion

In the second part of the processing chain, the SNLPs are added to the implementedarchitecture. The obtained performance improvement is from 4X to 20X dependingon the number of NPs. About 50% of the FPGA device is used, with 1 GRFP and2 SNLPs implemented. This core combination can be considered as the “minimal”implementation that functionally can achieve all the computations required for theSNNs described in this work. When using the “minimal” architecture, at least a 3.5Xperformance improvement is obtained for the smallest tested network, and a maximum9.5X performance improvement for the large network tested in this work. When usingmore that 4 NPs for SNLP, the area required for the design is increased by a 1.5 factor.For the last implementation, only the 52% of the Mult18x18s embedded multipliermodules are used, while the number of BRAMS is close to reach the limit (96%). Theused hardware resources scale linearly with the number of neurons, and the performancedecreases linearly but with a small slope compared with the hardware resources increase.The performance improvement is good (in the order of 7X to 9X), when using regulartopologies (networks with the same number of neurons in each layer), and not too good(about 2X - 4X) when using an unbalanced number of neurons in the layers.

5.5 Hardware Implementation for Learning Phase

in Multilayer SNNs

5.5.1 Results

In figure 5.14, a performance comparison of both recall and learning phase is shown.The legends shown in the graphs have the following meanings:

• IRIS. The Iris dataset has 150 samples and 4 variables. This dataset has samplesthat belong to 3 different classes. The number of GRFs is set to 4 for each variable,and the final topology tested was 16-16-3.

• WBC. The Wisconsin Breast Cancer dataset has 699 samples with 9 variables.This dataset has samples that belong to 2 different classes. The number of GRFsis set to 4 for each variable, and the final topology tested was 36-8-2.

105

Hardware Implementation for Learning Phase in Multilayer SNNs

(a) 64-64-64 Network

(b) 128-128-128 Network

(c) 256-256-256 Network

Figure 5.12: Threshold Evaluation

106


16−16−2 16−32−2 16−64−20

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Topologies

Pro

cess

ing

time

(s)

Threshold:1 Number of Samples: 150

HW−16P HW−8P HW−4P SW

(a) IRIS dataset

36−16−2 36−32−2 36−64−20

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Topologies

Pro

cess

ing

time

(s)

Threshold:1 Number of Samples: 699

HW−16P HW−8P HW−4P SW

(b) WBC dataset

Figure 5.13: Performance Comparison for the Recall Phase Implementation in HW andSW (Real Datasets)

107

Hardware Implementation for Learning Phase in Multilayer SNNs

Figure 5.14: Comparison of Learning Performance of both SW and HW Implementa-tions

• RAND-X. This random dataset has 1000 samples with 8 variables. This data sethas samples that belong to 8 different classes. The number of GRFs is set to 4for each variable, and the final topology tested was 32-8-8.

• RAND-Y. This random dataset has 512 samples with 16 variables. This data sethas samples that belong to 16 different classes. The number of GRFs is set to 8for each variable, and the final topology tested was 128-8-16.

5.5.2 Learning comparison

XOR is a logical operator that is commonly used in data mining and machine learningas an example of a function that is not linearly separable. The table 5.6 represents asimple dataset where the relationship between the attributes (X1 and X2) and the classvariable (y) is defined by the XOR function such that Y = X1XOR X2.

A machine learning algorithm would need to discover or approximate the XORfunction in order to accurately predict Y using information about X1 and X2. TheXOR function rose to fame in the late 60’s when Minsky and Papert demonstrated thatperceptron-based systems cannot learn to compute the XOR function. Subsequently,when backpropagation-trained multi-layer perceptrons were introduced, one of the firsttest performed was to demonstrate that they could in fact learn to computer XOR.

108


X1 X2 Y

0 0 00 1 11 0 11 1 0

Table 5.6: XOR dataset

Input0 Input1 Input2 Output0 Output1

0 0.00 0.50 0.50 4.00 5.001 0.00 0.50 1.50 5.00 4.002 0.00 1.50 0.50 5.00 4.003 0.00 1.50 1.50 4.00 5.00

Table 5.7: Input and Output Firing Times (Patterns) for the XOR Problem.

XOR and generalization of XOR are the most popular test of backpropagation-basedsystems. The XOR function requires hidden units to transform the input into thedesired output.

In similar way to the reported in [63], for encoding the XOR function, the outputfunction is coded using a couple of neurons. Three input neurons must be used. Thefirst one is the reference neuron, with firing time equal to 0.0. The second neuroncodifies the first variable in the XOR function, and the third neuron codifies the secondvariable in the XOR function. For coding a 0 in the input variable, an “earlier” firingtime is defined. For coding a 1 in the input variable, a “later” firing time is defined.For the output neurons, a 0 output is coded assigning to the first neuron an “earlier”and to the second neuron an “later” firing time. A 1 output is coded assigning to thefirst neuron a “later” and to the second neuron an “earlier” firing time. The temporallycoded XOR is shown in table 5.7. The numbers in the table represent spike times usea resolution of 102 seconds.

The characteristics of the proposed implementation are summarized in table 5.8.The architecture starts the learning of the input parameters, and the output firingtimes are shown. The thresholds, decays and learning rates are defined in a similar waythan other works in the literature. The number of defined hidden neurons is 4, butsimilar results are obtained with a different number of hidden neurons, from a range[3-12]. The proposed architecture is flexible with respect to parameters shown in thetable (with exception of the number of synaptic terminals by connections, which is fixedto 2 synaptic terminals).

In figure 5.15, the output spikes in several iterations of the learning process areshown. The output firing times of the implemented network are shown in severaliterations. The desired firing times are shown in the last row of the figure. Note thatin the iteration number 2, the output firing times are different and the order of thefiring times can vary from the desired output times. In the iteration number 20, the

109

Overall Implementation

Input neurons 3Hidden neurons 4Output neurons 2

Synaptic terminals by connection 2Weights learning rate 2.5Delays learning rate 0.5

Neuron threshold 2.0Neuron decay 0.07

Timeslot resoultion 0.01

Table 5.8: Network Configuration and Learning Parameters

actual output firing times appear to be closer to the desired output firing times, and initeration number 200, the actual output firing times are almost the same.

5.6 Overall Implementation

The architectures described previously can be combined in one compact core for per-forming all the required phases for the SpikeProp learning. Each module can be im-plemented separately or it can be merged with other to conform a more complex hard-ware block. It is possible to define as many GRFPs as necessary, with the purpose ofperforming the input coding. As discussed previously, this block has other potentialapplications.

When both Coding and Recall hardware blocks (performed by the SNLPs) areimplemented, it is possible to have the evaluation of several input patterns using apredefined set of weights and delays. In this step, it is possible to train the weightsand delays using an external tool (a SW program running in a host computer), andlater to load the weights and delays in the external memory, for performing the recallphase (obtaining only the network output, but without training or modifying weightsand delays).

Finally, for completing the learning on hardware, the LM can be added to the overallhardware core. The LM performs the computations required for adjusting weights anddelays without performing the additional external memory transfers to/from the hostcomputer required then the learning is performed off-chip.

In table 5.9, hardware resource statistics for several variations of the proposed ar-chitecture is show. All the implementations have 2 SNPLs with 4 NPs each SNPL, butthe variation is with respect to the number of GRFPs, which can be 1 or 2, and thenumber of GPs in each GRFPs could be 4 or 8. Additionally, the implementation couldinclude the implementation of the learning module or not.

110


5.6.1 Discussion

The obtained results are interesting because the advantages of the hardware imple-mentation of both GRFs and SNN modules is shown, but a several trade-offs must beconsidered. The performance for a set of topologies with the same number of neuronsin each one of their layers is obtained. There is a tradeoff that can be established,which consists on the number of NPs that can be implemented for obtaining a highperformance rate. When using from 8 to 16 NPs grouped in 2 SNLPs, the perfor-mance improvement is significant (about 30% of the performance of 8NPs), but formore processors the performance increase stabilizes and almost remains the same. Thearea increasing is not linear when the number of processors is doubled. The curve ofperformance is almost linear, that is, the more number of neurons in the network, theless obtained execution time. If the number of processors is increased, then the networkperformance is also increased, but with a high cost related to the hardware resource uti-lization. In the actual implementation, the maximum number of NPs implemented forthe proposed architecture is 40 (20 for each SNLPs), allowing a performance improve-ment of at least 9X for the largest tested network. The limitation about the number ofimplemented NPs is given by the target FPGA resources.

111

Overall Implementation

Fig

ure

5.15

:N

etw

ork

Outp

ut

Fir

ing

tim

esfo

rSel

ecte

dL

earn

ing

Iter

atio

ns.

112


Imple

men

tati

onG

RF

Ps/

SN

LP

/L

earn

ing

MU

LT

18x18

sB

lock

RA

Ms

Tot

alG

ate

GP

sN

Ps

slic

esco

unt

HW

11

/4

2/

4N

/A35

367,

042

2,64

6,40

1H

W2

1/

82

/4

N/A

5136

7,93

62,

726,

128

HW

32

/4

2/

4N

/A54

368,

330

2,74

5,68

8H

W4

2/

82

/4

N/A

8636

10,3

732,

908,

649

HW

51

/4

2/

4A

5149

9,38

53,

605,

252

HW

61

/8

2/

4A

6749

9,81

73,

686,

959

HW

72

/4

2/

4A

7049

10,7

173,

705,

058

HW

82

/8

2/

4A

102

4912

,796

3,86

8,03

0

Tab

le5.

9:R

esult

sfo

rth

eO

vera

llA

rchit

ectu

re

113

Chapter 6

SNN Architecture Tests withDatasets

In this chapter the application of the implemented architecture for several standardbenchmarks is reported. Execution times, accuracy and evolution of the MSE metricis shown.

6.1 State of the Art of SNNs Applied to Standard

Machine Learning Datasets

The main references of tests with standard machine learning datasets for SNNs are:

• In [29] and [63], experiments with the following standard benchmark problemshave been carried out: The IRIS-dataset, The Wisconsin breast-dataset and theStatlog Landsat dataset. The used network topology is the Feedforward topologyand the learning algorithm is the SpikeProp.

• In [77], the application of SNNs to the lip reading problem (Tulips Audiovisualdatabase) is reported. The used network topology is a Recurrent FF topology,and the learning algorithm is the SpikeProp adapted for neurons that fire multiplespike times.

• In [41], the SNNs are tested with the IRIS-dataset. The used network topologyis a FeedForward topology and the learning algorithm is the SpikeProp but thelearning parameter is adaptively changed.

• In [64], the SNNs are tested with the IRIS-dataset. The used network topologyis a FeedForward topology and the learning algorithm is the SpikeProp but withadaption of RProp and QProp algorithms.

• In [34], experiments with the following standard benchmark problems have beencarried out: Iris-dataset, Diabetes-dataset. The used network topology is a Feed-

115

Application Problems

Forward topology and the learning algorithm is the Parallel Differential EvolutionAlgorithm (PARDE), based on Evolutionary Algorithms (EA).

• In [28], a hardware investigation of both Spiking and Classical neural models isreported. The Iris dataset benchmark problem is used. The evaluated models arearranged into specific network topologies to suit the problem and a supervisedlearning algorithm is employed. For MLP networks, the gradient descent is used,and for the spike models, an evolutionary strategy is applied. Later, a hardwareprototype of each model is implemented and the required hardware resources areevaluated.

• In [33], experiments with the following standard benchmark problems have beencarried out: The IRIS-dataset and the Wisconsin breast-dataset. The networktopology is a Recurrent FF topology, and the learning algorithm is based onEvolutionary Strategies (ES).

6.2 Application Problems

The proposed architecture is used for solving several classifying tasks. The architectureis configured for several network topologies, the network outputs are a set of firing times.The criteria for translating the output firing times in pattern classes for evaluating theclassification performance is shown in table 6.1. Basically, three possible cases canhappen when analyzing the network output. The first one is the correct classification,in which, the neuron assigned to the class is the first neuron in emitting its firing time.The second case is when there are two or more neurons that fire at the same time, inthis case, there is an ambiguous classification. The third case is when the classificationis wrong, or when the first neuron that fires is different to the output neuron assignedto the expected output class.

6.2.1 IRIS Dataset

The Iris flower dataset of Fisher’s Iris dataset is a multivariate dataset introducedby Ronald Aymer Fisher in 1936. The dataset consist of 50 samples for each one ofthree species of Iris flowers (Iris setosa, Iris virginica and Iris versicolor). Four featureswere measured for each sample. This features are: petal length, sepal length, petalwidth, sepal width (all measured in centimeters). Fisher developed a linear discriminantmodel to determine which species they are [78]. The Iris dataset is considered to bea reasonably simple classification problem. The classes contained in the dataset areconsidered as not linearly separable.

For the Iris dataset, the inputs for the neuron are defined by evaluating each inputvariable (or column) though a set of GRFs. In the reported results, each variable iscoded using 4 GRFs, which corresponds to the number of neurons. In this case, thenumber of variables in the data set is 4, and then the number of input neurons is 16.

116

SNN Architecture Tests with Datasets

Table 6.1: Criteria for Classification Quality

Criterion Description Output ExampleAccurate Classification The order of the fir-

ing times of the out-put neurons matcheswith the desired firingtimes

0.05 0.09 0.19 (de-sired)

0.06 0.08 0.17 (ob-tained)

Ambiguous Classification There are two or moreneurons that fire atthe same time

0.05 0.09 0.19 (de-sired)

0.06 0.08 0.06 (ob-tained)

Misclassification The order of the fir-ing times of the out-put neurons is differ-ent from the desiredfiring times

0.05 0.09 0.19 (de-sired)

0.19 0.08 0.04 (ob-tained)

117


IRISNET Accuracy

I H O ITs Train Test(%) (%)

Weka BP 4 5 3 1,500 98.2% 95.5%Matlab BP 4 5 3 1,500 97.4% 96.1%SpikeProp 16 5 3 800 97.2% 96.8%

Table 6.2: Results for the IRIS Dataset

0 50 100 150 200 250

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

Iterations

MS

E

Figure 6.1: MSE for Iris Dataset

The data was divided in two sets and classified using two-fold cross-validation. Theresults are reported in table 6.2. The number of hidden neurons is set to 5 (in similarway that hidden neurons reported in previous work). The number of output neuronsdepends of the number of class, in this case, the total number classes for this dataset istwo. In figure 6.1, the MSE along training iterations is shown. The number of iterationsfor this dataset is 250. In figure 6.2, the number of correct classified examples at eachtraining iteration is shown.

6.2.2 Wisconsin Breast Cancer Diagnosis Dataset (WBCD)

The WBCD is a large dataset that consists of 699 patterns of which 458 are benignsamples and 241 are malign samples. Each of these patterns consists of nine measure-ments taken from a fine needle aspirates from a patient’s breast. The measurements aregraded 1 to 10 at the time of sample collection, with 1 being the closed to benign and10 the most anaplastic. The characteristics of nuclei analyzed in the images are: radius(R), texture (T), perimeter (T), area (A), smoothness (SM), compactness (CM), con-

118


0 50 100 150 200 2500.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

Iterations

Acc

urac

y

Figure 6.2: Correct Classified Samples for IRIS Dataset

WBCNET Accuracy

I H O ITs Train Test(%) (%)

Weka BP 36 5 2 1500 97.6% 97.6%Matlab BP 36 5 2 1500 98.3% 98.0%SpikeProp 36 5 2 400 98.9% 97.4%

Table 6.3: Results for the WBC Dataset

cavity (C), symmetry (S), concave points and fractal dimension. A linear programingmethod for pattern separation called Multisurface method has been proposed in [79].

For the Wisconsin Breast Cancer dataset, the inputs for the neuron are defined byevaluating each input variable (or column) through a set of GRFs. In the reportedresults, each variable is coded using 4 GRFs, which corresponds to the number ofneurons. In this case, variables in the dataset is 9, and then the number of inputneurons is 36. The data was divided into two sets and classified using two-fold cross-validation. The results are reported in table 6.3. The number of hidden neurons isset to 5 (in similar way that hidden neurons reported in previous work). The numberof output neurons depends of the number of classes, in this case, the total number ofclasses for this dataset is two. In figure 6.3, the MSE along training iterations is shown.The number of iterations for this dataset is 150. In figure 6.4, the number of correctclassified examples at each training iteration is shown.

119


0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

1.2

1.4

Iterations

MS

E

Figure 6.3: MSE for Wisconsin Breast Cancer Dataset

0 50 100 150 200 2500.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Iterations

Acc

urac

y

Figure 6.4: Correct Classified Samples for Wisconsin Breast Cancer Dataset

120


6.3 Discussion

The MSE is used for evaluating the classifier performance. For the presented applica-tions, several plots of the evolution of the MSE along the number of learning iterationsare shown. The presented plots show that the algorithm converges in hardware andallow to validate the functioning of the SpikeProp learning in hardware.

The accuracy measure is used for evaluating how reliable is the classifier. Theobtained accuracy for the hardware learning is very close or similar to the accuracyobtained using the classical backpropagation with Weka or Matlab software. This allowsto conclude that the SpikeProp algorithm implemented hardware is very competitivewith respect to SW implementations related to the obtained accuracy.

For comparing the number of iterations required for each evaluated algorithm, theSpikeProp learning requires less iterations compared with classical backpropagationimplemented with Weka or Matlab software. This is an important aspect, becauseit allows the verification of the proposed hardware and allows to conclude that theSpikeProp learning requires less iterations compared with SW implementations.

For machine learning purposes, the digital nature of spikes allows for efficient imple-mentation of SNNs, especially in the case of sparse activity as achieved by the GRFs.The number of learning iterations for the proposed implementation is compared withseveral machine learning algorithms. The accuracy obtained for the proposed applica-tions is very competitive, and the number of iterations is smaller compared with theiterations required by classical machine learning algorithms.

121

Chapter 7

Conclusions and Future Work

7.1 Conclusions

In this thesis, a scalable and modular hardware core for SNNs has been proposed. Thearchitecture is based in three phases: Coding, Recall and Learning.

For the coding phase, the GRFs are performed by a set of processors called GRFPs,a performance improvement of at least 9X is obtained, compared with a SW-basedimplementation. The proposed architecture was tested with several datasets, includingrandomly generated datasets and standard machine learning datasets. The obtainederror shows that the proposed architecture could be applied for solving machine learningclassification problems. The hardware module for GRFs is flexible and scalable, and itcan be used for other applications in the literature.

For the Recall phase, the neural processing is performed by a set of processors calledSNLPs. An important performance improvement of at least 4X is obtained when theneural processing is performed by the hardware module. The Recall phase hardwaremodule was tested with both random and real dataset for a performance comparison.

For the Learning phase, the learning is preformed by a processor called LM. Theproposed processor implements all the operations required by the SpikeProp algorithm.Performance results and hardware resource statistics for the on-chip learning are pre-sented. The application of the proposed learning on machine learning standard datasetsare reported.

For the GRFs combined with the Recall and Learning phases, a performance im-provement of at least 4X is obtained. The separated cores were reported previously,but several improvements with respect to each core (logic resource reduction and per-formance increasing) were implemented in the cores reported in this work. Finally, theintegration of both cores is fully documented in this work, as well as performance andresource utilization statistics.

For the combination of all the proposed cores, the performance of the full systemis presented. The full system can be implemented on a single FPGA device, but it canbe implemented on larger FPGA devices for implementing more complex applications.

123

Summary of Contributions

7.2 Summary of Contributions

• The implementation of a high performance hardware core for GRFs coding, whichcould be useful not only for the implemented SNN, but for other applications likeclustering based on RBFs networks, or recurrent SNNs.

• The implementation of a hardware core for the recall phase on a multilayer SNNs,which obtains the network output of a large ammount of input patterns, and whoseperformance improvement is at least 4X with respect to the execution on a PC.

• The combination of the previously defined hardware cores in one single hardwarecore that performs the coding, the recall and learning phases of a SNN. Theproposed core could be applied to standard machine learning classification tasks.

7.3 Publications

7.3.1 Refereed Publications

The referred publications potential from this research work are:

1. “An Efficient Scalable Parallel Hardware Architecture for Multilayer Spiking Neu-ral Networks”, Marco Nuno-Maganda, Miguel Arias-Estrada, Cesar Torres-Huitzil. IEEE Southern Conference on Programmable Logic (SPLCONF’07), Mardel Plata, Argentina, Febrero 2007, pp 167-170.

2. “High Performance Hardware Implementation of SpikeProp Learning: Potentialand Tradeoffs”. Marco Nuno-Maganda, Miguel Arias-Estrada, Cesar Torres-Huitzil. IEEE - International Conference on Field Programmable Technology(ICFPT’07). Kitakyushu, Japan, 2007. pp 129-136.

3. “A Population Coding Hardware Architecture for Spiking Neural Network Appli-cations”. Marco Nuno-Maganda, Miguel Arias-Estrada, Cesar Torres-Huitzily Bernard Giurau. IEEE Southern Conference on Field Programmable Logic(SPLCONF’2009). Sao Carlos, Brazil, 2009. pp 83-87.

4. “Hardware Implementation of Spiking Neural Network Classifiers based on Back-propagation based Learning Algorithms”. Marco Nuno-Maganda, MiguelArias-Estrada, Cesar Torres-Huitzil y Bernard Girau. International Joint Confer-ence on Artificial Neural Networks (IJCNN’09). Atlanta, E.U, 2009. pp. 2294-2301.

7.3.2 Unrefereed Publications

The unreferred publications derived from this research work are:

124

Conclusions and Future Work

1. “Towards the implementation of a Parallel Architecture for Spiking Neural Net-works”, Marco Nuno-Maganda, Miguel Arias-Estrada, Cesar Torres-Huitzil.Consorcio Doctoral Encuentro Nacional de Computacion 2006 (ENC06), SanLuis Potosı, Mexico, pp 8-9.

2. “An FPGA based Architecture for Spiking Neural Networks: Initial Steps”, MarcoNuno-Maganda, Miguel Arias-Estrada, Cesar Torres-Huitzil. Septimo Encuen-tro de Investigacion INAOE, 2006.

3. “Hardware Implementation of SpikeProp Learning Preliminary Results”, MarcoNuno-Maganda, Miguel Arias-Estrada, Cesar Torres-Huitzil. Octavo Encuentrode Investigacion INAOE, 2007, pp 201-205.

7.4 Future Research Directions

• The modification of the architecture for supporting other network topologies, likerecurrent networks or cellular neural networks could be interesting for a widerange of applications.

• The extension of the implemented hardware neuron for supporting other neuronmodels could help to discover more powerful potential spike-based classifiers.

• The application of the proposed architecture to real-world classifier systems, withmore demanding performance requirements. These potential applications couldvalidate the usefulness of the architecture in high performance applications.

• The implementation of the proposed architecture in larger FPGA devices coulddemonstrate the full potential of the proposed architecture, and the confirmationof the predicted performance could be evaluated.

• Further tests with more machine learning datasets, with the purpose of evaluatingthe implemented classifier, for concluding that the proposed classifier can be usedas a generic classifier that could be competitive with other classical classifiers.

7.5 Research Activities

1. Short research stay at CORTEX group at Laboratoire Lorrain de Recherche enInformatique et ses Applications (LORIA), in Nancy, France, from November 1stto November 23rd 2007. The main activities developed during the stay were:

• Implementation of spiking neural-networks into FPGA devices with on-chiplearning based on the SpikeProp algorithm.

125

Research Activities

• A study on the mechanisms to integrate the spiking modules into the soft-ware platform, NNetWARE, developed at CORTEX teams for automaticgeneration of spiking neural models into reconfigurable devices.

2. A second reseach stay at CORTEX group at LORIA, from June 28th to August23th 2008. The main activities developed during the stay were:

• Presentation of the partial results on a Seminar.

• Interchanging ideas with other Ph.D Students in LORIA.

126

Bibliography

[1] Simon Haykin. Neural networks: a comprehensive foundation. Prentice Hall, 1994.ISBN 0-13-273350-1.

[2] N. Sundararajan and P. Saratchandran. Parallel Architectures for Artificial NeuralNetworks: Paradigms and Implementations. IEEE Computer Society Press, LosAlamitos, CA, USA, 1998. ISBN 0818683996.

[3] J. Zhu and P. Sutton. FPGA implementations of neural networks - a survey of adecade of progress. In P.Y.K. Cheung, G.A. Constantinides, and J.T. de Sousa,editors, Proc. of the 13th International Conference on Field Programmable Logicand Applications (FPL’03), number 2778 in LNCS, pages 1062–1066, Berlin, Hei-delberg, September 2003. Springer Verlag.

[4] Axel Jahnke, Tim Schonauer, Ulrich Roth, K. Mohraz, and Heinrich Klar. Simula-tion of spiking neural networks on different hardware platforms. In ICANN, pages1187–1192, 1997. URL citeseer.ist.psu.edu/jahnke97simulation.html.

[5] Fabrice Bernhard and Renaud Keriven. Spiking neurons on gpus. In Vassil N.Alexandrov, Geert Dick van Albada, Peter M.A. Sloot, and Jack Dongarra, edi-tors, Computational Science – ICCS 2006, volume 3994 of LNCS, pages 236–243.Springer, 2006. ISBN 3-540-34385-7.

[6] Jean-Luc Beuchat, Jacques-Olivier Haenni, and Eduardo Sanchez. Hardware re-configurable neural networks. In IPPS/SPDP Workshops, pages 91–98, 1998.

[7] LM Patnaik. Overview of Neurocomputers. Current Science, 68(2):156–162, 1995.ISSN 0011-3891.

[8] A. Upegui, C. A. Pena-Reyes, and E. Sanchez. A functional spiking neuron hard-ware oriented model. In Proceedings of the International Work-conference on Ar-tificial and Natural Neural Networks IWANN2003, volume 2686 of Lecture Notesin Computer Science, pages 136–143, Berlin Heidelberg, 2003. Springer.

[9] L. E. Jordan and Gita Alaghband. Fundamentals of Parallel Processing. PrenticeHall Professional Technical Reference, 2002. ISBN 0139011587.

127

citeseer.ist.psu.edu/jahnke97simulation.html

BIBLIOGRAPHY

[10] Wolfgang Maass. Networks of spiking neurons: The third generation of neuralnetwork models. Neural Networks, 10(9):1659–1671, 1997.

[11] Axel Jahnke, Ulrich Roth, and Tim Schonauer. Digital simulation of spiking neuralnetworks. In Pulsed neural networks, pages 237–257. MIT Press, Cambridge, MA,USA, 1999. ISBN 0-626-13350-4.

[12] Liam P. Maguire, T. Martin McGinnity, Brendan P. Glackin, A. Ghani, AmmarBelatreche, and Jim Harkin. Challenges for large-scale implementations of spikingneural networks on fpgas. Neurocomputing, 71(1-3):13–29, 2007. ISSN 0925-2312.

[13] Andres Upegui, Carlos Andres Pena-Reyes, and Eduardo Sanchez. An fpga plat-form for on-line topology exploration of spiking neural networks. Microprocessorsand Microsystems, 29(5):211 – 223, 2005. ISSN 0141-9331.

[14] Joost N. Kok S. M. Bohte, J. A. L Poutre. Unsupervised classification in a layerednetwork of spiking neurons. IEEE Trans. Neural Networks, 13(2):426–435, 2002.

[15] A. C. C. Coolen, R. Kuhn, and P. Sollich. Theory of Neural Information Pro-cessing Systems. Oxford University Press, Inc., New York, NY, USA, 2005. ISBN0198530242.

[16] Neil R. Carlson. Foundations of Physiological Psychology. Pearson Education,2005. ISBN 0-205-427235.

[17] James A. Freeman and David M. Skapura. Neural networks: algorithms, appli-cations, and programming techniques. Addison Wesley Longman Publishing Co.,Inc., Redwood City, CA, USA, 1991. ISBN 0-201-51376-5.

[18] R. Mayrhofer. A new approach to a fast simulation of spiking neural networks.Master’s thesis, Johannes Kepler University of Linz, Austria, July 2002.

[19] Raul Rojas. Neural networks: a systematic introduction. Springer-Verlag NewYork, Inc., New York, NY, USA, 1996. ISBN 3-540-60505-3.

[20] Daniel R. Kunkle and Chadd Merrigan. Pulsed neural networks and their appli-cation. Technical report, Rochester Institute of Technology, Rochester, NY, 2002.

[21] William W Lytton. From computer to brain : foundations of computational neu-roscience. Springer, New York, 2002. ISBN 0387955283; 0387955267.

[22] Alexander I. Galushkin. Neural Network Theory. Springer-Verlag New York, Inc.,Secaucus, NJ, USA, 2007. ISBN 3540481249.

[23] Amit Konar. Artificial Intelligence and Softcomputing. CRC Press, 1999. ISBN0-8493-1385-6.

128

BIBLIOGRAPHY

[24] Helene Paugam-Moisy. Spiking Neuron Networks A survey. Technical ReportIDIAP-RR 06 11, IDIAP Research Institute, Martigny, Switzerland, February2006.

[25] Wulfram Gerstner and Werner Kistler. Spiking Neuron Models: Single Neurons,Populations, Plasticity. Cambridge University Press, New York, NY, USA, 2002.ISBN 0521890799.

[26] Eric R. Kandel, James H. Schwartz, and Thomas M. Jessel. Principles of NeuralScience. Mc Graw Hill, 2000. ISBN 9780838577011.

[27] E. M. Izhikevich. Which model to use for cortical spiking neurons? Neural Net-works, IEEE Transactions on, 15(5):1063–1070, 2004.

[28] Simon Johnston, Girijesh Prasad, Liam P. Maguire, and T. Martin McGinnity.Comparative investigation into classical and spiking neuron implementations onfpgas. In Wlodzislaw Duch, Janusz Kacprzyk, Erkki Oja, and Slawomir Zadrozny,editors, ICANN (1), volume 3696 of Lecture Notes in Computer Science, pages269–274. Springer, 2005. ISBN 3-540-28752-3.

[29] Sander M. Bohte. Spiking Neural Networks. PhD thesis, Universiteit Leiden, NL,2003.

[30] Ben Krose and Patrick van der Smagt. An introduction to neural networks. URLftp://ftp.informatik.uni-freiburg.de/papers/neuro/ann intro smag.ps.gz, The Uni-versity of Amsterdam, 1996. URL citeseer.ist.psu.edu/ose96introduction.

html.

[31] Andreej Kasinski and Filip Ponulak. Comparison of supervised learning methodsfor spike time coding in spiking neural networks. International Journal of AppliedMathematics and Computer Sciences, 16(1):101–113, 2006.

[32] Phill Rowcliffe, Jianfeng Feng, and Hilary Buxton. Spiking perceptrons. IEEETransactions on Neural Networks, 17(3):803–807, 2006.

[33] Ammar Belatreche, Liam P. Maguire, and Martin McGinnity. Advances in designand application of spiking neural networks. Soft Comput., 11(3):239–248, 2006.ISSN 1432-7643.

[34] N.G. Pavlidis, D. K. Tasoulis, V. P. Plagianakos, G. Nikiforidis, and M. N. Vra-hatis. Spiking neural network training using evolutionary algorithms. In In Proc.of International Joint Conference on Neural Networks (IJCNN05), 2005. PoznanUniversity of Technology, Institute of Control and Information Engineering, pages2190–2194, 2005.

129

citeseer.ist.psu.edu/ose96introduction.html

citeseer.ist.psu.edu/ose96introduction.html

BIBLIOGRAPHY

[35] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning rep-resentations by back-propagating errors. In Neurocomputing: foundations of re-search, pages 696–699. MIT Press, Cambridge, MA, USA, 1988. ISBN 0-262-01097-6.

[36] Joost N. Kok S. M. Bohte, J. A. L Poutre. Spike-prop: error-backprogation inmulti-layer networks of spiking neurons. In M. Verleysen, editor, Proceedings ofthe European Symposium on Artificial Neural Networks ESANN’2000, pages 419–425. D-Facto, 2000.

[37] Peter Tino and Ashely J. S. Mills. Learning beyond finite memory in recurrentnetworks of spiking neurons. Neural Computation, 18(3):591–613, 2006. ISSN0899-7667.

[38] Benjamin Schrauwen and Jan Van Campenhout. Improving spikeprop: Enhance-ments to an error-backpropagation rule for spiking neural networks. In Proceedingsof the 15th ProRISC Workshop, 11 2004.

[39] Benjamin Schrauwen and Van Campenhout Jan. Extending spikeprop. In Jan VanCampenhout, editor, Proceedings of the International Joint Conference on NeuralNetworks, pages 471–476, Budapest, 7 2004.

[40] Simon Christian Moore. Back-Propagation in Spiking Neural Networks. Master’sthesis, University of Bath, United Kingdom, 2002.

[41] Jianguo Xin and Mark J. Embrechts. Supervised learning with spiking neuralnetworks. In Proceedings of the 2001 International Joint Conference on NeuralNetworks (IJCNN), Washington, USA, July 2001.

[42] Q. X. Wu, T. Martin McGinnity, Liam P. Maguire, Brendan P. Glackin, andAmmar Belatreche. Learning under weight constraints in networks of temporalencoding spiking neurons. Neurocomputing, 69(16-18):1912–1922, 2006.

[43] Altera. FPGAs. http://www.altera.com/products/fpga.html, 2002.

[44] Christophe Bobda. Introduction to Reconfigurable Computing. Springer, 2007.ISBN 978-1-4020-6088-5.

[45] Maya Gokhale and Paul S. Graham. Reconfigurable Computing: Accelerating Com-putation with Field-Programmable Gate Arrays. Springer, December 2005. ISBN0387261052.

[46] Xilinx. Virtex II PRO and Virtex II PRO X Platform FPGAs: Complete DataSheet. Xilinx, November 2007.

[47] Amos R. Omondi and Jagath C. Rajapakse. FPGA Implementations of Neu-ral Networks. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. ISBN0387284850.

130

http://www.altera.com/products/fpga.html

BIBLIOGRAPHY

[48] Francisco Barat and Rudy Lauwereins. Reconfigurable instruction set processors:A survey. In RSP ’00: Proceedings of the 11th IEEE International Workshop onRapid System Prototyping (RSP 2000), page 168, Washington, DC, USA, 2000.IEEE Computer Society. ISBN 0-7695-0668-2.

[49] Ralph Duncan. A survey of parallel computer architectures. Computer, 23(2):5–16, 1990. ISSN 0018-9162.

[50] Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. Introductionto parallel computing: design and analysis of algorithms. Benjamin-CummingsPublishing Co., Inc., Redwood City, CA, USA, 1994. ISBN 0-8053-3170-0.

[51] David A. Patterson and John L. Hennessy. Computer Organization and Design,Fourth Edition, Fourth Edition: The Hardware/Software Interface (The MorganKaufmann Series in Computer Architecture and Design). Morgan Kaufmann Pub-lishers Inc., San Francisco, CA, USA, 2008. ISBN 0123744938, 9780123744937.

[52] Steve Kilts. Advanced FPGA Design: Architecture, Implementation, and Opti-mization. Wiley-IEEE Press, 2007. ISBN 0470054379.

[53] Antony W. Savich, Medhat Moussa, and Shawki Areibi. The impact of arithmeticrepresentation on implementing mlp-bp on fpgas: A study. IEEE Transactions onNeural Networks, 18(1):240–252, 2007.

[54] Benjamin Schrauwen. Towards Applicable Spiking Neural Networks. PhD Thesis,Parallel Information Systems (PARIS) group, Electronics and Information Systems(ELIS) departement, Ghent University, 2008.

[55] Benjamin Schrauwen and Van-Campenhout Jan. Backpropagation for population-temporal coded spiking neural networks. In Proceedings of the 2006 InternationalJoint Conference on Neural Networks, pages 3463–3470. IEEE, 1 2006.

[56] Eduardo Ros, Eva M. Ortigosa, Rodrigo Agıs, Richard R. Carrillo, and MikeArnold. Real-time computing platform for spiking neurons (rt-spike). IEEE Trans-actions on Neural Networks, 17(4):1050–1063, 2006.

[57] E. L. Graas, E. A. Brown, and H. Lee. An fpga-based approach to high-speedsimulation of conductance-based neuron models. Neuroinformatics, 2(4):417–436,December 2004. ISSN 1539-2791.

[58] Daniel Roggen, Stephane Hofmann, Yann Thoma, and Dario Floreano. Hardwarespiking neural network with run-time reconfigurable connectivity in an autonomousrobot. In EH ’03: Proceedings of the 2003 NASA/DoD Conference on EvolvableHardware, page 199, Washington, DC, USA, 2003. IEEE Computer Society. ISBN0-7695-1977-6.

131

BIBLIOGRAPHY

[59] Martin J. Pearson, Ian Gilhespy, Kevin N. Gurney, Chris Melhuish, BenjaminMitchinson, Mokhtar Nibouche, and Anthony G. Pipe. A real-time, fpga based,biologically plausible neural network processor. In Wlodzislaw Duch, JanuszKacprzyk, Erkki Oja, and Slawomir Zadrozny, editors, ICANN (2), volume 3697of Lecture Notes in Computer Science, pages 1021–1026. Springer, 2005. ISBN3-540-28755-8.

[60] G. Hartmann G. Frank, A. Jahnke, and M. Schaefer. An accelerator for neuralnetworks with pulse-coded model neurons. IEEE-NN, 10(3):527–538, 1999. URLciteseer.nj.nec.com/frank98accelerator.html.

[61] H. H. Hellmich, M. Geike, P. Griep, P. Mahr, M. Rafanelli, and H. Klar. Emulationengine for spiking neurons and adaptive synaptic weights. In Proceedings 2005IEEE International Joint Conference on Neural Networks (IJCNN’05), volume 5,pages 3261–3266, 2005.

[62] Brendan P. Glackin, T. Martin McGinnity, Liam P. Maguire, Qingxiang Wu, andAmmar Belatreche. A novel approach for the implementation of large scale spikingneural networks on fpga hardware. In IWANN, pages 552–563, 2005.

[63] Joost N. Kok S. M. Bohte, J. A. L Poutre. Spikeprop: Error-backpropagationfor in multi-layer networks of spiking neurons. Neurocomputing, 1-4(48):17–37,November 2002.

[64] D. Lui S. McKennoch and L. G. Bushnell. Fast modifications of the spikepropalgorithm. IEEE World Congress on Computational Intelligence (WCCI), July2006.

[65] Thomas Natschlaeger and Berthold Ruf. Spatial and temporal pattern analysis viaspiking neurons. Network: Computation in Neural Systems, 9(13):319–332, 1998.

[66] Berthold Ruf. Computing and Learning with Spiking Neurons - Theory and Sim-ulations. PhD Thesis, Institute of Theretical Computer Science, Technische Uni-versitat Graz, Austria, 1998.

[67] Mike Farabee. The on-line biology book. World Wide Web electronic publica-tion, 1999. URL http://www.estrellamountain.edu/faculty/farabee/biobk/

biobooktoc.html.

[68] Peter K. Kaiser. The joy of visual perception: A web book. World Wide Webelectronic publication, 1996. URL http://www.yorku.ca/eye/.

[69] Peter Dayan and L. F. Abbott. Theoretical Neuroscience. Computational andMathematical Modeling of Neural systems. MIT Press, 2001.

[70] W Gerstner. Populations of spiking neurons. Maass, W. and Bishop, C., editors,MIT-Press, Cambridge, 1999.

132

citeseer.nj.nec.com/frank98accelerator.html

http://www.estrellamountain.edu/faculty/farabee/biobk/biobooktoc.html

http://www.estrellamountain.edu/faculty/farabee/biobk/biobooktoc.html

http://www.yorku.ca/eye/

BIBLIOGRAPHY

[71] Kechen Zhang and Terrence J. Sejnowski. Neuronal tuning: To sharpen or broaden.Neural Computation, 11:75–84, 1999.

[72] R. Berredo. A review of spiking neuron models and applications. Master’s thesis,University Federal de Minas Gerais, Brazil, 2005.

[73] Celoxica. Handel-C Language Reference Manual, 2003.

[74] Alphadata. ADC-PMC Datasheet Revision 1.0, 2009.

[75] Alpha Data. ADM-XRC-PRO-Lite (ADM-XPL) Hardware Manual. Alpha DataParallel Systems Ltd, July 2002.

[76] Alpha Data. ADM-XRC SDK 4.4.1 User Guide (Win32). Alpha Data ParallelSystems Ltd, 2003.

[77] Olaf Booij. Temporal pattern classification using spiking neural networks. Master’sthesis, University of Amsterdam, The Netherlands, 2004.

[78] Ronald A. Fisher. The use of multiple measurements in taxonomic problems.Annals Eugen., 7:179–188, 1936.

[79] Olvi L. Mangasarian, W. Nick Street, and William H. Wolberg. Breast cancerdiagnosis and prognosis via linear programming. Operations Research, 43:570–577,1995.

133

a high performance hardware architecture for multilayer spiking … · 2017. 9. 21. · a high...

Documents