An FPGA implementation of neutrinotrack detection for the IceCube telescope
−500
0
500 −500
0
500−500
0
500
yx
z
Examensarbete utfort i Datorteknik
av
Carl Wernhoff
LiTH-ISY-EX–10/4174–SELinkoping 2010
An FPGA implementation of neutrinotrack detection for the IceCube telescope
Examensarbete utfort i Datorteknikvid Linkopings Tekniska Hogskola
av
Carl Wernhoff
LiTH-ISY-EX–10/4174–SELinkoping 2010
Handledare: Christian Bohm, Per Olof Hulth, Stockholms UniversitetExaminator: Olle Seger, Linkopings Tekniska Hogskola
Linkoping 2010-03-24
Presentationsdatum 2010-03-10 Publiceringsdatum (elektronisk version) 2010-03-24
Institutionen för systemteknik Department of Electrical Engineering
URL för elektronisk version http://www.ep.liu.se
Publikationens titel An FPGA implementation of neutrino track detection for the IceCube telescope Författare Carl Wernhoff
Sammanfattning The IceCube telescope is built within the ice at the geographical South Pole in the middle of the Antarctica continent. The purpose of the telescope is to detect muon neutrinos, the muon neutrino being an elementary particle with minuscule mass coming from space.
The detector consists of some 5000 DOMs registering photon hits (light). A muon neutrino traveling through the detector might give rise to a track of photons making up a straight line, and by analyzing the hit output of the DOMs, looking for tracks, neutrinos and their direction can be detected.
When processing the output, triggers are used. Triggers are calculation- efficient algorithms used to tell if the hits seem to make up a track - if that is the case, all hits are processed more carefully to find the direction and other properties of the track.
The Track Engine is an additional trigger, specialized to trigger on low- energy events (few track hits), which are particularly difficult to detect. Low-energy events are of special interest in the search for Dark Matter.
An algorithm for triggering on low-energy events has been suggested. Its main idea is to divide time in overlapping time windows, find all possible pairs of hits in each time window, calculate the spherical coordinates θ and ϕ of the position vectors of the hits of the pairs, histogram the angles, and look for peaks in the resulting 2d-histogram. Such peaks would indicate a straight line of hits, and, hence, a track.
It is not believed that a software implementation of the algorithm would be fast enough. The Master's Thesis project has had the aim of developing an FPGA implementation of the algorithm.
Such an FPGA implementation has been developed. Extensive tests on the design has yielded positive results showing that it is fully functional. The design can be synthesized to about 180 MHz, making it possible to handle an incoming hit rate of about 6 MHz, giving a margin of more than twice to the expected average hit rate of 2.6 MHz.
Nyckelord FPGA, IceCube, neutrino, telescope, south pole, trigger, Track Engine
Språk Svenska x Annat (ange nedan) Engelska Antal sidor 87
Typ av publikation Licentiatavhandling x Examensarbete C-uppsats D-uppsats Rapport Annat (ange nedan)
ISBN (licentiatavhandling) ISRN LiTH-ISY-EX--10/4174--SE
Serietitel (licentiatavhandling) Serienummer/ISSN (licentiatavhandling)
Contents
1 Abstract 1
2 Terms and abbreviations 2
3 Introduction 43.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1.1 The neutrino elementary particles . . . . . . . . . . . 43.1.2 The IceCube telescope . . . . . . . . . . . . . . . . . . 53.1.3 Triggering and the Data Acquisition System . . . . . . 63.1.4 The Track Engine . . . . . . . . . . . . . . . . . . . . 7
3.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3.1 Research and reference material . . . . . . . . . . . . . 83.3.2 FPGA implementation . . . . . . . . . . . . . . . . . . 8
4 Analysis 94.1 The need of the Track Engine . . . . . . . . . . . . . . . . . . 9
4.1.1 Hits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.1.2 Local coincidence hits . . . . . . . . . . . . . . . . . . 94.1.3 Read-out . . . . . . . . . . . . . . . . . . . . . . . . . 104.1.4 Triggering with the Track Engine . . . . . . . . . . . . 10
4.2 Premises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.2.1 Interfaces to existing systems . . . . . . . . . . . . . . 104.2.2 IceCube Coordinate System . . . . . . . . . . . . . . . 12
4.3 Track Engine algorithm . . . . . . . . . . . . . . . . . . . . . 124.3.1 The idea . . . . . . . . . . . . . . . . . . . . . . . . . . 134.3.2 A 2-dimensional example . . . . . . . . . . . . . . . . 154.3.3 The algorithm . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4.1 Hit frequency and pair frequency (fh and fp) . . . . . 174.4.2 Number of hits and pairs (nh and np) in a time window 194.4.3 Highest achievable number of hits per time window . . 20
4.5 Placing pairs in angle bins . . . . . . . . . . . . . . . . . . . . 20
iii
CONTENTS iv
4.5.1 Simple binning . . . . . . . . . . . . . . . . . . . . . . 204.5.2 The binning problem . . . . . . . . . . . . . . . . . . . 214.5.3 Suggested binning . . . . . . . . . . . . . . . . . . . . 224.5.4 Analytical expressions for the suggested binning . . . 23
4.6 Time sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.7 Implementation options . . . . . . . . . . . . . . . . . . . . . 26
4.7.1 Software (PCs) . . . . . . . . . . . . . . . . . . . . . . 264.7.2 Logic (FPGA) . . . . . . . . . . . . . . . . . . . . . . 284.7.3 Comments . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.8 Borderline events . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 The implementation 305.1 Full-TE system overview . . . . . . . . . . . . . . . . . . . . . 305.2 Introduction to FPGAs . . . . . . . . . . . . . . . . . . . . . 315.3 Hardware used . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3.1 The FPGA . . . . . . . . . . . . . . . . . . . . . . . . 325.3.2 The embedded processor . . . . . . . . . . . . . . . . . 325.3.3 The evaluation board . . . . . . . . . . . . . . . . . . 33
5.4 Communication within TE, data rates . . . . . . . . . . . . . 335.4.1 Currently used encoding of hits in IceCube . . . . . . 335.4.2 Suggested encodings, resulting data rates . . . . . . . 34
5.5 The VHDL code . . . . . . . . . . . . . . . . . . . . . . . . . 355.5.1 Code amount . . . . . . . . . . . . . . . . . . . . . . . 365.5.2 VHDL code conventions used . . . . . . . . . . . . . . 37
5.6 Internal representations . . . . . . . . . . . . . . . . . . . . . 385.6.1 DOM IDs . . . . . . . . . . . . . . . . . . . . . . . . . 385.6.2 Time stamps . . . . . . . . . . . . . . . . . . . . . . . 385.6.3 Position coordinates, lengths . . . . . . . . . . . . . . 39
5.7 TE core and its units . . . . . . . . . . . . . . . . . . . . . . . 395.7.1 Input and output ports, other signal types . . . . . . . 395.7.2 System overview . . . . . . . . . . . . . . . . . . . . . 405.7.3 The preBuffer unit . . . . . . . . . . . . . . . . . . . . 435.7.4 The pairProd unit . . . . . . . . . . . . . . . . . . . . 435.7.5 The speedCrit unit . . . . . . . . . . . . . . . . . . . . 475.7.6 The angles unit . . . . . . . . . . . . . . . . . . . . . . 495.7.7 The preHistBuffer unit . . . . . . . . . . . . . . . . . . 555.7.8 The histStage unit . . . . . . . . . . . . . . . . . . . . 57
5.8 Comments on the size of TWbuffer . . . . . . . . . . . . . . . 625.8.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . 635.8.2 Consequences of larger TWbuffer . . . . . . . . . . . . 655.8.3 Behavior when TWbuffer is full . . . . . . . . . . . . . 65
5.9 BlockRAM usage . . . . . . . . . . . . . . . . . . . . . . . . . 655.10 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.10.1 Synthesis options . . . . . . . . . . . . . . . . . . . . . 66
CONTENTS v
5.10.2 Area, logic utilization . . . . . . . . . . . . . . . . . . 665.11 Testing with “TEtest” . . . . . . . . . . . . . . . . . . . . . . 66
5.11.1 Testing of the histogram unit . . . . . . . . . . . . . . 715.11.2 Testing the whole TE core design (system unit) . . . . 715.11.3 “Problematic track” problem . . . . . . . . . . . . . . 74
6 Results and suggestions 796.1 Realizing the TE algorithm in hardware . . . . . . . . . . . . 796.2 Realizing the TE algorithm in software . . . . . . . . . . . . . 796.3 Interfacing the TE to existing systems . . . . . . . . . . . . . 796.4 Possible improvements of the TE . . . . . . . . . . . . . . . . 80
6.4.1 String geometry and coordinate system . . . . . . . . 806.4.2 Rectangular-shaped bins . . . . . . . . . . . . . . . . . 80
6.5 Borderline events-problem . . . . . . . . . . . . . . . . . . . . 80
7 References 83
Chapter 1
Abstract
The IceCube telescope is built within the ice at the geographical South Polein the middle of the Antarctica continent. The purpose of the telescope isto detect muon neutrinos, the muon neutrino being an elementary particlewith minuscule mass coming from space.
The detector consists of some 5000 DOMs registering photon hits (light).A muon neutrino traveling through the detector might give rise to a trackof photons making up a straight line, and by analyzing the hit output of theDOMs, looking for tracks, neutrinos and their direction can be detected.
When processing the output, triggers are used. Triggers are calculation-efficient algorithms used to tell if the hits seem to make up a track—if thatis the case, all hits are processed more carefully to find the direction andother properties of the track.
The Track Engine is an additional trigger, specialized to trigger on low-energy events (few track hits), which are particularly difficult to detect.Low-energy events are of special interest in the search for Dark Matter.
An algorithm for triggering on low-energy events has been suggested. Itsmain idea is to divide time in overlapping time windows, find all possiblepairs of hits in each time window, calculate the spherical coordinates θ andϕ of the position vectors of the hits of the pairs, histogram the angles, andlook for peaks in the resulting 2d-histogram. Such peaks would indicate astraight line of hits, and, hence, a track.
It is not believed that a software implementation of the algorithm wouldbe fast enough. The Master’s Thesis project has had the aim of developingan FPGA implementation of the algorithm.
Such an FPGA implementation has been developed. Extensive tests onthe design has yielded positive results showing that it is fully functional.The design can be synthesized to about 180 MHz, making it possible tohandle an incoming hit rate of about 6 MHz, giving a margin of more thantwice to the expected average hit rate of 2.6 MHz.
1
Chapter 2
Terms and abbreviations
ALU Arithmetic Logic Unit, the unit of a CPU performing binary arith-metic and logical operations.
Bin A “box” on the unit sphere. The unit sphere is divided into a coupleof hundred bins.
BlockRAM or BRAM Resources of RAM memory (volatile working-memory)within the FPGA chip, available for the logic to use.
CPU Central Processing Unit, common term for a processor.
DAQ Data Acquisition System, a fundamental part of the IceCube de-tector, which, whenever a trigger has triggered, analyzes hits thatoccurred within the time window trigged for.
DOM Digital Optical Module, the photon-detecting units of the detectorwhich along with cabling make up the strings. The DOMs consist of aPMT and electronics to handle data acquisition and communicationswith the systems on the surface.
Event Common name for hits caused by physics events such as travelingparticles, as opposed to hits caused by PMT noise.
FIFO buffer First In, First Out-buffer, buffer storing elements and out-putting them in the order they were inputted.
FPGA Field Programmable Gate Array, integrated circuit to be configuredby the designer after manufacturing implementing digital circuits.
ICL IceCube Laboratory, a building at the surface above the IceCube de-tector housing hardware and computer equipment for communicatingwith the DOMs of the detector and data analysis.
2
CHAPTER 2. TERMS AND ABBREVIATIONS 3
LC, Local Coincidence A possible attribute of a hit, meaning that theDOM over or under the DOM registering the hit also registered a hitsimultaneously.
LUT Look-Up Table, a data structure used to replace calculations withlooking-up pre-calculated values.
Neutrino Common name for the elementary particles electron neutrino,muon neutrino and tau neutrino. Neutrinos are electrically neutral,are able to pass through ordinary matter, travel with almost the speedof light and have a minuscule (but non-zero) mass.
PMT Photomultiplier Tube, very sensitive detectors of light, able to detectindividual photons. Is part of a DOM in the detector.
RAM Random Access Memory, volatile working-memory.
ROM Read-only Memory, a memory that can only be read and not writtento. For ROM in FPGAs, BlockRAM resources are used.
String The vertical structures of cabling and DOMs making up the detec-tor.
VHDL VHSIC Hardware Descriptive Language (VHSIC is acronym forVery High Speed Integrated Circuit), a language for describing digitallogic.
Chapter 3
Introduction
3.1 Background
IceCube is a neutrino telescope being built on Antarctica. The IceCubeproject is a large international collaboration of researchers and universities.In Sweden, the physics department at Stockholm University and a researchgroup at Uppsala University are involved in the project.
The project is mainly funded by the National Science Foundation (US),but is also partly funded by Knut and Alice Wallenberg Foundation and theSwedish Polar Research Secretariat.
3.1.1 The neutrino elementary particles
The neutrino elementary particles are created in various nuclear reactions,such as those that take place in the sun and in other parts of space. Theirmass is close to zero and they travel with approximately the speed of light.
There are three kinds of neutrinos; electron, muon and tau neutrinos,each of them also having an antiparticle. IceCube is most sensitive to themuon neutrinos.
Neutrinos seldom interact with other particles. This property makesthem hard to detect, but it also makes them particularly interesting sincetheir travel through space is not affected by magnetic fields and since theyeven travel through matter. Every second, billions of neutrinos pass throughthe human body.
By learning more about neutrinos, their origins and attributes, ourknowledge of the universe and of such things as supernova explosions, gamma-ray bursts and black holes grows. We might also be given clues to what DarkMatter is, and what its properties are.
4
CHAPTER 3. INTRODUCTION 5
Figure 3.1: The Amundsen-ScottSouth Pole Station. (Photo fromthe IceCube Collaboration.)
Figure 3.2: The IceCube telescopewith an illustrated track. Coloredbulbs represent detected photonhits. (Graphics from the IceCubeCollaboration.)
3.1.2 The IceCube telescope
IceCube is a neutrino telescope being built within the ice close to the ge-ographical south pole at Antarctica. An American research station, theAmundsen-Scott South Pole Station (Figure 3.1), is situated right besidethe detector, and it contains the base camp for building and maintainingthe detector.
In the (very seldom) event of a muon neutrino colliding with a proton ora neutron in a water molecule, a muon is produced, and that muon will haveroughly1 the same direction as the neutrino had. When traveling throughice, the muon will radiate a so-called Cherenkov Cone of photons. It isthese photons that are detected in IceCube. Since they make up a track,the direction of the track can be found. Figure 3.2 shows the detector duringa muon track event.
The photons are detected by Photo Multiplier Tubes (PMTs), which arebuilt into units called Digital Optical Modules (DOMs) (Figure 3.3). Thedetector will consist of 5200 DOMs when it is completed, spread out in avolume of 1 km3 of ice, at depths of 1500 to 2500 meters below the surface.
1within about 1 ◦
CHAPTER 3. INTRODUCTION 6
Figure 3.3: A DOMjust before deploy-ment. (Photo fromthe IceCube Collabora-tion.)
Figure 3.4: The ICL. (Photo from theIceCube Collaboration.)
DOMs and their cabling make up vertical strings. On a string, theDOMs are spaced 17 meters2 (vertically), and the strings are spaced about125 meters (laterally).
Currently, a large part of the strings have been deployed in the ice, withthe last strings to be deployed in the winter season 2010/2011. The detectoris already running with the deployed DOMs which are producing hits givingscientific data.
3.1.3 Triggering and the Data Acquisition System3
Hit data is collected by the Data Acquisition System (DAQ), housed inthe IceCube Laboratory (ICL) (Figure 3.4), a building at the surface. TheDAQ is implemented in software running on a large farm of industrial PCs.Neutrinos are looked for by searching for photon hits seeming to build upstraight tracks.
A full analysis of all hits in the detector is impossible to perform con-tinuously in real-time due to the large number of hits. Softwares knownas triggers are used to make a rough estimate of whether the hits in thedetector, at any single moment, might be of interest or not. Only if theyseem to be, all hits are read out from the detector and an analysis of thosehits is performed.
The current triggers only look at a few percent of the most qualitative2Six so-called Deep Core-strings in the middle of the detector, not yet deployed, will
have less vertical spacing for increased resolution.3A more detailed discussion of these issues and of the need of a Track Engine can be
found in Section 4.1.
CHAPTER 3. INTRODUCTION 7
hits4. The main strategies for triggering is currently to look at the multiplic-ity (number of) such hits. Only if above a defined threshold, the detectortriggers and an analysis of all hits within this time window is performed.
Many muons traversing the detector are not caused by neutrinos. Thereare two ways of filtering out those tracks: first, one can regard only theupward-going tracks5, and, second, one can regard only the tracks that startin the detector.
When the detector is completed, it is expected to find some 50 000neutrinos each year.
3.1.4 The Track Engine
Triggering with the strategy of applying a multiplicity condition to the num-ber of the most qualitative hits is simple, and possible to perform contin-uously in real-time, which is necessary. However, the existing triggers aretypically not so good in triggering on the dim tracks, that is, tracks withfew photon hits, caused by low-energy muons.
The suggested Track Engine is a trigger that will take all hits into ac-count, and that not only will consider the multiplicity, but also analyze ifthey seem to make up a straight track. The challenge is to be able to performsuch calculations in real-time for the very large flow of hit data at hand.
A rather simple algorithm has been suggested, but it has been believedthat a software implementation will still be too slow to be able to run itcontinuously in real-time, at least if running it at one single PC. Therefore,the Track Engine will either have to be implemented in hardware, or, somedistributed solution using several PCs running software will have to be used.
3.2 Aim
There has been one primary and two secondary aims of the Master’s Thesisproject. The primary aim has been to
• make an FPGA design that implements an as large part as possible ofthe Track Engine,
and the secondary aims, to
• evaluate the both options of realizing the Track Engine in software(PCs) or hardware (FPGA)
• investigate how the Track Engine should interface to the existing sys-tems of the IceCube telescope.
4Those hits being the local coincidence hits. This will be further described in sec-tion 4.1.
5Neutrinos are the only particles being able to travel through the whole earth, hence,upward-going muons must be caused by neutrinos.
CHAPTER 3. INTRODUCTION 8
3.3 Method
3.3.1 Research and reference material
There was no complete documentation of e.g. the DAQ. Although such sys-tems are in use and have been for some time, they are still under develop-ment, which is one cause of the difficulty of finding detailed documentation.For details, personal contact with the developers were taken.
3.3.2 FPGA implementation
The design is coded in VHDL, a hardware description language (HDL). TheModelSim software has been used for simulation and Xilinx’s ISE has beenused for synthesis.
For the VHDL coding, the design flow used in the beginning was typically
Coding → Test bench coding → Compiling for simulation →Running simulation → Synthesis
but as the work evolved, a work flow of
Coding → Compiling for simulation → Synthesis → Test bench coding →Compiling for simulation → Running simulation
appeared to be more convenient. Synthesizing before simulating makessense since the timing performance and resource usage results tell if thechosen design strategy is at all possible, whereas simulation runs typicallyreveal errors in details of the design. However, before synthesis, the codewas compiled in the simulation tool to reveal syntactical errors.
Chapter 4
Analysis
4.1 The need of the Track Engine
A more detailed discussion on the need of the Track Engine follows.
4.1.1 Hits
When a PMT in a DOM detects a photon, it is said that there is a hit inthe DOM. When a hit occurs, the DOM will store both the waveform ofthe PMT signal and the exact time when the hit occurred. The hit will bestored in a buffer so that it can be sent up to the surface later.
The PMTs are said to have a noise rate of 500 Hz, meaning that eachPMT and hence each DOM in the ice registers a hit about 500 times persecond. Only a small fraction of the hits corresponds to photons originatingfrom neutrino events.
With 5200 DOMs, 500 Hz noise rate for each DOM and waveforms storedfor each hit, it can be understood that the total raw data rate produced inthe detector is very large.
4.1.2 Local coincidence hits
A simple and rough measure of the quality of a hit is the local coincidence(LC). A hit is an LC hit if there is another simultaneous1 hit in any ofthe neighboring DOMs; neighboring meaning the nearest DOM up or downalong the string2. As can be understood, LC-hits appear in pairs.
If a hit is an LC-hit, it is much more likely to be the result of a physicsevent of interest than random noise. About 2% of the hits in the telescopeare LC-hits.
1“Simultaneous” being defined as a limit of the difference of the times.2There is support for loosing this condition by considering DOMs further away (e.g.
two DOMs up or down the string) being neighboring DOMs.
9
CHAPTER 4. ANALYSIS 10
4.1.3 Read-out
When a trigger triggers, it issues a trigger signal, causing the DAQ to per-form a read-out of all hits (also non-LC hits) in the detector within thetime window triggered on. The hits are then analyzed in order to find outwhether they make up a track or not and what its direction and possiblyother properties are.
Data to be stored is stored on tape, flown to the McMurdo Base at thecoast of the continent, shipped to the US and then finally made accessibleat the University of Wisconsin. Part of the data is sent to University ofWisconsin over satellite link directly from the Amundsen-Scott South PoleStation.
4.1.4 Triggering with the Track Engine
As mentioned, the existing triggers are likely to miss dim tracks caused bylow-energy muons.
The Track Engine is an additional trigger suggested by prof. Dave Ny-gren, Berkeley Laboratories (during 2008 being a guest researcher at Stock-holm University).
By a simple algorithmic approach it shall be possible to trigger alsoon the very dim tracks. This will make it possible to detect neutrinos ofmuch lower energies than is otherwise possible. Finding those neutrinos isespecially interesting in the search for Dark Matter.
4.2 Premises
4.2.1 Interfaces to existing systems
In the ICL, computers known as stringHubs communicate with the DOMs.The stringHubs are industrial PCs equipped with a number of so-called DOR(DOM Read-out) cards that handle the low-level communication with theDOMs.
Except for acting as the interface to the DOMs, the stringHubs also havethe important task of re-calculating the time stamps of all hits from DOMtime to IceCube time.
When detecting a hit, the DOM will tag the hit with a time stampaccording to its internal clock. The clocks within the DOMs will however beslightly out of phase, and the stringHubs therefore has a system of queryingthe DOMs about their internal time. By keeping track of the offset fromthe global time used, called the IceCube time, the stringHub can correct thetime stamps it receives. This process is known as the time transformation.
The DAQ is the system used for analyzing hits and storing away theresult. An overview of the detector, the stringHubs and the DAQ can befound in Figure 4.1.
CHAPTER 4. ANALYSIS 11
stringHub
IceCube (DOMsin the ice)
stringHub
stringHub
Existingtriggers
All hits
Only LC hits Data Acquisition System
Triggersignal
ICL
stringHub
IceCube (DOMsin the ice)
stringHub
stringHub
Existingtriggers
Trigger signal
All hits
Only LC hits Data Acquisition System
Triggersignal
Track Engine
ICL
All hits
Figure 4.1: The detector, string hubs and DAQ. Current layout (above) andwith the TE added (below).
CHAPTER 4. ANALYSIS 12
Figure 4.2: The definition of zenith (θ) and azimuth (ϕ)
From the above discussion it is found that the Track Engine will haveto interface to the stringHubs, in order to receive the hits, and to the DAQ,which it should provide the trigger signal to. The trigger signal shouldcontain the time window of interest, within which the hits will be read out.
The stringHub computers have an unused ethernet port that the TrackEngine could use to receive hits through.
The DAQ is typically interfaced to by ethernet as well.
4.2.2 IceCube Coordinate System
A cartesian coordinate system known as the IceCube Coordinate Systemis used within the IceCube collaboration. The coordinate system is mostimportant when it comes to the exact location of each DOM.
The coordinate system has its origin in the middle of the detector withthe positive z-axis pointing up.
The zenith (θ) and azimuth (ϕ) angles are defined according to Fig-ure 4.2. These definitions are the same as for θ and ϕ in (a common defini-tion of) the spherical coordinate system. The intervals used for the anglesare:
θ ∈ [0, π[ϕ ∈ [0, 2π[
4.3 Track Engine algorithm
The Track Engine algorithm is suggested by prof. David Nygren. Its aim isto determine whether the hits in the detector during some time window israndom noise or if they seem to make up a straight line – a track, originatingfrom a muon.
CHAPTER 4. ANALYSIS 13
µs1 2 3 4 5 6 7 8
time
Time window 1
Time window 2
Time window 3
Time window 5
Time window 4
Figure 4.3: Time windows
4.3.1 The idea
The maximum life time of a track within the detector should be dependentof the velocity of the particle and the size of the detector. A time windowof width
Tw = 5 µs
has been suggested. The time window is to be slided every 1 µs. Thiswill lead to a division of the hits into (overlapping) time windows accordingto Figure 4.3.
For each time window, all possible pairs of hits are studied. All suchpairs are considered candidates of belonging to a track.
With only noise hits, for the number of hits in a time window nh,tw, weexpect nh,tw = fhTw = 13 on average.
Some pairs can be immediately excluded because of the geometrical dis-tance between them (between the DOMs that detected them), the differenceof the time stamps and the expected speed of the particle. If these quantitiesdo not make sense together, the pair is thrown away. Currently, the limits[0.3c, 1.0c] are used for the implied speed v, c being the speed of light. Al-though the muon travels with the speed of light, the spread of the CherenkovCone and possible refraction in the ice before the photon reaches the PMT,causes the lower limit of 0.3c to make sense.
For the remaining pairs, it needs to be determined whether they seemto make up a track or not.
For each pair the zenith (θ) and azimuth (ϕ) angles are calculated. Now,if all the hits are random noise, no θ or ϕ angle should be more frequent thanany other. However, if some of the hits are hits within a track, the θ and ϕfor the track should be more common than the other angles. Histogrammingthe angles is a good way of seeing how they are distributed. This idea is
CHAPTER 4. ANALYSIS 14
30 60 90 120150180
0 30
6090120
150180
210240
270300
330360
0
2
4
6
8
10
12
theta [deg]
2−dimensional histogram over theta and phi
phi [deg]
mul
tiplic
ity
Figure 4.4: θ and ϕ angles histogrammed in a 2-dimensional histogram. Apeak indicating a track can be seen at (θ, ϕ) ≈ (130 ◦, 250 ◦).
shown in an example in the section below.A 2-dimensional histogram over θ and ϕ is created. θ and ϕ are dis-
cretized, leading to a number of so-called bins. Figure 4.4 shows what sucha histogram might look like3. It can be understood that building the 2d-histogram is the same as keeping track of the number of pairs for each andall of the bins.
For all pairs, their bin is determined, and when a complete time windowof pairs has been processed, there is for each bin information on how manypairs were placed in that bin. One additional piece of information is alsonecessary for each bin: whether at least one of the pairs placed in the binhad at least one LC-tagged hit.
When the whole time window of pairs has been processed, all bins withat least nhist.thresh.4 pairs, with at least one pair with one LC-tagged hit,are considered to correspond to a detected track.
There may be none, one or several detected tracks in a time window.If there was at least one, a trigger signal to the DAQ is sent for that timewindow (see further details below).
3In reality, the binning is not as is indicated in the figure. The binning actually usedis described in Section 4.5
4A typical value for nhist.thresh. would be around 5.
CHAPTER 4. ANALYSIS 15
4.3.2 A 2-dimensional example
The idea of the TE algorithm is shown in a 2-dimensional example in Fig-ure 4.5
The direction of a possible track is called ϕ, defined as the angle betweenthe track and a horizontal line.5
With only noise, no significant peaks appear in the histogram. With a(weak) track signature as in the middle graph, a peak around 25 ◦ appearsin the histogram. Also more than one track signature in each time windowcan be detected as shown in the rightmost graph.
For the graphs, 13 noise hits and 13 hits for each track signature wereused, with the number of noise hits thus corresponding to nh,tw.
The track signatures used were:
l1 : y = 0.5x+ 0.2l2 : y = −0.7x+ 1
Further, we have
ϕ1 = arctan(0.5) ≈ 27 ◦
ϕ2 = arctan(−0.7) + 180 ◦ ≈ 145 ◦
which corresponds to the peaks in the histograms.Note: The detector is, very roughly, 1 km × 1 km × 1 km. The unit
square in Figure 4.5 can be thought of as being 1 km × 1 km. The standarddeviation σ used when spreading the track hits around the ideal lines l1 andl2 were σ = 0.05. 0.05 km = 50 m, comparable to the optical absorptionlength in the ice λabs, λabs ≈ 100 , and to the effective scattering length ofphotons in the ice λeff , λeff ≈ 25 m.6
4.3.3 The algorithm
The Track Engine algorithm is applied to all hits within each time window,and, written out point by point, it is as follows:
1. Find all pairwise combinations of the hits.5The interval used for ϕ is [0, π], which means that no information of the direction
of the track, which would require an interval of [0, 2π], is at hand. In the real case, thedirection of the track is important.
6λabs and λeff from “The IceCube Data Acquisition System: Signal Capture, Digiti-zation, and Time-Stamping”, 2008-01-21 (unpublished), p. 6.
CHAPTER 4. ANALYSIS 16
0 0.5 10
0.2
0.4
0.6
0.8
1
x
y
noise
noise samples: 13, line samples: 13, [k1 m1 k2 m2]= 0.5 0.2 -0.7 1, [phi1_deg phi2_deg]= 26.5651 145.008[stddev1 stddev2]= 0.05 0.07
0 0.5 10
0.2
0.4
0.6
0.8
1
noise and track signature l1
0 0.5 10
0.2
0.4
0.6
0.8
1
noise and track signatures l1 and l2
0 50 100 1500
20
40
60
80
100
120
140
0 50 100 1500
20
40
60
80
100
120
140
phi [deg]
mul
tiplic
ity
0 50 100 1500
20
40
60
80
100
120
140
Figure 4.5: The idea of the TE algorithm shown in a 2-dimensional example.
CHAPTER 4. ANALYSIS 17
2. For all pairs, find the geometrical distance l between the DOMs forthe two hits and the time difference ∆t between the two time stamps.Find the implied velocity v and discard all pairs not confirming to
v :=l
∆t∈ [0.3c, 1.0c] ,
c being the speed of light.
3. Find the zenith angle θ of all pairs.
4. Find the azimuth angle ϕ of all pairs.
5. Find the bin of the (θ, ϕ) combination for all pairs.
6. Build a histogram over the bins for all pairs in the time window—or,equivalently—for each bin, count how many pairs was assigned to thatbin.
7. For all bins containing at least nhist.thresh. pairs, of which at least onepair has one LC-tagged hit, consider this to correspond to a detectedtrack.
8. If there was at least one detected track in the time window, send outa trigger signal from the Track Engine.
The trigger signal is a packet of data containing:
• The start and end times of the time window
• The total number of pairs in the time window
• For all detected tracks (but maximum 10):
– The bin number (corresponding to intervals for θ and ϕ)
– The number of pairs in the bin
4.4 Performance
4.4.1 Hit frequency and pair frequency (fh and fp)
From the algorithm we can see that the Track Engine calculations are cen-tered around pairs. We want to determine what rate of pairs we will haveto be able to handle for a realization of the algorithm.
Call the rate of pairs fp. Call the rate of hits (earlier called the noiserate) fh and find fh from fp using Tw.
CHAPTER 4. ANALYSIS 18
Consider the time window as a buffer always containing all hits not tooold. The pairs are produced by pairing each incoming hit with all the hitsin the time window. The time window contains fhTw hits. The number ofhits entering every second to be paired with those hits is fh. Hence we have
fp = f2hTw . (4.1)
For example, with some reasonable numerical values,
fh = 5200 DOMs · 500 Hz/DOM = 2.6 MHzTw = 5 µs
we get
fp ≈ 30 MHz ,
that is, 30 million pairs each second to process for the expected average noiserate.
We also want to determine the highest achievable hit frequency. Wehence need the relation between fh and and the clock frequency fc and thatrelation is given by the cycle efficiency ηcyc, explained in Section 5.7.4 onpage 44, through the expression
fp = ηcycfc . (4.2)
Taking (4.2), substituting fp according to (4.1) and ηcyc according to(5.1) (page 44) with nextra = 4 (see 7) in (5.1) and solving for fh yields the(positive) solution
fh(fc) = −nextra2Tw
+
√n2extra
4T 2w
+fcTw
. (4.3)
(4.3) is plotted in Figure 4.6. Since the design is intended to be clocked at180 MHz, we can see that roughly fh = 6 MHz will be the highest achievablehit frequency. This is more than twice the expected average hit frequency2.6 MHz.
It should be remembered that Expression (4.1) is only valid when TWbufferis not full, since fhTw (the number of hits in TWbuffer) has a maximumvalue of 31 (see Section 5.8 on page 62). Also, fhTw in (5.1) has the samemaximum value. In other words, the above is only valid for
fhTw ≤ 31 ⇐⇒ fh ≤31Tw
= 6.2 MHz
which is true for the larger part of Figure 4.6.7The value for 4 for nextra can be understood through the state machine in Figure 5.7
on page 45 (although not supposed to be immediately obvious.)
CHAPTER 4. ANALYSIS 19
120 140 160 180 200
5
5.2
5.4
5.6
5.8
6
6.2
6.4
clock frequency fc [MHz]
hit f
requ
ency
f h [MH
z]
Figure 4.6: Achievable hit frequency vs. the clock frequency of the design.
4.4.2 Number of hits and pairs (nh and np) in a time window
We seek the relation between the number of hits and the number of pairs forone single time window. To find such a relation, we must assume a steadyhit frequency, that is, we must assume several consecutive time windowswith approximately the same number of hits.
We call the number of hits per time window nh. nh = fhTw will be truefor a steady fh according to the previous section. A hit is not paired withitself, leading to nh(nh−1) possible pairs. However, the pairing is only donein “one direction” (“all possible un-ordered pairs are produced”), e. g., forthe three hits 1, 2 and 3, if (2,1) is a pair, (1,2) is not another pair, andall pairs would be (2,1), (3,1) and (3,2). From this follows that from thenumber of hits in a time window nh, the number of pairs np for that sametime window would be
np =nh(nh − 1)
2.
For a given np, solving the above expression for nh yields the positivesolution
nh = 12 +
√14 + 2np .
CHAPTER 4. ANALYSIS 20
4.4.3 Highest achievable number of hits per time window
We will estimate the number of pairs for each time window. With 1Tw
beingthe number of time windows per second, we expect
np,tw = fp
/1Tw
= 150 ,
np,tw being the number of pairs per time window. This is true for theexpected average pair frequency, fp = 30 MHz. The highest achievable fp inthe implementation is an fp close to the clock frequency fc, that is, handlingone pair each clock cycle. If this is the case, we will get an np,tw close tofc/
1Tw
= 900 pairs per time window.How many hits per time window does this correspond to? This is de-
pendent on the size of TWbuffer, np,twmax . Assume TWbuffer is full andwill remain full for a whole set of incoming hits for a time window. Callthe number of incoming hits for that time window nh∗, introducing starredvariables for highest achievable number of hits or pairs for a time window.All those hits will be paired with all hits in TWbuffer and this should give900 pairs:
900 = np,twmaxnh∗ ⇔ nh∗ = 30
for np,twmax = 30 which is one suggested value for that parameter.All this would mean that the TE is capable of processing np,tw ≈ 900
pairs or nh∗ = 30 hits each time window for Tw = 5 µs.Having a higher rate of incoming hits than that would mean that the
buffers grow. It should be remembered, that when exceeding this limit,more pairs are generated, the number of pairs growing with the square ofthe number of hits, potentially giving the TE a load of pairs so large that itwon’t be able to catch-up again. There should be some mechanism makingsure that the TE is never reached by more than about 30 hits per timewindow.
4.5 Placing pairs in angle bins
As has been mentioned before, a bin is a combination of one interval for θand one interval for ϕ. The pairs are placed in bins in order to determine ifsome direction is more common than others.
4.5.1 Simple binning
It would be straight-forward to just divide the θ interval ([0, π]) and theϕ interval ([0, 2π]) in some number of equally-sized intervals. A binningaccording to this “simple” method is shown in Figure 4.7 by having plottedthe bin limits over the unit sphere. The detector can be thought of as being
CHAPTER 4. ANALYSIS 21
−1
−0.5
0
0.5
1
−1−0.5
00.5
1
−1
−0.5
0
0.5
1
x
Bins on the unit sphere (tot. 262 bins)
y
z
Figure 4.7: Simple limits. The top and bottom bins have been given specialtreatment.
in the middle of the sphere with points on the sphere represented directionsof incoming particles.
It can be clearly seen that the bins are not of the same area, which isdesirable since the number of pairs in a bin decides whether it is considereda real track or not.
4.5.2 The binning problem
Instead, we should look for a binning giving bins of similar sizes as theyappear on the unit sphere. This problem is similar to that of sewing asoccer ball: create the surface of the unit sphere by using equally-sizedsurface segments. Joining octagons and hexagons are established means ofachieving this.
However, we must keep the implementation in mind. The propertiesof a pair used to determine its bin are its θ and ϕ angles. With a oc-tagon/hexagon binning, either rather complex calculations or a look-up ta-ble must be used. There is room for neither in an FPGA implementation8.We need a binning with roughly consistent bin areas and a simple way ofdetermining bin from the θ and ϕ properties.
8Such a LUT would be to big to fit in the Block-RAM of the FPGA. With 5300 DOMsthere are 53002/2 DOM pairs (without regard to order) and with 9 bits to identify thebin, 53002/2 × 9 ≈ 16 Mbyte. Placing the LUT in external RAM probably wouldn’t befast enough.
CHAPTER 4. ANALYSIS 22
−1
−0.5
0
0.5
1
−1
−0.5
0
0.5
1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
x
y
z
Figure 4.8: Suggested and used binning, here with 328 bins.
4.5.3 Suggested binning
The binning method I suggest and use is roughly as follows:
1. Slice the sphere a number of times along the XY-plane. θ anglesdefine where the slices should be made. Choose the discrete θ anglesuniformly over [0, π].
2. Keep the top and bottom slices as they are.9
3. For the other slices, cut them into a number of segments, those seg-ments becoming bins. ϕ angles define where the cuts should be made.Choose discrete ϕ angles uniformly over [0, 2π].
Note that the ϕ angles defining the cuts are different for each slicesince the number of ϕ cuts depend on the latitude of the slice.
This binning produces bins on the unit sphere as shown in Figure 4.8.The sizes of the bins are visualized in Figure 4.9. It can be seen that
the biggest area differences are around 15%, and that the bins on all slicesexcept the top- and bottom slices and their neighbours (only about 20 outof over 300) have area differences within 4%. With “simple binning”, thegreatest bin areas are several times as large as the smallest.
9Dividing also the top and bottom slice in segments at different ϕ angles is possible,but by not doing so it is easier to achieve more similar areas of the bins.
CHAPTER 4. ANALYSIS 23
0 50 100 150 200 250 300 350100
102
104
106
108
110
112
114
Bin no.
Siz
e, %
of t
he s
mal
lest
bin
Bin sizes
Figure 4.9: Bin sizes with suggested binning.
4.5.4 Analytical expressions for the suggested binning
Here follows an analytical description of the bin limits in terms of θ and ϕaccording to the suggested and used binning method (shown in Figure 4.8).
Input parameter:nb number of bins to aim for
Output variables:nθ number of θ slicesnϕ,kθ number of ϕ cuts for θ slice with index kθθkθ the θ slice anglesϕkθ,kϕ the ϕ cut angles for θ slice with index kθ
with kθ = 0, 1, . . . , nθ − 1, kϕ = 0, 1, . . . , nϕ,kθ − 1
The expressions for the nθ and nϕ,kθ variables are:
nθ =√nbπ
2− 1
nϕ,kθ =ckθ2
√nbπ− 1
The expressions for the θ slice angles are:
CHAPTER 4. ANALYSIS 24
x
y
z
θ0θ1θ2
θnθ-1
ϕkθ,0
ϕkθ,1
ϕkθ,nϕ-1∆θ
y
z
x
θ1
θ2 ϕ1,1 ϕ1,2ϕ1,3
c1
Figure 4.10: Variables in the binning expressions
θ0 =2√nb
(top slice)
θnθ−1 = π − 2√nb
(bottom slice)
θkθ = θ0 + kθ∆θ, kθ = 1, 2, . . . , nθ − 2 (middle slices)
The expressions for the ϕ cut angles are10:
ϕkθ,kϕ = 2πkϕ + 1nϕ,kθ + 1
The expressions for the helper variables ckθ and ∆θ used above are:
ckθ = 2π sinπ(kθ + 3
2)nθ + 1
∆θ =π − 2θ0nθ − 1
(=π − 4/
√nb
nθ − 1
)
The helper variables have geometrical meanings. ckθ is the circumferenceat the θ angle right between θkθ and θkθ+1. ∆θ is the difference (or distancealong θ on the unit sphere) between two adjacent θ slice angles. Please referto Figure 4.10.
Motivation
The unit sphere has an area of 4π from which follows that we will take aimat keeping each bin area as close as possible to
10There is also a ϕ cut at ϕ=0.
CHAPTER 4. ANALYSIS 25
Ab =4πnb
.
The top and bottom slices are not divided with ϕ cuts. Call the top/bot-tom bin area At, seek an expression and solve for θ0:
At ≈ πθ20 = Ab ⇐⇒ θ0 =
2√nb
(⇒ θnθ−1 = π − 2
√nb
)
We need to find the number of θ cuts desired. Since the bins will havesquare-like shapes we have a desired bin side length of
√Ab. We hence have
∆θ =√Ab , (4.4)
and using a simplified expression for ∆θ,
∆θ =π
nθ + 1
we solve (4.4) for nθ and round nθ to the nearest integer11.To find the middle θ slice angles, the rest of the θ interval should be
split up uniformly over [θ0, θnθ−1]. Doing this gives the “stripe width”, thehelper variable ∆θ, and the expression for θkθ for the middle slices follows.
We now seek the nϕ,kθ expression. Take a slice, cut at ϕ = 0 and considerthe resulting stripe. It should be cut such a number of times that the lengthof the pieces are as close to
√Ab as possible. We seek the expression for the
length of the stripe, which is the same as ckθ . The circumference at a θ is
cθ = 2π sin θ .
However, for a given kθ we want the circumference right between θkθ andθkθ+1, that is, the circumference at
θ =π(kθ + 3
2)nθ + 1
which gives the expression for ckθ as given above.As soon as we know the number of ϕ cuts for a stripe, we need only to
distribute them uniformly over [0, 2π] which gives the expression for ϕkθ,kϕas given above.
11In the algorithm actually used, the nearest even integer has been chosen in ordernot to get a θ limit at θ = π
2since this is a singularity in the quantity used by the
implementation. See further Section 5.7.6.
CHAPTER 4. ANALYSIS 26
4.6 Time sorting
As stated earlier, current triggers only handle LC (local-coincidence) hits.The LC hits are time-sorted in the string hubs which means that the triggersreceive a time-sorted stream of hits.12
If however all hits, LC and non-LC, are to be sent on from the stringhubs, they will be unsorted. The TE algorithm must have a stream of sortedhits as its input. Hence, a time-sorting is needed, capable of sorting all hits.With 2.6 MHz of hits, dispersed in times for possibly ten minutes or more,it can be understood that such a task may not be simple. Being able totime-sort the hits is a major issue with the TE.
There exists a software sorting implementation with sufficient perfor-mance developed at Stockholm University by Clyde Robson. It is able tosort all the hits with the algorithm running on an ordinary PC.13
The algorithm has been tested, the sending and receiving of the hitsfrom other network clients included. Performance was up to 20 Mhits/s14
(receiving, sorting and sending). The performance needed for the TE ataverage load is, as stated earlier, around 2.6 MHz.
4.7 Implementation options
It has been discussed whether the TE can be implemented in none, one orboth of software and hardware. The implementation must be able to executethe TE algorithm in real-time for the expected hit-rates. The options arediscussed below.
4.7.1 Software (PCs)
A single PC
It is generally hard to estimate the performance of software not yet written.However, some simple argumentation might give some clues.
For the pair rate fp we have fp ≈ 30 MHz. The clock rate for a typicalPC processor is 3 GHz. This gives 100 clock pulses for each pair.
A software implementation of the TE must:
• Synthesize the pairs12The hits would typically not be sorted in time if no action to achieve that would be
taken. Buffers in the DOMs might hold the hits several minutes or more before passingthem on to the surface.
13The algorithm is briefly described in “A low energy muon trigger”, C. Bohm, D. Ny-gren, C. Robson, C. Wernhoff and G. Wikstrm, posted to the IEEE/NSS 2008 conferencein Dresden.
14“Track Engine status” (presentation slides), C. Bohm, D. Nygren, C. Robson, C.Wernhoff and G. Wikstrm, IceCube DeepCore workshop in Utrecht, 2008
CHAPTER 4. ANALYSIS 27
• Check the speed criteria for each pair:
– Determine the positions of the two DOMs
– Calculate the distance
– Calculate the implied velocity from the distance and the timestamp difference
– Compare the implied velocity to given limits and reject or keepthe pair
• Calculate θ and ϕ for each pair
• Build the histograms, search each finished histogram for peaks andreport the peaks
• Handle the input (hits) and output (reported tracks) interfaces of theTE
It seems that managing this with the given pair rate will be hard.Note: It could be argued that all the “calculate”s above could be re-
placed by “look-up”s. This might possibly give increased performance. Itshould however be kept in mind that those kinds of memory operations typ-ically result in cache misses and hence we are looking-up in the externalmemory with its limited memory bus clock and big delay as compared tothe processor.
Multiple PCs
There is a way of dividing the load of the algorithm between several PCs.Hence, principally any performance can be achieved by just using more PCs.
A structure of several PCs, of which one is called the arbiter and theother the clients, is suggested. The arbiter receives the hits through e.g.an ethernet interface card. It is also equipped with one ethernet card perclient.
The arbiters task is to distribute the hits to the clients. It will directthe hit stream to one client at a time, changing which client is to receive thestream according to some condition, for example a time condition, jumpingto the next client, say, every second.
Each client work in to phases, the receive phase and the compute phase.During the receive phase, the computer stores the incoming hit stream intomemory. During the compute phase, it reads the stored hits from the mem-ory, synthesizes the pairs and executes the algorithm. It then reports thedetected tracks in some way.
If the arbiter simply redirects the stream of hits to the next client, thelast time window at the old client and the first time window of the newclient will have incomplete hit pairs. Either, this could simply be accepted,
CHAPTER 4. ANALYSIS 28
Client
Client
Arbiter
Client
Hits
Figure 4.11: Using multiple PCs
since the number of lost tracks will be very small. Otherwise, the arbitercould overlap the hit streams by the time window size. In that case, thecomputer collecting the results from the clients would need to check for theduplicated tracks and remove one of them.
4.7.2 Logic (FPGA)
The TE can be implemented using logic. Performance is enough, with mar-gin, for executing the algorithm in real-time. The larger part of the Master’sThesis project has been about achieving such an implementation and theresult was successful. The rest of this report provides details on the imple-mentation.
4.7.3 Comments
Considering the special place the TE is intended to be deployed at, e.i. theSouth Pole, the FPGA option has some clear benefits when it comes tophysical space and power consumption.
4.8 Borderline events
A drawback of the TE algorithm is the inability to handle borderline events,e.i., events scattered around a bin limit. Figure 4.12 visualizes the problem.
For the likelihood to trigger on an event like in b to be as high as totrigger on an event like in a, twice as many pairs are needed. For events likein c, four times as many pairs are needed.
CHAPTER 4. ANALYSIS 29
a b c
Figure 4.12: Figure visualizing the borderline events problem by illustratingthree ways that pairs from a track can be scattered relative to the bin limitson the unit sphere.
Chapter 5
The implementation
This chapter not only intends to present and give an overview to the FPGAimplementation of the TE, but is also meant to be a reference and docu-mentation of the design. The chapter is therefore on some points ratherdetailed.
5.1 Full-TE system overview
The “full-TE” consists of all new hardware units and cabling needed torun the TE with the existing systems within the IceCube DAQ. The unitexecuting the algorithm, the logic within the FPGA, is called the “TE core”.Often, the TE core is referenced by simply “the TE”.
The full-TE (see overview in Figure 5.1) gets its input from the stringHubs,feeding the full-TE with hits. The stringHubs have an unused ethernet port,and the stringHub software will be modified so that all hits will be sent onthrough that port.
There are 86 stringHubs, the same as the number of strings. Using ahierarchy of switches, all hits are presented to a unit called the ControlServer through one single ethernet cable.
The Control Server has three main tasks:
• time-sort the stream of incoming hits
• buffer the hits
• interface to the FPGA processor (see below) and send the time-sortedstream of hits on to it
In the Virtex-V FPGA chip, there is also a PowerPC processor which isused, called the FPGA processor. The FPGA processor has the followingtasks:
• interface to the Control Server and receive the hits
30
CHAPTER 5. THE IMPLEMENTATION 31
CPU(input and output
interfaces)
Logic(”TE core”)
FPGA chipControl Server
ExistingIceCubeDAQ
reported timewindows
hits
stringHubs
hits
Figure 5.1: Overview of the so-called “full-TE”
• buffer the hits using the external RAM memory of the Xilinx ML-507evaluation board
• interface to the TE core (the logic in the FPGA) in order to:
– send the hits on to the TE core
– receive detected tracks messages from the TE core
• interface to the IceCube DAQ in order to send on the detected tracksmessages to it
There are various ways to communicate between the FPGA processorand the FPGA logic. Probably, one of the buses provided by the PowerPCprocessor, such as the PLB (Processor Local Bus), will be used. Anotheroption is to use the APU (Auxiliary Processor Unit) of the processor.
5.2 Introduction to FPGAs
An FPGA, Field Programmable Gate Array, is a programmable silicon de-vice. In it, logic is realized as specified by the user. The user describes whatlogic should be implemented, earlier by electronic logic circuit diagrams butnowadays usually by using a hardware descriptive language (HDL). VHDL,VHSIC1 Hardware Descriptive Language, is one such language and the oneused for this project. An FPGA can be reprogrammed with new designsdescribed by new or revised HDL code a very large number of times.
In the Track Engine project, there was a given algorithm that was tobe implemented, the main options being software or an FPGA. The main
1Very High Speed Integrated Circuits
CHAPTER 5. THE IMPLEMENTATION 32
advantage of the FPGA in this case is the very much larger performancethat can be achieved. In a CPU, only one binary computation (such as e. g.addition, comparison) can be performed at each clock cycle, since there isonly one ALU2, the unit performing such operations, in the CPU.
In an FPGA, on the other hand, we could have literally hundreds ofthousands of adders, if we wanted to, working in parallel or serial. Severalsuch components serially make up a pipeline, similar to the pipeline in aRISC processor. The TE FPGA design is in essence a long pipeline withhundreds of stages.
FPGA manufacturers usually offer evaluation boards, where the FPGA ispre-mounted on a circuit board with various components such as connectors,interfaces, buttons and LEDs. Such an evaluation board has been used andwill also be used for the real implementation of the TE.
5.3 Hardware used
5.3.1 The FPGA
The TE core is intended for a Xilinx Virtex-V FPGA. The exact FPGAmodel number is XC5VFX70T in package FFG1136 with speedGrade -1,since this is the FPGA chip on the ML-507 board which is intended to beused.
The XC5VFX70T is intended for embedded systems and hence has aPowerPC processor embedded within the chip, which is a key feature for theTE since the design relies on a processor available within the FPGA chip.Some other features of the FPGA of interest for the TE are:
• 11,200 slices and 44,800 flip-flops
• Up to 550 MHz (BlockRAM and DPS48-slices)
• 5,328 Kbit = 670 Kbyte BlockRAM, dual-port
• 128 DSP48 slices (can be used as multipliers)
• 640 available (single-ended) user I/O-pins
• PCIe endpoint blocks
5.3.2 The embedded processor
Some details on the PowerPC block of interest for the TE are:
• Model: IBM PowerPC 440
• Type: 32-bit RISC2Arithmetic Logic Unit
CHAPTER 5. THE IMPLEMENTATION 33
• 550 MHz clock frequency
• Separate memory bus and Processor Local Bus (PLB)
• Access to PCIe blocks
5.3.3 The evaluation board
The TE is intended to be implemented in an FPGA on the Xilinx ML-507evaluation board (Figure 5.2). Some features of this board of interest forthe TE are:
• External DDR2 RAM (256 MB)
• Ethernet port
• SystemACE CompactFlash socket
Figure 5.2: The ML-507 evaluation board. Photo from xilinx.com.
5.4 Communication within TE, data rates
5.4.1 Currently used encoding of hits in IceCube
The current encoding of hits in IceCube is according to Listing 5.1.With 38 bytes/hit, it is clear that the density of information of interest
for the TE is low.
CHAPTER 5. THE IMPLEMENTATION 34
BYTES TYPE WHAT4 INT Record l e n g t h i n by t e s s e l f − i n c l u s i v e4 INT PayloadID8 INT Timestamp4 INT Tr igge rType (−1 u s u a l l y )4 INT Tr i g g e rCon f i g ID (−1 u s u a l l y )4 INT Source ID (12000 + s t r i n g #)8 INT DomID2 INT TriggerMode ( b i tmask o f t r i g g e r b i t s )38 ( t o t a l )
Listing 5.1: Excerpt from a specification of the hit encoding currently usedin IceCube. Of interest to us is Timestamp, SourceID (string number) andDomID (DOM number).
The hits according to the specification appear in the stringHubs. Asdescribed, full-TE collects all hits from all stringHubs into the ControlServer,merging the data stream from the different stringHubs through switches.
With this encoding, the data rate for the Control server to handle wouldbe
DR = 2.6 Mhits/s · 38 bytes/hit ≈ 100 Mbyte/s ,
that is, being the expected average data rate, with peaks potentially sig-nificantly higher. Such a data rate is too high to be handled conveniently.
5.4.2 Suggested encodings, resulting data rates
“TE protocol”
A protocol called the TE protocol shall be used for (refer again to Figure 5.1on page 31)
• sending hits from the stringHubs to the Control Server, and for
• sending hits from the Control Server to the FPGA processor.
The protocol won’t be fully specified here, but it will represent each hit by3:
3“DomID”, whenever used in this document, refers to a value uniquely identifying anyof the 5300 DOMs in the detector and should not be confused with the DOM number,identifying a DOM on a given string.
CHAPTER 5. THE IMPLEMENTATION 35
BYTES WHAT2 DomID and LC tag2 Timestamp o f f s e t
(The encodings used for DOM IDs with LC-tag and time stamps arespecified in Section 5.6.)
A global time stamp is sent more sparsely (with a negligible additionalcommunication load), and the time stamp offset marks the time to be addedto the lastly received absolute time stamp in order to get the absolute timestamp of the current hit.
With 4 bytes/hit according to above and a hit rate of 2.6 MHz, the datarate would be 6 Mbyte/s.
Communication between the FPGA processor and the logic
The FPGA processor will calculate the absolute time stamp for each hit andthe full absolute time stamp will be included with each hit sent from theFPGA chip processor to the logic. The full absolute time stamp has beenchosen to have 5 bytes. The hits fed to the TE logic will hence be describedas:
BYTES WHAT2 DomID and LC tag5 a b s o l u t e t ime stamp
This totals to 7 bytes/hit, making up a data rate of 10.5 Mbyte/s.Describing a hit in 32-bit words (4 bytes), 2 words are needed for each hit.
With 2.6 MHz of hits there are 5.2 MHz of 32-bit words that the processorneeds to send to the TE logic. A processor working at 550 MHz (such asthe PowerPC in the Virtex-5 FPGAs) should be able to handle that easily.
The TE logic will report detected tracks back to the processor. Thisload should be negligible compared to that of the incoming hits.
5.5 The VHDL code
The VHDL code for the implementation is placed in files named as thetop-level units. There are two additional files:
miscPkg , a VHDL package, with types, constants and functions for allsignals and ports that is not specific to a single unit
CHAPTER 5. THE IMPLEMENTATION 36
miscComps , a .vhd file with entities and their architectures for generalunits that are used more than once (flip-flops of various kinds, regis-ters, counters, tri-state buffers, FIFO buffers, RAM etc.)
5.5.1 Code amount
The total VHDL code amount of the project is a bit over 8000 lines. Fig-ure 5.3 gives an idea of the code amount for the different units. The codeamount is more or less proportional to the complexity of the units.
Figure 5.3: VHDL code amount for the different units. The code for theangles unit includes 480 lines of auto-generated code.
Furthermore, the project includes a bit over 1500 lines of Matlab code,used for various tasks such as
• writing input data for the DOM coordinate LUT,
• calculating bin limits and generating the VHDL code for the compara-tors, and
• doing some simple behavioral simulations giving performance mea-sures.
In addition to that, some 3000 lines of Matlab code is included in theTEtest Matlab scripts for generating input and analyzing output data to theunits making up the TE. TEtest will be further explained in Section 5.11.
CHAPTER 5. THE IMPLEMENTATION 37
f i f o i n s t : ent i t y work . genFIFOgener ic map(
s i z e => 11+27+15,−− c l o c k c y c l e s f o r s p e e dC r i t + ang l e s + margin
width => TanglesTimeStUns igned ’ l eng th ,a lmostEmptyOf f se t => open , −− not useda lm o s t F u l l O f f s e t => open , −− not usedi n i t F i l e => ”” −− empty f o r no i n i t f i l e
)port map(
wrEN => f ifoWrEN ,rdAck => weAnglesOut ,DI => ang l e s I nUns i gned ,a lmo s t F u l l => open ,a lmostEmpty => open ,empty => empty ,f u l l => f u l l ,DO => ang lesOutUns igned ,c l k => c l k ,r s t => r s t
) ;
Listing 5.2: Example of entity instantiation, here from the instantiation ofthe FIFO buffer in the preHistBuffer unit.
5.5.2 VHDL code conventions used
Some comments on the VHDL code conventions used will be made.
Fully parameterized design No design parameters or data type param-eters are hard-coded within sub-units of the design—instead, all suchparameters are defined in a top-level VHDL package (miscPackage).Any references to such values are also given as actual references andnot as numerical values. This minimizes consistency problems withinthe code (see the declaration of TdetTrackPkg in Listing 5.5 for anexample—the actual type declaration is dependent on other types andconstants in several steps.)
Instantiations coding style VHDL supports component, configuration andentity instantiation. In the TE design, entity instantiation has beenchosen since it considerably reduces code amount and simplifies chang-ing the design. An example of entity instantiation can be found inListing 5.2.
Processes Clocked sequential statements can be described by a processblock with either a sensitivity list and “if rst=‘1’ then [...] elsif rising edge(clk)
CHAPTER 5. THE IMPLEMENTATION 38
[...]”-like code, or by no sensitivity list and a “wait until rising edge(clk)”statement. The first of the two has been used in the TE.
5.6 Internal representations
5.6.1 DOM IDs
In the TE, DOM IDs are represented by 14 bit words. A DOM ID carriesthree pieces of information:
• LC (yes/no)
• String number
• DOM number
This information is contained in the DOM ID word according to:
BITS : MSB [ 1 2−8 9−14 ] LSBWHAT: [ LC S t r i n g No . DOM No . ]
String number and DOM number are 0-based. For an LC-hit, the LC bit is‘1’, otherwise ‘0’.
The way pairs are produces in pairProd, they will be sorted in time withregard to hit1.timeSt.
5.6.2 Time stamps
Each hit is tagged with a time stamp. The time stamp is important whencalculating the implied speed, used for cutting away noise hits as earlierdescribed. Also, when the TE reports detected tracks on to the DAQ, thestart- and ending times of the time window in which they occurred arereported as well, these start- and ending times also originating from thetime stamps.
The time stamps as reported by the DOMs have a resolution of about1 ns. The DOMs report time stamps with an LSB significance of 0.1 ns. Bystripping the three least-significant bits, we get a new LSB significance of0.8 ns which seems reasonable.
For the time stamps, 5 bytes are used. The largest time value that canbe described is hence
0.8 · 10−9 · 28·5/60 [minutes] ≈ 18 minutes .
CHAPTER 5. THE IMPLEMENTATION 39
TE corehitIn
readyOut
weIn
detTrackPkg
detTrackPkgWe
clk
rstSig
Figure 5.4: The interfaces of system (TE core).
When reporting time windows with detected tracks on to the DAQ, thestandardized way of describing times in the IceCube DAQ4 must be used.The Control Server converts the time to that format. This should not imposeany problems since the delay in the TE core is much less than 18 minutes.
5.6.3 Position coordinates, lengths
Position coordinates are represented by words 13 bits wide, with the LSBbit having a significance of 0.3 m.
The largest representable coordinate value is hence
0.3 · 213 m ≈ 2500 m ,
which is within our needs since origo is placed in the middle of thedetector and the detector is roughly 1 km × 1 km × 1 km.
5.7 TE core and its units
5.7.1 Input and output ports, other signal types
The top-level unit of the TE core is called system. Its entity declaration canbe found in Listing 5.3.
The fundamental input to the TE core is hits, and the output is detectedtracks information. These signals along with ready and writeEnable signalsconstitute the interface of the TE core (Figure 5.4).
readyOut is high when system is ready to receive another hit, which iswritten to the unit by laying it out on the hitIn port and pull weIn high forone clock pulse.
To report a time window with detected track(s), a TdetTrackPkg struc-ture is laid out on the detTrackPkg output port and detTrackPkgWe is pulledhigh for one clock pulse.
4the number of 0.1 ns since some specified date and time using an 8-byte word
CHAPTER 5. THE IMPLEMENTATION 40
ent i t y system i sport (
h i t I n : i n Thi t ;readyOut : out s t d l o g i c ;weIn : i n s t d l o g i c ;detTrackPkg : out TdetTrackPkg ;detTrackPkgWe : out s t d l o g i c ;c l k : i n s t d l o g i c ;r s t S i g : i n s t d l o g i c
) ;end system ;
Listing 5.3: Entity declaration of system, the top-level unit of TE core.
−− r ange s w i t h i n TdomID v e c t o r :subtype s t r ingNoRange i s n a t u r a l range 12 downto 6 ;subtype domNoRange i s n a t u r a l range 5 downto 0 ;
subtype TdomID i s uns i gned (1+str ingNoRange ’ h igh downto 0 ) ;subtype TtimeSt i s uns i gned (8∗5−1 downto 0 ) ;−− 5 bytes , LSB=0.8 ns
type Thi t i s recorddomID : TdomID ;t imeSt : TtimeSt ;
end record ;
Listing 5.4: Declaration of the Thit structure
The Thit structure declaration, along with types and constants neededto declare it, is found in Listing 5.4.
The TdetTrackPkg structure declaration, along with types and constantsneeded to declare it, is found in Listing 5.5.
In the TE, the first thing to happen to the hits is that they are pairedinto pairs. Pairs are thereafter the data type that is pushed through thewhole TE. The declaration of the Thit data type is found in Listing 5.4.
In a pair, hit2 is always the first (oldest) hit of the two. Its time stampwill hence always be less than (or possibly equal to) the time stamp of hit1.
5.7.2 System overview
The TE core design can be thought of as a long pipeline. The object flowingthrough the pipe is a pair, consisting of two hits. Typically, at every positive
CHAPTER 5. THE IMPLEMENTATION 41
constant ang l e sAdd rB i t s : i n t e g e r := 9 ;constant t o t P a i r s C o u n t e rB i t s : i n t e g e r := 14 ;constant l c C o un t e rB i t s : i n t e g e r := 3 ; −− 2ˆ3−1=7
subtype Tuns1 i s uns i gned (0 downto 0 ) ;subtype TanglesAddr i s uns i gned ( ang l e sAdd rB i t s −1 downto 0 ) ;subtype Tto tPa i r s i s uns i gned ( t o tPa i r sCoun t e rB i t s −1 downto 0 ) ;subtype This tCount i s uns i gned (3 downto 0 ) ;subtype TlcCounte r i s uns i gned ( l cCoun t e rB i t s −1 downto 0 ) ;
type TdetectedTrackData i s recordwe : Tuns1 ;addr : TanglesAddr ;count : Th i s tCount ;
end record ;type TdetTrackVec i s ar ray (0 to 9) of TdetectedTrackData ;
type TdetTrackPkg i s recordt imeStFrom : TtimeSt ;t imeStTo : TtimeSt ;to tNoOfPa i r s : T t o tPa i r s ;l cCoun t e r : T lcCounte r ;detTrack : TdetTrackVec ;
end record ;
Listing 5.5: Declaration of the TdetTrackPkg structure
type Tpa i r i s recordh i t 1 : Th i t ;h i t 2 : Th i t ;v a l i d P a i r : s t d l o g i c ;
end record ;
Listing 5.6: Declaration of the Tpair structure. hit1.timeSt ≥ hit2.timeStwill always be the case.
CHAPTER 5. THE IMPLEMENTATION 42
hits pairs
histStagepreHistBufanglespairProd speedCritreportedtimewindows
Figure 5.5: Overview of the sub-units of the TE. The preBuffer unit is notshown. Only the principal information flow is shown through the arrows; inthe design there are further connections between the sub-units.
clock flank, each pair is sent one step onwards. This long pipeline can alsobe thought of as a large shift register, with each cell in the shift registercontaining one pair. This likening of the system with a pipeline makes senseup to but not including the so-called histStage unit.
There might not always be a pair to send on. Still though, just filling thedata fields for that pipeline stage with 0’s could not easily be distinguishedfrom a pair, and therefore there is one bit at each stage, called the validPairbit, signaling if the registers at that stage actually contain information on apair or if that stage is empty.
The units of the top-level design from beginning to end is (see also Fig-ure 5.5):
• the preBuffer unit, buffering the hits from the FPGA processor andsending them on to the pairProd unit,
• the pairProd unit which takes hits and synthesizes them to all possiblepairs with the condition that the time stamps of the two hits may differat most Ttw,
• the speedCrit unit, deleting (setting validPair bit to 0) all pairs notconforming to the speed criteria,
• the angles unit, determining the θ and ϕ angle properties of the pair,using these quantities to find the bin of the pair and from now onsends the bin number along with the pair,
• the preHistBuffer unit, a buffer with a special functionality necessarybefore the histStage, further explained later, and
• the histStage unit, building histograms in the FPGAs RAM resourcesfor all the (overlapping) time windows, finding time windows withhistogram cells with the number of pairs above the threshold level,and reporting those time windows and their track candidates back tothe FPGA processor.
CHAPTER 5. THE IMPLEMENTATION 43
hit currentHitReg
DODI
DI
DO
TWbuffer
addr
strobesignal
validPair
pair
”hit1” ”hit2”
Figure 5.6: Rough sketch of the pairing principle.
The units are explained in more detail in the next section.
5.7.3 The preBuffer unit
The preBuffer unit is a FIFO buffer and is used in order to provide a bufferbetween the FPGA processor communication interface and the pairProdunit.
The FIFO buffer stores hits in the Block RAM. The buffer size can beconfigured through a parameter, typically the buffer would store tens ormaybe hundreds of hits.
preBuffer communicates with pairProd. pairProd sets a “ready” bit tohigh when it is ready to receive another hit. If preBuffer is not empty, itwill write a hit to pairProd by putting the hit on a data bus and pulling aWrite Enable bit high.
5.7.4 The pairProd unit
The functionality of the pairProd unit is based on a buffer called TWbuffer.Its size is an important and critical design parameter and is discussed inSection 5.8 on page 62.
Pairing
Hits are paired into pairs by the principle shown in Figure 5.6.When a new hit is received from preBuffer, it is written to the register
currentHitReg. The hit in the currentHitReg is always laid out as one ofthe hits on the output pair bus.
CHAPTER 5. THE IMPLEMENTATION 44
The TWbuffer ring buffer contain the other hits that are to be pairedwith the hit in currentHitReg, and the hits in TWbuffer are pointed out oneby one (“strobing”), the output of TWbuffer laid out as the second hit ofthe pair on the output pair bus.
The validPair bit of the outputPairBus is pulled high as the contentsof the TWbuffer are pointed out. In this way, all pairs between the hit incurrentHitReg and the hits in TWbuffer are sent on.
When the strobing is complete, the hit in currentHitReg is added toTWbuffer.
Removing hits too old from TWbuffer
There must also be a mechanism to assure that only hits with time-stampsdiffering at most Ttw are paired together - somehow, hits must also be re-moved from TWbuffer. This is done every time currentHitReg has receiveda new hit, before the strobing.
When a new hit has been written to currentHitReg, its time stamp iscompared to that of the oldest hit in TWbuffer. If the difference is too great,the oldest hit is deleted, and the procedure is repeated until the oldest hitis not too old. This whole procedure is controlled by a state machine. It issketched out in Figure 5.7.
Cycle efficiency
Ideally, one pair would leave pairProd every clock cycle. Because of thedesign chosen, this will however not be the case. By studying the statemachine, we understand that it is only while looping within the strobe statethat pairs are sent on through pulling validPair high. However, some timewill be spent also in the other states.
We define the cycle efficiency ηcyc as how big part of the clock cyclesthat pairProd really does output pairs (e.i., validPair is high), given thatthere are available hits to input at all times. We have
ηcyc =fhTw
fhTw + nextra, (5.1)
with fh being the hit frequency as before and nextra being the number ofclock cycles spent in other states than the probe states within one full cyclein the state machine.
The expression is simply motivated by the fact that fhTw is the numberof hits in one time window, and fhTw + nextra is the total number of clockcycles for one full cycle in the state machine.
CHAPTER 5. THE IMPLEMENTATION 45
S_waitForNewHit
S_delOldHits
S_strobe
S_writeNewHit
if we=0
if we=1
if still too-old hits-> delete oldest hit
if last hit not pointed out-> increment strobe counter-> validPair=1
if last hit pointed out
-> transfer hit from curHitReg to TWbuffer
if no more too-old hits
we=1 andTWbuffer empty
if TWbuffer empty(deleted all hits)
if stallPairProd=1-> validPair=0
if last hit pointedout and Twbuffer is full-> delete oldest hit
Figure 5.7: Rough sketch of state machine controlling the pairProd unit.Shows conditions for state transitions and actions taken (marked with“− >”).
CHAPTER 5. THE IMPLEMENTATION 46
Stalling the pairProd unit
The histStage unit, creating and searching the histograms, may get busy,having no resources left to build new histograms. This should rarely happen,but if it happens, histStage orders pairProd to stop creating pairs by pullingthe signal stallPairProd high.
A number of pairs will already have left pairProd but not yet reachedhistStage. Those pairs are swallowed by the preHistBuffer, which is therefor only that specific situation.
The implementation of this stalling functionality in the pairProd unit issimple. In the state machine, if in the strobing state, validPair is pulled low,no counters for TWbuffer are incremented and the next state is the strobingstate.
The timing bottleneck of the design
The timing bottleneck of the design, that is, at how high clock frequencythe design can be run, lies in the pairProd unit. More specifically, it is inthe next state-logic of the state machine, when the difference of the timestamps are compared to the time window width constant in order to knowif the currently oldest hit should be removed or not.
Some effort has been put into trying to shorten the logic path causingthe limit of the timing. Two actions have been taken and are implementedin the design:
• A kind of pipe-lining of the subtraction and compare operation has beenimplemented. In short, it consists of a flip-flop after the operation andsome additional logic and modifications of the state machine to handlethat the result of the comparison is one clock cycle (too) old.
• The subtraction and comparison are not performed with full accuracy.Allowing pairs with time stamp differences of exactly Ttw is not nec-essary. Hence some of the least-significant bits are skipped in thecomparison. The number of bits to skip can be easily set. In the cur-rent design, 8 bits are skipped, which gives an accuracy loss close tothe significance of the 9th bit, which is 0.8 · 29 ≈ 0.41µs.
For simulations, all bits are usually used in the comparison for pre-dictability.
Each of these two actions have helped, but the compare operation stillremains the timing bottleneck.
Some effort has been put into investigating the possibilities of one ad-ditional pipeline stage in the comparison, which should certainly give theoperation a logical path short enough not to make it the timing bottleneck.Such a modification would make it necessary to modify the state-machine
CHAPTER 5. THE IMPLEMENTATION 47
so that principally every action it makes can be undone. Nothing is impos-sible, and it could be implemented, but it would require quite some workeffort and, probably to a greater disadvantage, it would lead to a heavilycomplicated state-machine more difficult to test, modify and trouble-shoot.
5.7.5 The speedCrit unit
In speedCrit, pairs not conforming to v ∈ [0.3c, 1.0c], v being the impliedvelocity (see Section 4.3.3) are filtered out. Filtering out, e. i. removing apair, is simply done by setting the validPair bit of the pair to 0.
The idea
We have
v :=l
∆tsince earlier. For the two hits, we have their DOM IDs, and their respec-
tive positions are looked-up in a look-up table, giving us their coordinates(x1, y1, z1) and (x2, y2, z2). We will avoid the square root operation since itis more difficult than the square operation to implement in logic. Hence, asthe distance measure, we use
l2 = (x2 − x1)2 + (y2 − y1)2 + (z2 − z1)2 ,
and the actual comparison that will be made is
(x2 − x1)2 + (y2 − y1)2 + (z2 − z1)2?∈[((t2 − t1)0.3c)2, ((t2 − t1)1.0c)2
]
Unit overview
An overview of the unit can be found in Figure 5.8.The unit can be said to consist of three parallel pipelines. The delay in
each is the same so that the pair, or information derived from the pair, willbe at hand at the same time at the end of each pipeline. The three pipelinesare:
• The coordinate pipeline implementing the left-hand side of the relationabove
• The time pipeline implementing the right-hand side of the relationabove
• A delay chain consisting of D-flip-flops in serial, just feeding on theoriginal pair, making sure that it is at hand at the end of the pipelineto be sent on if the comparison resulted in not removing the pair.
CHAPTER 5. THE IMPLEMENTATION 48
sameStringFilter
validPairFilter
coordLUT arithmetics
arithmetics
comparator
delaychain
pairIn
coord. pipe-line
time pipe-line
x, y, z ∆x2+∆y2+∆x2DOM IDs
t1, t2((t2-t1)1.0c)2
pairOut
((t2-t1)0.3c)2
pair
Figure 5.8: Overview of the speedCrit unit.
There is also a comparator performing the actual comparison betweenthe left- and right-hand side of the relation above and changing the validPairbit if necessary.
There is also a logic block called validPairFilter in the beginning ofspeedCrit. It sets the contents of a pair with validPair=0 to all-0’s. Thishas no functional effect, but for data not representing an actual pair, itmakes less gates and flip-flops in the both pipelines flip, thereby reducingpower consumption/dissipation. The design is also easier to trouble-shootin simulations since the internal signals will be “clean”, without nonsenseDOM IDs and time stamps for non-valid pairs.
The two parallel pipelines have a delay of 10 clock cycles. The compara-tor is pipelined in one stage.
The coordinate pipeline
The task of the coordinate pipeline is to provide the comparator with thequantity (x2 − x1)2 + (y2 − y1)2 + (z2 − z1)2.
At first, the DOM position coordinates must be found. This is done byusing a LUT implemented as a ROM in the BlockRAM of the VirtexV. TheLUT is initialized by a VHDL function reading binary data from a text file.That text file has in turn been generated by Matlab code from a text filecontaining geometry data for the detector (coordinates of the DOMs). Thatgeometry data has been retrieved from the IceCube collaboration.
The ROM used for the LUT is simply a RAM without write functionality.The Virtex-V supports so-called dual-port RAM, meaning that two differentdata words can be read or written independently from/to different memoryaddresses at the same clock pulse. This is used for the read operation. One
CHAPTER 5. THE IMPLEMENTATION 49
pair each clock pulse means two hits with corresponding DOM-coordinates.After the LUT, the coordinates are subtracted from each other, there-
after the differences are squared and the squares are added. The resultingquantity is passed on to the comparator.
The time pipeline
The task of the time pipeline is to provide the comparator with the quantities((t2 − t1)0.3c)2 (lower limit) and ((t2 − t1)1.0c)2 (upper limit).
The two times are subtracted in a subtractor and the result is fed totwo multipliers, giving signals corresponding to the quantities 0.3c and 1.0c.Since the multiplication with c gives very large values, and the least signifi-cant bits are not necessary for enough resolution, the result is shifted 10 bitsright, ignoring the bits shifted out. Each of the modified limits is finally sentto a multiplier squaring the limit. The result is sent to the comparator.
5.7.6 The angles unit
Strategies for determining the bin
The angles unit has the task of determining the bin number for each pair.There are two main approaches to this task, of which 2. is used.
1. The most straight-forward approach would be to use a LUT, looking-up each DOM pair and reading out the bin number. However, sucha LUT doesn’t fit in the BlockRAM of the FPGA: with 5200 DOMs,52002 ≈ 2.7 Mwords would be needed. With over a byte for each word,describing the bin number, the LUT would not fit in the BlockRAM.Using external RAM would require a VHDL RAM bus controller, abig project in itself, furthermore, the external RAM on the board isneeded by the PowerPC processor.
2. Geometry data of the DOMs can be used to calculate the θ and ϕquantities for each pair. Then, (a large number of) comparators canbe applied to those quantities to determine the bin number.
It should be noted that using a LUT would give total freedom in definingthe geometrical shapes of the bins. That is however not the case with theused approach which demands the bin limits to be defined as limits in θand ϕ.
The DOM coordinates are already at hand in the speedCrit unit wherethey were looked-up in a LUT. They are sent on along with the pair to theangles unit through the signal miscCoords.
CHAPTER 5. THE IMPLEMENTATION 50
miscCoords
dividerWrapper
arith-metics
x2+y2
z2
x
y
÷
÷
arith-metics
arith-metics
z1 ≥ z2
xyQuadrant
Tq
Pq
angleCompTq,mod
Pq,mod
addr
Figure 5.9: Overview of the angles unit.
Overview
The unit is sketched out in Figure 5.9. The work flow of the unit is in short:
1. From miscCoords, form the quantities (x2−x1)2 + (y2− y1)2, |y2− y1|and |x2 − x1| (left-most “arithmetics” box in figure)5,
2. perform the division, yielding Tq and Pq, which are measures of tan2 θand tanϕ but without regard to quadrant (dividerWrapper in figure),
3. determine quadrants and re-map Tq and Pq to Tq,mod and Pq,mod, eachof those having a 1:1 relation to tan θ and tanϕ, re-introducing quad-rant information (the two right-most “arithmetics”-boxes in figure),and
4. compare Tq,mod and Pq,mod to hard-coded limits defining the bins toget the bin number of the pair (angleComp in the figure).
More details on the motivation for using quotients and on the trans-formation of Tq and Pq to Tq,mod and Pq,mod is found in the subsection“Comparing with quotients” further down.
dividerWrapper
The dividerWrapper instantiates two hardware dividers—one for Tq and onefor Pq. The open-license divider6 used were found at opencores.org. Thedivider is written in Verilog but can be instantiated in a VHDL unit thanksto ModelSim and Xilinx both supporting mixed-language designs.
5(x2−x1)2 and (y2− y1)2 are already at hand in miscCoords since they have been cal-culated in speedCrit, so the logic needs only to implement an adder and simple arithmeticsand no multipliers.
6www.asics.ws
CHAPTER 5. THE IMPLEMENTATION 51
The (unsigned) hardware divider has a delay of one clock cycle per outputbinary digit. It is fully pipelined and can provide one division result eachpositive clock flank.
dividerWrapper reshapes the signals to the appropriate width before theyare fed to the dividers. It also delays the output of the Pq divider in orderfor the output of the two dividers to be in-phase—Pq has fewer bits than Tqand is therefore available earlier than Tq.
angleComp
The angleComp implements comparators for determining bin number fromTq,mod and Pq,mod (corresponding to θ and ϕ). This is done by comparingTq,mod and Pq,mod to pre-calculated constants defining the bins.
The larger part of the unit is made up by VHDL code auto-generatedby a Matlab script. The Matlab script
1. reads a text file defining the bin limits in θ and ϕ,
2. calculates the corresponding Tq,mod and Pq,mod quantities,
3. generates VHDL code declaring constants for the Tq,mod and Pq,modlimits, and
4. generates VHDL code implementing the comparators with limits asdefined by the constants.
An example of excerpts from the auto-generated code can be found inListing 5.7.
The input text file defining the bins is typically generated by anotherMatlab script generating the limits as described in Section 4.5. An exampleof an input text file is found in Listing 5.8.
Comparing with quotients
First, x, y and z should here and from now on be interpreted as the coordi-nates of the position vector formed from the coordinates of the two hits of apair, furthermore, unsigned is used for all numerical internal signals of theTE, resulting in, for x,
x := |x1 − x2| , (5.2)
and the corresponding expressions for y and z.A coordinate (θ, ϕ) in spherical units corresponds to the coordinate
(x, y, z) in cartesian units according to
CHAPTER 5. THE IMPLEMENTATION 52
−− <gene r a t ed code>−− Generated from gene r a t eAng l e sVhd l ( . . . ) 05−Nov−2009 17 : 48 : 03 :gene r a t eAng l e sVhd l : block
constant nBitsTh : i n t e g e r := 28 ; −− (7 hex d i g i t s )constant nBitsPh : i n t e g e r := 20 ; −− (5 hex d i g i t s )subtype TTquotLim i s uns i gned ( nBitsTh−1 downto 0 ) ;subtype TPquotLim i s uns i gned ( nBitsPh−1 downto 0 ) ;s i g n a l b i nAdd r I n t : i n t e g e r range 0 to ang l e sAddrH igh ;constant nTh : i n t e g e r := 16 ;constant nPhMax : i n t e g e r := 32 ;type Tt i s a r ray (0 to nTh−1) of TTquotLim ;type Tp i s a r ray (0 to nTh−1−1,0 to nPhMax−1) of TPquotLim ;constant t : Tt :=
( x”0000032” , x”0000196” , x”00004 c6” , x”0000b1b” , x”0001851” ,x”000391 e” , x”000 b0 fe ” , x”0068d6b” , x”1 f972d4 ” , x”1 f f 4 f 0 2 ” ,x”1 f f c 6 e 0 ” , x”1 f f e 7 a d ” , x”1 f f f 4 e 3 ” , x”1 f f f b 3 8 ” , x”1 f f f e 6 8 ” ,x”1 f f f f c c ” ) ;
constant p : Tp := (−− [ th0 , th1 [ :
( x”000d7” , x”005 ac” , x”1 f e43 ” , x”1 f f a 1 ” , x”2005b” , x”201b9” ,x”3 fa50 ” , x”3 f f 2 5 ” , x” f f f f f ” , x” f f f f f ” , x” f f f f f ” , x” f f f f f ” ,x” f f f f f ” , x” f f f f f ” , x” f f f f f ” , x” f f f f f ” , x” f f f f f ” , x” f f f f f ” ,x” f f f f f ” , x” f f f f f ” , x” f f f f f ” , x” f f f f f ” , x” f f f f f ” , x” f f f f f ” ,x” f f f f f ” , x” f f f f f ” , x” f f f f f ” , x” f f f f f ” , x” f f f f f ” , x” f f f f f ” ,x” f f f f f ” , x” f f f f f ” ) ,
−− [ th1 , th2 [ :( x”0007b” , x”00141” , x”00462” , x”1 fb9c ” , x”1 febd ” , x”1 f f 8 3 ” ,x”1 f f f e ” , x”20079” , x”2013 f ” , x”20460” , x”3 fb9a ” , x”3 febb ” ,x”3 f f 8 1 ” , x” f f f f f ” , x” f f f f f ” , x” f f f f f ” , x” f f f f f ” , x” f f f f f ” ,
−− [ . . . and on wi th a s e t o f ph i l i m i t s f o r each th e t a l i m i t ]
beginb i nAdd r I n t <=
−−[ addr ] [−− th c o n d i t i o n −−] [−−− ph c o n d i t i o n −−−−−]0 when tQ <t (0 ) e l s e1 when tQ>=t (0) and tQ<t (1 ) and pQ <p (0 , 0 ) e l s e2 when tQ>=t (0) and tQ<t (1 ) and pQ>=p (0 , 0 ) and pQ<p (0 , 1 ) e l s e3 when tQ>=t (0) and tQ<t (1 ) and pQ>=p (0 , 1 ) and pQ<p (0 , 2 ) e l s e4 when tQ>=t (0) and tQ<t (1 ) and pQ>=p (0 , 2 ) and pQ<p (0 , 3 ) e l s e5 when tQ>=t (0) and tQ<t (1 ) and pQ>=p (0 , 3 ) and pQ<p (0 , 4 ) e l s e6 when tQ>=t (0) and tQ<t (1 ) and pQ>=p (0 , 4 ) and pQ<p (0 , 5 ) e l s e7 when tQ>=t (0) and tQ<t (1 ) and pQ>=p (0 , 5 ) and pQ<p (0 , 6 ) e l s e8 when tQ>=t (0) and tQ<t (1 ) and pQ>=p (0 , 6 ) and pQ<p (0 , 7 ) e l s e9 when tQ>=t (0) and tQ<t (1 ) and pQ>=p (0 , 7 ) e l s e10 when tQ>=t (1) and tQ<t (2 ) and pQ <p (1 , 0 ) e l s e11 when tQ>=t (1) and tQ<t (2 ) and pQ>=p (1 , 0 ) and pQ<p (1 , 1 ) e l s e12 when tQ>=t (1) and tQ<t (2 ) and pQ>=p (1 , 1 ) and pQ<p (1 , 2 ) e l s e13 when tQ>=t (1) and tQ<t (2 ) and pQ>=p (1 , 2 ) and pQ<p (1 , 3 ) e l s e14 when tQ>=t (1) and tQ<t (2 ) and pQ>=p (1 , 3 ) and pQ<p (1 , 4 ) e l s e
−− [ . . . up to addr 343 f o r 344 b i n s ]
binAddr <= to un s i g n ed ( b inAddr In t , a ng l e sAdd rB i t s ) ;end block gene r a t eAng l e sVhd l ;−− </gene r a t ed code>
Listing 5.7: Example of excerpts from the VHDL code auto-generated byMatlab.
CHAPTER 5. THE IMPLEMENTATION 53
ID Theta−low Theta−h igh Phi−low Phi−h igh1 0 0.110432 0 6.2831942 0.110432 0.305147 0 0.6981333 0.110432 0.305147 0.698133 1.3962754 0.110432 0.305147 1.396272 2.0944245 0.110432 0.305147 2.094473 2.7925316 0.110432 0.305147 2.792534 3.4906767 0.110432 0.305147 3.490673 4.1888368 0.110432 0.305147 4.188836 4.8869339 0.110432 0.305147 4.886931 5.58507210 0.110432 0.305147 5.585074 011 0.305147 0.499863 0 0.44882312 0.305147 0.499863 0.448834 0.89761613 0.305147 0.499863 0.897674 1.34642814 0.305147 0.499863 1.346454 1.79527315 0.305147 0.499863 1.795293 2.242956[ . . . ]
Listing 5.8: Example of excerpt from an input text file defining the bin limits.The ID column contains the bin numbers. Angles are given in radians.
tan θ =
√x2 + y2
z(5.3a)
tanϕ =y
x. (5.3b)
Solving for θ and ϕ yields
θ = arctan
√x2 + y2
z(5.4a)
ϕ = arctany
x. (5.4b)
According to (5.2), the coordinates (x, y, z) reaching the angles unitcarries no sign information and therefore expressions (5.3) and (5.4) arevalid only in the 1st quadrant.
The angles unit is reached by coordinates (x, y, z), originating from alook-up in the LUT of the speedCrit unit, whereas the bin of a pair is deter-mined by limits of θ and ϕ. Going from cartesian to spherical coordinates—implementing equations (5.4)—in logic imposes difficulties: neither of squareroot, division or trigonometric functions are easily implemented in hardware.
Hence the expressions (5.4) have been transformed to what is here calledthe quotients, named Tq and Pq—quantities that avoid the square root and
CHAPTER 5. THE IMPLEMENTATION 54
arcustangent trigonometric functions. The logic need only to calculate thetransformed expressions, the quotients, as long as the limits to be comparedto, defining the bin limits, are transformed accordingly.
The transformation operations are:
• For Expression (5.4a): apply tan, then square,
• For Expression (5.4b): apply tan.
This gives the transformed quotient quantities Tq and Pq:
Tq := tan2 θ =x2 + y2
z2(5.5a)
Pq := tanϕ =y
x(5.5b)
The right-hand side of expressions (5.5), to be implemented in the logic,have no square root or trigonometric function operations. The division musthowever still be implemented which has shown to be possible by using thehardware divider previously mentioned. The left-hand sides are the pre-calculated constants, corresponding to the θ and ϕ limits for the bins, to beused in the comparators.
Since x, y and z carries no sign information, expressions (5.5) carries noinformation on the octant of the coordinate. This is handled by determiningoctants separately and off-set Tq and Pq with constants and do the same forthe limit constants.
Re-mapping the quadrants to modified quotients
For Tq, a point below the xy-plane suggest the point mirrored to the positiveside of the xy-plane. This can be seen in the graph of Tq in Figure 5.10.Since there exists a singularity for Tq at θ = π/2, Tq has a maximum allowedmagnitude Tq,max, Tq,max = 50 in the figure.
To preserve octant information, Tq is modified to the quantity Tq,modaccording to:
Tq,mod =
{Tq , θ ∈ [0, π/2[2Tq,max − Tq , θ ∈ [π/2, π[
(5.6)
with Tq still being limited to Tq,max. Hence, Tq,mod is Tq mirrored atθ = π/2, or, in other words, mirrored in the xy-plane, as desired.
Pq is modified to the quantity Pq,mod according to:
Pq,mod =
Pq , ϕ ∈ [0, π/2[2Pq,max − Pq , ϕ ∈ [π/2, π[2Pq,max + Pq , ϕ ∈ [π, 3π/2[4Pq,max − Pq , ϕ ∈ [3π/2, 2π[
(5.7)
CHAPTER 5. THE IMPLEMENTATION 55
0 0.5 1 1.5 2 2.5 30
10
20
30
40
50
60
70
80
90
100
θ
T q
Tq := tan2(theta)TqMax=50
0 0.5 1 1.5 2 2.5 30
10
20
30
40
50
60
70
80
90
100
θ
T q,m
od
Tqmod := 2*Tqmax-Tq (theta>=pi/2), Tq (theta<pi/2)foo
Figure 5.10: Tq and Tq,mod, Tq here saturated at Tq,max = 50. The graphsillustrate the transformation (5.6).
The transformation is illustrated i Figure 5.11.Some notes:
• The reason for allowing only even nθ when laying out the bins onthe unit sphere, as described in Section 4.5.4, leading to no θ-limit atθ = π/2, is now clear: since several θ’s around π/2 map to the sameTq,mod (e.g. according to Figure 5.10), accuracy for such a θ-limitwould be significantly lower than for other θ-limits.
• Tq and Pq are typically not integers—e.g. in a binning with 344 bins,17 θ-slices, the first four θ-limits correspond to Tq’s less than 1. Inter-nally, the unsigned bit vectors representing Tq and Pq are shifted left(actually, the nominators are shifted before division), allowing for anumber of decimal bits. 12+12 bits (integer part + decimal part) areused for Tq and 8+8 bits for Pq. More bits are required for Tq thanfor Pq since Tq is the result of a division of squared quantities.
• The value used for Tq,max and Pq,max when plotting the graphs in Fig-ures 5.10 and 5.11, Tq,max = 50 and Pq,max = 7, was chosen only toprovide intuitive graphs—the values actually used in the implementa-tion are Tq,max = 212 − 1 = 4095 and Pq,max = 28 − 1 = 255.
5.7.7 The preHistBuffer unit
Buffering before the histStage unit might be needed for two reasons:
1. The histStage unit has a limited number of histogramming resourcesand might, for unfortunate (and improbable) combinations of inputtime stamps, get busy and temporarily be unable to receive new pairsfor processing. If this happens, a stall signal will be sent to the pair-Prod unit making it stop produce pairs, but, there will be pairs in
CHAPTER 5. THE IMPLEMENTATION 56
0 1 2 3 4 5 60
5
10
15
20
25
φ
Pq
0 1 2 3 4 5 60
5
10
15
20
25
φ
Pq,
mod
1st quadr.
2nd quadr.
3rd quadr.
4th quadr.
Pq,max
2Pq,max
3Pq,max
4Pq,max
Figure 5.11: Pq and Pq,mod, Pq here saturated at Pq,max = 7. The graphsillustrate the transformation (5.7).
the pipeline between the pairProd and the preHistBuffer which will besent on and needs to be buffered.
2. Because of the pipelined structure of the histogram building in hist-Stage, equal bin numbers must never be closer than four clock cyclesin order not to lose any of those pairs. Therefore, the histStage unithas a unit called the pairSparser that sparses out approximate pairswith the same bin number—if this happens, a received pair needs to beheld for one to three clock cycles and during that time, the histStage isunable to receive new pairs and hence incoming pairs to preHistBufferneed to be buffered.
preHistBuffer is controlled by a state machine. A general buffer unit isinstantiated for the actual buffer.
A ModelSim screen shot of some signals in preHistBuffer during simula-tion of the TE is shown in Figure 5.12.
In a real case, it is very seldom that preHistBuffer must buffer hitsbecause of Case 1. above. Case 2. might not be that unlikely, however,it typically only requires buffering for one clock cycle. In this simulation,the allowSend signal was tweaked, so that buffering takes place much moreoften than is probable in a real case.
The signal alloc is no actual signal of the unit, it is provided for simula-tion showing the buffer allocation level. The dashed line at the top indicates
CHAPTER 5. THE IMPLEMENTATION 57
Figure 5.12: Allocation level and some other signals of the buffer in preHist-Buffer during a simulation of the TE.
the 100% (full) level.histStage shows with the signal allowSend whether preHistBuffer is al-
lowed to send any pairs on to histStage. It might go low for reason 1. or 2.as stated above.
When allowSend=0 is detected, buffering is started, and stallPairProdis immediately pulled high. stallpairProd is connected to pairProd andaffects the state machine so that no pairs are produced when stallPairProdis high. When this happens, the buffering of pairs between pairProd andpreHistBuffer is started—the buffer allocation level rises, as indicated in thefigure. The buffer allocation always reaches a certain maximum, dependingon how many pairs were in the pipeline, and that allocation level will bepresent until allowSend again becomes high and the buffer can be emptied.
A close-up of the buffering phase at time 8000 in Figure 5.12 is found inFigure 5.13.
5.7.8 The histStage unit
The histStage unit is responsible for histogramming the pairs and reporttime windows with tracks meeting the trigger condition.
The histStage can be said to consist of three stages connected in serial(see Figure 5.14): pairSparser, 15 cascaded histogram units and detTrack-DataCollector.
pairSparser is responsible for making sure that no pairs with the same binnumber are closer than four clock cycles apart. This is necessary sincethe pipelined layout of histCollector in the histogram unit will misspairs with the same bin closer than that.
CHAPTER 5. THE IMPLEMENTATION 58
Figure 5.13: Close-up of the buffering phase at time 8000 in Figure 5.12.
anglesIn
pairSparser histogram
anglesInSparse
readyToSparser
detTrackPkg
detTrackDataCollector
detTrackPkg
Figure 5.14: Overview of the histStage unit.
CHAPTER 5. THE IMPLEMENTATION 59
cestate machine
collector
reader
clearer
ce
ce
logic
startThis
startNext
anglesIndetTrackPkg
anglesIn
memory
Figure 5.15: Overview of (one instance of) the histogram unit.
15 instances of the histogram unit performs the actual histogramming.Each unit has two stages: collect and read-out. During collect, in-coming pairs are stored in histograms represented by memory (Block-RAM).
During read-out, the memory is scanned through to see if any his-togram cell (memory address, bin) meets the trigger condition. If atleast one does, the time windows is reported along with details on thedetected track(s).
A ModelSim screen shot in Figure 5.17 on page 62 shows the statesof the instances during a simulation. The unit is explained in moredetail further down.
detTrackDataCollector receives data on time windows with tracks fromall the histogram instances and reports them on.
The histogram unit
As mentioned, there are 15 cascaded instances of the histogram unit. Theunits internal structure is sketched out in Figure 5.15, and the VHDL codeto generate the instances is shown in Listing 5.97.
The idea is that all pairs are fed to all histogram instances. Each his-togram instance is then responsible for making sure that only the desiredpairs are processed.
7Actually, a real implementation is somehow different—a VHDL generic is set for oneof the histogram, making it start collect pairs at the first pair received so that the wholechain of histograms will come into operation after a reset.
CHAPTER 5. THE IMPLEMENTATION 60
−− Genera te the h i s tog ram i n s t a n t i a t i o n s :h i s tGen : f o r i i n 0 to noOfHis t s−1 generate
h i s t o g r am i n s t : ent i t y work . h i s tog ramport map (
ang l e s I nT imeSt => ang l e s I nSpa r s e ,s t a r t T h i s => s t a r tTh i sV e c ( i ) ,s t a r tN e x t => s t a r tNex tVec ( i ) ,s t opFeed i ng => s topFeed ingVec ( i ) ,detTrackPkg => detTrackPkgVec ( i ) ,go tVa l i dData => gotVa l idDataVec ( i ) ,c l k => c l k ,r s t => r s t
) ;s t a r tTh i sV e c ( i ) <= sta r tNex tVec ( i −1);
end generate h i s tGen ;s t a r tTh i sV e c (0 ) <= sta r tNex tVec ( noOfHis t s −1);
Listing 5.9: The VHDL code used to generate the instances of the histogramunit. The noOfHists constant has the value 15.
The work flow of a histogram unit is as follows:
1. Wait for a startThis pulse from the previous instance
2. On the startThis pulse (see Figure 5.16 for the startThis/startNextsignals along with the states),
• store timeStamp2 of the simultaneous pair into the startedAtregister for the histogram to keep track of the beginning time ofthe time window,
• start to collect incoming pairs into the memory building up thehistogram8
3. At the first pair with timeStamp2 greater than startedAt +1µs, senda startNext pulse (fed to startThis input of the next histogram)
4. At the first pair with timeStamp1 greater than startedAt +5µs (or thetime window width, Tw), stop collecting hits and start the read-outphase:
5. Read-out the histogram in the memory by addressing all the memorycells (bins) and check each bin for the trigger condition:
8Only pairs with both time stamps larger than startedAt will be collected, other pairsare filtered away since they don’t belong to the time window of the histogram.
CHAPTER 5. THE IMPLEMENTATION 61
Figure 5.16: States and startThis/startNext signals for three of the 15 his-togram instances.
• for each bin conforming to the trigger condition, but at most 10,store them into temporary registers
• if startThis is received during the read-out phase, it means thathistStage is overloaded - the stopFeeding output port is pulledhigh, which in turn will cause histStage to pull stallPairProd high,making the pairProd unit stop producing pairs (pairs betweenpairProd and histStage will be buffered in preHistBuffer)9
6. When all bins have been read, if there was at least one conforming tothe trigger condition, the time window shall be reported by sending iton to the detTrackDataCollector unit by laying out all the appropriateinformation on the detTrackPkg output port and pulling gotValidDatahigh.
7. Enter the wait phase, start again at 1
A ModelSim screen shot from a simulation of histStage showing thestates of the 15 histogram instances can be found in Figure 5.17.
The functionality described above is implemented by (again, see alsoFigure 5.15)
• logic for the signals startThis, stopFeeding, startNext, the registerstartedAt and some other signals read by the state machine,
• a state machine keeping track of the current state (wait, collect orread-out),
9stopFeeding is not pulled high if startThis is received during the collect phase—this aswell means that the histStage is (even more severely) overloaded, but pulling stopFeedinghigh will immediately stop pairs from being sent on into the histogram units—that meansno pair with time stamps satisfying the startedAt +5µs condition will be received andhence the collect phase of the histogram unit will never be left—the unit is stuck and willstall the whole TE.
CHAPTER 5. THE IMPLEMENTATION 62
Figure 5.17: States of the 15 histogram instances during a simulation.
• the units collector and reader, enabled by the state machine through achipEnable signal during the states collect and read-out respectively,getting access to the memory bus through tri-state buffers during thetime the state machine gives them chipEnable,
• a clearer unit, also connected to the memory bus through tri-statebuffers, clearing the memory contents during reset,
• the memory bus that connects the memory of the histogram to thecollector, reader and clearer units, and
• the memory, implemented as dual-port BlockRAM, with the samenumber of addresses as the number of bins, each memory cell contain-ing a counter of the number of pairs for that bin.
Note about overlapping time windows
With time windows 5µs wide, and TWdelta = 1µs, ideally, an event notspread at all in time would be detected five times. This must however notbe the case because of the way new time windows are triggered in the imple-mentation. If a first time window starts at t = 0, the next time window willstart at the first pair with the smallest time stamp (timeStamp2) meetingthe criteria > 1µs. That first pair might well have a timeStamp2 greaterthan 1µs. Such conditions might well lead to that some events will only becovered by three or four time windows.
5.8 Comments on the size of TWbuffer
According to Section 4.3.1 on page 13, for the average number of hits in atime window nh,tw we have the approximation nh,tw = fhTw = 13, these hits
CHAPTER 5. THE IMPLEMENTATION 63
resulting from the PMT noise. At a muon event, especially a high-energymuon event, in the detector, we expect however a much larger number ofhits in a time window.
The number of hits in TWbuffer at any time, for an infinitely largeTWbuffer, is the number of hits from the oldest hit and one time win-dow on. We must have a limit for the number of hits in TWbuffer. If wewouldn’t, then for high-energy events, which the TE is not designated tohandle anyway, we would get overloaded with pairs within one single timewindow (pairs growing with the square of the number of hits). The morehits in TWbuffer, the more seldom we can let a new hit enter currentHitReg,that is, we will deal with the new incoming hits in a slower pace.
A very low limit on the number of hits in TWbuffer, on the other hand,will make it harder to distinguish low-energy events.
The size used for the TWbuffer is 31 hits. This is small enough not toget stuck with a large number of hits/pairs in the same time window. Thiswill be shown below.
5.8.1 Motivation
The situation we want to avoid is to have TWbuffer filled, without beingable to deal with the new incoming hits fast enough. We argue, that with theworst-case of the TWbuffer being filled, we should still be able to deal withnew incoming hits in the pace we expect them to appear, with a considerablemargin. The pace in which we expect them to appear is about 2.6 MHz.
Call the size of the TWbuffer ntw,max. For each new hit going intocurrentHitReg, we must complete one turn in the state machine, one suchturn taking ntw,max +nextra clock cycles (see Section 5.7.4 for nextra). Withclock frequency fc we have
fh =fc
ntw,max + nextra(5.8)
with fh being the pace in which we can deal with new incoming hits.With clock frequency fc = 180 MHz, ntw,max = 31 and nextra = 7 we getfh ≈ 4.7 MHz, that is, not far from twice of the expected hit rate of 2.6 MHz.This is a satisfying margin.
A Matlab script was developed, mimicking the behavior of the pairProdunit. The results of runs of two sizes of TWbuffer can be seen in Figure 5.18.The simulation does not take into account that the clock frequency fc islimited. It also doesn’t take into account that ηcyc 6= 1, e.i., that nextra 6= 0.
It can be seen that the relation between hit and pair frequency follows(the analytical) Expression (4.1) on page 18. Furthermore, a knee is clearlyvisible for the hit frequency where TWbuffer is filled. This marks the pointwhere the quadratic behavior is replaced by a linear.
CHAPTER 5. THE IMPLEMENTATION 64
0 2 4 6 8 10 120
50
100
150
200
250
300
350
400
450
500
System real−time, 180 MHz
hit freq. [MHz]
pair
freq
. [M
Hz]
pair freq. vs. hit freq. for various TWbuffermax(2000 hits used for each simulation)
f p=Tw*f h 2
4231
Figure 5.18: Pair frequency fp plotted against hit frequency fh for two(Matlab) simulations with different values for the ntw,max parameter (size ofTWbuffer). The analytical expression (4.1) is also plotted with a solid line.ntw,max = 31 is the value actually used in the implementation, the greatervalue for that parameter is plotted to show the tendency of the knee pointwhen the size of TWbuffer grows. For both simulations, 2000 hits were used.
CHAPTER 5. THE IMPLEMENTATION 65
5.8.2 Consequences of larger TWbuffer
If one feels that a maximum of 31 hits in one time window is too less, raisingntw,max to 63 gives an fh of about 2.6 MHz, that is, the pace in which weassume the hits to appear with only noise events. This would however bea kind of gambling, since there would be a risk of all the buffers filling up,the TE never being able to work through all the hits and catch-up withreal-time again.
5.8.3 Behavior when TWbuffer is full
When TWbuffer is full and a new hit is about to be written to it, theoldest hit in the buffer is deleted to make room for the new hit. Thiseffectively means that the time window is shrinked - the oldest hit is removedalthough it is not too old. This means that for a particle traveling throughthe detector, generating hits, there won’t be any pair of the first and last hit(in time, and geometrically), but rather, the pairs appearing will be pairsbetween hits in subsections of the track.
5.9 BlockRAM usage
Figure 5.19: BlockRAM usage, totally 231 Kbit (5300 Kbit are available inthe XC5VFX70T Virtex-V FPGA).
Figure 5.19 shows the usage of the BlockRAM resources in the FPGA.
CHAPTER 5. THE IMPLEMENTATION 66
TWbuffer is the buffer in the pairProd unit used when synthesizing hitsinto pairs
histogram is the histograms, realized through RAM memory, with onehistogram for each of the 15 histogram units contained in the histStagesub-unit
preHistBuffer FIFO is the FIFO buffer in preHistBuffer used to receiveand store pairs in the pipeline if histStage stalls
coordinate LUT is a ROM containing the look-up table with coordinatesof all the DOMs, used in the speedCrit unit (the looked-up coordinatesare also sent on to the angles unit and used there)
5.10 Synthesis
The final design synthesizes without errors.The timing bottleneck of the design is the next state-logic in the state
machine in the pairProd unit. This is described and commented in Sec-tion 5.7.4 on page 46.
5.10.1 Synthesis options
The optimization goal setting used for the synthesis has been speed.Options for allowing the synthesis tool to move flip-flops and push them
through combinatorial logic, including duplicating flip-flops if necessary (op-tions register balancing, register duplication and move flip flops), were ac-tivated. This means that it has been possible to describe logic that is tobe pipelined simply by describing the logic as one big combinatorial unit,then add a number of flip-flops at the inputs or outputs to the block. Thesynthesis tool has then pushed the flip-flops into the logic, pipe-lining it.
5.10.2 Area, logic utilization
The utilization summary, included in the Synthesis report, from a synthesisof the whole design, is found in Listing 5.10.
In short, it shows that about 50% of the logic in the FPGA is used.
5.11 Testing with “TEtest”
For testing of individual units of the TE and of the whole TE, a set of VHDLand Matlab code, commonly referred to as “TEtest”, has been developed.TEtest allows for input and output of data/signals to/from the TE unitsby text files. Text files for input are generated by Matlab scripts, and theoutput text file is read and analyzed by Matlab scripts.
CHAPTER 5. THE IMPLEMENTATION 67
Dev ice u t i l i z a t i o n summary :−−−−−−−−−−−−−−−−−−−−−−−−−−−
Se l e c t e d Dev ice : 5 v s x50 t f f 1136 −1
S l i c e Log i c U t i l i z a t i o n :Number o f S l i c e R e g i s t e r s : 18454 out o f 32640 56%Number o f S l i c e LUTs : 17698 out o f 32640 54%
Number used as Log i c : 16282 out o f 32640 49%Number used as Memory : 1416 out o f 12480 11%
Number used as SRL : 1416
S l i c e Log i c D i s t r i b u t i o n :Number o f LUT F l i p F lop p a i r s used : 24938
Number w i th an unused F l i p F lop : 6484 out o f 24938 26%Number w i th an unused LUT: 7240 out o f 24938 29%Number o f f u l l y used LUT−FF p a i r s : 11214 out o f 24938 44%Number o f un ique c o n t r o l s e t s : 578
IO U t i l i z a t i o n :Number o f IOs : 328Number o f bonded IOBs : 328 out o f 480 68%
IOB F l i p F l op s / La tche s : 321
S p e c i f i c Fea tu r e U t i l i z a t i o n :Number o f Block RAM/FIFO : 22 out o f 132 16%
Number u s i n g Block RAM on l y : 22Number o f BUFG/BUFGCTRLs : 2 out o f 32 6%
Listing 5.10: Utilization summary, excerpt from the Synthesis Report gen-erated during a synthesis of the whole design.
CHAPTER 5. THE IMPLEMENTATION 68
speed criteriapairs pairs
outputFile VHDL unit
writeFileVHDL unitASCII
text fileASCIItext file
simulation in modelSim
Matlabscript
Matlabscript
generate input data(generate expected output)
analyze output(e.g. compare toexpected output)
Figure 5.20: Testing with TEtest. In this example, a test of the speedCri-teria unit is shown.
There are two VHDL units for TEtest:
1. outputFile, for reading a text file and output its contents to the unitsoutput ports, and,
2. writeFile, for writing the signals sent to the unit’s input ports to atext file.
An overview of the VHDL units and how the Matlab scripts are usedcan be found in Figure 5.20.
The format of the input and output text files are similar: The signals arelisted as columns, and the data is listed as rows, each row being outputtedat the same clock pulse, with ascending time. The data values are writtenin hex.
The outputFile unit allows both for output of one row at each clockcycle, and by outputting of data according to a ready/writeEnable interfaceto the unit to interface to. The same goes for writeFile.
Some of the benefits of TEtest are:
• Repeating the test of a unit after e.g. changing a VHDL module issimple.
• Using ASCII text files means that anyone can provide input data andanalyze the output.
• TEtest VHDL modules are general—only the Matlab code is specificto the tested module
CHAPTER 5. THE IMPLEMENTATION 69
Each sub-unit, and some sub-units of sub-units, have been tested throughly.No complete documentation of the tests will be made, instead some examplesare shown to show the idea.
Testing of the angles unit
Listing 5.11 shows an excerpt from the Matlab prompt during a test of theangles unit.
98 761 pairs were fed to the angles unit. The pairs in this case is themiscCoords of the pairs (x, y, z, x2, y2, z2). 98 761 such coordinates,according to random angles, were inputted to the unit.
The Matlab script analyzes the output in a number of ways. The resultis written out to the prompt. The output says:
• The time stamps were not changed: exactly those time stamps thatwere inputted were also found in the output.
• For each outputted bin number, the corresponding θ and ϕ limits werecalculated. They were compared to the actual θ and ϕ angles of thecoordinates that were inputted with a tolerance. All pairs passed thistest. That means that some bin numbers might have differed from theideal, but those were for angles close to a bin limit.
• The ideal bins were calculated. This was done using 64-bit floating-point representations with Matlab and hence has higher precision thanhow the bins are determined in the implementation. Some mis-matchesare therefore expected.
The bins did not match for 3% of the bins. The greatest θ errorwas 0.05 ◦ and the mean θ error, mean with regards to only the mis-matching bins, was 0.009 ◦. The corresponding values for ϕ can befound in the output.
3% of the pairs resulting in bin mis-matches, and with another testhaving shown that all the mis-matches must happen for angles close to thebin limits, much indicates that the unit is working properly. However, ifthe situation would be that one of the bins or bin limits resulted in all theerrors, we would want to identify that problem.
To reveal any systemacy in the errors, the θs and ϕs having led to theerrors were histogrammed. The result can be seen in Figure 5.21.
There is some systemacy among the errors, most significantly in ϕ. Itis however not that prominent and no further work on investigating it hasbeen made. Some systemacy in the errors is of course expected since the res-olution the implementation can give for e.g. the ϕ angle is dependent on theangle. This is for example obvious since the re-mapping functions are non-linear and hence gives different resolution for different angles (Figure 5.11on page 56).
CHAPTER 5. THE IMPLEMENTATION 70
>>Wrote 98761 rows to ’ A in . tx t ’ (8493 .863 kB ) .Saved i npu t d a t a s e t to ’ A 11 . mat ’ .Execu t i on time , data g e n e r a t i o n ( pre−s imu l a t i o n ) : 1 min 26 .5 s e cRun s imu l a t i o n (98761 p a i r s ) , h i t e n t e r to read output f i l e .Read 98761 rows from ’ A out . t x t ’ (2864 .286 kB ) .OK: t ime stamps .OK: theta−expec t ed compared to theta− l i m i t s from output Bin No . ’ s .OK: phi−expec t ed compared to phi− l i m i t s from output Bin No . ’ s .Note : There were 342 (3%) Bin No . mis−matches .
no . o f t h e t a l i m i t mis−matches : 100max . t h e t a e r r o r magnitude : 0 .00086934 rad = 0.04981 degmean th e t a e r r o r magnitude : 0 .00015581 rad = 0.0089271 deg
no . o f ph i l i m i t mis−matches : 242max . ph i e r r o r magnitude : 0 .0042899 rad = 0.24579 degmean ph i e r r o r magnitude : 0 .0011491 rad = 0.065837 deg
Execu t i on time , data a n a l y s i s ( post−s imu l a t i o n ) : 3 min 43 .2 s e c
Listing 5.11: Excerpt from the Matlab command prompt during a test ofthe angles unit.
0 0.5 1 1.5 2 2.5 30
5
10
15
20
Histogram of θexpected's that generated the θ errors
θ
Mul
tiplic
ity
0 1 2 3 4 5 6 70
20
40
60
80
Histogram of φexpected's that generated the φ errors
φ
Mul
tiplic
ity
Figure 5.21: Histogram showing which angles generated the bin mis-matches.
CHAPTER 5. THE IMPLEMENTATION 71
>>Data w i l l be gene r a t ed wi th the f o l l o w i n g p r o p e r t i e s :no . o f TWs: 2500n o i s e p a i r s /TW: u n i f [ 0 10 ]LC ’ s /TW: u n i f [ 0 2 ]t r a c k s /TW: u n i f [ 1 2 ] , u n i f [ 1 1 ]p a i r s / t r a c k : u n i f [ 8 12 ]
Data w i l l be gene r a t ed wi th the f o l l o w i n g p r o p e r t i e s :no . o f TWs: 2500n o i s e p a i r s /TW: u n i f [ 0 10 ]LC ’ s /TW: u n i f [ 3 10 ]t r a c k s /TW: u n i f [ 1 2 ] , u n i f [ 1 1 ]p a i r s / t r a c k : u n i f [ 4 7 ]
Wrote 167373 rows to ’ A in . tx t ’ (5356 .226 kB ) .Saved i npu t d a t a s e t to ’ A1 5 . mat ’ .Execu t i on time , data g e n e r a t i o n ( pre−s imu l a t i o n ) : 9 min 32 .4 s e cRun s imu l a t i o n (167373 p a i r s ) , h i t e n t e r to read output f i l e .Read 5000 rows from ’ A out . t x t ’ (556 .147 kB ) .OK: same number o f expec t ed and a c t u a l d e t e c t e d t r a c k s .OK: expec t ed and a c t u a l d e t e c t ed t r a c k s data i d e n t i c a l .Sim t ime 19423740 ns
Listing 5.12: Excerpt from the Matlab command prompt during a test ofthe histogram unit (sub-unit of histStage).
5.11.1 Testing of the histogram unit
Listing 5.12 shows an excerpt from the Matlab prompt during a test of thehistogram unit. The histogram unit inputs pairs (as bin numbers with timestamps) and outputs information on detected tracks.
At first, the parameters for the generation of input data is stated. Twoseparate input data sets have been generated.
Each consists of 2500 time windows with tracks. “unif[x y]” means thatthat parameter has been given random values uniformly on the interval[x, y].
It can be seen that the data generation resulted in 167 373 pairs and thatthe simulation led to the output of 5000 reported time windows. Those havebeen analyzed, and it was found that they described exactly those tracksthat the input data represent.
5.11.2 Testing the whole TE core design (system unit)
Listing 5.13 shows an excerpt from the Matlab prompt during a test of thewhole TE Core design.
CHAPTER 5. THE IMPLEMENTATION 72
>>I n i t i a l i z e d g en e r a t e ( d a t a s e t 2 , simName 4) , work ing . . .Tracks w i l l be gene r a t ed wi th the f o l l o w i n g p r o p e r t i e s :no . o f t r a c k s : 100 000h i t s : u n i f [ 10 11 ]v : u n i f [ 0 . 9 0 0 . 9 5 ]t r a c k l e n g t h s : u n i f [ 400 950 ]
Wrote 1 050 184 rows to ’ A in . tx t ’ (18903 .548 kB ) .Saved i npu t d a t a s e t to ’ A2 4 . mat ’ .Execu t i on time , data g e n e r a t i o n ( pre−s imu l a t i o n ) : 33 min 9 .4 s e cRun s imu l a t i o n (1 050 184 h i t s ) , h i t e n t e r to read output f i l e . >Read 116 998 rows from ’ A out . t x t ’ (12870 .927 kB ) .Note : 100 000 expec t ed t r a ck s , found 183 165 i n output ( x1 . 8 3 ) .Note : f i r s t o c cu r r e n c e o f an expec t ed t r a c k f o r which no detTrack
t ime window was found i s f o r tStMid=65332 ( row 6 i nexpDetTracks ) .
Note : 93 770 out o f 100 000 (93.77%) o f the t ime windows wi thgene r a t ed t r a c k s had at l e a s t one d e t e c t ed t r a c k .
F i g u r e s saved to ’ h i s t s ∗ A 4 . f i g ’ .Execu t i on time , data a n a l y s i s ( post−s imu l a t i o n ) : 376 min 24 .9 s e c
Listing 5.13: Excerpt from the Matlab command prompt during a test ofthe whole TE Core design.
CHAPTER 5. THE IMPLEMENTATION 73
Input data
Some properties of the generated hits and tracks follows:
• 100 000 tracks were generated with 10 or 11 hits (equal probability)for each track. (This hence resulted in some 1 000 000 hits.)
• Velocity of the particle was randomized between 0.90c0 and 0.95c0.
• The track lengths were randomized between 400 and 950 m.
• No noise hits were generated.
• No multiple tracks (more than one track within the same time window)were generated—all tracks were spaced 10µs (2× TWwidth).
• No vertical tracks were generated10—a limit of .05π rad ≈ 10 ◦ wasused.
For the test, a trigger condition of nthresh = 4 (least no. of pairs in abin in order to trigger) was used.
Output data—efficiency
It is not immediately obvious how it, from the output data, should be de-termined if the TE Core behaves as expected or not.
For some tracks, the pairs will have such a large spread in θ and ϕ thatno bin will meet the trigger condition. A quantity called the efficiency tellshow large part of the input tracks were actually detected—and this figureof course does give some hint on the functionality of the design.
The efficiency was calculated to 93.77% for the test run.
Output data—direction information
Some investigation of direction information was made. It should however beremembered that the TE is supposed to work as a trigger—rather than pro-viding direction information—but, nevertheless, the quality of the directioninformation provided by the bin number for detected tracks does say some-thing about the functionality of the algorithm and of the implementation.
For detected tracks, a bin number is provided (translatable to θ and ϕlimits). The true θ and ϕ angles of the input tracks are known. These twoangular measures were compared.
We call the expected θ for a track θexp and similar for ϕexp. The detectedθ for a track is called θdet and similar for ϕdet. For θdet and ϕdet, the centreof the reported bin was used.
10The reason for not generating any vertical tracks is that these can’t be detectedanyway. Vertical tracks have hits only within one single string, and all pairs on the samestring will be filtered away by the sameString-filter in the speedCrit unit.
CHAPTER 5. THE IMPLEMENTATION 74
The differences of the expected and detected values for θ and ϕ shouldbe investigated. It makes sense to see the differences as distances on theunit sphere. For the difference distances in θ and ϕ separated, called εθ andεϕ, we get
εθ = θexp − θdet (5.9)εϕ = (ϕexp − ϕdet) sin θmid . (5.10)
For θmid,
θmid =θexp + θdet
2has been used.The distance on the unit sphere of the differences, called εθϕ, is
εθϕ =√ε2θ + ε2ϕ . (5.11)
The ε quantities given by (5.9), (5.10) and (5.11) were histogrammed.For one generated track there were often more than one detected track.
For each ε triple, two histograms were generated: one with the ε’s for alltracks and one with only the ε’s for the strongest track, if there were morethan one detected track for a generated track.
The histograms are found in Figure 5.22. The approximate average binside length is also shown.
The figures show that the direction information provided by the TE is,at least, reasonable.
It is clear that the spread in ϕ is larger than in θ which is expectedbecause of the lay-out of the detector: DOMs are spaced 17 m in the z-direction, but 125 m in the xy-direction.
For εθ and εϕ, the peaks fall somewhat well within the bin widths, how-ever, for εθϕ, the peak does not. This could indicate that the bins shouldbe larger (fewer).
5.11.3 “Problematic track” problem
When looking closer into some of the test runs of the whole TE Core, itsoon became obvious that a certain kind of situation sometimes appearedgiving a very large spread in the ϕ angle. If that spread could be reduced,it would mean that the efficiency of the TE could be (maybe considerably)increased.
An example track is visualized in Figure 5.23. When studying the plotof the track in the XY-plane, it is obvious that because of the geometry ofthe strings in the XY-plane, pairs with ϕ angles according to the true track
CHAPTER 5. THE IMPLEMENTATION 75
0 5 10 15 20 25 30 35 400
500
1000 Bin width 11.16◦
11.16◦/2 = 5.58◦
√ε2θ + ε2ϕ ’s for all tracks
√ε2θ + ε2ϕ (◦)
Mul
tipl
icit
y
−50 −40 −30 −20 −10 0 10 20 30 40 500
200
400
600
800
εθ’s for all tracks
εθ (◦)
Mul
tipl
icit
y
−50 −40 −30 −20 −10 0 10 20 30 40 500
200
400
600
800
εϕ’s for all tracks
εϕ (◦)
Mul
tipl
icit
y
(tracks: 100000, hits/1000m: unif[10 11], track lengths/m: unif[400 950], tracks detected: 93.77%, epsph=sin(thnorm)*(phexp−phdet
0 5 10 15 20 25 30 35 400
100
200
300
400
500Bin width 11.16◦
11.16◦/2 = 5.58◦
√ε2θ + ε2ϕ ’s for strongest track
√ε2θ + ε2ϕ (◦)
Mul
tipl
icit
y
−50 −40 −30 −20 −10 0 10 20 30 40 500
100
200
300
εθ’s for strongest track
εθ (◦)
Mul
tipl
icit
y
−50 −40 −30 −20 −10 0 10 20 30 40 500
100
200
300
400
εϕ’s for strongest track
εϕ (◦)
Mul
tipl
icit
y
(tracks: 100000, hits/1000m: unif[10 11], track lengths/m: unif[400 950], tracks detected: 93.77%, epsph=sin(thnorm)*(ph
Figure 5.22: Histograms of εθ (5.9), εϕ (5.10) and εθϕ (5.11) for all detectedtracks (top) and strongest track (bottom). The (approximate) side lengthof the bins is also indicated in the graphs.
CHAPTER 5. THE IMPLEMENTATION 76
will be few and instead the pairs will give rise to peaks for the ϕ anglescorresponding to the lines that the string geometry make up.
Figure 5.24 visualizes the output of a simulation when inputting thetrack in Figure 5.23. We see that no strong peak in ϕ appears, instead, thepairs are very much spread in ϕ.
Possible ways of handling this problem is discussed in Section 6.4 onpage 80.
CHAPTER 5. THE IMPLEMENTATION 77
−400 −200 0 200
−300
−200
−100
0
100
200
300
x
XY
y
−500 0 500−500
0
500
x
XZ
z
−500 0 500−500
0
500
y
YZ
z
−500
0
500 −500
0
500−500
0
500
y
3d projection
x
z
Figure 5.23: An example of a “problematic track” in XY-, XZ-, YZ- and3d-projection. The DOMs are marked by dots. In the XY projection it canbe clearly seen that the peak for ϕ won’t be aligned with the true ϕ of thetrack, but rather at ϕ’s describing the angles of the lines that the stringsmake up.
CHAPTER 5. THE IMPLEMENTATION 78
15 15
15 12 6
15 9
6
10
6
Figure 5.24: A visualization of the output (reported detected track) of asimulation when inputting the track in Figure 5.23. The cross marks origoand the line shows the track direction (position vector of the pair). The ringshows where the line crosses the unit sphere. The figures in the bins (onlyfor detected tracks) are the multiplicity of pairs in those bins. Several timewindows having reported the same bin states their respective multiplicitywith spaces between. nthresh = 4 was used for the simulation.
Chapter 6
Results and suggestions
The first three sections below correspond to the aims of the project as de-scribed in Section 3.2 on page 7.
6.1 Realizing the TE algorithm in hardware
That the TE algorithm can be implemented with logic in an FPGA has beenproven since such an implementation has been developed within the scopeof this Master’s Thesis project.
The design does meet needed performance according to Section 4.4.1.Figure 4.6 shows that when clocking the design at 180 MHz, the design canhandle a hit frequency of 6 MHz, more than twice of the expected averagehit frequency 2.6 MHz.
6.2 Realizing the TE algorithm in software
The possibility of realizing the TE in software was discussed in Section 4.7.1on page 26.
Realizing the TE in software using several PCs is possible. It is harderto give a clear statement on the possibility of using one single PC. Nothingproves that it would be impossible, but it is certainly challenging, if it canbe done.
6.3 Interfacing the TE to existing systems
The TE needs to interface to the stringHubs and the DAQ as described inSection 4.2.1.
The data rates are kept reasonably low by sending only the necessaryinformation from the stringHubs on to the controlServer. This is explainedin Section 5.4.
79
CHAPTER 6. RESULTS AND SUGGESTIONS 80
6.4 Possible improvements of the TE
Studying the situation with the “problematic track”, as described in Sec-tion 5.11.3 on page 74, gave rise to two ideas of improving the TE algorithm.
6.4.1 String geometry and coordinate system
It is in the XY-plane that the string geometry exists. Furthermore, the ϕangle in a spherical coordinate system is defined by projecting the point inthe XY-plane and then finding the angle to that point. It is this coincidencethat causes the problem.
By rotating the coordinate system, e.g. 45 ◦ around the X-axis, we couldremove this coincidence. Practically, this would mean describing the DOMcoordinates in this rotated coordinate system. That would be all—no othermodification needs to be done to any part of the algorithm/implementation.The DOM positions are fetched from the DOM coordinate LUT, imple-mented as a ROM in BRAM. By re-calculating the initialization file forthat ROM (by multiplying with a rotation matrix), the TE implementationwould instead work with the rotated coordinate system.
6.4.2 Rectangular-shaped bins
If the spread is larger in ϕ than in θ, then, intuitively, it would make senseto make the bins rectangular rather than square. The proportions of theside lengths should correspond to the differences in the spread, measuredby, say, the standard deviations of the distributions.
6.5 Borderline events-problem
The inability to handle borderline events (Section 4.8, page 28) remains amajor problem of the TE. The exact scope of the problem is not knownthough—no simulations have been made giving answers, in quantitativeterms, on how much efficiency is lost because of borderline events.
I sketch out two strategies that might be investigated in order to find away to handle the borderline events:
1. Several FPGA chips with the TE are used in parallel, the only differ-ence being that the bins are off-setted differently, and, more precisely,in such a way that for each point on the unit sphere, there is at leastone FPGA chip with bin limits so that the point is not immediatelyclose to a bin limit. The number of FPGA chips used could be 41.
1one off-setted 1/2 of the length of a bin side West, one 1/2 North, and one 1/2 of abin diagonal Northwest, in an imagined 2D-projection.
CHAPTER 6. RESULTS AND SUGGESTIONS 81
There is no change on the input side—all hits are fed to all FPGAchips. The merged output of the chips, e.i., reported time windows,will typically contain several duplicates. The ControlServer or theDAQ will have to identify them and remove them. That shouldn’t bea problem—it needs to be done anyway since also the overlapping timewindows typically give rise to duplicates.
2. Each pair is placed at first in its calculated bin, and thereafter also inseveral adjacent bins.
This results in a kind of “blurred” direction information, which shouldreduce the borderline events problem.
Since more pairs are spread in the histograms, it follows that thethreshold values for triggering a track will have to be increased.
The TE core design would need significant modifications. Those wouldbe, roughly:
• To find the adjacent bins of a bin, using a LUT. Its size is about25 Kbit2 and therefore fits easily within the BlockRAM.
• To change the way the memory operations are performed whenbuilding the histograms: Currently, a pipe-lined design is usedin the collector unit of the histogram unit3 capable of addingone pair to a histogram cell each clock cycle. At most one paireach clock cycle will appear at the input4. Apparently, suddenlydemanding several memory operations for each pair is a problem.It can be solved by buffering all pairs that are to increment thehistogram cell counters.
Two comments on benefits and drawbacks of the two options will bemade:
• Option 1. requires four times as much hardware (evaluation boards),but, except for that, no or almost no changes on any part of the TE.
• In Option 2., the hardware is unchanged, but major modifications tothe TE core are necessary.
A final comment: No simulations have yet shown the extent of the bor-derline events problem or if there actually is a significant reduction in effi-ciency if it is not handled. Intuitively, it might seem clear that that is thecase. We should however keep in mind that the events we demand the TE to
2344 addresses × (9 bits/address × up to 8 adjacent bins)3see “The histogram unit” under Section 5.7.8 on page 574typically significantly less, since a large part of the pairs are filtered away in the
speedCrit unit
CHAPTER 6. RESULTS AND SUGGESTIONS 82
find are weak and that they will always appear in considerable noise. Noisehits might also help to “blur” the directions.
Chapter 7
References
The IceCube Data Acquisition System: Signal Capture, Digitization, andTime-Stamping, 2008-01-21 (unpublished)
C. Bohm, D. Nygren, C. Robson, C. Wernhoff and G. Wikstrom: A low en-ergy muon trigger, accepted to the conference proceedings to the IEEE/NSS2008 conference in Dresden
C. Bohm, D. Nygren, C. Robson, C. Wernhoff and G. Wikstrom: Track En-gine status (presentation slides), IceCube DeepCore workshop in Utrecht,2008
83