efficient fixed-radius near neighbors for machine learning

Efficient Fixed-Radius Near Neighbors for Machine Learning

by David Porter Walter III

S.B., MIT (2018)

Submitted to the

Department of Electrical Engineering and Computer Science

in partial fulfillment of the requirements for the degree of

Master of Engineering in Electrical Engineering and Computer Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2019

© Massachusetts Institute of Technology 2019. All rights reserved.

Author: _________________________________________

Department of Electrical Engineering and Computer Science May 24, 2019

Certified by: _________________________________________

Tomaso A. Poggio Professor of Brain and Cognitive Science Thesis Supervisor May 24, 2019

Accepted by: _________________________________________

Katrina LaCurts Chair, Master of Engineering Thesis Committee

1

Efficient Fixed-Radius Near Neighbors for Machine Learning

by David Porter Walter III

S.B., MIT (2018)

Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of

Master of Engineering in Electrical Engineering and Computer Science

Abstract

Deep learning has enabled artificial intelligence systems to move away from manual feature engineering and toward feature learning and better performance. Convolutional neural networks (CNNs) have especially demonstrated super-human performance in many vision tasks. One big reason for the success of CNNs is due to the use of parallelizable software and hardware to run these models, making their use computationally practical. This work is focused in the design and implementation of an efficient and parallel fixed-radius near neighbors program (FRNN). FRNN is a core component in a new type of machine learning model, object oriented deep learning (OODL), serving as a replacement for CNNs with goals of invariance, equivariance, interpretability, and computational efficiency that improve upon the abilities of CNNs. This efficient implementation of FRNN is a critical step in making OODL computationally efficient and practical.

Thesis Supervisor: Tomaso A. Poggio Title: Professor of Brain and Cognitive Science

3

Acknowledgements

Thank you for everyone at the Poggio Lab, especially to Qianli Liao for

providing guidance and mentorship for me in this project. Thank you to

my grandpa David Walter Sr. for being there on my MIT journey since day

one. Thank you Kathy Guerra for being there alongside me and believing in

me since the first year of my five year MIT journey, because without you I

would have not made it to the end of this thesis.

5

Contents

Abstract 3

Acknowledgements 5

Contents 7

List of Figures 9

List of Tables 11

Introduction 13

1.1 Motivation 13

1.2 Contributions 15

Background 17

2.1 Parallel Programming 17

2.2 PyTorch 18

Related Work 21

3.1 Convolutional Neural Networks 21

3.2 Dropout 21

3.3 Pooling 22

3.4 Object Oriented Deep Learning 23

Methods 27

4.1 Combinative Functions in Deep Learning 27

4.2 FRNN With Bins 28

4.3 FRNN With Scales and Bins 33

4.4 Parallel Binning 37

4.4.1 Mapping Points to Bins 37

7

4.4.2 Mapping Bins to Points 40

4.4.3 Mapping Bins to Neighbor Bins 40

4.5 Parallelized FRNN 41

4.5.1 Parallelizing Across Bins and Points 41

4.5.2 Storing Bin Points in Shared Memory 43

Results 45

5.1 Theoretical Runtime Analysis 45

5.2 Experimental Runtime Analysis 47

Evaluation 53

6.1 Practical Trade-offs of Different Algorithms 53

Future Work 55

7.1 Connected Components 55

Conclusion 59

References 61

Appendix 65

Python Code Mapping Points to their Bins 65

Python Code Mapping Bins to Points 66

Python Code Mapping Bins to Neighbor Bins 67

CPP CUDA Kernel Code for Parallel FRNN with Bins 69

8

List of Figures

2.1 OODL Voting and Binding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Brute Force FRNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 FRNN With Bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 FRNN With Bins and Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4 Experimental Example of Points Mapped to Bins . . . . . . . . . . . . 35

4.5 Experimental Example of Neighbors for a Single Bin . . . . . . . . . 36

5.1 Parallel FRNN Runtime versus Radius for Different Numbers of Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

5.2 Parallel FRNN Runtime versus Number of Bins for Different Numbers of Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

5.3 Parallel FRNN Runtime versus Number of Outputted Edges for Different Numbers of Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

5.4 Parallel FRNN Runtime Per Edge versus Number of Outputted Edges for Different Numbers of Points . . . . . . . . . . . . . . . . . . . . .

52

7.1 Connected components in a undirected graph . . . . . . . . . . . . . . . 56

9

List of Tables

5-1 FRNN Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5-1 FRNN Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

11

Chapter 1

Introduction

1.1 Motivation

The rise of deep learning marks a leap forward in the field of

artificial intelligence, enabling humans to build a computational system

that can hierarchically process an input and learn a hierarchy of features

and use those features to transform an input into a classification or an

action. Nature has genetically endowed humans with the ability to perform

very complicated and intelligent tasks. As scientists continue to uncover

some of the mechanisms behind neurally based intelligence, we can use

that knowledge to build artificial systems that have many of the useful

properties that we think neurally based systems have, and even move

beyond some of the limitations of human intelligence. As the field of

machine learning (ML) advances, we see a trend analogous to the

transition from training wheels on a bike, to no training wheels, to

modifying the bike itself, to scrapping the bike altogether and building a

better form of transportation. What this means for AI is that we started

hard-coding in all the details for our computer systems, but ever

13

increasingly, we will be heading toward a world where our computational

systems are not only given the ability to make their own decisions to a

small degree, but they will be able to make decisions about how their

decisions are even made, about how they learn, and less human knowledge

will be forced on our computer systems.

As good scientists and engineers of artificially intelligent systems, we

need to be cognizant of this long-term trajectory of making less

human-made assumptions and the short-term benefits of injecting human

knowledge into a system where learning that human knowledge would be

too difficult. In the context of the field of machine learning and deep

learning, we want to advance the field in a way that continues to promote

the minimization of human-biased knowledge while also forcing an AI

system to have properties that we know are good. The research for this

proposal represents a movement in that direction. This work is focused on

the design and implementation of an efficient and parallel fixed-radius

near neighbors program (FRNN). FRNN is a core component in a new type

of machine learning model, object oriented deep learning (OODL), serving

as a replacement for CNNs with goals of invariance, equivariance,

interpretability, and computational efficiency. This efficient

14

implementation of FRNN is a critical step in making OODL computationally

efficient and practical.

1.2 Contributions

For the work of this thesis project I:

● Implemented a parallel system that maps points in both linear and

exponential spaces to bins, and maps each bin to their neighbor bins.

● Implemented a fixed-radius near neighbors system that processes

bins in parallel and pairs of points within bins in parallel.

● Showed that the runtime of this parallel FRNN implementation with

binning improves upon the non-parallel and non-binning

implementation by orders of magnitude, allowing this

implementation to be practical for use for object oriented deep

learning.

15

Chapter 2

Background

2.1 Parallel Programming

Machine learning, and specifically deep learning, has seen such a

breakthrough in no small part from the advancement of hardware with the

capability to run parallelizable code on graphical processing units (GPUs).

As the name suggests, GPUs were originally designed for graphical

processing tasks that inherently benefit from massive amounts of

parallelization. Specifically we can thank the videogame industry for the

initial proliferation of GPUs used for graphics applications and then other

industries like blockchain and AI have been able to benefit from the

advancements in GPU hardware as well [1] [2]. There are many libraries

and frameworks for writing GPU code, including OpenCL [3] but the most

popular deep learning frameworks including Pytorch and TensorFlow use

Nvidia’s CUDA. “CUDA is a parallel computing platform and programming

model that makes using a GPU for general purpose computing simple and

elegant” [4]. CUDA currently works with multiple programming languages

including C, C++, Fortran. In simple terms, GPUs, and the CUDA framework

17

allow an engineer to transform a parallelizable algorithm into code that

allows the hardware to actually operate in parallel with hundreds to

thousands of threads running at the same time. This goes beyond a normal

CPU that runs code that seems like it is in parallel in the software, through

abstraction, but in reality, the operations are being run mostly in serial or

on a very small number or CPU cores. While in theory, a GPU is still a form

of a Turing machine (TM) will all of the same theoretical limitations, based

on the constant number of threads it can actually run in parallel, in

practice, the number of blocks and threads on a GPU make it run certain

parallel operations much faster than a CPU. Because this project

specifically uses Nvidia GPUs and the CUDA framework, the most

practically useful thing to know about the way Nvidia’s GPUs are

programmed is that they have threads like a normal CPU and blocks which

represent groups of threads all undergoing the same operations in parallel.

An Nvidia GPU runs multiple threads in a block in parallel and multiple

blocks in parallel.

2.2 PyTorch

We have also chosen to use CUDA because PyTorch makes it

relatively easy to create custom CUDA kernels and import them directly

into PyTorch. PyTorch is a major member in the list of popular deep

18

learning frameworks, which also include TensorFlow and Theano. PyTorch

garners its strength from its ability to dynamically run deep learning

models, unlike frameworks like TensorFlow, which needs to compile a

static computational graph before running a deep learning model. This

dynamic nature makes PyTorch much easier for research and

development, including building and debugging models. The downside of

PyTorch’s dynamic nature is that it has less room for optimization of the

underlying implementation. While the limits of optimization in PyTorch

may be limited to a small degree, PyTorch is practically easier to customize,

based on the simplicity, structure and dynamic nature of the framework

[5] [6].

19

Chapter 3

Related Work

3.1 Convolutional Neural Networks

Convolutional neural networks (CNNs) are to thank for much of the

recent advancement in computer vision in the past several years, in no

small part due to advancements in parallelizable hardware like GPUs [7].

One of the main properties that make CNNs so powerful is translation

invariance [8]. Invariance is the property that allows a system to not be

affected by a change in a feature. For example, translation invariance

allows an ML model to not lose performance when an object is shifted. We

would like computer vision models to also have rotation invariance for

novel rotations, but CNNs struggle with this property compared to their

ability to handle novel translations [9].

3.2 Dropout

Dropout is a method in which nodes in a deep learning model are

randomly set to zero. Dropout can help our models with at least one

21

notable property: disentanglement [10]. Disentanglement is the idea that a

feature of the input is both captured in the system and isolated in a small

subset of the system. To contrast, a system that is entangled has many of its

important aspects in many different parts of the system. A system that is

disentangled is modular, such that certain parts of the system can be

identified with certain aspects of the computation for that system. From an

interpretability perspective, like modularity in software engineering,

disentanglement in ML models is a desirable property that allows scientists

and engineers to understand how the system works instead of treating the

system as a black box. Interpretability is a positive step in the direction of

safety for AI systems, and a step toward scientist and engineers better

understanding how our models work and how they can be improved.

3.3 Pooling

Pooling layers are used in CNNs to summarize groups of neurons

within a k x k sliding kernel map [7]. Max pooling is a popular pooling

layer type, where the output from each kernel map is the max value within

the kernel map output before pooling. Max pooling serves as a way to

propagate only the most prominent value within a region of the input, and

ignore the rest [11]. Pooling operations help CNNs achieve translation

invariance, because the location of a feature matters less when it is

22

propagated through a pooling layer, as the pooling layer only cares about

the presence of the feature. Nevertheless, with CNNs pooling helps the

most with invariance of position, but not necessarily invariance of other

features like rotation and size. With a different model type, pooling layers

could aid in the problem of invariance of other features types.

3.4 Object Oriented Deep Learning

Object Oriented Deep Learning (OODL) is a model that has its highest

aims on interpretability, disentanglement and equivariance [12] [13].

Unlike the conventional CNN architecture that uses N-dimensional feature

tensors as the fundamental representation and convolutional kernels,

OODL’s basic representation is an object. In this context, an object is an

entity that may have explicit properties like position, rotation, and size

build in, and an N-length signature vector that contains learned features

for that object. Like with most other encoding neural architectures that

process images, the first layer takes as input an image and each subsequent

layer encodes the input into continually higher levels of abstracted objects

and features, such that objects and features in high layers are some

learned combination to features from lower layers. This is similar to

typical deep learning architecture because features are still a combination

of lower-level features, but OODL is different because OODL also has

23

objects as fundamental units, allowing for a symbolic paradigm to emerge.

What this symbolic paradigm represents is a movement beyond simple

statistical feature detectors of typically strong deep learning architecture

like CNNs and toward an ability to understand the relation between objects

in an input that allows for elements such as context and complex

relationships to emerge. OODL follows the a paradigm similar to other

deep learning architectures, where the first, and lowest, layer starts with

the individual pixels or other low-level features. If the input is a 2D image,

the main difference in OODL’s approach is that these pixels are treated as

individual objects, and each pixel has the property of position, rotation.

Then, in higher layers there exist few objects that are some combination of

the objects in the lower layers, until the highest layer will have few or only

one object that is a weighted sum of all objects remaining in the image.

To transform objects from layer to layer, OODL uses a voting layer

followed by a binding layer. A voting layer is currently implemented by

multiplying a set of radially oriented weights that dot multiply their values

by each object’s signature, and predict neighboring objects. Binding layers

are used to combine objects in each layer by aggregating objects to those

with the most surrounding objects. This is why the voting layer is named as

24

such, because objects ‘vote’ for neighboring objects, and the binding layer

aggregates the votes.

Figure 3.1: Visualization of a voting operation on the left and a binding operation on the right. [12]

Voting layers are more general than convolutional layers because the

radial kernel is not constrained to a the pixel grad, or a static rotation, or

lack thereof. In terms of interpretability, the existence of individual objects

makes it easier for a scientist or engineer to locate important areas of a

model and interpret where a feature exists in the objects properties and

signature and how it affects the models computation. In addition, an OODL

model has the potential to do less computation if the input is less complex,

whereas other neurally based architectures like CNNs carry the same

amount of computation independent of the complexity of the input.

25

Chapter 4

Methods

4.1 Combinative Functions in Deep Learning

A core aspect of computationally efficient and representative

systems like neural networks and hierarchical representations is that there

needs to be a way to combine low-level information into high-level

representations. Language and vision fit this structure well. We see this in

everyday task like reading were we combine letters to formulate words

and words to formulate phrases. We combine edges and dots to formulate

objects in an image, like eyes and legs on an animal and car wheels and

windows on a car, and combine these sub-features and sub-objects into

high-level features and high-level objects like animals and cars

respectively.

In AI systems, we usually have an input with many low-level

features being mapped to an output with a few high-level features. Often

this is a mapping from a larger sized input to a smaller sized output, and to

do this, AI systems employ combinative mappings that have a net effect of

27

reducing the size of the input. Hierarchical networks like deep learning

embrace this paradigm fully. A vanilla feedforward deep network

architecture can vary widely, but one usual, and important, aspect is that

the final layer will map some larger internal representation to some

smaller representation like an encoding, action, or classification. A

self-driving car system can take in frames from a video and map it to

actions like steering the vehicle and accelerating. A natural language

system can read a paragraph and output the emotions of that sentence, like

happy, or sad or angry.

4.2 FRNN With Bins

Many of the important properties of OODL, including dynamic

computational cost, equivariance, etc., rely on a dynamic architecture that

deviates from the less dynamic nature of vanilla deep learning approaches

like feedforward neural networks and even deviating from slightly

dynamic architectures like CNNs. CNN’s core combinative function is the

convolution with either a stride larger than one or dropout. While we are

open to different combinative functions for OODL, OODL uses FRNN as it’s

core combinative operation. We have currently settled on using the FRNN

operation instead of convolutional filters because of its dynamic property

28

that allows it to be applied to point clouds or inputs and model layers of

various sizes and shapes.

FRNN takes in as input a set of points and a radius and it outputs a

matrix that maps each point to all other points that are within a distance of

the radius from that point.

Algorithm 4.1: FRNN Brute Force.

1. initialize empty edges list

2. for all points xa

a. for all points xb

i. compute distance between xa and xb

ii. if distance < radius, push xa and xb to edges list

3. output edges list

29

Figure 4.1: Depiction of the operation for one point in the brute force FRNN algorithm in 2 dimensions. The blue, or lighter, points are “near neighbors” of the current point in the center of the circle or size radius.

In order to not needlessly compare every point to every other point,

a better algorithm than the brute force algorithm would place points in

bins such that points that are only a constant number of bins over need to

be checked. Imagine you are acting out FRNN in real life between you and

everyone in the world, and the radius was 10 miles. It would be unwise to

30

checking distances to people all the way across the world. Instead you

know to only check within your town or city and then possibly check the

few surrounding towns. Similarly for FRNN with generic points, this

program puts each point in a bin similarly to how people exist in cities. If

the radius were smaller, the program can make the bins smaller, and

similarly for larger radii, enabling us to only look at a relatively small

number of bins.

Algorithm 4.2: FRNN with Bins.

1. ptsToBins2d ← floor(points2d - min(points2d) / radius).

2. ptsToBins1d: map each ptsToBins2d to unique 1d integer

3. binsToPts1d ← map each 1D bin to the ids in points2d

4. initialize empty edges list

5. for ba in binsToPts1d:

a. for bb in binsToPts1d:

i. for xa in ba:

1. for xb in bb:

a. compute distance between xa and xb

b. if distance < radius, push xa and xb to edges list

6. output edges list

31

Figure 4.2: FRNN in 2 dimensions with bins. In this case, the bins are as wide as the radius, so it is guaranteed that every neighbor of the current bin will be no more than one bin away.

32

4.3 FRNN With Scales and Bins

So far I have been discussing the FRNN problem in terms of points that

exist in 2D or 3D linear euclidean space. In this linear space, determining

the distance between coordinates matches up with our usual intuitions,

and the radius is always the same no matter which pair of points you are

considering. But when we introduce scales, the radius that we use to

compare the distance between points is multiplied by the value of the scale

of that point as well. In this way, the radius is made to be proportionally

larger when the scale is larger, making our radius relative with respect to

the scale. When we have scales, the radius r that is inputted into the FRNN

program is changed to an adjusted radius equal to (r)(s) where s is the

scale.

33

Figure 4.3: FRNN with one linear dimension and one scale dimension with bins, where the first dimension in the x-axis is in linear space, and the second scale dimension is in logarithm space. Bins that are not white are considered neighboring bins, but the light red, or lightest shaded sections have points that will not be neighboring points ever. Gaps between the scale bins are to leave room for the explanatory lines and arrows. As you will see in figure 4.4, the scale bins get taller as the scale increases.

34

Figure 4.4: Experimental Example of Points Mapped to Bins. We have one linear dimension in the x-axis and one exponential dimension in the y-axis, which is the scale. In this case we can see that bins get wider in the linear dimensions and taller in the scale dimension as the scale increases.

35

Figure 4.5: Experimental Example of Neighbors for a Single Bin. The middle bin in light blue is the current bin. This example corresponds with the same points and bins as in figure 4.4.

36

4.4 Parallel Binning

This section describes setting up the inputs for the bins that will be

inputted into the CUDA kernel. At the start the FRNN algorithm is given the

points (x, y, z, s, b), the linear radius for the first 3 dimensions in x, y, and z,

and the exponential scale radius for the fourth dimension for scales. Here

b is the batch id, where batch id identifies points in unique inputs so that

we do not compare distances between point in different inputs, and this

allows us to process multiple inputs in a single batch. Before the FRNN

CUDA kernel is initialized, we need the bins that each point is mapped to.

The high-level steps for this part of the program involve:

1. Compute the map from each point to its bin.

2. Compute the map from each bin to its points.

3. Compute the map from bins to their neighbor bins.

The Python code for all three steps of this algorithm is in the appendix.

4.4.1 Mapping Points to Bins

The main decision to be made here is how wide to make the bins. As

the bin widths get larger and larger we will have more points in each bin

37

and less bins. If all points fit into one bin, then we would be back at the

brute force FRNN algorithm, but if each bin has only one point then we can

still theoretically have a speed up, but practically we would not.

Somewhere in the middle should produce the best performance. Although

figure 4.2 and figure 4.3 already showed this size of the bins, these figures

do not make that design choice seem non-accidental. I chose the width of

the linear bins to be the same as the radius, and the bin width in the fourth

scale dimension to be the scale radius. Considering only the first 3 linear

dimensions, each bin forms a cube with each side having a length of the

radius. This makes getting neighboring bins in the linear space much

easier because any point in the current bin will be guaranteed to find all of

its neighboring points within all of the neighboring bins, as you can see in

figure 4.2. Still considering linear space in 3D, what this choice of bin size

also means is that if we assume the points are mostly distributed evenly, as

in the average case, then each point will only make O(|Ei|) comparisons

with other points, where Ei is the number of neighboring points for point i.

In addition, we do not want negative bin ids so we will also shift the

value of every point by the lowest value, so that no point have a value less

than 0. With this the equation for mapping a point to a bin in linear space

is:

38

wi,j = ( x i,j - min(X) ) / r , for j ∈ {0, 1, 2}

Where X is the tensor of points, and xi is a single linear point in X, and wi,j is

the bin width in the ith object in the jth dimension.

If we consider the scale in the fourth dimension, then we have two

separate equations for the bins in linear space and the bins in exponential

space.

u = floor( log(si ) / log(z) )

bi,j = floor( w i,j / z u+1 ) , for j ∈ {0, 1, 2}

bi,3 = floor( log(s i ) / log(z) )

Where si is the scale value at point i , z is the scale radius, u is a placeholder

for the equation for bi,j , b i,j is the scaled bin for point i in linear dimension j

for the first three dimensions, bi,3 is the scale bin for point i . The equations

for bi incorporate the resizing for the highest scale value in any given bin,

due to the u+1 exponent. The idea of the highest scale in the bin limiting

the width of the bin may make more sense when looking at figure 4.3 and

the dotted lines that separate the area of the neighbor bins that have points

that could be neighbors, in light blue, and the area outside in light red, that

39

are guaranteed to not have any neighboring points for any of the points in

the current bin. For intuition, the width of the bins get exponentially larger

as the scale increases linearly.

4.4.2 Mapping Bins to Points

Algorithm: Mapping Bins to Points:

1. get the sorted indices for the mapping of points to bins.

2. get the differences: allow us to set every point that does not have a next

point in the bin to -1 and the rest set to the next point in the bin.

3. set the values for first points.

4. set the values for the point indices and un-sort them so they are in the

same order as the original points, because the indices are corresponding

with the points based on their locations.

4.4.3 Mapping Bins to Neighbor Bins

When we add in scales, things get a little more complicated because

we cannot simply look one bin in each direction to get the neighbors. As

shown in figure 4.3, when looking at the current bin and keeping the scale

dimension constant, we simply get the neighbors that are one bin higher

and lower. But when getting the neighbors for bins for other scales we look

40

one bin higher and lower in the scale dimensions, but how many bins to

get in the other dimensions depend on the maximum radius at those other

scales, so we are not grabbing exactly one bin over when at different

scales. The main requirement for getting neighboring bins is selecting them

such that all neighboring points of all points in the current bin are

contained in all of the selected neighboring bins. Code for this algorithm is

in the appendix and figure 4.3 is the best way to understand this section

spatially.

4.5 Parallelized FRNN

With the radius, points, bins and their neighbor bins, and the linked

list mapping points to the next point in their bin, we can write parallel code

in a CUDA kernel to implement FRNN. The kernel has two main aspects:

parallelizing across bins and points and storing a bin’s points in shared

memory when comparing points in two neighboring bins. Code for this

section can be found in the appendix.

4.5.1 Parallelizing Across Bins and Points

The lowest-level of parallelization in a generic FRNN algorithm exists

at the level of the point, but because this approach runs in O(n2) time we

want to run as many points in parallel as we can. With this we assign

41

separate threads to each point pair and a CUDA block to each bin and

compute the solution for this section of the algorithm in approximately the

time it would take to compute the solution for one bin. Figure 4.2 shows a

visual depiction of what the operation of one bin looks like.

Algorithmically here are the steps for processing each bin:

1. edges ← empty array

2. for point pa in bin ba:

a. for bin bb in ba’s neighbors:

i. for point pb in bin bb:

1. sa , sb = pa[0], pb[0]

2. scale_dist ← max([sa, sb]) / min([sa, sb])

3. d ← |pa - pb|

4. avg_scale ← log(sa * sb)

5. if d ≤ radius * avg_scale and scale_dist < scale_radius

a. edges.push([pa, pb])

42

4.5.2 Storing Bin Points in Shared Memory

The other significant improvement of my implementation from

other FRNN algorithms in CUDA is the speedup that comes from loading

bins’ points into shared memory before comparing the points’ distances.

Shared memory is cache that is allocated and controlled by the user instead

of automatically allocated by the compiler and used by the programming

languages underlying memory management program. As with cache, any

reads and writes to shared memory happen much faster than with global

memory. The procedure here is one thread in the CUDA kernel initially

writes the points in shared memory for bina and then for every

neighboring binb the same thread loads the points for binb in the shared

memory. This was designed to induce a speed up because one thread can

very quickly move all of the points into shared memory once and then

when the distance comparisons are being done during the actual

algorithm, where each point’s values need to be read O(n2) times, the reads

to shared memory can be done very quickly.

43

Chapter 5

Results

5.1 Theoretical Runtime Analysis

Time Complexity

Algorithm Best Average Worst

FRNN Brute Force O(|X|2) O(|X|2) O(|X|2)

FRNN with Bins O(|E|) O(|E|) O(|X|2)

Parallel FRNN with Bins O(|E|/|B|) O(|E|/|B|) O(|X|2)

Table 5-1: Time Complexity in the best, average, and worst case for different FRNN algorithms, where X is the points matrix and E is the outputted edges matrix, and B is the 1 dimensional bins array.

In the FRNN brute force algorithm we compare all |X|2

combinations of points in all cases. In the FRNN with bins algorithm we

expect the program will be comparing O(|E|) point pairs because on

average, most of the points in the neighbor bins will form an edge with the

points in the current bin that is being considered. In the best case, all

45

points will be as evenly distributed across the bins as possible and we will

make O(|E| / |B|) comparisons. In the worst case, where we have many of

the points concentrated in very few bins, we will have a runtime

complexity of O(|X|2) because this will essentially be like running the brute

force algorithm on a single or few bins.

Algorithm Space Complexity (All Cases)

FRNN Brute Force O(|X|+|E|)

FRNN with Bins O(|X|+|E|+|B|)

Parallel FRNN with Bins O(|X|+|E|+|B|)

Table 5-2: Space Complexity in the best, average, and worst case for different FRNN algorithms, where X is the points matrix and E is the outputted edges matrix, and B is the 1 dimensional bins array.

For the space complexity, the only difference between the brute

force algorithms and the algorithms with bins is that we also need to

include the storage necessary for the bins, because the storage size for the

bins is a function of the radius.

46

5.2 Experimental Runtime Analysis

In this experiment I am testing the final version of the FRNN

program using an Nvidia Tesla K80 GPU, and for the parallel algorithm the

CUDA kernel is set to use 8192 blocks and 16 by 16 2D threads. These

settings for the number of blocks and number of threads give the best

experimental performance for my implementation. For the non-parallel

algorithm, I am using only one block and one thread. Below is a

comprehensive description of the other experimental settings.

Constant Hyper-Parameters:

1. dimension of points: two linear dimensions and one scale dimension

(x, y, s)

2. linear domain: x and y values are sampled from a uniform

distribution ranging from 0.0 to 100.0.

3. scale domain: the scale values are sampled from a uniform

distribution ranging from 1.0 to 1.5.

4. scale radius: 1.25

Independent Variables:

1. number of points: ranging from 2 to 20,000, inclusive.

47

2. radius: ranging from 1.0 to 100.0 (or 1% of the linear range to 100%

of the linear range).

a. number of bins: this is a function of the radius.

Dependent Variables:

1. runtime for binning step.

2. runtime for the FRNN kernel.

3. number of outputted edges.

The values for the experiment described above were chosen to

sufficiently test the FRNN program even for values outside of what may be

used in practice. In OODL, we would not likely have a radius that is as large

as the range of inputs, nor would we likely have a radius that is only 1% of

the range of the inputs. One reason to test these values is because we don’t

actually know for sure what range of values the OODL program will use in

practice, as well as other non-OODL applications like collision detection

with a large number of points and a small radius. Regardless, it makes

sense to test these values, if only to look for ways to improve the

implementation.

48

Figure 5.1: Runtime comparison between the Brute Force FRNN implementation, the serial FRNN implementation with bins, and the parallel FRNN implementation with bins.

49

Figure 5.2: Parallel FRNN Runtime versus Radius for Different Numbers of Points.

50

Figure 5.3: Parallel FRNN Runtime versus Number of Bins for Different Numbers of Points.

51

Figure 5.4: Parallel FRNN Runtime versus Number of Outputted Edges for Different Numbers of Points. When the number of points is small, we do not have runtimes for a large number of edges because we do not have enough points to make a large number of edges, even if the radius was very large.

52

Chapter 6

Evaluation

6.1 Practical Trade-offs of Different Algorithms

Maybe the most important result is from figure 5.1, which shows

how much faster the runtime of the parallel bin FRNN algorithm is than

the serial bin and brute force algorithms. Due to the practical limits of a

GPU and how many blocks and threads it can actually run in parallel, if the

size of the inputs gets very large, the runtime of the parallel bin algorithm

will approach the runtime of the serial bin algorithm, but with the practical

sizes of the inputs that OODL will use, we will expect the parallel algorithm

to do much better than either of the serial algorithms.

From the results we can see that the parallel FRNN with bins gets

more efficient as the size of the input and output increase. This is likely due

to the high-start up time of a CUDA kernel and the startup time of

initializing the bins. As the size of the input and output increases, the

startup time becomes a smaller percentage of the total runtime. This is why

53

it is important in practice to run this algorithm with larger batch sizes, so

the number of times the CUDA kernel has to be initialized is small.

We can see that when the radius is really small, likely smaller than

will ever be used in practice, the runtime is much higher than for most of

the other radii values. This is likely due to the fact that when the radius is

really small, the number of bins is very large. The runtime of the kernel

increases linearly with the number of bins. With more bins, we will have

more blocks running in parallel, and given any given GPU has practical

limitations for the number of blocks it is actually runs in parallel, we know

that when the number of blocks is very large, some of those blocks are

running in series instead of in parallel.

I designed this algorithm specifically for OODL, so it does well with

parallelizing across bins and for inputs sizes that are relevant to OODL. In

other applications like with graphics and detecting object collisions, this

algorithm may or may not serve that purpose as well as another algorithm

that is specifically optimized for those applications.

54

Chapter 7

Future Work

7.1 Connected Components

While FRNN is a function that is good for spatially grouping objects

together based on their distance from each other, this may not be enough if

our goal is for a computer system to segment out objects with shapes that

do not closely approximate a circle or sphere. To account for this, we are

working on the next step in the binding process, which is a connected

components step. A connected component in an undirected graph is a set of

points that all have some path from itself to all other points in that

connected component.

55

Figure 7.1: Visualization of connected components in a undirected graph.

This diagram has three connected components: the five blue points in the

top left, the four green points on the right, and the one red point on the

bottom.

We believe that object segmentation is a fundamental aspect of vision, so in

order to make that capability innate on an artificial vision system we

propose the combination of FRNN and connected components as a

sub-system that will allow a computer system to learn to predict

neighboring objects by interpreting the cloud of points that are produced

from the voting layer in OODL. Out of this we hypothesize that objects and

shapes will be learned without any explicit domain knowledge, and other

56

difficulties like with occlusion or highly variable shapes will be handled

fundamentally by this approach.

At the time of this thesis being submitted, we have the connected

components step already implemented, and what is left is to complete the

entire pipeline and experiments with the voting layers and FRNN and

connected components to have a complete OODL system.

57

Chapter 8

Conclusion

CNNs have been the preferred model for most computer vision

problems for a few years, and OODL presents an alternative model that

aims at going beyond the capabilities of CNNS. Like CNNs benefited from

the implementation of parallel hardware and software for the convolution

operation, this work represents a similar step in that direction for OODL

and the FRNN operation. For the work of this thesis project I implemented

a parallel system that maps points to bins, and maps each bin to their

neighbor bins, and a parallel fixed-radius near neighbors system that

processes bins and pairs of points in parallel. The parallel bin FRNN

algorithm shows good performance for the input sizes and types necessary

to train an OODL model, and a key result is how the performance of this

parallel FRNN with bins implementation is orders of magnitude more

efficient than the non-parallel and non-binning FRNN implementation.

With these results, the critical FRNN component of OODL will allow the

continued design of OODL to be free to develop knowing its core

functionality is computationally efficient and practical.

59

References

[1] Owens, John D., et al. "A survey of general-purpose computation on

graphics hardware." Computer graphics forum. Vol. 26. No. 1. Oxford, UK:

Blackwell Publishing Ltd, 2007.

[2] Sze, Vivienne, et al. "Efficient processing of deep neural networks: A

tutorial and survey." Proceedings of the IEEE 105.12 (2017): 2295-2329.

[3] Stone, John E., David Gohara, and Guochun Shi. "OpenCL: A parallel

programming standard for heterogeneous computing systems." Computing

in science & engineering 12.3 (2010): 66.

[4] Yang, Zhiyi, Yating Zhu, and Yong Pu. "Parallel image processing based

on CUDA." 2008 International Conference on Computer Science and Software

Engineering. Vol. 3. IEEE, 2008.

[5] Ketkar, Nikhil. "Introduction to pytorch." Deep learning with python.

Apress, Berkeley, CA, 2017. 195-208.

[6] Paszke, Adam, et al. "Automatic differentiation in pytorch." (2017).

61

[7] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet

classification with deep convolutional neural networks." Advances in

neural information processing systems. 2012.

[8] LeCun, Y., & Bengio, Y. (1995). Convolutional networks for images,

speech, and time series. The handbook of brain theory and neural networks,

3361(10), 1995.

[9] Cheng, Gong, Peicheng Zhou, and Junwei Han. "Learning

rotation-invariant convolutional neural networks for object detection in

VHR optical remote sensing images." IEEE Transactions on Geoscience and

Remote Sensing 54.12 (2016): 7405-7415.

[10] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., &

Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural

networks from overfitting. The Journal of Machine Learning Research, 15(1),

1929-1958.

[11] Sabour, Sara, Nicholas Frosst, and Geoffrey E. Hinton. "Dynamic

routing between capsules." Advances in neural information processing

systems. 2017.

62

[12] Liao, Qianli, and Tomaso Poggio. Object-oriented deep learning. Center

for Brains, Minds and Machines (CBMM), 2017.

https://cbmm.mit.edu/publications/object-oriented-deep-learning

[13] Liao, Qianli, and Tomaso Poggio. Exact equivariance, disentanglement

and invariance of transformations. 2017.

63

Appendix

This appendix contains most code for this project:

Python Code Mapping Points to their Bins

def get_pts2bins(pts, batch_ids, radius, scale_radius, device): """ """

ix, iy, iz, i_s = 0, 1, 2, 3 n_pts, n_dims = pts.size()

n_lin_dims = 3

apply_4d = lambda func: torch.tensor([func(pts[:, i]) for i in range(n_dims)],

device=device)

min_vals = apply_4d(torch.min)

pts2bins = pts.clone()

''' steps for xyz bins:

1.) shift to min 0

2.) map by radius

3.) scale by max scale in corresponding scale bin

'''

''' xyz: 1.) shift to min 0

'''

pts2bins[:, ix : iz + 1].sub_(min_vals[ix : iz + 1]) ''' xyz: 2.) map by radius

'''

pts2bins[:, ix : iz + 1].div_(radius) ''' xyz: 3.) scale by max scale in corresponding scale bin

'''

unshifted_scale_bins = torch.floor(torch.log(pts[:, i_s]) /

math.log(scale_radius))

max_unshifted_scale_bins = torch.max(unshifted_scale_bins).item()

exp_range = scale_radius ** torch.arange(1, max_unshifted_scale_bins + 2, dtype=torch.float, device=device)

scale_divs = exp_range[unshifted_scale_bins.long()].view(-1, 1)

pts2bins[:, : n_lin_dims].div_(scale_divs)

'''

65

steps for scale bins:

1.) map by scale_radius

'''

pts2bins[:, i_s].log_()

pts2bins[:, i_s].div_(math.log(scale_radius))

pts2bins.floor_()

pts2bins = pts2bins.long()

# now add another dimension for the batch_ids batch_bins = batch_ids.view(-1, 1) pts2bins = torch.cat([pts2bins, batch_bins], dim=1)

return pts2bins

Python Code Mapping Bins to Points

def get_pt_idx_data(pts, bins_5d, device):

bins = bins_5d

ix, iy, iz, i_s, ib = 0, 1, 2, 3, 4 _, n_dims = bins_5d.size()

n_bins5d = torch.tensor([torch.max(bins[:, i]) + 1 for i in range(n_dims)], device=device)

# do sorting of (bin, pt) pair and sort by bin bins1d = bins[:, ib] * n_bins5d[ix] * n_bins5d[iy] * n_bins5d[iz] *

n_bins5d[i_s]

bins1d += bins[:, i_s]* n_bins5d[ix] * n_bins5d[iy] * n_bins5d[iz]

bins1d += bins[:, iz] * n_bins5d[ix] * n_bins5d[iy]

bins1d += bins[:, iy] * n_bins5d[ix]

bins1d += bins[:, ix]

n_bins1d = n_bins5d[0] * n_bins5d[1] * n_bins5d[2] * n_bins5d[3] * n_bins5d[4]

sorted_idxs = torch.argsort(bins1d)

''' if the bins were: [0, 0, 1, 1, 1, 2]

we would need the pt_idxs to be:

[1, -1, 3, 4, -1, -1]

and the first pt idxs to be:

[0, 2, 5] (one for each bin)

'''

sorted_bins1d = bins1d[sorted_idxs]

# now get the pt_idxs: diffs = sorted_bins1d[1:] - sorted_bins1d[:-1] diffs_mask = diffs != 0

66

pt_idxs_sorted = torch.arange(bins1d.size()[0], dtype=torch.long, device=device) + 1

pt_idxs_sorted[:-1][diffs_mask] = -1 pt_idxs_sorted[-1] = -1

mixed_pt_idxs = sorted_idxs[pt_idxs_sorted]

mixed_pt_idxs[pt_idxs_sorted == -1] = -1

# unsorts it... pt_idxs = mixed_pt_idxs[torch.argsort(sorted_idxs)]

# now create the idxs for the linked list.. first_pt_idxs = torch.zeros(n_bins1d, dtype=torch.long, device=device) - 1 # have to manually add the first element min_bin1d = sorted_bins1d[0]

first_pt_idxs[min_bin1d] = sorted_idxs[0]

mixed_first_bins = sorted_bins1d[1:][diffs_mask]

mixed_first_pts = sorted_idxs[1:][diffs_mask]

first_pt_idxs[mixed_first_bins] = mixed_first_pts

return pt_idxs, first_pt_idxs, n_bins5d, bins1d

Python Code Mapping Bins to Neighbor Bins

def get_nebs2nebs(radius, scale_radius, n_bins5d, device):

ix, iy, iz, i_s, ib = 0, 1, 2, 3, 4 nx, ny, nz, ns, nb = n_bins5d

ndims = 5 n_lin_dims = 3 n_bins1d = nx * ny * nz * ns * nb

''' map each bin to the start values in the ranges at each scale

'''

''' the number of neighbors in euclidean space at scale 1

(scale bin 0 is relative scale scale_radius**0, and

scale bin 1 is relative scale scale_radius**1, etc.)

so for each scale bin we need to adjust this

n_nebs_left accordingly

'''

n_side = 1 n_middle = 1

# because log(scale_bin_width) / log(scale_radius) == 1 n_nebs = n_side + n_middle + n_side

67

bin_offsets = torch.arange(-n_side, -n_side + n_nebs,

dtype=torch.float, device=device)

all_bins = range5d(n_bins5d, torch.float, device).view(n_bins1d, ndims)

nebs_at_scales = []

for bin_offsets_i in bin_offsets:

scale_transform = scale_radius ** bin_offsets_i

starts = ((all_bins[:, : n_lin_dims] / scale_transform) - n_side).view(

n_bins1d, 1, n_lin_dims) # stop = torch.ceil(n_side + n_middle / scale_transform + n_side).long().item()

stop = math.ceil(n_side + n_middle / scale_transform + n_side)

# + 1

range_at_scale_i = range3d([stop, stop, stop], torch.float,

device).view(

stop ** n_lin_dims, n_lin_dims)

nebs_at_scale_i = torch.empty((n_bins1d, stop ** 3, ndims), device=device)

# setting the x, y, z dims nebs_at_scale_i[:, :, : iz + 1] = starts + range_at_scale_i # setting the scale dim nebs_at_scale_i[:, :, i_s] = all_bins.view(-1, 1, ndims)[:, :, i_s] + bin_offsets_i

# setting the batch_id dim nebs_at_scale_i[:, :, ib] = all_bins.view(-1, 1, ndims)[:, :, ib] # floor to round to the nearest bin below so that they can be longs nebs_at_scale_i = nebs_at_scale_i.floor().long()

nebs_at_scales.append(nebs_at_scale_i)

nebs_tensor5d = torch.cat(nebs_at_scales, dim=1)

# filtering the negative values mask_out_of_bounds = torch.sum(nebs_tensor5d < 0, dim=2) > 0 # filtering the values that are larger than the max possible values mask_out_of_bounds += torch.sum(nebs_tensor5d - n_bins5d >= 0, dim=2) > 0 # reset the mask to only have values that are either 0 or 1 mask_out_of_bounds = mask_out_of_bounds > 0

# transform 5d bins to 1d bins

nebs1d = nebs_tensor5d[:, :, ib] * n_bins5d[ix] * n_bins5d[iy] *

n_bins5d[iz] * n_bins5d[i_s]

nebs1d += nebs_tensor5d[:, :, i_s] * n_bins5d[ix] * n_bins5d[iy] *

n_bins5d[iz]

nebs1d += nebs_tensor5d[:, :, iz] * n_bins5d[ix] * n_bins5d[iy]

nebs1d += nebs_tensor5d[:, :, iy] * n_bins5d[ix]

nebs1d += nebs_tensor5d[:, :, ix]

nebs1d[mask_out_of_bounds] = -1

return nebs1d

68

def range5d(shape5d, dtype, device):

lx, ly, lz, ls, lb = shape5d

n_dims = 5

r5d = torch.zeros((lb, ls, lz, ly, lx, n_dims), dtype=dtype, device=device)

r5d[:, :, :, :, :, 0] += torch.arange(lb, dtype=dtype, device=device).view((-1, 1, 1, 1, 1)) r5d[:, :, :, :, :, 1] += torch.arange(ls, dtype=dtype, device=device).view((-1, 1, 1, 1)) r5d[:, :, :, :, :, 2] += torch.arange(lz, dtype=dtype, device=device).view((-1, 1, 1)) r5d[:, :, :, :, :, 3] += torch.arange(ly, dtype=dtype, device=device).view((-1, 1)) r5d[:, :, :, :, :, 4] += torch.arange(lx, dtype=dtype, device=device)

# need to reverse each indiv coordinate. rev_idx = torch.arange(start=n_dims-1, end=-1, step=-1, device=device) r5d = r5d.index_select(n_dims, rev_idx)

return r5d

CPP CUDA Kernel Code for Parallel FRNN with Bins

__global__ void frnn_cuda_forward_kernel( const int* neighbor_bins, const float* pts, const int* pt_idxs, const int* first_pt_idxs, const float radius, const float scale_radius, const int n_max_neighbors, const int n_bins, int* edges, int* i_edges, const int max_size_edges ) {

// stores the points for bin_a __shared__ float bin_a[n_max_pts_bin * pt_size]; __shared__ float bin_b[n_max_pts_bin * pt_size]; // stores the pt ids for bin_a __shared__ int bin_a_ids[n_max_pts_bin]; __shared__ int bin_b_ids[n_max_pts_bin];

for (int idx_i_bin_a = blockIdx.x; idx_i_bin_a < n_bins;

idx_i_bin_a += gridDim.x) {

__syncthreads();

__threadfence();

69

int i_bin_a = idx_i_bin_a;

// if the bin is empty: if (first_pt_idxs[i_bin_a] == -1) { continue; }

////////////// // load bin_a ////////////// if (threadIdx.x == 0 && threadIdx.y == 0) {

bool set_neg_1 = false; int inext = first_pt_idxs[i_bin_a]; for (int i=0; i < n_max_pts_bin; i++) {

if (set_neg_1) { bin_a_ids[i] = -1; continue; }

bin_a_ids[i] = inext;

if (inext == -1) { set_neg_1 = true; continue; }

int i_pts_start = inext * pt_size; int i_bin = i * pt_size; bin_a[i_bin + offset_x] = pts[i_pts_start + offset_x];

bin_a[i_bin + offset_y] = pts[i_pts_start + offset_y];

bin_a[i_bin + offset_z] = pts[i_pts_start + offset_z];

bin_a[i_bin + offset_s] = pts[i_pts_start + offset_s];

inext = pt_idxs[inext];

}

}

for (int idx_i_bin_b = 0; idx_i_bin_b < n_max_neighbors;

idx_i_bin_b += 1) {

// int idx_i_bin_b = blockIdx.y; int i_bin_b = neighbor_bins[i_bin_a * n_max_neighbors + idx_i_bin_b];

// neighboring bins in the matrix that are empty // should have been set to -1 // but there might be more bin_b's that // are not -1 after a bin_b that is -1 . . . if (i_bin_b == -1) {continue;}

if (first_pt_idxs[i_bin_b] == -1) { continue;}

// don't double check bin pairs if (i_bin_b < i_bin_a) {continue;}

70

/*---------- LOAD BIN B

-----------*/

int set_neg_1 = false; int inext = first_pt_idxs[i_bin_b]; for (int i=0; i < n_max_pts_bin; i++) {

if (set_neg_1) { bin_b_ids[i] = -1; continue; }

bin_b_ids[i] = inext;

if (inext == -1) { set_neg_1 = true; continue; }

int i_pts_start = inext * pt_size; int i_bin = i * pt_size; bin_b[i_bin + offset_x] = pts[i_pts_start + offset_x];

bin_b[i_bin + offset_y] = pts[i_pts_start + offset_y];

bin_b[i_bin + offset_z] = pts[i_pts_start + offset_z];

bin_b[i_bin + offset_s] = pts[i_pts_start + offset_s];

inext = pt_idxs[inext];

}

__syncthreads();

__syncthreads();

__threadfence();

/*--------------------- THE COMPARISONS

now do the comparison between

bin_a's pts and bin_b's pts

----------------------*/

// ia is the bin index for the current pt a // so it is NOT the index into the pts matrix for (int ia = threadIdx.x; ia < n_max_pts_bin; ia+=blockDim.x) {

if (ia >= n_max_pts_bin) {break;}

if (bin_a_ids[ia] <= -1) {break;}

float ax = bin_a[ia * pt_size + offset_x]; float ay = bin_a[ia * pt_size + offset_y]; float az = bin_a[ia * pt_size + offset_z]; float as = bin_a[ia * pt_size + offset_s];

for (int ib = threadIdx.y; ib < n_max_pts_bin; ib+=blockDim.y) {

if (ib >= n_max_pts_bin) {break;}

if (bin_b_ids[ib] <= -1) {break;}

71

// don't compare the same point to itself: if (bin_b_ids[ib] == bin_a_ids[ia]) {continue;}

// if it's the same bin, // only compare lower points to higher points if ((i_bin_a == i_bin_b) && (bin_b_ids[ib] <= bin_a_ids[ia])) {

continue; }

float bx = bin_b[ib * pt_size + offset_x]; float by = bin_b[ib * pt_size + offset_y]; float bz = bin_b[ib * pt_size + offset_z]; float bs = bin_b[ib * pt_size + offset_s];

// check that the scales are close enough float scale_max = as; float scale_min = bs; if (as < bs) { scale_max = bs;

scale_min = as;

}

if ((scale_max / scale_min) > scale_radius) {continue;}

float diffx = bx - ax; float diffy = by - ay; float diffz = bz - az; float dist = diffx * diffx + diffy * diffy + diffz * diffz; dist = sqrt(dist);

float log_avg_scale = sqrt(as * bs); if (dist >= radius * log_avg_scale) {continue;}

int this_i_edges = atomicAdd(&i_edges[0], edge_size); edges[this_i_edges + 0] = bin_a_ids[ia]; edges[this_i_edges + 1] = bin_b_ids[ib]; }

}

__syncthreads();

__threadfence();

}

__syncthreads();

__threadfence();

}

return; }

72

efficient fixed-radius near neighbors for machine learning

Documents