efficient fixed-radius near neighbors for machine learning
TRANSCRIPT
Efficient Fixed-Radius Near Neighbors for Machine Learning
by David Porter Walter III
S.B., MIT (2018)
Submitted to the
Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2019
© Massachusetts Institute of Technology 2019. All rights reserved.
Author: _________________________________________
Department of Electrical Engineering and Computer Science May 24, 2019
Certified by: _________________________________________
Tomaso A. Poggio Professor of Brain and Cognitive Science Thesis Supervisor May 24, 2019
Accepted by: _________________________________________
Katrina LaCurts Chair, Master of Engineering Thesis Committee
1
2
Efficient Fixed-Radius Near Neighbors for Machine Learning
by David Porter Walter III
S.B., MIT (2018)
Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
Deep learning has enabled artificial intelligence systems to move away from manual feature engineering and toward feature learning and better performance. Convolutional neural networks (CNNs) have especially demonstrated super-human performance in many vision tasks. One big reason for the success of CNNs is due to the use of parallelizable software and hardware to run these models, making their use computationally practical. This work is focused in the design and implementation of an efficient and parallel fixed-radius near neighbors program (FRNN). FRNN is a core component in a new type of machine learning model, object oriented deep learning (OODL), serving as a replacement for CNNs with goals of invariance, equivariance, interpretability, and computational efficiency that improve upon the abilities of CNNs. This efficient implementation of FRNN is a critical step in making OODL computationally efficient and practical.
Thesis Supervisor: Tomaso A. Poggio Title: Professor of Brain and Cognitive Science
3
4
Acknowledgements
Thank you for everyone at the Poggio Lab, especially to Qianli Liao for
providing guidance and mentorship for me in this project. Thank you to
my grandpa David Walter Sr. for being there on my MIT journey since day
one. Thank you Kathy Guerra for being there alongside me and believing in
me since the first year of my five year MIT journey, because without you I
would have not made it to the end of this thesis.
5
6
Contents
Abstract 3
Acknowledgements 5
Contents 7
List of Figures 9
List of Tables 11
Introduction 13
1.1 Motivation 13
1.2 Contributions 15
Background 17
2.1 Parallel Programming 17
2.2 PyTorch 18
Related Work 21
3.1 Convolutional Neural Networks 21
3.2 Dropout 21
3.3 Pooling 22
3.4 Object Oriented Deep Learning 23
Methods 27
4.1 Combinative Functions in Deep Learning 27
4.2 FRNN With Bins 28
4.3 FRNN With Scales and Bins 33
4.4 Parallel Binning 37
4.4.1 Mapping Points to Bins 37
7
4.4.2 Mapping Bins to Points 40
4.4.3 Mapping Bins to Neighbor Bins 40
4.5 Parallelized FRNN 41
4.5.1 Parallelizing Across Bins and Points 41
4.5.2 Storing Bin Points in Shared Memory 43
Results 45
5.1 Theoretical Runtime Analysis 45
5.2 Experimental Runtime Analysis 47
Evaluation 53
6.1 Practical Trade-offs of Different Algorithms 53
Future Work 55
7.1 Connected Components 55
Conclusion 59
References 61
Appendix 65
Python Code Mapping Points to their Bins 65
Python Code Mapping Bins to Points 66
Python Code Mapping Bins to Neighbor Bins 67
CPP CUDA Kernel Code for Parallel FRNN with Bins 69
8
List of Figures
2.1 OODL Voting and Binding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Brute Force FRNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 FRNN With Bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 FRNN With Bins and Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Experimental Example of Points Mapped to Bins . . . . . . . . . . . . 35
4.5 Experimental Example of Neighbors for a Single Bin . . . . . . . . . 36
5.1 Parallel FRNN Runtime versus Radius for Different Numbers of Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
5.2 Parallel FRNN Runtime versus Number of Bins for Different Numbers of Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
5.3 Parallel FRNN Runtime versus Number of Outputted Edges for Different Numbers of Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
5.4 Parallel FRNN Runtime Per Edge versus Number of Outputted Edges for Different Numbers of Points . . . . . . . . . . . . . . . . . . . . .
52
7.1 Connected components in a undirected graph . . . . . . . . . . . . . . . 56
9
10
List of Tables
5-1 FRNN Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5-1 FRNN Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
11
12
Chapter 1
Introduction
1.1 Motivation
The rise of deep learning marks a leap forward in the field of
artificial intelligence, enabling humans to build a computational system
that can hierarchically process an input and learn a hierarchy of features
and use those features to transform an input into a classification or an
action. Nature has genetically endowed humans with the ability to perform
very complicated and intelligent tasks. As scientists continue to uncover
some of the mechanisms behind neurally based intelligence, we can use
that knowledge to build artificial systems that have many of the useful
properties that we think neurally based systems have, and even move
beyond some of the limitations of human intelligence. As the field of
machine learning (ML) advances, we see a trend analogous to the
transition from training wheels on a bike, to no training wheels, to
modifying the bike itself, to scrapping the bike altogether and building a
better form of transportation. What this means for AI is that we started
hard-coding in all the details for our computer systems, but ever
13
increasingly, we will be heading toward a world where our computational
systems are not only given the ability to make their own decisions to a
small degree, but they will be able to make decisions about how their
decisions are even made, about how they learn, and less human knowledge
will be forced on our computer systems.
As good scientists and engineers of artificially intelligent systems, we
need to be cognizant of this long-term trajectory of making less
human-made assumptions and the short-term benefits of injecting human
knowledge into a system where learning that human knowledge would be
too difficult. In the context of the field of machine learning and deep
learning, we want to advance the field in a way that continues to promote
the minimization of human-biased knowledge while also forcing an AI
system to have properties that we know are good. The research for this
proposal represents a movement in that direction. This work is focused on
the design and implementation of an efficient and parallel fixed-radius
near neighbors program (FRNN). FRNN is a core component in a new type
of machine learning model, object oriented deep learning (OODL), serving
as a replacement for CNNs with goals of invariance, equivariance,
interpretability, and computational efficiency. This efficient
14
implementation of FRNN is a critical step in making OODL computationally
efficient and practical.
1.2 Contributions
For the work of this thesis project I:
● Implemented a parallel system that maps points in both linear and
exponential spaces to bins, and maps each bin to their neighbor bins.
● Implemented a fixed-radius near neighbors system that processes
bins in parallel and pairs of points within bins in parallel.
● Showed that the runtime of this parallel FRNN implementation with
binning improves upon the non-parallel and non-binning
implementation by orders of magnitude, allowing this
implementation to be practical for use for object oriented deep
learning.
15
16
Chapter 2
Background
2.1 Parallel Programming
Machine learning, and specifically deep learning, has seen such a
breakthrough in no small part from the advancement of hardware with the
capability to run parallelizable code on graphical processing units (GPUs).
As the name suggests, GPUs were originally designed for graphical
processing tasks that inherently benefit from massive amounts of
parallelization. Specifically we can thank the videogame industry for the
initial proliferation of GPUs used for graphics applications and then other
industries like blockchain and AI have been able to benefit from the
advancements in GPU hardware as well [1] [2]. There are many libraries
and frameworks for writing GPU code, including OpenCL [3] but the most
popular deep learning frameworks including Pytorch and TensorFlow use
Nvidia’s CUDA. “CUDA is a parallel computing platform and programming
model that makes using a GPU for general purpose computing simple and
elegant” [4]. CUDA currently works with multiple programming languages
including C, C++, Fortran. In simple terms, GPUs, and the CUDA framework
17
allow an engineer to transform a parallelizable algorithm into code that
allows the hardware to actually operate in parallel with hundreds to
thousands of threads running at the same time. This goes beyond a normal
CPU that runs code that seems like it is in parallel in the software, through
abstraction, but in reality, the operations are being run mostly in serial or
on a very small number or CPU cores. While in theory, a GPU is still a form
of a Turing machine (TM) will all of the same theoretical limitations, based
on the constant number of threads it can actually run in parallel, in
practice, the number of blocks and threads on a GPU make it run certain
parallel operations much faster than a CPU. Because this project
specifically uses Nvidia GPUs and the CUDA framework, the most
practically useful thing to know about the way Nvidia’s GPUs are
programmed is that they have threads like a normal CPU and blocks which
represent groups of threads all undergoing the same operations in parallel.
An Nvidia GPU runs multiple threads in a block in parallel and multiple
blocks in parallel.
2.2 PyTorch
We have also chosen to use CUDA because PyTorch makes it
relatively easy to create custom CUDA kernels and import them directly
into PyTorch. PyTorch is a major member in the list of popular deep
18
learning frameworks, which also include TensorFlow and Theano. PyTorch
garners its strength from its ability to dynamically run deep learning
models, unlike frameworks like TensorFlow, which needs to compile a
static computational graph before running a deep learning model. This
dynamic nature makes PyTorch much easier for research and
development, including building and debugging models. The downside of
PyTorch’s dynamic nature is that it has less room for optimization of the
underlying implementation. While the limits of optimization in PyTorch
may be limited to a small degree, PyTorch is practically easier to customize,
based on the simplicity, structure and dynamic nature of the framework
[5] [6].
19
20
Chapter 3
Related Work
3.1 Convolutional Neural Networks
Convolutional neural networks (CNNs) are to thank for much of the
recent advancement in computer vision in the past several years, in no
small part due to advancements in parallelizable hardware like GPUs [7].
One of the main properties that make CNNs so powerful is translation
invariance [8]. Invariance is the property that allows a system to not be
affected by a change in a feature. For example, translation invariance
allows an ML model to not lose performance when an object is shifted. We
would like computer vision models to also have rotation invariance for
novel rotations, but CNNs struggle with this property compared to their
ability to handle novel translations [9].
3.2 Dropout
Dropout is a method in which nodes in a deep learning model are
randomly set to zero. Dropout can help our models with at least one
21
notable property: disentanglement [10]. Disentanglement is the idea that a
feature of the input is both captured in the system and isolated in a small
subset of the system. To contrast, a system that is entangled has many of its
important aspects in many different parts of the system. A system that is
disentangled is modular, such that certain parts of the system can be
identified with certain aspects of the computation for that system. From an
interpretability perspective, like modularity in software engineering,
disentanglement in ML models is a desirable property that allows scientists
and engineers to understand how the system works instead of treating the
system as a black box. Interpretability is a positive step in the direction of
safety for AI systems, and a step toward scientist and engineers better
understanding how our models work and how they can be improved.
3.3 Pooling
Pooling layers are used in CNNs to summarize groups of neurons
within a k x k sliding kernel map [7]. Max pooling is a popular pooling
layer type, where the output from each kernel map is the max value within
the kernel map output before pooling. Max pooling serves as a way to
propagate only the most prominent value within a region of the input, and
ignore the rest [11]. Pooling operations help CNNs achieve translation
invariance, because the location of a feature matters less when it is
22
propagated through a pooling layer, as the pooling layer only cares about
the presence of the feature. Nevertheless, with CNNs pooling helps the
most with invariance of position, but not necessarily invariance of other
features like rotation and size. With a different model type, pooling layers
could aid in the problem of invariance of other features types.
3.4 Object Oriented Deep Learning
Object Oriented Deep Learning (OODL) is a model that has its highest
aims on interpretability, disentanglement and equivariance [12] [13].
Unlike the conventional CNN architecture that uses N-dimensional feature
tensors as the fundamental representation and convolutional kernels,
OODL’s basic representation is an object. In this context, an object is an
entity that may have explicit properties like position, rotation, and size
build in, and an N-length signature vector that contains learned features
for that object. Like with most other encoding neural architectures that
process images, the first layer takes as input an image and each subsequent
layer encodes the input into continually higher levels of abstracted objects
and features, such that objects and features in high layers are some
learned combination to features from lower layers. This is similar to
typical deep learning architecture because features are still a combination
of lower-level features, but OODL is different because OODL also has
23
objects as fundamental units, allowing for a symbolic paradigm to emerge.
What this symbolic paradigm represents is a movement beyond simple
statistical feature detectors of typically strong deep learning architecture
like CNNs and toward an ability to understand the relation between objects
in an input that allows for elements such as context and complex
relationships to emerge. OODL follows the a paradigm similar to other
deep learning architectures, where the first, and lowest, layer starts with
the individual pixels or other low-level features. If the input is a 2D image,
the main difference in OODL’s approach is that these pixels are treated as
individual objects, and each pixel has the property of position, rotation.
Then, in higher layers there exist few objects that are some combination of
the objects in the lower layers, until the highest layer will have few or only
one object that is a weighted sum of all objects remaining in the image.
To transform objects from layer to layer, OODL uses a voting layer
followed by a binding layer. A voting layer is currently implemented by
multiplying a set of radially oriented weights that dot multiply their values
by each object’s signature, and predict neighboring objects. Binding layers
are used to combine objects in each layer by aggregating objects to those
with the most surrounding objects. This is why the voting layer is named as
24
such, because objects ‘vote’ for neighboring objects, and the binding layer
aggregates the votes.
Figure 3.1: Visualization of a voting operation on the left and a binding operation on the right. [12]
Voting layers are more general than convolutional layers because the
radial kernel is not constrained to a the pixel grad, or a static rotation, or
lack thereof. In terms of interpretability, the existence of individual objects
makes it easier for a scientist or engineer to locate important areas of a
model and interpret where a feature exists in the objects properties and
signature and how it affects the models computation. In addition, an OODL
model has the potential to do less computation if the input is less complex,
whereas other neurally based architectures like CNNs carry the same
amount of computation independent of the complexity of the input.
25
26
Chapter 4
Methods
4.1 Combinative Functions in Deep Learning
A core aspect of computationally efficient and representative
systems like neural networks and hierarchical representations is that there
needs to be a way to combine low-level information into high-level
representations. Language and vision fit this structure well. We see this in
everyday task like reading were we combine letters to formulate words
and words to formulate phrases. We combine edges and dots to formulate
objects in an image, like eyes and legs on an animal and car wheels and
windows on a car, and combine these sub-features and sub-objects into
high-level features and high-level objects like animals and cars
respectively.
In AI systems, we usually have an input with many low-level
features being mapped to an output with a few high-level features. Often
this is a mapping from a larger sized input to a smaller sized output, and to
do this, AI systems employ combinative mappings that have a net effect of
27
reducing the size of the input. Hierarchical networks like deep learning
embrace this paradigm fully. A vanilla feedforward deep network
architecture can vary widely, but one usual, and important, aspect is that
the final layer will map some larger internal representation to some
smaller representation like an encoding, action, or classification. A
self-driving car system can take in frames from a video and map it to
actions like steering the vehicle and accelerating. A natural language
system can read a paragraph and output the emotions of that sentence, like
happy, or sad or angry.
4.2 FRNN With Bins
Many of the important properties of OODL, including dynamic
computational cost, equivariance, etc., rely on a dynamic architecture that
deviates from the less dynamic nature of vanilla deep learning approaches
like feedforward neural networks and even deviating from slightly
dynamic architectures like CNNs. CNN’s core combinative function is the
convolution with either a stride larger than one or dropout. While we are
open to different combinative functions for OODL, OODL uses FRNN as it’s
core combinative operation. We have currently settled on using the FRNN
operation instead of convolutional filters because of its dynamic property
28
that allows it to be applied to point clouds or inputs and model layers of
various sizes and shapes.
FRNN takes in as input a set of points and a radius and it outputs a
matrix that maps each point to all other points that are within a distance of
the radius from that point.
Algorithm 4.1: FRNN Brute Force.
1. initialize empty edges list
2. for all points xa
a. for all points xb
i. compute distance between xa and xb
ii. if distance < radius, push xa and xb to edges list
3. output edges list
29
Figure 4.1: Depiction of the operation for one point in the brute force FRNN algorithm in 2 dimensions. The blue, or lighter, points are “near neighbors” of the current point in the center of the circle or size radius.
In order to not needlessly compare every point to every other point,
a better algorithm than the brute force algorithm would place points in
bins such that points that are only a constant number of bins over need to
be checked. Imagine you are acting out FRNN in real life between you and
everyone in the world, and the radius was 10 miles. It would be unwise to
30
checking distances to people all the way across the world. Instead you
know to only check within your town or city and then possibly check the
few surrounding towns. Similarly for FRNN with generic points, this
program puts each point in a bin similarly to how people exist in cities. If
the radius were smaller, the program can make the bins smaller, and
similarly for larger radii, enabling us to only look at a relatively small
number of bins.
Algorithm 4.2: FRNN with Bins.
1. ptsToBins2d ← floor(points2d - min(points2d) / radius).
2. ptsToBins1d: map each ptsToBins2d to unique 1d integer
3. binsToPts1d ← map each 1D bin to the ids in points2d
4. initialize empty edges list
5. for ba in binsToPts1d:
a. for bb in binsToPts1d:
i. for xa in ba:
1. for xb in bb:
a. compute distance between xa and xb
b. if distance < radius, push xa and xb to edges list
6. output edges list
31
Figure 4.2: FRNN in 2 dimensions with bins. In this case, the bins are as wide as the radius, so it is guaranteed that every neighbor of the current bin will be no more than one bin away.
32
4.3 FRNN With Scales and Bins
So far I have been discussing the FRNN problem in terms of points that
exist in 2D or 3D linear euclidean space. In this linear space, determining
the distance between coordinates matches up with our usual intuitions,
and the radius is always the same no matter which pair of points you are
considering. But when we introduce scales, the radius that we use to
compare the distance between points is multiplied by the value of the scale
of that point as well. In this way, the radius is made to be proportionally
larger when the scale is larger, making our radius relative with respect to
the scale. When we have scales, the radius r that is inputted into the FRNN
program is changed to an adjusted radius equal to (r)(s) where s is the
scale.
33
Figure 4.3: FRNN with one linear dimension and one scale dimension with bins, where the first dimension in the x-axis is in linear space, and the second scale dimension is in logarithm space. Bins that are not white are considered neighboring bins, but the light red, or lightest shaded sections have points that will not be neighboring points ever. Gaps between the scale bins are to leave room for the explanatory lines and arrows. As you will see in figure 4.4, the scale bins get taller as the scale increases.
34
Figure 4.4: Experimental Example of Points Mapped to Bins. We have one linear dimension in the x-axis and one exponential dimension in the y-axis, which is the scale. In this case we can see that bins get wider in the linear dimensions and taller in the scale dimension as the scale increases.
35
Figure 4.5: Experimental Example of Neighbors for a Single Bin. The middle bin in light blue is the current bin. This example corresponds with the same points and bins as in figure 4.4.
36
4.4 Parallel Binning
This section describes setting up the inputs for the bins that will be
inputted into the CUDA kernel. At the start the FRNN algorithm is given the
points (x, y, z, s, b), the linear radius for the first 3 dimensions in x, y, and z,
and the exponential scale radius for the fourth dimension for scales. Here
b is the batch id, where batch id identifies points in unique inputs so that
we do not compare distances between point in different inputs, and this
allows us to process multiple inputs in a single batch. Before the FRNN
CUDA kernel is initialized, we need the bins that each point is mapped to.
The high-level steps for this part of the program involve:
1. Compute the map from each point to its bin.
2. Compute the map from each bin to its points.
3. Compute the map from bins to their neighbor bins.
The Python code for all three steps of this algorithm is in the appendix.
4.4.1 Mapping Points to Bins
The main decision to be made here is how wide to make the bins. As
the bin widths get larger and larger we will have more points in each bin
37
and less bins. If all points fit into one bin, then we would be back at the
brute force FRNN algorithm, but if each bin has only one point then we can
still theoretically have a speed up, but practically we would not.
Somewhere in the middle should produce the best performance. Although
figure 4.2 and figure 4.3 already showed this size of the bins, these figures
do not make that design choice seem non-accidental. I chose the width of
the linear bins to be the same as the radius, and the bin width in the fourth
scale dimension to be the scale radius. Considering only the first 3 linear
dimensions, each bin forms a cube with each side having a length of the
radius. This makes getting neighboring bins in the linear space much
easier because any point in the current bin will be guaranteed to find all of
its neighboring points within all of the neighboring bins, as you can see in
figure 4.2. Still considering linear space in 3D, what this choice of bin size
also means is that if we assume the points are mostly distributed evenly, as
in the average case, then each point will only make O(|Ei|) comparisons
with other points, where Ei is the number of neighboring points for point i.
In addition, we do not want negative bin ids so we will also shift the
value of every point by the lowest value, so that no point have a value less
than 0. With this the equation for mapping a point to a bin in linear space
is:
38
wi,j = ( x i,j - min(X) ) / r , for j ∈ {0, 1, 2}
Where X is the tensor of points, and xi is a single linear point in X, and wi,j is
the bin width in the ith object in the jth dimension.
If we consider the scale in the fourth dimension, then we have two
separate equations for the bins in linear space and the bins in exponential
space.
u = floor( log(si ) / log(z) )
bi,j = floor( w i,j / z u+1 ) , for j ∈ {0, 1, 2}
bi,3 = floor( log(s i ) / log(z) )
Where si is the scale value at point i , z is the scale radius, u is a placeholder
for the equation for bi,j , b i,j is the scaled bin for point i in linear dimension j
for the first three dimensions, bi,3 is the scale bin for point i . The equations
for bi incorporate the resizing for the highest scale value in any given bin,
due to the u+1 exponent. The idea of the highest scale in the bin limiting
the width of the bin may make more sense when looking at figure 4.3 and
the dotted lines that separate the area of the neighbor bins that have points
that could be neighbors, in light blue, and the area outside in light red, that
39
are guaranteed to not have any neighboring points for any of the points in
the current bin. For intuition, the width of the bins get exponentially larger
as the scale increases linearly.
4.4.2 Mapping Bins to Points
Algorithm: Mapping Bins to Points:
1. get the sorted indices for the mapping of points to bins.
2. get the differences: allow us to set every point that does not have a next
point in the bin to -1 and the rest set to the next point in the bin.
3. set the values for first points.
4. set the values for the point indices and un-sort them so they are in the
same order as the original points, because the indices are corresponding
with the points based on their locations.
4.4.3 Mapping Bins to Neighbor Bins
When we add in scales, things get a little more complicated because
we cannot simply look one bin in each direction to get the neighbors. As
shown in figure 4.3, when looking at the current bin and keeping the scale
dimension constant, we simply get the neighbors that are one bin higher
and lower. But when getting the neighbors for bins for other scales we look
40
one bin higher and lower in the scale dimensions, but how many bins to
get in the other dimensions depend on the maximum radius at those other
scales, so we are not grabbing exactly one bin over when at different
scales. The main requirement for getting neighboring bins is selecting them
such that all neighboring points of all points in the current bin are
contained in all of the selected neighboring bins. Code for this algorithm is
in the appendix and figure 4.3 is the best way to understand this section
spatially.
4.5 Parallelized FRNN
With the radius, points, bins and their neighbor bins, and the linked
list mapping points to the next point in their bin, we can write parallel code
in a CUDA kernel to implement FRNN. The kernel has two main aspects:
parallelizing across bins and points and storing a bin’s points in shared
memory when comparing points in two neighboring bins. Code for this
section can be found in the appendix.
4.5.1 Parallelizing Across Bins and Points
The lowest-level of parallelization in a generic FRNN algorithm exists
at the level of the point, but because this approach runs in O(n2) time we
want to run as many points in parallel as we can. With this we assign
41
separate threads to each point pair and a CUDA block to each bin and
compute the solution for this section of the algorithm in approximately the
time it would take to compute the solution for one bin. Figure 4.2 shows a
visual depiction of what the operation of one bin looks like.
Algorithmically here are the steps for processing each bin:
1. edges ← empty array
2. for point pa in bin ba:
a. for bin bb in ba’s neighbors:
i. for point pb in bin bb:
1. sa , sb = pa[0], pb[0]
2. scale_dist ← max([sa, sb]) / min([sa, sb])
3. d ← |pa - pb|
4. avg_scale ← log(sa * sb)
5. if d ≤ radius * avg_scale and scale_dist < scale_radius
a. edges.push([pa, pb])
42
4.5.2 Storing Bin Points in Shared Memory
The other significant improvement of my implementation from
other FRNN algorithms in CUDA is the speedup that comes from loading
bins’ points into shared memory before comparing the points’ distances.
Shared memory is cache that is allocated and controlled by the user instead
of automatically allocated by the compiler and used by the programming
languages underlying memory management program. As with cache, any
reads and writes to shared memory happen much faster than with global
memory. The procedure here is one thread in the CUDA kernel initially
writes the points in shared memory for bina and then for every
neighboring binb the same thread loads the points for binb in the shared
memory. This was designed to induce a speed up because one thread can
very quickly move all of the points into shared memory once and then
when the distance comparisons are being done during the actual
algorithm, where each point’s values need to be read O(n2) times, the reads
to shared memory can be done very quickly.
43
44
Chapter 5
Results
5.1 Theoretical Runtime Analysis
Time Complexity
Algorithm Best Average Worst
FRNN Brute Force O(|X|2) O(|X|2) O(|X|2)
FRNN with Bins O(|E|) O(|E|) O(|X|2)
Parallel FRNN with Bins O(|E|/|B|) O(|E|/|B|) O(|X|2)
Table 5-1: Time Complexity in the best, average, and worst case for different FRNN algorithms, where X is the points matrix and E is the outputted edges matrix, and B is the 1 dimensional bins array.
In the FRNN brute force algorithm we compare all |X|2
combinations of points in all cases. In the FRNN with bins algorithm we
expect the program will be comparing O(|E|) point pairs because on
average, most of the points in the neighbor bins will form an edge with the
points in the current bin that is being considered. In the best case, all
45
points will be as evenly distributed across the bins as possible and we will
make O(|E| / |B|) comparisons. In the worst case, where we have many of
the points concentrated in very few bins, we will have a runtime
complexity of O(|X|2) because this will essentially be like running the brute
force algorithm on a single or few bins.
Algorithm Space Complexity (All Cases)
FRNN Brute Force O(|X|+|E|)
FRNN with Bins O(|X|+|E|+|B|)
Parallel FRNN with Bins O(|X|+|E|+|B|)
Table 5-2: Space Complexity in the best, average, and worst case for different FRNN algorithms, where X is the points matrix and E is the outputted edges matrix, and B is the 1 dimensional bins array.
For the space complexity, the only difference between the brute
force algorithms and the algorithms with bins is that we also need to
include the storage necessary for the bins, because the storage size for the
bins is a function of the radius.
46
5.2 Experimental Runtime Analysis
In this experiment I am testing the final version of the FRNN
program using an Nvidia Tesla K80 GPU, and for the parallel algorithm the
CUDA kernel is set to use 8192 blocks and 16 by 16 2D threads. These
settings for the number of blocks and number of threads give the best
experimental performance for my implementation. For the non-parallel
algorithm, I am using only one block and one thread. Below is a
comprehensive description of the other experimental settings.
Constant Hyper-Parameters:
1. dimension of points: two linear dimensions and one scale dimension
(x, y, s)
2. linear domain: x and y values are sampled from a uniform
distribution ranging from 0.0 to 100.0.
3. scale domain: the scale values are sampled from a uniform
distribution ranging from 1.0 to 1.5.
4. scale radius: 1.25
Independent Variables:
1. number of points: ranging from 2 to 20,000, inclusive.
47
2. radius: ranging from 1.0 to 100.0 (or 1% of the linear range to 100%
of the linear range).
a. number of bins: this is a function of the radius.
Dependent Variables:
1. runtime for binning step.
2. runtime for the FRNN kernel.
3. number of outputted edges.
The values for the experiment described above were chosen to
sufficiently test the FRNN program even for values outside of what may be
used in practice. In OODL, we would not likely have a radius that is as large
as the range of inputs, nor would we likely have a radius that is only 1% of
the range of the inputs. One reason to test these values is because we don’t
actually know for sure what range of values the OODL program will use in
practice, as well as other non-OODL applications like collision detection
with a large number of points and a small radius. Regardless, it makes
sense to test these values, if only to look for ways to improve the
implementation.
48
Figure 5.1: Runtime comparison between the Brute Force FRNN implementation, the serial FRNN implementation with bins, and the parallel FRNN implementation with bins.
49
Figure 5.2: Parallel FRNN Runtime versus Radius for Different Numbers of Points.
50
Figure 5.3: Parallel FRNN Runtime versus Number of Bins for Different Numbers of Points.
51
Figure 5.4: Parallel FRNN Runtime versus Number of Outputted Edges for Different Numbers of Points. When the number of points is small, we do not have runtimes for a large number of edges because we do not have enough points to make a large number of edges, even if the radius was very large.
52
Chapter 6
Evaluation
6.1 Practical Trade-offs of Different Algorithms
Maybe the most important result is from figure 5.1, which shows
how much faster the runtime of the parallel bin FRNN algorithm is than
the serial bin and brute force algorithms. Due to the practical limits of a
GPU and how many blocks and threads it can actually run in parallel, if the
size of the inputs gets very large, the runtime of the parallel bin algorithm
will approach the runtime of the serial bin algorithm, but with the practical
sizes of the inputs that OODL will use, we will expect the parallel algorithm
to do much better than either of the serial algorithms.
From the results we can see that the parallel FRNN with bins gets
more efficient as the size of the input and output increase. This is likely due
to the high-start up time of a CUDA kernel and the startup time of
initializing the bins. As the size of the input and output increases, the
startup time becomes a smaller percentage of the total runtime. This is why
53
it is important in practice to run this algorithm with larger batch sizes, so
the number of times the CUDA kernel has to be initialized is small.
We can see that when the radius is really small, likely smaller than
will ever be used in practice, the runtime is much higher than for most of
the other radii values. This is likely due to the fact that when the radius is
really small, the number of bins is very large. The runtime of the kernel
increases linearly with the number of bins. With more bins, we will have
more blocks running in parallel, and given any given GPU has practical
limitations for the number of blocks it is actually runs in parallel, we know
that when the number of blocks is very large, some of those blocks are
running in series instead of in parallel.
I designed this algorithm specifically for OODL, so it does well with
parallelizing across bins and for inputs sizes that are relevant to OODL. In
other applications like with graphics and detecting object collisions, this
algorithm may or may not serve that purpose as well as another algorithm
that is specifically optimized for those applications.
54
Chapter 7
Future Work
7.1 Connected Components
While FRNN is a function that is good for spatially grouping objects
together based on their distance from each other, this may not be enough if
our goal is for a computer system to segment out objects with shapes that
do not closely approximate a circle or sphere. To account for this, we are
working on the next step in the binding process, which is a connected
components step. A connected component in an undirected graph is a set of
points that all have some path from itself to all other points in that
connected component.
55
Figure 7.1: Visualization of connected components in a undirected graph.
This diagram has three connected components: the five blue points in the
top left, the four green points on the right, and the one red point on the
bottom.
We believe that object segmentation is a fundamental aspect of vision, so in
order to make that capability innate on an artificial vision system we
propose the combination of FRNN and connected components as a
sub-system that will allow a computer system to learn to predict
neighboring objects by interpreting the cloud of points that are produced
from the voting layer in OODL. Out of this we hypothesize that objects and
shapes will be learned without any explicit domain knowledge, and other
56
difficulties like with occlusion or highly variable shapes will be handled
fundamentally by this approach.
At the time of this thesis being submitted, we have the connected
components step already implemented, and what is left is to complete the
entire pipeline and experiments with the voting layers and FRNN and
connected components to have a complete OODL system.
57
58
Chapter 8
Conclusion
CNNs have been the preferred model for most computer vision
problems for a few years, and OODL presents an alternative model that
aims at going beyond the capabilities of CNNS. Like CNNs benefited from
the implementation of parallel hardware and software for the convolution
operation, this work represents a similar step in that direction for OODL
and the FRNN operation. For the work of this thesis project I implemented
a parallel system that maps points to bins, and maps each bin to their
neighbor bins, and a parallel fixed-radius near neighbors system that
processes bins and pairs of points in parallel. The parallel bin FRNN
algorithm shows good performance for the input sizes and types necessary
to train an OODL model, and a key result is how the performance of this
parallel FRNN with bins implementation is orders of magnitude more
efficient than the non-parallel and non-binning FRNN implementation.
With these results, the critical FRNN component of OODL will allow the
continued design of OODL to be free to develop knowing its core
functionality is computationally efficient and practical.
59
60
References
[1] Owens, John D., et al. "A survey of general-purpose computation on
graphics hardware." Computer graphics forum. Vol. 26. No. 1. Oxford, UK:
Blackwell Publishing Ltd, 2007.
[2] Sze, Vivienne, et al. "Efficient processing of deep neural networks: A
tutorial and survey." Proceedings of the IEEE 105.12 (2017): 2295-2329.
[3] Stone, John E., David Gohara, and Guochun Shi. "OpenCL: A parallel
programming standard for heterogeneous computing systems." Computing
in science & engineering 12.3 (2010): 66.
[4] Yang, Zhiyi, Yating Zhu, and Yong Pu. "Parallel image processing based
on CUDA." 2008 International Conference on Computer Science and Software
Engineering. Vol. 3. IEEE, 2008.
[5] Ketkar, Nikhil. "Introduction to pytorch." Deep learning with python.
Apress, Berkeley, CA, 2017. 195-208.
[6] Paszke, Adam, et al. "Automatic differentiation in pytorch." (2017).
61
[7] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet
classification with deep convolutional neural networks." Advances in
neural information processing systems. 2012.
[8] LeCun, Y., & Bengio, Y. (1995). Convolutional networks for images,
speech, and time series. The handbook of brain theory and neural networks,
3361(10), 1995.
[9] Cheng, Gong, Peicheng Zhou, and Junwei Han. "Learning
rotation-invariant convolutional neural networks for object detection in
VHR optical remote sensing images." IEEE Transactions on Geoscience and
Remote Sensing 54.12 (2016): 7405-7415.
[10] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., &
Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural
networks from overfitting. The Journal of Machine Learning Research, 15(1),
1929-1958.
[11] Sabour, Sara, Nicholas Frosst, and Geoffrey E. Hinton. "Dynamic
routing between capsules." Advances in neural information processing
systems. 2017.
62
[12] Liao, Qianli, and Tomaso Poggio. Object-oriented deep learning. Center
for Brains, Minds and Machines (CBMM), 2017.
https://cbmm.mit.edu/publications/object-oriented-deep-learning
[13] Liao, Qianli, and Tomaso Poggio. Exact equivariance, disentanglement
and invariance of transformations. 2017.
63
64
Appendix
This appendix contains most code for this project:
Python Code Mapping Points to their Bins
def get_pts2bins(pts, batch_ids, radius, scale_radius, device): """ """
ix, iy, iz, i_s = 0, 1, 2, 3 n_pts, n_dims = pts.size()
n_lin_dims = 3
apply_4d = lambda func: torch.tensor([func(pts[:, i]) for i in range(n_dims)],
device=device)
min_vals = apply_4d(torch.min)
pts2bins = pts.clone()
''' steps for xyz bins:
1.) shift to min 0
2.) map by radius
3.) scale by max scale in corresponding scale bin
'''
''' xyz: 1.) shift to min 0
'''
pts2bins[:, ix : iz + 1].sub_(min_vals[ix : iz + 1]) ''' xyz: 2.) map by radius
'''
pts2bins[:, ix : iz + 1].div_(radius) ''' xyz: 3.) scale by max scale in corresponding scale bin
'''
unshifted_scale_bins = torch.floor(torch.log(pts[:, i_s]) /
math.log(scale_radius))
max_unshifted_scale_bins = torch.max(unshifted_scale_bins).item()
exp_range = scale_radius ** torch.arange(1, max_unshifted_scale_bins + 2, dtype=torch.float, device=device)
scale_divs = exp_range[unshifted_scale_bins.long()].view(-1, 1)
pts2bins[:, : n_lin_dims].div_(scale_divs)
'''
65
steps for scale bins:
1.) map by scale_radius
'''
pts2bins[:, i_s].log_()
pts2bins[:, i_s].div_(math.log(scale_radius))
pts2bins.floor_()
pts2bins = pts2bins.long()
# now add another dimension for the batch_ids batch_bins = batch_ids.view(-1, 1) pts2bins = torch.cat([pts2bins, batch_bins], dim=1)
return pts2bins
Python Code Mapping Bins to Points
def get_pt_idx_data(pts, bins_5d, device):
bins = bins_5d
ix, iy, iz, i_s, ib = 0, 1, 2, 3, 4 _, n_dims = bins_5d.size()
n_bins5d = torch.tensor([torch.max(bins[:, i]) + 1 for i in range(n_dims)], device=device)
# do sorting of (bin, pt) pair and sort by bin bins1d = bins[:, ib] * n_bins5d[ix] * n_bins5d[iy] * n_bins5d[iz] *
n_bins5d[i_s]
bins1d += bins[:, i_s]* n_bins5d[ix] * n_bins5d[iy] * n_bins5d[iz]
bins1d += bins[:, iz] * n_bins5d[ix] * n_bins5d[iy]
bins1d += bins[:, iy] * n_bins5d[ix]
bins1d += bins[:, ix]
n_bins1d = n_bins5d[0] * n_bins5d[1] * n_bins5d[2] * n_bins5d[3] * n_bins5d[4]
sorted_idxs = torch.argsort(bins1d)
''' if the bins were: [0, 0, 1, 1, 1, 2]
we would need the pt_idxs to be:
[1, -1, 3, 4, -1, -1]
and the first pt idxs to be:
[0, 2, 5] (one for each bin)
'''
sorted_bins1d = bins1d[sorted_idxs]
# now get the pt_idxs: diffs = sorted_bins1d[1:] - sorted_bins1d[:-1] diffs_mask = diffs != 0
66
pt_idxs_sorted = torch.arange(bins1d.size()[0], dtype=torch.long, device=device) + 1
pt_idxs_sorted[:-1][diffs_mask] = -1 pt_idxs_sorted[-1] = -1
mixed_pt_idxs = sorted_idxs[pt_idxs_sorted]
mixed_pt_idxs[pt_idxs_sorted == -1] = -1
# unsorts it... pt_idxs = mixed_pt_idxs[torch.argsort(sorted_idxs)]
# now create the idxs for the linked list.. first_pt_idxs = torch.zeros(n_bins1d, dtype=torch.long, device=device) - 1 # have to manually add the first element min_bin1d = sorted_bins1d[0]
first_pt_idxs[min_bin1d] = sorted_idxs[0]
mixed_first_bins = sorted_bins1d[1:][diffs_mask]
mixed_first_pts = sorted_idxs[1:][diffs_mask]
first_pt_idxs[mixed_first_bins] = mixed_first_pts
return pt_idxs, first_pt_idxs, n_bins5d, bins1d
Python Code Mapping Bins to Neighbor Bins
def get_nebs2nebs(radius, scale_radius, n_bins5d, device):
ix, iy, iz, i_s, ib = 0, 1, 2, 3, 4 nx, ny, nz, ns, nb = n_bins5d
ndims = 5 n_lin_dims = 3 n_bins1d = nx * ny * nz * ns * nb
''' map each bin to the start values in the ranges at each scale
'''
''' the number of neighbors in euclidean space at scale 1
(scale bin 0 is relative scale scale_radius**0, and
scale bin 1 is relative scale scale_radius**1, etc.)
so for each scale bin we need to adjust this
n_nebs_left accordingly
'''
n_side = 1 n_middle = 1
# because log(scale_bin_width) / log(scale_radius) == 1 n_nebs = n_side + n_middle + n_side
67
bin_offsets = torch.arange(-n_side, -n_side + n_nebs,
dtype=torch.float, device=device)
all_bins = range5d(n_bins5d, torch.float, device).view(n_bins1d, ndims)
nebs_at_scales = []
for bin_offsets_i in bin_offsets:
scale_transform = scale_radius ** bin_offsets_i
starts = ((all_bins[:, : n_lin_dims] / scale_transform) - n_side).view(
n_bins1d, 1, n_lin_dims) # stop = torch.ceil(n_side + n_middle / scale_transform + n_side).long().item()
stop = math.ceil(n_side + n_middle / scale_transform + n_side)
# + 1
range_at_scale_i = range3d([stop, stop, stop], torch.float,
device).view(
stop ** n_lin_dims, n_lin_dims)
nebs_at_scale_i = torch.empty((n_bins1d, stop ** 3, ndims), device=device)
# setting the x, y, z dims nebs_at_scale_i[:, :, : iz + 1] = starts + range_at_scale_i # setting the scale dim nebs_at_scale_i[:, :, i_s] = all_bins.view(-1, 1, ndims)[:, :, i_s] + bin_offsets_i
# setting the batch_id dim nebs_at_scale_i[:, :, ib] = all_bins.view(-1, 1, ndims)[:, :, ib] # floor to round to the nearest bin below so that they can be longs nebs_at_scale_i = nebs_at_scale_i.floor().long()
nebs_at_scales.append(nebs_at_scale_i)
nebs_tensor5d = torch.cat(nebs_at_scales, dim=1)
# filtering the negative values mask_out_of_bounds = torch.sum(nebs_tensor5d < 0, dim=2) > 0 # filtering the values that are larger than the max possible values mask_out_of_bounds += torch.sum(nebs_tensor5d - n_bins5d >= 0, dim=2) > 0 # reset the mask to only have values that are either 0 or 1 mask_out_of_bounds = mask_out_of_bounds > 0
# transform 5d bins to 1d bins
nebs1d = nebs_tensor5d[:, :, ib] * n_bins5d[ix] * n_bins5d[iy] *
n_bins5d[iz] * n_bins5d[i_s]
nebs1d += nebs_tensor5d[:, :, i_s] * n_bins5d[ix] * n_bins5d[iy] *
n_bins5d[iz]
nebs1d += nebs_tensor5d[:, :, iz] * n_bins5d[ix] * n_bins5d[iy]
nebs1d += nebs_tensor5d[:, :, iy] * n_bins5d[ix]
nebs1d += nebs_tensor5d[:, :, ix]
nebs1d[mask_out_of_bounds] = -1
return nebs1d
68
def range5d(shape5d, dtype, device):
lx, ly, lz, ls, lb = shape5d
n_dims = 5
r5d = torch.zeros((lb, ls, lz, ly, lx, n_dims), dtype=dtype, device=device)
r5d[:, :, :, :, :, 0] += torch.arange(lb, dtype=dtype, device=device).view((-1, 1, 1, 1, 1)) r5d[:, :, :, :, :, 1] += torch.arange(ls, dtype=dtype, device=device).view((-1, 1, 1, 1)) r5d[:, :, :, :, :, 2] += torch.arange(lz, dtype=dtype, device=device).view((-1, 1, 1)) r5d[:, :, :, :, :, 3] += torch.arange(ly, dtype=dtype, device=device).view((-1, 1)) r5d[:, :, :, :, :, 4] += torch.arange(lx, dtype=dtype, device=device)
# need to reverse each indiv coordinate. rev_idx = torch.arange(start=n_dims-1, end=-1, step=-1, device=device) r5d = r5d.index_select(n_dims, rev_idx)
return r5d
CPP CUDA Kernel Code for Parallel FRNN with Bins
__global__ void frnn_cuda_forward_kernel( const int* neighbor_bins, const float* pts, const int* pt_idxs, const int* first_pt_idxs, const float radius, const float scale_radius, const int n_max_neighbors, const int n_bins, int* edges, int* i_edges, const int max_size_edges ) {
// stores the points for bin_a __shared__ float bin_a[n_max_pts_bin * pt_size]; __shared__ float bin_b[n_max_pts_bin * pt_size]; // stores the pt ids for bin_a __shared__ int bin_a_ids[n_max_pts_bin]; __shared__ int bin_b_ids[n_max_pts_bin];
for (int idx_i_bin_a = blockIdx.x; idx_i_bin_a < n_bins;
idx_i_bin_a += gridDim.x) {
__syncthreads();
__threadfence();
69
int i_bin_a = idx_i_bin_a;
// if the bin is empty: if (first_pt_idxs[i_bin_a] == -1) { continue; }
////////////// // load bin_a ////////////// if (threadIdx.x == 0 && threadIdx.y == 0) {
bool set_neg_1 = false; int inext = first_pt_idxs[i_bin_a]; for (int i=0; i < n_max_pts_bin; i++) {
if (set_neg_1) { bin_a_ids[i] = -1; continue; }
bin_a_ids[i] = inext;
if (inext == -1) { set_neg_1 = true; continue; }
int i_pts_start = inext * pt_size; int i_bin = i * pt_size; bin_a[i_bin + offset_x] = pts[i_pts_start + offset_x];
bin_a[i_bin + offset_y] = pts[i_pts_start + offset_y];
bin_a[i_bin + offset_z] = pts[i_pts_start + offset_z];
bin_a[i_bin + offset_s] = pts[i_pts_start + offset_s];
inext = pt_idxs[inext];
}
}
for (int idx_i_bin_b = 0; idx_i_bin_b < n_max_neighbors;
idx_i_bin_b += 1) {
// int idx_i_bin_b = blockIdx.y; int i_bin_b = neighbor_bins[i_bin_a * n_max_neighbors + idx_i_bin_b];
// neighboring bins in the matrix that are empty // should have been set to -1 // but there might be more bin_b's that // are not -1 after a bin_b that is -1 . . . if (i_bin_b == -1) {continue;}
if (first_pt_idxs[i_bin_b] == -1) { continue;}
// don't double check bin pairs if (i_bin_b < i_bin_a) {continue;}
70
/*---------- LOAD BIN B
-----------*/
int set_neg_1 = false; int inext = first_pt_idxs[i_bin_b]; for (int i=0; i < n_max_pts_bin; i++) {
if (set_neg_1) { bin_b_ids[i] = -1; continue; }
bin_b_ids[i] = inext;
if (inext == -1) { set_neg_1 = true; continue; }
int i_pts_start = inext * pt_size; int i_bin = i * pt_size; bin_b[i_bin + offset_x] = pts[i_pts_start + offset_x];
bin_b[i_bin + offset_y] = pts[i_pts_start + offset_y];
bin_b[i_bin + offset_z] = pts[i_pts_start + offset_z];
bin_b[i_bin + offset_s] = pts[i_pts_start + offset_s];
inext = pt_idxs[inext];
}
__syncthreads();
__syncthreads();
__threadfence();
/*--------------------- THE COMPARISONS
now do the comparison between
bin_a's pts and bin_b's pts
----------------------*/
// ia is the bin index for the current pt a // so it is NOT the index into the pts matrix for (int ia = threadIdx.x; ia < n_max_pts_bin; ia+=blockDim.x) {
if (ia >= n_max_pts_bin) {break;}
if (bin_a_ids[ia] <= -1) {break;}
float ax = bin_a[ia * pt_size + offset_x]; float ay = bin_a[ia * pt_size + offset_y]; float az = bin_a[ia * pt_size + offset_z]; float as = bin_a[ia * pt_size + offset_s];
for (int ib = threadIdx.y; ib < n_max_pts_bin; ib+=blockDim.y) {
if (ib >= n_max_pts_bin) {break;}
if (bin_b_ids[ib] <= -1) {break;}
71
// don't compare the same point to itself: if (bin_b_ids[ib] == bin_a_ids[ia]) {continue;}
// if it's the same bin, // only compare lower points to higher points if ((i_bin_a == i_bin_b) && (bin_b_ids[ib] <= bin_a_ids[ia])) {
continue; }
float bx = bin_b[ib * pt_size + offset_x]; float by = bin_b[ib * pt_size + offset_y]; float bz = bin_b[ib * pt_size + offset_z]; float bs = bin_b[ib * pt_size + offset_s];
// check that the scales are close enough float scale_max = as; float scale_min = bs; if (as < bs) { scale_max = bs;
scale_min = as;
}
if ((scale_max / scale_min) > scale_radius) {continue;}
float diffx = bx - ax; float diffy = by - ay; float diffz = bz - az; float dist = diffx * diffx + diffy * diffy + diffz * diffz; dist = sqrt(dist);
float log_avg_scale = sqrt(as * bs); if (dist >= radius * log_avg_scale) {continue;}
int this_i_edges = atomicAdd(&i_edges[0], edge_size); edges[this_i_edges + 0] = bin_a_ids[ia]; edges[this_i_edges + 1] = bin_b_ids[ib]; }
}
__syncthreads();
__threadfence();
}
__syncthreads();
__threadfence();
}
return; }
72