efficient fixed-radius near neighbors for machine learning

72
Efficient Fixed-Radius Near Neighbors for Machine Learning by David Porter Walter III S.B., MIT (2018) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2019 © Massachusetts Institute of Technology 2019. All rights reserved. Author: _________________________________________ Department of Electrical Engineering and Computer Science May 24, 2019 Certified by: _________________________________________ Tomaso A. Poggio Professor of Brain and Cognitive Science Thesis Supervisor May 24, 2019 Accepted by: _________________________________________ Katrina LaCurts Chair, Master of Engineering Thesis Committee 1

Upload: others

Post on 18-Dec-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Fixed-Radius Near Neighbors for Machine Learning

Efficient Fixed-Radius Near Neighbors for Machine Learning 

 

by David Porter Walter III 

S.B., MIT (2018) 

 

Submitted to the 

Department of Electrical Engineering and Computer Science 

in partial fulfillment of the requirements for the degree of 

 

Master of Engineering in Electrical Engineering and Computer Science 

 

at the 

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 

June 2019 

© Massachusetts Institute of Technology 2019. All rights reserved. 

Author:  _________________________________________ 

  Department of Electrical Engineering and Computer Science May 24, 2019 

Certified by:  _________________________________________ 

  Tomaso A. Poggio  Professor of Brain and Cognitive Science Thesis Supervisor May 24, 2019 

Accepted by:  _________________________________________ 

  Katrina LaCurts Chair, Master of Engineering Thesis Committee 

Page 2: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Page 3: Efficient Fixed-Radius Near Neighbors for Machine Learning

Efficient Fixed-Radius Near Neighbors for Machine Learning 

by David Porter Walter III 

S.B., MIT (2018)  

Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of 

Master of Engineering in Electrical Engineering and Computer Science 

Abstract  

Deep learning has enabled artificial intelligence systems to move away                   from manual feature engineering and toward feature learning and better                   performance. Convolutional neural networks (CNNs) have especially             demonstrated super-human performance in many vision tasks. One big                 reason for the success of CNNs is due to the use of parallelizable software                           and hardware to run these models, making their use computationally                   practical. This work is focused in the design and implementation of an                       efficient and parallel fixed-radius near neighbors program (FRNN). FRNN                 is a core component in a new type of machine learning model, object                         oriented deep learning (OODL), serving as a replacement for CNNs with                     goals of invariance, equivariance, interpretability, and computational             efficiency that improve upon the abilities of CNNs. This efficient                   implementation of FRNN is a critical step in making OODL computationally                     efficient and practical.  

 

Thesis Supervisor: Tomaso A. Poggio Title: Professor of Brain and Cognitive Science 

 

Page 4: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

 

 

 

 

 

 

 

 

 

 

 

 

Page 5: Efficient Fixed-Radius Near Neighbors for Machine Learning

Acknowledgements 

Thank you for everyone at the Poggio Lab, especially to Qianli Liao for                         

providing guidance and mentorship for me in this project. Thank you to                       

my grandpa David Walter Sr. for being there on my MIT journey since day                           

one. Thank you Kathy Guerra for being there alongside me and believing in                         

me since the first year of my five year MIT journey, because without you I                             

would have not made it to the end of this thesis. 

 

 

 

 

 

 

 

 

 

Page 6: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Page 7: Efficient Fixed-Radius Near Neighbors for Machine Learning

Contents 

Abstract 3 

Acknowledgements 5 

Contents 7 

List of Figures 9 

List of Tables 11 

Introduction 13 

1.1 Motivation 13 

1.2 Contributions 15 

Background 17 

2.1 Parallel Programming 17 

2.2 PyTorch 18 

Related Work 21 

3.1 Convolutional Neural Networks 21 

3.2 Dropout 21 

3.3 Pooling 22 

3.4 Object Oriented Deep Learning 23 

Methods 27 

4.1 Combinative Functions in Deep Learning 27 

4.2 FRNN With Bins 28 

4.3 FRNN With Scales and Bins 33 

4.4 Parallel Binning 37 

4.4.1 Mapping Points to Bins 37 

Page 8: Efficient Fixed-Radius Near Neighbors for Machine Learning

4.4.2 Mapping Bins to Points 40 

4.4.3 Mapping Bins to Neighbor Bins 40 

4.5 Parallelized FRNN 41 

4.5.1 Parallelizing Across Bins and Points 41 

4.5.2 Storing Bin Points in Shared Memory 43 

Results 45 

5.1 Theoretical Runtime Analysis 45 

5.2 Experimental Runtime Analysis 47 

Evaluation 53 

6.1 Practical Trade-offs of Different Algorithms 53 

Future Work 55 

7.1 Connected Components 55 

Conclusion 59 

References 61 

Appendix 65 

Python Code Mapping Points to their Bins 65 

Python Code Mapping Bins to Points 66 

Python Code Mapping Bins to Neighbor Bins 67 

CPP CUDA Kernel Code for Parallel FRNN with Bins 69 

 

 

 

 

Page 9: Efficient Fixed-Radius Near Neighbors for Machine Learning

List of Figures 

2.1   OODL Voting and Binding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  25 

4.1  Brute Force FRNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  30 

4.2  FRNN With Bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  32 

4.3  FRNN With Bins and Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   34 

4.4  Experimental Example of Points Mapped to Bins . . . . . . . . . . . .   35 

4.5  Experimental Example of Neighbors for a Single Bin . . . . . . . . .  36 

5.1  Parallel FRNN Runtime versus Radius for Different Numbers               of Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  

 49 

5.2  Parallel FRNN Runtime versus Number of Bins for Different                 Numbers of Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 

 50 

5.3  Parallel FRNN Runtime versus Number of Outputted Edges for Different Numbers of Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 

 51 

5.4  Parallel FRNN Runtime Per Edge versus Number of Outputted Edges for Different Numbers of Points . . . . . . . . . . . . . . . . . . . . .  

 52 

7.1  Connected components in a undirected graph . . . . . . . . . . . . . . .   56 

 

 

 

 

Page 10: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

 

 

 

 

 

 

 

 

 

10 

Page 11: Efficient Fixed-Radius Near Neighbors for Machine Learning

List of Tables 

5-1 FRNN Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   45 

5-1 FRNN Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  46 

   

   

 

 

 

 

 

 

 

 

 

 

11 

Page 12: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

 

 

 

 

 

 

 

 

 

 

 

 

 

12 

Page 13: Efficient Fixed-Radius Near Neighbors for Machine Learning

Chapter 1 

Introduction 

1.1 Motivation 

The rise of deep learning marks a leap forward in the field of                         

artificial intelligence, enabling humans to build a computational system                 

that can hierarchically process an input and learn a hierarchy of features                       

and use those features to transform an input into a classification or an                         

action. Nature has genetically endowed humans with the ability to perform                     

very complicated and intelligent tasks. As scientists continue to uncover                   

some of the mechanisms behind neurally based intelligence, we can use                     

that knowledge to build artificial systems that have many of the useful                       

properties that we think neurally based systems have, and even move                     

beyond some of the limitations of human intelligence. As the field of                       

machine learning (ML) advances, we see a trend analogous to the                     

transition from training wheels on a bike, to no training wheels, to                       

modifying the bike itself, to scrapping the bike altogether and building a                       

better form of transportation. What this means for AI is that we started                         

hard-coding in all the details for our computer systems, but ever                     

13 

Page 14: Efficient Fixed-Radius Near Neighbors for Machine Learning

increasingly, we will be heading toward a world where our computational                     

systems are not only given the ability to make their own decisions to a                           

small degree, but they will be able to make decisions about how their                         

decisions are even made, about how they learn, and less human knowledge                       

will be forced on our computer systems. 

As good scientists and engineers of artificially intelligent systems, we                   

need to be cognizant of this long-term trajectory of making less                     

human-made assumptions and the short-term benefits of injecting human                 

knowledge into a system where learning that human knowledge would be                     

too difficult. In the context of the field of machine learning and deep                         

learning, we want to advance the field in a way that continues to promote                           

the minimization of human-biased knowledge while also forcing an AI                   

system to have properties that we know are good. The research for this                         

proposal represents a movement in that direction. This work is focused on                       

the design and implementation of an efficient and parallel fixed-radius                   

near neighbors program (FRNN). FRNN is a core component in a new type                         

of machine learning model, object oriented deep learning (OODL), serving                   

as a replacement for CNNs with goals of invariance, equivariance,                   

interpretability, and computational efficiency. This efficient           

14 

Page 15: Efficient Fixed-Radius Near Neighbors for Machine Learning

implementation of FRNN is a critical step in making OODL computationally                     

efficient and practical.  

1.2 Contributions 

For the work of this thesis project I: 

● Implemented a parallel system that maps points in both linear and                     

exponential spaces to bins, and maps each bin to their neighbor bins. 

● Implemented a fixed-radius near neighbors system that processes               

bins in parallel and pairs of points within bins in parallel. 

● Showed that the runtime of this parallel FRNN implementation with                   

binning improves upon the non-parallel and non-binning             

implementation by orders of magnitude, allowing this             

implementation to be practical for use for object oriented deep                   

learning. 

 

 

 

 

15 

Page 16: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

16 

Page 17: Efficient Fixed-Radius Near Neighbors for Machine Learning

Chapter 2 

Background 

2.1 Parallel Programming 

Machine learning, and specifically deep learning, has seen such a                   

breakthrough in no small part from the advancement of hardware with the                       

capability to run parallelizable code on graphical processing units (GPUs).                   

As the name suggests, GPUs were originally designed for graphical                   

processing tasks that inherently benefit from massive amounts of                 

parallelization. Specifically we can thank the videogame industry for the                   

initial proliferation of GPUs used for graphics applications and then other                     

industries like blockchain and AI have been able to benefit from the                       

advancements in GPU hardware as well [1] [2]. There are many libraries                       

and frameworks for writing GPU code, including OpenCL [3] but the most                       

popular deep learning frameworks including Pytorch and TensorFlow use                 

Nvidia’s CUDA. “CUDA is a parallel computing platform and programming                   

model that makes using a GPU for general purpose computing simple and                       

elegant” [4]. CUDA currently works with multiple programming languages                 

including C, C++, Fortran. In simple terms, GPUs, and the CUDA framework                       

17 

Page 18: Efficient Fixed-Radius Near Neighbors for Machine Learning

allow an engineer to transform a parallelizable algorithm into code that                     

allows the hardware to actually operate in parallel with hundreds to                     

thousands of threads running at the same time. This goes beyond a normal                         

CPU that runs code that seems like it is in parallel in the software, through                             

abstraction, but in reality, the operations are being run mostly in serial or                         

on a very small number or CPU cores. While in theory, a GPU is still a form                                 

of a Turing machine (TM) will all of the same theoretical limitations, based                         

on the constant number of threads it can actually run in parallel, in                         

practice, the number of blocks and threads on a GPU make it run certain                           

parallel operations much faster than a CPU. Because this project                   

specifically uses Nvidia GPUs and the CUDA framework, the most                   

practically useful thing to know about the way Nvidia’s GPUs are                     

programmed is that they have threads like a normal CPU and blocks which                         

represent groups of threads all undergoing the same operations in parallel.                     

An Nvidia GPU runs multiple threads in a block in parallel and multiple                         

blocks in parallel. 

2.2 PyTorch 

We have also chosen to use CUDA because PyTorch makes it                     

relatively easy to create custom CUDA kernels and import them directly                     

into PyTorch. PyTorch is a major member in the list of popular deep                         

18 

Page 19: Efficient Fixed-Radius Near Neighbors for Machine Learning

learning frameworks, which also include TensorFlow and Theano. PyTorch                 

garners its strength from its ability to dynamically run deep learning                     

models, unlike frameworks like TensorFlow, which needs to compile a                   

static computational graph before running a deep learning model. This                   

dynamic nature makes PyTorch much easier for research and                 

development, including building and debugging models. The downside of                 

PyTorch’s dynamic nature is that it has less room for optimization of the                         

underlying implementation. While the limits of optimization in PyTorch                 

may be limited to a small degree, PyTorch is practically easier to customize,                         

based on the simplicity, structure and dynamic nature of the framework                     

[5] [6]. 

 

 

 

 

 

 

 

19 

Page 20: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

20 

Page 21: Efficient Fixed-Radius Near Neighbors for Machine Learning

Chapter 3 

Related Work 

3.1 Convolutional Neural Networks 

Convolutional neural networks (CNNs) are to thank for much of the                     

recent advancement in computer vision in the past several years, in no                       

small part due to advancements in parallelizable hardware like GPUs [7].                     

One of the main properties that make CNNs so powerful is translation                       

invariance [8]. Invariance is the property that allows a system to not be                         

affected by a change in a feature. For example, translation invariance                     

allows an ML model to not lose performance when an object is shifted. We                           

would like computer vision models to also have rotation invariance for                     

novel rotations, but CNNs struggle with this property compared to their                     

ability to handle novel translations [9].  

3.2 Dropout 

Dropout is a method in which nodes in a deep learning model are                         

randomly set to zero. Dropout can help our models with at least one                         

21 

Page 22: Efficient Fixed-Radius Near Neighbors for Machine Learning

notable property: disentanglement [10]. Disentanglement is the idea that a                   

feature of the input is both captured in the system and isolated in a small                             

subset of the system. To contrast, a system that is entangled has many of its                             

important aspects in many different parts of the system. A system that is                         

disentangled is modular, such that certain parts of the system can be                       

identified with certain aspects of the computation for that system. From an                       

interpretability perspective, like modularity in software engineering,             

disentanglement in ML models is a desirable property that allows scientists                     

and engineers to understand how the system works instead of treating the                       

system as a black box. Interpretability is a positive step in the direction of                           

safety for AI systems, and a step toward scientist and engineers better                       

understanding how our models work and how they can be improved. 

3.3 Pooling 

Pooling layers are used in CNNs to summarize groups of neurons                     

within a k x k sliding kernel map [7]. Max pooling is a popular pooling                             

layer type, where the output from each kernel map is the max value within                           

the kernel map output before pooling. Max pooling serves as a way to                         

propagate only the most prominent value within a region of the input, and                         

ignore the rest [11]. Pooling operations help CNNs achieve translation                   

invariance, because the location of a feature matters less when it is                       

22 

Page 23: Efficient Fixed-Radius Near Neighbors for Machine Learning

propagated through a pooling layer, as the pooling layer only cares about                       

the presence of the feature. Nevertheless, with CNNs pooling helps the                     

most with invariance of position, but not necessarily invariance of other                     

features like rotation and size. With a different model type, pooling layers                       

could aid in the problem of invariance of other features types. 

3.4 Object Oriented Deep Learning 

Object Oriented Deep Learning (OODL) is a model that has its highest                       

aims on interpretability, disentanglement and equivariance [12] [13].               

Unlike the conventional CNN architecture that uses N-dimensional feature                 

tensors as the fundamental representation and convolutional kernels,               

OODL’s basic representation is an object. In this context, an object is an                         

entity that may have explicit properties like position, rotation, and size                     

build in, and an N-length signature vector that contains learned features                     

for that object. Like with most other encoding neural architectures that                     

process images, the first layer takes as input an image and each subsequent                         

layer encodes the input into continually higher levels of abstracted objects                     

and features, such that objects and features in high layers are some                       

learned combination to features from lower layers. This is similar to                     

typical deep learning architecture because features are still a combination                   

of lower-level features, but OODL is different because OODL also has                     

23 

Page 24: Efficient Fixed-Radius Near Neighbors for Machine Learning

objects as fundamental units, allowing for a symbolic paradigm to emerge.                     

What this symbolic paradigm represents is a movement beyond simple                   

statistical feature detectors of typically strong deep learning architecture                 

like CNNs and toward an ability to understand the relation between objects                       

in an input that allows for elements such as context and complex                       

relationships to emerge. OODL follows the a paradigm similar to other                     

deep learning architectures, where the first, and lowest, layer starts with                     

the individual pixels or other low-level features. If the input is a 2D image,                           

the main difference in OODL’s approach is that these pixels are treated as                         

individual objects, and each pixel has the property of position, rotation.                     

Then, in higher layers there exist few objects that are some combination of                         

the objects in the lower layers, until the highest layer will have few or only                             

one object that is a weighted sum of all objects remaining in the image.  

To transform objects from layer to layer, OODL uses a voting layer                       

followed by a binding layer. A voting layer is currently implemented by                       

multiplying a set of radially oriented weights that dot multiply their values                       

by each object’s signature, and predict neighboring objects. Binding layers                   

are used to combine objects in each layer by aggregating objects to those                         

with the most surrounding objects. This is why the voting layer is named as                           

24 

Page 25: Efficient Fixed-Radius Near Neighbors for Machine Learning

such, because objects ‘vote’ for neighboring objects, and the binding layer                     

aggregates the votes. 

 

Figure 3.1: Visualization of a voting operation on the left and a binding operation on the right. [12] 

 

Voting layers are more general than convolutional layers because the                   

radial kernel is not constrained to a the pixel grad, or a static rotation, or                             

lack thereof. In terms of interpretability, the existence of individual objects                     

makes it easier for a scientist or engineer to locate important areas of a                           

model and interpret where a feature exists in the objects properties and                       

signature and how it affects the models computation. In addition, an OODL                       

model has the potential to do less computation if the input is less complex,                           

whereas other neurally based architectures like CNNs carry the same                   

amount of computation independent of the complexity of the input.  

25 

Page 26: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

26 

Page 27: Efficient Fixed-Radius Near Neighbors for Machine Learning

Chapter 4 

Methods 

4.1 Combinative Functions in Deep Learning 

A core aspect of computationally efficient and representative               

systems like neural networks and hierarchical representations is that there                   

needs to be a way to combine low-level information into high-level                     

representations. Language and vision fit this structure well. We see this in                       

everyday task like reading were we combine letters to formulate words                     

and words to formulate phrases. We combine edges and dots to formulate                       

objects in an image, like eyes and legs on an animal and car wheels and                             

windows on a car, and combine these sub-features and sub-objects into                     

high-level features and high-level objects like animals and cars                 

respectively.  

In AI systems, we usually have an input with many low-level                     

features being mapped to an output with a few high-level features. Often                       

this is a mapping from a larger sized input to a smaller sized output, and to                               

do this, AI systems employ combinative mappings that have a net effect of                         

27 

Page 28: Efficient Fixed-Radius Near Neighbors for Machine Learning

reducing the size of the input. Hierarchical networks like deep learning                     

embrace this paradigm fully. A vanilla feedforward deep network                 

architecture can vary widely, but one usual, and important, aspect is that                       

the final layer will map some larger internal representation to some                     

smaller representation like an encoding, action, or classification. A                 

self-driving car system can take in frames from a video and map it to                           

actions like steering the vehicle and accelerating. A natural language                   

system can read a paragraph and output the emotions of that sentence, like                         

happy, or sad or angry. 

4.2 FRNN With Bins 

Many of the important properties of OODL, including dynamic                 

computational cost, equivariance, etc., rely on a dynamic architecture that                   

deviates from the less dynamic nature of vanilla deep learning approaches                     

like feedforward neural networks and even deviating from slightly                 

dynamic architectures like CNNs. CNN’s core combinative function is the                   

convolution with either a stride larger than one or dropout. While we are                         

open to different combinative functions for OODL, OODL uses FRNN as it’s                       

core combinative operation. We have currently settled on using the FRNN                     

operation instead of convolutional filters because of its dynamic property                   

28 

Page 29: Efficient Fixed-Radius Near Neighbors for Machine Learning

that allows it to be applied to point clouds or inputs and model layers of                             

various sizes and shapes.  

FRNN takes in as input a set of points and a radius and it outputs a                               

matrix that maps each point to all other points that are within a distance of                             

the radius from that point. 

Algorithm 4.1: FRNN Brute Force. 

1. initialize empty edges list 

2. for all points xa  

a. for all points xb  

i. compute distance between xa and xb 

ii. if distance < radius, push xa and xb to edges list 

3. output edges list 

 

29 

Page 30: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

Figure 4.1: Depiction of the operation for one point in the brute force                         FRNN algorithm in 2 dimensions. The blue, or lighter, points are “near                       neighbors” of the current point in the center of the circle or size radius. 

 

In order to not needlessly compare every point to every other point,                       

a better algorithm than the brute force algorithm would place points in                       

bins such that points that are only a constant number of bins over need to                             

be checked. Imagine you are acting out FRNN in real life between you and                           

everyone in the world, and the radius was 10 miles. It would be unwise to                             

30 

Page 31: Efficient Fixed-Radius Near Neighbors for Machine Learning

checking distances to people all the way across the world. Instead you                       

know to only check within your town or city and then possibly check the                           

few surrounding towns. Similarly for FRNN with generic points, this                   

program puts each point in a bin similarly to how people exist in cities. If                             

the radius were smaller, the program can make the bins smaller, and                       

similarly for larger radii, enabling us to only look at a relatively small                         

number of bins.  

Algorithm 4.2: FRNN with Bins. 

1. ptsToBins2d ← floor(points2d - min(points2d) / radius). 

2. ptsToBins1d: map each ptsToBins2d to unique 1d integer 

3. binsToPts1d ← map each 1D bin to the ids in points2d 

4. initialize empty edges list 

5. for ba in binsToPts1d: 

a. for bb in binsToPts1d: 

i. for xa in ba: 

1. for xb in bb: 

a. compute distance between xa and xb 

b. if distance < radius, push xa and xb to edges list 

6. output edges list 

31 

Page 32: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

Figure 4.2: FRNN in 2 dimensions with bins. In this case, the bins are as                             wide as the radius, so it is guaranteed that every neighbor of the current                           bin will be no more than one bin away. 

 

32 

Page 33: Efficient Fixed-Radius Near Neighbors for Machine Learning

4.3 FRNN With Scales and Bins 

So far I have been discussing the FRNN problem in terms of points that                           

exist in 2D or 3D linear euclidean space. In this linear space, determining                         

the distance between coordinates matches up with our usual intuitions,                   

and the radius is always the same no matter which pair of points you are                             

considering. But when we introduce scales, the radius that we use to                       

compare the distance between points is multiplied by the value of the scale                         

of that point as well. In this way, the radius is made to be proportionally                             

larger when the scale is larger, making our radius relative with respect to                         

the scale. When we have scales, the radius r that is inputted into the FRNN                             

program is changed to an adjusted radius equal to (r)(s) where s is the                           

scale. 

 

 

 

 

33 

Page 34: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

Figure 4.3: FRNN with one linear dimension and one scale dimension with                       bins, where the first dimension in the x-axis is in linear space, and the                           second scale dimension is in logarithm space. Bins that are not white are                         considered neighboring bins, but the light red, or lightest shaded sections                     have points that will not be neighboring points ever. Gaps between the                       scale bins are to leave room for the explanatory lines and arrows. As you                           will see in figure 4.4, the scale bins get taller as the scale increases.  

 

 

 

34 

Page 35: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

Figure 4.4: Experimental Example of Points Mapped to Bins. We have one                       linear dimension in the x-axis and one exponential dimension in the y-axis,                       which is the scale. In this case we can see that bins get wider in the linear                                 dimensions and taller in the scale dimension as the scale increases.  

35 

Page 36: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

Figure 4.5: Experimental Example of Neighbors for a Single Bin. The                     middle bin in light blue is the current bin. This example corresponds with                         the same points and bins as in figure 4.4.  

 

 

 

36 

Page 37: Efficient Fixed-Radius Near Neighbors for Machine Learning

4.4 Parallel Binning 

This section describes setting up the inputs for the bins that will be                         

inputted into the CUDA kernel. At the start the FRNN algorithm is given the                           

points (x, y, z, s, b), the linear radius for the first 3 dimensions in x, y, and z,                                     

and the exponential scale radius for the fourth dimension for scales. Here                       

b is the batch id, where batch id identifies points in unique inputs so that                             

we do not compare distances between point in different inputs, and this                       

allows us to process multiple inputs in a single batch. Before the FRNN                         

CUDA kernel is initialized, we need the bins that each point is mapped to.                           

The high-level steps for this part of the program involve: 

1. Compute the map from each point to its bin. 

2. Compute the map from each bin to its points. 

3. Compute the map from bins to their neighbor bins. 

The Python code for all three steps of this algorithm is in the appendix. 

4.4.1 Mapping Points to Bins 

The main decision to be made here is how wide to make the bins. As                             

the bin widths get larger and larger we will have more points in each bin                             

37 

Page 38: Efficient Fixed-Radius Near Neighbors for Machine Learning

and less bins. If all points fit into one bin, then we would be back at the                                 

brute force FRNN algorithm, but if each bin has only one point then we can                             

still theoretically have a speed up, but practically we would not.                     

Somewhere in the middle should produce the best performance. Although                   

figure 4.2 and figure 4.3 already showed this size of the bins, these figures                           

do not make that design choice seem non-accidental. I chose the width of                         

the linear bins to be the same as the radius, and the bin width in the fourth                                 

scale dimension to be the scale radius. Considering only the first 3 linear                         

dimensions, each bin forms a cube with each side having a length of the                           

radius. This makes getting neighboring bins in the linear space much                     

easier because any point in the current bin will be guaranteed to find all of                             

its neighboring points within all of the neighboring bins, as you can see in                           

figure 4.2. Still considering linear space in 3D, what this choice of bin size                           

also means is that if we assume the points are mostly distributed evenly, as                           

in the average case, then each point will only make O(|Ei|) comparisons                       

with other points, where Ei is the number of neighboring points for point i. 

In addition, we do not want negative bin ids so we will also shift the                             

value of every point by the lowest value, so that no point have a value less                               

than 0. With this the equation for mapping a point to a bin in linear space                               

is: 

38 

Page 39: Efficient Fixed-Radius Near Neighbors for Machine Learning

wi,j = ( x i,j - min(X) ) / r , for j ∈ {0, 1, 2}  

Where X is the tensor of points, and xi is a single linear point in X, and wi,j is                                     

the bin width in the ith object in the jth dimension.  

If we consider the scale in the fourth dimension, then we have two                         

separate equations for the bins in linear space and the bins in exponential                         

space.  

u = floor( log(si ) / log(z) ) 

bi,j = floor( w i,j / z u+1 ) , for j ∈ {0, 1, 2}  

bi,3 = floor( log(s i ) / log(z) ) 

Where si is the scale value at point i , z is the scale radius, u is a placeholder 

for the equation for bi,j , b i,j is the scaled bin for point i in linear dimension j 

for the first three dimensions, bi,3 is the scale bin for point i . The equations 

for bi incorporate the resizing for the highest scale value in any given bin, 

due to the u+1 exponent. The idea of the highest scale in the bin limiting 

the width of the bin may make more sense when looking at figure 4.3 and 

the dotted lines that separate the area of the neighbor bins that have points 

that could be neighbors, in light blue, and the area outside in light red, that 

39 

Page 40: Efficient Fixed-Radius Near Neighbors for Machine Learning

are guaranteed to not have any neighboring points for any of the points in 

the current bin. For intuition, the width of the bins get exponentially larger 

as the scale increases linearly. 

4.4.2 Mapping Bins to Points 

Algorithm: Mapping Bins to Points: 

1. get the sorted indices for the mapping of points to bins. 

2. get the differences: allow us to set every point that does not have a next                             

point in the bin to -1 and the rest set to the next point in the bin.  

3. set the values for first points. 

4. set the values for the point indices and un-sort them so they are in the                             

same order as the original points, because the indices are corresponding                     

with the points based on their locations.  

4.4.3 Mapping Bins to Neighbor Bins 

When we add in scales, things get a little more complicated because                       

we cannot simply look one bin in each direction to get the neighbors. As                           

shown in figure 4.3, when looking at the current bin and keeping the scale                           

dimension constant, we simply get the neighbors that are one bin higher                       

and lower. But when getting the neighbors for bins for other scales we look                           

40 

Page 41: Efficient Fixed-Radius Near Neighbors for Machine Learning

one bin higher and lower in the scale dimensions, but how many bins to                           

get in the other dimensions depend on the maximum radius at those other                         

scales, so we are not grabbing exactly one bin over when at different                         

scales. The main requirement for getting neighboring bins is selecting them                     

such that all neighboring points of all points in the current bin are                         

contained in all of the selected neighboring bins. Code for this algorithm is                         

in the appendix and figure 4.3 is the best way to understand this section                           

spatially.  

4.5 Parallelized FRNN 

With the radius, points, bins and their neighbor bins, and the linked                       

list mapping points to the next point in their bin, we can write parallel code                             

in a CUDA kernel to implement FRNN. The kernel has two main aspects:                         

parallelizing across bins and points and storing a bin’s points in shared                       

memory when comparing points in two neighboring bins. Code for this                     

section can be found in the appendix. 

4.5.1 Parallelizing Across Bins and Points 

The lowest-level of parallelization in a generic FRNN algorithm exists                   

at the level of the point, but because this approach runs in O(n2) time we                             

want to run as many points in parallel as we can. With this we assign                             

41 

Page 42: Efficient Fixed-Radius Near Neighbors for Machine Learning

separate threads to each point pair and a CUDA block to each bin and                           

compute the solution for this section of the algorithm in approximately the                       

time it would take to compute the solution for one bin. Figure 4.2 shows a                             

visual depiction of what the operation of one bin looks like.  

Algorithmically here are the steps for processing each bin: 

1. edges ← empty array 

2. for point pa in bin ba: 

a. for bin bb in ba’s neighbors: 

i. for point pb in bin bb: 

1. sa , sb = pa[0], pb[0] 

2. scale_dist ← max([sa, sb]) / min([sa, sb])  

3. d ← |pa - pb| 

4. avg_scale ← log(sa * sb) 

5. if d ≤ radius * avg_scale and scale_dist < scale_radius 

a. edges.push([pa, pb]) 

 

42 

Page 43: Efficient Fixed-Radius Near Neighbors for Machine Learning

4.5.2 Storing Bin Points in Shared Memory 

The other significant improvement of my implementation from               

other FRNN algorithms in CUDA is the speedup that comes from loading                       

bins’ points into shared memory before comparing the points’ distances.                   

Shared memory is cache that is allocated and controlled by the user instead                         

of automatically allocated by the compiler and used by the programming                     

languages underlying memory management program. As with cache, any                 

reads and writes to shared memory happen much faster than with global                       

memory. The procedure here is one thread in the CUDA kernel initially                       

writes the points in shared memory for bina and then for every                       

neighboring binb the same thread loads the points for binb in the shared                         

memory. This was designed to induce a speed up because one thread can                         

very quickly move all of the points into shared memory once and then                         

when the distance comparisons are being done during the actual                   

algorithm, where each point’s values need to be read O(n2) times, the reads                         

to shared memory can be done very quickly.  

 

 

 

43 

Page 44: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

44 

Page 45: Efficient Fixed-Radius Near Neighbors for Machine Learning

Chapter 5 

Results 

5.1 Theoretical Runtime Analysis 

  Time Complexity 

Algorithm  Best  Average  Worst 

FRNN Brute Force  O(|X|2)  O(|X|2)  O(|X|2) 

FRNN with Bins  O(|E|)  O(|E|)  O(|X|2) 

Parallel FRNN with Bins  O(|E|/|B|)  O(|E|/|B|)  O(|X|2) 

 

Table 5-1: Time Complexity in the best, average, and worst case for                       different FRNN algorithms, where X is the points matrix and E is the                         outputted edges matrix, and B is the 1 dimensional bins array.  

 

In the FRNN brute force algorithm we compare all |X|2                   

combinations of points in all cases. In the FRNN with bins algorithm we                         

expect the program will be comparing O(|E|) point pairs because on                     

average, most of the points in the neighbor bins will form an edge with the                             

points in the current bin that is being considered. In the best case, all                           

45 

Page 46: Efficient Fixed-Radius Near Neighbors for Machine Learning

points will be as evenly distributed across the bins as possible and we will                           

make O(|E| / |B|) comparisons. In the worst case, where we have many of                           

the points concentrated in very few bins, we will have a runtime                       

complexity of O(|X|2) because this will essentially be like running the brute                       

force algorithm on a single or few bins.  

Algorithm  Space Complexity (All Cases) 

FRNN Brute Force  O(|X|+|E|) 

FRNN with Bins  O(|X|+|E|+|B|) 

Parallel FRNN with Bins  O(|X|+|E|+|B|) 

 

Table 5-2: Space Complexity in the best, average, and worst case for                       different FRNN algorithms, where X is the points matrix and E is the                         outputted edges matrix, and B is the 1 dimensional bins array.  

 

For the space complexity, the only difference between the brute                   

force algorithms and the algorithms with bins is that we also need to                         

include the storage necessary for the bins, because the storage size for the                         

bins is a function of the radius. 

46 

Page 47: Efficient Fixed-Radius Near Neighbors for Machine Learning

5.2 Experimental Runtime Analysis 

In this experiment I am testing the final version of the FRNN                       

program using an Nvidia Tesla K80 GPU, and for the parallel algorithm the                         

CUDA kernel is set to use 8192 blocks and 16 by 16 2D threads. These                             

settings for the number of blocks and number of threads give the best                         

experimental performance for my implementation. For the non-parallel               

algorithm, I am using only one block and one thread. Below is a                         

comprehensive description of the other experimental settings. 

Constant Hyper-Parameters: 

1. dimension of points: two linear dimensions and one scale dimension                   

(x, y, s) 

2. linear domain: x and y values are sampled from a uniform                     

distribution ranging from 0.0 to 100.0. 

3. scale domain: the scale values are sampled from a uniform                   

distribution ranging from 1.0 to 1.5. 

4. scale radius: 1.25 

Independent Variables: 

1. number of points: ranging from 2 to 20,000, inclusive. 

47 

Page 48: Efficient Fixed-Radius Near Neighbors for Machine Learning

2. radius: ranging from 1.0 to 100.0 (or 1% of the linear range to 100%                           

of the linear range). 

a. number of bins: this is a function of the radius. 

Dependent Variables: 

1. runtime for binning step. 

2. runtime for the FRNN kernel. 

3. number of outputted edges. 

The values for the experiment described above were chosen to                   

sufficiently test the FRNN program even for values outside of what may be                         

used in practice. In OODL, we would not likely have a radius that is as large                               

as the range of inputs, nor would we likely have a radius that is only 1% of                                 

the range of the inputs. One reason to test these values is because we don’t                             

actually know for sure what range of values the OODL program will use in                           

practice, as well as other non-OODL applications like collision detection                   

with a large number of points and a small radius. Regardless, it makes                         

sense to test these values, if only to look for ways to improve the                           

implementation. 

 

 

48 

Page 49: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

 

Figure 5.1: Runtime comparison between the Brute Force FRNN                 implementation, the serial FRNN implementation with bins, and the                 parallel FRNN implementation with bins.  

 

49 

Page 50: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

Figure 5.2: Parallel FRNN Runtime versus Radius for Different Numbers of                     Points.  

 

50 

Page 51: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

Figure 5.3: Parallel FRNN Runtime versus Number of Bins for Different                     Numbers of Points. 

51 

Page 52: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

 

Figure 5.4: Parallel FRNN Runtime versus Number of Outputted Edges for                     Different Numbers of Points. When the number of points is small, we do                         not have runtimes for a large number of edges because we do not have                           enough points to make a large number of edges, even if the radius was very                             large. 

 

 

 

 

52 

Page 53: Efficient Fixed-Radius Near Neighbors for Machine Learning

Chapter 6 

Evaluation 

6.1 Practical Trade-offs of Different Algorithms 

Maybe the most important result is from figure 5.1, which shows                     

how much faster the runtime of the parallel bin FRNN algorithm is than                         

the serial bin and brute force algorithms. Due to the practical limits of a                           

GPU and how many blocks and threads it can actually run in parallel, if the                             

size of the inputs gets very large, the runtime of the parallel bin algorithm                           

will approach the runtime of the serial bin algorithm, but with the practical                         

sizes of the inputs that OODL will use, we will expect the parallel algorithm                           

to do much better than either of the serial algorithms. 

From the results we can see that the parallel FRNN with bins gets                         

more efficient as the size of the input and output increase. This is likely due                             

to the high-start up time of a CUDA kernel and the startup time of                           

initializing the bins. As the size of the input and output increases, the                         

startup time becomes a smaller percentage of the total runtime. This is why                         

53 

Page 54: Efficient Fixed-Radius Near Neighbors for Machine Learning

it is important in practice to run this algorithm with larger batch sizes, so                           

the number of times the CUDA kernel has to be initialized is small.  

We can see that when the radius is really small, likely smaller than                         

will ever be used in practice, the runtime is much higher than for most of                             

the other radii values. This is likely due to the fact that when the radius is                               

really small, the number of bins is very large. The runtime of the kernel                           

increases linearly with the number of bins. With more bins, we will have                         

more blocks running in parallel, and given any given GPU has practical                       

limitations for the number of blocks it is actually runs in parallel, we know                           

that when the number of blocks is very large, some of those blocks are                           

running in series instead of in parallel. 

I designed this algorithm specifically for OODL, so it does well with                       

parallelizing across bins and for inputs sizes that are relevant to OODL. In                         

other applications like with graphics and detecting object collisions, this                   

algorithm may or may not serve that purpose as well as another algorithm                         

that is specifically optimized for those applications. 

 

 

54 

Page 55: Efficient Fixed-Radius Near Neighbors for Machine Learning

Chapter 7 

Future Work 

7.1 Connected Components 

While FRNN is a function that is good for spatially grouping objects                       

together based on their distance from each other, this may not be enough if                           

our goal is for a computer system to segment out objects with shapes that                           

do not closely approximate a circle or sphere. To account for this, we are                           

working on the next step in the binding process, which is a connected                         

components step. A connected component in an undirected graph is a set of                         

points that all have some path from itself to all other points in that                           

connected component.  

55 

Page 56: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

Figure 7.1: Visualization of connected components in a undirected graph.                   

This diagram has three connected components: the five blue points in the                       

top left, the four green points on the right, and the one red point on the                               

bottom.  

 

We believe that object segmentation is a fundamental aspect of vision, so in                         

order to make that capability innate on an artificial vision system we                       

propose the combination of FRNN and connected components as a                   

sub-system that will allow a computer system to learn to predict                     

neighboring objects by interpreting the cloud of points that are produced                     

from the voting layer in OODL. Out of this we hypothesize that objects and                           

shapes will be learned without any explicit domain knowledge, and other                     

56 

Page 57: Efficient Fixed-Radius Near Neighbors for Machine Learning

difficulties like with occlusion or highly variable shapes will be handled                     

fundamentally by this approach.  

At the time of this thesis being submitted, we have the connected                       

components step already implemented, and what is left is to complete the                       

entire pipeline and experiments with the voting layers and FRNN and                     

connected components to have a complete OODL system. 

 

 

 

 

 

 

 

 

 

 

57 

Page 58: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

 

 

 

 

 

 

 

 

 

58 

Page 59: Efficient Fixed-Radius Near Neighbors for Machine Learning

Chapter 8 

Conclusion 

CNNs have been the preferred model for most computer vision                   

problems for a few years, and OODL presents an alternative model that                       

aims at going beyond the capabilities of CNNS. Like CNNs benefited from                       

the implementation of parallel hardware and software for the convolution                   

operation, this work represents a similar step in that direction for OODL                       

and the FRNN operation. For the work of this thesis project I implemented                         

a parallel system that maps points to bins, and maps each bin to their                           

neighbor bins, and a parallel fixed-radius near neighbors system that                   

processes bins and pairs of points in parallel. The parallel bin FRNN                       

algorithm shows good performance for the input sizes and types necessary                     

to train an OODL model, and a key result is how the performance of this                             

parallel FRNN with bins implementation is orders of magnitude more                   

efficient than the non-parallel and non-binning FRNN implementation.               

With these results, the critical FRNN component of OODL will allow the                       

continued design of OODL to be free to develop knowing its core                       

functionality is computationally efficient and practical. 

59 

Page 60: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

60 

Page 61: Efficient Fixed-Radius Near Neighbors for Machine Learning

References 

[1] Owens, John D., et al. "A survey of general-purpose computation on                       

graphics hardware." Computer graphics forum. Vol. 26. No. 1. Oxford, UK:                     

Blackwell Publishing Ltd, 2007. 

[2] Sze, Vivienne, et al. "Efficient processing of deep neural networks: A                       

tutorial and survey." Proceedings of the IEEE 105.12 (2017): 2295-2329. 

[3] Stone, John E., David Gohara, and Guochun Shi. "OpenCL: A parallel                       

programming standard for heterogeneous computing systems." Computing             

in science & engineering 12.3 (2010): 66. 

[4] Yang, Zhiyi, Yating Zhu, and Yong Pu. "Parallel image processing based                       

on CUDA." 2008 International Conference on Computer Science and Software                   

Engineering. Vol. 3. IEEE, 2008. 

[5] Ketkar, Nikhil. "Introduction to pytorch." Deep learning with python.                   

Apress, Berkeley, CA, 2017. 195-208. 

[6] Paszke, Adam, et al. "Automatic differentiation in pytorch." (2017). 

61 

Page 62: Efficient Fixed-Radius Near Neighbors for Machine Learning

[7] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet                   

classification with deep convolutional neural networks." Advances in               

neural information processing systems. 2012. 

[8] LeCun, Y., & Bengio, Y. (1995). Convolutional networks for images,                     

speech, and time series. The handbook of brain theory and neural networks,                       

3361(10), 1995. 

[9] Cheng, Gong, Peicheng Zhou, and Junwei Han. "Learning                 

rotation-invariant convolutional neural networks for object detection in               

VHR optical remote sensing images." IEEE Transactions on Geoscience and                   

Remote Sensing 54.12 (2016): 7405-7415. 

[10] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., &                   

Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural                   

networks from overfitting. The Journal of Machine Learning Research, 15(1),                   

1929-1958. 

[11] Sabour, Sara, Nicholas Frosst, and Geoffrey E. Hinton. "Dynamic                   

routing between capsules." Advances in neural information processing               

systems. 2017. 

62 

Page 63: Efficient Fixed-Radius Near Neighbors for Machine Learning

[12] Liao, Qianli, and Tomaso Poggio. Object-oriented deep learning. Center                   

for Brains, Minds and Machines (CBMM), 2017.             

https://cbmm.mit.edu/publications/object-oriented-deep-learning 

[13] Liao, Qianli, and Tomaso Poggio. Exact equivariance, disentanglement                 

and invariance of transformations. 2017. 

 

 

 

 

 

 

 

 

 

63 

Page 64: Efficient Fixed-Radius Near Neighbors for Machine Learning

 

 

 

 

 

 

 

 

64 

Page 65: Efficient Fixed-Radius Near Neighbors for Machine Learning

Appendix 

This appendix contains most code for this project: 

Python Code Mapping Points to their Bins 

def get_pts2bins(pts, batch_ids, radius, scale_radius, device): """ """

ix, iy, iz, i_s = 0, 1, 2, 3 n_pts, n_dims = pts.size()

n_lin_dims = 3

apply_4d = lambda func: torch.tensor([func(pts[:, i]) for i in range(n_dims)],

device=device)

min_vals = apply_4d(torch.min)

pts2bins = pts.clone()

''' steps for xyz bins:

1.) shift to min 0

2.) map by radius

3.) scale by max scale in corresponding scale bin

'''

''' xyz: 1.) shift to min 0

'''

pts2bins[:, ix : iz + 1].sub_(min_vals[ix : iz + 1]) ''' xyz: 2.) map by radius

'''

pts2bins[:, ix : iz + 1].div_(radius) ''' xyz: 3.) scale by max scale in corresponding scale bin

'''

unshifted_scale_bins = torch.floor(torch.log(pts[:, i_s]) /

math.log(scale_radius))

max_unshifted_scale_bins = torch.max(unshifted_scale_bins).item()

exp_range = scale_radius ** torch.arange(1, max_unshifted_scale_bins + 2, dtype=torch.float, device=device)

scale_divs = exp_range[unshifted_scale_bins.long()].view(-1, 1)

pts2bins[:, : n_lin_dims].div_(scale_divs)

'''

65 

Page 66: Efficient Fixed-Radius Near Neighbors for Machine Learning

steps for scale bins:

1.) map by scale_radius

'''

pts2bins[:, i_s].log_()

pts2bins[:, i_s].div_(math.log(scale_radius))

pts2bins.floor_()

pts2bins = pts2bins.long()

# now add another dimension for the batch_ids batch_bins = batch_ids.view(-1, 1) pts2bins = torch.cat([pts2bins, batch_bins], dim=1)

return pts2bins 

Python Code Mapping Bins to Points 

def get_pt_idx_data(pts, bins_5d, device):

bins = bins_5d

ix, iy, iz, i_s, ib = 0, 1, 2, 3, 4 _, n_dims = bins_5d.size()

n_bins5d = torch.tensor([torch.max(bins[:, i]) + 1 for i in range(n_dims)], device=device)

# do sorting of (bin, pt) pair and sort by bin bins1d = bins[:, ib] * n_bins5d[ix] * n_bins5d[iy] * n_bins5d[iz] *

n_bins5d[i_s]

bins1d += bins[:, i_s]* n_bins5d[ix] * n_bins5d[iy] * n_bins5d[iz]

bins1d += bins[:, iz] * n_bins5d[ix] * n_bins5d[iy]

bins1d += bins[:, iy] * n_bins5d[ix]

bins1d += bins[:, ix]

n_bins1d = n_bins5d[0] * n_bins5d[1] * n_bins5d[2] * n_bins5d[3] * n_bins5d[4]

sorted_idxs = torch.argsort(bins1d)

''' if the bins were: [0, 0, 1, 1, 1, 2]

we would need the pt_idxs to be:

[1, -1, 3, 4, -1, -1]

and the first pt idxs to be:

[0, 2, 5] (one for each bin)

'''

sorted_bins1d = bins1d[sorted_idxs]

# now get the pt_idxs: diffs = sorted_bins1d[1:] - sorted_bins1d[:-1] diffs_mask = diffs != 0

66 

Page 67: Efficient Fixed-Radius Near Neighbors for Machine Learning

pt_idxs_sorted = torch.arange(bins1d.size()[0], dtype=torch.long, device=device) + 1

pt_idxs_sorted[:-1][diffs_mask] = -1 pt_idxs_sorted[-1] = -1

mixed_pt_idxs = sorted_idxs[pt_idxs_sorted]

mixed_pt_idxs[pt_idxs_sorted == -1] = -1

# unsorts it... pt_idxs = mixed_pt_idxs[torch.argsort(sorted_idxs)]

# now create the idxs for the linked list.. first_pt_idxs = torch.zeros(n_bins1d, dtype=torch.long, device=device) - 1 # have to manually add the first element min_bin1d = sorted_bins1d[0]

first_pt_idxs[min_bin1d] = sorted_idxs[0]

mixed_first_bins = sorted_bins1d[1:][diffs_mask]

mixed_first_pts = sorted_idxs[1:][diffs_mask]

first_pt_idxs[mixed_first_bins] = mixed_first_pts

return pt_idxs, first_pt_idxs, n_bins5d, bins1d 

Python Code Mapping Bins to Neighbor Bins 

def get_nebs2nebs(radius, scale_radius, n_bins5d, device):

ix, iy, iz, i_s, ib = 0, 1, 2, 3, 4 nx, ny, nz, ns, nb = n_bins5d

ndims = 5 n_lin_dims = 3 n_bins1d = nx * ny * nz * ns * nb

''' map each bin to the start values in the ranges at each scale

'''

''' the number of neighbors in euclidean space at scale 1

(scale bin 0 is relative scale scale_radius**0, and

scale bin 1 is relative scale scale_radius**1, etc.)

so for each scale bin we need to adjust this

n_nebs_left accordingly

'''

n_side = 1 n_middle = 1

# because log(scale_bin_width) / log(scale_radius) == 1 n_nebs = n_side + n_middle + n_side

67 

Page 68: Efficient Fixed-Radius Near Neighbors for Machine Learning

bin_offsets = torch.arange(-n_side, -n_side + n_nebs,

dtype=torch.float, device=device)

all_bins = range5d(n_bins5d, torch.float, device).view(n_bins1d, ndims)

nebs_at_scales = []

for bin_offsets_i in bin_offsets:

scale_transform = scale_radius ** bin_offsets_i

starts = ((all_bins[:, : n_lin_dims] / scale_transform) - n_side).view(

n_bins1d, 1, n_lin_dims) # stop = torch.ceil(n_side + n_middle / scale_transform + n_side).long().item()

stop = math.ceil(n_side + n_middle / scale_transform + n_side)

# + 1

range_at_scale_i = range3d([stop, stop, stop], torch.float,

device).view(

stop ** n_lin_dims, n_lin_dims)

nebs_at_scale_i = torch.empty((n_bins1d, stop ** 3, ndims), device=device)

# setting the x, y, z dims nebs_at_scale_i[:, :, : iz + 1] = starts + range_at_scale_i # setting the scale dim nebs_at_scale_i[:, :, i_s] = all_bins.view(-1, 1, ndims)[:, :, i_s] + bin_offsets_i

# setting the batch_id dim nebs_at_scale_i[:, :, ib] = all_bins.view(-1, 1, ndims)[:, :, ib] # floor to round to the nearest bin below so that they can be longs nebs_at_scale_i = nebs_at_scale_i.floor().long()

nebs_at_scales.append(nebs_at_scale_i)

nebs_tensor5d = torch.cat(nebs_at_scales, dim=1)

# filtering the negative values mask_out_of_bounds = torch.sum(nebs_tensor5d < 0, dim=2) > 0 # filtering the values that are larger than the max possible values mask_out_of_bounds += torch.sum(nebs_tensor5d - n_bins5d >= 0, dim=2) > 0 # reset the mask to only have values that are either 0 or 1 mask_out_of_bounds = mask_out_of_bounds > 0

# transform 5d bins to 1d bins

nebs1d = nebs_tensor5d[:, :, ib] * n_bins5d[ix] * n_bins5d[iy] *

n_bins5d[iz] * n_bins5d[i_s]

nebs1d += nebs_tensor5d[:, :, i_s] * n_bins5d[ix] * n_bins5d[iy] *

n_bins5d[iz]

nebs1d += nebs_tensor5d[:, :, iz] * n_bins5d[ix] * n_bins5d[iy]

nebs1d += nebs_tensor5d[:, :, iy] * n_bins5d[ix]

nebs1d += nebs_tensor5d[:, :, ix]

nebs1d[mask_out_of_bounds] = -1

return nebs1d

68 

Page 69: Efficient Fixed-Radius Near Neighbors for Machine Learning

def range5d(shape5d, dtype, device):

lx, ly, lz, ls, lb = shape5d

n_dims = 5

r5d = torch.zeros((lb, ls, lz, ly, lx, n_dims), dtype=dtype, device=device)

r5d[:, :, :, :, :, 0] += torch.arange(lb, dtype=dtype, device=device).view((-1, 1, 1, 1, 1)) r5d[:, :, :, :, :, 1] += torch.arange(ls, dtype=dtype, device=device).view((-1, 1, 1, 1)) r5d[:, :, :, :, :, 2] += torch.arange(lz, dtype=dtype, device=device).view((-1, 1, 1)) r5d[:, :, :, :, :, 3] += torch.arange(ly, dtype=dtype, device=device).view((-1, 1)) r5d[:, :, :, :, :, 4] += torch.arange(lx, dtype=dtype, device=device)

# need to reverse each indiv coordinate. rev_idx = torch.arange(start=n_dims-1, end=-1, step=-1, device=device) r5d = r5d.index_select(n_dims, rev_idx)

return r5d

CPP CUDA Kernel Code for Parallel FRNN with Bins 

__global__ void frnn_cuda_forward_kernel( const int* neighbor_bins, const float* pts, const int* pt_idxs, const int* first_pt_idxs, const float radius, const float scale_radius, const int n_max_neighbors, const int n_bins, int* edges, int* i_edges, const int max_size_edges ) {

// stores the points for bin_a __shared__ float bin_a[n_max_pts_bin * pt_size]; __shared__ float bin_b[n_max_pts_bin * pt_size]; // stores the pt ids for bin_a __shared__ int bin_a_ids[n_max_pts_bin]; __shared__ int bin_b_ids[n_max_pts_bin];

for (int idx_i_bin_a = blockIdx.x; idx_i_bin_a < n_bins;

idx_i_bin_a += gridDim.x) {

__syncthreads();

__threadfence();

69 

Page 70: Efficient Fixed-Radius Near Neighbors for Machine Learning

int i_bin_a = idx_i_bin_a;

// if the bin is empty: if (first_pt_idxs[i_bin_a] == -1) { continue; }

////////////// // load bin_a ////////////// if (threadIdx.x == 0 && threadIdx.y == 0) {

bool set_neg_1 = false; int inext = first_pt_idxs[i_bin_a]; for (int i=0; i < n_max_pts_bin; i++) {

if (set_neg_1) { bin_a_ids[i] = -1; continue; }

bin_a_ids[i] = inext;

if (inext == -1) { set_neg_1 = true; continue; }

int i_pts_start = inext * pt_size; int i_bin = i * pt_size; bin_a[i_bin + offset_x] = pts[i_pts_start + offset_x];

bin_a[i_bin + offset_y] = pts[i_pts_start + offset_y];

bin_a[i_bin + offset_z] = pts[i_pts_start + offset_z];

bin_a[i_bin + offset_s] = pts[i_pts_start + offset_s];

inext = pt_idxs[inext];

}

}

for (int idx_i_bin_b = 0; idx_i_bin_b < n_max_neighbors;

idx_i_bin_b += 1) {

// int idx_i_bin_b = blockIdx.y; int i_bin_b = neighbor_bins[i_bin_a * n_max_neighbors + idx_i_bin_b];

// neighboring bins in the matrix that are empty // should have been set to -1 // but there might be more bin_b's that // are not -1 after a bin_b that is -1 . . . if (i_bin_b == -1) {continue;}

if (first_pt_idxs[i_bin_b] == -1) { continue;}

// don't double check bin pairs if (i_bin_b < i_bin_a) {continue;}

70 

Page 71: Efficient Fixed-Radius Near Neighbors for Machine Learning

/*---------- LOAD BIN B

-----------*/

int set_neg_1 = false; int inext = first_pt_idxs[i_bin_b]; for (int i=0; i < n_max_pts_bin; i++) {

if (set_neg_1) { bin_b_ids[i] = -1; continue; }

bin_b_ids[i] = inext;

if (inext == -1) { set_neg_1 = true; continue; }

int i_pts_start = inext * pt_size; int i_bin = i * pt_size; bin_b[i_bin + offset_x] = pts[i_pts_start + offset_x];

bin_b[i_bin + offset_y] = pts[i_pts_start + offset_y];

bin_b[i_bin + offset_z] = pts[i_pts_start + offset_z];

bin_b[i_bin + offset_s] = pts[i_pts_start + offset_s];

inext = pt_idxs[inext];

}

__syncthreads();

__syncthreads();

__threadfence();

/*--------------------- THE COMPARISONS

now do the comparison between

bin_a's pts and bin_b's pts

----------------------*/

// ia is the bin index for the current pt a // so it is NOT the index into the pts matrix for (int ia = threadIdx.x; ia < n_max_pts_bin; ia+=blockDim.x) {

if (ia >= n_max_pts_bin) {break;}

if (bin_a_ids[ia] <= -1) {break;}

float ax = bin_a[ia * pt_size + offset_x]; float ay = bin_a[ia * pt_size + offset_y]; float az = bin_a[ia * pt_size + offset_z]; float as = bin_a[ia * pt_size + offset_s];

for (int ib = threadIdx.y; ib < n_max_pts_bin; ib+=blockDim.y) {

if (ib >= n_max_pts_bin) {break;}

if (bin_b_ids[ib] <= -1) {break;}

71 

Page 72: Efficient Fixed-Radius Near Neighbors for Machine Learning

// don't compare the same point to itself: if (bin_b_ids[ib] == bin_a_ids[ia]) {continue;}

// if it's the same bin, // only compare lower points to higher points if ((i_bin_a == i_bin_b) && (bin_b_ids[ib] <= bin_a_ids[ia])) {

continue; }

float bx = bin_b[ib * pt_size + offset_x]; float by = bin_b[ib * pt_size + offset_y]; float bz = bin_b[ib * pt_size + offset_z]; float bs = bin_b[ib * pt_size + offset_s];

// check that the scales are close enough float scale_max = as; float scale_min = bs; if (as < bs) { scale_max = bs;

scale_min = as;

}

if ((scale_max / scale_min) > scale_radius) {continue;}

float diffx = bx - ax; float diffy = by - ay; float diffz = bz - az; float dist = diffx * diffx + diffy * diffy + diffz * diffz; dist = sqrt(dist);

float log_avg_scale = sqrt(as * bs); if (dist >= radius * log_avg_scale) {continue;}

int this_i_edges = atomicAdd(&i_edges[0], edge_size); edges[this_i_edges + 0] = bin_a_ids[ia]; edges[this_i_edges + 1] = bin_b_ids[ib]; }

}

__syncthreads();

__threadfence();

}

__syncthreads();

__threadfence();

}

return; } 

72