blavatnik school of computer science matrix factorization ... · blavatnik school of computer...

Raymond and Beverly Sackler Faculty of Exact Science

Blavatnik School of Computer Science

Matrix Factorization Methods

for

Massive Data Processing

Thesis submitted for the degree of

“Doctor of Philosophy”

by

Yaniv Shmueli

The thesis was carried out under the supervision of

Prof. Amir Averbuch

Submitted to the Senate of Tel-Aviv University

August, 2014

“Imagination is more important than knowledge. For knowledge is limited,whereas imagination embraces the entire world, stimulating progress, giving birth

to evolution. It is, strictly speaking, a real factor in scientific research.”A. Einstein.

Abstract

The explosion of data being created by humanity requires the develop-ment of scalable learning methods to process such massive volumes to uncoverthe hidden knowledge and insights within them. One important set of nu-merical computing methodologies for addressing these challenges is matrixfactorization methods. They are widely used today and have been appliedto many different problems for quantitative data analysis. In this thesis, wepresent three different methods that use matrix factorization frameworks tosolve problems in sampling, classification and detection in massive datasets.The first is a multiscale-based method that accelerates the computation of aparticle filter algorithm, which is a powerful tool for tracking the state of atarget based on non-linear observations. Instead of calculating the particlesweights directly, we sample a small subset from the source particles usingmatrix factorization methods. Then, we apply an out-of-sample algorithmto achieve a function extension to recover the density function for the re-maining particles. This method is demonstrated on both simulated and realdata such as objects tracking in video sequences.

The second method is a fast randomized algorithm that computes a low-rank lower and upper (LU) triangular matrices decomposition. The algo-rithm uses random projection techniques to efficiently compute a low-rankapproximation of large matrices. This algorithm can be parallelized andfurther accelerated by using sparse random matrices in its projection step.Several different error bounds are derived for the algorithm approximations.We provide numerical results that demonstrate that the algorithm outper-forms other randomized based decomposition methods. This algorithm isthen used for a dictionary construction. It provides a dictionary that is usedfor file types classification based on their content data and for the task fordetecting malicious code in PDF files.

The third method is an algorithm that updates the existing factorizationof a given training profile. Such frequent updating is required when theinput data stream changes across time. Instead of performing the entirecomputation again, the algorithm updates the factorization based on smallperturbations that occur in parts of the input data. By using this method,

i

we develop an anomaly detection algorithm that can update the classificationmodel on-the-fly. It was applied to Web traffic data.

The presented methods in this thesis demonstrate the power of matrixfactorization tools and how they can be enhanced and improved further tosolve a diverse set of problems in the domain of quantitative data analysis.

ii

Acknowledgments

First and foremost, I would like to express my deepest gratitude to my thesisadviser, Prof. Amir Averbuch. Amir has been a fantastic thesis adviserand has provided me with the best guidance one could wish for. I haveknown Amir for almost two decades, as he was the academic mentor for myM.Sc. and Ph.D. studies. His profound knowledge, unique intuition, greatideas and never-ending humor had a critical impact on my research, butmore important than that, on my approach to solving problems and mycharacter as a scientist. I would also like to thank my collaborators andfellow lab members: Dr. Amit Bermanis, Aviv Rotbart, Gil Shabat, Dr.Guy Wolf, Moshe Salhov, Dr. Shachar Harussi, Dr. Tuomo Sipola and YarivAizenbud. I was privileged to have the experience of working with them.Their valuable contributions and dear friendship is something I will cherishforever. I also wish to express my appreciation to Prof. Yoel Shkolinsky forfruitful discussions and consultations during my research. I am certain thatwithout all these people, my Ph.D. studies would not have been so enrichingand enjoyable. Many thanks also to the administrative staff at Tel AvivUniversity who supported me during my studies. Finally, I would like tothank my beloved wife Adi, my parents Yonit and Moshe, and my wonderfuldaughters Ziv and Amit. I could not complete this amazing journey withouttheir continuous love, support, encouragement and inspiration.

iii

Contents

Introduction 1Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Outline and Contributions of this Thesis . . . . . . . . . . . . . . . 4Published and Submitted Papers . . . . . . . . . . . . . . . . . . . 7Funding Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 8

I Accelerating Particle Filter Computation using Ma-trix Factorization Methods 9

1 Accelerating Particle Filter Computation using RandomizedMultiscale and Fast Multipole Type Methods 111.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3 Particle Filter Computation . . . . . . . . . . . . . . . . . . . 151.4 Multiscale Function Extension Method . . . . . . . . . . . . . 18

1.4.1 Data Subsampling Through ID of a Gaussian Matrix . 181.4.2 Multiscale Function Extension Algorithm . . . . . . . . 20

1.5 Multiscale Particle Filter (MSPF) Computation . . . . . . . . 211.5.1 Particle Subsampling . . . . . . . . . . . . . . . . . . . 231.5.2 Weight Calculation using Function Extension . . . . . 23

1.6 Accelerating the Particle Sampling Step . . . . . . . . . . . . 251.6.1 Fast Multipole Method . . . . . . . . . . . . . . . . . . 251.6.2 Fast Gauss Transform (FGT) . . . . . . . . . . . . . . 261.6.3 Weighted Farthest Point Selection (WFPS) Algorithm 27

1.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 271.7.1 Comparison with Other Approximation Methods . . . 301.7.2 Multiple Targets Tracking . . . . . . . . . . . . . . . . 321.7.3 Comparison with the EMD Measurement . . . . . . . . 351.7.4 Weighted FPS in the Selection Step . . . . . . . . . . . 36

1.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

v

II Randomized LU Decomposition and its Applica-tions 39

2 Randomized LU Decomposition 412.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.3.1 Rank Revealing LU (RRLU) . . . . . . . . . . . . . . . 442.3.2 Sparse Random Matrices . . . . . . . . . . . . . . . . . 46

2.4 Randomized LU . . . . . . . . . . . . . . . . . . . . . . . . . . 482.4.1 Computational Complexity Analysis . . . . . . . . . . 502.4.2 Bounds for the Randomized LU . . . . . . . . . . . . . 502.4.3 Randomized LU for Sparse Matrices . . . . . . . . . . 562.4.4 Rank Deficient Least Squares . . . . . . . . . . . . . . 59

2.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . 602.5.1 Error Rate and Computational Time Comparisons . . . 602.5.2 Image Matrix Factorization . . . . . . . . . . . . . . . 63

2.6 Sparse Matrix Factorization . . . . . . . . . . . . . . . . . . . 662.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3 File Content Recognition using Fast LU Dictionary 693.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.3 Randomized LU based Classification Algorithm . . . . . . . . 733.4 Determining the Dictionaries Sizes . . . . . . . . . . . . . . . 763.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 78

3.5.1 First Scenario: Entire File is Analyzed . . . . . . . . . 783.5.2 Second Scenario: Fragments of a File . . . . . . . . . . 803.5.3 Third Scenario: Detecting Execution Code in PDF Files 863.5.4 Time Measurements . . . . . . . . . . . . . . . . . . . 87

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

III Perturbed Matrix Factorization 91

4 Spectral Decomposition Update by Affinity Perturbations 934.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . 96

4.3.1 Finding a Low-Dimensional Embedded Space . . . . . 964.3.2 Updating the Embedding . . . . . . . . . . . . . . . . . 98

vi

4.4 The Recursive Power Iteration (RPI) Algorithm . . . . . . . . 984.4.1 First Order Approximations . . . . . . . . . . . . . . . 984.4.2 The Recursive Power Iteration Method . . . . . . . . . 1014.4.3 RPI Algorithm with First Order Approximations . . . 103

4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 1044.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5 Affinity Perturbations Usage to Detect Web Traffic Anoma-lies 1075.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.3 Low Dimensional Space Projection . . . . . . . . . . . . . . . 109

5.3.1 Diffusion Maps . . . . . . . . . . . . . . . . . . . . . . 1095.3.2 Incremental Update of the Embedded Space . . . . . . 110

5.4 Recursive Power Iteration . . . . . . . . . . . . . . . . . . . . 1115.4.1 First Order Approximation of Eigenpairs . . . . . . . . 1115.4.2 RPI Algorithm . . . . . . . . . . . . . . . . . . . . . . 112

5.5 Sliding Window Diffusion Map . . . . . . . . . . . . . . . . . . 1125.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 1145.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Conclusions 117

Bibliography 121

vii

List of Tables

1.7.1 Comparison between WFPS and ID acceleration times [sec],in the MSPF algorithm that uses EMD. Sampling rate was10% from the total number of particles. . . . . . . . . . . . . . 36

2.4.1 Calculated values of the success probability ξ (Eq. 2.4.2). Theterms l − k, β and γ appears in Eq. 2.4.2. . . . . . . . . . . . 49

2.4.2 Probability P of the failure of Conjecture 2.4.12. The Averagevalue of σk(G1) was computed 10, 000 times for different valuesof n, l, k and ρ. . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.5.1 Confusion matrix for the first scenario. 100 files of each typewere classified by Algorithm 3.5.1. . . . . . . . . . . . . . . . . 80

3.5.2 Confusion matrix for the second scenario where BFD+CDDbased features were chosen. 100 files of each type were classi-fied by Algorithm 3.5.2. . . . . . . . . . . . . . . . . . . . . . 83

3.5.3 Confusion matrix for the second scenario that is based onDBFD based features. 100 files of each type were classifiedby Algorithm 3.5.2. . . . . . . . . . . . . . . . . . . . . . . . . 84

3.5.4 Confusion matrix for the second scenario using MW basedfeatures. 100 files of each type were classified by Algorithm 3.5.2. 86

3.5.5 Confusion matrix for malicious PDF detection experiment.110 files were classified by Algorithm 3.5.2. . . . . . . . . . . . 87

3.5.6 Running times for the first scenario. The bold number inthe left column refers to the running time of Algorithm 3.3.1,excluding the computation of the dictionary size k. The cor-responding number in the right column refers to the runningtime of Algorithm 3.5.1. The times are normalized by thetraining data size (left column) and the testing data size (rightcolumn). This normalization allows to analyze the times re-gardless of the sizes of the files in our experiments, which arerandom and vary largely. . . . . . . . . . . . . . . . . . . . . . 88

ix

3.5.7 Running times for the second scenario. The bold numbers inthe left column refer to the running time of Algorithm 3.3.1(excluding the computation of the dictionary size k) for dif-ferent feature sets. The corresponding numbers in the rightcolumn refer to the running time of Algorithm 3.5.2. Thetimes in the table are normalized by the training and testingdata size, as done in Table 3.5.6. The analysis time of the testfiles however, is per file because Algorithm 3.5.2 samples afixed amount of content from each file, regardless of its actualsize. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

x

List of Figures

1.7.1 A set of representative frames from a basketball tracking se-quence. The object is tracked using the MSPF Algorithm1.5.1 with a direct computation of the weights for 10% fromthe total number of particles. . . . . . . . . . . . . . . . . . . 30

1.7.2 Comparison between the tracking success rate for a given com-putational budget with standard PF (Alg. 1.3.1 which we referto as the “naive PF”) and MSPF (Alg. 1.5.1). . . . . . . . . . 31

1.7.3 Comparison between the RMSE for different methods: mul-tiscale with ID sampling (Alg. 1.4.2), multiscale with WFPSsampling (Alg. 1.6.1), linear approximation and cubic approx-imation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

1.7.4 Computational time of the MSPF with different sampling rates.The total number of particles is 1500. . . . . . . . . . . . . . . 33

1.7.5 Comparison between the tracking error for each frame, for agiven computational budget. We processed 1000 frames ofthe Brownian motion movie using the naive PF (Alg. 1.3.1)and then using the MSPF (Alg. 1.5.1). For the naive PF, weused 150 particles. For the MSPF we used 500 particles and10% sampling rate. Both Algorithm’s execution time was 280seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1.7.6 A selected set of representative frames from the tennis gamethat demonstrates the tracking performance. The two tennisplayers were tracked by the application of the MSPF Algo-rithm 1.5.1 with a direct weights computation for 10% fromthe total number of particles. . . . . . . . . . . . . . . . . . . 35

2.5.1 Comparison between the low-rank approximation error of dif-ferent algorithms: Randomized SVD, Randomized ID andRandomized LU. Randomized LU achieves the lowest error. . . 62

2.5.2 Comparison between the execution times of the same algo-rithms as in Fig. 2.5.1 running on a CPU. Randomized LUachieved the lowest execution time. . . . . . . . . . . . . . . . 62

xi

2.5.3 Comparison between the execution times from running Algo-rithm 2.4.1 on different computational platforms: CPU with 8cores and GPU. Randomized LU achieved the lowest executiontime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.5.4 The original input image of size 2124×7225 that was factorizedby the randomized LU, randomized ID and randomized SVDalgorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

2.5.5 The reconstructed image from the randomized LU factoriza-tion with k = 200 and l = 203. . . . . . . . . . . . . . . . . . . 64

2.5.6 Comparison between the PSNR values from image reconstruc-tion application using randomized LU, randomized ID, ran-domized SVD and Lanczos SVD algorithms. . . . . . . . . . . 65

2.5.7 Comparison between the execution time of the randomizedLU, randomized ID, randomized SVD and Lanczos SVD algo-rithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

2.6.1 Comparison between the approximation error of the random-ized LU, randomized ID and randomized SVD algorithms, ex-ecuted on the sparse matrix eu-2005. . . . . . . . . . . . . . . 66

2.6.2 Comparison between the execution time of the randomizedLU, randomized ID and randomized SVD algorithms, exe-cuted on the sparse matrix eu-2005. . . . . . . . . . . . . . . . 67

2.6.3 Approximation error from the application of Algorithm 2.4.2to the matrix A with two different densities. . . . . . . . . . . 67

2.6.4 Execution time from the application of Algorithm 2.4.2 to thematrix A with two different densities. . . . . . . . . . . . . . . 68

3.5.1 Byte Frequency Distribution (BFD) features extracted fromthe file fragment “AABCCCDR”. . . . . . . . . . . . . . . . . 79

3.5.2 Consecutive Differences Distribution (CDD) features extractedfrom the file fragment “AABCCCDFG”. There are three consecutive-pairs of bytes with difference 0, four with difference 1 and onewith difference 2. These distributions are normalized to pro-duce the shown probabilities. The normalized factor is thelength of the sting minus one. In this example, the normal-ized factor is 8. . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.5.3 Error matrices produced by Algorithm 3.4.1. The matrix ispresented in cold to hot colormap to show ranges of low (blue)and high (red) errors. . . . . . . . . . . . . . . . . . . . . . . . 82

3.5.4 Features extracted from the file fragment “AABCCC” usingDouble Byte Frequency Distribution (DBFD). The normaliza-tion factor is one less then the legth of the string. . . . . . . . 84

xii

3.5.5 Markov Walk (MW) based features extracted from the filefragment “AABCCCF”. . . . . . . . . . . . . . . . . . . . . . 85

4.4.1 Approximation error rates. The x-axis is the index of theordered eigenvalues. The y-axis is the relative error of theapproximated value (Eq. 4.4.8). . . . . . . . . . . . . . . . . . 101

4.5.1 Comparison of the total wallclock time between the three RPIalgorithmic variations. We compute the first 10 eigenpairs ofa 104×104 matrix. Each bar represents a variation of the RPIalgorithm. Each group of bars is compared within a givenadmissible error. . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.5.2 Comparison of the total number of iterations between thethree RPI algorithmic variations. We compute the first 10eigenpairs of a 104 × 104 matrix. Each bar represents a vari-ation of the RPI algorithm. Each group of bars is comparedwithin a given admissible error. . . . . . . . . . . . . . . . . . 105

5.6.1 The scores for each point with window size 1000 using thesecond eigenvector. . . . . . . . . . . . . . . . . . . . . . . . . 115

xiii

List of Algorithms

1.3.1 Particle Filter (SIR) . . . . . . . . . . . . . . . . . . . . . . . . 161.4.1 Deterministic Interpolative Decomposition . . . . . . . . . . . . 191.4.2 Randomized Interpolative Decomposition . . . . . . . . . . . . 211.4.3 Single-Scale Extension . . . . . . . . . . . . . . . . . . . . . . . 221.4.4 Multiscale Data Sampling and Function Extension . . . . . . . 221.5.1 Multiscale Particle Filter (MSPF) . . . . . . . . . . . . . . . . 241.6.1 Weighted Farthest Point Selection (WFPS) . . . . . . . . . . . 28

2.4.1 Randomized LU Decomposition . . . . . . . . . . . . . . . . . . 482.4.2 Randomized LU Decomposition for Sparse Matrices . . . . . . . 562.4.3 Rank Deficient Least Squares using Randomized LU . . . . . . 60

3.3.1 Dictionaries Training using Randomized LU . . . . . . . . . . . 743.3.2 Dictionary based Classification . . . . . . . . . . . . . . . . . . 753.4.1 Dictionary Sizes Detection . . . . . . . . . . . . . . . . . . . . . 773.5.1 File Content Dictionary Classification . . . . . . . . . . . . . . 803.5.2 File Fragment Classification using Dictionary Learning . . . . . 83

4.4.1 Recursive Power Iteration Algorithm . . . . . . . . . . . . . . . 102

5.5.1 Sliding Window Diffusion Map with RPI . . . . . . . . . . . . . 113

xv

Introduction

Over the last few decades, we have witnessed two major trends in informationtechnology and scientific computing domains. The first trend is the explosionof data that is generated and recorded by humanity. From 2013 to 2020,the digital universe will grow by a factor of 10 - from 4.4 trillion gigabytesto 40 trillion, containing nearly as many digital bits as there are stars inthe universe [1, 2]. Organizations and companies such as NYSE, CERN,Facebook, Google, Boeing and Goldman Sachs generate dozens of terabyteseach day. In summary, the digital universe more then doubles every twoyears. Digital computing causes both explicit and implicit creation of datain the form of text, images, audio and video recordings that increases ourday-to-day digital output and cyberspace footprint. Gigantic volumes of datacan be saved nowadays due to advanced recording and storage technologies,making them available for further analysis and processing.

The second trend is the development of computational methods for re-trieving information and insights from accumulated and streaming of massivedata. There is a constant improvement in the ability to process and analyzesuch huge datasets that require minimal or no human intervention. Suchcapabilities are driven mostly by the development of mathematical methodsand machine learning methodologies.

The goal is to extract intelligence and make sense of massive amountsof data to understand the underlying structure of complex topics, connectthe dots between pieces of information, and turn data into insight. Bigdata creates tremendous value for the global economy, driving innovation,productivity, and growth. Therefore, we turn data into knowledge.

There are several methodologies to automatically discover connections indata that add to our understanding, to mine and interpret massive amountsof data quickly and easily. Methods like subsampling while preserving themutual relations among data points, stochastic modeling, random processing,dimensionality reduction, spectral and kernel methods, dictionary construc-tions, embedding into a lower dimension space, sparsification, to name someare common. In this thesis, we utilize a mathematical framework known

1

2

as matrix factorization of low rank matrices to develop machine learningalgorithms for clustering, ranking, tracking and anomaly detection in highdimensional big data. Generally speaking, a matrix factorization decomposesa matrix into a product of matrices having properties such as being trian-gular, orthogonal, diagonal, sparse or low-rank. We focus on three differentproblems and solve them using matrix factorization tools.

Tracking: How to accelerate a tracking algorithm known as particle filterusing matrix factorization methods. By subsampling the input dataefficiently, we can improve the algorithm performance while reducingits computational time. Then, a multiscale method is used for interpo-lating the values of the rest of the data, which enables us to maintaintracking accuracy while requiring fewer computational resources.

Dictionary construction: Efficient dictionary construction from a train-ing dataset for use in recognition and clustering of high dimensionaldata. We develop a coherent methodology (theory, algorithms, soft-ware) called randomized LU for dictionary learning by factorization ofthe training matrix. This algorithm provides a low-rank approxima-tion for the LU decomposition method. Several error bounds of thealgorithm’s accuracy are presented. We examine the computationalefficiency of the algorithm on sparse matrices, images and random ma-trices. For one demo application, the randomized LU is used for filecontent-based classification.

Efficient updating: The challenge is to update an existing factorizationwithout the need to perform the entire computation again in caseswhere the new input matrix contains small perturbations from an ex-isting matrix. The proposed algorithm updates the factorization anduses it to detect anomalies in Web traffic data. This problem becomesespecially important when dealing with data streams that frequentlychange. This imposes the need to have recurring updates to the trainingprofile to maintain the classifier accuracy level.

The solution for each of the above problems contains an algorithm, a proofof its correctness and a computational complexity analysis. It is appliedto simulated and real data to demonstrate its capabilities. The presentedsolutions can be applied to different input data types that include sparsedata. In this work, we test our algorithms using telemetric data, network logs,images and video sequences to demonstrate the robustness of our solutions.We also use several matrix factorization methods and enhance them further tobetter fit our problem requirements. An important property of our solution’s

3

portfolio is in its scaling abilities of the algorithms to process very largedatasets.

Related Work

Methods for data analysis have become very popular and diverse in the lastfew decades. They are used to clean, sample, process, transform, model,mine and predict data in order to extract useful information and insights,suggest conclusions and support decision-making. Many approaches exist toaddress this problem, and different techniques have been developed and usedin science, social and business domains.

Matrix factorization serves as a basis for many studies and algorithmsfor data analysis applications [3]. The dataset is first modeled by a matrixthat can represent raw samples, features, similarities or transition proba-bilities between samples. Then the matrix is factorized into a product oflow-rank matrices or matrices that have a unique structure. The factoriza-tion can reveal properties of the dataset such as dominant samples, clusteringand anomalies. Common factorizations methods used today are for exampleeigenvalue decomposition, singular value decomposition (SVD), non-negativematrix factorizations, LU decomposition, QR decomposition and interpola-tive decomposition (ID) [4]. Many of these methods have parallelized versionsand randomized implementations that compute a low-rank approximation ef-ficiently [5, 6, 7]. They provide powerful tools for constructing approximatematrix factorizations. They are able to deal with challenges such as han-dling very large matrices and matrices with missing or inaccurate elements,performance of minimal iterations over the input data and exploiting newcomputer architectures such as graphic processing units (GPU) and cloudcomputing.

A variety of problems and applications can be solved using matrix fac-torization: solution of a system of equations [5], regression [8, 9], collabo-rative filtering problems [10], dictionary construction [11], semi-supervisedlearning [12], principal component analysis (PCA) [13], dimensionality re-duction [14], clustering [15], anomaly detection [16], compression [4], noisereduction and sampling [17]. Famous examples of problems solved usingmatrix factorizations are Google’s PageRank algorithm for computing theimportance of a given Web page [18], Netflix’s data challenge for improvingtheir recommendation system [19], and human face and handwritten digitrecognition systems [20, 21]. These examples demonstrate the power of us-ing matrix factorizations. In this thesis, we expand these ideas by enhancingseveral matrix factorization methods and using them to solve a new set of

4

problems.

Outline and Contributions of the Thesis

This thesis explores the properties of several matrix factorization methodsand how can they be employed to solve machine learning tasks. The thesishas three parts. The first part shows how to compute efficiently the particlefilter problem by using matrix factorization. The second part introduces arandomized version for the LU decomposition and several dictionary learningapplications based on this decomposition. The third part provides an efficientfactorization of a constantly changing matrix and presents a Web trafficanomaly detection algorithm that is based on it. The rest of this sectionpresents brief overviews of each part.

Multiscale Particle Filter (Part I)

Particle filter (PF) is a powerful method for state tracking using non-linearobservations. In each cycle of the algorithm, we advance the state of a setof particles and then compute the probability of each state to represent theactual target accurately, using the input observations. The algorithm thenresamples the particles based on their probabilities, causing the best particlesto evolve into the next cycles. In Chapter 1, which is based on [22, 23], wepresent a multiscale-based method that accelerates the tracking computationdone by the particle filter algorithm. Unlike the conventional method, whichcalculates weights over all particles in each cycle of the algorithm, we samplea small subset from the source particles using matrix decomposition methods.Then we apply a function extension algorithm that uses this particle subsetto recover the density function for all the rest of the particles not included inthe chosen subset. The computational effort for computing with the entireparticles is substantial, especially when multiple objects are tracked concur-rently. The proposed algorithm significantly reduces the computational load.By using the fast Gaussian transform (FGT), the complexity of the particleselection step is reduced to a linear time in n and k, where n is the number ofparticles and k is the number of particles in the selected subset. We demon-strate our method on both simulated and real data, such as objects trackingin video sequences. The main contribution of this research is the accelera-tion of the PF tracking algorithm by using matrix factorization methods. Wemanage to accelerate the algorithm up to 10 times faster, while maintainingthe same tracking error.

5

Randomized LU Decomposition (Part II)

In Chapter 2, which is based on [24], a fast randomized algorithm that com-putes a low-rank LU decomposition is presented. The algorithm uses randomprojections techniques to efficiently compute a low-rank approximation oflarge matrices. The randomized LU algorithm can be parallelized and furtheraccelerated by using sparse random matrices in its projection step. Severalerror bounds for the algorithm’s approximations are proved. To prove thesebounds, recent results from random matrix theory related to sub-Gaussianmatrices are used. The algorithm, which can utilize sparse structures, isfully parallelized and thus can efficiently utilize GPU architectures. Numeri-cal examples, which illustrate the performance of the algorithm and compareit to other decomposition methods such as randomized SVD, randomized IDand Lanczos method of SVD running on sparse matrices, are described. Wealso show that the randomized algorithm can be parallelized, since it consistsmostly of matrix multiplication and pivoted LU. The results on GPU showthat it is possible to accelerate the computational time significantly even byusing only standard Matlab libraries.

We demonstrate the effectiveness of our algorithm in Chapter 3 by apply-ing it to dictionary learning applications. Distinctive-dictionary learning hasbeen a busy research area in recent years. In this learning method, one ormore dictionaries are learned from a training data and then used to classifytest signals. A new dictionary learning algorithm is introduced, based on ourrandomized LU decomposition. This method is fast and scalable. It worksextremely well on large sparse matrices. In contrast to existing methods, ran-domized LU decomposition constructs an under-complete dictionary, whichsimplifies both the construction process and the classification of unknownsignals. We demonstrate our algorithm on a file type detection application.This is a fundamental task in digital security performed by different systemssuch as firewalls, anti-virus systems and e-mail clients. We propose a content-based method for detecting file types, instead of relying on the file extensionand metadata. Such an approach is harder to deceive, and we show that onlya few file fragments are needed for a successful classification. We compareour results to recent studies to evaluate our classification success rate andexecution times. We also show that our method can effectively identify PDFfiles that contain execution code fragments by inspecting the content of eachPDF file and reconstructing it using different dictionaries.

6

Affinity Perturbations (Part III)

Many machine learning based algorithms contain a training step that is doneonce. The training step is usually computational expensive since it involvesprocessing huge matrices. If the training profile is extracted from an evolv-ing dynamic dataset, it has to be updated as some features of the trainingdataset are changed. In Chapter 4, which is based on [25], we propose a so-lution for updating this profile efficiently. We investigate how to update thetraining profile when the data is constantly evolving. In many algorithms forclustering and classification, a low-dimensional representation of the affinity(kernel) graph of the embedded training dataset is computed. The trainingdata is first modeled by a kernel method and then processed by a spectraldecomposition. Such representation is then used for classifying newly arriveddata points. We present methods for updating such embeddings of the train-ing datasets in an incremental way without the need to perform the entirecomputation from the beginning upon the occurrences of changes in a smallnumber of the training samples. Efficient computation of such an algorithmis critical in many Web-based applications. In Chapter 5, which is basedon [26], we use this approach to solve the problem of anomaly detectionwhen the training set is constantly evolving and changing. Evolving datasetsare challenging from this point of view because changing behavior requiresupdating the training. We propose a method for updating the training profileefficiently and a sliding window algorithm for processing the data on-line insmaller parts. We demonstrate the algorithm with a Web server request logwhere an actual intrusion attack is known to happen. The dynamic kernelupdate with the sliding window prevents the problem of single initial trainingand can process evolving datasets more efficiently.

Published and Submitted Papers

1. Y. Shmueli, G. Wolf, and A. Averbuch. Updating kernel methods inspectral decomposition by affinity perturbations. Linear Algebra and itsApplications, 437(6):1356 – 1365, 2012.

2. G. Wolf, Y. Shmueli, S. Harussi, and A. Averbuch. Polar classifica-tion of nominal data. In S. Repin, T. Tiihonen, and T. Tuovinen, ed-itors, Numerical Methods for Differential Equations, Optimization, andTechnological Problems, volume 27 of Computational Methods in AppliedSciences, pages 253–271. Springer Netherlands, 2013.

3. Y. Shmueli, G. Shabat, A. Bermanis, and A. Averbuch. Particle filteracceleration using multiscale sampling methods. In SampTA 2013: 10thinternational conference on Sampling Theory and Applications, Bremen,Germany, 2013.

4. Y. Shmueli, T. Sipola, G. Shabat, and A. Averbuch. Using affinityperturbations to detect web traffic anomalies. In SampTA 2013: 10thinternational conference on Sampling Theory and Applications, Bremen,Germany, 2013.

5. G. Shabat, Y. Shmueli, and A. Averbuch. Missing entries matrix approx-imation and completion. In SampTA 2013: 10th international conferenceon Sampling Theory and Applications, Bremen, Germany, 2013.

6. Y. Shmueli, G. Shabat, A. Bermanis, and A. Averbuch. Acceleratingparticle filter using multiscale methods. In Electrical & Electronics En-gineers in Israel (IEEEI), 2012 IEEE 27th Convention of, pages 1–4.IEEE, 2012.

7. G. Wolf, Y. Shmueli, S. Harussi, and A. Averbuch. Polar clustering. InECCOMAS Thematic Conference on Computational Analysis and Opti-mization, 2011.

8. G. Shabat, Y. Shmueli, and A. Averbuch. Randomized LU decomposi-tion. Submitted to SIAM Journal on Scientific Computing, 2014.

9. G. Shabat, Y. Shmueli, A. Bermanis, and A. Averbuch. Accelerating par-ticle filter using randomized multiscale and fast multipole type methods.Submitted to Pattern Analysis and Machine Intelligence, 2013.

7

8

Funding Acknowledgments

The author of this thesis was supported by the Israel Science Foundation(Grant No. 1041/10), by the Israeli Ministry of Science & Technology (GrantsNo. 3-9096, 3-10898) and by the US - Israel Binational Science Foundation(BSF 2012282).

Part I

Accelerating Particle FilterComputation using Matrix

Factorization Methods

9

Chapter 1

Accelerating Particle FilterComputation usingRandomized Multiscale andFast Multipole Type Methods

Particle filter (PF) is a powerful method for state tracking using non-linearobservations. In this chapter, we present a multiscale-based method that ac-celerates the tracking computation done by PF. Unlike the conventional way,which calculates weights over all the participating particles in each cycle ofthe algorithm, we sample a small subset from the source particles using ma-trix decomposition methods. Then, we apply a function extension algorithmthat uses a particle subset to recover the density function for all particlesexcluded from the chosen subset. The computational effort for the entire setof particles is substantial especially when multiple objects are tracked con-currently. The proposed algorithm reduces significantly the computationalload. By using the fast Gaussian transform and the fast multipole method(FMM), the complexity of the particle selection step is reduced to a lineartime in n and k , where n is the number of particles and k is the number ofparticles in the selected subset. We demonstrate our method on both sim-ulated and on real data such as objects tracking in video sequences. Theresults in this chapter appear in [22, 23].

1.1 Introduction

PF is a state tracking method based on non-linear observations that uses theMonte-Carlo approach [27]. PF implements a recursive Bayesian filter where

11

12

the probability density function (PDF) is represented by a set of randomsamples (particles) rather then its analytical form. The number of particlescontrols the approximation accuracy. A large number of particles will lead toa more accurate representation of the functional form of the PDF. The parti-cles are propagated and advanced under the control of the system dynamicsand under target measurement model. In the sequential importance resam-pling (SIR) version of the PF, particles are resampled at each cycle by usingimportance sampling on their probabilities that are called “weights”. Theparticle with the maximal likelihood is selected to be the current predictedstate of the target. Unlike Kalman filters, PFs are not restricted by station-ary linear-Gaussian assumptions that make them more robust and suitablefor a larger set of problems. Although the PF concept is fairly straightfor-ward , it becomes computationally expensive for practical implementations,as a very large number of particles are needed in order to accurately estimatethe PDF of the observed target(s). The increase in computational power inthe last decades have enabled to introduce PF-based solutions to real worldproblems.

The role of PF acceleration is to increase the number of propagated par-ticles while maintaining the same computational cost. A large number ofparticles represent the required distributions more accurately that leads tobetter results. Many systems are geared to track objects that are “buried”in huge data-streams such as video sequences, communication and telemetricdata. Predicting the next object state or tracking it has to be done in nearreal-time especially by devices that have limited computational resourcessuch as embedded devices.

In this work, we develop an improved PF algorithm that avoids the needto compute the likelihood function (the weights) for all the particles. Theparticle weights computation is expensive in cases such as tracking objectsin video sequences. Instead, the weights are computed for only a subset ofthe particles and their values are used to estimate the weights for the restof them. To select a representative set of particles, we use the interpolativedecomposition (ID) method [4]. Then, the weights are extended to the restof the particles using a multiscale function extension (MSE) method [28],which is an application of the modified Nystrom extension method ([29, 30])to particles excluded from the selected set. The extension is based on thesimilarities between particles that are not in the selected subset and theparticles in the selected subset. The MSE uses radial Gaussian functionswith varying scales to estimate the coefficients of the Nystrom extension.The MSE method is shown in [28] to be both accurate and numericallystable overcoming the deficiencies in the Nystrom method.

Interpolative decomposition (ID) is a method that approximates a matrix

13

by selecting a set of k independent columns in the input matrix that consti-tute a basis. We use the ID method to select particles that best representthe PDF. To find this particles subset, we compute an affinity matrix A forthe particles and apply the ID algorithm to compute a k independent set ofcolumns from A. These columns correspond to the most relevant particles.

Once the particles selection process is completed, the weights are com-puted only for the selected particles. The weight values are extended to therest of the particles by the application of MSE. The motivation for usingMSE as our extension method is based on the fact that the PDF is generallya smooth function and that the MSE method is strongly related to a Gaus-sian process regression (GPR) [31], which is an extension method in the fieldof statistical inference. We use a randomized implementation of the ID algo-rithm, which is based on random projections, to minimize the computationalcost of selecting the most relevant particles [6]. To reduce the computationalcost even further, we propose a selection algorithm that is based on the Far-thest Point Sampling (FPS) [32] method combined with density estimation.We refer to our method, which combines the FPS algorithm with densityestimator, as a weighted FPS (WFPS). The density estimator was imple-mented using the fast multipole method (FMM) [33] that asymptotically hasa lower computational cost.

In our experiments, we were able to accelerate the weight calculation stepto be approximately 10 times faster compared to the standard particle filter.The running time and tracking error of the algorithm were compared to thestandard PF (sequential importance resampling) as well as to other weightinterpolation methods indicating that the proposed algorithm was able tomaintain the same error rate with a much shorter running time.

This chapter is organized as follows: Section 1.2 presents related workon PF acceleration. Section 1.3 describes the PF algorithm, explains itslimitations and how it can be used for object tracking in video sequences.Section 1.4 describes the multiscale sampling and extension techniques withits mathematical tools such as the randomized interpolative decomposition(ID). Section 1.5 presents the full multiscale PF algorithm that acceleratesthe standard PF. An additional acceleration is achieved by the algorithmin Section 1.6 to overcome the multiscale PF scalability bottleneck. Ex-perimental results are presented in Section 1.7. We compare between theperformances of the presented method and other methods in different sce-narios.

14

1.2 Related Work

PFs have been studied in many works and used in different application do-mains such as computer vision, robotics, target tracking and finance. Exam-ple applications include hand gesture-based interface [34], real-time track-ing of soccer players [35], mobile robot localization [36] and visual trackingof human face [37]. The PF employs a sequential Monte Carlo approachto solve recursive Bayesian filtering problems. The Monte Carlo samplingmethod combined with the Bayesian inference enables the PF to provide asolution for non-linear and non-Gaussian problems. While PF can be robustto both the input observations distribution and to the observations noiselevel, its implementation is computationally expensive. Making it to workin real-time (computationally efficient) has become a major challenge whenobjects tracking is done in high dimensional state space, or when dealingwith multiple targets tracking. Such instances require to use more parti-cles and thus the problem quickly becomes intractable. In addition, becauseof the nature of the PF algorithm and its repeated sampling method, thealgorithm suffers from sample degeneracy where most of the particles havenegligible weights. Over the last two decades, different variations of thePF algorithm have emerged to overcome these limitations. Methods suchas auxiliary PF [38], Gaussian sum PF [39], unscented PF [40] and swarmintelligence-based PF [37] were developed in order to overcome these limi-tations by improving the underlining sampling techniques and by providingbetter evaluations to the proposal distribution of the particles in each stepof the algorithm.

The problem of tracking curves in a dense visual clutter is investigatedin [41], where a method for learning the dynamical models using visual ob-servations and propagate a randomly generated set over time to achieve nearreal time tracking is introduced. High dimensional models for tracking peopleusing Monte Carlo filtering and hybrid hypothesis methods are also studiedin [42, 43, 44, 45]. By using randomization methods for improving the parti-cles re-sampling and applying specific assumptions to the dynamical models,efficient visual tracking is achieved. Overcoming the uncertainty induced byocclusion, abrupt motion or appearance changes while still preventing sampleimpoverishment problems is demonstrated in [46, 47].

Additional methods for estimating the posterior densities were suggestedin [48, 49, 50, 51]. These methods propose effective ways for representingthe posterior density that result in using fewer particles. The unscented PF(UPF), for example, defines points that capture sufficient distribution statis-tics. Then, it propagates each particle and incorporates the new observationto produce a Gaussian estimation of its proposal distribution. By using the

15

input observations, the UPF can achieve more accurate proposal distributionestimation than what the regular PF implementation achieves. In then con-tinuous density propagation [51], density approximation and interpolationtechniques are employed to represent the proposal distribution efficiently.The density functions are represented by Gaussian mixtures where the num-ber of components, their coefficients and other related statistical parame-ters are automatically determined by the algorithm. Similar sampling andapproximation methods for accelerating nonparametric belief propagation(NBP) were also developed in [50]. The core of the NBP algorithm requiresa repeated sampling from products of Gaussian mixture, which makes thealgorithm computationally expensive. To accelerate the process, Gaussianmixture density approximation using mode propagation and kernel fitting isapplied. The products of the Gaussian mixture are approximated accuratelyby just a few mode propagations and kernel fitting steps. This significantlyaccelerates the sampling method since it uses a fewer samples. The ap-proach presented in this chapter is similar in that sense, but unlike the abovemethods, we compute the weights for only a particle subset. This subset isselected using matrix decomposition methods and the remaining weights areiteratively interpolated using multiscale extension methods which are accu-rate and numerically stable.

The challenge, which all these methods face, is how to incorporate a newobservation into the estimated proposal distribution function of the particleswhile maintaining a reasonable computation cost. Comprehensive tutorialsand surveys on different PF versions and recent advances in PF methods aregiven for example in [27, 52, 53].

1.3 Particle Filter Computation

PF is an on-line density estimation technique based on target simulation thatuses Monte Carlo methods for solving a recursive Bayesian filtering problem.PF is used to estimate the state xi at time i from noisy observations y1, ..., yi.Dynamic state space equations are used for modeling and prediction of thetarget state. The basic idea behind the PF approach is to use a sufficientlylarge number of “particles”. Each particle is an independent random variablewhich represents a possible target state. For example, a state can be locationand velocity. In this case, each particle represents a possible location anda possible velocity of the target from a proposal distribution. The systemmodel is applied to the particles in order to predict the next state. Then, eachparticle is assigned a weight, which represents its reliability or the probabilitythat it represents the real target state. The actual location (the output

16

of the PF algorithm) is usually determined as the maximum a posterioriprobability of the particle’s posterior distribution. The algorithm robustnessand accuracy are determined by the number of participated particles. Alarge number of particles is more likely to cover a wide state subspace inthe proximity of the target, as well as a better approximation of the statedistribution function. However, the computational cost of such an improvedtracking is high since each particle needs to be both advanced in time andweighted. This is repeated in each cycle of the algorithm.

Algorithm 1.3.1: Particle Filter (SIR)

Input: n number of particles; x0 initial state; y1, ..., yT currentobservations; q(·) proposal distribution function; p(·) approximatedposterior distribution functionOutput: x1, ..., xT estimated observations

1: Weights initialization: w(i)0 = 1

n, x

(i)0 ∼ p(x0), i = 1, ..., n.

2: for time steps t=1,...,T do3: Resample n new particles by their distribution determined by the

weights w(i)t−1.

4: Prediction: Apply the dynamic model to each particle to estimate thenext state using xt−1 and y1, ..., yt

x(i) ∼ q(xt|x(i)t−1, y1, ..., yt), i = 1, ..., n.

5: Weights calculation:

w(i)t ∝

p(yt|x(i)t )p(x

(i)t |x

(i)t−1)

q(xt|x(i)t−1, y1, ..., yt)

, i = 1, ..., n.

6: Weights normalization:

w(i)t =

w(i)t∑n

l=1w(l)t

, i = 1, ..., n.

7: xt is set to be the particle x(l)t where l = argmax

1≤i≤nw

(i)t .

8: end for

A description of the PF algorithm flow is given by Algorithm 1.3.1. As-sume that p represents the proposal distribution that is used to predict the

17

next particles states. The optimal proposal distribution is the target’s distri-bution, which is given by p(xk|xk−1, yk). Since this computation is imprac-tical, an estimated distribution, which is called the proposal distribution, isused. In our case, this distribution is computed by utilizing the physics ofthe system (physical model) to the particles.

The weights computation in Algorithm 1.3.1, step 5, can be expensive insome cases. For example, when the PF algorithm is used for tracking a targetin videos, it is common to use distances between histograms for weightingthe measurements. An RGB image with 256 gray-levels for each pixel willhave a histogram of 2563 = 16, 777, 216 bins. Then, the distance betweentwo histograms h1 and h2 (both vectors have length of size the number ofbins) can be measured, for example, by the Bhattacharyya coefficient

B =

√h1

Th2. (1.3.1)

The color histogram calculation complexity for a given particle dependson the number of bins we use and the number of pixels we need to iterate, aseach particle points to an image patch of the target. Assume that the numberof bins is b and the number of particles is n. The total weight calculationcomplexity in each cycle can be very expensive for large b and n values.On the other hand, high number of bins can help in improving the distanceaccuracy that affects the weights estimations.

Despite the robustness of the PF algorithm, it suffers from several limi-tations. Usually, each new observation requires some preprocessing followedby a weight calculation of each particle. Both steps can be computation-ally expensive in applications such as computer vision and robotics. In suchproblems, the observation contains a large amount of data. For example,when tracking a target within a video sequence, each measurement consistsof an image frame that contains up to several millions of pixels (as in HDformat). In addition, each particle is assigned a weight that is based onsome calculation applied to a subset of the measured data. This subset canbe relatively large (for example, an image patch with thousands of pixels).As the number of particles becomes large, the total computational load canbecome extremely expensive. When dealing with high dimensional particlesthat contain many parameters in each state, the number of needed particlesincreases exponentially to cover a region of interest around the current targetstate. However, using a large number of particles is important to obtain highdiversity for the particles so they will represent the solution space adequately.

18

1.4 Multiscale Function Extension Method

Given a set Pn = {p1, p2, . . . , pn} of n particles, we want to estimate thevalues of their weights using a small subset Pk ⊆ Pn of k particles. Here, k ≪n is a predefined number for which the weights of Pk are computed directly.Formally, our goal is to interpolate the weight function w : Pk → R from Pk

to Pn (as calculated in Step 5 in Algorithm 1.3.1). For that purpose, we usethe MSE method [28], which is a multiscale-based algorithm. The MSE isan iterative method. Each MSE iteration contains two phases: subsamplingand extension. The first phase is done by a special decomposition, known asinterpolative decomposition (ID) [4], of an affinities matrix associated withPk. The second phase extends the function from Pk to Pn using the outputfrom the first (sampling) phase. The essentials of the MSE are described inSections 1.4.1, 1.4.2 and in [28].

We use the following notation: s denotes the scale parameter, s = 0, 1, . . .,ϵs = 2−sϵ0 for some positive number ϵ0, and

g(s)(r)∆= exp{−r2/ϵs}. (1.4.1)

For a fixed scale s, we define the functions g(s)j : Pn → R, j = 1, . . . , n

g(s)j (p)

∆= g(s)(d(pj, p)) (1.4.2)

to be a Gaussian of width ϵs centered at pj. Here, d : Pn × Pn → R is adistance function defined among the particles. For example, it can be theEuclidean distance between their coordinates. Without loss of generality, weassume that the chosen particles are indexed such that Pk = {p1, . . . , pk}.Let A(s) be the k × k affinities matrix associated with Pk, whose (i, j)-thentry is g(s)(d(pi, pj)). In other words,

A(s)(i, j)∆= g(s)(d(pi, pj)), i, j = 1, . . . , k. (1.4.3)

Note that the j-th column of A(s) is the restriction of g(s)j to Pk. Pc

k is thecomplementary set of Pk in Pn. The spectral norm of a matrix A is denotedby ∥A∥ and its j-th singular value (in decreasing order) is denoted by σj(A).w = (w1, w2, . . . , wk)

T are the values of the weight function w on the particlesin Pk , where wj is the weight of pj.

1.4.1 Data Subsampling Through ID of a GaussianMatrix

19

Algorithm 1.4.1: Deterministic Interpolative Decomposition

Input: An m× n matrix A and an integer k, s.t. k < min {m,n} .Output: An m× k matrix B, whose columns are a subset of A’scolumns, and a k × n matrix P s.t.∥A−BP∥ ≤

√4k(n− k) + 1σk+1(A).

1: Apply a pivoted QR algorithm to A (Algorithm 5.4.1 in [5]),

APR = QR,

where PR is an n× n permutation matrix, Q is an m×m orthogonalmatrix and R is an m× n upper triangular matrix, where the diagonalabsolute values are decreasingly ordered.

2: Split R and Q s.t.

R∆=

[R11 R12

0 R22

], Q

∆=[Q1 Q2

]where R11 is k × k, R12 is k × (n− k), R22 is (m− k)× (n− k), Q1 ism× k and Q2 is m× (m− k).

3: Define the m× k matrixB

∆= Q1R11 (1.4.4)

4: Define the k × n matrix

P∆=[Ik R−1

11 R12

]P TR

where Ik is the k × k identity matrix.

Let s be a fixed scale. Our goal is to approximate w by a superposition ofthe columns in the affinity matrix A(s), then to extend w to p∗ ∈ Pc

k based onthe affinities between p∗ and the elements of Pk. Due to Bochner’s theorem,as long as Pk consists of k distinct particles, A(s), which is defined in Eq.1.4.3, is strictly positive definite. At first sight, we can solve the equationA(s)c = w and by using the radiality of g(s) (defined in Eq. 1.4.1), to extend

w to p∗ by w(p∗) =∑k

i=1 cig(s)i (p∗) (defined in Eq. 1.4.2), which is exact

on Pk. That is, wj = w(pj), j = 1, 2, . . . , k. This method is known as theNystrom extension [29, 30]. As proved in [28], the condition number of A(s)

is large for small values of s, namely A(s) is numerically singular. On theother hand, a too big s results in a short distance interpolation. Moreover,even if we choose such s for which A(s) is numerically non-singular and the

20

interpolation is not for a too short distance, interpolation by a superpositionof translated Gaussian of a fixed width will not necessarily fit the propertiesof w. In order to overcome the numerical singularity of A(s), we apply theID procedure to A(s).

An ID of order k of an m × n matrix A consists of an m × k matrix Bwhose columns consist of a subset of the columns of A, as well as a k × nmatrix P , such that a subset of the columns of P constitutes a k×k identitymatrix, and A ≈ BP in the sense that ∥A−BP∥ . O(n, σk+1(A)). Usually,k is chosen to be the numerical rank of A up to a certain accuracy δ > 0,i.e. k = #{j : σj(A) ≥ δσ1(A)}. This selection of k guarantees that thecolumns of B constitute a well conditioned basis to the range of A, whosecondition number is of order δ. The deterministic ID algorithm is describedin Algorithm 1.4.1 whose complexity is O(mn2).

Additionally to Algorithm 1.4.1, there are randomized versions of the IDalgorithm that require less computational operations. For example, Algo-rithm 1.4.2 is a random-projections based algorithm[6]. It produces an IDfor a general matrix m× n matrix A and an integer l < min{m,n}, s.t.

∥A−BP∥2 . l√mnσl+1(A). (1.4.5)

The complexity is lCA+lCAT +O(l2n log(n)), where CA is the cost of applyingA to a vector of length k, and CAT is the cost of applying AT to a vectorof length m. Algorithm 1.4.2 uses the deterministic ID Algorithm 1.4.1 byapplying it to a smaller matrix than A.

Each column of A(s), as defined in Eq. 1.4.3, corresponds to a singleparticle in Pk. The columns subset selection from A(s) is equivalent to Pk

particles that are subsampled from the associated Pn particles.

1.4.2 Multiscale Function Extension Algorithm

Let A(s) ≈ B(s)P (s) be the ID of A(s), where B(s) is a k × r matrix, whosecolumns constitute a subset of the columns of A(s), and P(s) = {ps1 , . . . , psr}is its associated sampled dataset. The extension of the weight function wfrom Pk to Pc

k is done by an orthogonal projection of w on the columns spaceof B(s), and by extending the projected function to Pc

k similar to Nystromextension method that uses the radiality of g(s). Algorithm 1.4.3, whosecomplexity is O(kr2), describes the single-scale extension algorithm.

Since the columns of B(s) do not necessarily constitute a basis of Rk, w(s)

is not necessarily equal to w, namely the output of Algorithm 1.4.3 is notan interpolant of w. This phenomenon is illustrated in Fig. 5.1 in [28]. Inthis case, we apply Algorithm 1.4.3 once again to the residual w−w(s) with

21

Algorithm 1.4.2: Randomized Interpolative Decomposition

Input: An m× n matrix A and two integers l < k, s.t.k < min{m,n} (for example, k = l + 8).Output: An m× l matrix B and an l × n matrix P that satisfy Eq.1.4.5.

1: Use a random number generator to form a real k ×m matrix G whoseentries are i.i.d Gaussian random variables of zero mean and unitvariance. Compute the k × n product matrix

W = GA.

2: Using Algorithm 1.4.1, form a k× l matrix S, whose columns constitutea subset of the columns of W and a real l × n matrix P , such that

∥SP −W∥2 ≤√

4l(n− l) + 1σl+1(W ).

3: From Step 2, the columns of S constitute a subset of the columns of W .In other words, there exists a finite sequence i1, i2, . . . , il of integers suchthat, for any j = 1, 2, . . . , l, the j-th column of S is the ij-th column ofW . The corresponding columns of A are collected into a real m× lmatrix B, such that, for any j = 1, 2, . . . , l, the j-th column of B is theij-th column of A. Then, the sampled dataset is Ds = {xi1 , xi2 , . . . , xil}.

a narrower Gaussian. This guarantees that the next-scale affinities matrixA(s+1) has a bigger numerical rank then A(s). As a consequence, it guaranteesa wider subspace to project the residual on. The above is summarized inAlgorithm 1.4.4 whose complexity is O(k3).

1.5 Multiscale Particle Filter (MSPF) Com-

putation

In order to accelerate the PF algorithm when it runs on a large numberof particles, we apply particles subsampling. We use the MSE method tocompute the weights for the rest of the particles. This will allow us tocompute a relatively small number of particle weights in each cycle of thealgorithm. This approach can be effective if the particle’s weight calculationon all particles is computationally expensive especially when the number ofparticles is high. Algorithm 1.5.1 describes our modified PF algorithm thatsupports multiscale subsampling and extension.

22

Algorithm 1.4.3: Single-Scale Extension

Input: Scale parameter s, k × r matrix B(s), the associated sampleddataset P(s) = {ps1 , . . . , psk}, a new data point p∗ ∈ Pc

k, and theweight function w : Pk → R to be extended.Output: The projection w(s) = (w

(s)1 , w

(s)2 , . . . w

(s)k )T of

w = (w1, w2, . . . , wk)T on B(s) and its extension w

(s)∗ to p∗.

1: Solve the least squares problem minc∈Rr∥B(s)c−w∥2 for c = (c1, c2, . . . , cr)

T .

2: Calculate the orthogonal projection of w on the columns of B(s),w(s) = B(s)c.

3: Calculate the extension w(s)∗ of w(s) to p∗ using Eq. 1.4.2:

w(s)∗ ,

r∑j=1

cjg(s)sj(p∗). (1.4.6)

Algorithm 1.4.4: Multiscale Data Sampling and Function Extension

Input: A dataset of k particles Pk = {p1, . . . , pk}, a positive numberϵ0, a new particle p∗ ∈ Pc

k, a weight function w : Pk → R, to beextended (represented by w = (w1, w2, . . . , wk)

T where wj is theweight of pj), and an error parameter err ≥ 0.Output: An approximation w = (w1 w2 . . . wk)

T of w on Pk thatsatisfies ∥w − w∥ ≤ err and its extension w∗ to p∗.

1: Set the scale parameter s = 0, the approximation to w, w = 0, and theextension of w to p∗, w∗ = 0.

2: while ∥w − w∥ > err do3: Form the Gaussian affinities matrix A(s) (Eq. 1.4.3) on Pk, with

ϵs = 2−sϵ0.4: Set r to be the numerical rank of A(s) (see Definition 3.1 in [28]).5: Apply Algorithm 1.4.1 to A(s) with the parameter r to get an k × r

matrix B(s) and the associated sampled dataset P(s).6: Apply Algorithm 1.4.3 to B(s), P(s), p∗, and w − w. We get the

approximation w(s) to w − w at scale s, and its extension w(s)∗ to p∗.

7: Accumulate the approximations from step 6: Set w = w + w(s) andw∗ = w∗ + w

(s)∗ .

8: Set s = s+ 1.9: end while

23

1.5.1 Particle Subsampling

In each cycle in Algorithm 1.5.1, we first resample a new set of k particlesfrom the set Pn using their weights as the distribution function. Once weapply the dynamic model to each particle and advance it, new weights haveto be computed. Therefore, we first select a small subset from all the nparticles. The goal is to find a good set of representative particle candidatesthat will capture the geometry and the activity of the source weight functionw : Pn → R. To identify these candidates, we define a distance metricbetween the n particles using a weighted Euclidean distance between eachtwo particles viewed as vectors. Other metrics can be used as well. Weselect the particle candidates using Algorithm 1.4.2 which is the randomizedID. We construct an affinity matrix A(s) that contains the affinities d(pi, pj)between the particles. The kernel, which is defined between the particles, is

[A(s)]ij∆= exp

(−d(pi, pj)2

ϵs

), i, j = 1, .., n. (1.5.1)

We calculate the affinities between all the particles in Pn such that A(s) isan n × n matrix defined by Eq. 1.5.1. The number of candidates we use isat most k. The output from the randomized ID algorithm (Algorithm 1.4.2)will be the set Pk of k particles that were selected from Pn. We computedirectly the weights for the k selected particles.

1.5.2 Weight Calculation using Function Extension

We obtained a set of particles Pk with their calculated weight values. Next,we continue and compute the weights for the rest of the particles that arenot included in Pk. We compute the weight value for each of the othern − k particles by applying Algorithm 1.4.4 to the set Pk that has the firstk columns of the affinity matrix A(s). These columns contain the affinitiesbetween each pair of particles in Pk, the affinities between the particles in Pk

and all the other particles. The output from Algorithm 1.4.4 is the weights ofthe n−k particles that were not selected in the previous step. This extensionmethod allows us to skip a direct weight computations for the remaining n−kparticles. Therefore, we keep the entire set of particles Pn from which weresample a small set of particles in the next PF algorithm step. This isespecially beneficial when we cannot compute the weights for all particles ifthe computation is too expensive. Once the n− k weights are calculated, weselect the particle with the maximum likelihood as the prediction result andcontinue to the next algorithmic cycle.

24

Weights computation for k particles and their extension to the othern − k particles reduce the number of operations and accelerates the PF asdemonstrated in Section 1.7.

Algorithm 1.5.1: Multiscale Particle Filter (MSPF)

Input: n number of particles; x0 initial state; y1, ..., yT currentobservations; q(·) proposal distribution function; p(·) approximatedposterior distribution functionOutput: x1, ..., xT estimated observations

1: Weights initialization: w(i)0 = 1

n, x

(i)0 ∼ p(x0), i = 1, ..., n.

2: for time steps t=1,...,T do3: Resample n new particles by their distributions determined by the

weights w(i)t−1.

4: Prediction: Apply the dynamic model q(·) to each particle toestimate the next state using xt−1 and y1, ..., yt

x(i)t ∼ q(x

(i)t |x

(i)t−1, y1, ..., yt), i = 1, ..., n.

5: Selection: Select a subset of size k from the new particles x(i)t by

computing the affinity matrix A(s) (Eq. 1.5.1) and by using the IDAlgorithm 1.4.2.

6: Calculate the weights of the k selected particles using

w(i)t ∝

p(yt|x(i)t )p(x

(i)t |x

(i)t−1)

q(xt|x(i)t−1, y1, ..., yt)

, i = 1, ..., n.

7: Weight extension: Calculate the weights of the n− k particles usingthe MSE Algorithm 1.4.4.

8: Weights normalization:

w(i)t =

w(i)t∑n

l=1w(l)t

, i = 1, ..., n.

9: xt is set to be the particle x(l)t where l = argmax

1≤i≤nw

(i)t .

10: end for

25

1.6 Accelerating the Particle Sampling Step

The computational bottleneck in Algorithm 1.5.1 lies in the selection step(Step 5) where we sample a subset of size k from the source particle set ofsize n. In this step, the randomized ID algorithm requires O(kn2 + k2nlogn)operations (see Section 5.3 in [6]). In addition, the input kernel matrixcalculation in the ID algorithm requires O(n2) operations. Therefore, Algo-rithm 1.5.1 does not scale well and the gained performance boost decreasesas the number of particles increases. To improve the performance of thesampling step, a different sampling method was developed. The method isbased on a variation of the Farthest Point Sampling (FPS) algorithm [32, 54].The FPS algorithm begins by selecting a random data point and adding itinto the sampled set. Then, in each step it adds the most farthest data pointfrom the sampled set, thus minimizing the distance between the original datapoints and the sampled set. The resulted sampled set contains k data pointswhich spans the original set.

Our sampling step uses the FPS selection approach, but the distancemetric between data points is affected by the particles densities. The stan-dard FPS algorithm is computed in O(klogn) operations [55]. However, datapoints densities calculation, which uses a Gaussian kernel, takes O(n2) op-erations. In our implementation, we use the multidimensional version of theGauss transform to calculate the density of each data point (in our case it isa particle). We use the fast multipole method (FMM) as an efficient way tocalculate the Gauss transform that is called the fast Gauss transform (FGT).This approach enables to reduce the computational cost of the particle selec-tion step from O(kn2 + k2nlogn) to O(n+ klogn) operations. Sections 1.6.1and 1.6.2 briefly describe the implementation of FMM and FGT, respectively.These descriptions were adopted from [56].

1.6.1 Fast Multipole Method

Assume that we want to evaluate the sum

v(yj) =N∑i=1

uiϕi(yj), j = 1, ...,M (1.6.1)

where {ϕi} is a family of functions that correspond to a source functionϕ centered around different locations xi, yj is a point in a d-dimensionalspace and ui is a weight. Direct evaluation of the sum in Eq. 1.6.1 requiresO(MN) operations. In the FMM algorithm [33], we assume that {ϕi} canbe expanded with a multipole series and with a local series centered at x∗

26

and y∗, respectively, such that

ϕ(y) =p−1∑n=0

bn(x∗)Sn(y − x∗) + ϵS(p)

ϕ(y) =p−1∑n=0

an(y∗)Rn(y − y∗) + ϵR(p)

(1.6.2)

where Sn and Rn are the multipole and the local basis functions, respectively,x∗ and y∗ are the expansion centers, {an} and {bn} are the expansion coef-ficients and ϵS(p) and ϵR(p) are the errors induced by truncating the seriesafter p terms. Rewriting the sum in Eq. 1.6.1 using one of the expansionsin 1.6.2 gives:

v(yj) =N∑i=1

uiϕi(yj)

=N∑i=1

ui

p−1∑n=0

cniRn(yj − y∗), j = 1, ...,M.

(1.6.3)

Here, cni is the coefficient an of ϕi. By rearranging the expression in Eq.1.6.3 we get

v(yj) =p−1∑n=0

[N∑i=1

uicni

]Rn(yj − y∗)

=p−1∑n=0

CnRn(yj − y∗).

(1.6.4)

The computation of Eq. 1.6.4 takes O(Mp+Np) operations where p deter-mines the desired accuracy. The FMM can be used to compute the Gausstransform efficiently and this is referred as the FGT.

1.6.2 Fast Gauss Transform (FGT)

The FGT can be evaluated by using the FMM directly by choosing ϕi(y) =e−∥y−xi∥2/h2

, and then expanding the Gaussian using Hermite Polynomials.In one dimension, this yields

e−∥y−xi∥2/h2

=

p−1∑n=0

1

n!(xi − x∗

h)nHn(

y − x∗

h) + ϵ(p) (1.6.5)

where Hn are Hermite polynomials defined by Hn(x) = (−1)nex2 dn

dxn (e−x2

).Extension to higher dimensions is done by considering the multivariate Gaus-sian function as a product of univariate Gaussians where the series factoriza-tions are applied to each dimension, see [56]. Equipped with a fast methodto calculate the densities, we present in Section 1.6.3 the full Weighted Far-thest Point Selection (WFPS) algorithm for selecting the representatives datapoints that replaces the randomized ID Algorithm 1.4.2.

27

1.6.3 Weighted Farthest Point Selection (WFPS) Al-gorithm

WFPS is a modification of the FPS Algorithm (described in Section 1.6).In WFPS, the metric value is bigger for two pairs of data points with equaldistance that are located in a high density area than if they are located in anarea with lower density. Therefore, the next sampled data point in each stepin the FPS algorithm is selected according to a density-weighted distancefunction. The idea to replace the ID algorithm with the WFPS is originatedfrom the empirical observations that in each step of the randomized ID algo-rithm the next most independent distance column of the matrix is selected.This can be viewed as selecting a data point which is the “most different” indistance terms. The density computation in the FPS algorithm is based onthe observation that the randomized ID algorithm favors data points whosedistance column have higher “energy” then the rest of the columns. Wefound that the WFPS algorithm yields an improved particle selection setin comparison to either randomly selected or a regular FPS selection. Thedensity weights cause the algorithm to prefer particle selections from denseareas. Similar augmentation to the FPS algorithm is shown in [57].

Algorithm 1.6.1 describes the WFPS algorithm for selecting k data pointsfrom a set of n data points in Rd.

When the selection step (Step 5) in the deterministic ID Algorithm 1.4.1is replaced by the WFPS Algorithm 1.6.1 (where the input set to the WFPSis the particle set Pn), we achieved even faster computational time. Theresults are presented in Section 1.7.4.

1.7 Experimental Results

The performance of the MSPF algorithm was evaluated by performing severalexperiments on objects tracking in both synthetic and real video sequences.The results were compared with the results from other tracking methods. Weused a video sequence to track a ball (Fig. 1.7.1), which moves in a non-linearway around a basketball player. Each particle is described by a vector withsix coordinates p = (x, y, vx, vy, w, h), which are the target’s location, speedin each axis, width (w) and height (h), respectively. The target’s initial statep0 is given as the input to the algorithm. Initially, the algorithm extracts acolor histogram from a tile BT that contains the target. The target’s tile isa rectangular defined by four points

B = {(x− 12w, y − 1

2h), (x+ 1

2w, y − 1

2h),

(x− 12w, y + 1

2h), (x+ 1

2w, y + 1

2h)}. (1.7.1)

28

Algorithm 1.6.1: Weighted Farthest Point Selection (WFPS)

Input: A set of data points X = {x1, .., xn} in Rd; k the numberselected data pointsOutput: k selected data points S

1: Set w1, ..., wn to be the calculated densities of the data points in Xusing FGT (Eq. 1.6.4).

2: Set S = {x1}.3: Set ds(xi) = wi∥xi − x1∥ for all data points in X.4: for step=2,...,k do5: Find the farthest data point in S:

s = argmaxx∈X

ds(x).

6: Add data point s to the set S.7: Update the distances of the data points in X:

ds(xi) = min(ds(xi), wi∥xi − s∥).

8: end for

This tile is used later when it is compared to the other color histograms ofthe other particles. We also used the weighted Euclidean distance betweentwo particles as the distance between the vectors that represents them suchthat

d(p(i), p(j)) = ∥p(i) − p(j)∥2. (1.7.2)

This metric was used in the affinity matrixA(s) (Eq. 1.4.3) in Algorithm 1.4.4.In each cycle of the algorithm, we sampled the particles and obtained a newset of n particles (see Step 3 in Algorithm 1.5.1). The model equations areapplied to each particle. In this case, it is done by adding the speed tothe corresponding location coordinates. Then, we perturbed each particleby adding a Gaussian noise with a standard deviation configured to eachcoordinate separately. The system dynamics is formulated as

pt =

1 0 1 0 0 00 1 0 1 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 00 0 0 0 0 1

pt−1 + nt (1.7.3)

29

where nt is a random Gaussian noise vector such that nt(i) ∼ N (0, σ2i ), and

σ2i , i = 1, . . . , 6, represents the variance we assigned for this coordinate. For

example, in a constant velocity, the variance of the velocity is zero, that isσ2vx = σ2

vy = 0. To calculate the weight of each particle p(i), i = 1, . . . , n, weprocess a tile, which is centered at (x, y) with width w and height h definedin Eq. 1.7.1, by calculating a color histogram for all the pixels within the tile.Then, the histogram is compared with the histogram from the original targettile using the Bhattacharyya distance (Eq. 1.3.1). Therefore, the weight ofthe particle is

w(i) =√h(BT )Th(Bi), i = 1, . . . , n, (1.7.4)

where h(Bi) is the color histogram of tile Bi. In the next step, we computethe weights of the k particles selected by the randomized ID Algorithm 1.4.2.Their values are used for the weights calculation for the rest of the parti-cles. This is done by applying the MSE with the defined distance metric(Eq. 1.7.2). Once all the weights were calculated, they are normalized andare used as the new distribution values for the n particles to be resampledin the next phase. The results from the application of the MSPF Algo-rithm 1.5.1 are applied to the basketball sequence and they are displayed inFig. 1.7.1. The basketball is being tracked while the camera is moving andthe background is changing constantly.

The performance of the MSPF Algorithm 1.5.1 was tested with differentn and k values. We found that for most tracking tasks, a small set of particles(between 5% to 10% from the total number of particles n) is sufficient as seenin Fig. 1.7.1.

To measure the acceleration of the MSPF Algorithm 1.5.1, the trackingquality of each algorithm was tested along with their weight computationtime. Each test was repeated 10 times and the average success tracking ratewas computed. The standard PF (Algorithm 1.3.1) was tested with a particlerange of 10−300 particles. The MSPF Algorithm was tested using 100−3000particles while a direct computation was done for 10% of the particles. Theresults are shown in Fig. 1.7.2. We can see that in the same computationtime, the tracking success ratio of MSPF Algorithm outperforms the standardPF (Algorithm 1.3.1).

In addition, the MSPF Algorithm 1.5.1 tracking success graph has lessjitter than the standard PF Algorithm. This is because the MSPF algorithmuses more particles to cover the same state space under similar computationalcost then the standard PF Algorithm. For example, the standard PF execu-tion time with 160 particles took 80 seconds while achieving a 45% trackingsuccess rate. The MSPF achieved a 98% tracking success rate when using800 particles where the weights were computed for only 10% of them while

30

Figure 1.7.1: A set of representative frames from a basketball tracking se-quence. The object is tracked using the MSPF Algorithm 1.5.1 with a directcomputation of the weights for 10% from the total number of particles.

achieving the same execution time.

1.7.1 Comparison with Other Approximation Meth-ods

In order to compare between the performance of the MSPF algorithm withdifferent approximation methods, we tested the MSPF algorithm using dif-ferent approximation methods to calculate the weights of the particles. Forthis comparison, we used a synthetic movie. We generated a video sequenceby moving a colored disc over a still image. The disc moved along a non-linear parametric function. This allows us to know the ground truth of thetarget at any frame. We applied the MSPF algorithm to the synthetic videosequence several times, each with different interpolation method. We com-

31

0 20 40 60 80 100 120 140 160 1800

10

20

30

40

50

60

70

80

90

100

Weight Computation Time

Per

cent

age

of fr

ames

with

suc

cess

ful t

rack

ing

Naive PFMultiScale PF

Figure 1.7.2: Comparison between the tracking success rate for a given com-putational budget with standard PF (Alg. 1.3.1 which we refer to as the“naive PF”) and MSPF (Alg. 1.5.1).

pared the total Root Mean Square Error (RMSE) for each approximationmethod measured on the distance between the MSPF algorithm output andthe real location of the target. The MSPF Algorithm 1.5.1 achieved the low-est error rate even when we sampled between 2%-5% particles. When suchsubsampling rate was used, all the other tested methods fail (error grew).

Next, we compared between the computational time performing Algo-rithm 1.4.4 using different sampling rates, by running the PF Algorithm 1.3.1.The weights computation of all particles took 200 seconds (on average). FromFig. 1.7.4, we can see that the MSPF Algorithm 1.5.1 achieved the lowestcomputational time when the sampling rate was lower than 13% of the totalnumber of particles. When the WFPS sampling was used, the computationaltime was even better. Overall, the MSPF Algorithm 1.5.1 achieved the lowestcomputational time while maintaining a low error rate.

We repeated the tests with another video sequence where the disc location

32

0 50 100 1500

50

100

150

200

250

Particle Sampling Number

Tra

ckin

g D

ista

nce

Err

or (

RM

SE

)

Multiscale(ID)multiscale(WFPS)linearCubic

Figure 1.7.3: Comparison between the RMSE for different methods: mul-tiscale with ID sampling (Alg. 1.4.2), multiscale with WFPS sampling(Alg. 1.6.1), linear approximation and cubic approximation.

was set to simulate Brownian motion such that the acceleration was a randomwhite noise. The comparison between the running time and tracking errorrate showed similar results as in Figs. 1.7.3 and 1.7.4. In Fig. 1.7.5, we cansee the tracking location error of each movie frame, when comparing betweenthe MSPF and the naive PF, under the same computational budget. TheMSPF manages to track the disc object with a low error rate, while the naivePF loses the tracking occasionally.

1.7.2 Multiple Targets Tracking

The MSPF Algorithm 1.5.1 was tested on a video sequence that containsmultiple objects. In such scenario, the tracking can be achieved by usingtwo separate PF algorithms. Each PF uses a different set of particles and aseparate set of observations. Here, each particle describes a state of a single

33

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20

50

100

150

200

250

300

350

Particle Sampling Factor

Acc

umul

ated

Com

puta

tion

Tim

e (S

ec)

NoneMultiscale(ID)Multiscale(WFPS)

Figure 1.7.4: Computational time of the MSPF with different sampling rates.The total number of particles is 1500.

target.

Another approach to track multiple objects is to create a “super-state”particle, which describes the state of all the objects inside the video sequence.In this case, the number of fields inside the particle vector was n×k where k isthe number of targets and k is the number of parameters required to describea single target. In the latter scenario, the MSE Algorithm outperformed theother interpolation methods since it works better in high dimensions. Theadvantage of using the “super-state” particle is by enabling to advance aparticle state by dynamic model equations that took into account the stateof all the objects within a particle including dependencies between objects.

In order to test the tracking performance using the “super-state” particle,we tracked two tennis players in a video sequence. The players are representedby a single particle with 6 × 2 = 12 coordinates, 6 for each player (locationin x and y, velocity in x and y, width and hight). In each algorithmic cycle,the prediction step advanced the particles by the application of the model

34

Figure 1.7.5: Comparison between the tracking error for each frame, for agiven computational budget. We processed 1000 frames of the Brownianmotion movie using the naive PF (Alg. 1.3.1) and then using the MSPF(Alg. 1.5.1). For the naive PF, we used 150 particles. For the MSPF we used500 particles and 10% sampling rate. Both Algorithm’s execution time was280 seconds.

equations separately to each coordinate. The weight calculation was done ineach region separately and then multiplied the Bhattacharyya coefficient toobtain a single weight. Then, the extension step was applied as before usingthe weighted Euclidean metric for each particle that has 12 coordinates. Byusing Algorithm 1.5.1, we were able to track both targets successfully withthe lowest computational cost in comparison to other extension methods thatare based on standard interpolation such as B-splines, cubics and nearestneighbor. Figure 1.7.6 displays the results from the application of the MSPFAlgorithm 1.5.1 to achieve multiple targets tracking. We used 1500 particlesto track both players. In each step of the algorithm, we calculated the weightsfor 150 selected particles and interpolated the weights for the other 1350particles by using the MSE Algorithm. The complete videos of the basketball

35

Figure 1.7.6: A selected set of representative frames from the tennis gamethat demonstrates the tracking performance. The two tennis players weretracked by the application of the MSPF Algorithm 1.5.1 with a direct weightscomputation for 10% from the total number of particles.

and tennis games tracking can be viewed in our website1.

1.7.3 Comparison with the EMD Measurement

Recently, the Earth Moving Distance (EMD) [58] was used for particlesweight computation since this particle weight fits deformable objects [59].The EMD computational cost is significantly higher than other methods suchas color histograms. The MSPF becomes effective as the computational costof the weights increases. We tested Algorithm 1.5.1 with the EMD metricto demonstrate how well the extension scheme fits it. Several runs were con-ducted on the “Lemming” sequence from the PROST database. Each run

1http://www.cs.tau.ac.il/research/yaniv.shmueli/mspf

http://www.cs.tau.ac.il/research/yaniv.shmueli/mspf

36

used several frames executed on i7-2630QM 2.9GHz processor. Weights werecalculated for 10% from the total number of particles while the rest of theparticles were estimated using the MSE Algorithm. We verified that the tar-get was not lost during the tracking procedure in each execution when usingMSPF algorithm.

Table 1.7.1 shows the time differences between the standard version ofthe PF algorithm that uses the EMD metric (Algorithm 1.3.1) and our im-plementation that uses the MSE method (Algorithm 1.4.2) . For the latter,10% of the particles were sampled, and the MSE was applied to the other90% of the particles. We can see that the MSE algorithm reduces the PFtotal computation time. However, we can also observe that when the numberof particles increases, the acceleration becomes less significant, as seen in thesecond and third columns. We analyze this scalability issue in Section 1.7.4.

1.7.4 Weighted FPS in the Selection Step

Table 1.7.1 compares between the running times of theWFPS Algorithm 1.6.1and the randomized ID as the selection methods. Each time the MSPF Al-gorithm 1.5.1, which uses the EMD, was tested with a different particle setsizes. Performance comparisons were done between the following algorithms:standard PF, PF with MSE and ID selection, PF with MSE and WFPSselection. Table 1.7.1 shows that the acceleration factor is high even when10, 000 particles are used.

Table 1.7.1: Comparison between WFPS and ID acceleration times [sec], inthe MSPF algorithm that uses EMD. Sampling rate was 10% from the totalnumber of particles.

# of Time Time Time AccelerationParticles [No MSE] [MSE-ID] [MSE-WFPS] Factor

Alg. 1.3.1 Alg. 1.4.2 Alg. 1.6.1

2000 63 10.6 6.6 9.54000 125 32 14 8.96000 187 75.4 22 8.58000 260 151 32 8.110000 294 266 41 7.1

37

1.8 Conclusion

In this work, several contributions are presented. The PF computational timewas reduced by the application of MSE method that reduces the load of theparticle weight calculation. Therefore, it allows us to utilize more particleswithin a given computational budget. This improves the PF performance.The modified PF algorithm was tested on real video sequences to successfullytrack a single and multiple targets. In addition, the performance of the PFwas compared with other extension methods and demonstrated that the MSEmanaged to track the target with much fewer particles then PF was using.These enhancements can become effective when multiple targets are trackedin real-time.

Part II

Randomized LU Decompositionand its Applications

39

Chapter 2

Randomized LU Decomposition

In this chapter, we present a fast randomized algorithm that computes alow-rank LU decomposition. The algorithm uses random projections typetechniques to efficiently compute a low-rank approximation of large matrices.The randomized LU algorithm can be parallelized and further accelerated byusing sparse random matrices in its projection step. Several error boundsfor the algorithm’s approximations are proved. To prove these bounds, re-cent results from random matrix theory related to sub-Gaussian matrices areused. The algorithm, which can utilize sparse structures, is fully parallelizedand thus can utilize efficiently GPUs. Numerical examples, which illustratethe performance of the algorithm and compare it to other decompositionmethods, are presented. The results in this chapter appear in [24].

2.1 Introduction

Matrix factorizations and low-rank approximations play a major role in manyof today’s applications [3]. In mathematics, matrix decompositions are usedfor low-rank approximations that often reveal interesting properties of a ma-trix. Matrix decompositions are used for example in solving linear equationsand in finding least squares solutions. In engineering, matrix decompositionsare used in computer vision [17], machine learning [60], collaborative filter-ing and Big Data analytics [10]. As the size of the data grows exponentially,analysis of large datasets has gained an increasing interest. Such an analysiscan involve a factorization step of the input data given as a large sample-by-feature matrix or by a sample affinity matrix. Two main reasons for thedifficulties in analyzing huge data structures are high memory consumptionand the computational complexity of the factorization step. Recently, thereis an on-going interest in applying mathematical tools that are based on

41

42

randomized algorithms to overcome these difficulties.Some of the randomized algorithms use random projections, which project

the matrix to a set of random vectors. Formally, given a matrix A of sizem×n (supposem ≥ n) and a random matrix G of size n×k, then the productAG is computed to obtain a smaller matrix that potentially captures mostof the data activities in A. In most of these applications, k is set to be muchsmaller than n to obtain a compact approximation for A.

Fast randomized matrix decomposition algorithms are used for track-ing objects in videos [22], multiscale extensions for data [28] and detectinganomalies in network traffic for finding cyber attacks [61], to name some.There are randomized versions for many different matrix factorization al-gorithms [7], compressed sensing methods [62] and least squares problems[9].

In this work, we develop a randomized version of the LU decomposition.Given an m × n matrix A, we seek a lower triangular m × k matrix L andan upper triangular k × n matrix U such that

∥LU − PAQ∥2 = C(m,n, k)σk+1(A), (2.1.1)

where P and Q are orthogonal permutation matrices, σk+1(A) is the k + 1largest singular value of A and C(m,n, k) is a constant depending on m,nand k.

The interest in a randomized LU decomposition can be motivated (com-putationally wise) by three important properties of the classical LU decompo-sition: First, it can be applied efficiently to sparse matrices with computationtime depending on the number of non-zero elements. Second, LU decomposi-tion with full pivoting on sparse matrices can generate large regions of zerosin the factorized matrices [63, 64, 65]. Third, LU decomposition can befully parallelized that makes it applicable for running on Graphics Process-ing Units (GPU). GPUs are mostly used for computer games, graphics andvisualization such as movies and 3D display. Their powerful computationcapabilities can be used for fast matrix computations [66].

The contributions of this work are the following: We develop a random-ized version for LU decomposition. Such an algorithm does not appear in theliterature, and we provide several error bounds for the error ∥LU−PAQ∥2. Inaddition, we present a sparse version of our randomized LU algorithm alongwith a full implementation on a standard GPU card. We present numericalresults that compare our algorithm with other decomposition methods andshow it superiority.

The chapter is organized as follows: in Section 2.2, we overview relatedwork on matrix decomposition and approximation using randomized meth-ods. Section 2.3 reviews some mathematical facts that are needed for the

43

development of the randomized LU. Section 2.4 presents the randomized LUalgorithm and proves several error bounds on the approximation. We discussthe case of sparse matrices and also show how to solve rank deficient leastsquares problems using the randomized algorithm as an example. Section2.5 presents numerical results on the approximation error, the computationalcomplexity of the algorithm and compares it with other methods. The per-formance comparison was done on random matrices, images and large sparsematrices.

2.2 Related Work

Efficient matrix decomposition serves as a basis for many studies and al-gorithms for data analysis applications. There is a variety of methods andalgorithms that factorize a matrix into several matrices. Typically, the fac-torized terms have properties such as being triangular, orthogonal, diagonal,sparse or low-rank. It is possible to have a certain control on the desiredapproximation error on a factorized matrix.

Rank revealing factorization uses permutation matrices on the columnsand rows of A so that the factorized matrices structure have a strong rankportion and a rank deficient portion. The most known example for approx-imating an m × n matrix A by a low-rank k matrix is the truncated SVD.Other rank revealing factorizations can be used to achieve low-rank approx-imations. For example, both QR and LU factorizations have rank revealingversions such as RRQR decomposition [67], strong RRQR [68] decomposition,RRLU decomposition [69] and strong RRLU decomposition [70].

Other matrix factorization methods such as Interpolative Decomposition(ID) [4] and CUR decomposition [71], use columns and rows of the originalmatrix A in the factorization process. Such a property exposes the mostimportant terms that construct A. An ID factorization of order k of anm × n matrix A consists of an m × k matrix B whose columns consist of asubset of the columns of A, as well as a k × n matrix P , such that a subsetof the columns of P becomes a k × k identity matrix and A ≈ BP suchthat ∥A − BP∥ . O(n, σk+1(A)). Usually, k is chosen to be the numericalrank k = #{j : σj(A) ≥ δσ1(A)} of A up to a certain accuracy δ > 0. Thisselection of k guarantees that the columns of B constitute a well-conditionedbasis to the range of A [4].

Randomized version for many important algorithms have been developedin order to deal with computational complexity by approximating the so-lution to a desired rank. These include SVD, QR and ID factorizations[6], CUR decomposition as a randomized version [71] of the pseudo-skeleton

44

decomposition, methods for solving least squares problems [72, 8, 9] andlow-rank approximations [8, 73].

In general, randomization methods for matrix factorization have twosteps. First, a low-dimensional space, which captures most of the “energy”of A, is found using randomization. Then, A is projected into the retrievedsubspace and projected matrix is factorized [7].

Several different selections exist for the random projection matrix, whichis used in Step 1. For example, it can be a matrix of random signs (±1)[74, 75]; a matrix of i.i.d Gaussian random variables with zero mean and unitvariance [6]; a matrix whose columns are selected randomly from the iden-tity matrix with either uniform or non-uniform probability [76, 77]; a randomsparse matrix designed to enable fast multiplication with a sparse input ma-trix A [8, 73]; random structured matrices that use orthogonal transformssuch as discrete Fourier transform, Walsh-Hadamard transform and so on( [72, 9, 78]). In our algorithm, we use Gaussian matrices in Step 1 as wellas sparse Gaussian matrices (a special case of sub-Gaussian matrices) whenfactorizing sparse matrices.

2.3 Preliminaries

In this section, we review the rank revealing LU (RRLU) decompositionand several singular values bounds on random matrices that will be used toprove the error bounds for the randomized LU algorithm. Throughout thechapter, we use the following notation: for any matrix A, σj(A) is the jthlargest singular value and ∥A∥ is the spectral norm (the largest singular valueor l2 operator norm). If x is a vector then ∥x∥ is the standard l2 (Euclidean)norm. A† denotes the pseudo-inverse of A.

2.3.1 Rank Revealing LU (RRLU)

The following theorem is adapted from [69] (Theorem 1.2):

Theorem 2.3.1 ([69]). Let A be an m×n matrix (m ≥ n). Given an integer1 ≤ k < n, then the following factorization

PAQ =

(L11 0L21 In−k

)(U11 U12

0 U22

), (2.3.1)

holds where L11 is a unit lower triangular, U11 is an upper triangular, P andQ are orthogonal permutation matrices. Let σ1 ≥ σ2 ≥ ... ≥ σn ≥ 0 be the

45

singular values of A, then:

σk ≥ σmin(L11U11) ≥σk

k(n− k) + 1, (2.3.2)

andσk+1 ≤ ∥U22∥ ≤ (k(n− k) + 1)σk+1. (2.3.3)

Based on Theorem 2.3.1, we have the following definition:

Definition 2.3.1 (RRLU Rank k Approximation denoted RRLUk). Givena RRLU decomposition (Theorem 2.3.1) of a matrix A with an integer k (asin Eq. 2.3.1) such that PAQ = LU , then the RRLU rank k approximationis defined by taking k columns from L and k rows from U such that

RRLUk(PAQ) =

(L11

L21

)(U11U12

). (2.3.4)

where L11, L21, U11, U12, P and Q are defined in Theorem 2.3.1

Lemma 2.3.2 (RRLU Approximation Error). The error of the RRLUk ap-proximation of A is

∥PAQ− RRLUk(PAQ)∥ ≤ (k(n− k) + 1)σk+1. (2.3.5)

Proof. From Eqs. 2.3.1 and 2.3.4 we have

∥PAQ− RRLUk(PAQ)∥ =

∥∥∥∥(L11 0L21 In−k

)(U11 U12

0 U22

)−(L11

L21

)(U11U12

)∥∥∥∥= ∥U22∥ ≤ (k(n− k) + 1)σk+1.

(2.3.6)The last inequality is derived from Eq. 2.3.3.

Lemma 2.3.3 appears in [79], page 75:

Lemma 2.3.3 ([79]). Let A and B be two matrices and let σj(·) denotes thejth singular value of a matrix. Then, σj(AB) ≤ ∥A∥σj(B) and σj(AB) ≤∥B∥σj(A).

Lemma 2.3.4 was taken from [6] and it is an equivalent formulation forEq. 8.8 in [80].

Lemma 2.3.4 ([6]). Suppose that G is a real n× l matrix whose entries arei.i.d Gaussian random variables with zero mean and unit variance and let mbe an integer such that m ≥ l, m ≥ n, γ > 1 and

1− 1

4(γ2 − 1)√πmγ2

(2γ2

eγ2−1

)m

(2.3.7)

is non-negative. Then, ∥G∥ ≤√2mγ with probability not less than the value

in Eq. 2.3.7.

46

2.3.2 Sparse Random Matrices

Sparse matrices have a significant importance in many applications. Thecomputation of the degrees of separation between two individuals using Face-book 2011 connection matrix is a typical example. It requires factorizing asparse matrix of size 720, 000, 000× 720, 000, 000 that has 69 billions connec-tions. It means that only 1.33× 10−5 percent of the matrix is non-zero [81].The advantage of using sparse matrices is evident.

Definition 2.3.2 (Sparse Gaussian matrix). A = (ξij) is a sparse Gaussianmatrix if each entry is centered normally distributed with probability ρ andzero with probability 1−ρ. That is, ξij is a random variable whose probabilitydensity function (PDF) is given by

p(x) = (1− ρ)δ(x) +ρ√2πσ2

e−x2

2σ2 , (2.3.8)

where δ(x) is the Dirac delta function,σ > 0 and 0 < ρ ≤ 1. If A is a sparsematrix whose non-zero entries are Gaussian variables, then we refer to ρ asthe density of the matrix.

Definition 2.3.3. A real valued random variable X is called sub-Gaussianif there exists b > 0 such that for all t > 0, we have EetX ≤ eb

2t2/2 where Eis the expectation.

Suppose X is distributed as in Eq. 2.3.8 and E is the expectation. It canbe easily verified that:

1. EX = 0;

2. EX2 = ρσ2;

3. E|X|3 = 4ρσ3√2π;

4. X is sub-Gaussian.

We review several facts adapted from [82] and [83] about random matriceswhose entries are sub-Gaussian. We focus on the case where A is a tall m×nmatrix (m > (1 + 1

lnn)n). Similar results can be found in [84] for square and

almost square matrices.

Definition 2.3.4. Assume that µ ≥ 1, a1 > 0 and a2 > 0. A(µ, a1, a2,m, n)is the set of all m× n (m > n) random matrices A = (ξij) whose entries arei.i.d real valued centered random variables satisfying the following conditions:

1. Moments: E|ξij|3 ≤ µ3;

47

2. Norm: P(∥A∥ > a1√m) ≤ e−a2m where P is the probability function;

3. Variance: Eξ2ij ≥ 1.

It is shown in [82] that if A is sub-Gaussian then A ∈ A. In particular,a Gaussian matrix whose entries are zero with probability 1− ρ is also sub-Gaussian. Hence, this model (Definition 2.3.4) can also be used for sparseGaussian matrices with density ρ. For simplicity, we work with randommatrices with unit variance. In the case of sparse Gaussian matrices, we set:

σ2 = 1ρ, then µ =

(4√2πρ

) 13.

The following theorems are taken from Section 2 in [82]:

Theorem 2.3.5 ([82]). Every matrix A of size m×n (m ≥ n) whose entriesare sub-Gaussian with µ ≥ 1 and a2 ≥ 0 satisfies:

P(∥A∥ ≥ a1

√m)≤ e−a2m (2.3.9)

with a1 = 6µ√a2 + 4.

Theorem 2.3.5 provides an upper bound for the largest singular valuethat depends on the desired probability. Theorem 2.3.6 is used to boundfrom below the smallest singular value of sparse Gaussian matrices.

Theorem 2.3.6 ([82]). Let µ ≥ 1, a1, a2 > 0. Let A be an m × n matrixwith m > (1+ 1

lnn)n. m can be written as m = (1+ δ)n. Suppose the entries

of A are independent centered random variables such that conditions 1, 2, 3in Definition 2.3.4 hold. Then, there exist positive constants c1 and c2 suchthat:

P(σn(A) ≤ c1√m) ≤ e−m + e−c′′m/(2µ6) + e−a2m ≤ e−c2m. (2.3.10)

The exact values of constants c1, c2 and c′′ are given by:

c1 =b

e2c3

(b

3e2c3a1

) 1δ

, (2.3.11)

c′′ =27

211. (2.3.12)

Here, c3 = 4√

2π

(2µ9

a31+√π), b = min

(14, c′

5a1µ3

)and c′ =

(27213

) 12 . For the

constant c2, we need a small enough constant to satisfy the inequality in Eq.2.3.10 and set it, for simplification, as

c2 = min

(1,

c′

(2µ6), a2

)− ln 3

m. (2.3.13)

48

2.4 Randomized LU

In this section, we present the randomized LU algorithm (Algorithm 2.4.1)that computes the LU rank k approximation of a full matrix. In addition,we present a version (Algorithm 2.4.2) that approximates a sparse matrix.Error bounds are proven for each algorithm.

The algorithm starts by projecting the input matrix on a random ma-trix. The resulting matrix captures most of the information of the inputmatrix. Then, we compute a triangular basis for this matrix and project theinput matrix on it. Last, we find a second triangular basis for the projectedcolumns and multiply it with the original basis. The product leads to alower triangular matrix, L and the U matrix is the upper triangular matrixobtained from the second LU factorization.

Algorithm 2.4.1: Randomized LU Decomposition

Input: A matrix of size m× n to decompose; k desired rank; lnumber of columns to use.Output: Matrices P,Q, L, U such that ∥PAQ− LU∥ ≤ O(σk+1(A))where P and Q are orthogonal permutation matrices, L and U are thelower and upper triangular matrices, respectively.

1: Create a matrix G of size n× l whose entries are i.i.d. Gaussianrandom variables with zero mean and unit standard deviation.

2: Y ← AG.3: Apply RRLU decomposition (Theorem 2.3.1) to Y such that

PY Qy = LyUy.4: Truncate Ly and Uy by choosing the first k columns and the first k

rows, respectively, such that Ly ← Ly(:, 1 : k) and Uy ← Uy(1 : k, :)5: B ← L†

yPA.6: Apply LU decomposition to B with column pivoting BQ = LbUb.7: L← LyLb.8: U ← Ub.

Remark 2.4.1. The pseudo-inverse of Ly in step 5 can be computed byL†

y = (LTy Ly)

−1LTy . This can be done efficiently when it is computed on

platforms such as GPUs that can multiply matrices efficiently. Usually, theinversion is done on a small matrix since in many cases k ≪ n and it canbe done by the application of Gaussian elimination.

Remark 2.4.2. In practice, it is sufficient to perform step 3 in Algorithm2.4.1 using standard LU decomposition with partial pivoting instead of apply-ing RRLU. The cases where U grows exponentially are extremely rare (seesection 3.4.5 in [5] and [85]).

49

We now present our main error bound for Algorithm 2.4.1:

Theorem 2.4.3. Given a matrix A of size m × n. Then, its randomizedLU decomposition produced by Algorithm 2.4.1 with integers k and l (l ≥ k)satisfies:

∥LU − PAQ∥ ≤(2√2nlβ2γ2 + 1 + 2

√2nlβγ (k(n− k) + 1)

)σk+1(A),

(2.4.1)with probability not less than

ξ , 1− 1√2π(l − k + 1)

(e

(l − k + 1)β

)l−k+1

− 1

4(γ2 − 1)√

πnγ2

(2γ2

eγ2−1

)n

,

(2.4.2)where β > 0 and γ > 1.

The proof of Theorem 2.4.3 is given in Section 2.4.2. To show that thesuccess probability ξ in Eq. 2.4.2 is sufficiently high, we present in Table2.4.1 several calculated values of ξ. We omitted the value of n from Table2.4.1 since it does not affect the value of ξ since the second term in Eq. 2.4.2decays fast.

Table 2.4.1: Calculated values of the success probability ξ (Eq. 2.4.2). Theterms l − k, β and γ appears in Eq. 2.4.2.

l − k β γ ξ

3 5 5 1− 6.8× 10−5

5 5 5 1− 9.0× 10−8

10 5 5 1− 5.2× 10−16

3 30 5 1− 5.2× 10−8

5 30 5 1− 1.9× 10−12

10 30 5 1− 1.4× 10−24

3 30 10 1− 5.2× 10−8

5 30 10 1− 1.9× 10−12

10 30 10 1− 1.4× 10−24

In Section 2.5, we show that in practice, Algorithm 2.4.1 produces compa-rable results to other well-known randomized factorization methods of low-rank matrices such as randomized SVD and randomized ID.

50

2.4.1 Computational Complexity Analysis

To compute the number of floating points operations in Algorithm 2.4.1, weevaluate the complexity of each step:

1. Generating an n× l random matrix requires O(nl) operations.

2. Multiplying A by G to form Y requires lCA operations, where CA is thecomplexity of applying A to an n× 1 column vector.

3. Partial pivoting computation of LU for Y requires O(ml2) operations.

4. Selecting the first k columns (we do not modify them) requires O(1)operations.

5. Computing the pseudo inverse of Ly requires O(k2m + k3 + k2m) op-erations and multiplying it by A requires kCAT operations. Note thatP is a permutation matrix that does not modify the rows of A.

6. Computing the partial pivoting LU for B requires O(k2n) operations.

7. Computing L requires O(k2m) operations.

8. Computing U requires O(1) operations.

By summing up the complexities of all the steps above, then Algorithm 2.4.1necessitated

CRandLU = lCA + kCAT +O(l2m+ k3 + k2n) (2.4.3)

operations. Here, we used CA (and CAT ) to denote the complexity of applyingA (and AT ) to a vector, respectively. For a general A, CA = CAT = O(mn).

2.4.2 Bounds for the Randomized LU

In this section, we prove Theorem 2.4.3 and an additional complementarybound. This is done by finding a basis to a smaller matrix AG, which isachieved in practice by using RRLU. The assumptions are that L is numer-ically stable so its pseudo-inverse can be computed accurately, that thereexists a matrix U such that LU is a good approximation to AG and thatthere exists a matrix F such that ∥AGF −A∥ is small. As for the numericalstability of L, it is always stable since it has a small condition number [86].

For the proof of Theorem 2.4.3, several lemmas are needed. Lemma 2.4.4states that a given basis L can form a basis for the columns A by boundingthe error ∥LL†A − A∥. The lemma uses additional smaller matrices (F , U

51

and G). Later on, we use this lemma by using G and U from Algorithm 2.4.1and F from Lemma 2.4.5.

Lemma 2.4.4. Assume that A is an m × n matrix, L is an m × k matrixwith rank k , G is an n× l matrix, U is a k× l matrix and F is l×n (k ≤ m)matrix. Then,

∥LL†A− A∥ ≤ 2∥AGF − A∥+ 2∥F∥∥LU − AG∥. (2.4.4)

Proof. By using the triangular inequality we get

∥LL†A− A∥ ≤ ∥LL†A− LL†AGF∥+ ∥LL†AGF − AGF∥+ ∥AGF − A∥.(2.4.5)

Clearly, the first term can be bounded by

∥LL†A− LL†AGF∥ ≤ ∥LL†∥∥A− AGF∥ ≤ ∥A− AGF∥. (2.4.6)

The second term can be bounded by

∥LL†AGF − AGF∥ ≤ ∥F∥∥LL†AG− AG∥. (2.4.7)

Also,

∥LL†AG− AG∥ ≤ ∥LL†AG− LL†LU∥+ ∥LL†LU − LU∥+ ∥LU − AG∥.(2.4.8)

Since L†L = I, it follows that ∥LL†LU − LU∥ = 0 and that ∥LL†AG −LL†LU∥ ≤ ∥AG− LU∥. When combined with Eq. 2.4.8 we obtain:

∥LL†AG− AG∥ ≤ 2∥LU − AG∥. (2.4.9)

By substituting Eq. 2.4.9 in Eq. 2.4.7 we get

∥LL†AGF − AGF∥ ≤ 2∥F∥∥LU − AG∥. (2.4.10)

By substituting Eqs. 2.4.6 and 2.4.10 in Eq. 2.4.5 we get

∥LL†A− A∥ ≤ 2∥AGF − A∥+ 2∥F∥∥LU − AG∥. (2.4.11)

Lemma 2.4.5 appears in [6]. It uses a lower bound for the least singularvalue of a Gaussian matrix with zero mean and unit variance. This boundcan be found in [87].

52

Lemma 2.4.5 ([6]). Assume that k, l,m and n are positive integers suchthat k ≤ l, l ≤ m and l ≤ n. Assume that A is a real m × n matrix, G isn × l whose entries are i.i.d Gaussian random variables of zero mean andunit variance, β and γ are real numbers, such that β > 0, γ > 1 and thequantity

1− 1√2π(l − k + 1)

(e

(l − k + 1)β

)l−k+1

− 1

4(γ2 − 1)√πnγ2

(2γ2

eγ2−1

)n

(2.4.12)is non-negative. Then, there exists a real l × n matrix F such that

∥AGF − A∥ ≤√2nlβ2γ2 + 1σk+1(A) (2.4.13)

and∥F∥ ≤

√lβ (2.4.14)

with probability not less than the value in Eq. 2.4.12.

Lemma 2.4.6 rephrases Lemma 2.4.5 by utilizing the bounds that appearin Section 2.3.2. The proof is close to the argumentation that appear in theproof of Lemma 2.4.5.

Lemma 2.4.6. Let A be a real m×n (m ≥ n) matrix. Let G be a real n× lmatrix whose entries are Gaussian i.i.d with zero mean and unit variance.Let k and l be integers such that l < m, l < n and l >

(1 + 1

ln k

)k. We define

a1, a2, c1 and c2 as in Theorem 2.3.6. Then, there exists a real matrix F ofsize l × n such that:

∥AGF − A∥ ≤

√a21n

c21l+ 1σk+1(A), (2.4.15)

and

∥F∥ ≤ 1

c1√l

(2.4.16)

with probability not less than 1− e−c2l − e−a2n.

Proof. We begin by forming the SVD of A

A = UΣV T , (2.4.17)

where U is orthogonal m×m matrix, Σ is m× n diagonal matrix with non-negative entries and V is orthogonal matrix n×n. Given V T and G, supposethat

V TG =

(HR

), (2.4.18)

53

where H is k× l and R is (n−k)× l. Since G is a Gaussian i.i.d. matrix andV is an orthogonal matrix, V TG is also a Gaussian i.i.d. matrix. Therefore,H is a Gaussian i.i.d. matrix. Let us define F = PV T , where P is of sizel × n such that

P =(H† 0

).

Therefore,F =

(H† 0

)V T . (2.4.19)

Computing ∥F∥ using Theorem 2.3.6 gives:

∥F∥ = ∥PV T∥ = ∥H†∥ = ∥HT (HHT )−1∥ = 1

σk(H)≤ 1

c1√l

(2.4.20)

with probability not less than 1− e−c2l. Now we can bound ∥AGF −A∥. Byusing Eqs. 2.4.17, 2.4.18 and 2.4.19 we get

AGF − A = UΣ

((HR

)(H† 0

)− I

)V T . (2.4.21)

We define S to be the upper-left k×k block of Σ and T to be the lower-right(n− k)× (n− k) block. Then,

Σ

((HR

)(H† 0

)− I

)=

(S 00 T

)(0 0

RH† −I

)=

(0 0

TRH† −T

).

The norm of the last term can be rewritten as:∥∥∥∥( 0 0TRH† −T

)∥∥∥∥2 ≤ ∥TRH†∥2 + ∥T∥2. (2.4.22)

Therefore, by using Eqs. 2.4.21, 2.4.22 and the fact that ∥T∥ = σk+1(A), weget

∥AGF − A∥ ≤√∥TRH†∥2 + ∥T∥2 ≤

√∥H†∥2∥R∥2 + 1σk+1(A). (2.4.23)

Also we know that

∥R∥ ≤ ∥V TG∥ = ∥G∥ ≤ a1√n

with probability not less than 1− e−a2n. Combining Eq. 2.4.23 with the factthat ∥H†∥ ≤ 1

c1√land ∥R∥ ≤ a1

√n gives:

∥AGF − A∥ ≤ σk+1(A)

√a21n

c21l+ 1. (2.4.24)

54

Remark 2.4.7. In contrast to Lemma 2.4.5 where ∥AGF − A∥ = O(√nl)

, Lemma 2.4.6 provides the bound ∥AGF − A∥ = O(√

nl) that is tighter for

large values of l.

Remark 2.4.8. The condition l >(1 + 1

ln k

)k in Lemma 2.4.6 is satisfied

without a dramatic increase of the computational complexity of Algorithm2.4.1. However, there are bounds for the case where H is almost square(l ≈ k) and square (l = k) and they are given in [84].

Proof of Theorem 2.4.3. The error is given by the expression ∥LU − PAQ∥where L,U, P and Q are the outputs of Algorithm 2.4.1 whose inputs are thematrix A, integers k and l. From Steps 7 and 8 in Algorithm 2.4.1 we have

∥LU − PAQ∥ = ∥LyLbUb − PAQ∥. (2.4.25)

Here, Ly is the m× k matrix in step 4. By using the fact that BQ = LbUb =L†

yPAQ, we get

∥LU − PAQ∥ = ∥LyLbUb − PAQ∥ = ∥LyL†yPAQ− PAQ∥. (2.4.26)

Applying Lemma 2.4.4 gives that

∥LU − PAQ∥ = ∥LyL†yPAQ− PAQ∥

≤ 2∥PAQGF − PAQ∥+ 2∥F∥∥LyUy − PAQG∥.(2.4.27)

Here, Uy is the k × n matrix in step 4 in Algorithm 2.4.1. This holds forany matrix G. In particular, for a matrix G satisfying QG = GQy, where Gis a random Gaussian i.i.d. matrix. G is in fact G after row and columnspermutations. Therefore, the last term can be reformulated as ∥LyUy −PAQG∥ = ∥LyUy − PAGQy∥ where G is the random matrix in Algorithm2.4.1. By applying Lemmas 2.3.2 and 2.3.3 to ∥LyUy − PAQG∥ we get

∥LyUy − PAQG∥ = ∥LyUy − PAGQy∥

≤ (k(n− k) + 1)σk+1(AG)

≤ (k(n− k) + 1)∥G∥σk+1(A).

(2.4.28)

Lemma 2.4.5 gives that ∥PAQGF − PAQ∥ ≤√2nlβ2γ2 + 1σk+1(A) and

∥F∥ ≤√lβ. By combining Lemmas 2.4.5 and 2.3.4 we get

∥LU − PAQ∥ ≤(2√

2nlβ2γ2 + 1 + 2√2nlβγ (k(n− k) + 1)

)σk+1(A),

(2.4.29)which completes the proof.

55

Remark 2.4.9. The error in Theorem 2.4.3 may appear large, especiallyfor the case where k ≈ n

2and n is large. Yet, we performed extensive nu-

merical experiments showing that the actual error was much smaller whenusing Gaussian elimination with partial pivoting. Note that the error can de-crease by increasing k. This is applicable to certain applications. Numericalillustrations appear in section 2.5.

We now present an additional error bound that relies on [82]. Asymptot-ically, this is a sharper bound for large values of n and l, since it containsthe term

√nl, which is smaller than the term

√nl appear in Theorem 2.4.3.

See also Remark 2.4.7.

Theorem 2.4.10. Given a matrix A of size m × n, integers k and l suchthat l >

(1 + 1

ln k

)k and a2 > 0. By applying Algorithm 2.4.1 with A, k and l

as its input parameters, we get a randomized LU decomposition that satisfies

∥LU − PAQ∥ ≤

(2

√a21n

c21l+ 1 +

2a1√n

c1√l

(k(n− k) + 1)

)σk+1(A), (2.4.30)

with probability not less than 1 − e−a2n − e−c2l. The value of c1 is given inEq. 2.3.11 and the value of c2 is given in Eq. 2.3.13. Both values depend ona2.

Proof. By using steps 5,6,7 and 8 in Algorithm 2.4.1, we get that

∥LU − PAQ∥ = ∥LyL†yPAQ− PAQ∥. (2.4.31)

Then, from Lemma 2.4.4,

∥LyL†yPAQ− PAQ∥ ≤ 2∥PAQGF − PAQ∥+ 2∥F∥∥LyUy − PAQG∥.

(2.4.32)From Lemma 2.4.6 we get that

∥PAQGF − PAQ∥ ≤

√a21n

c21l+ 1σk+1(A). (2.4.33)

Using the same argumentation given in Theorem 2.4.3, we get

∥LyUy − PAQG∥ = ∥LyUy − PAGQy∥ ≤ (k(n− k) + 1) ∥G∥σk+1(A)(2.4.34)

where G is the matrix used in Algorithm 2.4.1 Step 1. Combining Eqs.2.4.32, 2.4.33, 2.4.34 and since ∥F∥ ≤ 1

c1√l, ∥G∥ ≤ a1

√n (see Lemma 2.4.6

56

and Theorem 2.3.5, respectively), we get that

∥LU − PAQ∥ ≤ 2

√a21n

c21l+ 1σk+1(A) +

2a1√n

c1√l

(k(n− k) + 1) σk+1(A).

(2.4.35)

Here, µ =(

4√2π

) 13, a1 is given by Theorem 2.3.5 and c1 is given by Eq.

2.3.11.

2.4.3 Randomized LU for Sparse Matrices

Assume that A is a sparse matrix. We want to compute its approximatedLU factorization by applying Algorithm 2.4.1 to A. If the random matrix G,which is used in step 1 in Algorithm 2.4.1, is sparse, then AG can be done byapplying a sparse matrix multiplication. If this product is also sparse, thenits LU decomposition can be computed by using sparse LU, which is moreefficient. Moreover, the resulted matrices L and U are also sparse. Thisobservation can help to accelerate even more the randomized LU on largesparse matrices. In this section, we derive the randomized LU approxima-tion error bound using the tools introduced in Section 2.3.2. As can be seenfrom Theorem 2.4.3, the error depends on both the largest and the smallestsingular values of the random matrix G which is a sparse random matrix.Therefore, we compute the error bounds for the singular values of G whenG is a sparse random matrix with density ρ. We present the randomizedLU algorithm for sparse matrices in Algorithm 2.4.2 and its error bound inTheorem 2.4.11.

Algorithm 2.4.2: Randomized LU Decomposition for Sparse Matrices

Input: A sparse matrix of size m× n to decompose; k desired rank; lnumber of columns to use; ρ - random matrix density.Output: Matrices P,Q, L and U such that∥PAQ− LU∥ ≤ O(σk+1(A)) where P and Q are orthogonalpermutation matrices and L and U are sparse lower and uppertriangular matrices, respectively.

1: Apply Algorithm 2.4.1 where G is a sparse random Gaussian matrix,with density ρ and its non-zero entries are gij ∼ N (0, 1

ρ).

Algorithm 2.4.2 finds an LU rank k decomposition of A since it is basedon Algorithm 2.4.1. However, the error bound proof (Theorem 2.4.10) for asparse Gaussian G is not applicable here since it relies on the fact that GU

57

is also a Gaussian i.i.d (U is an orthogonal matrix), which is not the casewhen G is a sparse Gaussian matrix. Therefore, we present a modified errorbound (Theorem 2.4.11) for Algorithm 2.4.2.

Theorem 2.4.11. Given a matrix A, integers k and l such that l >(1 + 1

ln k

)k,

k ≈ l2, a2 > 0 and 0 < ρ < 1. By applying Algorithm 2.4.2 with A, k, l, ρ as

its input parameters, the resulted randomized LU decomposition satisfies

∥LU − PAQ∥ ≤

(2

√a21n

c21l+ 1 +

2a1√n

c1√l

(k(n− k) + 1)

)σk+1(A), (2.4.36)

with probability not less than 1−e−a2n−e−c2l. The values of c1 and c2 (whichdepend on ρ) are given in Eqs. 2.3.11 and 2.3.13, respectively.

To prove the error bound for Algorithm 2.4.2, we present Conjecture2.4.12 that bounds the kth singular value of a sparse sub-Gaussian matrixmultiplied by an orthogonal matrix.

Conjecture 2.4.12. Let G be an n× l sparse Gaussian matrix with densityρ ≪ 1. Assume that l and k are integers such that k ≈ l

2. Q is an n × n

orthogonal matrix. Define G1 as a matrix with the k top rows of G and B1

as a matrix with the k top rows of QG. Then, for n and l sufficiently large,σk(G1) ≤ σk(B1) with high probability.

To verify that Conjecture 2.4.12 holds experimentally, we calculated thefailure probability P(σk(G1) > σk(B1)) by computing the kth singular valueof B1 and G1 10, 000 times for different values of n, l, k and ρ. The resultsare presented in Table 2.4.2.

58

Table 2.4.2: Probability P of the failure of Conjecture 2.4.12. The Averagevalue of σk(G1) was computed 10, 000 times for different values of n, l, k andρ.

n l k ρ Average σk(G1) P(σk(G1) > σk(B1))

3000 200 100 0.03 0.611 < 10−4

3000 300 200 0.03 1.324 < 10−4

3000 400 200 0.02 2.602 < 10−4

3000 700 400 0.02 4.165 < 10−4

3000 900 400 0.03 9.215 < 10−4

4000 300 100 0.03 3.663 < 10−4

4000 500 300 0.02 3.356 < 10−4

4000 700 400 0.03 5.718 < 10−4

4000 700 300 0.01 0.923 < 10−4

4000 700 300 0.02 6.578 < 10−4

4000 700 300 0.03 6.738 < 10−4

Conjecture 2.4.12 is used to prove Lemma 2.4.13, which is similar toLemma 2.4.6.

Lemma 2.4.13 (Based on Conjecture 2.4.12). Let A be a real m×n (m ≥ n)matrix. Let G be a real n × l sparse Gaussian matrix with density ρ ≪ 1,and i.i.d. entries with unit variance. Let k and l be integers such that l < m,l < n, l >

(1 + 1

ln k

)k and k ≈ l

2. Let F be a real matrix of size l × n. We

define a1, a2, c1, c2 as in Theorem 2.3.6. Then,

∥AGF − A∥ ≤

√a21n

c21l+ 1σk+1(A), (2.4.37)

and

∥F∥ ≤ 1

c1√l

(2.4.38)

with probability not less than 1− e−c2l − e−a2n.

Proof. The proof is almost identical to the proof of Lemma 2.4.6. Here,in order to prove Eq. 2.4.20 for G as a sparse Gaussian matrix, we useConjecture 2.4.12 and take Q = V T . Therefore, σk(H) ≥ σk(G1) where G1

is the upper k× l block of G. By combining Conjecture 2.4.12 and Theorem2.3.6 we get

∥F∥ = 1

σk(H)≤ 1

σk(G1)≤ 1

c1√l

(2.4.39)

59

where c1 depends on µ =(

4√2πρ

) 13.

Proof of Theorem 2.4.11. The proof is almost identical to the proof of Theo-rem 2.4.10, except that we use Lemma 2.4.13 instead of Lemma 2.4.6 that isapplicable to sparse Gaussian random matrices. The bound stays the sameas in Theorem 2.4.10, except that a1, c1 and c2 depend on µ = 4√

2πρ.

Remark 2.4.14. In practice, step 3 in Algorithm 2.4.1 can be done usingstandard LU decomposition with either partial or full pivoting, when it iscalled by Algorithm 2.4.2. By using full pivoting we can sparsify L evenmore.

2.4.4 Rank Deficient Least Squares

In this section, we present an application that uses the randomized LU andshow how it can be used to solve efficiently the Rank Deficient Least Squares(RDLS) problem.Assume that A is an m×n matrix (m ≥ n) with rank(A) = k, k < n and b isa column vector of size m× 1. We want to minimize ∥Ax− b∥. Because A isa rank deficient matrix, then the problem has an infinite number of solutionssince if x is a minimizer and z ∈ null(A), then x+ z is also a minimizer (i.e.a valid solution). We now show that the complexity of the solution dependson the rank of A and that the problem is equivalent to solving the followingtwo problems: a full rank Least Square (LS) problem of size m × k and asimplified undetermined linear system of equations that requires a matrixinversion of size k × k.The solution is derived by the application of Algorithm 2.4.1 to A to get

∥Ax− b∥ = ∥P TLUQTx− b∥ = ∥LUQTx− Pb∥, (2.4.40)

where L is an m × k matrix, U is a k × n matrix and both L and U are ofrank k. Let y = UQTx and c = Pb. Then, the problem is reformulated as

min ∥Ly − c∥. (2.4.41)

Note that L is a full rank matrix and the problem to be solved becomes astandard full rank LS problem. The solution is given by y = L†c. Next, wesolve

Uz = y, (2.4.42)

where z = QTx. Since U is a k×n matrix, Eq. 2.4.42 is an underdeterminedsystem. Assume that U = [U1 U2] and z = [z1 z2]

T , where U1 is a k × k

60

matrix, z1 is a k× 1 vector and z2 is a (n− k)× 1 vector. Then, the solutionis given by setting any value to z2 and solving

U1z1 = y − U2z2. (2.4.43)

For simplicity, we choose z2 = 0. Therefore, we get that z1 = U−11 y. The fi-

nal solution is given by x = Qz. This procedure is summarized in Algorithm2.4.3 that finds the solution to the deficient least squares problem that usesAlgorithm 2.4.1.

Algorithm 2.4.3: Rank Deficient Least Squares using Randomized LU

Input: matrix A of size m× n with rank k; l integer such that l ≥ k;b vector of size m× 1.Output: solution x that minimizes ∥Ax− b∥.

1: Apply Algorithm 2.4.1 to A with parameters k and l.2: y ← L†Pb.3: z1 ← U−1

1 y.4: z ←

(z1z2

), where z2 is an n− k zero vector.

5: x← Qz.

The complexity of Algorithm 2.4.3 is equal to the randomized LU com-plexity (Algorithm 2.4.1) with an additional inversion cost of the matrix U1

in Step 3, which is of size k × k. Note that the solution given by Algorithm2.4.3 is sparse in the sense that x contains at most k non-zero entries.

2.5 Numerical Results

In order to evaluate Algorithm 2.4.1, we present the numerical results bycomparing between the performances of several randomized low-rank ap-proximation algorithms. We tested the algorithms and compared betweenthem by applying them to random and sparse matrices and to images. Allthe results were computed using the standard MATLAB libraries on a ma-chine with two Intel Xeon CPUs X5560 2.8GHz that contains an nVidia GPUGTX TITAN card.

2.5.1 Error Rate and Computational Time Compar-isons

The performance of the randomized LU (Algorithm 2.4.1) was tested andcompared to a randomized SVD and to a randomized ID (see [6, 7]). The

61

tests compare between the normalized (relative) error of the low-rank approx-imation obtained by the examined methods. In addition, the computationaltime of each method was measured. If A is the original matrix and A is alow-rank approximation of A, then the relative approximation error is givenby:

err =∥A− A∥∥A∥

. (2.5.1)

We compared between the low-rank approximation achieved by the appli-cation of the randomized SVD, randomized ID and randomized LU withdifferent ranks k. Throughout the experiments, we chose l = k + 3 andthe test matrix was a random matrix of size 3000× 3000 with exponentiallydecaying singular values. The computations of the algorithms were donein a single precision. The results are presented in Fig. 2.5.1. The experi-ment shows that the error of the randomized ID is significantly larger thanthe error obtained from both the randomized SVD and the randomized LU(Algorithm 2.4.1), which are almost identical. In addition, we comparedbetween the execution time of these algorithms. The results are presentedin Fig. 2.5.2. The results show that the execution time of the randomizedLU (Algorithm 2.4.1) is significantly lower than the execution time of therandomized SVD and the randomized ID algorithms. The LU factorizationhas a parallel implementation (see [5] section 3.6). To see the impact of aparallel LU factorization implementation, the execution time for computinga randomized LU of a matrix of size 3000×3000 was measured on an nVidiaGTX TITAN GPU device and it is shown in Fig. 2.5.3. The execution timeon the GPU was up to ×10 faster than running it on an eight cores CPU.Thus, the algorithm scales well. For larger matrices (n and k are large) thedifferences between the performances running on CPU and on GPU are moresignificant.

62

0 100 200 300 400 500 60010

−4

10−3

10−2

10−1

100

101

Rank

Rel

ativ

e A

ppro

xim

atio

n E

rror

Randomized SVDRandomized IDRandomized LU

Figure 2.5.1: Comparison between the low-rank approximation error of dif-ferent algorithms: Randomized SVD, Randomized ID and Randomized LU.Randomized LU achieves the lowest error.

0 100 200 300 400 500 60010

−2

10−1

100

Rank

Exe

cutio

n T

ime

[sec

]

Randomized SVDRandomized IDRandomized LU

Figure 2.5.2: Comparison between the execution times of the same algorithmsas in Fig. 2.5.1 running on a CPU. Randomized LU achieved the lowestexecution time.

63

0 500 1000 1500 2000 250010

−2

10−1

100

101

Rank

Exe

cutio

n T

ime

[sec

]

Randomized LU (8 cores)Randomized LU (GPU)

Figure 2.5.3: Comparison between the execution times from running Algo-rithm 2.4.1 on different computational platforms: CPU with 8 cores andGPU. Randomized LU achieved the lowest execution time.

2.5.2 Image Matrix Factorization

Algorithm 2.4.1 was applied to images given in a matrix format. The factor-ization error and the execution time were compared with the performancesof the randomized SVD and the randomized ID. We also added the SVDerror computation and execution time as benchmark computed using Lanc-zos bidiagonalization [5] implemented in the PROPACK package [88]. Theimage size was 2124×7225 pixels and it has 256 gray levels. The parameterswere k = 200 and l = 203. The approximation quality (error) was measuredin PSNR defined by:

PSNR = 20 log10maxA

√N

∥A− A∥F(2.5.2)

where A is the original image, A is the approximated image (the result ofAlgorithm 2.4.1), maxA is the maximal pixel value of A and N the totalnumbers of pixels. ∥ · ∥F is the Frobenius norm.

64

1000 2000 3000 4000 5000 6000 7000

500

1000

1500

2000

Figure 2.5.4: The original input image of size 2124×7225 that was factorizedby the randomized LU, randomized ID and randomized SVD algorithms.

1000 2000 3000 4000 5000 6000 7000

500

1000

1500

2000

Figure 2.5.5: The reconstructed image from the randomized LU factorizationwith k = 200 and l = 203.

Figures 2.5.4 and 2.5.5 show the original and the reconstructed images,respectively. The image approximation quality (measured in PSNR) relatedto rank k is shown in Fig. 2.5.6 where for the same k the PSNR of Algorithm2.4.1 is higher than the randomized ID and almost identical to the random-ized SVD. Also, the PSNR is close to the result achieved by the applicationof the Lanczos SVD which is the best possible rank k approximation. Theexecution time of each algorithm is shown in Fig. 2.5.7. All the computa-tions were done in double precision. Here, the randomized LU is significantlyfaster than all the other compared methods making it applicable for real timeapplications.

65

0 50 100 150 200 250 300 350 400 450 5005

10

15

20

25

30

35

40

45

Rank

PS

NR

[dB

]

Lanczos SVDRandomized SVDRandomized IDRandomized LU

Figure 2.5.6: Comparison between the PSNR values from image reconstruc-tion application using randomized LU, randomized ID, randomized SVD andLanczos SVD algorithms.

0 50 100 150 200 250 300 350 400 450 50010

−2

10−1

100

101

102

Rank

Exe

cutio

n T

ime

[sec

]

Lanczos SVDRandomized SVDRandomized IDRandomized LU

Figure 2.5.7: Comparison between the execution time of the randomized LU,randomized ID, randomized SVD and Lanczos SVD algorithms.

66

2.6 Sparse Matrix Factorization

In order to test Algorithm 2.4.1 on sparse matrices, we used a binary sparsematrix of size 862, 664 × 862, 664 with 19, 235, 140 non-zero elements (ρ =2.58 × 10−5) that contains the results of crawling the .eu domain. The eu-2005 matrix was generated and studied in [89]. Each edge in the eu-2005graph represents a link between two websites. The approximation error ofeach algorithm applied to the eu-2005 matrix is shown in Fig. 2.6.1, and theexecution time is shown in Fig. 2.6.2. To test the factorization for large kvalues, we applied Algorithms 2.4.1 and 2.4.2 to a 100, 000×100, 000 randomgenerated sparse matrix A with density ρ = 10−6. Applying Algorithm 2.4.1to this matrix using k = 10, 000 requires a large memory size and CPUresources. This resulted in a very long execution time. A similar experimentwas conducted by applying Algorithm 2.4.2 to the same matrix A with kranging from 13, 000 to 25, 000. The random matrix G, selected in Step 1of Algorithm 2.4.2, was a random generated sparse matrix with two possibledensities ρ = 10−4 and ρ = 3 × 10−4. The approximation error for eachdensity-based execution is shown in Fig. 2.6.3 and the execution time isshown in Fig. 2.6.4. The graphs represent an averaging from successiveapplication of the algorithm.

0 50 100 1500.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rank

Rel

ativ

e A

ppro

xim

atio

n E

rror

Lanczos SVDRandomized SVDRandomized LU

Figure 2.6.1: Comparison between the approximation error of the randomizedLU, randomized ID and randomized SVD algorithms, executed on the sparsematrix eu-2005.

67

0 50 100 15010

−1

100

101

102

103

Rank

Exe

cutio

n T

ime

[sec

]

Lanczos SVDRandomized SVDRandomized LU

Figure 2.6.2: Comparison between the execution time of the randomizedLU, randomized ID and randomized SVD algorithms, executed on the sparsematrix eu-2005.

1 1.5 2 2.5

x 104

0.4

0.5

0.6

0.7

0.8

0.9

1

Rank

Rel

ativ

e A

ppro

xim

atio

n E

rror

Sparsity 10−4

Sparsity 3 × 10−4

Figure 2.6.3: Approximation error from the application of Algorithm 2.4.2to the matrix A with two different densities.

68

1 1.5 2 2.5

x 104

1

2

3

4

5

6

7

Rank

Exe

cutio

n T

ime

[sec

]

Sparsity 10−4

Sparsity 3 × 10−4

Figure 2.6.4: Execution time from the application of Algorithm 2.4.2 to thematrix A with two different densities.

The approximation error in Fig. 2.6.3 depends on the sparsity ρ of therandom matrix G. The error increases as G became sparser. Figure 2.6.4shows the dependency between execution time and sparsity. A sparser matrixreduces the algorithm execution time. This trade-off enables the user tochoose between speed and accuracy by selecting a proper ρ value.

2.7 Conclusion

In this work, we presented a randomized algorithm for computing an LUrank k decomposition. Given an integer k, the algorithm finds an LU de-composition such that both L and U are of rank k with negligible failureprobability. We constructed error bounds for the approximation of the inputmatrix and proved that they are proportional to the k + 1 singular value.We also compared the performance of the algorithm with regard to the errorrate and to the computational time. We compared the results to random-ized SVD, randomized ID and to the application of Lanczos SVD runningon sparse matrices. We also showed that our algorithm can be parallelizedsince it consists mostly of matrix multiplication and pivoted LU. The re-sults on GPU show that it was possible to accelerate the computational timesignificantly even by using only the standard MATLAB libraries.

Chapter 3

File Content Recognition usingFast LU Dictionary

In recent years, distinctive-dictionary construction has become an active re-search area due to his usefulness in data processing. Usually, one or moredictionaries are constructed from a training data and then they are usedto classify signals that did not participate in the training process. A newdictionary construction algorithm is introduced. It is based on a low-rankmatrix factorization being achieved by the application of the randomizedLU decomposition to a training data. This method is fast, scalable, paral-lelizable, consumes low memory, outperforms SVD in these categories andworks also extremely well on large sparse matrices. In contrast to existingmethods, the randomized LU decomposition constructs an under-completedictionary, which simplifies both the construction and the classification pro-cesses of newly arrived signals. The dictionary construction is generic andgeneral and it fits different applications. We demonstrate the capabilitiesof this algorithm on file type identification, which is a fundamental task indigital security arena, performed nowadays for example by sandboxing mech-anism, deep packet inspection, firewalls and anti-virus systems. We proposea content-based method that detects file types that neither depends on fileextension nor on metadata. Such approach is harder to deceive and we showthat only a few file fragments from a whole file are needed for a success-ful classification. Based on the constructed dictionaries, we show that theproposed method can effectively identify execution code fragments in PDFfiles.

69

70

3.1 Introduction

Recent years have shown a growing interest in dictionary learning. Dictio-naries were found to be useful for applications such as signal reconstruction,denoising, image impainting, compression, sparse representation, classifica-tion and more. Given a data matrix A, a dictionary learning algorithmproduces two matrices D and X such that ||A − DX|| is small where D iscalled dictionary and X is a coefficients matrix also called representationmatrix. Sparsity of X, namely each signal from A is described with only afew signals (also called atoms) from the dictionary D, is a major property be-ing pursued by many dictionary learning algorithms. The algorithms, whichlearn dictionaries for sparse representations, optimize a goal function, whichconsiders both the accuracy and the sparsity of the solution, by optimizingalternately these two properties minD,X ||A−DX||+λ||X||0. This construc-tion is computationally expensive and does not scale well to big data. Itbecomes even worse when dictionary learning is used for classification sinceanother distinctive term in addition to the two aforementioned is being intro-duced in the objective function. This term provides the learned dictionarya discriminative ability. This can be seen for example in the optimizationproblem minD,X,W ||A−DX||+λ||X||0+ξ||H−WX|| where W is a classifierand H is a vector of labels. ||H −WX|| is a penalty term for wrong classifi-cation. In order to achieve the described properties, dictionaries are usuallyover-complete, namely, they contain more atoms than the signal dimension.As a consequence, dictionaries are redundant such that there are linear de-pendencies between atoms. Therefore, a given signal can be represented inmore than one way using dictionary atoms. This enables us on one hand toget sparse representations, but on the other hand it complicates the repre-sentation process because, in the unrestricted case, it is NP-hard to find thesparsest representation for a signal by an over-complete dictionary [90].

In this work, we provide a generic way to construct an under-completedictionary. Its capabilities will be demonstrated for signal classification task.Since we do not look for sparse signal representation, we get rid of the alter-nating optimization process exist in the construction of over-complete dictio-naries. Our dictionary construction is based on matrix factorization. We usethe randomized LU matrix factorization algorithm [24] for a dictionary con-struction. Instead of facing an NP-hard problem approximation algorithmsfor sparse signal reconstruction (like Orthogonal Matching Pursuit [91] orBasis Pursuit [92]), we use a fast projection method that represents the sig-nal by LU construction. The randomized LU algorithm, which is applied toa given data matrix A ∈ Rm×n, decomposes A into two matrices L and U ,where L is the dictionary and U is the coefficient matrix. The size of the

71

dictionary, which is the numerical rank of the underlying training matrix ofm measurements and n features, is determined by the decaying spectrum ofthe singular values of this matrix. It is bounded by min{n,m} where A is ofsize m × n. In our application, the dictionary size is small and determinedby the numerical rank of the matrix A. Both L and U are linearly indepen-dent. The proposed dictionary construction has couple of advantages: it isfast, scalable, parallelizable and thus can run on GPU and multicore-basedsystems, consumes low memory, outperforms SVD in these categories andworks extremely well on large sparse matrices.

In order to evaluate the performance of the dictionaries, which are con-structed by the application of the randomized LU algorithm, we use themto classify files types. The experiments were conducted on a dataset thatcontains files of various types. The goal is to classify each file or portion of afile to the class describing its type. To the best of our knowledge, our workis the first to use dictionary learning method for file type classification. Ourwork considers three different scenarios that represent real security tasks:examining the full content of tested files, classifying a file type using a smallnumber of fragments from the file and detecting malicious code hidden in-side innocent looking files. While the first two scenarios were examined byother works, none of the described papers dealt with the third scenario. It isdifficult to compare our results to other algorithms, since the datasets usedare not publicly available, but for similar testing conditions, we improve thestate-of-the-art results. Our datasets will be made publicly available.

3.2 Related Work

Dictionary-based classification models have been the focus of much recentresearch leading to results in face recognition [93, 94, 95, 96, 97, 98], digitrecognition [96], object categorization [97, 98] and more. Many of theseworks [98, 93, 97] utilize the K-SVD [11] for their training, or in otherwords for their dictionary learning step. Others define different objectivefunctions such as the Fisher Discriminative Dictionary Learning [96]. Ma-jority of these methods use an alternating optimization process in order toconstruct their dictionary. This optimization procedure seeks a dictionarywhich is re-constructive, enables sparse representation and sometimes alsodiscriminative. In some works (see for example [93, 98]) the dictionary learn-ing algorithm requires meta parameters to regulate these properties of thelearned dictionary. Finding the optimal values for these parameters is a chal-lenging task that adds complexity to the proposed solutions. A dictionaryconstruction, which uses a multivariate optimization process, is computation-

72

ally expensive task (as described in [97], for example). Our approach suggeststo avoid these complexities by using the randomized LU Algorithm [24]. Thedictionary it creates is under-complete where the number of atoms is smallerthan the signal dimension. The outcome is that the dictionary constructionis fast and without compromising its abilities to achieve high classificationaccuracy. We improve upon state-of-the-art results in file type classifica-tion [99] and it is demonstrated by our running example.

The testing phase in many dictionary learning schemes is simple. Usually,linear classifier is used to assign test signals to one of the learned classes [93,98]. Classifier learning combined with dictionary learning add additionaloverhead to the process [93, 98]. Our method does not require to allocatespecial attention to a classifier learning. We utilize the output from therandomized LU algorithm to create a projection matrix. This matrix is usedto measure the distance between a test signal and the dictionary. The signalis then classified as belonging to the class that approximates it best. Theclassification process is fast and simple. The results described in Section 3.5show high accuracy in the content-based file type classification task.

We used this classification task to test the randomized LU dictionaryconstruction and to measure its discriminative power. This task is useful incomputer security applications like anti-virus systems and firewalls that needto detect files transmitted through network and response quickly to threats.Previous works in this field use mainly deep packet inspection (DPI) andbyte distribution frequency features (1-gram statistics) in order to describe afile [100, 101, 99, 102, 103, 104, 105, 106, 107]. In some works, other featureswere tested like consecutive byte differences [99, 100] and statistical proper-ties of the content [100]. The randomized LU decomposition [24] constructionis capable of dealing with a large number of features. This enables us to testour method on high dimensional feature sets like double-byte frequency dis-tributions (2-grams statistics) where each measurement has 65536 Markovwalk-based features. We refer the reader to [99] and references within foran exhaustive comparison of the existing methods for content-based file typeclassification.

Throughout this work, when A is a matrix, the norm ∥A∥ indicates thespectral norm (the largest singular value of A) and when A is a vector itindicates the standard l2 norm (Euclidean norm).

73

3.3 Randomized LU based Classification Al-

gorithm

In Chapter 2, we presented the randomized LU decomposition algorithm forcomputing the rank k LU approximation of a full matrix (Algorithm 2.4.1).The main building blocks of the algorithm are random projections and RankRevealing LU (RRLU) [69] to obtain a stable low-rank approximation for aninput matrix A.

The RRLU algorithm, used in the randomized LU algorithm, reveals theconnection between LU decomposition of a matrix and its singular values.This property is very important since it connects between the size of thedecomposition to the actual numerical rank of the data. Similar algorithmsexist for rank revealing QR decompositions (see, for example [68]). The run-ning time complexity and the error bound of Algorithm 2.4.1 are presentedin is Section 2.4 of Chapter 2.

In this section we utilize 2.4.1 for constructing a dictionary from a train-ing set and using it for classification of new samples. The training phaseincludes dictionary construction for each learned class from a given dataset.The classification phase assigns a newly arrived signal to one of the classesbased on its similarity to the learned dictionaries. Let X ∈ Rm×n be thematrix whose n columns are the training signals (samples). Each column isdefined by m features. Based on Section 2.4, we apply the randomized LUdecomposition (Algorithm 2.4.1) to X, yielding PXQ ≈ LU . The outputsP and Q are orthogonal permutation matrices. Theorem 3.3.1 shows thatP TL forms (up to a certain accuracy) a basis to A. This is the key propertyof the classification algorithm.

Theorem 3.3.1. Given a matrix A. Its randomized LU decomposition isPAQ ≈ LU . Then, the error of representing A by P TL satisfies:

∥(P TL)(P TL)†A−A∥ ≤(2√

2nlβ2γ2 + 1 + 2√2nlβγ (k(n− k) + 1)

)σk+1(A)

(3.3.1)with the same probability as in Theorem 2.4.3.

Proof. By combining Theorem 2.4.3 with the fact thatBQ = LbUb = L†yPAQ

we get

∥LU − PAQ∥ = ∥LyLbUb − PAQ∥ = ∥LyL†yPAQ− PAQ∥. (3.3.2)

Then, by using the fact that Lb is square and invertible we get

∥LyL†yPAQ− PAQ∥ = ∥LyLbL

−1b L†

yPAQ− PAQ∥ = ∥LL†PAQ− PAQ∥.(3.3.3)

74

By using the fact that the spectral norm is invariant to orthogonal projec-tions, we get

∥LL†PAQ− PAQ∥ = ∥LL†PA− PA∥ = ∥P TLL†PA− A∥ =

= ∥(P TL)(P TL)†A−A∥ ≤(2√2nlβ2γ2 + 1 + 2

√2nlβγ (k(n− k) + 1)

)σk+1(A),

(3.3.4)

with the same probability as in Theorm 2.4.3.

Assume that our dataset is composed of the sets X1, X2, . . . , Xl. Wedenote by Di = P T

i Li the dictionary learned from the class Xi by Algo-rithm 2.4.1. UiQ

Ti is the corresponding coefficient matrix. It is used to

reconstruct signals from Xi as a linear combination of atoms from Di. Thetraining phase of the algorithm is done by the application of Algorithm 2.4.1to different training datasets that correspond to different classes. For eachclass, a different dictionary is learned. The size of Di, namely its numberof atoms, is determined by the parameter ki that is related to the decayingspectrum of the singular values of each training dataset. The dictionariesdo not have to be of equal sizes. A discussion about the dictionary sizesappears later in this section and in Section 3.4. The third parameter, whichAlgorithm 2.4.1 needs, is the number of projections l on the random matrixcolumns. l is related to the error bound in Theorem 2.4.3 and it is used toensure high success probability for Algorithm 2.4.1. Taking l to be a littlebigger than k is sufficient. The training process of our algorithm is describedin Algorithm 3.3.1.

Algorithm 3.3.1: Dictionaries Training using Randomized LU

Data: X = {X1, X2, . . . , Xr} - training datasets for r sets;K = {k1, k2, . . . , kr} - Dictionary size of each set;

Output: D = {D1, D2, . . . , Dr} - set of dictionaries;for t ∈ {1, 2, . . . , r} do

Lt, Ut, Pt, Qt = Randomized LU Decomposition(Xt, kt, kt + 5),

(l∆= kt + 5);(Algorithm 2.4.1)

Dt = P Tt Lt;

endD = {D1, D2, . . . , Dr};Output D;

For the test phase of the algorithm, we need a similarity measure thatprovides a distance between a given signal and a dictionary.

Definition 3.3.1. Let x be a signal and D be a dictionary. The distancebetween x and the dictionary D is defined by

75

dist(x,D) , ||DD†x− x||,

where D† is the pseudo-inverse of the matrix D.

The geometric meaning of dist(x,Di) is related to the projection of xonto the column space of Di, where Di is the dictionary learned for class i ofthe problem. dist(x,Di) measures the distance between x and dictionary Di.If x ∈ column-span{Di} then Theorem 3.3.1 guarantees that dist(x,Di) < ϵ.For x /∈ column-span{Di} then dist(x,Di) is large. Thus, dist is used forclassification as described in Algorithm 3.3.2.

Algorithm 3.3.2: Dictionary based Classification

Data: x - input test signal;D = {D1, D2, . . . , Dr} - set of dictionaries;Output: i - the classified class label for x;for t ∈ {1, 2, . . . , r} do

ERRt = dist(x,Dt) ;endtX = argmint {ERRt} ;Output tX ;

The core of Algorithm 3.3.2 is the dist function from Definition 3.3.1.This is done by examining portion of the signal that is spanned by the dic-tionary atoms. If the signal can be expressed with high accuracy as a linearcombination of the dictionary atoms then their dist will be small. The bestaccuracy is achieved when the examined signal belongs to the span of Di. Inthis case, their dist is very small and bounded by Theorem 2.4.3. On theother hand, if the dictionary atoms cannot express well a signal then theirdist will be large. The largest distance is achieved when a signal is orthog-onal to the dictionary atoms. In this case, dist will be equal to the norm ofthe signal. Signal classification is accomplished by finding a dictionary witha minimal distance to it. This is where the dictionary size comes into play.The more atoms a dictionary has, the larger is the space of signals that havelow dist to it and vice versa. By adding or removing atoms from a dictionary,the distances between this dictionary and the test signals are changed. Thisaffects the classification results of Algorithm 3.3.2. The practical meaning ofthis observation is that dictionary sizes need to be chosen carefully. Ideally,we wish that each dictionary will be of dist zero to test signals of its type,and of large dist values for signals of other types. However, in reality, sometest signals are represented more accurately by a dictionary of the wrongtype than by a dictionary of their class type. For example, we encounteredseveral cases where GIF files were represented more accurately by the PDF

76

dictionary than by the GIF dictionary. By an incorrect selection of k, the sizeof the dictionary will result in either a dictionary that cannot represent wellsignals of its own class (causes miss-detections), or in a dictionary that repre-sents too accurately signals from other classes (causes false alarms). The firstproblem occurs when the dictionary is too small whereas the second occurswhen the dictionary is too large. In Section 3.4, we discuss the problem offinding the optimal dictionaries sizes and how they relate it to the spectrumof the training data matrices.

3.4 Determining the Dictionaries Sizes

One possible way to find the dictionaries sizes is to observe the spectrumdecay of the training data matrix. In this approach, the number of atoms ineach dictionary is selected as the number of singular values that capture mostof the energy of the training matrix. This method is based on estimating thenumerical rank of the matrix, namely on the dimension of its column space.Such a dictionary approximates well the column space of the data but isless likely to approximate well signals from other classes. Nevertheless, itis possible in this construction that dictionary of a certain class will havehigh rate of false alarms. In other words, this dictionary might approximatesignals from other classes with a low error rate.

Two different actions can be taken to prevent this situation. The firstoption is to reduce the size of this dictionary so that it approximates mainlysignals of its class and not from other classes. This should be done care-fully so that this dictionary still identifies well signals of its class better thanother dictionaries. The second option is to increase the sizes of other dic-tionaries in order to overcome their miss-detections. This also should bedone with caution since we might represent well signals from other classesusing these enlarged dictionaries. Therefore, relying only on the spectrumanalysis of the training data is insufficient, because this method finds thesize of each dictionary independently from the other dictionaries. It ignoresthe interrelations between dictionaries, while the classification algorithm isbased on those relations. Finding the optimal k values can be described bythe following optimization problem:

arg mink1,k2,...,kr

∑1≤i≤r1≤j≤ri=j

Ci,j,X(ki, kj), (3.4.1)

where Ci,j,X(ki, kj) is the number of signals from class i in the dataset Xclassified as belonging to class j for the respective dictionary sizes ki and

77

kj. The term, which we wish to minimize in Eq. 3.4.1, is therefore thetotal number of wrong classifications in our dataset X when using a set ofdictionaries D1, D2, . . . , Dr with sizes k1, k2, . . . , kr, respectively.

We propose an algorithm for finding the dictionary sizes by examiningeach specific pair of dictionaries separately, and thus identifying the opti-mized dictionary sizes for this pair. Then, the global k values for all dic-tionaries will be determined by finding an agreement between all the localresults. This process is described in Algorithm 3.4.1.

Algorithm 3.4.1: Dictionary Sizes Detection

Data: X = {X1, X2, . . . , Xr} - training datasets for the r classes;Krange - a set of possible values of k to search in;Output: K = {k1, k2, . . . , kr} - dictionaries sizes;for i, j ∈ {1, 2, . . . , r}, i < j do

for ki, kj ∈ Krange doERRORi,j(ki, kj) = Ci,j,X(ki, kj) + Cj,i,X(kj, ki) ;

end

endK = find optimal agreement({ERRORi,j}1≤i,j≤r, i =j) ;Output K ;

Algorithm 3.4.1 examines each pair of classes i and j for different k valuesand produces the matrix ERRORi,j, such that the element ERRORi,j(s, t) isthe number of classification errors for those two classes, when the dictionarysize of class i is s and the dictionary size of class j is t. This number is the sumof signals from each class that were classified as belonging to the other class.The matrix ERRORi,j reveals the ranges of k values for which the numberof classification errors is minimal. These are the ranges that fit when dealingwith a problem that contains only the two classes of signals. However, manyclassification problems need to deal with a large number of classes. For thiscase, we create the ERROR matrix for each possible pairs, find the k rangesfor each pair and then find the optimal agreement between all pairs. Thefunction find optimal agreement describes this step in Algorithm 3.4.1.Finding this agreement can be done by making a list of constraints for eachpair and then finding k values that satisfy all the constraint and bring theminimal solution to the problem described in Eq. 3.4.1. The constraints canbound from below or above the size of a specific dictionary, or the relationbetween sizes of two dictionaries (for example, the dictionary of the first classshould have 10 more elements than the dictionary of the second class). Theusage of Algorithm 3.4.1 is demonstrated in Section 3.5.2.

78


In order to evaluate the performance of the dictionary construction and classi-fication algorithms from Section 3.3, Algorithm 3.3.2 was applied to a datasetthat contains six different file types. The goal is to classify each file or por-tion of a file to the class describing its type. The dataset consists of 1200files that were collected in the wild using automated Web crawlers. The fileswere equally divided into six types: PDF, EXE, JPG, GIF, HTM and DOC.We selected these types as they are among the most popular file formats ex-changed in computer networks. 100 files of each type were chosen randomlyas the training dataset and the other 100 files served as a testing dataset. Inorder to get results that reflect the true nature of the problem, no restrictionswere imposed on the file collection process. Thus, some files contain only afew kilobytes while others are of several megabytes in size. In addition, someof the PDF files contain pictures, which make it hard for a content-basedalgorithm to classify the correct file type. Similarly, DOC files may containpictures and executables may contain text and pictures. Clearly, these phe-nomena have negative effect on the accuracy of the results in this section.However, we chose to leave the dataset in its original form.

Throughout this work, we came across several similar works [100, 101,99, 102, 103, 104, 105, 106, 107] that classify unknown file types based ontheir content. Although it can be very useful, none of these works made theirdatasets publicly available for analysis and comparison with other methods.We decided to publicize the dataset of files that we collected to enable futurecomparisons. The details about downloading and using the dataset can beobtained by contacting one of the authors.

Three different scenarios were tested with the common goal of classifyingfiles or portions of files to their class type, namely, assigning them to oneof the six file types described above. In each scenario, six dictionaries werelearned that correspond to the six file types. Then, the classification algo-rithm (Algorithm 3.3.1) was applied to classify the type of a test fragmentor a file. The learning phase, which is common to all scenarios, was doneby applying Algorithm 3.4.1 to find the dictionary sizes and Algorithm 3.3.1to construct the dictionaries. The testing phase varies according to the spe-cific goal of each scenario. Sections 3.5.1, 3.5.2 and 3.5.3 provide a detaileddescription for each scenario and the classification results.

3.5.1 First Scenario: Entire File is Analyzed

In this scenario, we are given a whole file and the extracted features fromits entire content. The features are byte frequency distribution (BFD) that

79

contains 256 features from consecutive differences distribution (CDD). Totalof 512 features are measured for each training and testing files. CDD isused in addition to BFD because the latter fails to capture any informationabout bytes ordering in the file. CDD turned out to be very discriminativeand improved the classification results. The features extracted from each filewere normalized by its size since files of various sizes exist in the dataset.Example for BFD construction is described in Fig. 3.5.1 and example forCDD construction is given in Fig. 3.5.2.

AABCCCDR =⇒

Byte Probability (BFD)A 0.25B 0.125C 0.375D 0.125R 0.125... 0

Figure 3.5.1: Byte Frequency Distribution (BFD) features extracted fromthe file fragment “AABCCCDR”.

AABCCCDFG =⇒

Difference Probability (CDD)0 0.3751 0.52 0.125... 0

Figure 3.5.2: Consecutive Differences Distribution (CDD) features extractedfrom the file fragment “AABCCCDFG”. There are three consecutive-pairsof bytes with difference 0, four with difference 1 and one with difference 2.These distributions are normalized to produce the shown probabilities. Thenormalized factor is the length of the sting minus one. In this example, thenormalized factor is 8.

This scenario can be useful when the entire tested file is available forinspection. The training was done by the application of Algorithms 3.4.1and 3.3.1 to the training data. The dictionary sizes were 60 atoms per dictio-nary. The set of dictionariesD = {DPDF , DDOC , DEXE, DGIF , DJPG, DHTM}is the output from these algorithms. Each test file was analyzed using Algo-rithm 3.5.1 and classified to one of the six classes. The classification resultsare presented as a confusion matrix in Table 3.5.1. Each column correspondsto an actual file type and the rows correspond to the classified file type byAlgorithm 3.5.1. A perfect classification produces a table with score 100 onthe diagonal and zero elsewhere. Our results are similar to those achieved

80

in [99] (Table II) that use different methods. However, we did not have thedataset that [99] used and there is no way to perform a fair comparison.

Algorithm 3.5.1: File Content Dictionary Classification

Data: x - input file; D = {DPDF , DDOC , DEXE, DGIF , DJPG, DHTM} -set of dictionaries;

Output: tx - file type predicted for x;for t ∈ {PDF,DOC,EXE,GIF, JPG,HTM} do

ERRt = dist(x,Dt);endtx = argmint {ERRt};Output tx;

Correct File TypePDF DOC EXE GIF JPG HTM

Classified File Type

PDF 98 0 1 1 0 0DOC 0 97 1 0 0 0EXE 0 3 98 2 1 0GIF 0 0 0 97 1 0JPG 2 0 0 0 98 0HTM 0 0 0 0 0 100

Table 3.5.1: Confusion matrix for the first scenario. 100 files of each typewere classified by Algorithm 3.5.1.

3.5.2 Second Scenario: Fragments of a File

The second scenario describes a situation in which the entire file is unavail-able for the analysis but only some fragments that were taken from randomlocations are available. The goal is to classify the file type based on thispartial information. This serves a real application such as a firewall thatexamines packets transmitted through a network or a file being downloadedfrom a network server. This scenario contains three experiments where dif-ferent features were used in each. The training phase, which is common toall three experiments, includes extracted features from a 10 kilobytes frag-ments that belong to the training data. These features serve as an input toAlgorithm 3.3.1, which produces the dictionaries for the classification phase.The second parameter in Algorithm 3.3.1 is the set of dictionary sizes, whichwere determined by Algorithm 3.4.1. We use the first set of features in thisscenario (described hereafter) to demonstrate more deeply how this algo-rithm works. The sizes of six dictionaries need to be determined based on

81

agreement between the pairwise error matrices. Figure 3.5.3 shows the ma-trices ERRORPDF,JPG and ERRORPDF,EXE. It can be seen in Fig. 3.5.3(a)that the types PDF and JPG are very close to each other because for manydictionary size k have overlapped signals. Only few values in one cell abovethe diagonal provide good results for this pair. We observe that the JPGdictionary should have 10 atoms more than the PDF dictionary. We alsolearn that both dictionaries sizes should be greater than 50 atoms. The PDFand EXE error values in Fig. 3.5.3(b) indicate that these dictionaries are wellseparated. There is a large set of dictionary sizes near the diagonal for whichthe classification error is low.

The following intuition helps understanding why a large range of lowerrors will achieve better classification results. The error matrices are buildbased on training data that represents the classification error CPDF,JPG,X +CJPG,PFG,X of the algorithm when it applies to this data (See Eq. 3.4.1 whenusing 2 sets). The best k values from the ERROR matrix fit the trainingdata. However, a PDF test signal might be spanned by less atoms fromthe PDF dictionary than we would expect, or by more atoms from the JPGdictionary than we would expect (or both). This means that for this signalthe PDF dictionary size is smaller than the real size of this dictionary andthe JPG dictionary is larger than its real size. Moving in a perpendiculardirection to the diagonal in Fig. 3.5.3, can either increase or decrease theclassification error. This shift will increase the classification error because allthe dictionaries off the diagonal in Fig. 3.5.3(a) have high errors numbers.On the other hand, there is a low probability to get a classification error inFig. 3.5.3(b), because there are many off-diagonal options for dictionary sizesthat will generate a low error. The pair JPG-PDF is more sensitive to noisethan the pair EXE-PDF. This observation is supported by the confusionmatrix of the first experiment, as shown in Table 3.5.2.

In the first experiment, the dictionary sizes, which were determined byAlgorithm 3.4.1, are 150 atoms for PDF, DOC, EXE, GIF, HTM and 160atoms for JPG. 10 random fragments of 1500 bytes each were sampled fromeach examined file. BFD and CDD based features were extracted from eachfragment and then normalized by the fragment size (similarly to the nor-malization by file size conducted in the first scenario). Then, the distancebetween each fragment and each of the six dictionaries was calculated. Foreach dictionary, the mean value of the distances was computed. Eventually,the examined file was classified to the class that has the minimal mean value.This procedure is described in Algorithm 3.5.2. The classification results are

82

kjpg

k pdf

50 100 150 200 250 300 350

50

100

150

200

250

300

350 0

1

2

3

4

5

6

7

8

9

10

(a) Error matrix for the pair PDF-JPG

kexe

k pdf

50 100 150 200 250 300 350

50

100

150

200

250

300

350 0

1

2

3

4

5

6

7

8

9

10

(b) Error matrix for the pair PDF-EXE

Figure 3.5.3: Error matrices produced by Algorithm 3.4.1. The matrix ispresented in cold to hot colormap to show ranges of low (blue) and high(red) errors.

83

presented in Table 3.5.2.

Algorithm 3.5.2: File Fragment Classification using Dictionary Learn-ing

Data: X = {x1, x2, · · · , xr} - input fragments;D = {DPDF , DDOC , DEXE, DGIF , DJPG, DHTM} - set of dictionaries;Output: tX - file type predicted to X;for i = 1, . . . , r do

for t ∈ {PDF,DOC,EXE,GIF, JPG,HTM} doERRi,t = dist(Xi, Dt) ;

end

endfor t ∈ {PDF,DOC,EXE,GIF, JPG,HTM} do

MEANt = mean{ERRi,t}ri=1

endtX = argmint {MEANt} ;Output tX ;




Table 3.5.2: Confusion matrix for the second scenario where BFD+CDDbased features were chosen. 100 files of each type were classified by Algo-rithm 3.5.2.

The second experiment used a double-byte frequency distribution (DBFD),which contains 65536 features. Figure 3.5.4 demonstrates the DBFD featureextraction from a small file fragment.

84

AABCCC =⇒

Double-Byte Probability (DBFD)AA 0.2AB 0.2BC 0.2CC 0.4... 0

Figure 3.5.4: Features extracted from the file fragment “AABCCC” usingDouble Byte Frequency Distribution (DBFD). The normalization factor isone less then the legth of the string.

Similarly to the first experiment, 10 fragments were sampled from randomlocations at each examined file. However, this time we used 2000 bytes perfragment since smaller fragment sizes do not capture sufficient informationwhen DBFD features are used. The feature vectors were normalized bythe fragment’s size as before. Algorithm 3.5.2 was applied to classify thetype of each examined file. The dictionaries sizes in this experiment are 80atoms for PDF, DOC and JPG and 60 atoms for EXE, GIF and HTM. Theclassification results of this experiment are presented in Table 3.5.3. We seethat DBFD based features reveal patterns in the data that were not revealedby using BFD and CDD based features. In particular, it captures very wellGIF files that BFD and CDD based features fail to capture.




Table 3.5.3: Confusion matrix for the second scenario that is based on DBFDbased features. 100 files of each type were classified by Algorithm 3.5.2.

The third experiment defines a Markov-walk (MW) like set of 65536 fea-tures extracted from the dataset for each signal. The transition probabilitybetween each pair of bytes is calculated. Figure 3.5.5 demonstrates how toextract MW type features from a file fragment.

85

AABCCCF =⇒

Transition Probability (MW)A → A 0.5A → B 0.5B → C 1C → C 0.66C → F 0.33

..

. 0

Figure 3.5.5: Markov Walk (MW) based features extracted from the filefragment “AABCCCF”.

Both MW based features and DBFD based features are calculated usingthe double byte frequencies, but they capture different information from thedata. DBFD based features are focused on finding pairs of bytes that aremost prevalent and those who have low chances of appearing in a file. Onthe other hand, MW based features represent the probability that a specificbyte will appear in the file given the appearance of a previous byte. This iswell suited to file types such as EXE where similar addresses and opcodesare used repeatedly. Each memory address or opcode is comprised of twoor more bytes, therefore, it can be described by the transition probabilitybetween these bytes. Text files also constitute a good example for the appli-cability of MW based features because it is well known that natural languagecan be described by patterns of transition probabilities between words or let-ters. Our study shows that MW based features capture also the structureof media files like GIF and HTM files. The relatively unsatisfactory per-formance on JPG files is because our PDF dictionary was trained on PDFfiles containing pictures. Therefore, it captured some of the JPG files. Thepredicted results are described in Table 3.5.4. These results (97% avg. accu-racy) outperforms these that were obtained by the BFD+CDD and DBFDfeatures, and also improve over all the surveyed methods in [99] (Table VI),including the algorithm proposed in [99], which has 85.5% average accuracy.However, it should be noted that we used 10 fragments for the classificationof each file whereas in [99] a single fragment is used. In this scenario, thedictionary sizes are 500 atoms for PDF, DOC and EXE, 600 for GIF, 800for JPG and 220 for HTM. The HTM dictionary is smaller than the otherdictionaries. This is due to the fact that the HTM training set contains only230 samples of this file type, and the LU dictionary size is bounded by thedimension of the training matrix (see Algorithm 2.4.1).

86




Table 3.5.4: Confusion matrix for the second scenario using MW based fea-tures. 100 files of each type were classified by Algorithm 3.5.2.

3.5.3 Third Scenario: Detecting Execution Code inPDF Files

Portable Document Format (PDF) is a common file format that can containdifferent media elements such as text, fonts, images, vector graphics andmore. This format is widely used in the Web due to the fact that it is selfcontained and platform independent. While PDF format is considered to besafe, it can contain any file format including executables such as EXE filesand various script files. Detecting malicious PDF files can be challengingas it requires a deep inspection into every file fragment that can potentiallyhide executable code segments. The embedded code is not automaticallyexecuted when the PDF is being viewed using a PDF reader since it firstrequires to exploit a vulnerability in the viewer code or in the PDF format.Still, detecting such potential thread can lead to a preventive action by theinspecting system.

To evaluate how effective our method can be in detecting executable codeembedded in PDF files, we generated several PDF files which contain text,images and executable code. We used four datasets of PDF files as ourtraining data:

1. XPDF : 100 regular PDF files containing mostly text.

2. XGIFinPDF : 100 PDF files containing GIF images.

3. XJPGinPDF : 100 PDF files containing JPG images.

4. XEXEinPDF : 100 PDF files containing EXE files.

All the GIF, JPG and EXE files were taken from previous experiments andwere embedded into the PDF files. We generated 4 dictionaries for each

87

dataset using Algorithm 3.3.1. The input for the algorithm was

X = {XPDF , XGIFinPDF , XJPGinPDF , XEXEinPDF}.

We then created a test dataset which consisted of: 100 regular PDF files and10 PDF files that contain executable code. Algorithm 3.5.2 classified the 110files. The input fragments X were the PDF file fragments. The input set ofdictionaries

D = {DPDF , DGIFinPDF , DJPGinPDF , DEXEinPDF}

were the output from Algorithm 3.3.1. A file is classified as malicious (con-taining executable code) if we find more than TEXE fragments of type EXEinside, otherwise it is classified as a safe PDF file. We used TEXE = 10as our threshold since it minimized the total number of miss-classifications.The training step was applied to 10 kilobytes fragments and the classifica-tion step was applied to five kilobytes fragments. We used the MW basedfeatures (65,536 extracted features). By using Algorithm 3.5.2, we managedto detect all the 10 malicious PDF files with 8% of false alarm rate (8 PDFfiles that were classified as malicious PDF files). The results are summarizedin Table 3.5.5. Other file formats, which contain embedded data (DOC filesfor example), can be classified in the same way.

Correct File TypePDF Malicious PDF

Classified File TypeSafe PDF 92 0Malicious PDF 8 10

Table 3.5.5: Confusion matrix for malicious PDF detection experiment. 110files were classified by Algorithm 3.5.2.

3.5.4 Time Measurements

Computer security software face the the scenarios described in sections 3.5.1and 3.5.2 frequently. Therefore, any solution to file type classification mustprovide a quick response to queries. We measured the time required for boththe training phase and the classification phase of our method that classifiesa file or a fragment of a file. Since the training phase operates offline it needsnot need to be fast. On the other hand, classification query should be fastfor real-time considerations and for high-volume applications. The runningtimes of the training and classification steps of each scenario are presentedin Tables 3.5.6 and 3.5.7.

88

Features Training time (sec) Classification time (sec)per 1 MB of data per 1 MB of data

BFD+CDD Preprocessing 1.8 1.88Analysis 0.004 0.0005Total 1.804 1.8805

Table 3.5.6: Running times for the first scenario. The bold number in theleft column refers to the running time of Algorithm 3.3.1, excluding thecomputation of the dictionary size k. The corresponding number in theright column refers to the running time of Algorithm 3.5.1. The times arenormalized by the training data size (left column) and the testing data size(right column). This normalization allows to analyze the times regardless ofthe sizes of the files in our experiments, which are random and vary largely.

Features Training time (sec) Classification time (sec)per 1 MB of data per 1 MB of data

BFD+CDD Preprocessing 1.93 0.1Analysis 0.008 0.01 (per file)Total 1.938

DBFD Preprocessing 13.78 1.6Analysis 0.54 0.26 (per file)Total 14.32

MW Preprocessing 18.42 2.41Analysis 0.65 0.27 (per file)Total 19.07

Table 3.5.7: Running times for the second scenario. The bold numbers in theleft column refer to the running time of Algorithm 3.3.1 (excluding the com-putation of the dictionary size k) for different feature sets. The correspondingnumbers in the right column refer to the running time of Algorithm 3.5.2.The times in the table are normalized by the training and testing data size,as done in Table 3.5.6. The analysis time of the test files however, is per filebecause Algorithm 3.5.2 samples a fixed amount of content from each file,regardless of its actual size.

89

For each scenario, we describe the training time for dictionaries construc-tions and the classification time of an unknown file. The times are dividedinto preprocessing step and the actual analysis step. The preprocessing in-cludes feature extraction from files (data preparation) and loading this datainto Matlab. The feature extraction was done in Python and the output fileswere loaded to Matlab. Obviously, this is not an optimal configuration as itinvolves intensive slow disk I/O. We did not optimize these steps. The anal-ysis time refers to the time needed to build six dictionaries (in the trainingphase) and to classify a single file to one of the six classes (right column inthe classification phase). Our classification process is fast. The preprocess-ing step can be further optimized for real-time applications. All experimentswere conducted on a windows 64-bit, Intel i7, 2.93 GHz CPU machine with8 GB of RAM.

3.6 Conclusion

In this work, we presented a novel algorithm for dictionary construction,which is based on randomized LU decomposition. By using the constructeddictionary the algorithm classify the content of a file and can deduct its typeby examining few file fragments. The algorithm can also detect anomalies inPDF files (or any other rich content formats) which can be malicious. Thisapproach can be applied to detect suspicious files that can potentially containmalicious payload. Anti-virus systems and firewalls can therefore analyze andclassify PDF files using the described method and block suspicious files. Theusage of dictionary construction and classification in our algorithm is differentfrom other classical methods for file content detection, which uses statisticalmethods and pattern matching in the file header for classification. The fastdictionary construction allows to rebuild the dictionary from scratch when itis out-of-date which is important when building evolving systems that classifycontinuously changing data.

Part III

Perturbed Matrix Factorization

91

Chapter 4

Spectral Decomposition Updateby Affinity Perturbations

Many machine learning based algorithms contain a training step that is doneonce if the data is static. The training step is usually computational expen-sive since it involves processing of large matrices. If the training profile isextracted from an evolving dynamic dataset, it has to be updated as somefeatures of the training dataset are changed. This work proposes a solutionhow to update this profile efficiently. Therefore, we investigate how to up-date the training profile when the data is constantly evolving. We assumethat the data is modeled by a kernel method and processed by a spectraldecomposition. In many algorithms for clustering and classification, a low-dimensional representation of the affinity (kernel) graph of the embeddedtraining dataset is computed. Then, it is used for classifying newly arriveddata points. We present methods for updating such embeddings of the train-ing datasets in an incremental way without the need to perform the entirecomputation upon the occurrences of changes in a small number of the train-ing samples. Efficient computation of such an algorithm is critical in manyWeb-based applications. The results in this chapter appear in [25].

4.1 Introduction

Studying a dataset by extracting constructive information from it is a chal-lenging task. The computational complexity increases when we process evolv-ing data that requires frequent updates of the profile that represents the ini-tial training set we use. As time advances, the training profile, which waspreviously extracted from evolving dynamic data, stops representing accu-rately the behavior of the current data. Therefore, the extracted profile has

93

94

to be updated frequently.A straightforward approach to update the training profile is to repeat the

whole computational process that generated previously this training profile.However, it becomes computationally impractical when dealing with large-scale data while the changes in the training data are small. For example,consider a face recognition application. Assume the training dataset reflectsmany facial features such as color, glasses, haircut, age, etc. Assume someof the features (not all of them) were modified slightly. This happens of-ten. We propose a method for updating efficiently the training profile whileperforming a limited and efficient computation that takes into considerationonly the current modified features and not taking all the features.

A common practice in kernel methods is to extract features from a largefinite high dimensional dataset that becomes a training dataset. Then, a sim-ilarity graph between the features in the training dataset is formed. We willuse the Diffusion Maps (DM) methodology [108, 14] as our exemplary kernelmethod (see section 4.3.1 for the description of this method) to compute anembedding of this graph into a low-dimensional space. This embedding isaccomplished by eigenvectors computation of the graph affinity matrix thatbelong to the largest eigenvalues. Changes in the affinity matrix will resultin changes in the eigenvectors, thus, it will force us to compute them fre-quently. In this work, we propose a solution that is based on the PowerIteration algorithm combined with the first order approximation of the per-turbed eigenvectors and eigenvalues (eigenpairs) that enables us to updatethe training profile dataset by considering only the changes in the trainingdataset. By using ideas from perturbation theory, we update the eigenpairsthat are based on the perturbations (changes) in the affinity matrix thateventually require less computational efforts. We tested our algorithm onaffinity matrices that were generated by the diffusion operator of the train-ing similarity graph in the DM methodology.

Consider a set of n sensors. Each sensor measures m different parameters(features). Suppose that we get the measurement data matrixX of size n×m,where the cell (i, j) represents the j measurement of sensor i. Therefore,each sensor is a data point in Rm. We want to embed the sensed datain a low-dimensional space for clustering and classification [61]. We canalso predict the parameters values in locations nearby our sensors by usingout-of-sample extension methods [109, 28, 110]. Embedding the data intoa low-dimensional space is usually done by computing the singular valuedecomposition (SVD) of the sensors affinity matrix or by some variationof it (kernel matrix, probability matrix or diffusion operator matrix). Thevalues of the eigenvectors are used as the coordinates in the low-dimensionalspace. This affinity can be defined in several ways and it usually depends on

95

the measurements types. DM, for example, provides an affinity among thefeatures (sensor measurements). Since sensor data is dynamic and evolving,the embedded low-dimensional space have to be updated as the training datadoes not represent adequately the incoming data that did not participate inthe training phase. Even if most of the sensors readings were unchanged, wewill still need to perform the entire computation since we cannot determinethe effect of such a change on the embedded space. Therefore, the goal ofthis work is to provide an efficient way to update the embedding coordinateswithout the need to re-compute the entire SVD again and again providedthat the sensors reading are regarded as the perturbations from the originalreadings of the sensors affinity matrix. The perturbations are assumed to besufficiently small.

The chapter has the following structure: related works are described inSection 4.2. Section 4.3 provides a formal definition of the problem. The mainalgorithm and its validity are given in Section 4.4. Experimental results arepresented in Section 4.5.

4.2 Related Work

There are several works [111] that describe how to adapt elements from ma-trix perturbation theory to achieve eigenpairs approximation. Many of theseworks focus on updating the left principal eigenvector π of a stochastic ma-trix P where π = πP for eigenvalue 1. Here, π is the stationary distributionof a Markov chain defined by P . By the Ergodic Theorem for Markov chains,π is unique if P is aperiodic and irreducible [112] .The above updating meth-ods can help to accelerate algorithms such as Pagerank and HIT (see [113]for more details) that use the stationary distribution values as rating scores.Recently, new algorithms [112, 114] were introduced to further improve theconvergence rate of Google’s Pagerank algorithm. These types of methodsare suitable only for updating the first eigenvector of the perturbed matrix,whereas we have to update the first k dominant eigenvectors to create theembedding.

Another way to approximate the eigenvectors is by using the group inverseof A [115]. The group inverse A# of A is defined as the matrix that satisfiesAA#A = A, A#AA# = A#, and AA# = A#A. If the matrix A is non-singular then A# = A−1. By using the group inverse we can approximate theperturbed eigenvectors to be ϕi = ϕi− (A− λiI)

#Aϕi. However, calculatingthe group inverse is not trivial even if we only compute its approximation.Computing it for each new perturbation update of A is inefficient.

Alternatively, random algorithms such as [6] are proved to be effective in a

96

direct computation of the SVD approximation for large scale matrices. Thesemethods treat the perturbed matrix A as a new SVD approximation problemand neither use A nor its eigenpairs. Traditional computational methods suchas the Power Iteration [5], Inverse Iteration and the Lanczos [116] methodsoperate in the same way and compute the eigenpairs of each update of theperturbed matrix. Here, the computation is performed with a random guessas the initial input without taking the unperturbed matrix and its eigenpairsinto consideration.

Incremental versions of low-dimensional embedding algorithms were tai-lored specifically to fit Local Linear Embeddings(LLE) [117] and ISOMAP [118].These algorithms utilize manifold learning methods. They modify the origi-nal LLE and ISOMAP algorithms to process the data iteratively rather thanby batch processing. When a new data point arrives, these algorithms addit to the embedding and then efficiently update all the existing data pointsin the low dimensional space.

4.3 Problem Description

4.3.1 Finding a Low-Dimensional Embedded Space

We are given a set of data points xi in RD. We want to find a set of datapoints yi in Rd, d≪ D, that preserves up to a small controlled distortion theaffinities between them. In other words, nearby data points remain nearbywhile distant data points remain distant. This general framework has threesteps:

1. Build a graph G(V,E) which represents the data points xi. Nearbypoints are connected with an edge.

2. Build a similarity matrix using weights on the edges E in G.

3. Build an embedded graph by using the eigenvectors of the distancematrix.

The changes between each dimensionality reduction method are driven fromthe way we build the similarity matrix in step 2.

In words, we build an observational method of the inspected data thatcan be a large computer network for example. The fundamental ingredient isthe ability to organize and model the data into a simple (reduced dimension)geometry. Vectors of observations on the data are collected. They are orga-nized as a graph in which various vectors of observations are linked by their

97

similarity (affinity). Then, a second graph is built in which the actual entriesin the observation vector are linked through their mutual dependence. Spec-tral and harmonic analysis of the similarity matrix (or dependence matrix)is performed thus enabling the organization of the empirical observationsinto simpler low-dimensional structures. Nonlinear extension of conventionallinear statistical tools such as principal components analysis (PCA), andindependent components analysis (ICA) are used. These methods reducethe observed data to allow a small number of parameters (coordinates) tomodel all the variabilities in the observations. A robust similarity relation-ship between two observation vectors is computed as a combination of allchains of pairs that link them. In the DM case ([14]), these are the diffu-sion inference metrics. Clustering in this metric leads to robust clusteringof the observations and their characterization. Various local criteria of link-age between observations lead to distinct geometries. In these geometries,the user can redefine relevance and filter away unrelated information. Thetop eigenfunctions of the matrix define the pair linkages to provide globalorganization of the given set of observations. DM embeds the data into alow-dimensional Euclidean space that converts isometrically the (diffusion)relational inference metric to the corresponding Euclidean distance. Diffu-sion metrics can be computed efficiently as an ordinary Euclidean distancein a low-dimensional embedding by the DM. Total computational time scaleslinearly with the data size wet get, and it can be updated on line. The dif-fusion geometry, which is induced by various chains of inference, enables amultiscale hierarchical organization of regional folders of observations corre-sponding to various states of the network [119].

To better understand the proposed algorithm, we review the DiffusionMaps (DM) methodology [14, 108] that performs non-linear dimensionalityreduction. Given our sensor reading matrix X, we define a weighted graphover the sensor set, where the weight between sensor i and j is given by thekernel

k(i, j) , e−∥xi−xj∥

ε . (4.3.1)

The degree of a sensor (vertex) i in this graph is

d(i) ,∑j

k(i, j). (4.3.2)

Normalizing the kernel with this degree produces an n × n row stochastictransition matrix whose cells are [P ]ij = p(i, j) = k(i, j)/d(i) for sensors iand j. This defines a Markov process over the sensors set.

The dimensionality reduction achieved by this diffusion process is a resultof the spectral analysis of the kernel. Thus, it is preferable to work with a

98

symmetric conjugate to P that we denote by A and its cells are denoted by

[A]ij = a(i, j) =k(i, j)√d(i)

√d(j)

=√

d(i)p(i, j)1√d(j)

. (4.3.3)

The eigenvalues 1 = λ1 ≥ λ2 ≥ . . . of A and their corresponding eigenvectorsϕk (k = 1, 2, . . .) are used to obtain the desired dimensionality reduction bymapping each i onto the data point Φ(i) = (λ2ϕ2(i), λ3ϕ3(i), ..., λδϕδ(i)) fora sufficiently small δ, which is dependent on the decay of the spectrum of A.

4.3.2 Updating the Embedding

We are given the perturbation matrix A of the matrix A. We can assumethat the perturbations are sufficiently small, that is ∥A − A∥ < ε for somesmall ε. We also assume that A is symmetric since we compute it in the sameway as A was computed. We wish to update the eigenpairs of A based on Aand its eigenpairs. We now present the problem in mathematical terms.

Given a symmetric n× n matrix A where its k dominant eigenvalues areλ1 ≥ λ2 ≥ ... ≥ λk and its eigenvectors are ϕ1, ϕ2, ..., ϕk, respectively, and aperturbed matrix A such that ∥A − A∥ < ε, find the perturbed eigenvaluesλ1 ≥ λ2 ≥ ... ≥ λk and its eigenvectors ϕ1, ϕ2, ..., ϕk of A in the most efficientway.

4.4 The Recursive Power Iteration (RPI) Al-

gorithm

4.4.1 First Order Approximations

To efficiently update each eigenpair of the perturbed matrix A, we will firstcompute the first order approximation of each eigenpair. Later, it will beused in our algorithm as the initial guess to the RPI algorithm.

Given an eigenpair (ϕi, λi) of a symmetric matrix A where Aϕi = λiϕi,we compute the first order approximation of the eigenpair of the perturbedmatrix A = A + ∆A. We assume that the change ∆A is sufficiently small,which result in a small perturbation in ϕi and λi. We look for ∆λi and ∆ϕi

that satisfy the equation

(A+∆A)(ϕi +∆ϕi) = (λi +∆λi)(ϕi +∆ϕi). (4.4.1)

This equation is expanded to

Aϕi + [∆A]ϕi +A[∆ϕi] + [∆A][∆ϕi] = λiϕi + λi[∆ϕi] + [∆λi]ϕi + [∆λi][∆ϕi].

99

It becomes

[∆A]ϕi + A[∆ϕi] = λi[∆ϕi] + [∆λi]ϕi +O(∆2). (4.4.2)

For the rest of the computation, we will ignore the term O(∆2) as it providesrelatively small error. Since A is symmetric, its eigenvectors are orthogonaland can be used as a basis for the perturbed eigenvector ∆ϕi =

∑Nj=1 ϵijϕj

for some constants ϵij. Substituting this construction in Eq. 4.4.2, we get

[∆A]ϕi + A

(N∑j=1

ϵijϕj

)= λi

(N∑j=1

ϵijϕj

)+ [∆λi]ϕi,

or

[∆A]ϕi +N∑j=1

ϵijAϕj = λi

(N∑j=1

ϵijϕj

)+ [∆λi]ϕi.

By using that fact that Aϕj = λjϕj, we get

[∆A]ϕi +N∑j=1

ϵijλjϕj = λi

(N∑j=1

ϵijϕj

)+ [∆λi]ϕi. (4.4.3)

We can now simplify Eq. 4.4.3 by multiplying both sides by ϕTi

ϕTi [∆A]ϕi +

N∑j=1

ϵijλjϕTi ϕj = λi

(N∑j=1

ϵijϕTi ϕj

)+ [∆λi]ϕ

Ti ϕi.

By using the fact that for j = i , ϕj are orthogonal to ϕi, we get

ϕTi [∆A]ϕi + ϵiiλiϕ

Ti ϕi = λiϵiiϕ

Ti ϕi + [∆λi]ϕ

Ti ϕi,

and since ϕTi ϕi = 1 the equation becomes

[∆λi] = ϕTi [∆A]ϕi. (4.4.4)

By multiplying both sides of Eq. 4.4.3 by ϕTk , k = i, we get

ϕTk [∆A]ϕi +

N∑j=1

ϵijλjϕTk ϕj = λi

(N∑j=1

ϵijϕTk ϕj

)+ [∆λi]ϕ

Tj ϕ.

Therefore,

ϕTk [∆A]ϕi + ϵikλkϕ

Tk ϕk = λi[ϵikϕ

Tk ϕk] + [∆λi]ϕ

Tk ϕi.

100

Since ϕTk ϕk = 1 and ϕT

k ϕi = 0, k = i, we get

ϕTk [∆A]ϕi + ϵikλk = ϵikλi + 0,

which yields

ϵik =ϕTk [∆A]ϕi

λi − λk

. (4.4.5)

Finally, we require that the perturbed eigenvectors will also be an orthonor-mal basis and (ϕi + [∆ϕi])

T (ϕi + [∆ϕi]) = 1. If we expand this equation, weget

ϕTi ϕi + 2ϕT

i [∆ϕi] + [∆ϕi]T [∆ϕi] = 1.

After the removal of high order terms, we get 1+2ϕTi [∆ϕi] = 1, or ϕT

i [∆ϕi] =⟨ϕi, [∆ϕi]⟩ = 0. Since this product is ϵii, therefore ϵii = 0. Using ϵii = 0 andEq. 4.4.5, then [∆ϕi] becomes

[∆ϕi] =N∑j=1

ϵijϕj =∑j =i

ϕTj [∆A]ϕi

λi − λj

ϕj.

To conclude, we obtained that the following first order approximationsfor the eigenvalues and eigenvectors of A by

λi = λi + ϕTi [∆A]ϕi (4.4.6)

and

ϕi = ϕi +∑j =i

ϕTj [∆A]ϕi

λi − λj

ϕj. (4.4.7)

The relative error of the perturbed x from x is defined by

errx =∥x− x∥L2

∥x∥L2

. (4.4.8)

To analyze the accuracy of these approximations, we calculated the rel-ative error of a 200 × 200 random matrix A, where [A]ij ∼ U(0, 1). Theperturbed matrix A was created by adding some noise to the elements of Aat a rates of 1% and 5%.

101

(a) Eigenpairs approximation errors for 1% perturbations

(b) Eigenpairs approximation errors for 5% perturbations

Figure 4.4.1: Approximation error rates. The x-axis is the index of theordered eigenvalues. The y-axis is the relative error of the approximatedvalue (Eq. 4.4.8).

We can see that the error increases for eigenvectors that correspond toeigenvalues of smaller magnitudes. In other words, the error does not affectthe largest eigenpairs. Since the embedding into a lower dimension space byDM depends on a few largest eigenvalues, the error does not affect the qualityand the validity of the lower dimension space as a faithful representation ofthe original source space.

While this method provides a fast computation for the perturbed values,it is limited as it only uses the first order approximations, which might notbe sufficiently accurate for our needs.

4.4.2 The Recursive Power Iteration Method

Power Iteration method has proved to be effective when calculating the prin-ciple eigenvector of a matrix [112]. However, this method cannot find the

102

other eigenvectors of the matrix. In general, an initial guess of the eigenvec-tor is also important to guarantee fast convergence of the algorithm. In thealgorithm, which we call Recursive Power Iteration (RPI), the original eigen-vectors of A will be the initial guess for each power iteration (eventually thischoice will be refined in Section 4.4.3). Once the eigenvector ϕi is obtainedin step i, we transform A into a matrix that has ϕi+1 as its principle eigen-vector. We iterate this step until we recover the k dominant eigenvectors ofA.

Algorithm 4.4.1: Recursive Power Iteration Algorithm

Input: Perturbed symmetric matrix An×n, number of eigenvectors tocalculate k, initial eigenvectors guesses {vi}ki=1, admissibleerror err

Output: Approximated eigenvectors{ϕi

}k

i=1, approximated

eigenvalues{λi

}k

i=11: for i = 1→ k do2: ϕ← vi3: repeat

4: ϕnext ← Aϕ

∥Aϕ∥5: errϕ ← ∥ϕ− ϕnext∥6: ϕ← ϕnext

7: until errϕ ≤ err8: ϕi ← ϕ

9: λi ← ϕTi Aϕi

ϕTi ϕi

10: A← A− ϕiλiϕTi

11: end for

The correctness of the RPI algorithm is proved based on the fact thatthe power iteration method converges and on the spectral decompositionproperties of A.

Proposition 4.4.1. Algorithm 4.4.1 finds the first k eigenpairs of A.

Proof. We prove the correctness of the algorithm by induction on k in thealgorithm.

For k = 1, we apply the power iteration method on A that converges tothe principle component ϕ1 with its corresponding eigenvector λ1. λ1 is thelargest eigenvalue of A.

103

Let us assume that the algorithm found the first k eigenpairs of A. Ineach step, we subtract the matrix ϕiλiϕ

Ti from A. Then, in step k + 1, we

apply the power iteration loop to the matrix B = A −∑k

i=1 ϕiλiϕTi . A is

symmetric and has a spectral decomposition of the form A =∑n

i=1 ϕiλiϕTi ,

where ϕi, λi are the eigenpairs of A. Therefore, B =∑n

i=k+1 ϕiλiϕTi . Since

λk+1 ≥ λk+2 ≥ ... ≥ λn, the principle component of B is ϕk+1. This principlecomponent is found once we apply the power iteration method to B, which isexactly what happens in step k+1. Therefore, in step k+1 of the algorithm,the power iteration method will recover the eigenvector ϕk+1 of A. After k+1steps, the algorithm recovers the k + 1 dominant eigenpairs of A.

To analyze the computational complexity of the RPI algorithm, we ob-serve that during the execution of the algorithm, we perform (I1+ ...+Ik)CA

operations, where Im is the number of iterations needed in step m and CA

is the cost of applying the matrix A to a vector followed by its normaliza-tion. We also need to update A during the k steps which costs kn2 operations.Therefore, the total complexity of this algorithm is O(kn2+(I1+ ...+Ik)CA).

4.4.3 RPI Algorithm with First Order Approximations

The RPI finds in each step the principle eigenvector of the modified A byiterating the equation vk+1 = Avk

∥Avk∥. The convergence rate depends on the

initial guess which we provide to the iteration loop. Algorithm 4.4.1 uses theunperturbed eigenvector as the initial guess. To improve the convergencerate, we apply Algorithm 4.4.1 but use the eigenvectors of the first orderapproximation, which were computed in section 4.4, as our initial guess.

The justification for this approach is that the first order approximation ofthe perturbed eigenvector is inexpensive, and each RPI step will guaranteethat this approximation converges to the actual eigenvector of A. The firstorder approximation should be close to the actual solution we seek and there-fore requires fewer iterations steps to converge. A comparison between thenumber of iteration needed to compute the eigenpairs is given in Section 4.5for different variations of the RPI algorithm.

The fact that Algorithm 4.4.1 iterates over k to recover the perturbedeigenvectors can assist us in automatically selecting the number of eigenvec-tors to approximate, which is the dimension of the embedded space. If weapproximate the perturbed matrix A by the first k eigenvectors that formthe matrix Ak, which has rank k, we know that the approximated error is∥A − Ak∥ ≤ λk. Since in each iteration we also calculate the eigenvalues,we know to stop when λk is sufficiently small. This is an advantage over

104

methods that compute the SVD approximation of A that sometimes requirek as an a priori input parameter.


We compare the execution time as well as the total number of iterations∑ki=1Ci of the three variations of the RPI algorithm. The first algorithm

uses a random vector as the initial guess in each step. The second algorithmuses the unperturbed (known) vector of A as the initial guess. The thirdalgorithm uses the first order approximation of the perturbed eigenvector of Aas the initial guess. In this benchmark, we used a 104×104 perturbed diffusionmatrix from the DM methodology. The first k = 10 eigenpairs are computed.We compared the results for different admissible errors ∥ϕ− ϕnext∥L2 . As wecan see in Fig. 4.5.1, the RPI algorithm with the first order approximationhas the lowest total running time. It also needs fewer number of iterationsto complete, as illustrated in Fig. 4.5.2.

Figure 4.5.1: Comparison of the total wallclock time between the three RPIalgorithmic variations. We compute the first 10 eigenpairs of a 104 × 104

matrix. Each bar represents a variation of the RPI algorithm. Each groupof bars is compared within a given admissible error.

105

Figure 4.5.2: Comparison of the total number of iterations between the threeRPI algorithmic variations. We compute the first 10 eigenpairs of a 104×104

matrix. Each bar represents a variation of the RPI algorithm. Each groupof bars is compared within a given admissible error.

4.6 Conclusion

In this work, we presented several contributions. The dominant eigenpairserror is relatively small when the first order approximation is used as theinitial guess in the computation. We presented the RPI algorithm, which usesthe Power Iteration method, to compute the first k eigenpairs of a perturbedmatrix. The algorithm uses the given eigenpairs of the unperturbed matrix.We improved the algorithm by using the first order approximation of theeigenpairs as the initial guess and showed that it accelerates the convergencerate of each iteration. After proving the correctness of the algorithm, weshowed that it also performs well on real data.

Chapter 5

Affinity Perturbations Usage toDetect Web Traffic Anomalies

The initial training phase of machine learning algorithms is usually com-putationally expensive since it involves processing huge matrices. Evolvingdatasets are challenging from this point of view because changing behav-ior requires updating the training. We propose a method for updating thetraining profile efficiently and a sliding window algorithm for processing thedata on-line in smaller parts. This assumes that the data is modeled by akernel method that includes spectral decomposition. We demonstrate thealgorithm with a Web server request log where an actual intrusion attack isknown to happen. The dynamic kernel update with sliding window preventsthe problem of single initial training and can process evolving datasets moreefficiently. The results in this chapter appear in [26].

5.1 Introduction

Evolving data that requires frequent updates to the training profile is a chal-lenging target when extracting constructive information. The computationalcomplexity of the training phase increases with such datasets because an ear-lier profile will not accurately represent the behavior of current data. There-fore, we need to update frequently the extracted profile. A straightforwardapproach for updating the training profile is to repeat the entire compu-tational process that generated the original profile. This work presents amethod for efficiently updating the evolving profile.

A common practice in kernel methods is to extract features from a highdimensional dataset, and to form a similarity graph between the features inthe dataset. In this research we apply the Diffusion Maps (DM) method-

107

108

ology [14] to a Web traffic log. DM finds the embedded coordinates for alow-dimensional representation of the data. This embedding is accomplishedby eigenvectors computation of the graph affinity matrix. Changes in theaffinity matrix will result in changes in the eigenvectors, and thus will forceus to calculate them frequently. To accelerate this computation, our algo-rithm uses Recursive Power Iteration (RPI) together with an initial approx-imation calculated for the eigenvalues and eigenvectors (eigenpairs) of theupdated affinity matrix. Such approach enables to update the profile of thedataset while taking into consideration the changes in the original dataset,thus requiring less computational effort.

Internet data constantly evolve and change. For this reason, the embed-ded space should be updated since the original training information we gotdata does not represent the newly arrived data. This can happen if, for ex-ample, the new data was not included in the first training cycle. Even ifmost of the network data in our interest window remained unchanged, theentire computation has to be done since we cannot predict the possible im-pact on data in the embedded space. We develop an efficient method thatcan update the embedding coordinates while avoiding re-computation of theentire SVD repeatably. To achieve that , we treat new log lines features asperturbations of the original network log profile of the feature affinity matrix.By applying a sliding window technique to the arriving network data, we areable to process the data on-line, and keep embedding it using our calculatedlow-dimensional space. We test the method on real Web traffic data andcompare our results to the true classification.

5.2 Related Work

Several algorithms such as Power Iteration [5] and Lanczos method [116]operate in an iterative way. In each iteration they calculate the eigenpairs ofthe updated input matrix. The computation of the eigenpairs is performedusing a random guess as the initial input. Such approach does not utilize theoriginal matrix (without the perturbations) and its known eigenpairs, whencomputing the new eigenpairs.

There are also incremental algorithms for updating an embedding in alow-dimensional space that were developed for Local Linear Embeddings(LLE) and for Isomatric Maps (ISOMAP) [117, 118]. Both algorithms usemanifold learning methods. The incremental LLE and ISOMAP can pro-cess data iteratively in multiple bulks instead of processing it in a singleexecution that requires that the entire data will be given in advance. Uponthe arrival of new data samples, these algorithms start by adding them into

109

the low-dimensional space and then update all the historical data samplesefficiently.

Network security has been one focus among the machine learning com-munity. Kruegel and Vigna studied the parameters of HTTP queries using atraining step with unlabeled data with various methods. Their character dis-tribution analysis uses similar feature extraction as our current study [120].Hubballi et al. described an n-gram approach to detect intrusions from net-work packets [121]. Ringberg et al. studied IP packets using principal compo-nent analysis-based dimensionality reduction [122]. Callegari et al. analyzedsimilar low level packet data [123].

Diffusion maps have been also used for network security problems. Davidstudied the use of diffusion map methodology for detecting intrusions in net-work traffic [61], while David and Averbuch used low level features to classifynetwork traffic protocols [119]. Network server logs have also been studiedwith diffusion maps with an offline approach using n-gram features and spec-tral clustering [124, 125]. In these works, data analysis was performed in abatch fashion, processing all recordings as a single, off-line dataset.

5.3 Low Dimensional Space Projection

5.3.1 Diffusion Maps

Projecting data into a low-dimensional space is an important step in un-derstanding high dimensional data more profoundly as it recovers importantrelations between data samples and enables visualization. To provide theneeded background for the presented algorithm, we start by explaining theDM method [14, 126] that performs non-linear dimensionality reduction. As-sume that we have a Web log feature matrix X. Each row in X is a featurevector corresponding to a Web log line. Define a weighted graph G,whereeach log line is a vertex in G. The weight between the lines i and j is theweight of the edge (i, j) ∈ G and defined by the kernel

k(i, j) , e−∥xi−xj∥

ε , (5.3.1)

for some ε > 0. Note that the distance ∥xi − xj∥ and the value of ε areselected after studying the dataset and the extracted features. In Section 5.6we detail the distance metric that we used in our experiments.

Each log line i (represented by a vertex) in this graph has the degree

d(i) ,∑j

k(i, j). (5.3.2)

110

Next, we normalize the kernel using Eq. 5.3.2 to produces a transition matrixP that is n× n and row stochastic. The entries of P are

[P ]ij = p(i, j) =k(i, j)

d(i)(5.3.3)

for log lines i and j. The matrix P induces a Markov process, where [P ]ij isthe probability of going from vertex i to j.

The dimensionality reduction is obtained by such a diffusion process isthe outcome of the spectral analysis of the kernel (Eq. 5.3.1). To simplifyour computations we will use a symmetric conjugate of the matrix P thatwe denote by A. The entries of A are defined as

[A]ij = a(i, j) =k(i, j)√d(i)

√d(j)

=√

d(i)p(i, j)1√d(j)

. (5.3.4)

The eigenvalues 1 = λ1 ≥ λ2 ≥ . . . of P and their corresponding eigenvec-tors vk (k = 1, 2, . . .) are derived from the eigenvectors uk of A. We use theeigenvectors vk (k = 1, 2, . . .) to get the dimensionality reduction by mappingeach original data point i onto the data point

Ψ(i) = (λ2v2(i), λ3v3(i), ..., λδvδ(i)) (5.3.5)

for a sufficiently small δ ≥ 0. Note that δ depends on the decay rate ofthe spectrum of A [14]. When selecting a suitable value for ε, the decayof the spectrum of A is fast. This will enable us to use a small number ofcoordinates for the diffusion maps.

In matrix notation, the operator A (Eq. 5.3.4 is defined as

A = D− 12KD− 1

2 = D12PD− 1

2 (5.3.6)

where D is the diagonal matrix containing the d(i) value in cell Dii. Toretrieve the eigenvectors V = [v1, v2, ...vn] of P from the eigenvectors of A,we use the transformation

V = D− 12U, (5.3.7)

where U is the eigenvector column matrix of A. The eigenvectors V are thenscaled by dividing each one by the first value of the first eigenvector.

5.3.2 Incremental Update of the Embedded Space

Once we have the DM embedding of the initial matrix A, we need to keepupdating the embedding for the next arriving samples. By replacing the

111

oldest samples with the samples that arrived, we can model such changes.This is done by treating the update matrix A as the original data matrix Awith some perturbations added to it. Assume that A has perturbations thatare sufficiently small. In other words, ∥A−A∥ < ε for a small ε. The matrixA is symmetric since it represents the operator defined in Eq. 5.3.4. We seekto update the eigenpairs of A given that the A and its eigenpairs are knownto us.

To present the problem in a more formal way, let us assume that A isan n × n symmetric matrix. Assume that its k dominant eigenvalues areλ1 ≥ λ2 ≥ ... ≥ λk and its eigenvectors are ϕ1, ϕ2, ..., ϕk, respectively. Givena perturbed matrix A such that ∥A−A∥ < ε, calculate the perturbed eigen-values λ1 ≥ λ2 ≥ ... ≥ λk and the corresponding eigenvectors ϕ1, ϕ2, ..., ϕk ofA in an efficient way.

In Section 5.4, we detail how such processing can be done using the RPIAlgorithm. The algorithm pseudo code is shown in Algorithm 4.4.1, whichappears in Chapter 4. This Algorithm finds iteratively the eigenvectors ofthe perturbed matrix using an initial approximation followed by an iterationloop in order to recover the perturbed eigenpairs.

5.4 Recursive Power Iteration

5.4.1 First Order Approximation of Eigenpairs

To update the eigenpairs of a perturbed matrix A, we calculate the firstorder approximation of each eigenpair. Then we use the approximation asthe initial value in the RPI algorithm (Step 2 in Algorithm 4.4.1).

Given an eigenpair (ϕi, λi) of a symmetric matrix A where Aϕi = λiϕi, wecalculate the first order approximation of the eigenpair (ϕi, λi) of the matrixA = A +∆A. Let us assume that the change ∆A is sufficiently small, thatresults in a small perturbation in ϕi and λi. We look for ∆λi and ∆ϕi thatcan satisfy

(A+∆A)(ϕi +∆ϕi) = (λi +∆λi)(ϕi +∆ϕi). (5.4.1)

By using the methods described in [25], we have the first order approxima-tions of the eigenvalues and eigenvectors of A as

λi = λi + ϕTi [∆A]ϕi (5.4.2)

and

ϕi = ϕi +∑j =i

ϕTj [∆A]ϕi

λi − λj

ϕj. (5.4.3)

112

5.4.2 RPI Algorithm

Power iteration method is an effective and efficient method for calculatingthe principal eigenvector of a matrix [112]. This method only recovers theprincipal eigenpair and not the other eigenpairs of the matrix. The initialvalue of the eigenvector plays a critical role in enabling the algorithm toconverge fast enough. In Algorithm 4.4.1 the first order approximations ofthe perturbed eigenvectors of A are used as the initial value for each poweriteration loop. Each time that the eigenvector ϕi is computed in cycle i,we transform A to a matrix that has ϕi+1 as its principal eigenvector. Thisstep is iterated k times, until the algorithm recovers the k most dominanteigenvectors of A.

The logic behind this approach is that the first order approximation of theperturbed eigenvector is calculated efficiently, and each RPI step guaranteesthat this approximation will converge to the eigenvector of A. Since the firstorder approximation is an more accurate then a random vector, it requires lessiteration steps to converge to the correct solution. The proof that Algorithm4.4.1 produce correct results, and its complexity analysis are given in [25].

5.5 Sliding Window Diffusion Map

Using DM to process high volumes of data can be computationally intensive.This task becomes even more complex when the data is generated on-line andneeds to be processed continuously. Therefore, we try to process the arrivingdata with iterative methodology by using the sliding window model. A slidingwindow X takes into account the n latest measurements. In practice, it is ann ×m matrix with features on the columns and samples on the rows. Thesamples are high dimensional, so the dimensionality of the sliding window isreduced from m to d using DM. The n × d matrix Xr now contains a low-dimensional representation of the dataset. This reduction is done each timea new sample appears and the window moves. The consecutive update of theDM is a time-consuming process that requires singular value decomposition(SVD) during each window.

When updating the window, we can replace the oldest measurement witha new one in the matrix X, therefore changing a single row in X. Thismeans that one row and one column of the K matrix in the DM algorithmchange. This change can be interpreted as a perturbation to the matrix K,and furthermore to the matrix A, defined using the K matrix. Algorithm4.4.1 with the first order approximation solves the eigenvectors for perturbedmatrices. This leads us to use Algorithm 4.4.1 instead of using SVD on the

113

matrix A.

Algorithm 5.5.1 outlines the sliding window DM method. First, it solvesthe eigenvectors for the initial window using SVD. Then the algorithm iter-atively process the following windows until no new samples are available.

Algorithm 5.5.1: Sliding Window Diffusion Map with RPI

Require: X Dataset ; n window width ;k embedded dimension ; err admissible error.

Ensure: Anomaly score for points in X.1: ϵ← estimate kernel parameter for first window of size n.

2: [K]ij ← exp(− ||xi−xj ||2

ϵ

), where i, j = 1 . . . n

3: D ← diag(∑n

i=1[K]ij)

4: A← D− 12KD− 1

2

5: U,Λ, UT ← SVD(A)6: while new sample xt available, where t > n do7: l← t mod n8: Replace the row l in X with the new sample xt.9: Update both row l and column l of the affinity matrix K.10: D ← diag(

∑ni=1[K]ij)

11: A← D− 12KD− 1

2

12: U,Λ← RPI (Algorithm 4.4.1)with first order approximation(A, A, k, U, Λ, err)

13: V ← D− 12U

14: V ← VV1,1

15: Ψ ← V Λ16: Find anomalies in Ψ and give scores to each sample in X.17: A← A18: end while19: Return aggregated anomaly scores for each sample in X.

There are some practical problems with such implementation that needto be taken into consideration. First, Algorithm 4.4.1 might not be ableto solve the eigenvectors for some low-rank matrices. To prevent this, itis possible to use standard SVD when a low-rank (or otherwise unsuitable)matrix is encountered. Second, the window size itself has to be decided. Thechanging scales of the data over time introduce a challenge to the slidingwindow algorithm. The initial window still determines the profile and scalefor the beginning of the analysis. Big windows have the advantage of coveringa larger representation of the data and thus include a much more varied

114

overview of the normal behavior. With smaller windows, the percentage ofanomalies within the data might get too big, and detecting the normal statecan become more difficult. Small windows, on the other hand, require lesscomputational time since they induce smaller matrices. Optimal windowsize would therefore be the smallest possible that could still contain a smallenough percentage of anomalies within the data, enabling it to capture thenormal samples correctly.

Detecting the anomalies in the low-dimensional representation can bedone in various ways. A straightforward approach is to calculate the Eu-clidean distances between the embedded samples and find the ones thatdeviate too far from the center of the dataset. This and other spectralclustering methods give good results for datasets that contain clear sepa-ration [127, 124, 125]. Similarly, k-means or any other clustering algorithmcan find possible normal as well as anomalous behavior in the data [15]. Thedensity of points in the low-dimensional space tells how far they are fromthe more clustered areas of the data. These methods calculate the distancesto neighboring points [61, 128]. All these methods usually need a thresholdvalue for the anomalous region.

In each iteration, we evaluate the anomaly level of the samples within ourwindow. Each sample gets a score if it is classified as an anomaly. The scoresare added as the window moves. This anomaly score histogram is used todetermine the anomaly level of a point. Scoring is used because locally insidea window some samples might appear anomalous but globally, consideringthe whole dataset, they are not. In this way, even if the sample looks like ananomaly in some windows, it still gets only a few scores in the global pointof view.


For the experimental part of this study, we use a labeled proprietary datasetknown to contain some network attacks. The dataset consists of queries to aWeb server. These Web queries are in Apache combined log format (a textfile). To extract numerical features from this text file, only the changingparameter values are used. The frequencies of 2-grams in these parametersare calculated to a matrix. In this matrix, the rows represent the log lines,and the columns represent the different 2-grams we found. The entries inthis matrix count how many times each specific 2-gram appeared in theparameters of a log line. For more information about the dataset and thefeature extraction method, see related literature [124, 125].

The Web log we use has 4292 lines and contains 480 different 2-grams.

115

0 500 1000 1500 2000 2500 3000 3500 4000 45000

10

20

30

40

50

60

70

80

90

100

Sample

Ano

mal

ous

in h

ow m

any

win

dow

s

Anomaly detection scores

Figure 5.6.1: The scores for each point with window size 1000 using thesecond eigenvector.

Thus, the feature matrix has dimensions 4292 × 480. The experiment sim-ulates the initial state when n samples, or log lines, have arrived. When anew line arrives, it is added to the current window, while the oldest sampleis removed from the matrix. This is continued until no new samples areavailable. The algorithm tracks only the samples within the window so thatthe dynamically changing nature of the data can be followed. As the size ofthe window does not change, the eigenpair problem stays reasonably sized.

The anomaly detection finds the most deviating samples within each win-dow. This leads to false alarms when using simple normalized anomalymetrics because inside a window a point might look anomalous. Its localabnormality might be evident, but it should not be classified as one sinceglobally it is just a small deviation from the normal state. This fact pro-motes thresholding the non-normalized but centered low-dimensional repre-sentation dk = |Ψk − mean(Ψk)| using statistical threshold θk = c · std(dk),where the parameter c has to be adjusted empirically and k is the number ofdimensions in the embedded space.

Figure 5.6.1 shows the scores each point gets as the sliding window moves.

116

This experiment uses only the second eigenvector for the low-dimensionalrepresentation. In our analysis, we use a value of c = 10 for the anomalythreshold calculation. These scores themselves indicate in how many win-dows each sample is considered anomalous. Notice that a sample might beconsidered anomalous in several windows, but in the global view it is not ananomaly. Therefore, we use another threshold, that is the horizontal red linein Fig. 5.6.1.

To assess the validity of the results we calculate the accuracy and precisionmetrics for the detection. Accuracy was defined using the true and falsepositive and negative results in the following way: accuracy = (tp+tn)/(tp+fp + fn + tn). Similarly, precision was defined: precision = tp/(tp + fp).For more details refer to [129, p. 361]. With this setup, we manage to reachan accuracy of 92.5% and a precision of 99.7%.

5.7 Conclusion

In this chapter we presented an on-line algorithm for anomaly detection indata streams with new samples arriving in real time. An on-line approachis especially important with dynamically changing datasets, for example,finding attacks in Web traffic logs. In such datasets, the profile of what isconsidered normal data can change frequently over time. We use a slidingwindow approach combined with dimensionality reduction for each window ofthe samples. Algorithm 5.5.1 provides an efficient method for projecting thedataset into the embedding space where the anomaly detection is done. Thismethodology could be applied to other evolving datasets, including socialnetwork streams and news feeds. The theoretical bounds of the error rateand number of iterations for the presented algorithms need further study.

Conclusions

Several matrix factorization methods were enhanced and expanded in thisthesis. They were eused for various machine learning tasks such as predic-tion, dictionary learning, updating a training profile and anomaly detection.In Chapter 1, which is based on [22, 23], we used the randomized inter-polative decomposition method to accelerate a tracking algorithm known asparticle filter. We showed that by selecting only a few particles and com-puting their weights, we can maintain the tracking error rate in each cycleof the PF algorithm while estimating the weights of the rest of the particles.Using this approach, we increased the number of particles to maintain a lowtracking error while limiting the tracking computational effort. The proposedalgorithm significantly reduces the computational load, and we compared ourresults with similar tracking methods to show our superiority. We applied ourmethod on both simulated and real data such as objects tracking in videosequences. To further accelerate the algorithm, we used another methodfor selecting the particles by developing a weighted version for the farthestpoint selection algorithm. It replaces the ID algorithm in the selection step.An open question we encountered during our research was the relationshipbetween the ID and FPS algorithms. Our experiments showed that bothalgorithms return similar sets, and in future work we intend to study thisconnection that probably provide evidence that the FPS can approximatethe randomized ID selection with a bounded error rate.

In Chapter 2, which is based on [24], we developed a randomized algo-rithm for low-rank LU approximation. Several error bounds for the algo-rithm’s approximations were proved. To prove these bounds, recent resultsfrom random matrix theory related to sub-Gaussian matrices were used. Thealgorithm, which can utilize sparse structures, was fully parallelized and thuscould utilize efficiently GPUs. We provided numerical examples that illus-trated the performance of the algorithm and compared it to other decompo-sition methods. The LU factorization has many applications and is widelyused today in scientific computing. Our algorithm can offer a bounded errorand a low computational time, and therefore serves as a compelling alter-

117

118

native for various systems. The randomized LU algorithm uses a Gaussianrandom matrix in its projection step. In future work, we plan to experimentwith different random structured matrices that use orthogonal transformsthat can reduce the computational cost of this step even further. Such struc-tured matrices can be, for example, the discrete Fourier transform and theWalsh-Hadamard transform [72, 9, 78].

In Chapter 3, the randomized LU algorithm was used as a basis for theconstruction of a content-based dictionary suited for classification tasks. Weaddressed the problem of file type identification, which is a fundamental taskin digital security performed nowadays by firewalls and anti-virus systems.We proposed a content-based method that detects file types that dependsneither on file extension nor on metadata of the file. Such an approachis harder to deceive, and we showed that only a few file fragments from awhole file are needed for a successful classification. Based on the constructeddictionaries, we also showed that the proposed method can effectively identifyexecution code fragments in PDF files.

In Chapter 4, which is based on [25], we develop an algorithm for incre-mental update of a training profile. If the training profile is extracted froman evolving dynamic dataset, it has to be updated frequently as some fea-tures of the training dataset are changed over time. The incremental updatealgorithm solves this problem by efficiently updating the profile instead of ex-ecuting the entire training computation. In many algorithms for clusteringand classification, a low-dimensional representation of the affinity (kernel)graph of the embedded training dataset is computed. Then, this embeddingis used for classifying newly arrived data points that didn’t participate inthe training process. We present methods for updating such embeddings ofthe training datasets in an incremental way without the need to perform theentire computation due to changes in a small number of the training samples.

In Chapter 5, which is based on [26], we used this algorithm for ananomaly detection application for Web traffic data. We presented a methodfor updating the training profile efficiently by a sliding window algorithm forprocessing the data incrementally. To achieve that, the data was modeledby a kernel method that included spectral decomposition. We demonstratedthe algorithm using Web server logs where an actual intrusion attack wasknown to occur.

The increase in computing power and data explosion we see today enablesus to harness the power offered by these methods to unlock new secretsand discoveries in many data-driven domains. Matrix factorization methodsare efficient, scalable, robust, and applicable to many problems. This factserves as solid evidence of their power and relevance in the current datascientist’s toolbox. Our contributions in both expanding the theory of matrix

119

factorization and applying them to solving new problems strengthen thisclaim even further.

Bibliography

[1] J. Gantz and D. Reinsel, “The digital universe in 2020: Big data, biggerdigital shadows, and biggest growth in the far east,” IDC iView: IDCAnalyze the Future, 2012.

[2] G. Bell, T. Hey, and A. Szalay, “Beyond the data deluge,” Science,vol. 323, no. 5919, pp. 1297–1298, 2009.

[3] G. Stewart, “The decompositional approach to matrix computation,”Computing in Science & Engineering, vol. 2, no. 1, pp. 50–59, 2000.

[4] H. Cheng, Z. Gimbutas, P. Martinsson, and V. Rokhlin, “On the com-pression of low rank matrices,” SIAM Journal on Scientific Computing,vol. 26, no. 4, pp. 1389–1404, 2005.

[5] G. H. Golub and C. F. Van Loan, Matrix computations, vol. 4. JohnHopkins University Press, 2012.

[6] P. Martinsson, V. Rokhlin, and M. Tygert, “A randomized algorithm forthe decomposition of matrices,” Applied and Computational HarmonicAnalysis, vol. 30, no. 1, pp. 47–68, 2011.

[7] N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure withrandomness: Probabilistic algorithms for constructing approximate ma-trix decompositions,” SIAM review, vol. 53, no. 2, pp. 217–288, 2011.

[8] K. L. Clarkson and D. P. Woodruff, “Low rank approximation and re-gression in input sparsity time,” in Proceedings of the 45th annual ACMsymposium on Symposium on theory of computing, pp. 81–90, ACM, 2013.

[9] H. Avron, P. Maymounkov, and S. Toledo, “Blendenpik: Supercharginglapack’s least-squares solver,” SIAM Journal on Scientific Computing,vol. 32, no. 3, pp. 1217–1236, 2010.

[10] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques forrecommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.

121

122

[11] M. Aharon, M. Elad, and A. Bruckstein, “K-svd: An algorithm for de-signing overcomplete dictionaries for sparse representation,” IEEE Trans.on Signal Processing, vol. 54, no. 11, pp. 4311–4322, 2006.

[12] M. Belkin and P. Niyogi, “Semi-supervised learning on riemannian man-ifolds,” Machine learning, vol. 56, no. 1-3, pp. 209–239, 2004.

[13] I. Jolliffe, Principal component analysis. Wiley Online Library, 2005.

[14] R. Coifman and S. Lafon, “Diffusion maps,” Applied and ComputationalHarmonic Analysis, vol. 21, no. 1, pp. 5–30, 2006.

[15] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Anal-ysis and an algorithm,” in Advances in Neural Information ProcessingSystems 14, pp. 849–856, MIT Press, 2001.

[16] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A sur-vey,” ACM Comput. Surv., vol. 41, no. 3, pp. 1–58, 2009.

[17] M. Elad and M. Aharon, “Image denoising via sparse and redundant rep-resentations over learned dictionaries,” Image Processing, IEEE Trans-actions on, vol. 15, no. 12, pp. 3736–3745, 2006.

[18] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citationranking: Bringing order to the web.,” 1999.

[19] J. Bennett and S. Lanning, “The netflix prize,” in Proceedings of KDDcup and workshop, vol. 2007, p. 35, 2007.

[20] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791,1999.

[21] W. Liu and N. Zheng, “Non-negative matrix factorization based meth-ods for object recognition,” Pattern Recognition Letters, vol. 25, no. 8,pp. 893–897, 2004.

[22] Y. Shmueli, G. Shabat, A. Bermanis, and A. Averbuch, “Acceleratingparticle filter using multiscale methods,” in Electrical & Electronics En-gineers in Israel (IEEEI), 2012 IEEE 27th Convention of, pp. 1–4, IEEE,2012.

[23] G. Shabat, Y. Shmueli, A. Bermanis, and A. Averbuch, “Acceleratingparticle filter using randomized multiscale and fast multipole type meth-ods,” Submitted to Pattern Analysis and Machine Intelligence, 2013.

123

[24] G. Shabat, Y. Shmueli, and A. Averbuch, “Randomized LU decomposi-tion,” Submitted to SIAM Journal on Scientific Computing, 2014.

[25] Y. Shmueli, G. Wolf, and A. Averbuch, “Updating kernel methods inspectral decomposition by affinity perturbations,” Linear Algebra and itsApplications, vol. 437, no. 6, pp. 1356 – 1365, 2012.

[26] Y. Shmueli, T. Sipola, G. Shabat, and A. Averbuch, “Using affinityperturbations to detect web traffic anomalies,” in SampTA 2013: 10thinternational conference on Sampling Theory and Applications, (Bremen,Germany), 2013.

[27] M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorialon particle filters for online nonlinear/non-gaussian bayesian tracking,”IEEE Transactions on Signal Processing, vol. 50, no. 2, pp. 174–188,2002.

[28] A. Bermanis, A. Averbuch, and R. Coifman, “Multiscale data samplingand function extension,” Applied and Computational Harmonic Analysis,vol. 34, pp. 15–29, 2013.

[29] C. Baker, The numerical treatment of integral equations, vol. 13. Claren-don press Oxford, 1977.

[30] B. Flannery, W. Press, S. Teukolsky, and W. Vetterling, “Numericalrecipes in C,” Press Syndicate of the University of Cambridge, New York,1992.

[31] C. E. Rasmussen, “Gaussian processes in machine learning,” in AdvancedLectures on Machine Learning, pp. 63–71, Springer, 2004.

[32] T. Gonzalez, “Clustering to minimize the maximum intercluster dis-tance,” Theoretical Computer Science, vol. 38, pp. 293–306, 1985.

[33] L. Greengard and V. Rokhlin, “A fast algorithm for particle simula-tions,” Journal of computational physics, vol. 73, no. 2, pp. 325–348,1987.

[34] N. Gupta, P. Mittal, S. Roy, S. Chaudhury, and S. Banerjee, “Developinga gesture-based interface,” Journal of the Institution of Electronics andTelecommunication Engineers, vol. 48, no. 3, pp. 237–244, 2002.

[35] K. Nummiaro, E. Koller-Meier, and L. Van Gool, “An adaptive color-based particle filter,” Image and Vision Computing, vol. 21, no. 1, pp. 99–110, 2003.

124

[36] S. Thrun, D. Fox, W. Burgard, and F. Dellaert, “Robust monte carlolocalization for mobile robots,” Artificial intelligence, vol. 128, no. 1,pp. 99–141, 2001.

[37] X. Zhang, W. Hu, and S. Maybank, “A smarter particle filter,” Com-puter Vision–ACCV 2009, pp. 236–246, 2010.

[38] M. Pitt and N. Shephard, “Filtering via simulation: Auxiliary particlefilters,” J. of the American Statistical Association, pp. 590–599, 1999.

[39] J. Kotecha and P. Djuric, “Gaussian sum particle filtering,” IEEE Trans-actions on Signal Processing, vol. 51, no. 10, pp. 2602–2612, 2003.

[40] R. Van Der Merwe, A. Doucet, N. De Freitas, and E. Wan, “The un-scented particle filter,” Advances in Neural Information Processing Sys-tems, pp. 584–590, 2001.

[41] M. Isard and A. Blake, “Condensation-conditional density propagationfor visual tracking,” International journal of computer vision, vol. 29,no. 1, pp. 5–28, 1998.

[42] T.-J. Cham and J. M. Rehg, “A multiple hypothesis approach to fig-ure tracking,” in Computer Vision and Pattern Recognition, 1999. IEEEComputer Society Conference on., vol. 2, IEEE, 1999.

[43] R. Urtasun, D. J. Fleet, and P. Fua, “3d people tracking with gaussianprocess dynamical models,” in Computer Vision and Pattern Recognition,2006 IEEE Computer Society Conference on, vol. 1, pp. 238–245, IEEE,2006.

[44] K. Choo and D. J. Fleet, “People tracking using hybrid monte carlofiltering,” in Computer Vision, 2001. ICCV 2001. Proceedings. EighthIEEE International Conference on, vol. 2, pp. 321–328, IEEE, 2001.

[45] C. Sminchisescu and B. Triggs, “Kinematic jump processes for monoc-ular 3d human tracking,” in Computer Vision and Pattern Recognition,2003. Proceedings. 2003 IEEE Computer Society Conference on, vol. 1,pp. I–69, IEEE, 2003.

[46] J. Deutscher, A. Blake, and I. Reid, “Articulated body motion captureby annealed particle filtering,” in Computer Vision and Pattern Recogni-tion, 2000. Proceedings. IEEE Conference on, vol. 2, pp. 126–133, IEEE,2000.

125

[47] F. Wang and M. Lu, “Efficient visual tracking via hamiltonian montecarlo markov chain,” The Computer Journal, 2012.

[48] E. Wan, A. Doucet, R. van der Merwe, and N. de Freitas, “The unscentedparticle filter,” tech. rep., Technical report CUED/F-INFENG/TR380,Cambridge University, 2000.

[49] Y. Rui and Y. Chen, “Better proposal distributions: Object trackingusing unscented particle filter,” in Computer Vision and Pattern Recogni-tion, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer SocietyConference on, vol. 2, pp. II–786, IEEE, 2001.

[50] T. X. Han, H. Ning, and T. S. Huang, “Efficient nonparametric beliefpropagation with application to articulated body tracking,” in ComputerVision and Pattern Recognition, 2006 IEEE Computer Society Confer-ence on, vol. 1, pp. 214–221, IEEE, 2006.

[51] B. Han, Y. Zhu, D. Comaniciu, and L. S. Davis, “Visual tracking bycontinuous density propagation in sequential bayesian filtering frame-work,” Pattern Analysis and Machine Intelligence, IEEE Transactionson, vol. 31, no. 5, pp. 919–930, 2009.

[52] B. Ristic, S. Arulampalam, and N. Gordon, Beyond the Kalman filter:Particle filters for tracking applications. Artech House Publishers, 2004.

[53] A. Doucet and A. Johansen, “A tutorial on particle filtering and smooth-ing: Fifteen years later,” Handbook of Nonlinear Filtering, pp. 656–704,2009.

[54] A. Bronstein, M. Bronstein, and R. Kimmel, Numerical geometry ofnon-rigid shapes. Springer-Verlag New York Inc, 2008.

[55] T. Feder and D. Greene, “Optimal algorithms for approximate cluster-ing,” in Proceedings of the twentieth annual ACM symposium on Theoryof computing, pp. 434–444, ACM, 1988.

[56] C. Yang, R. Duraiswami, N. Gumerov, and L. Davis, “Improved fastgauss transform and efficient kernel density estimation,” in Ninth IEEEInternational Conference on Computer Vision, 2003., pp. 664–671, IEEE,2003.

[57] Y. Eldar, M. Lindenbaum, M. Porat, and Y. Zeevi, “The farthest pointstrategy for progressive image sampling,” IEEE Transactions on ImageProcessing, vol. 6, no. 9, pp. 1305–1315, 1997.

126

[58] Y. Rubner, C. Tomasi, and L. Guibas, “The earth mover’s distance asa metric for image retrieval,” International Journal of Computer Vision,vol. 40, no. 2, pp. 99–121, 2000.

[59] S. Avidan, D. Levi, A. Bar-Hillel, and S. Oron, “Locally orderless track-ing,” in 2012 IEEE Conference on Computer Vision and Pattern Recog-nition, pp. 1940–1947, IEEE, 2012.

[60] R. Mazumder, T. Hastie, and R. Tibshirani, “Spectral regularization al-gorithms for learning large incomplete matrices,” The Journal of MachineLearning Research, vol. 99, pp. 2287–2322, 2010.

[61] G. David, Anomaly Detection and Classification via Diffusion Processesin Hyper-Networks. PhD thesis, School of Computer Science, Tel AvivUniversity, March 2009.

[62] D. L. Donoho, “Compressed sensing,” Information Theory, IEEE Trans-actions on, vol. 52, no. 4, pp. 1289–1306, 2006.

[63] O. Schenk, K. Gartner, and W. Fichtner, “Efficient sparse lu factoriza-tion with left-right looking strategy on shared memory multiprocessors,”BIT Numerical Mathematics, vol. 40, no. 1, pp. 158–176, 2000.

[64] J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, and J. W. Liu,“A supernodal approach to sparse partial pivoting,” SIAM Journal onMatrix Analysis and Applications, vol. 20, no. 3, pp. 720–755, 1999.

[65] T. A. Davis and I. S. Duff, “An unsymmetric-pattern multifrontalmethod for sparse lu factorization,” SIAM Journal on Matrix Analysisand Applications, vol. 18, no. 1, pp. 140–158, 1997.

[66] D. Kirk, “Nvidia cuda software and gpu parallel computing architec-ture,” in ISMM, vol. 7, pp. 103–104, 2007.

[67] T. F. Chan, “Rank revealing QR factorizations,” Linear Algebra and ItsApplications, vol. 88, pp. 67–82, 1987.

[68] M. Gu and S. C. Eisenstat, “Efficient algorithms for computing a strongrank-revealing QR factorization,” SIAM Journal on Scientific Comput-ing, vol. 17, no. 4, pp. 848–869, 1996.

[69] C.-T. Pan, “On the existence and computation of rank-revealing LU fac-torizations,” Linear Algebra and its Applications, vol. 316, no. 1, pp. 199–222, 2000.

127

[70] L. Miranian and M. Gu, “Strong rank revealing lu factorizations,” Linearalgebra and its applications, vol. 367, pp. 1–16, 2003.

[71] P. Drineas, M. W. Mahoney, and S. Muthukrishnan, “Relative-error curmatrix decompositions,” SIAM Journal on Matrix Analysis and Applica-tions, vol. 30, no. 2, pp. 844–881, 2008.

[72] V. Rokhlin and M. Tygert, “A fast randomized algorithm for overde-termined linear least-squares regression,” Proceedings of the NationalAcademy of Sciences, vol. 105, no. 36, pp. 13212–13217, 2008.

[73] D. Achlioptas and F. Mcsherry, “Fast computation of low-rank matrixapproximations,” Journal of the ACM (JACM), vol. 54, no. 2, p. 9, 2007.

[74] K. L. Clarkson and D. P. Woodruff, “Numerical linear algebra in thestreaming model,” in Proceedings of the 41st annual ACM symposium onTheory of computing, pp. 205–214, ACM, 2009.

[75] A. Magen and A. Zouzias, “Low rank matrix-valued chernoff bounds andapproximate matrix multiplication,” in Proceedings of the Twenty-SecondAnnual ACM-SIAM Symposium on Discrete Algorithms, pp. 1422–1436,SIAM, 2011.

[76] A. Frieze, R. Kannan, and S. Vempala, “Fast monte-carlo algorithms forfinding low-rank approximations,” Journal of the ACM (JACM), vol. 51,no. 6, pp. 1025–1041, 2004.

[77] P. Drineas, R. Kannan, and M. W. Mahoney, “Fast monte carlo algo-rithms for matrices ii: Computing a low-rank approximation to a matrix,”SIAM Journal on Computing, vol. 36, no. 1, pp. 158–183, 2006.

[78] C. Boutsidis and A. Gittens, “Improved matrix algorithms via the sub-sampled randomized hadamard transform,” SIAM Journal on MatrixAnalysis and Applications, vol. 34, no. 3, pp. 1301–1340, 2013.

[79] R. Bhatia, Matrix analysis, vol. 169. Springer, 1997.

[80] H. H. Goldstine and J. Von Neumann, “Numerical inverting of matricesof high order. ii,” Proceedings of the American Mathematical Society,vol. 2, no. 2, pp. 188–202, 1951.

[81] L. Backstrom, P. Boldi, M. Rosa, J. Ugander, and S. Vigna, “Four de-grees of separation,” in Proceedings of the 3rd Annual ACM Web ScienceConference, pp. 33–42, ACM, 2012.

128

[82] A. E. Litvak, A. Pajor, M. Rudelson, and N. Tomczak-Jaegermann,“Smallest singular value of random matrices and geometry of randompolytopes,” Advances in Mathematics, vol. 195, no. 2, pp. 491–523, 2005.

[83] M. Rudelson and R. Vershynin, “Smallest singular value of a randomrectangular matrix,” Communications on Pure and Applied Mathematics,vol. 62, no. 12, pp. 1707–1739, 2009.

[84] A. Litvak and O. Rivasplata, “Smallest singular value of sparse randommatrices,” Stud. Math, vol. 212, pp. 195–218, 2010.

[85] L. N. Trefethen and R. S. Schreiber, “Average-case stability of gaussianelimination,” SIAM Journal on Matrix Analysis and Applications, vol. 11,no. 3, pp. 335–360, 1990.

[86] G. Stewart, “The triangular matrices of gaussian elimination and relateddecompositions,” Tech. Rep. TR-3533, Department of Computer Scienceand Institute for Advanced Computer Studies, University of Maryland,College Park, MD, 1995.

[87] Z. Chen and J. J. Dongarra, “Condition numbers of gaussian randommatrices,” SIAM Journal on Matrix Analysis and Applications, vol. 27,no. 3, pp. 603–620, 2005.

[88] R. Larsen, “Lanczos bidiagonalization with partial reorthogonalization,”Tech. Rep. DAIMI PB-357, Department of Computer Science, AarhusUniversity, 1998.

[89] P. Boldi, B. Codenotti, M. Santini, and S. Vigna, “Ubicrawler: A scal-able fully distributed web crawler,” Software: Practice & Experience,vol. 34, no. 8, pp. 711–726, 2004.

[90] B. K. Natarajan, “Sparse approximate solutions to linear systems,”SIAM journal on computing, vol. 24, no. 2, pp. 227–234, 1995.

[91] J. Tropp, “Greed is good: algorithmic results for sparse approximation,”Information Theory, IEEE Transactions on, vol. 50, no. 10, pp. 2231–2242, 2004.

[92] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decompositionby basis pursuit,” SIAM journal on scientific computing, vol. 20, no. 1,pp. 33–61, 1998.

129

[93] Q. Zhang and B. Li, “Discriminative k-svd for dictionary learning inface recognition,” in Computer Vision and Pattern Recognition (CVPR),2010 IEEE Conference on, pp. 2691–2698, IEEE, 2010.

[94] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robustface recognition via sparse representation,” Pattern Analysis and MachineIntelligence, IEEE Transactions on, vol. 31, no. 2, pp. 210–227, 2009.

[95] A. Y. Yang, J. Wright, Y. Ma, and S. S. Sastry, “Feature selection inface recognition: A sparse representation perspective,” submitted to IEEETransactions Pattern Analysis and Machine Intelligence, 2007.

[96] M. Yang, D. Zhang, and X. Feng, “Fisher discrimination dictionarylearning for sparse representation,” in Computer Vision (ICCV), 2011IEEE International Conference on, pp. 543–550, IEEE, 2011.

[97] D.-S. Pham and S. Venkatesh, “Joint learning and dictionary construc-tion for pattern recognition,” in Computer Vision and Pattern Recogni-tion, 2008. CVPR 2008. IEEE Conference on, pp. 1–8, IEEE, 2008.

[98] Z. Jiang, Z. Lin, and L. S. Davis, “Learning a discriminative dictionaryfor sparse coding via label consistent k-svd,” in Computer Vision andPattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1697–1704,IEEE, 2011.

[99] M. C. Amirani, M. Toorani, and S. Mihandoost, “Feature-based typeidentification of file fragments,” Security and Communication Networks,vol. 6, no. 1, pp. 115–128, 2013.

[100] W. C. Calhoun and D. Coles, “Predicting the types of file fragments,”Digital Investigation, vol. 5, pp. S14–S20, 2008.

[101] I. Ahmed, K.-s. Lhee, H. Shin, and M. Hong, “On improving the accu-racy and performance of content-based file type identification,” in Infor-mation Security and Privacy, pp. 44–59, Springer, 2009.

[102] I. Ahmed, K.-s. Lhee, H. Shin, and M. Hong, “Fast file-type identifica-tion,” in Proceedings of the 2010 ACM Symposium on Applied Computing,pp. 1601–1602, ACM, 2010.

[103] C. J. Veenman, “Statistical disk cluster classification for file carving,”in Information Assurance and Security, 2007. IAS 2007. Third Interna-tional Symposium on, pp. 393–398, IEEE, 2007.

130

[104] M. Karresand and N. Shahmehri, “File type identification of data frag-ments by their binary structure,” in Information Assurance Workshop,2006 IEEE, pp. 140–147, IEEE, 2006.

[105] R. F. Erbacher and J. Mulholland, “Identification and localization ofdata types within large-scale file systems,” in Systematic Approaches toDigital Forensic Engineering, 2007. SADFE 2007. Second InternationalWorkshop on, pp. 55–70, IEEE, 2007.

[106] W.-J. Li, K. Wang, S. J. Stolfo, and B. Herzog, “Fileprints: Identifyingfile types by n-gram analysis,” in Information Assurance Workshop, 2005.IAW’05. Proceedings from the Sixth Annual IEEE SMC, pp. 64–71, IEEE,2005.

[107] M. McDaniel and M. H. Heydari, “Content based file type detectionalgorithms,” in System Sciences, 2003. Proceedings of the 36th AnnualHawaii International Conference on, pp. 10–pp, IEEE, 2003.

[108] S. Lafon, Diffusion Maps and Geometric Harmonics. PhD thesis, YaleUniversity, May 2004.

[109] R. Coifman and S. Lafon, “Geometric harmonics: A novel tool formultiscale out-of-sample extension of empirical functions,” Applied andComputational Harmonic Analysis, vol. 21, no. 1, pp. 31–52, 2006.

[110] B. Flannery, W. Press, S. Teukolsky, and W. Vetterling, Numericalrecipes in C. Cambridge University Press, 1992.

[111] G. Stewart and J. Sun, Matrix perturbation theory, vol. 175. Academicpress New York, 1990.

[112] A. Langville and C. Meyer, “Updating markov chains with an eye ongoogle’s pagerank,” SIAM journal on matrix analysis and applications,vol. 27, no. 4, pp. 968–987, 2006.

[113] A. Langville and C. Meyer, “A survey of eigenvector methods for webinformation retrieval,” SIAM review, pp. 135–161, 2005.

[114] S. Kamvar, T. Haveliwala, and G. Golub, “Adaptive methods for thecomputation of pagerank,” Linear Algebra and its Applications, vol. 386,pp. 51–65, 2004.

[115] C. Meyer and G. Stewart, “Derivatives and perturbations of eigenvec-tors,” SIAM Journal on Numerical Analysis, pp. 679–691, 1988.

131

[116] C. Lanczos, An iteration method for the solution of the eigenvalue prob-lem of linear differential and integral operators. United States Governm.Pr. Office, 1950.

[117] O. Kouropteva, O. Okun, and M. Pietikainen, “Incremental locallylinear embedding algorithm,” Image Analysis, pp. 145–159, 2005.

[118] M. Law and A. Jain, “Incremental nonlinear dimensionality reductionby manifold learning,” Pattern Analysis and Machine Intelligence, IEEETransactions on, vol. 28, no. 3, pp. 377–391, 2006.

[119] G. David and A. Averbuch, “Hierarchical data organization, clusteringand denoising via localized diffusion folders,” Applied and ComputationalHarmonic Analysis, vol. 33, no. 1, pp. 1–23, 2012.

[120] C. Kruegel and G. Vigna, “Anomaly detection of web-based attacks,”in Proceedings of the 10th ACM conference on Computer and communi-cations security, pp. 251–261, ACM, 2003.

[121] N. Hubballi, S. Biswas, and S. Nandi, “Layered higher order n-gramsfor hardening payload based anomaly intrusion detection,” in Availability,Reliability, and Security, 2010. ARES’10 International Conference on,pp. 321–326, IEEE, 2010.

[122] H. Ringberg, A. Soule, J. Rexford, and C. Diot, “Sensitivity of PCA fortraffic anomaly detection,” ACM SIGMETRICS Performance EvaluationReview, vol. 35, no. 1, pp. 109–120, 2007.

[123] C. Callegari, L. Gazzarrini, S. Giordano, M. Pagano, and T. Pepe,“A novel PCA-based network anomaly detection,” in Communications(ICC), 2011 IEEE International Conference on, pp. 1–5, IEEE, 2011.

[124] T. Sipola, A. Juvonen, and J. Lehtonen, “Anomaly detection fromnetwork logs using diffusion maps,” in Engineering Applications of Neu-ral Networks (L. Iliadis and C. Jayne, eds.), vol. 363 of IFIP Advancesin Information and Communication Technology, pp. 172–181, SpringerBoston, 2011.

[125] T. Sipola, A. Juvonen, and J. Lehtonen, “Dimensionality reductionframework for detecting anomalies from network logs,” Engineering In-telligent Systems, 2012. forthcoming.

[126] S. Lafon and A. B. Lee, “Diffusion maps and coarse-graining: a uni-fied framework for dimensionality reduction, graph partitioning, and data

132

set parameterization,” Pattern Analysis and Machine Intelligence, IEEETransactions on, vol. 28, no. 9, pp. 1393 –1403, 2006.

[127] U. von Luxburg, “A tutorial on spectral clustering,” Statistics andComputing, vol. 17, pp. 395–416, 2007.

[128] J. Turkka, T. Ristaniemi, G. David, and A. Averbuch, “Anomaly de-tection framework for tracing problems in radio networks,” in Proc. toICN 2011, 2011.

[129] J. Han and M. Kamber, Data mining: concepts and techniques. MorganKaufmann, 2006.

. ההטלה בשלב דלילות אקראיות במטריצות שימוש ידי עלמציגים כיצד ניתן להאיץ אותו

על מנת להוכיח את נכונותם .האלגוריתם שלאנו מפתחים מספר חסמים לשגיאת הקירוב

ת מטריצות אקראיות ימתאורי האחרונות תוצאותשל חסמים אלו, אנו עושים שימוש ב

דליל ה המבנהמנצל את , שפותח האלגוריתם. גאוסיות דלילות-ודלילות וכן מטריצות תת

של מאיצים ארכיטקטורותגבוהה וליישום על ולכן ניתן למיקבול ברמה של המטריצה

האלגוריתם של הביצועים את הממחישותאנו מציגים דוגמאות מעשיות . (GPUגרפיים )

random -ו ,אקראי SVD פירוק כגון, אחרותפירוק לשיטותומשווים אותו

interpolative decomposition .פרק ב האלגוריתם של ויעילות את מדגימים אנו

. האחרונות שניםטכניקות לבניית מילון נחקרו רבות ב .סיווג לטובת בניית מילון השלישי,

משתמשים בהם כדי מכן ולאחר אימון נתונימ מילוניםאנו לומדים מספר , המוצגת בשיטה

LU פירוק על מבוסס המוצג בפרק זה מילון למידת אלגוריתם. נתונים חדשים לסווג

, מדרגה נמוכה מילון בונה אקראי LU פירוק, קיימות לשיטות בניגוד. אותו פיתחנו אקראי

מדגימים אנו. חדשים אותות של םסיווגאת ו הבנייה תהליך את גם מפשט אשרדבר

דיגיטלית באבטחה בסיסית משימה זוהי. קובץ סוג יהויבעיית ז על שלנו האלגוריתם

דואר ושרתי וירוס-אנטי מערכות, אש-חומות כגון שונות מערכות ידי על המבוצעת

להסתמך במקום, כנם הגולמי של הקבציםות על המבוססת שיטה מציעים אנו. אלקטרוני

קשה יותר היא כזו גישה. או נתוני מסגרת אחרים בו , שם הקובץהקובץ סיומת על

יווגנדרשים רק מספר מועט של מסגרות מידע מתוך הקובץ לס כי מראים ואנו, למעקף

בתחום זיהוי לאחרונה שנעשהדומים למחקריםאלו תוצאות את משווים אנו. מוצלח

אנו, כן כמו. שלה ביצועה וזמנישיטה היכולות הסיווג של את להעריך מנת עלהקבצים,

ידי עלוזאת , זדוניים קוד קטעי המכילים PDF קבצי הצביע עלל יכולה זו שיטה כי מראים

.שונים מילונים באמצעות תכנו תא לשחזרניסיון ו, PDF-ה קובץ תוכן בדיקת

(IIIהפרעות מבוססות זיקה )חלק

אימון ראשוני המתבצע פעם שלבבדרך כלל מכילים למידה חישובית מבוססי אלגוריתמים

כאשר. ענק מטריצותבעיבוד כרוך הוא שכן חישוביתזה יקר אחת בלבד. שלב האימון

אימון באופן תדיר ליך התנדרש לבצע את , אימון מתבסס על מידע המשתנה בציר הזמןה

אנו[, 02המבוסס על ]תזה, הבפרק הרביעי של על מנת שפרופיל האימון יהיה מעודכן.

כאשר בלי לחשב אותו מחדש,מו ביעילות זהכאימון פרופילראים כיצד ניתן לעדכן מ

גרעין מיוצגים על ידי הנתוניםאנו מייצגים את . ללא הרףומשתנים תפתחיםמ הנתונים

((kernel גרףמימד נמוך למ ייצוגמחושב ובעזרתו ספקטרלי פירוקבוצע לו מאשר

חדשים נתונים לסיווגבייצוג זה שימוש נעשה מכן לאחר .דגימות האימון בין המרחקים

שיטות מציגים אנו. , על ידי חישוב מרחקן מדגימון האימון במימד הנמוךהגיעו עתה שזה

הפירוק חישוב מלא של לבצע צורך ללא, מסוג זה פרופיל ממימד נמוךהדרגתי של לעדכון

פרק ב. אימוןקטנים בדגימות השינויים מידול העדכונים כהספקטרלי, ותוך הסתמכות על

לבנות פרופיל אימון כדי זו בשיטה משתמשים אנו [,02המבוסס על ] החמישי של התזה,

את מדגימים אנו חריגות בתעבורה זו.של תעבורת רשת המתעדכן באופן תדיר, ומציאת

שרת ומראים בוצעו ניסיונות חדירה ל שבו אינטרנט שרתל שתרתעבורת על האלגוריתם

ניסיונות חדירה אלו כחריגות, על ידי עדכון שוטף של פרופיל האימון להצביע עלכי ניתן

ושימוש בו לטובת הסיווג.

מבנה ותרומת התזה

וכיצד ניתן להשתמש בהן מטריצות רוקיפל שיטות מספר של המאפיינים את בוחנת זה תזה

החלק. חלקים שלושההתזה מחולקת ל. ממוחשבתהלמידה המתחום בעיות לפתור מנת על

פרוק באמצעות להאיץ את חישובה ניתן זה ואיך חלקיקים מסנן לבעיית מתייחס הראשון

בניית מילון ם של יישומי ומספר LU לפירוק אקראית גרסה מציג השני החלק. מטריצות

משתנה מטריצה של רוקיהפ לחישוב פתרון מספק השלישי החלק. זה פירוק על מבוססה

סעיף. על שיטה זו המבוסס האינטרנט תעבורתב אנומליות זיהויל אלגוריתם ומציגבזמן,

. אלו מחלקים אחד כל של קצרה סקירה מציג זה

I)מידה )חלק -מסנן חלקיקים מרובה קנה

.ותליניארי שאינםקלט תצפיות באמצעות מטרה אחר למעקב שיטה הינה חלקיקים מסנן

של מחזור בכל. כלשהו קיקים בעלי מצב נתוןהמטרה מיוצגת על ידי אוסף של חל

וכן את , ע"פ חוקים מוגדרים מראש,חלקיקיםה שלמחשבים את מצבם החדש , האלגוריתם

תוך שימוש בתצפיות הקלט.חלקיק לייצג את המטרה בצורה נאמנה, כל של ההסתברות

גורםדבר ש, שלהם ההסתברויותעל בסיס חלקיקיםמיד לאחר מכן, דוגמים מחדש את ה

עלות חישוב זה היא .ולהתקדם למחזורים הבאים להתפתח יותר הטובים חלקיקיםל

משתמשים במספר רב של חלקיקים כפי שנדרש בכדי לעקוב כאשר במיוחדמשמעותית

מציגים אנו [,02, 00המבוסס על ] בפרק הראשון של תזה זו, ביל.אחר מספר מטרות במק

ידי על נעשהה העקיבה חישוב את איץלה יבכד multiscale)מידה )-מרובת קנה שיטה

מחזור בכלמשקלו של כל חלקיק את שמחשבת, המקובלת לשיטה בניגוד. חלקיקיםה מסנן

בשיטות שימוש תוך המקור חלקיקיומייצגת של קטנה קבוצה דוגמים אנו, האלגוריתם של

לאחר מכן אנו מרחיבים את פונקציית . מטריצות, ומחשבים את המשקל רק עבורם פירוקל

, משנהה בקבוצת יםכלול שאינם החלקיקים שאר כל עבורהמשקל ומשערכים אותה

באופן מפחית המוצע האלגוריתם ומסיקים ממנה את פונקצית ההסתברות המבוקשת.

אנו ,(FGT)מהיר גאוסטרנספורם ב שימוש ידי על. החישובי עומסהאת משמעותי

ים אותה לזמן מקבוצת החלקיקים המייצגים ומצמצמפשטים גם את שלב הבחירה של

ו במעקב אחר אובייקטים את האפקטיביות של שיטה ז מדגימים אנו ליניארי בגודל הקלט.

מסנן אלגוריתם האצתב היא זה מחקר של העיקרית תרומתו טוני וידאו.שונים בסר

מצליחים אנחנו. מטריצות, דבר שלא נעשה עד היום פרוקל שיטות באמצעותהחלקיקים

. רמת עקיבה אותה על שמירה תוך, במקרים מסויימים פי עשרה עד האלגוריתם את להאיץ

(IIאקראי )חלק LUפירוק

שמחשב אקראיו מהיר אלגוריתםאנו מציגים [, 02המבוסס על ]בפרק השני של התזה,

לחשבכדי אקראיות הטלה בטכניקות משתמש האלגוריתם. מדרגה נמוכה LU פירוק

ואנו לולמקב ניתןשפותח אלגוריתםה. גדולותקלט מטריצותל דרגת נמוך קירוב ביעילות

המטריצה החדשה כהפרעות שחלו במטריצה שיטה זו מתבססת על מידול שוב את הפירוק.

על פירוק פרופיל אימון המבוססעל מנת לבחון את יעילותה של השיטה לעדכון המקורית.

זו בעיה. באינטרנט נתונים בתעבורת חריגות לזהות כדימטריצות, אנו משתמשים בה

יםלעת שמשתניםנדרש להתמודד עם מקורות מידע כאשר במיוחד חשובה להיות הופכת

גבוהה דיוק רמת על לשמור כדיב האימון לפרופיל תכופים עדכונים יםרשנד לכן, קרובות

.מסווגשל ה

ובפרט קלט , קלט נתוני של שונים סוגים ליישם ניתןאת פתרונות שאנו מציעים בתיזה זו,

את השיטות בוחנים אנו, זו בעבודה ( שאופייני למידע ממימד גבוה.sparseדליל )

, על מנת וידאו וסירטוני תמונותהקלטות של תעבורת רשת, , נתוני מדידהשפיתחנו על

שיטות במספר משתמשים אנולהדגים את יעילות השיטה ושימושיה הרחבים. זאת ועוד,

ולאפשר הבעיה דרישותל להתאים אותן ומשפרים אותם על מנת מטריצות פרוקל שונות

. במיוחדלעבוד ביעילות על מאגרי מידע גדולים לעהן

עבודות קודמות

שמשיםמ הם .האחרונות השנים בעשרותומגוונות פופולרית הפכו נתונים לניתוח שיטות

יידע ותובנות, מתן השערות לחלץ כדי ,נתוניםעיבוד, מידול, ושיערוך של , דגימה, לניקוי

נמצאות שונות וטכניקותות אלו, בבעי לטיפול רבות גישותקיימות . החלטות קבלתוכן ל

כבסיסשיטות מתמטיות לפירוק מטריצות משמשות . ובתעשיה במדעבשימוש שוטף

, המידע שברשותנו אלו בשיטות. נתונים ניתוחל אלגוריתמיםמוטמעות בו ,רבים למחקרים

שחולצו תכונותהגולמיות, דגימותה את לייצג שיכולהמאוכסן באמצעות מטריצת מידע

את לגורמיםאנו מפרקים . הדגימות בין מעבר הסתברויותבין דגימות או דמיון, מהמידע

, ומייצגים אותה כמכפלה של מטריצות מדרגה נמוכה או בעלות מבנה ייחודי המטריצה

. פירוק זה מאפשר חשיפה של מאפיינים )מטריצות משולשות, אורטוגונליות וכו'( אחר

חלקות, והצפה של חלוקה למ, דומיננטיות דגימותהצבעה על גוןשל המידע כ חשובים

eigenvalueות הנמצאות בשימוש נרחב כיום הן: חריגות. שיטות נפוצות לפירוק מטריצ

decomposition, singular value decomposition ,non-negative matrix

factorization , interpolative decomposition, QR decomposition וLU

decomposition. לחלק משיטות אלו קיימות גרסאות אקראיות וכן מימושים המחשבים

פתרון מקורב לפירוק המטריצה, ומסוגלים להתמודד עם מספקים הםקירוב מדרגה נמוכה.

לא או חסרים במיוחד, מטריצות עם אברים גדולות מטריצותב טיפול אתגרים שונים כגון

חדשות מחשב ארכיטקטורות ניצולכן ו ,הקלט נתוניכמות מעברים מינימלית על , מדויקים

בעיות של רחב מגווןניתן לפתור כיום, . ענןמבוסס ועיבוד( GPU)מאיצים גרפיים כגון

בעיות, רגרסיה בעיותליניאריות, משוואות מערכת פתרון: מטריצות פירוק באמצעות

, הורדת ( PCA)רכיבים מרכזי ניתוחחישובית, למידה, מילון בניית, שיתופיות סינון

דוגמאות .ודגימהניקוי רעשים , דחיסהמימדים, סיווג למחלקות, גילוי חריגות,

האלגוריתם של אשר נפתרו תוך שימוש בפירוק מטריצות הן בעיות של המפורסמות

Google לדירוג תוצאות חיפוש (page rankה ,)לשיפור נטפליקסחברת של אתגר

אלו דוגמאות. פניםומערכות לזיהוי אותיות, ספרות ותווי לסרטים שלהם ההמלצה מערכת

אלו רעיונותמרחיבים אנו, זו בתזה הכוח הטמון בשימוש בפירוק מטריצות. את ממחישות

.חדשות בעיות לפתור כדי בהם ושימוש מטריצות פרוקל שיטות מספרפיתוח ידי על

תקציר

מידעה תטכנולוגי בעידן עיקריות מגמות לשתי עדים אנו, האחרונים העשורים במהלך

נוצרתר שא נתוניםהעלייה הגוברת בכמות ה היא הראשונה המגמה. מדעיה מחשובוה

היקום הדיגיטלי גדל בקצב אקספוננציאלי ומחקרים . האנושותכלל ידי על ונאגרת

כמות הביטים ביקום הדיגיטלי תהיה כמספר הכוכבים בשמיים. 0202מעריכים כי עד שנת

מכשיריםכמות ב גידולכן הו, דיגיטליות רשתות, תקשורת מערכות, מחשבים שלהמצאתם

, טקסטבתצורות של יםנתונ שלעקיפה ישירה ו ליצירה גורמיםכל אלו , סביבנו דיגיטליים

, שצילמנו בעברש ממה תמונות יותר מצלמים אנחנו. וידאושמע וסירטוני הקלטות ,תמונות

מגדילים באופן תדיר ו, והודעות מסמכיםמקליטים יותר ויותר סרטוני וידאו, כותבים יותר

מערכות ,בנוסף. הקיברנטי מרחבבנוכחותנו את כמות המידע הדיגיטלי השייך לנו ואת

. שלנו הסביבה של לצלםולהקליט , מדידות מסוגים שונים לייצר מתוכנתות סביבנומחשב

ואחסון הקלטה טכנולוגיות בזכות כיום לשמור ניתןאת הכמות העצומה של מידע מסוג זה,

הפיתוח היא השנייה המגמה. עתידי ועיבוד לניתוח זמינים אותם הופךגם ש מה, מתקדמות

הוצאת תובנות ומסקנות ממאגרי נתונים ו מידע לאחזורשונות חישוביות שיטות שלהמואץ

אלו תוך נתונים מאגרי את ולנתח לעבדגדולים. קיימת מגמת שיפור מתמיד ביכולת

בעיקר מונעות אלו יכולות. מינימאלית אנושית תערבותתחת הו מחשבשימוש במערכות

מחשב למערכות המאפשרים חישובית,למידה ואלגוריתמי מתמטיות שיטות יתוחבעזרת פ

שיטות למיצוי מידע מסוגים שונים . החבויות בו תובנות את עבד את המידע ולספקל

, ביטחון, כלכלה, ביולוגיה, רובוטיקה כגון חיינו של תחום בכל כמעטמשולבות היום

. חברתיות ורשתות פרסום, תקשורת

פתח, על מנת למטריצות פרוקל שיטות המכונים מתמטיים כלים סט רותמים אנו, זו בתזה

רוקיפ. ותחריג ואיתור משימות של סיווג, דירוג, דגימה, עקיבהל למידה אלגוריתמי

ותבעל מטריצות שלהנה פעולה מתמטית בה אנו מבטאים מטריצת קלט, כמכפלה מטריצה

עלות מטריצות משולשיות, אורטוגונליות, אלכסוניות, דלילות או ב כגוןמוגדרות תכונות

פרוקל כלים באמצעות אותן ופותרים שונות בעיות שלושחוקרים אנו דרגה נמוכה.

תוך ,חלקיקים מסנן המכונה יבהעק אלגוריתםהאצת היא הראשונה הבעיה. מטריצות

את יםשפרמ אנו, באופן יעיל הקלט נתוני דגימת ידי על. מטריצות רוקיפל בשיטות שימוש

משתמשים אנו מכן לאחר. לפעולתו הנדרש החישובזמן מפחיתים את ו האלגוריתם

טובת שיערוך הפרמטרים הנדרשים ל( multiscaleמידה )-מרובות קנה תובשיט

עקיבה גבוהה, וביעילות דיוקרמת על לשמור לנו תאפשר. טכניקה זו מבאלגוריתם

יעילה של בנייה היאבתזה זו, חוקריםאותה אנו השנייה הבעיה. חישובית טובה יותר

אנו. יםחדש נתונים יקלטעל מנת לזהות ולסווג בו , ושימושתוך סט נתוני קלטמ מילון

המידע וליצר ללמוד כדיחדש לפירוק מטריצות ומשתמשים בו אקראי אלגוריתם מפתחים

אנו .LUהידועה כפירוק פירוק לשיטת דרגה-נמוך קירוב מספק האלגוריתם .מילוןעבורו

,אקראיות ומטריצות תמונות, דלילות מטריצות על האלגוריתם של היעילות את בוחנים

לאחר מכן אנו בונים שיטה לסיווג קבצים על פי תוכנם, בעזרת למידה של המילון ו

לבצע בלימ , וזאתקייםמטריצות רוקיפמתמקדת בעדכון של השלישית הבעיה. המתאים

תמצית

מתקדמות שיטות פיתוח מחייב ,האנושות ידי עלהנוצרת תוניםכמות הנהמתמיד בהגידול

, ידע ולחלץ מהם גולמיים, לניתוח מידע אשר יכולות לעבד כמות מאסיבית של נתונים

הן ,אלו אתגרים עםקבוצת כלים מתמטיים חשובה להתמודדות . ומסקנות חבויות תובנות

מאפשרות פתרון בעיות ו, נרחב בשימוששיטות אלו נמצאות כיום . מטריצות פרוקל שיטות

המבוססות שונות ותשיט שלוש מציגים אנו, זו בתזה. כמותיים נתונים ניתוחרבות בתחום

מאגרי מידע חריגות ב וזיהוימידע סיווג, דגימה של בעיות לפתור כדיב מטריצות פרוקעל

. גדולים במיוחד

מידה-חלוקת המידע למספר שכבות קנה על המבוססת שיטה היא הראשונה השיטה

((multiscale ,חלקיקים מסנןהידוע בשם אלגוריתםבעזרתה ניתן להאיץ אשר שונות

(particle filter אלגוריתם זה מאפשר .)עלמתבסס וה של מטרה כלשהי, מצב אחר מעקב

מסנן .הסתברותיים , וייצוג המטרה כאוסף של חלקיקיםותליניארי שאינןקלט תצפיות

בשיטה .המערכות אבטחו "מ, רובוטיםבשימוש נרחב במערכות מככיום נמצא חלקיקים

קבוצה-תת דוגמים אנו, ישיר באופן החלקיקים של המשקללחשב את במקום ,המוצגת

מפעילים אנו, אחר כך. מטריצות רוקיפל בשיטות שימוש תוך המקור מחלקיקימייצגת

אנו . הנותרים החלקיקים עבור הצפיפות תיפונקצי את שחזרשיטת שיערוך על מנת ל

בעקיבה אחר מטרות בסרטוני וידאו מסוגים שונים.שיטה השל האת יעילות מדגימים

פירוקבאופן יעיל ר מבצעשאאקראי אלגוריתם היאהמוצגת בתזה, השנייה השיטה

random) תאקראי הטלה תבטכניק משתמש האלגוריתם .LUמטריצות בשם פירוק

projection ) פירוק ביעילות לחשבעל מנתLU גדולות מטריצות עבור ממימד נמוך .

בשלב דלילות אקראיות במטריצות שימוש ידי עללהאיץ ו למקבל ניתן זה אלגוריתם

תוצאותה מציגים מספר חסמי שגיאה עבור האלגוריתם ומוכיחים את נכונותם.אנו . הטלהה

האלגוריתם משפר שיטות אקראיות אחרות לפירוק מטריצות.כי מראות מוצגות בתזה זו ה

ילוןהאלגוריתם מחשב מ. מילון בניית של בעיה לפתרוןבנוסף, אנו מיישמים אלגוריתם זה

בלבד. מילון זה גם מאפשר לזהות ומתבסס על תוכנםעל פי סוגם, בציםק לסיווג המשמש

ריצה.-זדוניים המכילים קוד PDFקבצי

פרופיל של קייםספקטרלי רוקיפ עדכןהמאפשר ל אלגוריתם היא השלישית השיטה

משתנים על הקלט נתוני כאשר. עבור מטריצת קלט המשתנה על ציר הזמן נתון אימונים

חשב מחדש את פרופיל האימון ל במקום פרופיל האימון.פני הזמן, נדרש עדכון שוטף של

המתרחשות הפרעותפרופיל ומתבסס על מידול הה את מעדכן האלגוריתם, שוב ושוב

לעדכןהמסוגל ותחריג זיהויל אלגוריתם מפתחים אנו, זו שיטה באמצעות. הקלט בנתוני

תעבורתהקלטות של שיטה זו על של הומדגימים יעילות ,בזמן ריצה הסיווג מודל את

וזיהוי פעילות חריגה. אינטרנטהרשת מ נתונים

ניתן להרחיב וכיצדמדגימות כוחן שיטות לפירוק מטריצות זו בתזה תגוהמוצ השיטות

.נתוניםכמותי של ניתוח בתחוםולשפר אותן, על מנת לפתור סט נרחב של בעיות

מדעים מדוייקים על שם ריימונד ובברלי סאקלרהפקולטה ל

מדעי המחשב על שם בלבטניקבית הספר ל

בעזרת בנפח גבוהעיבוד מידע

מטריצות ישיטות מבוססות פירוק

לשם קבלת תוארחיבור

"דוקטור לפילוסופיה"

מאת

יניב שמואלי

החיבור בוצע בהנחייתו של

פרופ' אמיר אורבוך

הוגש לסנאט אוניברסיטת תל אביב

תשע"ד ,אלול

blavatnik school of computer science matrix factorization ... · blavatnik school of computer...

Documents