review linearalgebra mle scholar - purdue university · •linear algebra •using python in the...

47
Data Mining CS57300 Purdue University Jan 11, 2018 Bruno Ribeiro 1

Upload: others

Post on 02-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Data Mining

CS57300Purdue University

Jan 11, 2018

Bruno Ribeiro

1

Page 2: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

The Need for Formalism in Data Mining

• A woman is leading an environmental protest outside PMU today

• What is more likely?

a) That she is an investment banker?

b) That she is an investment banker studying Environmental Engineering at Purdue?

2

P [A] =X

b2B

P [A, b]

Page 3: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Goals today• Review some basic concepts

• Statistical Inference

• Linear Algebra

• Using Python in the Scholar cluster

3

Page 4: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Probability and Statistical Estimation

4

Page 5: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Populations and samples

• In data mining we often work with a sample of data from the population of interest

• Estimation techniques allow inferences about population properties from sample data

• If we had the population we could calculate the properties of interest

5

Page 6: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Estimating characteristics from samplingData processing inequality:

“No processing can increase the amount of statistical information already contained in the data”

6

World rawobservations

featurescharacteristic

summary

Data processinginequality

Page 7: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Populations and samples• Elementary units:

• Entities (e.g., persons, objects, events) that meet a set of specified criteria

• Example: All people who’ve purchased something at Wallmart

• Population:

• Aggregate of elementary units (i.e, all items of interest)

• Sampling:

• Sub-group of the population

• Serves as a reference group for estimating characteristics about the population and drawing conclusions

7

Page 8: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Sampling

• Sampling is the main technique employed for data selection

• It is often used for both the preliminary investigation of the data and the final data analysis

• Reasons to sample

• Obtaining the entire set of data of interest is too expensive or time consuming

• Processing the entire set of data of interest is too expensive or time consuming

• Note: Even if you use an entire dataset for analysis, you should be aware that most datasets are just samples of the entire population

Tan, Steinbach, Kumar. Introduction to Data Mining, 2004.

8

Page 9: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Sampling …

• The key principle for effective sampling is the following:

• Using a sample will work almost as well as using the entire dataset, if the sample is representative for the task you are interested

• A sample is representative if it has approximately the same property (of interest) as the original data

• Sample (observation) bias is a HUGE problem in data mining

9

Page 10: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Warning: Careful with Sampling (Observation) Biases

10

Page 11: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Warning: Observation Biases are Prevalent

• Your experience with buses at peak hours:

• Bus at 99% capacity at peak hours

• You wait on average 17 minutes and 9 seconds for it to arrive

• Transportation admin:

• buses at peak hour are at 60% capacity

• average bus inter-arrival time is 10 minutes?

11

Page 12: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Inspection Paradox

• 40 minutes / 4 buses = 10 min inter-arrival time

• How long do you wait?

• Assume you arrive uniformly during these 40 minutes

• What is the probability you will arrive within the 37 minute interval?

• P[Arrive at 37 min interval] = 37/40

• What is the average waiting time if you arrive at the 37 min interval?

• E[Wait | Arrive at 37 min interval] = 37/2 = 18.5

Bus 1 Bus 2 Bus 3

37 minutes

Bus 4

12

Page 13: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

End of Observation Bias Warning

13

Page 14: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Statistical inference

• Infer properties of an unknown distribution with sample data generated from that distribution

• Parameter estimation

• Infer the value of a population parameter based on a sample statistic (e.g., estimate the mean)

• Hypothesis testing

• Infer the answer to a question about a population parameter based on a sample statistic (e.g., is the mean non-zero?)

14

Page 15: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Common distributions

• Bernoulli

• Binomial

• Multinomial

• Poisson

• Normal

15

Page 16: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Bernoulli

• Binary variable (0/1) that takes the value of 1 with probability p

• E.g., Outcome of a fair coin toss is Bernoulli with p=0.5

16

Page 17: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Binomial

• Describes the number of successful outcomes in nindependent Bernoulli(p) trials

• E.g., Number of heads in a sequence of 10 tosses of a fair coin is Binomial with n=10 and p=0.5

17

Page 18: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Multinomial

• Generalization of binomial to k possible outcomes; outcome ihas probability pi of occurring

• E.g., Number of {outs, singles, doubles, triples, homeruns} in a sequence of 10 times at bat is Multinomial

• Let Xi denote the number of times the i-th outcome occurs in ntrials:

18

Page 19: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Poisson

• Describes the number of successful outcomes occurring in a fixed interval of time (or space) if the “successes” occur independently with a known average rate

• E.g., Number of emergency calls to a service center per hour, when the average rate per hour is λ=10

19

Page 20: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Normal (Gaussian)

• Important distribution gives well-known bell shape

• Central limit theorem:

• Distribution of the mean of n samples becomes normally distributed as n ↑, regardless of the distribution of the underlying population

20

Page 21: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Multivariate RV

• A multivariate random variable X is a set X1, X2, ..., Xp of random variables

• Joint density function: P(x)=P(x1, x2, ..., xp)

• Marginal density function: the density of any subset of the complete set of variables, e.g.,:

• Conditional density function: the density of a subset conditioned on particular values of the others, e.g.,:

21

Page 22: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Data Likelihood

• Consider a random variable X whose distribution has parameters W

• We will often use P(X | W) to denote the probability of X

• Consider X1, …, Xn random variables that are independent and identically distributed with distribution P(X | W)

• Dataset: The likelihood of observations x1, …, xn is

P (X1 = x1, . . . , Xn = xn|W)

=nY

i=1

P (Xi = xi|W)

22

Page 23: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Parameter Estimation

• Maximum likelihood estimate (MLE)

• Maximum a posteriori probability estimate (MAP)

23

cW = argmax

WP [Data|W]

cW = argmax

WP [Data|W]P [W]

Page 24: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Parameter Estimation by Gradient Ascent

• Consider the j-th parameter of P(Data | W), Wj

• We can maximize the likelihood by gradient ascent:

• The above is called (full) batch gradient ascent, as it uses the entire dataset.

• For more complex models (or large datasets), we will often use only a subset of the data to compute an estimate of the gradient

24

Wj,t+1 = Wj,t + ⌘@

@WjP (Data|W)

Page 25: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Linear Algebra Review

25

Page 26: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Why Linear Algebra?

• Why Algebra?

• Computing is all about algebra

• Way to mathematically describe data

• Why Linear?

• Fast tools

• Easy to understand

• Many non-linear problems can be transformed or approximated as linear systems

• Combination works well in practice

26

Page 27: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Example of Linear Algebra Application: Explain Relationships

• Marvel character appears on comic book

• Described as an adjacency matrix (representing a graph)

27

Matrix Transpose & Symmetry

3. (AB)T = BTAT

3.1 symmetric matrix: A= AT

3.2 (BT)T = B3.3 (BBT)T =?

Cocitation networks

! C = ATA

(Newman’s notation C = AAT)

Bibliographic coupling

! B = AAT

6

Marvel character

Comic book

Page 28: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Important Properties

28

Linear algebra

1. Matrix multiplication is not commutative

!

0 10 0

"!

0 01 0

"

=

!

1 00 0

"

=!

0 01 0

"!

0 10 0

"

=

!

0 00 1

"

2. Trace: Tr(AB) = Tr(BA)

2. inner product ⟨x ,y⟩= xTy = ∑∀i xiyi

2.1 ⟨x ,x⟩ ≥ 02.2 ∥x∥2 ≡ ⟨x ,x⟩2.3 ⟨x ,y⟩= 0 iff x ⊥ y2.4 ⟨x ,y⟩= ∥x∥∥y∥cosθ , thus y⟨x ,y⟩= ∥x∥u, where u = y/∥y∥

5

Page 29: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Common Matrix Representations

29

Adjacency matrix

Undirected graph:A= AT

Bipartite graph:

A=

[

0 B

BT 0

]

k connected components:

A=

A1 · · · 0

0. . . 0

0 · · · Ak

4

(undirected)

Page 30: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Orthogonal vectors (linearly independent)• Consider vectors

• Orthonormal if both

30

Normalized if uiuTi = 1 ,

also defined as kuik = 1, i = 1, . . . , k

Orthogonal if uiuTj = 0 ,

also defined as ui?uj , i 6= j

If U is orthonormal then UU�1= UUT

= I,

where I is the identity matrix

Page 31: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Orthogonal Projections

31

u2

Note that I = UUT=

kX

i=1

uiuTi

Thus a = Ia = U(UTa), a 2 Rk

The vector (UTa) is the projection of a onto U

Thus, a =

kX

i=1

(uTi a)ui a2

Page 32: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Can we have non-orthogonal basis?

• Yes

• For instance:

32

Let U = [u1, u2] be an orthonormal basis and let

v1 = 2u1 + u2

v2 = u1 + 2u2

The vectors v1 and v1 are not orthogonal:

v1vT2 = 4 ,

but still form a basis for R2

Page 33: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Eigenvalues and EigenvectorsA> 0 is n×n full rank, λ eigenvalue

det(A−λ I ) = 0

and x eigenvectorAx = λx ,

we assume ∥x∥= 1.

A[

x1, · · · , xn]

=[

x1, · · · , xn]

λ1. . .

λn

,

where xi is the i-th eigenvector of A ordered s.t. λ1 ≥ · · ·≥ λnIf A has n linearly independent eigenvectors

A= VΛV−1

7Inverse is A�1 = V ⇤�1V �1 , where ⇤�1 =

2

641/�1

. . .1/�n

3

7533

(right eigenvector)

If A is symmetric positive semidefinite V-1 = VT

Square is A2 = V ⇤2V �1

Page 34: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Singular Value Decomposition (SVD)

• X(i,j) = value of user i for property j

• X(Alice, cholesterol) = 10

• X(i,j) = number of times i buys j

• X(i,j) = how much i pays j

• X(i,j) = 1 if i and j are friends, 0 otherwise

• X(i,j) = temperature of sensor j at time i

34

2

2

j5

i

© 2015 Bruno Ribeiro

Page 35: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

σ1σ2

σk

SVD Illustrated

35

… …

X UΣ

V T

Left singular vectors Right singular vectors

singular values

Data

x1 xmx2 u1 uku2

v1v2

vk

© 2015 Bruno Ribeiro

Columns of

U are orthonormal

Columns of

V are orthonormal

=

Page 36: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

SVD Properties (I)

36

U and V are orthonormal(orthogonal & unit norm)

© 2015 Bruno Ribeiro

Page 37: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

SVD Definition

• SVD gives best rank-k approximation of X in L2 and Frobeniusnorms

37

X = +u1

v1σ1

u2

v2σ2

Outer product

+ …

© 2015 Bruno Ribeiro

Page 38: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

SVD Properties (II)• “Almost unique” decomposition

• There are two sources of ambiguity

• Orientation of singular vectors

• Permute rows of left singular vector and corresponding rows of left singular vector

• If I is identity matrix: I = UIUT , for all orthonormal U

• “Hypersphere ambiguity”

• Related to rotational ambiguity of PCA (we will see this later)38

X = +u1

v1σ1

u2

v2σ2

+ …

© 2015 Bruno Ribeiro

Page 39: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

SVD Properties (III)

• Theorem (Eckart-Young, 1936)

• UΣ1VT is best rank 1 approximation of X, that is |X − UΣ1 V T |2 ≤ |X − Y|2 for every rank 1 matrix Y

• UΣ1VT + UΣ2VT is the best rank 2 approximation of X, that is|X − UΣ1 V T − UΣ2 V T|2 ≤ |X − Y| 2

for every rank ≤ 2 matrix Y

• also for 3 , 4, …, r

39© 2015 Bruno Ribeiro

Page 40: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Singular Value Decomposition

40

• SVD factorization A = U⌃V ?is more general than

eigenvalue / eigenvector factorization A = V ⇤V �1.

Moore–Penrose pseudoinverse is A+= V ⌃

+UT

SVD also gives Moore-Penrose pseudoinverse (inverse of non-square matrices):

Page 41: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Programming Languages

Python + Scholar Cluster• Detailed tutorial:

• https://www.cs.purdue.edu/homes/ribeirob/courses/Spring2018/notes/python/python-how-to.html

• Links to Python resources

• Detailed instructions on the use of the Scholar cluster

41

Page 42: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Python example in the clusterimport numpy as np from matplotlib import use # Avoid using the xterminal to create plots (needed if plotting # on Scholar) use("Agg") import matplotlib.pyplot as pltimport scipy as spimport math

p = 0.8 MAX_DEGREE = 101 ECCDF = 1.0 x = [] y = [] for d in xrange(MAX_DEGREE): #Be careful with machine precision

x.append(d) y.append(ECCDF) ECCDF = ECCDF - (1-p)*p**d

plt.xlim([1,max(x)]) plt.xlabel("node degree", fontsize=18) plt.ylabel("ECCDF", fontsize=18) plt.loglog(x,y,"ro") plt.savefig(’ECCDF_plot.pdf’)

42

Page 43: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Scholar Cluster• High Performance Computing Cluster

• Jobs must be submitted with qsub(do not use main terminal to run tasks or you will be banished)

• But you can also use your own computer (but Scholar learning curve pays-off later)

43

Page 44: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Qsub Example File#!/bin/bash -l # Example submission file (myjob.sub)# choose queue to use (e.g. standby or scholar-b)#PBS -q standby# FILENAME: myjob.submodule load develmodule load gccmodule load anaconda/4.4.1-py35module load r

cd $PBS_O_WORKDIR

unset DISPLAY

python matlibplot_example.py

44

Page 45: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Scholar Cluster Job Submission• qsub myjob.sub

• qstat -u <your Purdue username>

• qstat -u <your Purdue username>

• After job finishes (submission directory has output files):

• myjob.sub.o<jobID> (std output)

• myjob.sub.e<jobID> (error)

• ECCDF_plot.pdf (your plot)

45

Page 46: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Unsupported: Jupyter Hub

• https://www.rcac.purdue.edu/compute/scholar/

• In the Jupyter Hub you will be unable to load some libraries (e.g., pytorch).

46

Page 47: Review LinearAlgebra MLE Scholar - Purdue University · •Linear Algebra •Using Python in the Scholar cluster 3. Probability and Statistical Estimation 4. Populations and samples

Job Submission System is Not for Debugging

• Submission takes a while to run (few minutes)

• How to iteratively debug code?

1. qsub -I -q standby -l walltime=01:00:00 # asks for 1 hour iterative terminal to debug problems in your code# remember to load the modules you need

2. Or you can quickly run your code in the terminal to see if it is working.

• Further instructions at:https://www.cs.purdue.edu/homes/ribeirob/courses/Spring2018/notes/python/python-how-to.html

• Where to get further helphttps://www.rcac.purdue.edu/compute/scholar/guide/

47