08 dimensionality reduction - sjtu › ~yshen › courses › bigdata › 08... · 5/2/17 12 ¡a =...

33
5/2/17 1 Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org ¡ Assumption: Data lies on or near a low d-dimensional subspace ¡ Axes of this subspace are effective representation of the data J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

Upload: others

Post on 04-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

1

Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

http://www.mmds.org

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

¡ Assumption: Dataliesonornearalowd-dimensionalsubspace

¡ Axesofthissubspaceareeffectiverepresentationofthedata

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 2

Page 2: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

2

¡ Compress/reducedimensionality:§ 106 rows;103 columns;noupdates§ Randomaccesstoanycell(s);smallerror:OK

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 3

The above matrix is really “2-dimensional.” All rows can be reconstructed by scaling [1 1 1 0 0] or [0 0 0 1 1]

¡ Q:Whatisrank ofamatrixA?¡ A: Numberoflinearlyindependent columnsofA¡ Forexample:§ MatrixA= hasrankr=2

§ Why?Thefirsttworowsarelinearlyindependent,sotherankisatleast2,butallthreerowsarelinearlydependent(thefirstisequaltothesumofthesecondandthird)sotherankmustbelessthan3.

¡ Whydowecareaboutlowrank?§ WecanwriteA astwo“basis”vectors:[121][-2-31]§ Andnewcoordinatesof:[10][01][1-1]

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 4

Page 3: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

3

¡ Cloudofpoints3Dspace:§ Thinkofpointpositionsasamatrix:

¡ Wecanrewritecoordinatesmoreefficiently!§ Oldbasisvectors: [100][010][001]§ Newbasisvectors:[121][-2-31]§ ThenA hasnewcoordinates:[10].B:[01],C:[1-1]

§ Notice:Wereducedthenumberofcoordinates!

1 row per point:

ABC

A

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 5

¡ Goalofdimensionalityreductionistodiscovertheaxisofdata!

Rather than representingevery point with 2 coordinateswe represent each point with1 coordinate (corresponding tothe position of the point on the red line).

By doing this we incur a bit oferror as the points do not exactly lie on the line

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 6

Page 4: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

4

Whyreducedimensions?¡ Discoverhiddencorrelations/topics§ Wordsthatoccurcommonlytogether

¡ Removeredundantandnoisyfeatures§ Notallwordsareuseful

¡ Interpretationandvisualization¡ Easierstorageandprocessingofthedata

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 7

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 8

A[mxn] =U[mxr] S [ rxr] (V[nxr])T

¡ A:Inputdatamatrix§ m xnmatrix(e.g.,m documents,n terms)

¡ U:Leftsingularvectors§ m xrmatrix (m documents,r concepts)

¡ S:Singularvalues§ r xr diagonalmatrix(strengthofeach‘concept’)(r :rankofthematrixA)

¡ V:Rightsingularvectors§ n xrmatrix(n terms,r concepts)

Page 5: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

5

9

Am

n

Sm

n

U

VT

»

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

T

10

Am

n

» +

s1u1v1 s2u2v2

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

σi … scalarui … vectorvi … vector

T

Page 6: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

6

Itisalways possibletodecomposearealmatrixA intoA=US VT ,where

¡ U,S,V:unique¡ U,V:columnorthonormal§ UT U=I;VT V=I (I:identitymatrix)§ (Columnsareorthogonalunitvectors)

¡ S:diagonal§ Entries(singularvalues)arepositive,andsortedindecreasingorder(σ1 ³ σ2 ³ ...³ 0)

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 11

Nice proof of uniqueness: http://www.mpi-inf.mpg.de/~bast/ir-seminar-ws04/lecture2.pdf

¡ A=US VT - example:UserstoMovies

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 12

=SciFi

Romnce

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

Sm

n

U

VT

“Concepts” AKA Latent dimensionsAKA Latent factors

Page 7: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

7

¡ A=US VT - example:UserstoMovies

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 13

=SciFi

Romnce

x x

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

¡ A=US VT - example:UserstoMovies

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 14

SciFi-conceptRomance-concept

=SciFi

Romnce

x x

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

Page 8: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

8

¡ A=US VT - example:

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 15

Romance-concept

U is “user-to-concept” similarity matrix

SciFi-concept

=SciFi

Romnce

x x

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

¡ A=US VT - example:

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 16

SciFi

Romnce

SciFi-concept

“strength” of the SciFi-concept

=SciFi

Romnce

x x

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

Page 9: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

9

¡ A=US VT - example:

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 17

SciFi-concept

V is “movie-to-concept”similarity matrix

SciFi-concept

=SciFi

Romnce

x x

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 18

‘movies’,‘users’and‘concepts’:¡ U:user-to-conceptsimilaritymatrix

¡ V:movie-to-conceptsimilaritymatrix

¡ S:itsdiagonalelements:‘strength’ofeachconcept

Page 10: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

10

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 20

v1

first right singular vector

Movie 1 rating

Mov

ie 2

ratin

g

¡ Insteadofusingtwocoordinates(𝒙, 𝒚) todescribepointlocations,let’suseonlyonecoordinate 𝒛

¡ Point’spositionisitslocationalongvector𝒗𝟏¡ Howtochoose𝒗𝟏?Minimizereconstructionerror

Page 11: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

11

¡ Goal:Minimizethesumofreconstructionerrors:

)) 𝑥+, − 𝑧+,/

0

,12

3

+12§ where𝒙𝒊𝒋 arethe“old”and𝒛𝒊𝒋 arethe“new”coordinates

¡ SVDgives‘best’axistoprojecton:§ ‘best’=minimizingthereconstructionerrors

¡ Inotherwords,minimumreconstructionerror

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 21

v1

first right singularvector

Movie 1 rating

Mov

ie 2

ratin

g

¡ A=US VT- example:§ V: “movie-to-concept”matrix§ U:“user-to-concept”matrix

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 22

v1

first right singularvector

Movie 1 rating

Mov

ie 2

ratin

g

= x x

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

Page 12: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

12

¡ A=US VT- example:

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 23

v1

first right singularvector

Movie 1 rating

Mov

ie 2

ratin

g

variance (‘spread’) on the v1 axis

= x x

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

A=US VT- example:¡ US: Givesthecoordinatesofthepointsintheprojectionaxis

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 24

v1

first right singularvector

Movie 1 rating

Mov

ie 2

ratin

g

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

1.61 0.19 -0.015.08 0.66 -0.036.82 0.85 -0.058.43 1.04 -0.061.86 -5.60 0.840.86 -6.93 -0.870.86 -2.75 0.41

Projection of users on the “Sci-Fi” axis (U S) T:

Page 13: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

13

Moredetails¡ Q: Howexactlyisdim.reductiondone?

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 25

= x x

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

Moredetails¡ Q: Howexactlyisdim.reductiondone?¡ A:Setsmallestsingularvaluestozero

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 26

= x x

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

Page 14: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

14

Moredetails¡ Q: Howexactlyisdim.reductiondone?¡ A:Setsmallestsingularvaluestozero

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 27

x x

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

»

Moredetails¡ Q: Howexactlyisdim.reductiondone?¡ A:Setsmallestsingularvaluestozero

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 28

x x

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

»

Page 15: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

15

Moredetails¡ Q: Howexactlyisdim.reductiondone?¡ A:Setsmallestsingularvaluestozero

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 29

» x x

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.020.41 0.070.55 0.090.68 0.110.15 -0.590.07 -0.730.07 -0.29

12.4 0 0 9.5

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.69

Moredetails¡ Q: Howexactlyisdim.reductiondone?¡ A:Setsmallestsingularvaluestozero

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 30

»

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.92 0.95 0.92 0.01 0.012.91 3.01 2.91 -0.01 -0.013.90 4.04 3.90 0.01 0.014.82 5.00 4.82 0.03 0.030.70 0.53 0.70 4.11 4.11

-0.69 1.34 -0.69 4.78 4.780.32 0.23 0.32 2.01 2.01

Frobenius norm:

ǁMǁF = ÖΣij Mij2 ǁA-BǁF = Ö Σij (Aij-Bij)2

is“small”

Page 16: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

16

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 31

A USigma

VT=

B USigma

VT=

B is best approximation of A

¡ Theorem:Let A =US VT and B =US VT whereS = diagonalrxr matrix withsi=σi (i=1…k)elsesi=0thenB isa best rank(B)=k approx.toA

Whatdowemeanby“best”:§ B isasolutiontominB ǁA-BǁF whererank(B)=k

Σ𝜎22

𝜎88

𝐴 − 𝐵 ; = ) 𝐴+, − 𝐵+,/

+,

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 32

Page 17: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

17

¡ Theorem: Let A =US VT (σ1³σ2³…,rank(A)=r)then B =US VT

§ S = diagonalrxr matrix wheresi=σi (i=1…k)elsesi=0isabestrank-k approximationtoA:§ B isasolutiontominB ǁA-BǁF whererank(B)=k

¡ Wewillneed2facts:§ 𝑀 ; = ∑ 𝑞++ /�

+ whereM =P Q R isSVDofM§ US VT- US VT =U (S - S)VT

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 33

Σ𝜎22

𝜎88

Details!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 34

¡ Wewillneed2facts:§ 𝑀 ; = ∑ 𝑞AA /�

A whereM =P Q R isSVDofM

§ US VT- US VT =U (S - S)VT

We apply:-- P column orthonormal-- R row orthonormal-- Q is diagonal

Details!

Page 18: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

18

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 35

¡ A =US VT ,B =US VT (σ1³σ2³…³ 0, rank(A)=r)§ S = diagonalnxn matrixwheresi=σi (i=1…k)elsesi=0then B issolutiontominB ǁA-BǁF ,rank(B)=k

¡ Why?

¡ Wewanttochoosesi tominimize∑ 𝜎+ − 𝑠+ /�+

¡ Solutionistosetsi=σi (i=1…k)andothersi=0

ååå+=+==

=+-=r

kii

r

kii

k

iiis s

i1

2

1

2

1

2)(min sss

å=

=-=-S=-

r

iiisFFkBrankBsSBA

i1

2

)(,)(minminmin s

We used: U S VT - U S VT = U (S - S) VT

Details!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 36

Equivalent:‘spectraldecomposition’ofthematrix:

= x xu1 u2

σ1

σ2

v1

v2

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

Page 19: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

19

Equivalent:‘spectraldecomposition’ofthematrix

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 37

= u1σ1 vT1 u2σ2 vT

2+ +...

n

m

n x 1 1 x m

k terms

Assume: σ1 ³ σ2 ³ σ3 ³ ... ³ 0

Why is setting small σi to 0 the right thing to do?Vectors ui and vi are unit length, so σiscales them.So, zeroing small σi introduces less error.

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 38

Q:Howmanyσs tokeep?A: Rule-of-athumb:keep80-90%of‘energy’ = ∑ 𝝈𝒊𝟐�

𝒊

= u1σ1 vT1 u2σ2 vT

2+ +...n

m

Assume: σ1 ³ σ2 ³ σ3 ³ ...

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

Page 20: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

20

¡ TocomputeSVD:§ O(nm2) orO(n2m) (whicheverisless)

¡ But:§ Lesswork,ifwejustwantsingularvalues§ orifwewantfirstk singularvectors§ orifthematrixissparse

¡ Implementedinlinearalgebrapackageslike§ LINPACK,Matlab,SPlus,Mathematica ...

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 39

¡ SVD: A=US VT:unique§ U:user-to-conceptsimilarities§ V:movie-to-conceptsimilarities§ S :strengthofeachconcept

¡ Dimensionalityreduction:§ keepthefewlargestsingularvalues(80-90%of‘energy’)

§ SVD:picksuplinearcorrelations

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 40

Page 21: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

21

¡ SVDgivesus:§ A = U S VT

¡ Eigen-decomposition:§ A = X L XT

§ Aissymmetric§ U,V,Xareorthonormal (UTU=I),§ L, S arediagonal

¡ Nowlet’scalculate:§ AAT= US VT(US VT)T = US VT(VSTUT) = USST UT

§ ATA = V ST UT (US VT) = V SST VT

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 41

¡ SVDgivesus:§ A = U S VT

¡ Eigen-decomposition:§ A = X L XT

§ Aissymmetric§ U,V,Xareorthonormal (UTU=I),§ L, S arediagonal

¡ Nowlet’scalculate:§ AAT= US VT(US VT)T = US VT(VSTUT) = USST UT

§ ATA = V ST UT (US VT) = V SST VT

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 42

X L2 XT

X L2 XT

Shows how to computeSVD using eigenvalue

decomposition!

Page 22: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

22

¡ A AT =U S2 UT

¡ ATA =V S2 VT

¡ (ATA) k=V S2k VT

§ E.g.:(ATA)2=V S2 VTV S2 VT=V S4 VT

¡ (ATA) k~v1 σ12k v1T fork>>1

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 43

Page 23: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

23

¡ Q:Findusersthatlike‘Matrix’¡ A:Mapqueryintoa‘conceptspace’– how?

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 45

=SciFi

Romnce

x x

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

¡ Q:Findusersthatlike‘Matrix’¡ A:Mapqueryintoa‘conceptspace’– how?

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 46

5 0 0 0 0

q =

Matrix

Alie

n

v1

q

v2

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie

Project into concept space:Inner product with each ‘concept’ vector vi

Page 24: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

24

¡ Q:Findusersthatlike‘Matrix’¡ A:Mapqueryintoa‘conceptspace’– how?

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 47

v1

q

q*v1

5 0 0 0 0

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie

v2

MatrixA

lien

q =

Project into concept space:Inner product with each ‘concept’ vector vi

Compactly,wehave:qconcept =qV

E.g.:

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 48

movie-to-conceptsimilarities (V)

=

SciFi-concept

5 0 0 0 0

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie

q =

0.56 0.120.59 -0.020.56 0.120.09 -0.690.09 -0.69

x 2.8 0.6

Page 25: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

25

¡ Howwouldtheuserd thatrated(‘Alien’,‘Serenity’)behandled?dconcept =dV

E.g.:

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 49

movie-to-conceptsimilarities (V)

=

SciFi-concept

0 4 5 0 0

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie

q =

0.56 0.120.59 -0.020.56 0.120.09 -0.690.09 -0.69

x 5.2 0.4

¡ Observation: Userd thatrated(‘Alien’,‘Serenity’)willbesimilar touserq thatrated(‘Matrix’),althoughd andq havezeroratingsincommon!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 50

0 4 5 0 0

d =

SciFi-concept

5 0 0 0 0

q =

Mat

rix

Alie

n

Sere

nity

Casa

blan

ca

Am

elie

Zero ratings in common Similarity ≠ 0

2.8 0.6

5.2 0.4

Page 26: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

26

+ Optimallow-rankapproximationintermsofFrobenius norm

- Interpretabilityproblem:§ Asingularvectorspecifiesalinearcombinationofallinputcolumnsorrows

- Lackofsparsity:§ Singularvectorsaredense!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 51

=U

SVT

Page 27: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

27

¡ Goal:ExpressAasaproductofmatricesC,U,RMakeǁA-C·U·RǁF small

¡ “Constraints”onCandR:

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 53

A C U R

Frobenius norm:

ǁXǁF = Ö Σij Xij2

¡ Goal:ExpressAasaproductofmatricesC,U,RMakeǁA-C·U·RǁF small

¡ “Constraints”onCandR:

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 54

Pseudo-inverse of the intersection of C and R

A C U R

Frobenius norm:

ǁXǁF = Ö Σij Xij2

Page 28: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

28

¡ Let:Ak bethe“best”rankk approximationtoA (thatis,Ak isSVDofA)

Theorem [Drineas etal.]CUR inO(m·n)timeachieves§ ǁA-CURǁF £ ǁA-AkǁF +eǁAǁFwithprobabilityatleast1-d,bypicking§ O(klog(1/d)/e2) columns,and§ O(k2log3(1/d)/e6) rows

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 55

In practice:Pick 4k cols/rows

¡ Samplingcolumns(similarlyforrows):

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 56

Note this is a randomized algorithm, same column can be sampled more than once

Page 29: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

29

¡ LetW bethe“intersection”ofsampledcolumnsC androwsR§ LetSVDofW= XZ YT

¡ Then: U =Y (Z+)2XT

§ Z+:reciprocalsofnon-zerosingularvalues: Z+

ii =1/ Zii

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 57

AC

R W

»

¡ Forexample:

§ Select𝒄 = 𝑶 𝒌 𝒍𝒐𝒈 𝒌𝜺𝟐

columnsofAusingColumnSelect algorithm

§ Select𝒓 = 𝑶 𝒌 𝒍𝒐𝒈 𝒌𝜺𝟐

rowsofAusingColumnSelect algorithm

§ Set𝑼 = 𝑾P

¡ Then:withprobability98%

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 58

In practice:Pick 4k cols/rowsfor a “rank-k” approximation

SVD errorCUR error

Page 30: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

30

+ Easyinterpretation• Sincethebasisvectorsareactualcolumnsandrows

+ Sparsebasis• Sincethebasisvectorsareactualcolumnsandrows

- Duplicatecolumnsandrows• Columnsoflargenormswillbesampledmanytimes

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 59

Singular vectorActual column

¡ Ifwewanttogetridoftheduplicates:§ Throwthemaway§ Scale(multiply)thecolumns/rowsbythesquarerootofthenumberofduplicates

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 60

ACd

Rd

Cs

Rs

Construct a small U

Page 31: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

31

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 61

SVD: A = U S VT

Huge but sparse Big and dense

CUR: A = C U RHuge but sparse Big but sparse

dense but small

sparse and small

¡ DBLPbibliographicdata§ Author-to-conferencebigsparsematrix§ Aij:Numberofpaperspublishedbyauthori atconferencej

§ 428Kauthors(rows),3659conferences(columns)§ Verysparse

¡ Wanttoreducedimensionality§ Howmuchtimedoesittake?§ Whatisthereconstructionerror?§ Howmuchspacedoweneed?

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 62

Page 32: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

32

¡ Accuracy:§ 1– relativesumsquarederrors

¡ Spaceratio:§ #outputmatrixentries/#inputmatrixentries

¡ CPUtime

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 63

SVDCURCUR no duplicates

SVDCURCUR no dup

Sun, Faloutsos: Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM ’07.

CUR

SVD

¡ SVDislimitedtolinearprojections:§ Lower-dimensionallinearprojectionthatpreservesEuclideandistances

¡ Non-linearmethods:Isomap§ Dataliesonanonlinearlow-dimcurveakamanifold

§ Usethedistanceasmeasuredalongthemanifold

§ How?§ Buildadjacencygraph§ Geodesicdistanceisgraphdistance

§ SVD/PCAthegraphpairwise distancematrix

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 64

Page 33: 08 Dimensionality Reduction - SJTU › ~yshen › courses › BigData › 08... · 5/2/17 12 ¡A = U SVT -example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5/2/17

33

¡ Drineas etal.,FastMonteCarloAlgorithmsforMatricesIII:ComputingaCompressedApproximateMatrixDecomposition,SIAMJournalonComputing,2006.

¡ J.Sun,Y.Xie,H.Zhang,C.Faloutsos:LessisMore:CompactMatrixDecompositionforLargeSparseGraphs,SDM2007

¡ Intra- andinterpopulation genotypereconstructionfromtaggingSNPs,P.Paschou,M.W.Mahoney,A.Javed,J.R.Kidd,A.J.Pakstis,S.Gu,K.K.Kidd,andP.Drineas,GenomeResearch,17(1),96-107(2007)

¡ Tensor-CURDecompositionsForTensor-BasedData,M.W.Mahoney,M.Maggioni,andP.Drineas,Proc.12-thAnnualSIGKDD,327-336(2006)

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 65