latent semantic analysis of wikipedia with spark
TRANSCRIPT
![Page 1: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/1.jpg)
1© Cloudera, Inc. All rights reserved.
LSA-ing Wikipedia with SparkSandy Ryza | Senior Data Scientist
![Page 2: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/2.jpg)
2© Cloudera, Inc. All rights reserved.
Me
• Data scientist at Cloudera
• Recently lead Cloudera’s Apache Spark development
• Author of Advanced Analytics with Spark
![Page 3: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/3.jpg)
3© Cloudera, Inc. All rights reserved.
LSA-ing Wikipedia with SparkSandy Ryza | Senior Data Scientist
![Page 4: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/4.jpg)
4© Cloudera, Inc. All rights reserved.
Latent Semantic Analysis
• Fancy name for applying a matrix decomposition (SVD) to text data
![Page 5: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/5.jpg)
5© Cloudera, Inc. All rights reserved.
![Page 6: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/6.jpg)
6© Cloudera, Inc. All rights reserved.
![Page 7: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/7.jpg)
7© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-Document
MatrixSVD
Interpret Results
![Page 8: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/8.jpg)
8© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-Document
MatrixSVD
Interpret Results
![Page 9: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/9.jpg)
9© Cloudera, Inc. All rights reserved.
Wikipedia Content Data Set
• http://dumps.wikimedia.org/enwiki/latest/• XML-formatted• 46 GB uncompressed
![Page 10: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/10.jpg)
10© Cloudera, Inc. All rights reserved.
<page><title>Anarchism</title><ns>0</ns><id>12</id><revision><id>584215651</id><parentid>584213644</parentid><timestamp>2013-12-02T15:14:01Z</timestamp><contributor><username>AnomieBOT</username><id>7611264</id></contributor><comment>Rescuing orphaned refs ("autogenerated1" from rev584155010; "bbc" from rev 584155010)</comment><text xml:space="preserve">{{Redirect|Anarchist|the fictional character|Anarchist (comics)}}{{Redirect|Anarchists}}{{pp-move-indef}}{{Anarchism sidebar}}
'''Anarchism''' is a [[political philosophy]] that advocates [[stateless society|stateless societies]] often defined as [[self-governance|self-governed]] voluntaryinstitutions,<ref>"ANARCHISM, a social philosophy that rejectsauthoritarian government and maintains that voluntary institutions are best suitedto express man's natural social tendencies." George Woodcock."Anarchism" at The Encyclopedia of Philosophy</ref><ref>"In a society developed on these lines, the voluntary associations whichalready now begin to cover all the fields of human activity would take a stillgreater extension so as to substitute...
![Page 11: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/11.jpg)
11© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-Document
MatrixSVD
Interpret Results
![Page 12: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/12.jpg)
12© Cloudera, Inc. All rights reserved.
import org.apache.mahout.text.wikipedia.XmlInputFormatimport org.apache.hadoop.conf.Configurationimport org.apache.hadoop.io._
val path = "hdfs:///user/ds/wikidump.xml"val conf = new Configuration()conf.set(XmlInputFormat.START_TAG_KEY, "<page>")conf.set(XmlInputFormat.END_TAG_KEY, "</page>")val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat], classOf[LongWritable], classOf[Text], conf)val rawXmls = kvs.map(p => p._2.toString)
![Page 13: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/13.jpg)
13© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-Document
MatrixSVD
Interpret Results
![Page 14: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/14.jpg)
14© Cloudera, Inc. All rights reserved.
Lemmatization
“the boy’s cars are different colors”
“the boy car be different color”
![Page 15: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/15.jpg)
15© Cloudera, Inc. All rights reserved.
CoreNLP
def createNLPPipeline(): StanfordCoreNLP = { val props = new Properties() props.put("annotators", "tokenize, ssplit, pos, lemma") new StanfordCoreNLP(props)}
![Page 16: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/16.jpg)
16© Cloudera, Inc. All rights reserved.
Stop Words
“the boy car be different color”
“boy car different color”
![Page 17: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/17.jpg)
17© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-Document
MatrixSVD
Interpret Results
![Page 18: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/18.jpg)
18© Cloudera, Inc. All rights reserved.
Tail Monkey Algorithm Scala
Document 1 1.5 1.8
Document 2 2.0 4.3
Document 3 1.4 6.7
Document 4 1.6
Document 5 1.2
Term-Document Matrix
![Page 19: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/19.jpg)
19© Cloudera, Inc. All rights reserved.
tf-idf
• (Term Frequency) * (Inverse Document Frequency)
• tf(document, word) = # times word appears in document
• idf(word) = 1 / (# documents that contain word)
![Page 20: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/20.jpg)
20© Cloudera, Inc. All rights reserved.
val rowVectors: RDD[Vector] = ...
![Page 21: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/21.jpg)
21© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-Document
MatrixSVD
Interpret Results
![Page 22: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/22.jpg)
22© Cloudera, Inc. All rights reserved.
Singular Value Decomposition
• Factors matrix into the product of three matrices: U, S, and V• m = # documents• n = # terms• U is m x n• S is n x n• V is n x n
![Page 23: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/23.jpg)
23© Cloudera, Inc. All rights reserved.
Low Rank Approximation
• Account for synonymy by condensing related terms.• Account for polysemy by placing less weight on terms that have multiple meanings.• Throw out noise.
SVD can find the rank-k approximation that has the lowest Frobenius distance from the original matrix.
![Page 24: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/24.jpg)
24© Cloudera, Inc. All rights reserved.
Singular Value Decomposition
• Factors matrix into the product of three matrices: U, S, and V• m = # documents• n = # terms• k = # concepts• U is m x n• S is k x k• V is k x n
![Page 25: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/25.jpg)
25© Cloudera, Inc. All rights reserved.
Docs:
Terms:U S V
![Page 26: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/26.jpg)
26© Cloudera, Inc. All rights reserved.
Docs:
Terms:U S V
![Page 27: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/27.jpg)
27© Cloudera, Inc. All rights reserved.
rowVectors.cache()val mat = new RowMatrix(rowVectors)val k = 1000val svd = mat.computeSVD(k, computeU=true)
![Page 28: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/28.jpg)
28© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-Document
MatrixSVD
Interpret Results
![Page 29: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/29.jpg)
29© Cloudera, Inc. All rights reserved.
What are the top “concepts”?
I.e. what dimensions in term-space and document-space explain most of the variance of the data?
![Page 30: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/30.jpg)
30© Cloudera, Inc. All rights reserved.
Docs:
Terms:U S V
![Page 31: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/31.jpg)
31© Cloudera, Inc. All rights reserved.
U S V
![Page 32: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/32.jpg)
32© Cloudera, Inc. All rights reserved.
U S V
![Page 33: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/33.jpg)
33© Cloudera, Inc. All rights reserved.
def topTermsInConcept(concept: Int, numTerms: Int) : Seq[(String, Double)] = { val v = svd.V.toBreezeMatrix val termWeights = v(::, k).toArray.zipWithIndex val sorted = termWeights.sortBy(-_._1) sorted.take(numTerms)}
![Page 34: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/34.jpg)
34© Cloudera, Inc. All rights reserved.
def topDocsInConcept(concept: Int, numDocs: Int) : Seq[Seq[(String, Double)]] = { val u = svd.U val docWeights = u.rows.map(_.toArray(concept)).zipWithUniqueId() docWeights.top(numDocs)}
![Page 35: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/35.jpg)
35© Cloudera, Inc. All rights reserved.
Concept 1
Terms: department, commune, communes, insee, france, see, also,
southwestern, oise, marne, moselle, manche, eure, aisne, isère
Docs: Communes in France, Saint-Mard, Meurthe-et-Moselle,
Saint-Firmin, Meurthe-et-Moselle, Saint-Clément, Meurthe-et-Moselle,
Saint-Sardos, Lot-et-Garonne, Saint-Urcisse, Lot-et-Garonne, Saint-Sernin,
Lot-et-Garonne, Saint-Robert, Lot-et-Garonne, Saint-Léon, Lot-et-Garonne,
Saint-Astier, Lot-et-Garonne
![Page 36: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/36.jpg)
36© Cloudera, Inc. All rights reserved.
Concept 2
Terms: genus, species, moth, family, lepidoptera, beetle, bulbophyllum,
snail, database, natural, find, geometridae, reference, museum, noctuidae
Docs: Chelonia (genus), Palea (genus), Argiope (genus), Sphingini,
Cribrilinidae, Tahla (genus), Gigartinales, Parapodia (genus),
Alpina (moth), Arycanda (moth)
![Page 37: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/37.jpg)
37© Cloudera, Inc. All rights reserved.
Querying
• Given a set of terms, find the closest documents in the latent space
![Page 38: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/38.jpg)
38© Cloudera, Inc. All rights reserved.
Reconstructed Matrix(U * S * V)
Doc
Term
![Page 39: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/39.jpg)
39© Cloudera, Inc. All rights reserved.
def topTermsForTerm( normalizedVS: BDenseMatrix[Double], termId: Int): Seq[(Double, Int)] = { val rowVec = new BDenseVector[Double](row(normalizedVS, termId).toArray) val termScores = (normalizedVS * rowVec).toArray.zipWithIndex termScores.sortBy(- _._1).take(10)}
val VS = multiplyByDiagonalMatrix(svd.V, svd.s)val normalizedVS = rowsNormalized(VS)topTermsForTerm(normalizedVS, id, termIds)
![Page 40: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/40.jpg)
40© Cloudera, Inc. All rights reserved.
printRelevantTerms("radiohead")
radiohead 0.9999999999999993
lyrically 0.8837403315233519
catchy 0.8780717902060333
riff 0.861326571452104
lyricsthe 0.8460798060853993
lyric 0.8434937575368959
upbeat 0.8410212279939793
Term Similarity
![Page 41: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/41.jpg)
41© Cloudera, Inc. All rights reserved.
printRelevantTerms("algorithm")
algorithm 1.000000000000002
heuristic 0.8773199836391916
compute 0.8561015487853708
constraint 0.8370707630657652
optimization 0.8331940333186296
complexity 0.823738607119692
algorithmic 0.8227315888559854
Term Similarity
![Page 42: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/42.jpg)
42© Cloudera, Inc. All rights reserved.
(algorithm,1.000000000000002), (heuristic,0.8773199836391916),
(compute,0.8561015487853708), (constraint,0.8370707630657652),
(optimization,0.8331940333186296), (complexity,0.823738607119692),
(algorithmic,0.8227315888559854), (iterative,0.822364922633442),
(recursive,0.8176921180556759), (minimization,0.8160188481409465)
![Page 43: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/43.jpg)
43© Cloudera, Inc. All rights reserved.
def topDocsForTerm(US: RowMatrix, V: Matrix, termId: Int) : Seq[(Double, Long)] = { val rowArr = row(V, termId).toArray val rowVec = Matrices.dense(termRowArr.length, 1, termRowArr) val docScores = US.multiply(termRowVec) val allDocWeights = docScores.rows.map( _.toArray(0)). zipWithUniqueId() allDocWeights.top( 10)}
![Page 44: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/44.jpg)
44© Cloudera, Inc. All rights reserved.
printRelevantDocs("fir")
Silver tree 0.006292909647173194
See the forest for the trees 0.004785047583508223
Eucalyptus tree 0.004592837783089319
Sequoia tree 0.004497446632469554
Willow tree 0.004429936059594164
Coniferous tree 0.004381572286629475
Tulip Tree 0.004374705020233878
Document Similarity
![Page 45: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/45.jpg)
45© Cloudera, Inc. All rights reserved.
• https://github.com/sryza/aas/tree/master/ch06-lsa• https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html•
More detail?
![Page 46: Latent Semantic Analysis of Wikipedia with Spark](https://reader034.vdocuments.site/reader034/viewer/2022052311/55a6befd1a28ab4d688b46a7/html5/thumbnails/46.jpg)
46© Cloudera, Inc. All rights reserved.
Thank you@sandysifting