scala matsuri 2016: japanese text mining with scala and spark

Japanese Text Mining with Scala and Spark Eduardo Gonzalez Scala Matsuri 2016 Scala と Spark ととととととととととととととと

Upload: eduardo-gonzalez

Post on 16-Apr-2017




3 download


Page 1: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Japanese Text Mining with Scala

and SparkEduardo GonzalezScala Matsuri 2016

Scala と Spark による日本語テキストマイニング

Page 2: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

About Me• Eduardo Gonzalez• Japan Business Systems• Japanese System Integrator (SIer)• Social Systems Design Center (R&D)

• Pittsburgh University• Computer Science• Japanese


Page 3: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Agenda• Intro to Text mining with Spark• Pre-processing Japanese Text• Japanese Word Breaking• Spark Gotchas

• Topic Extraction with LDA• Intro to Word2Vec• Recommendation with Word Embedding

導入、前処理(分かち書き、 Spark の落とし穴)、トピック解析、 Word2Vec 、レコメンドの順で説明する

Page 4: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Machine Learning Vocabulary• Feature: A number that represents

something about a data point• Label: A feature of the data we want to

predict• Document: A block of text with a unique

ID• Model: A learned set of parameters that

can be used for prediction • Corpus: A collection of documents

機械学習の前提となる語彙として Feature 、 Label 、 Document 、 Model 、Corpus がある

Page 5: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

What is Apache Spark

• A library that defines a Resilient Distributed Dataset type and a set of transformations• RDDs are only representations of calculations

• A runtime that can execute RDDs in a distributed manner• A master process that schedules and monitors executors

• Executors actually do the calculations and can keep results in their memory

• Spark SQL, MLLib and Graph X define special types of RDDs

Spark は汎用分散処理基盤で、 SQL/ 機械学習 / グラフといったコンポーネントを保持する

Page 6: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Apache Spark Example

import org.apache.spark.{SparkConf, SparkContext}

object Main extends App { val sc = new SparkContext(new SparkConf())

val text = sc.textFile("hdfs:///kjb.txt")

val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.collect().foreach(println)}

Spark で WordCount アプリケーションを構築するとこのようになる

Page 7: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Spark’s Text-Mining Tools• LDA for Topic Extraction

• Word2Vec an unsupervised way to turn words into features based on their meaning

• CountVectorizer turns documents into vectors based on word count

• HashingTF-IDF calculates important words of a document with respect to the corpus

• And much moreSparkのテキストマイニングツールとしてLDA、 CountVectorizer、 HashingTF-IDF等のツールがある

Page 8: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

How to use Spark LDA

import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}import org.apache.spark.mllib.linalg.Vectors

// Load and parse the dataval data = sc.textFile("data/mllib/sample_lda_data.txt")val parsedData = => Vectors.dense(s.trim.split(' ').map(_.toDouble)))// Index documents with unique IDsval corpus =

// Cluster the documents into three topics using LDAval ldaModel = new LDA().setK(3).run(corpus)

Page 9: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark


ただ、入力の LDA データは文章のようには見えない

1 2 6 0 2 3 1 1 0 0 31 3 0 1 3 0 0 2 0 0 11 4 1 0 0 4 9 0 1 2 02 1 0 3 0 0 5 0 2 3 93 1 1 9 3 0 2 0 0 1 34 2 0 3 4 5 1 1 1 4 02 1 0 3 0 0 5 0 2 2 91 1 1 9 2 1 2 0 0 1 34 4 0 3 4 2 1 3 0 0 02 8 2 0 3 0 2 0 2 7 21 1 1 9 0 2 2 0 0 3 34 1 0 0 4 5 1 3 0 1 0

(´Д` )

This does not

look like text

Page 10: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

LDA Step 0: Get words

LDA 実行にあたり、まずはじめに単語を抽出する必要がある

Page 11: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Word Segmentation• Hard to actually get right.

• Simple in theory with English• Str.Split(“ “)

• But not enough for real data.• (Take parens for example.)• [“(Take”, “parens”, “for”, “example.)”]• Etc.


Page 12: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Word Segmentation

• Since Japanese lacks spaces it’s hard even in theory

• A probabilistic approach is necessary• Thankfully there are libraries that can



Page 13: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Morphological Analyzers

• Include POS tagging, pronunciation and stemming

• MeCab• Written in C++with SWIG bindings to

pretty much everything• Kuromoji• Written in Java available via maven

• Others形態素解析(品詞タグ付け、発音、語幹処理服務)用に MeCab や Kuromoji 等のライブラリがある

Page 14: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

JMecab & Spark/Hadoop

• Not impossible but difficult• Add MeCab to each node• Add jar to classpaths• Include jar in project for compilation

• Not too bad with own hardware but painful with Amazon EMR or Azure HDInsight

JMecab は事前 Install が必要なため、オンプレでは何とかなるが、クラウド環境では実行困難

Page 15: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Kuromoji & Spark/Hadoop

• Easy• Include dependency in build.sbt• Include jar file in FatJar with sbt-


Kuromoji は依存性を追加し、 FatJar をビルドするだけなので使いやすい

Page 16: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Using Kuromojiimport org.atilika.kuromoji.Tokenizer

object Main extends App { import scala.collection.JavaConverters.asScalaBufferConverter

val tokenizer = Tokenizer.builder().build()

val ex1 = "リストのような構造の物から条件を満たす物を探す " val res1 = tokenizer.tokenize(ex1).asScala

for (token <- res1) { println(s"${token.getBaseForm}\t${token.getPartOfSpeech}") }}

Page 17: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Using Kuromoji

Kuromoji を使うとこのように認識される

厚生 名詞 , 一般 ,*,*年金 名詞 , 一般 ,*,*基金 名詞 , 一般 ,*,*脱退 名詞 , サ変接続 ,*,*に 助詞 , 格助詞 , 一般 ,*伴う 動詞 , 自立 ,*,*手続き 名詞 , サ変接続 ,*,*について 助詞 , 格助詞 , 連語 ,*の 助詞 , 連体化 ,*,*リマ 名詞 , 固有名詞 , 地域 , 一般インド 名詞 , 固有名詞 , 地域 ,国です 助動詞 ,*,*,*

リスト 名詞 , 一般 ,*,*の 助詞 , 連体化 ,*,*よう 名詞 , 非自立 , 助動詞語幹 ,*だ 助動詞 ,*,*,*構造 名詞 , 一般 ,*,*の 助詞 , 連体化 ,*,*物 名詞 , 非自立 , 一般 ,*から 助詞 , 格助詞 , 一般 ,*条件 名詞 , 一般 ,*,*を 助詞 , 格助詞 , 一般 ,*満たす 動詞 , 自立 ,*,*物 名詞 , 非自立 , 一般 ,*を 助詞 , 格助詞 , 一般 ,*探す 動詞 , 自立 ,*,*

Page 18: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Step 1: Build Vocabulary


Page 19: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Vocabularylazy val tokenizer = Tokenizer.builder().build()

val text = sc.textFile("documents")val words = for { line <- text token <- tokenizer.tokenize(line).asScala} yield token.getBaseForm

val vocab = words.distinct().zipWithIndex().collectAsMap()

Page 20: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Step 2: Create Corpus


Page 21: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Corpusval documentWords: RDD[Array[String]] = => tokenizer.tokenize(line) => t.getBaseForm).toArray)val documentCounts: RDD[Array[(String, Int)]] = => { word => (word, words.count(_ == word)) })val documentIndexAndCount: RDD[Seq[(Int, Double)]] = => {

case (word, count) => (vocab(word).toInt, count.toDouble) })val corpus: RDD[(Long, Vector)] =, _))

Page 22: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Step 3: Learn Topics


Page 23: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Learn Topicsval ldaModel = new LDA().setK(10).setMaxIterations(100).run(corpus)

val topics = ldaModel.describeTopics(10).map { case (terms, weights) =>}

topics.zipWithIndex.foreach { case (topic, i) => println(s"TOPIC $i") topic.foreach { case (term, weight) => println(s"$term\t$weight") } println(s"==========")}

Page 24: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Step 4: Evaluate


Page 25: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Topics?Topic 0:

です 0.10870545899718176。 0.09623411796419644さん 0.06105040403724023

Topic 1:

の 0.11035671185240525を 0.07860862808644907する 0.05605566895190625

Topic 2:

お願い 0.07579177409154919ご 0.04431117457391179よろしく0.032788330612439916


Page 26: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Step 5: GOTO 2

Page 27: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Filter Stopwordsval popular = words .map(w => (w, 1)) .reduceByKey(_ + _) .sortBy(-_._2) .take(50) .map(_._1) .toSet

val vocabIndicies = words.distinct().filter(w => !popular.contains(w)).zipWithIndex()val vocab: Map[String, Long] = vocabIndicies.collectAsMap()val vocabulary = vocabIndicies.collect().map(_._1)


Page 28: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Topics!Topic 0:

変更 0.032952997236706624サーバー 0.03140777729144046設定0.021643554361727567エラー 0.017955380768330902

Topic 1:

ログ 0.028665774057609564時間0.026686704628121154時 0.02404938565591628発生0.020797622509804107

Topic 2:

様 0.0474658820402456株式会社 0.026174292703953685お世話0.021939329774535308

Page 29: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Using the LDA model

• Prediction requires a LocalLDAModel• Use .toLocal if

isInstanceOf[DistributedLDAModel]• Convert to Vector using same steps• Be sure to filter out words not in the


• Call topicDistributions to see topic scores

LDA モデルはトピックの予想のために使用される

Page 30: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Topics Prediction

New document topics: 0.091084004103132,0.1044111561202625,0.09090943947509807,0.11607354553753861,0.10404284803971378,0.09697071269561051,0.09571658794577831,0.0919546186785918,0.09176248930132802,0.11707459810294643

New document topics: 0.09424474530277152,0.1183270779577911,0.09230776874419214,0.09835759337114718,0.13159581881630272,0.09279638945611612,0.094124104743527,0.09295449996673977,0.09291472297512052,0.09237727866629193


Topic 0 Topic 1 Topic 2 Topic …

Page 31: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Now what?

• Find the minimum logLikelihood in a set of documents you know are OK

• Report anomaly whenever a new document has a lower logLikelihood


Page 32: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Anomaly Detectionval newDoc = sc.parallelize(Seq("平素は当社サービスをご利用いただき、誠にありがとうございます。 "))

def stringToCountVector(strings: RDD[String]) = { . . .}

val score = lda.logLikelihood(stringToCountVector(newDoc))




Page 33: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Word2Vec• Created vectors that represents

points in meaning space• Unsupervised but requires a lot of

data to generate good vectors• Google’s sample vectors trained

on 100 billion words (~X00GB?)• Vectors with less data can provide

interesting similarities but can’t do so consistently

Word2Vec では単語をベクトル化して定量的に表現可能で、単語同士の類似度を出すことができる

Page 34: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Word2Vec Intuition

• Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.


Page 35: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Vector Concatenation






サポート. . .

Page 36: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Step 1: Make vectors


Page 37: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Making Word2VecModel

val documentWords: RDD[Seq[String]] = => tokenizer.tokenize(line)


val model = new Word2Vec().setVectorSize(300).fit(documentWords)

Page 38: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Step 2: Use vectors


Page 39: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Using Word2VecModel

model.findSynonyms(“日本” , 5).foreach(println)


(マイクロソフト ,3.750299190465294)

(ビジネス ,3.7329870992662104)

(株式会社 ,3.323983664186244)

(システムズ ,3.1331352923187987)

(ビジネスプロダクティビティ ,2.595931613590554)



Big dataset is very important.

Page 40: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark


• Paragraph Vectors• Not available in Spark T_T

文章のベクトル化によるレコメンドは Spark ではできない

Page 41: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Embedding with Vector Concatenation • Calculate sum of words in description• Add it to vectors from

Word2VecModel.getVectors with special keyword (Ex. ITEM_1234)

• Create new Word2VecModel using constructor

• ※Not state of the art but can produce reasonable recommendations without user rating data

ベクトル連結による embedding 、「アイテム」ごとに含まれる単語のベクトルを合計する

Page 42: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Item Embedding (1/2)

val embeds = Map( "ITEM_001_01" -> "営業部門の情報共有と活用をサポートし ", "ITEM_001_02" -> "組織的な営業力 売れる仕組みを構築します・ ", "ITEM_001_03" -> "営業情報のコミュニケーション基盤を構築する ", "ITEM_002_01" -> "一般的なサーバ、ネットワーク機器や OSレベルの監視に加え ", "ITEM_002_02" -> "またモニタリングポータルでは、アラームの発生状況 ", "ITEM_002_03" -> "監視システムにより取得されたパフォーマンス情報が逐次ダッシュボード形式 ", "ITEM_003_01" -> "IPネットワークインフラストラクチャを構築します ", "ITEM_003_02" -> "導入にとどまらず、アプリケーションや OAシステムとの融合を図ったユニファイドコミュニケーション環境を構築 ", "ITEM_003_03" -> "企業内および企業外へのコンテンツの効果的な配信環境、閲覧環境をご提供します ")

Page 43: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Item Embedding (2/2)

def stringToVector(s: String): Array[Double] = { val words = tokenizer.tokenize(s) val vectors = => Try(model.transform(word)).getOrElse(model.transform("は "))) val breezeVectors: Seq[DenseVector[Double]] = => new DenseVector(v.toArray)) val concat = breezeVectors.foldLeft(DenseVector.zeros[Double](vectorLength))((a, b) => a :+ b)


val embedVectors: Map[String, Array[Float]] = { case (key, value) => (key, stringToVector(value).map(_.toFloat))}

val embedModel = new Word2VecModel(embedVectors ++ model.getVectors)

Page 44: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Recommending Similar

embedModel.findSynonyms("ITEM_001_01", 5).foreach(println)/*








Page 45: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Recommending New

val newSentence = stringToVector("会計・受発注及び生産管理を中心としたシステム ")embedModel.findSynonyms(Vectors.dense(newSentence), 5).foreach(println)









Page 46: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Thank you

• Questions?

• Example source code at:•