spark + clojure for topic discovery - zalando tech clojure/conj talk

Hunter Kelly@retnuhAll the Topics on the Interwebs

manning

Perhaps this?

Or maybe this?

embassy wikileaks assange

german merkel cables

snowden speigel spying

❖ What are we actually doing?➢ Mining web pages for insights

❖ How?➢ Using Machine Learning to do heavy

lifting■ Use Classifiers to filter/bucket the

data■ Build Topic Models to try to discover

concepts related to words

❖ Getting Data➢ DMOZ➢ Common Crawl

❖ Manipulating Data➢ Spark➢ Sparkling

■ RDDs■ DataFrames

❖ Data Science➢ MLLib➢ Classification - Random Forests™➢ LDA (Latent Dirichlet Allocation)

DMOZCommon Crawl

❖ DMOZ➢ “The largest human edited directory of the

web”

➢ Useful when you think of it in terms of

“free crowdsourced labeled data”

➢ Fairly ancient, borderline decrepit

➢ Crowdsourced is a double edged sword

http://www.dmoz.org

http://www.dmoz.org

❖ Common Crawl (CC)➢ “an open repository of web crawl data

that can be accessed and analyzed by

anyone.”

➢ Monthly crawls

➢ Readily accessible index

➢ Tons of free data - raw, links, plain text

formats

https://commoncrawl.org/

https://commoncrawl.org/

❖ How to use them together!➢ Use DMOZ to samples of positive and

negative “seed links”➢ Lookup and expand your “seed links”

using CC index ➢ Fetch your data with little/no fuss using

CC index information

Spark & Sparkling

❖ Apache Spark➢ The “next big thing”➢ Or arguably the “current” big thing

❖ Sparkling➢ Clojure bindings to Spark➢ Great Presentation (highly recommended)➢ RDDs➢ DataFrames

http://spark.apache.org/

http://spark.apache.org/

http://gorillalabs.github.io/sparkling/

http://gorillalabs.github.io/sparkling/

https://speakerdeck.com/chris_betz/spark-way

https://speakerdeck.com/chris_betz/spark-way

❖ RDDs➢ Resilient Distributed Datasets➢ Easy to think of them as partitioned (or

sharded) seqs➢ Transformations (map, filter, etc) are lazy➢ Operations (count, collect, reduce, etc)

cause evaluation➢ Very familiar paradigms for Clojure

programmers

(defn sieve-prime-multiples [n primes numbers] (let [max-prime (last primes) upto (* max-prime max-prime) prime-multiples (->> primes (r/mapcat #(generate-multiples % n (odd? %))) (into #{})) candidates (->> numbers (r/remove prime-multiples)) new-primes (->> candidates (r/filter #(< % upto)) r/foldcat sort (into [])) remaining (->> candidates (r/remove (set new-primes)) r/foldcat)] [new-primes remaining]))

Clojure using Reducers

https://github.com/retnuh/sparktimus-prime/blob/master/src/sparktimus_prime/reducers.clj

https://github.com/retnuh/sparktimus-prime/blob/master/src/sparktimus_prime/reducers.clj

(defn sieve-prime-multiples [ctx n primes numbers-rdd] (let [max-prime (last primes) upto (* max-prime max-prime) prime-multiples-rdd (->> (spark/parallelize ctx primes) (spark/flat-map #(generate-multiples % n (odd? %)))) candidates-rdd (spark/cache (.subtract numbers-rdd prime-multiples-rdd)) new-primes-rdd (->> candidates-rdd (spark/filter #(< % upto)) spark/cache) new-prime (vec (sort (spark/collect new-primes-rdd))) remaining-rdd (.subtract candidates-rdd new-primes-rdd)] (.unpersist candidates-rdd false) (.unpersist new-primes-rdd false) [new-primes remaining-rdd]))

Clojure using Spark

https://github.com/retnuh/sparktimus-prime/blob/master/src/sparktimus_prime/core.clj

https://github.com/retnuh/sparktimus-prime/blob/master/src/sparktimus_prime/core.clj

❖ A Historical Tangent➢ “Those who cannot remember the past

are condemned to repeat it.”➢ ~15 years ago, everything is running

MySQL, Oracle, etc.➢ ~7 years ago everyone abandoning

SQL+RDBMS for NoSQL➢ Now looping back to SQL - Spark SQL,

Google F1, etc.

http://research.google.com/pubs/pub38125.html

http://research.google.com/pubs/pub38125.html

DataFrames

❖ DataFrames➢ DataFrames are the new hotness➢ It’s how Python and R can now achieve

similar speeds➢ The Catalyst execution engine can plan

intelligently - behind the scenes, generates source code, heavy use of Scala macros, optimize away boxing/unboxing calls, etc.

➢ Focus is clearly on DataFrames and upcoming DataSets

❖ DataFrames (cont)➢ Great in Scala, not so much via JVM

interop➢ Heavy use of Scala magic like implicits,

etc.➢ Working with DataFrames from Clojure

can be… less than pleasant➢ Scala folks really like their static, declared

types➢ Going to get worse with DataSets

(def FEATURE-TYPE [[:feature DataTypes/IntegerType]])(def FEATURE-SCHEMA (types->schema FEATURE-TYPE))

(defn create-feature-table [sql-ctx table-name features] (let [ctx (.sparkContext sql-ctx) features-rdd (->> (spark/parallelize (JavaSparkContext. ctx) (seq features)) (spark/map (fn [i] (RowFactory/create (to-array [i]))))) features-df (.createDataFrame sql-ctx features-rdd FEATURE-SCHEMA)] (.registerTempTable features-df table-name) features-df))

Creating a single column DataFrame

(let [query-df (-> bow-df (.select "word" (into-array ["index"])))] (reduce (fn [[bow rbow] row] [(assoc bow (.getString row 0) (.getInt row 1)) (assoc rbow (.getInt row 1) (.getString row 0))]) [{} {}] (.collectAsList query-df))))

(-> bow-df (.join features-df (.equalTo ind-col (.col features-df "feature"))) (.select (into-array [(.col bow-df "*") feature-index-col])) (.orderBy (into-array [feature-index-col])))

Machine LearningElevator Pitch

❖ Machine Learning Key Points➢ Uses statistical methods on large

amounts of data to hopefully gain insights➢ Uses vectors of numbers extracted (by

you) from your data - “feature vectors”➢ Classification puts things into buckets, i.e.

“fashion related website” vs. “everything else”

➢ Topic modeling - way of finding patterns in a bunch of documents - a “corpus”

❖ MLLib➢ Spark’s Machine Learning (ML) library➢ “Its goal is to make practical machine

learning scalable and easy”➢ Divides into two packages:

■ spark.mllib - built on top of RDDs■ spark.ml - built on top of DataFrames

❖ MLLib (cont)➢ All the basics - Vectors, Sparse Vectors,

LabeledPoints, etc.➢ A good variety of algorithms, all designed

for running in parallel➢ Well documented➢ Large community

MLLib gives us this...

But we want this!

❖ Example - Metrics➢ BinaryClassificationMetrics has some

useful things, but not basic things ➢ Have to use MulticlassMetrics for some of

the most wanted metrics, even on a binary classifier

➢ Neither actually give you the count of items by label - but BinaryClassificationMetrics logs it to INFO

➢ End up iterating your data 3 (!) times to get all desired metrics

Computing metrics (defn metrics [rdd model] (let [pl (->> rdd (spark/map (fn [point] (let [y (.label point) x (.features point)] (spark/tuple (.predict model x) y)))) spark/cache) multi-metrics (MulticlassMetrics. (.rdd pl)) metrics (BinaryClassificationMetrics. (.rdd pl)) r {:area-under-pr (.areaUnderPR metrics) :f-measure (.fMeasure multi-metrics 1.0) ;; Others elided :label-counts (->> rdd (spark/map-to-pair (fn [point] (spark/tuple (.label point) 1))) spark/count-by-key)}] (.unpersist pl false) r))

❖ Examples - Eye on the prize?➢ HashingTF - oh boy

■ Lose all access to original word■ Uses gigantic Array instead of a

HashMap➢ ChiSqSelector - used to select top N

features■ but how do we determine N? Can’t ask ■ End up grubbing around in the source

to find uses Statistics/chiSqTest

Computing Chi-Square Test

(let [sql-ctx (spark-util/make-sql-context ctx) labels-features-df (spark-util/maybe-sample-df options (spark-util/load-table sql-ctx "features" input)) labeled-points-rdd (->> (lf/load-labels-and-features-from-parquet labels-features-df true) (spark/map (fn [m] (get-in m [:labeled-points :term-count])))) [bow rbow] (bow/load-bow-maps-from-table sql-ctx (spark-util/load-table sql-ctx "bow" bow-input)) chi-sq-arr (Statistics/chiSqTest labeled-points-rdd)] (doseq [[ind tst] (map-indexed vector (seq chi-sq-arr))] (log/info "Feature:" ind (rbow ind) "tst:" tst)))

Classification w/Random Forests

❖ Classification➢ Using lots of data to tell things apart➢ Can put stuff into two buckets (or

“classes”) - Binary Classifier➢ Or into many buckets - Multi-class

Classifier➢ Lots of different techniques➢ Supervised learning - each sample needs:

■ “features” - a vector of numeric data■ “label” - a label specifying its class

❖ The Bag of Words➢ We started with very basic word cleansing

- lowercase, remove non letters/digits, 3 char min length, drop things just numbers

➢ Managed to make it this far in talk without having to use word count!

➢ But ultimately most Data Science/ML tasks involving text ends up heavily dependent on word count

❖ The Bag of Words (cont)➢ Ended up with too many words (1.3M)

even on sample➢ Were working on bare baseline, so no

stopword removal or stemming, following KISS principle

➢ We did say must occur on >= 5 distinct sites (not documents), reduced size to 460k words

(defn create-bow-site-occurance [json-lines-rdd] (->> json-lines-rdd (spark/map-to-pair (fn [m] (spark/tuple (site (:url m))

(set (clean-word-seq (:raw_text m)))))) (spark/reduce-by-key union) (spark/flat-map-to-pair

(s-de/key-value-fn (fn [site words] (map spark/tuple words (repeat 1)))))

(spark/reduce-by-key +) (spark/filter

(s-de/key-value-fn (fn [w c] (>= c MIN-SITE-OCCURANCE-COUNT))))

spark/sort-by-key))

Bag of Words

❖ Random Forests™➢ Ensemble of Decision Trees➢ Uses “bootstrapping” for selection of

feature set and training set➢ Not “Deep Learning” but extremely easy

to use and very effective➢ “Any sufficiently advanced technology is

indistinguishable from magic.”➢ Able to get pretty decent results! F-

measure 0.86

Train the Random Forest from LabeledPoints(defn train-random-forest [num-trees max-depth max-bins seed

labeled-points-rdd] (let [p {:num-classes 2, :categorical-feature-info {}, :feature-subset-strategy "auto", :impurity "gini", :max-depth max-depth, :max-bins max-bins}] (RandomForest/trainClassifier labeled-points-rdd (:num-classes p) (:categorical-feature-info p) num-trees (:feature-subset-strategy p) (:impurity p) (:max-depth p) (:max-bins p) seed)))

Prepare to train/test RandomForest(defn load-and-train-random-forest [rdd num-trees max-depth max-bins

seed & [sample-fraction]] (let [sampled-rdd (if sample-fraction (spark/sample false sample-fraction seed rdd) rdd) labeled-rdd (->> sampled-rdd (spark/map #(labeled-point lf/fashion? %))) [train test] (.randomSplit labeled-rdd (double-array [0.9 0.1]) seed) cached-train (spark/cache train) cached-test (spark/cache test) model (train-random-forest num-trees max-depth max-bins seed

cached-train)] [cached-train cached-test model]))

Topic Modellingwith LDA

❖ LDA - Latent Dirichlet Allocation➢ Topic Model which infers topics from text

corpus➢ Topics -> cluster centers, docs -> rows➢ Features are vectors of word counts (Bag

of Words)➢ Unsupervised Learning technique (but

you do supply the topic count)

❖ LDA (cont)➢ Quite tetchy to run at large scale➢ OutOfMemory error on executors➢ Job aborted due to stage failure: Serialized task 4341:0 was

365752339 bytes, which exceeds max allowed: spark.akka.frameSize (134217728 bytes) - reserved (204800 bytes). Consider increasing spark.akka.frameSize or using broadcast variables for large values.

➢ WTF?➢ BTW, do not ever change “spark.akka.

frameSize”...

http://www.wyrd-games.net/malifaux

❖ LDA (moar cont)➢ Finally able to get a trained model after

reducing BoW to more manageable size ~11k down from ~160k

➢ Trained on ~100k documents, roughly even split between fashion/non-fashion

➢ These models for demonstration purposes, moar fanciness planned

Train an LDA Model(defn train-lda-model [num-topics seed features-fn maps-rdd] (let [rdd (->> maps-rdd (spark/map (fn [{:keys [doc-number] :as m}] (spark/tuple doc-number (features-fn m)))) spark/cache) corpus-size (spark/count rdd) mbf (mini-batch-fraction-batch-size corpus-size 5000) max-iters (int (Math/ceil (/ mbf))) optimizer (doto (OnlineLDAOptimizer.) (.setMiniBatchFraction (min 1.0 mbf))) model (-> (doto (LDA.) (.setOptimizer optimizer) (.setK num-topics) (.setSeed seed) (.setMaxIterations max-iters)) (.run (.rdd rdd)))] (.unpersist rdd false) model))

So what’s the point?

❖ So what did we do?➢ We took pre-scraped, “pre-labeled” data

➢ Used Clojure and Spark/Sparkling to

munge the data

➢ Used state of the art ML tools to analyze

the data

➢ Explored for insights

❖ So what can YOU do?➢ This will work for almost ANY domain➢ There’s a lot of interesting information

even at this stage➢ There’s a ton of interesting directions this

can go■ Run classifier over all of CC data■ Build domain-specific LDA models

➢ Do cool things and have fun doing it!

Hunter Kelly@retnuhhttps://github.com/retnuh

https://github.com/retnuh

https://github.com/retnuh

spark + clojure for topic discovery - zalando tech clojure/conj talk

Technology