spark + clojure for topic discovery - zalando tech clojure/conj talk
TRANSCRIPT
Hunter Kelly@retnuhAll the Topics on the Interwebs
manning
Perhaps this?
Or maybe this?
embassy wikileaks assange
german merkel cables
snowden speigel spying
Wut?
❖ What are we actually doing?➢ Mining web pages for insights
❖ How?➢ Using Machine Learning to do heavy
lifting■ Use Classifiers to filter/bucket the
data■ Build Topic Models to try to discover
concepts related to words
❖ Getting Data➢ DMOZ➢ Common Crawl
❖ Manipulating Data➢ Spark➢ Sparkling
■ RDDs■ DataFrames
❖ Data Science➢ MLLib➢ Classification - Random Forests™➢ LDA (Latent Dirichlet Allocation)
DMOZCommon Crawl
❖ DMOZ➢ “The largest human edited directory of the
web”
➢ Useful when you think of it in terms of
“free crowdsourced labeled data”
➢ Fairly ancient, borderline decrepit
➢ Crowdsourced is a double edged sword
❖ Common Crawl (CC)➢ “an open repository of web crawl data
that can be accessed and analyzed by
anyone.”
➢ Monthly crawls
➢ Readily accessible index
➢ Tons of free data - raw, links, plain text
formats
❖ How to use them together!➢ Use DMOZ to samples of positive and
negative “seed links”➢ Lookup and expand your “seed links”
using CC index ➢ Fetch your data with little/no fuss using
CC index information
Spark & Sparkling
❖ Apache Spark➢ The “next big thing”➢ Or arguably the “current” big thing
❖ Sparkling➢ Clojure bindings to Spark➢ Great Presentation (highly recommended)➢ RDDs➢ DataFrames
RDDs
❖ RDDs➢ Resilient Distributed Datasets➢ Easy to think of them as partitioned (or
sharded) seqs➢ Transformations (map, filter, etc) are lazy➢ Operations (count, collect, reduce, etc)
cause evaluation➢ Very familiar paradigms for Clojure
programmers
(defn sieve-prime-multiples [n primes numbers] (let [max-prime (last primes) upto (* max-prime max-prime) prime-multiples (->> primes (r/mapcat #(generate-multiples % n (odd? %))) (into #{})) candidates (->> numbers (r/remove prime-multiples)) new-primes (->> candidates (r/filter #(< % upto)) r/foldcat sort (into [])) remaining (->> candidates (r/remove (set new-primes)) r/foldcat)] [new-primes remaining]))
Clojure using Reducers
(defn sieve-prime-multiples [ctx n primes numbers-rdd] (let [max-prime (last primes) upto (* max-prime max-prime) prime-multiples-rdd (->> (spark/parallelize ctx primes) (spark/flat-map #(generate-multiples % n (odd? %)))) candidates-rdd (spark/cache (.subtract numbers-rdd prime-multiples-rdd)) new-primes-rdd (->> candidates-rdd (spark/filter #(< % upto)) spark/cache) new-prime (vec (sort (spark/collect new-primes-rdd))) remaining-rdd (.subtract candidates-rdd new-primes-rdd)] (.unpersist candidates-rdd false) (.unpersist new-primes-rdd false) [new-primes remaining-rdd]))
Clojure using Spark
❖ A Historical Tangent➢ “Those who cannot remember the past
are condemned to repeat it.”➢ ~15 years ago, everything is running
MySQL, Oracle, etc.➢ ~7 years ago everyone abandoning
SQL+RDBMS for NoSQL➢ Now looping back to SQL - Spark SQL,
Google F1, etc.
DataFrames
❖ DataFrames➢ DataFrames are the new hotness➢ It’s how Python and R can now achieve
similar speeds➢ The Catalyst execution engine can plan
intelligently - behind the scenes, generates source code, heavy use of Scala macros, optimize away boxing/unboxing calls, etc.
➢ Focus is clearly on DataFrames and upcoming DataSets
❖ DataFrames (cont)➢ Great in Scala, not so much via JVM
interop➢ Heavy use of Scala magic like implicits,
etc.➢ Working with DataFrames from Clojure
can be… less than pleasant➢ Scala folks really like their static, declared
types➢ Going to get worse with DataSets
(def FEATURE-TYPE [[:feature DataTypes/IntegerType]])(def FEATURE-SCHEMA (types->schema FEATURE-TYPE))
(defn create-feature-table [sql-ctx table-name features] (let [ctx (.sparkContext sql-ctx) features-rdd (->> (spark/parallelize (JavaSparkContext. ctx) (seq features)) (spark/map (fn [i] (RowFactory/create (to-array [i]))))) features-df (.createDataFrame sql-ctx features-rdd FEATURE-SCHEMA)] (.registerTempTable features-df table-name) features-df))
Creating a single column DataFrame
(let [query-df (-> bow-df (.select "word" (into-array ["index"])))] (reduce (fn [[bow rbow] row] [(assoc bow (.getString row 0) (.getInt row 1)) (assoc rbow (.getInt row 1) (.getString row 0))]) [{} {}] (.collectAsList query-df))))
(-> bow-df (.join features-df (.equalTo ind-col (.col features-df "feature"))) (.select (into-array [(.col bow-df "*") feature-index-col])) (.orderBy (into-array [feature-index-col])))
Machine LearningElevator Pitch
❖ Machine Learning Key Points➢ Uses statistical methods on large
amounts of data to hopefully gain insights➢ Uses vectors of numbers extracted (by
you) from your data - “feature vectors”➢ Classification puts things into buckets, i.e.
“fashion related website” vs. “everything else”
➢ Topic modeling - way of finding patterns in a bunch of documents - a “corpus”
MLLib
❖ MLLib➢ Spark’s Machine Learning (ML) library➢ “Its goal is to make practical machine
learning scalable and easy”➢ Divides into two packages:
■ spark.mllib - built on top of RDDs■ spark.ml - built on top of DataFrames
❖ MLLib (cont)➢ All the basics - Vectors, Sparse Vectors,
LabeledPoints, etc.➢ A good variety of algorithms, all designed
for running in parallel➢ Well documented➢ Large community
MLLib gives us this...
But we want this!
❖ Example - Metrics➢ BinaryClassificationMetrics has some
useful things, but not basic things ➢ Have to use MulticlassMetrics for some of
the most wanted metrics, even on a binary classifier
➢ Neither actually give you the count of items by label - but BinaryClassificationMetrics logs it to INFO
➢ End up iterating your data 3 (!) times to get all desired metrics
Computing metrics (defn metrics [rdd model] (let [pl (->> rdd (spark/map (fn [point] (let [y (.label point) x (.features point)] (spark/tuple (.predict model x) y)))) spark/cache) multi-metrics (MulticlassMetrics. (.rdd pl)) metrics (BinaryClassificationMetrics. (.rdd pl)) r {:area-under-pr (.areaUnderPR metrics) :f-measure (.fMeasure multi-metrics 1.0) ;; Others elided :label-counts (->> rdd (spark/map-to-pair (fn [point] (spark/tuple (.label point) 1))) spark/count-by-key)}] (.unpersist pl false) r))
❖ Examples - Eye on the prize?➢ HashingTF - oh boy
■ Lose all access to original word■ Uses gigantic Array instead of a
HashMap➢ ChiSqSelector - used to select top N
features■ but how do we determine N? Can’t ask ■ End up grubbing around in the source
to find uses Statistics/chiSqTest
Computing Chi-Square Test
(let [sql-ctx (spark-util/make-sql-context ctx) labels-features-df (spark-util/maybe-sample-df options (spark-util/load-table sql-ctx "features" input)) labeled-points-rdd (->> (lf/load-labels-and-features-from-parquet labels-features-df true) (spark/map (fn [m] (get-in m [:labeled-points :term-count])))) [bow rbow] (bow/load-bow-maps-from-table sql-ctx (spark-util/load-table sql-ctx "bow" bow-input)) chi-sq-arr (Statistics/chiSqTest labeled-points-rdd)] (doseq [[ind tst] (map-indexed vector (seq chi-sq-arr))] (log/info "Feature:" ind (rbow ind) "tst:" tst)))
Classification w/Random Forests
❖ Classification➢ Using lots of data to tell things apart➢ Can put stuff into two buckets (or
“classes”) - Binary Classifier➢ Or into many buckets - Multi-class
Classifier➢ Lots of different techniques➢ Supervised learning - each sample needs:
■ “features” - a vector of numeric data■ “label” - a label specifying its class
❖ The Bag of Words➢ We started with very basic word cleansing
- lowercase, remove non letters/digits, 3 char min length, drop things just numbers
➢ Managed to make it this far in talk without having to use word count!
➢ But ultimately most Data Science/ML tasks involving text ends up heavily dependent on word count
❖ The Bag of Words (cont)➢ Ended up with too many words (1.3M)
even on sample➢ Were working on bare baseline, so no
stopword removal or stemming, following KISS principle
➢ We did say must occur on >= 5 distinct sites (not documents), reduced size to 460k words
(defn create-bow-site-occurance [json-lines-rdd] (->> json-lines-rdd (spark/map-to-pair (fn [m] (spark/tuple (site (:url m))
(set (clean-word-seq (:raw_text m)))))) (spark/reduce-by-key union) (spark/flat-map-to-pair
(s-de/key-value-fn (fn [site words] (map spark/tuple words (repeat 1)))))
(spark/reduce-by-key +) (spark/filter
(s-de/key-value-fn (fn [w c] (>= c MIN-SITE-OCCURANCE-COUNT))))
spark/sort-by-key))
Bag of Words
❖ Random Forests™➢ Ensemble of Decision Trees➢ Uses “bootstrapping” for selection of
feature set and training set➢ Not “Deep Learning” but extremely easy
to use and very effective➢ “Any sufficiently advanced technology is
indistinguishable from magic.”➢ Able to get pretty decent results! F-
measure 0.86
Train the Random Forest from LabeledPoints(defn train-random-forest [num-trees max-depth max-bins seed
labeled-points-rdd] (let [p {:num-classes 2, :categorical-feature-info {}, :feature-subset-strategy "auto", :impurity "gini", :max-depth max-depth, :max-bins max-bins}] (RandomForest/trainClassifier labeled-points-rdd (:num-classes p) (:categorical-feature-info p) num-trees (:feature-subset-strategy p) (:impurity p) (:max-depth p) (:max-bins p) seed)))
Prepare to train/test RandomForest(defn load-and-train-random-forest [rdd num-trees max-depth max-bins
seed & [sample-fraction]] (let [sampled-rdd (if sample-fraction (spark/sample false sample-fraction seed rdd) rdd) labeled-rdd (->> sampled-rdd (spark/map #(labeled-point lf/fashion? %))) [train test] (.randomSplit labeled-rdd (double-array [0.9 0.1]) seed) cached-train (spark/cache train) cached-test (spark/cache test) model (train-random-forest num-trees max-depth max-bins seed
cached-train)] [cached-train cached-test model]))
Topic Modellingwith LDA
❖ LDA - Latent Dirichlet Allocation➢ Topic Model which infers topics from text
corpus➢ Topics -> cluster centers, docs -> rows➢ Features are vectors of word counts (Bag
of Words)➢ Unsupervised Learning technique (but
you do supply the topic count)
❖ LDA (cont)➢ Quite tetchy to run at large scale➢ OutOfMemory error on executors➢ Job aborted due to stage failure: Serialized task 4341:0 was
365752339 bytes, which exceeds max allowed: spark.akka.frameSize (134217728 bytes) - reserved (204800 bytes). Consider increasing spark.akka.frameSize or using broadcast variables for large values.
➢ WTF?➢ BTW, do not ever change “spark.akka.
frameSize”...
❖ LDA (moar cont)➢ Finally able to get a trained model after
reducing BoW to more manageable size ~11k down from ~160k
➢ Trained on ~100k documents, roughly even split between fashion/non-fashion
➢ These models for demonstration purposes, moar fanciness planned
Train an LDA Model(defn train-lda-model [num-topics seed features-fn maps-rdd] (let [rdd (->> maps-rdd (spark/map (fn [{:keys [doc-number] :as m}] (spark/tuple doc-number (features-fn m)))) spark/cache) corpus-size (spark/count rdd) mbf (mini-batch-fraction-batch-size corpus-size 5000) max-iters (int (Math/ceil (/ mbf))) optimizer (doto (OnlineLDAOptimizer.) (.setMiniBatchFraction (min 1.0 mbf))) model (-> (doto (LDA.) (.setOptimizer optimizer) (.setK num-topics) (.setSeed seed) (.setMaxIterations max-iters)) (.run (.rdd rdd)))] (.unpersist rdd false) model))
Demo!
So what’s the point?
❖ So what did we do?➢ We took pre-scraped, “pre-labeled” data
➢ Used Clojure and Spark/Sparkling to
munge the data
➢ Used state of the art ML tools to analyze
the data
➢ Explored for insights
❖ So what can YOU do?➢ This will work for almost ANY domain➢ There’s a lot of interesting information
even at this stage➢ There’s a ton of interesting directions this
can go■ Run classifier over all of CC data■ Build domain-specific LDA models
➢ Do cool things and have fun doing it!