practical machine learning for smarter search with spark+solr

Practical Machine Learning for Smarter Search

with Solr and Spark

Jake Mannix

@pbrane

Lead Data Engineer, Lucidworks

$ whoamiNow: Lucidworks, Office of the CTO: applied ML / data engineering R&D

Previously: • Allen Institute for AI: Semantic Search on academic research

publications• Twitter: account search, user interest modeling, content

recommendations• LinkedIn: profile search, generic entity-to-entity recommender

systems

Prehistory:• other software companies, algebraic topology, particle cosmology

• Why Spark and Solr for Data Engineering?• Quick intro to Solr• Quick intro to Spark• Example: ManyNewsgroups

• data exploration• clustering: unsupervised ML• classification: supervised ML• recommender: collaborative filtering + content-based• search ranking

Overview

Practical Data Science with Spark and Solr

Why does Solr need Spark?

Why does Spark need Solr?

Why do data engineering with Solr and Spark?

Solr Spark• Data exploration and

visualization• Easy ingestion and feature

selection• Powerful ranking features• Quick and dirty classification

and clustering• Simple operation and scaling• Stats and math built in

• General purpose batch/streaming compute engine

Whole collection analysis!• Fast, large scale iterative

algorithms• Advanced machine learning:

MLLib, Mahout, Deep Learning4j

• Lots of integrations with other big data systems

Why does Spark need Solr?

Typical Hadoop / Spark data-engineering task, start with some data on HDFS:

$ hdfs dfs -ls /user/jake/mail/lucene-solr-user/2015…-rw-r--r-- 1 jake staff 6304388 Feb 4 18:22 part-00001.lzo-rw-r--r-- 1 jake staff 7977085 Feb 4 18:22 part-00002.lzo-rw-r--r-- 1 jake staff 7210817 Feb 4 18:22 part-00003.lzo-rw-r--r-- 1 jake staff 1215048 Feb 4 18:22 part-00004.lzo

Now what? What’s in these files?

Solr gives you:

• random access data store

• full-text search

• fast aggregate statistics

• just starting out: no HDFS / S3 necessary!

• world-class multilingual text analytics:

• no more: tokens = str.toLowerCase().split(“\\s+“)

• relevancy / ranking

• realtime HTTP service layer

• Apache Lucene

• Grouping and Joins

• Stats, expressions, transformations and more

• Lang. Detection

• Extensible

• Massive Scale/Fault tolerance

Solr Key Features

• Full text search (Info Retr.)

• Facets/Guided Nav galore!

• Lots of data types

• Spelling, auto-complete, highlighting

• Cursors

• More Like This

• De-duplication

Why Spark for Solr?

• Spark-shell: a Big Data REPL with all your fave JVM libs!

• Build the index in parallel very, very quickly!

• Aggregations

• Boosts, stats, iterative global computations

• Offline compute to update index with additional info (e.g. PageRank, popularity)

• Whole corpus analytics and ML: clustering, classification, CF, rankers

• General-purpose distributed computation

• Joins with other storage (Cassandra, HDFS, DB, HBase)

Spark Key Features

• General purpose, high powered cluster computing system

• Modern, faster alternative to MapReduce

• 3x faster w/ 10x less hardware for Terasort

• Great for iterative algorithms

• APIs for Java, Scala, Python and R

• Rich set of add-on libraries for machine learning, graph processing, integrations with SQL and other systems

• Deploys: Standalone, Hadoop YARN, Mesos, AWS, Docker, …

• Initial exploration of ASF mailing-list archives

• Index it into Solr

• Explore a bit deeper: unsupervised Spark ML

• Exploit labels: predictive analytics

Example: Many NewsGroups

• Initial exploration of ASF mailing-list archives

• index into Solr: just need to turn your records into json

• facet:

• fields with low cardinality or with sensible ranges

• document size histogram

• projects, authors, dates

• find: broken fields, automated content, expected data missing, errors

• now: load into a spark RDD via SolrRDD:

Many NewsGroups: Initial Exploration

• cleanup/filtering via spark DataFrame operations:

• create thread groups:


• try other text analyzers: (no more str.split(“\\w+”)! )


ref: Lucidworks blog on LuceneTextAnalyzer by Steve Rowe

• Unsupervised machine learning:

• clustering documents with KMeans

• extract topics with Latent Dirichlet Allocation

• learn word vectors with Word2Vec

Many NewsGroups: Exploratory Data Science

• Vectorize and run KMeans:


• Build topic models with LDA:


• Build word vector representations with Word2Vec:


• Now for some real Data Science:

Many NewsGroups: Supervised Learning

• What else could you do?• Try other classification algs, cross-validate to pick!• Recommender Systems

• content-based: • mail-thread as “item”, head msgs grouped by

replier as “user” profile• search query of users against items to recommend

• collaborative-filtering:• users replying to a head msg “rate” them +-tively• train a Spark ML ALS RecSys model

• Train search rankers in click logs

Many NewsGroups: Next steps?

Resources

• spark-solr: https://github.com/Lucidworks/spark-solr

• Company: http://www.lucidworks.com

• Our blog: http://www.lucidworks.com/blog

• Apache Solr: http://lucene.apache.org/solr

• Apache Spark: http://spark.apache.org

• Fusion: http://www.lucidworks.com/products/fusion

• Twitter: @pbrane

practical machine learning for smarter search with spark+solr

Technology