practical machine learning for smarter search with spark+solr

21
Practical Machine Learning for Smarter Search with Solr and Spark Jake Mannix @pbrane Lead Data Engineer, Lucidworks

Upload: jake-mannix

Post on 13-Apr-2017

276 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Practical Machine Learning for Smarter Search with Spark+Solr

Practical Machine Learning for Smarter Search

with Solr and Spark

Jake Mannix

@pbrane

Lead Data Engineer, Lucidworks

Page 2: Practical Machine Learning for Smarter Search with Spark+Solr

$ whoamiNow: Lucidworks, Office of the CTO: applied ML / data engineering R&D

Previously: • Allen Institute for AI: Semantic Search on academic research

publications• Twitter: account search, user interest modeling, content

recommendations• LinkedIn: profile search, generic entity-to-entity recommender

systems

Prehistory:• other software companies, algebraic topology, particle cosmology

Page 3: Practical Machine Learning for Smarter Search with Spark+Solr

• Why Spark and Solr for Data Engineering?• Quick intro to Solr• Quick intro to Spark• Example: ManyNewsgroups

• data exploration• clustering: unsupervised ML• classification: supervised ML• recommender: collaborative filtering + content-based• search ranking

Overview

Page 4: Practical Machine Learning for Smarter Search with Spark+Solr

Practical Data Science with Spark and Solr

Why does Solr need Spark?

Why does Spark need Solr?

Page 5: Practical Machine Learning for Smarter Search with Spark+Solr

Why do data engineering with Solr and Spark?

Solr Spark• Data exploration and

visualization• Easy ingestion and feature

selection• Powerful ranking features• Quick and dirty classification

and clustering• Simple operation and scaling• Stats and math built in

• General purpose batch/streaming compute engine

Whole collection analysis!• Fast, large scale iterative

algorithms• Advanced machine learning:

MLLib, Mahout, Deep Learning4j

• Lots of integrations with other big data systems

Page 6: Practical Machine Learning for Smarter Search with Spark+Solr

Why does Spark need Solr?

Typical Hadoop / Spark data-engineering task, start with some data on HDFS:

$ hdfs dfs -ls /user/jake/mail/lucene-solr-user/2015…-rw-r--r-- 1 jake staff 6304388 Feb 4 18:22 part-00001.lzo-rw-r--r-- 1 jake staff 7977085 Feb 4 18:22 part-00002.lzo-rw-r--r-- 1 jake staff 7210817 Feb 4 18:22 part-00003.lzo-rw-r--r-- 1 jake staff 1215048 Feb 4 18:22 part-00004.lzo

Now what? What’s in these files?

Page 7: Practical Machine Learning for Smarter Search with Spark+Solr

Solr gives you:

• random access data store

• full-text search

• fast aggregate statistics

• just starting out: no HDFS / S3 necessary!

• world-class multilingual text analytics:

• no more: tokens = str.toLowerCase().split(“\\s+“)

• relevancy / ranking

• realtime HTTP service layer

Page 8: Practical Machine Learning for Smarter Search with Spark+Solr

• Apache Lucene

• Grouping and Joins

• Stats, expressions, transformations and more

• Lang. Detection

• Extensible

• Massive Scale/Fault tolerance

Solr Key Features

• Full text search (Info Retr.)

• Facets/Guided Nav galore!

• Lots of data types

• Spelling, auto-complete, highlighting

• Cursors

• More Like This

• De-duplication

Page 9: Practical Machine Learning for Smarter Search with Spark+Solr

Why Spark for Solr?

• Spark-shell: a Big Data REPL with all your fave JVM libs!

• Build the index in parallel very, very quickly!

• Aggregations

• Boosts, stats, iterative global computations

• Offline compute to update index with additional info (e.g. PageRank, popularity)

• Whole corpus analytics and ML: clustering, classification, CF, rankers

• General-purpose distributed computation

• Joins with other storage (Cassandra, HDFS, DB, HBase)

Page 10: Practical Machine Learning for Smarter Search with Spark+Solr

Spark Key Features

• General purpose, high powered cluster computing system

• Modern, faster alternative to MapReduce

• 3x faster w/ 10x less hardware for Terasort

• Great for iterative algorithms

• APIs for Java, Scala, Python and R

• Rich set of add-on libraries for machine learning, graph processing, integrations with SQL and other systems

• Deploys: Standalone, Hadoop YARN, Mesos, AWS, Docker, …

Page 11: Practical Machine Learning for Smarter Search with Spark+Solr

• Initial exploration of ASF mailing-list archives

• Index it into Solr

• Explore a bit deeper: unsupervised Spark ML

• Exploit labels: predictive analytics

Example: Many NewsGroups

Page 12: Practical Machine Learning for Smarter Search with Spark+Solr

• Initial exploration of ASF mailing-list archives

• index into Solr: just need to turn your records into json

• facet:

• fields with low cardinality or with sensible ranges

• document size histogram

• projects, authors, dates

• find: broken fields, automated content, expected data missing, errors

• now: load into a spark RDD via SolrRDD:

Many NewsGroups: Initial Exploration

Page 13: Practical Machine Learning for Smarter Search with Spark+Solr

• cleanup/filtering via spark DataFrame operations:

• create thread groups:

Many NewsGroups: Initial Exploration

Page 14: Practical Machine Learning for Smarter Search with Spark+Solr

• try other text analyzers: (no more str.split(“\\w+”)! )

Many NewsGroups: Initial Exploration

ref: Lucidworks blog on LuceneTextAnalyzer by Steve Rowe

Page 15: Practical Machine Learning for Smarter Search with Spark+Solr

• Unsupervised machine learning:

• clustering documents with KMeans

• extract topics with Latent Dirichlet Allocation

• learn word vectors with Word2Vec

Many NewsGroups: Exploratory Data Science

Page 16: Practical Machine Learning for Smarter Search with Spark+Solr

• Vectorize and run KMeans:

Many NewsGroups: Exploratory Data Science

Page 17: Practical Machine Learning for Smarter Search with Spark+Solr

• Build topic models with LDA:

Many NewsGroups: Exploratory Data Science

Page 18: Practical Machine Learning for Smarter Search with Spark+Solr

• Build word vector representations with Word2Vec:

Many NewsGroups: Exploratory Data Science

Page 19: Practical Machine Learning for Smarter Search with Spark+Solr

• Now for some real Data Science:

Many NewsGroups: Supervised Learning

Page 20: Practical Machine Learning for Smarter Search with Spark+Solr

• What else could you do?• Try other classification algs, cross-validate to pick!• Recommender Systems

• content-based: • mail-thread as “item”, head msgs grouped by

replier as “user” profile• search query of users against items to recommend

• collaborative-filtering:• users replying to a head msg “rate” them +-tively• train a Spark ML ALS RecSys model

• Train search rankers in click logs

Many NewsGroups: Next steps?

Page 21: Practical Machine Learning for Smarter Search with Spark+Solr

Resources

• spark-solr: https://github.com/Lucidworks/spark-solr

• Company: http://www.lucidworks.com

• Our blog: http://www.lucidworks.com/blog

• Apache Solr: http://lucene.apache.org/solr

• Apache Spark: http://spark.apache.org

• Fusion: http://www.lucidworks.com/products/fusion

• Twitter: @pbrane