data science with solr and spark

OCTOBER 13-16, 2016

BOSTON, MA

http://lucenerevolution.com

Data Science with Solr and Spark

Grant Ingersoll

@gsingers

CTO, Lucidworks

Get Started

ASF Mailing Lists, lucidworks.com, docs.lucidworks.com Solr for Data Science

http://www.slideshare.net/gsingers/solr-for-data-science

Fusion for Data Science

https://www.youtube.com/watch?v=oSs5DNTVxv4

Lucidworks Fusion Is Search-Driven Everything

• Drive next generation relevance via Content, Collaboration and Context

• Harness best in class Open Source: Apache Solr + Spark

• Simplify application development and reduce ongoing maintenance

Fusion is built on three core principles:

Fusion Architecture

Worker Worker Cluster Mgr.

Apache Spark

Shards Shards

Apache Solr

Shared Config Mgmt

Leader Election

Load Balancing

Apache Zookeeper

DATABASEWEBFILELOGSHADOOP CLOUD

Connectors

Alerting/Messaging

Pipelines

Blob Storage

Scheduling

Recommenders/Signals

Core Services

Admin UI

SECURITY BUILT-IN

Lucidworks View

• Why Search for Data Engineering?

• Quick intro to Solr

• Quick intro to Spark

• Solr + Spark

• Demo!

Let’s Do This

The Importance of Importance

Search-Driven Everything

Customer Service

Customer Insights

Fraud Surveillance

Research Portal

Online Retail Digital Content

• Graph Traversal

• Grouping and Joins

• Stats, expressions, transformations and more

• Lang. Detection

• Extensible

• Massive Scale/Fault tolerance

Solr Key Features

• Full text search (Info Retr.)

• SQL and JDBC! (v6)

• Facets/Guided Nav galore

• Lots of data types

• Spelling, auto-complete, highlighting

• Cursors

• More Like This

• De-duplication

Lucene: the Ace in the Hole

• Vector Space or Probabilistic, it’s your choice!

• Killer FST

• Wicked fast

• Pluggable compression, queries, indexing and more

• Advanced Similarity Models

• Lang. Modeling, Divergence from Random, more

• Easy to plug-in ranking

Spark Key Features

• General purpose, high powered cluster computing system

• Modern, faster alternative to MapReduce

• 3x faster w/ 10x less hardware for Terasort

• Great for iterative algorithms

• APIs for Java, Scala, Python and R

• Rich set of add-on libraries for machine learning, graph processing, integrations with SQL and other systems

• Deploys: Standalone, Hadoop YARN, Mesos

Why do data engineering with Solr and Spark?

Solr Spark

• Data exploration and visualization

• Easy ingestion and feature selection

• Powerful ranking features

• Quick and dirty classification and clustering

• Simple operation and scaling

• Stats and math built in

• Advanced machine learning: MLLib, Mahout, Deep Learning4j

• Fast, large scale iterative algorithms

• General purpose batch/streaming compute engine

Whole collection analysis!

• Lots of integrations with other big data systems

Why Spark for Solr?

• Spark-shell!

• Build the index in parallel very, very quickly!

• Aggregations

• Boosts, stats, iterative computations

• Offline compute to update index with additional info (e.g. PageRank, popularity)

• Whole corpus analytics and ML: clustering, classification, CF, Rankers

• Joins with other storage (Cassandra, HDFS, DB, HBase)

Why Solr for Spark?

• Massive simplification of operations!

• World-class multilingual text analytics:

• No more: tokens = str.toLowerCase().split(“\\s+“)

• Non “dumb” distributed, resilient storage

• Random access with smart queries

• Table scans

• Advanced filtering, feature selection

• Schemaless when you want, predefined when you don’t

• Spatial, columnar, sparse

Spark + Solr in Anger

http://github.com/lucidworks/spark-solr

Map<String, String> options = new HashMap<String, String>();options.put("zkhost", zkHost);options.put("collection”, "tweets");DataFrame df = sqlContext.read().format("solr").options(options).load(); count = df.filter(df.col("type_s").equalTo(“echo")).count();

DemoSpark Shell, run k-Means, LDA, Word2Vec, Random Forest, index clusters

Resources

• This code: published in a few weeks

• Company: http://www.lucidworks.com

• Our blog: http://www.lucidworks.com/blog

• Book: http://www.manning.com/ingersoll

• Solr: http://lucene.apache.org/solr

• Fusion: http://www.lucidworks.com/products/fusion

• Twitter: @gsingers

Appendix A: Deeper Solr Details

• Parallel Execution of SQL across SolrCloud

• Realtime Map-Reduce (“ish”) Functionality

• Parallel Relational Algebra

• Builds on streaming capabilities in 5.x

• JDBC client in the works

Just When You Thought SQL was Dead

Full, Parallelized, SQL Support

• Lots of Functions:

• Search, Merge, Group, Unique, Parallel, Select, Reduce, Select, innerJoin, hashJoin, Top, Rollup, Facet, Stats, Update, JDBC, Intersect, Complement, Logit

• Composable Streams

• Query optimization built in

SQL Guts

Example

select str_s, count(*), sum(field_i), min(field_i), max(field_i), avg(field_i) from collection1 where text=’XXXX’ group by str_s

rollup( search(collection1, q=”(text:XXXX)”, qt=”/export”, fl=”str_s,field_i”, partitionKeys=str_s, sort=”str_s asc”, zkHost=”localhost:9989/solr”), over=str_s, count(*), sum(field_i), min(field_i), max(field_i), avg(field_i)

• Graph Traversal

• Find all tweets mentioning “Solr” by me or people I follow

• Find all draft blog posts about “Parallel SQL” written by a developer

• Find 3-star hotels in NYC my friends stayed in last year

• BM25F Default Similarity

• Geo3D search

Make Connections, Get Better Results

Streaming API & Expressions●API ○ Java API to provide programming framework ○ Returns tuples as a JSON stream ○ org.apache.solr.client.solrj.io

●Expressions ○ String Query Language ○ Serialization format ○ Allows non-Java programmers to access Streaming API

DocValues must be enabled for any field to be returned

Streaming Expression Request

curl -‐-‐data-‐urlencode 'stream=search(sample, q="*:*", fl="id,field_i", sort="field_i asc")' http://localhost:8901/solr/sample/stream

Streaming Expression Response

{"responseHeader": {"status": 0, "QTime": 1}, "tuples": { "numFound": -‐1, "start": -‐1, "docs": [ {"id": "doc1", "field_i": 1}, {"id": "doc2", "field_i": 2}, {"EOF": true}] }}

Architecture●MapReduce-ish ○ Borrows Shuffling concept from M/R ●Logical tiers for performing the query ○ SQL tier: translates SQL to streaming expressions for parallel query plan,

selects worker nodes, merges results ○ Worker tier: executes parallel query plan, streams tuples from data tables

back ○ Data Table tier: queries SolrCloud collections, performs initial sort and

partitioning of results for worker nodes

JDBC Client●Parallel SQL includes a “thin” JDBC client

●Expanded to include SQL Clients such as DbVisualizer (SOLR-8502)

●Client only works with Parallel SQL features

Learning MoreJoel Bernstein’s presentation at Lucene Revolution: ●https://www.youtube.com/watch?v=baWQfHWozXc

Apache Solr Reference Guide: ●https://cwiki.apache.org/confluence/display/solr/

Streaming+Expressions ●https://cwiki.apache.org/confluence/display/solr/Parallel

+SQL+Interface

Spark Architecture

Spark Master (daemon)

Spark Slave (daemon)

my-spark-job.jar (w/ shaded deps)

My Spark App SparkContext

(driver) •  Keeps track of live workers •  Web UI on port 8080 •  Task Scheduler •  Restart failed tasks

Spark Executor (JVM process)

Executor runs in separate process than slave daemon

Spark Worker Node (1...N of these)

Each task works on some partition of a data set to apply a transformation or action

Losing a master prevents new applications from being executed

Can achieve HA using ZooKeeper

and multiple master nodes

Tasks are assigned based on data-locality

When selecting which node to execute a task on, the master takes into account data locality

•  RDD Graph •  DAG Scheduler •  Block tracker •  Shuffle tracker

data science with solr and spark

Technology

solr and spark for real-time big data analytics: presented...

hadoop summit - interactive big data analysis with solr,...

how to integrate spark mllib and apache solr to build...

solr for data science

beyond nosql with spark and solr on cassandra

beyond the query: a cassandra + solr + spark love triangle...

apache solr + ajax solr

data science using spark: an introduction

solr as a spark sql datasource

xactly: how to build a successful converged data platform...

section 1 - spark | ucar science education

data science with spark

science spark

leveraging the power of solr with spark: presented by...

advanced data science on spark

enrico berti "harness the power of solr and spark with hue"

real time bom explosions with apache solr and spark

leveraging the power of solr with spark

harness the power of spark and solr in hue: big data...

real world analytics with solr cloud and spark