solr 6 feature preview

1© Cloudera, Inc. All rights reserved.

Solr 6 Feature Preview

Yonik Seeley3/09/2016


My Background

• Creator of Solr• Cloudera Engineer • LucidWorks Co-Founder• Lucene/Solr committer, PMC member• Apache Software Foundation member• M.S. in Computer Science, Stanford


Solr 6

• Happy Birthday Solr!• 10 Years at the Apache Software Foundation as of 1/2016

• Release branch as been cut• ETA before April• Java 8+ only


Streaming Expressions


Solr Streaming Expressions

• Generic platform for distributed computation• The basis for implementing distributed SQL

• Works across entire result sets (or subsets)• normal search operations are designed for fast top-N operations

• Map-reduce like "shuffle" partitions result sets for greater scalability• Worker nodes can be allocated from a collection for parallelism


Tuple Streams

• A streaming expression compiles/parses to a tuple stream• direct mapping from a streaming expression function->tuple_stream

• Stream Sources – produce a tuple stream• Stream Decorators – operate on tuple streams• Designed to include streams from non-Solr systems


search() expression

$ curl http://localhost:8983/solr/techproducts/stream -d 'expr=search(techproducts, q="*:*", fl="id,price,score", sort="id asc")'

{"result-set":{"docs":[{"score":1.0,"id":"0579B002","price":179.99},{"score":1.0,"id":"100-435805","price":649.99},{"score":1.0,"id":"3007WFP","price":2199.0},{"score":1.0,"id":"VDBDB1A16"},{"score":1.0,"id":"VS1GB400C3","price":74.99},{"EOF":true,"RESPONSE_TIME":6}]}}

resulting tuple stream


Search Tuple Stream

Shard 1Replica 2

Shard 1Replica 1

Shard 1Replica 2

Shard 2Replica 1

Shard 1Replica 2

Shard 3Replica 1

Worker

Tuple StreamTuple Stream

/stream worker executing the "search" expression

• search() is a stream source• SolrCloud aware (CloudSolrStream java class)• Fully streaming (no big buffers)• Worker node doesn't need to be a Solr node


search expression args

search( // parses to CloudSolrStream java class

techproducts, // name of the collection to searchzkHost="localhost:9983", // (opt) zookeeper address of collection to searchqt="/select", // (opt) the request handler to use

(/export is also available)rows=1000000, // (opt) number of rows to retrieve q=*:*, // query to match returned

documentsfl="id,price,score", // which fields to returnsort="id asc, price desc", // how to sort the results

aliases="id=myid,price=myprice" // (opt) renames output fields)


reduce() streaming expression

• Groups tuples by common field values• Emits one group-head per group• Each group-head contains list of tuples• "by" parameter must match up with

"sort" parameter• Any partitioning should be done on

same group field.

reduce( search(collection1, qt="/export" q="*:*", fl="id,manu,price", sort="manu asc, price desc"), by="manu"), group(sort="price desc",n=100))

stream operation


rollup() expression

• Groups tuples by common field values• Emits rollup value along with metrics• Closest equivalent to faceting

rollup( search(collection1, qt="/export" q="*:*", fl="id,manu,price", sort="manu asc"), over="manu"), count(*), max(price))

metrics

{"result-set":{"docs":[{"manu":"apple","count(*)":1.0},{"manu":"asus","count(*)":1.0},{"manu":"ati","count(*)":1.0},{"manu":"belkin","count(*)":2.0},{"manu":"canon","count(*)":2.0},{"manu":"corsair","count(*)":3.0},[...]


facet() expression

• Like search+rollup, but pushes down computation to JSON Facet API

facet( techproducts,q="*:*",buckets="manu",bucketSorts="count(*)

desc",bucketSizeLimit=1000,count(*),sum(price),max(popularity)

)

{"result-set":{"docs":[{"avg(price)":129.99, "max(popularity)":7.0,"manu":"corsair","count(*)":3},{"avg(price)":15.72,"max(popularity)":1.0,"manu":"belkin","count(*)":2},{"avg(price)":254.97,"max(popularity)":7.0,"manu":"canon","count(*)":2},{"avg(price)":399.0,"max(popularity)":10.0,"manu":"apple","count(*)":1},{"avg(price)":479.95,"max(popularity)":7.0,"manu":"asus","count(*)":1},{"avg(price)":649.98,"max(popularity)":7.0,"manu":"ati","count(*)":1},{"avg(price)":0.0,"max(popularity)":"NaN","manu":"boa","count(*)":1},[...]


Parallel Tuple Stream

Shard 1Replica 2

Shard 1Replica 1

Shard 1Replica 2

Shard 2Replica 1

Shard 1Replica 2

Shard 3Replica 1

WorkerPartition 1

WorkerPartition 2

Worker

Tuple Stream


Streaming Expressions – parallel

• Wraps a stream and sends to N worker nodes• The first parameter is the collection to

use for the intermediate worker nodes• partitionKeys must be provided to

underlying workers• usually makes sense to partition by

what you are grouping on• inner and outer sorts should match

parallel(collection1, rollup( search(techproducts, q="*:*", fl="id,manu,price", sort="manu asc", partitionKeys="manu"), over="manu asc"), workers=2, zkHost="localhost:9983", sort="manu asc")


Joins!

innerJoin( search(people, q=*:*, fl="personId,name", sort="personId asc"), search(pets, q=type:cat, fl="personId,petName", sort="personId asc"), on="personId")

leftOuterJoin, hashJoin, outerHashJoin,


More decorators

• complement – emits tuples from A which do not exist in B• intersect – emits tuples from A whish do exist in B• merge• top – reorders the stream and returns the top N tuples• unique – emits only the first tuple for each value• select – select, rename, or give default values to fields in a tuple


Interesting streams• update stream – indexes input into another SolrCloud collection!• daemon stream – blocks until more data is available from underlying stream• topic stream – a publish/subscribe messaging service• checkpoints are persisted in a Solr collection• resubmit to get new stuff• combine with daemon stream to automatically get continuous updates over time• further combine with update stream to push all matches to another collection

topic(checkpointCollection, dataCollection, id="topicA", q="solr rocks" checkpointEvery="1000")


jdbc() expression streamjoin with other data sources!

innerJoin( // example from JDBCStreamTest select( search(collection1, fl="personId_i,rating_f", q="rating_f:*", sort="personId_i asc"), personId_i as personId, rating_f as rating ), select( jdbc(connection="jdbc:hsqldb:mem:.", sql="select PEOPLE.ID as PERSONID, PEOPLE.NAME, COUNTRIES.COUNTRY_NAME from PEOPLE inner join COUNTRIES on PEOPLE.COUNTRY_CODE = COUNTRIES.CODE order by PEOPLE.ID", sort="ID asc", get_column_name=true), ID as personId, NAME as personName, COUNTRY_NAME as country ), on="personId")


Parallel SQL


/sql Handler

• /sql handler is there by default on all solr nodes• Translates SQL -> parallel streaming expressions• SQL tables map to SolrCloud collections• Query planner / optimizer• Currently uses Presto parser• May switch to Apache Calcite?


Simplest SQL Example

$ curl http://localhost:8983/solr/techproducts/sql -d "stmt=select id from techproducts"

{"result-set":{"docs":[{"id":"EN7800GTX/2DHTV/256M"},{"id":"100-435805"},{"id":"UTF8TEST"},{"id":"SOLR1000"},{"id":"9885A004"},[...]

tables map to collections


SQL handler HTTP parameters

curl http://localhost:8983/solr/techproducts/sql -d '&stmt=<sql_statement>&numWorkers=4 // currently used by GROUP BY and DISTINCT (via parallel stream)&workerCollection=collection1 // where to create intermediate workers&workerZkhost=localhost:9983 // cluster (zookeeper ensemble) address&aggregationMode=map_reduce | facet


The WHERE clause

• WHERE clauses are all pushed down to the search layer

select id where popularity=10 // simple match on numeric field "popularity" where popularity='[5 TO 10]' // solr range query (note the quotes) where name='hard drive' // phrase query on the "name" field where name='((memory retail) AND popularity:[5 TO 10])' // arbitrary solr query where name='(memory retail)' AND popularity='[5 TO 10]' // boolean logic


Ordering and Limiting

select id,score from techproducts where text='(memory hard drive)' ORDER BY popularity desc // default order is score desc for limited queries LIMIT 100

• Limited queries use /select handler• Unlimited queries use /export handler• fields selected need to be docValues• fields in "order by" need to be docValues• no "score" field allowed


More SQL examples

select distinct fieldA as fa, fieldB as fb from tableA order by fa desc, fb desc

// simple stats select count(fieldA) as count, sum(fieldB) as sum from tableA where fieldC = 'Hello'

select fieldA, fieldB, count(*), sum(fieldC), avg(fieldY) from tableA where fieldC = 'term1 term2' group by fieldA, fieldB having ((sum(fieldC) > 1000) AND (avg(fieldY) <= 10)) order by sum(fieldC) asc


Solr JDBC Driver


Solr JDBC driver works with Zeppelin


More Solr6 Features


Graph Query

• Basic (non-distributed) graph traversal query• Follows nodes to edges, optionally filtering during traversal• Currently only a "filter" query (produces a set of documents)• Parameters: from, to, traversalFilter, returnRoot, returnOnlyLeaf, maxDepth

• This example query matches “Philip J. Fry” and all of his ancestors:fq={!graph from=parent_id to=id}id:"Philip J. Fry"


Scoring changes

• For docCount (i.e. idf) in scoring, use the number of documents with that field rather than the number of documents in the whole index (maxDoc).• can add documents of a different type and not disturb/skew scoring

• BM25 scoring by default• tweakable on a per-fieldType basis ("k1" and "b" factors)• classic tf-idf still available


Cross DC Replication


Thank [email protected]

solr 6 feature preview

Technology