solr 6 feature preview

33
1 © Cloudera, Inc. All rights reserved. Solr 6 Feature Preview Yonik Seeley 3/09/2016

Upload: yonik-seeley

Post on 11-Feb-2017

2.383 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Solr 6 Feature Preview

1© Cloudera, Inc. All rights reserved.

Solr 6 Feature Preview

Yonik Seeley3/09/2016

Page 2: Solr 6 Feature Preview

2© Cloudera, Inc. All rights reserved.

My Background

• Creator of Solr• Cloudera Engineer • LucidWorks Co-Founder• Lucene/Solr committer, PMC member• Apache Software Foundation member• M.S. in Computer Science, Stanford

Page 3: Solr 6 Feature Preview

3© Cloudera, Inc. All rights reserved.

Solr 6

• Happy Birthday Solr!• 10 Years at the Apache Software Foundation as of 1/2016

• Release branch as been cut• ETA before April• Java 8+ only

Page 4: Solr 6 Feature Preview

4© Cloudera, Inc. All rights reserved.

Streaming Expressions

Page 5: Solr 6 Feature Preview

5© Cloudera, Inc. All rights reserved.

Solr Streaming Expressions

• Generic platform for distributed computation• The basis for implementing distributed SQL

• Works across entire result sets (or subsets)• normal search operations are designed for fast top-N operations

• Map-reduce like "shuffle" partitions result sets for greater scalability• Worker nodes can be allocated from a collection for parallelism

Page 6: Solr 6 Feature Preview

6© Cloudera, Inc. All rights reserved.

Tuple Streams

• A streaming expression compiles/parses to a tuple stream• direct mapping from a streaming expression function->tuple_stream

• Stream Sources – produce a tuple stream• Stream Decorators – operate on tuple streams• Designed to include streams from non-Solr systems

Page 7: Solr 6 Feature Preview

7© Cloudera, Inc. All rights reserved.

search() expression

$ curl http://localhost:8983/solr/techproducts/stream -d 'expr=search(techproducts, q="*:*", fl="id,price,score", sort="id asc")'

{"result-set":{"docs":[{"score":1.0,"id":"0579B002","price":179.99},{"score":1.0,"id":"100-435805","price":649.99},{"score":1.0,"id":"3007WFP","price":2199.0},{"score":1.0,"id":"VDBDB1A16"},{"score":1.0,"id":"VS1GB400C3","price":74.99},{"EOF":true,"RESPONSE_TIME":6}]}}

resulting tuple stream

Page 8: Solr 6 Feature Preview

8© Cloudera, Inc. All rights reserved.

Search Tuple Stream

Shard 1Replica 2

Shard 1Replica 1

Shard 1Replica 2

Shard 2Replica 1

Shard 1Replica 2

Shard 3Replica 1

Worker

Tuple StreamTuple Stream

/stream worker executing the "search" expression

• search() is a stream source• SolrCloud aware (CloudSolrStream java class)• Fully streaming (no big buffers)• Worker node doesn't need to be a Solr node

Page 9: Solr 6 Feature Preview

9© Cloudera, Inc. All rights reserved.

search expression args

search( // parses to CloudSolrStream java class

techproducts, // name of the collection to searchzkHost="localhost:9983", // (opt) zookeeper address of collection to searchqt="/select", // (opt) the request handler to use

(/export is also available)rows=1000000, // (opt) number of rows to retrieve q=*:*, // query to match returned

documentsfl="id,price,score", // which fields to returnsort="id asc, price desc", // how to sort the results

aliases="id=myid,price=myprice" // (opt) renames output fields)

Page 10: Solr 6 Feature Preview

10© Cloudera, Inc. All rights reserved.

reduce() streaming expression

• Groups tuples by common field values• Emits one group-head per group• Each group-head contains list of tuples• "by" parameter must match up with

"sort" parameter• Any partitioning should be done on

same group field.

reduce( search(collection1, qt="/export" q="*:*", fl="id,manu,price", sort="manu asc, price desc"), by="manu"), group(sort="price desc",n=100))

stream operation

Page 11: Solr 6 Feature Preview

11© Cloudera, Inc. All rights reserved.

rollup() expression

• Groups tuples by common field values• Emits rollup value along with metrics• Closest equivalent to faceting

rollup( search(collection1, qt="/export" q="*:*", fl="id,manu,price", sort="manu asc"), over="manu"), count(*), max(price))

metrics

{"result-set":{"docs":[{"manu":"apple","count(*)":1.0},{"manu":"asus","count(*)":1.0},{"manu":"ati","count(*)":1.0},{"manu":"belkin","count(*)":2.0},{"manu":"canon","count(*)":2.0},{"manu":"corsair","count(*)":3.0},[...]

Page 12: Solr 6 Feature Preview

12© Cloudera, Inc. All rights reserved.

facet() expression

• Like search+rollup, but pushes down computation to JSON Facet API

facet( techproducts,q="*:*",buckets="manu",bucketSorts="count(*)

desc",bucketSizeLimit=1000,count(*),sum(price),max(popularity)

)

{"result-set":{"docs":[{"avg(price)":129.99, "max(popularity)":7.0,"manu":"corsair","count(*)":3},{"avg(price)":15.72,"max(popularity)":1.0,"manu":"belkin","count(*)":2},{"avg(price)":254.97,"max(popularity)":7.0,"manu":"canon","count(*)":2},{"avg(price)":399.0,"max(popularity)":10.0,"manu":"apple","count(*)":1},{"avg(price)":479.95,"max(popularity)":7.0,"manu":"asus","count(*)":1},{"avg(price)":649.98,"max(popularity)":7.0,"manu":"ati","count(*)":1},{"avg(price)":0.0,"max(popularity)":"NaN","manu":"boa","count(*)":1},[...]

Page 13: Solr 6 Feature Preview

13© Cloudera, Inc. All rights reserved.

Parallel Tuple Stream

Shard 1Replica 2

Shard 1Replica 1

Shard 1Replica 2

Shard 2Replica 1

Shard 1Replica 2

Shard 3Replica 1

WorkerPartition 1

WorkerPartition 2

Worker

Tuple Stream

Page 14: Solr 6 Feature Preview

14© Cloudera, Inc. All rights reserved.

Streaming Expressions – parallel

• Wraps a stream and sends to N worker nodes• The first parameter is the collection to

use for the intermediate worker nodes• partitionKeys must be provided to

underlying workers• usually makes sense to partition by

what you are grouping on• inner and outer sorts should match

parallel(collection1, rollup( search(techproducts, q="*:*", fl="id,manu,price", sort="manu asc", partitionKeys="manu"), over="manu asc"), workers=2, zkHost="localhost:9983", sort="manu asc")

Page 15: Solr 6 Feature Preview

15© Cloudera, Inc. All rights reserved.

Joins!

innerJoin( search(people, q=*:*, fl="personId,name", sort="personId asc"), search(pets, q=type:cat, fl="personId,petName", sort="personId asc"), on="personId")

leftOuterJoin, hashJoin, outerHashJoin,

Page 16: Solr 6 Feature Preview

16© Cloudera, Inc. All rights reserved.

More decorators

• complement – emits tuples from A which do not exist in B• intersect – emits tuples from A whish do exist in B• merge• top – reorders the stream and returns the top N tuples• unique – emits only the first tuple for each value• select – select, rename, or give default values to fields in a tuple

Page 17: Solr 6 Feature Preview

17© Cloudera, Inc. All rights reserved.

Interesting streams• update stream – indexes input into another SolrCloud collection!• daemon stream – blocks until more data is available from underlying stream• topic stream – a publish/subscribe messaging service• checkpoints are persisted in a Solr collection• resubmit to get new stuff• combine with daemon stream to automatically get continuous updates over time• further combine with update stream to push all matches to another collection

topic(checkpointCollection, dataCollection, id="topicA", q="solr rocks" checkpointEvery="1000")

Page 18: Solr 6 Feature Preview

18© Cloudera, Inc. All rights reserved.

jdbc() expression streamjoin with other data sources!

innerJoin( // example from JDBCStreamTest select( search(collection1, fl="personId_i,rating_f", q="rating_f:*", sort="personId_i asc"), personId_i as personId, rating_f as rating ), select( jdbc(connection="jdbc:hsqldb:mem:.", sql="select PEOPLE.ID as PERSONID, PEOPLE.NAME, COUNTRIES.COUNTRY_NAME from PEOPLE inner join COUNTRIES on PEOPLE.COUNTRY_CODE = COUNTRIES.CODE order by PEOPLE.ID", sort="ID asc", get_column_name=true), ID as personId, NAME as personName, COUNTRY_NAME as country ), on="personId")

Page 19: Solr 6 Feature Preview

19© Cloudera, Inc. All rights reserved.

Parallel SQL

Page 20: Solr 6 Feature Preview

20© Cloudera, Inc. All rights reserved.

/sql Handler

• /sql handler is there by default on all solr nodes• Translates SQL -> parallel streaming expressions• SQL tables map to SolrCloud collections• Query planner / optimizer• Currently uses Presto parser• May switch to Apache Calcite?

Page 21: Solr 6 Feature Preview

21© Cloudera, Inc. All rights reserved.

Page 22: Solr 6 Feature Preview

22© Cloudera, Inc. All rights reserved.

Simplest SQL Example

$ curl http://localhost:8983/solr/techproducts/sql -d "stmt=select id from techproducts"

{"result-set":{"docs":[{"id":"EN7800GTX/2DHTV/256M"},{"id":"100-435805"},{"id":"UTF8TEST"},{"id":"SOLR1000"},{"id":"9885A004"},[...]

tables map to collections

Page 23: Solr 6 Feature Preview

23© Cloudera, Inc. All rights reserved.

SQL handler HTTP parameters

curl http://localhost:8983/solr/techproducts/sql -d '&stmt=<sql_statement>&numWorkers=4 // currently used by GROUP BY and DISTINCT (via parallel stream)&workerCollection=collection1 // where to create intermediate workers&workerZkhost=localhost:9983 // cluster (zookeeper ensemble) address&aggregationMode=map_reduce | facet

Page 24: Solr 6 Feature Preview

24© Cloudera, Inc. All rights reserved.

The WHERE clause

• WHERE clauses are all pushed down to the search layer

select id where popularity=10 // simple match on numeric field "popularity" where popularity='[5 TO 10]' // solr range query (note the quotes) where name='hard drive' // phrase query on the "name" field where name='((memory retail) AND popularity:[5 TO 10])' // arbitrary solr query where name='(memory retail)' AND popularity='[5 TO 10]' // boolean logic

Page 25: Solr 6 Feature Preview

25© Cloudera, Inc. All rights reserved.

Ordering and Limiting

select id,score from techproducts where text='(memory hard drive)' ORDER BY popularity desc // default order is score desc for limited queries LIMIT 100

• Limited queries use /select handler• Unlimited queries use /export handler• fields selected need to be docValues• fields in "order by" need to be docValues• no "score" field allowed

Page 26: Solr 6 Feature Preview

26© Cloudera, Inc. All rights reserved.

More SQL examples

select distinct fieldA as fa, fieldB as fb from tableA order by fa desc, fb desc

// simple stats select count(fieldA) as count, sum(fieldB) as sum from tableA where fieldC = 'Hello'

select fieldA, fieldB, count(*), sum(fieldC), avg(fieldY) from tableA where fieldC = 'term1 term2' group by fieldA, fieldB having ((sum(fieldC) > 1000) AND (avg(fieldY) <= 10)) order by sum(fieldC) asc

Page 27: Solr 6 Feature Preview

27© Cloudera, Inc. All rights reserved.

Solr JDBC Driver

Page 28: Solr 6 Feature Preview

28© Cloudera, Inc. All rights reserved.

Solr JDBC driver works with Zeppelin

Page 29: Solr 6 Feature Preview

29© Cloudera, Inc. All rights reserved.

More Solr6 Features

Page 30: Solr 6 Feature Preview

30© Cloudera, Inc. All rights reserved.

Graph Query

• Basic (non-distributed) graph traversal query• Follows nodes to edges, optionally filtering during traversal• Currently only a "filter" query (produces a set of documents)• Parameters: from, to, traversalFilter, returnRoot, returnOnlyLeaf, maxDepth

• This example query matches “Philip J. Fry” and all of his ancestors:fq={!graph from=parent_id to=id}id:"Philip J. Fry"

Page 31: Solr 6 Feature Preview

31© Cloudera, Inc. All rights reserved.

Scoring changes

• For docCount (i.e. idf) in scoring, use the number of documents with that field rather than the number of documents in the whole index (maxDoc).• can add documents of a different type and not disturb/skew scoring

• BM25 scoring by default• tweakable on a per-fieldType basis ("k1" and "b" factors)• classic tf-idf still available

Page 32: Solr 6 Feature Preview

32© Cloudera, Inc. All rights reserved.

Cross DC Replication

Page 33: Solr 6 Feature Preview

33© Cloudera, Inc. All rights reserved.

Thank [email protected]