data io: next generation search with lucene and solr 4

22
© Copyright 2013 Next Generation Search with Lucene and Solr 4 Grant Ingersoll CTO, LucidWorks Read More: http://ibm.co/1dJvL9k

Upload: grant-ingersoll

Post on 27-Jan-2015

109 views

Category:

Technology


0 download

DESCRIPTION

Overview talk on Lucene and Solr 4 features, using search for alternative problems.

TRANSCRIPT

Page 1: Data IO: Next Generation Search with Lucene and Solr 4

© Copyright 2013

Next Generation Search with Lucene and Solr 4

Grant IngersollCTO, LucidWorksRead More: http://ibm.co/1dJvL9k

Page 2: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks

• Search is Everywhere!

• The Bar is Raised

• Holistic view of the data AND the users is critical

Search is Dead, Long Live Search

Page 3: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks3

Search is good for…

• Classic: Fast, fuzzy text matching across a large document collection

• NoSQL and De-normalized data- “light” relational

• Top N problems

• Faceting, slicing and dicing of numerical/enumerated data

• Spatial, spell checking, record linkage, highlighting

Page 4: Data IO: Next Generation Search with Lucene and Solr 4

© Copyright 2013

Page 5: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks

Lucene: Speed and Memory

• Native Near Real Time (NRT) support- Per segment- FieldCache can be controlled to only load new segments- Soft commit -- faster without fsync, allows quicker update

visibility

• DWPT (Document Writer per Thread)- Faster more consistent index speed

• Faster fuzzy & wildcard query processing

• String -> BytesRef- Much improved data structure- … means less memory and less garbage collection effort

Page 6: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks

Up and to the Right

• http://people.apache.org/~mikemccand/lucenebench/indexing.html

6

Page 7: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks

Lucene: Flexibility

• Flexible Index Formats- New posting list codecs: Block, Simple Text, Append (HDFS..),

etc- Pulsing codec: improves performance of primary key searches,

inlining docs, positions, and payloads, saves disk seeks

• Pluggable Scoring- Decoupled from TF/IDF- Built in alternatives include BM25 & DFR, and others

» http://en.wikipedia.org/wiki/Okapi_BM25

» http://terrier.org/docs/v3.5/dfr_description.html- Add your own

Page 8: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks

FS(A|T)

• Keys:- byte[] – write-once

- Linear time build of min. automata (nlogn if not sorted, which isn’t our case)

- Compression

- Reverse lookups

- Weights (used for auto-suggest)

- Pluggable Algebra

• Uses:- Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others

- FuzzyQuery is 100x faster -- http://bit.ly/hgO65c

• More: - http://slidesha.re/vKtpVA

- http://bit.ly/Pkjyu0

- “Smaller Representation of Finite State Automata” » Proc. of the 16th Inter. Conf. on Implementation and Application of Automata,

CIAA'2011, vol. 6807, 2011, pp. 118—192.

Page 9: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks9

Recent Additions

• Replication module

• New Faceting capabilities

• New Suggester to handle infix suggestions

• Analysis Additions- Norwegian, Scandinavian alternatives

• Memory and FST improvements

Page 10: Data IO: Next Generation Search with Lucene and Solr 4

© Copyright 2013

Page 11: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks

Solr 4: New Features

• Search/Faceting/Relevance- New Relevance Function Queries (tf, df, others)- Pivot Faceting- Pseudo-join- Improved Spatial (more later)- Full support for Lucene Codecs, pluggable scoring

• Indexing- New Update Processors, including scripting option- Near real time

• Codec/Similarity support from Lucene 4• Other

- New Admin UI

Page 12: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks

Geospatial improvements

• Index shapes other than points (circles, polygons, etc)• More complex interactions than point in a circle using

Well Known Text

• Indexing:- "geo”:”43.17614,-90.57341”- “geo”:”Circle(4.56,1.23 d=0.0710)”- “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))”

• Searching:- fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"- fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0

0, -10 30)))”

Page 13: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks

Scaling Solr

• Distributed/sharded indexing & search- Auto distributes updates and queries to appropriate shards- Near Real Time (NRT) indexing capable

• Dynamically scalable- New SolrCloud instances add indexing and query capacity- Supports re-balancing

• Reliable- No single point of failure- Transactions logged- Robust, automatic recover

• http://wiki.apache.org/solr/SolrCloud

Page 14: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks

Solr as NoSQL

• Characteristics- Non-traditional data stores- Not designed for SQL type queries- Distributed fault tolerant architecture- Document oriented, data format agnostic(JSON, XML, CSV,

binary)

• Updated durability via transaction log• Real-time /get fetches latest version w/o hard commit• Versioning and optimistic locking

- w/ Real Time GET, allows read/write/update w/o conflicts

• Atomic updates- Can add/remove/change and increment a field in existing doc

w/o re-indexing

Page 15: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks15

Recent Additions

• HDFS backed directory for storing index and transaction logs in Apache Hadoop

• New Core discovery capabilities

• Schemaless/External Schema/Field Guessing

• Schema APIs

• Add documents from the Admin UI

Page 16: Data IO: Next Generation Search with Lucene and Solr 4

© Copyright 2013

Applications

16

Page 17: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks

… Find your Keys, Store Your Content

• Lucene/Solr is a fast key-value store- Bonus: search your values!

• NoSQL before NoSQL was cool

• Solr: distributed key/value- Durable, Isolated, Redundant, Fast,

Real-time- Joins, Column Storage

• Solr or Tika + Lucene can index popular office formats

• Solr can backup/replicate and scale as content grows

Page 18: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks

… Find Love! Upsell! Cross-sell!

• Cross recommendation as search- with search used to build cross recommendation!

• Recommend content to people who exhibit certain behaviors (clicks, query terms, other)

• (Ab)use of a search engine- but not as a search engine for content

- more like a search engine for behavior

• See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal Recommendation Algorithms- http://berlinbuzzwords.com/sessions/multi-modal-recommendation-algorithms

• Go get Mahout/Myrrix or just do it in y(our) search engine

Page 19: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks19

… Avoid Delays

Page 20: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks20

… Wibbly-wobbly Timey-wimey Stuff

• Leverage Solr’s new spatial capabilities to index non-spatial data, such as time ranges- Useful for Open Hours, Shifts,

etc.

• Query using rectangle intersections- q = shift:"Intersects(0 19 23

365)”

https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/

Page 21: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks21

Summary

• Lucene/Solr 4.x: - Faster- More Flexible- Easier than ever scaling- More reliable than ever

• If you need to rank a bunch of stuff according to some notion of similarity, a search engine is the way to go

Page 22: Data IO: Next Generation Search with Lucene and Solr 4

© 2013 LucidWorks22

Where to Next?

• Full article: http://ibm.co/1dJvL9k• • http://www.lucidworks.com• http://lucene.apache.org/

• Training: http://bit.ly/lws-training

• LucidWorks Search (Solr++) more info: http://bit.ly/lws-more-info

• Twitter: @gsingers, @LucidWorks

• Taming Text: http://www.manning.com/ingersoll