solr search engine with multiple table relation

49
Powerful Full-Text Search with Solr Jay Bharat [email protected] Carmatec It solution, Bangalore 1 July 2013 1

Upload: jay-bharat

Post on 07-May-2015

1.667 views

Category:

Technology


4 download

DESCRIPTION

Here you can learn how to use solr search engine and implement in your application like in PHP/MYSQL. I am introducing how to handle multiple table data handling in SOLR.

TRANSCRIPT

Page 1: Solr search engine with multiple table relation

Powerful Full-Text Search with Solr

Jay Bharat [email protected]

Carmatec It solution, Bangalore 1 July 2013

1

Page 2: Solr search engine with multiple table relation

Implementing search with free software

An introduction to Solr

2

Page 3: Solr search engine with multiple table relation

Solr Tm -1/2

3

Page 4: Solr search engine with multiple table relation

Solr Tm-2/2

4

Page 5: Solr search engine with multiple table relation

What is Solr?

•  Solr is an open source enterprise search server based on the Lucene Java search library.

•  Solr runs in a Java servlet container such as Tomcat or Jetty

•  Solr is free software and a project of the Apache Software Foundation

•  Solr is a sub-project of Lucene and can be found at http://lucene.apache.org/solr/

5

Page 6: Solr search engine with multiple table relation

Key Features •  Advanced Full-Text search •  Optimized for High Volume Web Traffic •  Standards Based Open Interfaces – XML and

HTTP •  Comprehensive HTML Administration Interface •  Server statistics exposed over JMX for monitoring •  Scalability through efficient replication •  Flexibility with XML configuration and Plugins •  Push vs Crawl indexing method

6

Page 7: Solr search engine with multiple table relation

Solr Clients •  Solr can be integrated with, among others…

– Ruby – PHP –  Java – Python –  JSON – Forrest/Cocoon – C# or Deveel Solr Client or solrnet – Coldfusion –  Drupal or apacheSolr project for Drupal

7

Page 8: Solr search engine with multiple table relation

Indexing

•  Push vs Crawl •  Schema.xml •  Add documents •  HTML interface

– Update – Delete – Commit

•  DataImportHandler – For searching databases

8

Page 9: Solr search engine with multiple table relation

Searching

•  Full text search http://localhost:8983/solr/select?q=Iraq §  Search only within a field http://localhost:8983/solr/select?

q=category:news §  Control which fields are displayed in result http://localhost:8983/solr/select?

q=video&fl=id,category §  Provide ranges to fields http://localhost:8983/solr/select?q=price:[0

TO400]&fl=id,name,price

9

Page 10: Solr search engine with multiple table relation

More Searching •  Faceting information http://localhost:8983/solr/select?

q=news&fl=id,description&facet=true&facet.field=category

§  More like this (MLT) http://localhost:8983/solr/select?

q=Iraq&mlt=true&mlt.fl=headline&mlt.mindf=1&mlt.mintf=1&fl=id,score&rows=100

•  More information on how this works and the options available can be found at http://wiki.apache.org/solr/MoreLikeThis

10

Page 11: Solr search engine with multiple table relation

QueryResponseWriter

§  A QueryResponseWriter is a Solr Plugin that defines the response format for any request

§  All of the requests we have made so far are formatted with the XMLResponseWriter

§  Other formats can be applied by appending wt=format to the search string like this:

http://localhost:8983/solr/select?q=date:[1998%20TO%201999]&fl=id,name,date,headline&rows=200&wt=xslt&tr=example.xsl

11

Page 12: Solr search engine with multiple table relation

Acknowledgements

•  Search smarter with Apache Solr, Part 1: Essential features and the Solr schema – http://www.ibm.com/developerworks/java/

library/j-solr1/ •  Solr Tutorial from Lucid Imagination

– http://www.lucidimagination.com/Community/Hear-from-the-Experts/Podcasts-and-Videos/Solr-Tutorial

•  Solr Wiki – http://wiki.apache.org/solr/ 12

Page 13: Solr search engine with multiple table relation

Powered by Lucene •  Wikipedia •  Internet Archive •  LinkedIn •  monster.com

13

Page 14: Solr search engine with multiple table relation

Indexing aardvark

hood

red

little

riding

robin

women

zoo

Little Red Riding Hood

Robin Hood

Little Women

0 1

0 2

0

0

2

1

0

1

2

14

Page 15: Solr search engine with multiple table relation

Search

•  Core parameters •  qt – query type (request handler) •  wt – writer type (response writer)

•  Common parameters •  q •  sort •  start •  rows •  fq – filters •  fl – return fields

15

Page 16: Solr search engine with multiple table relation

Search Syntax •  field:term (*:* returns everything) •  A score is generated at query time, the value itself doesn’t have any meaning, the

scores are relevant only when relative to each other (a scale) •  fq can filter query based on some supplied condition •  wt is the return type of the results (xml,json, etc.) •  qt is the request handler used to process the request (default is “standard”) •  fl is the list of fields to return (field must be stored) •  q is the query string •  You can specify the start value and maxrows

16

Page 17: Solr search engine with multiple table relation

Search Syntax •  field:term (*:* returns everything) •  A score is generated at query time, the value itself

doesn’t have any meaning, the scores are relevant only when relative to each other (a scale)

•  fq can filter query based on some supplied condition •  wt is the return type of the results (xml,json, etc.) •  qt is the request handler used to process the request

(default is “standard”) •  fl is the list of fields to return (field must be stored) •  q is the query string •  You can specify the start value and maxrows

17

Page 18: Solr search engine with multiple table relation

What is Lucene •  High performance, scalable, full-text

search library •  Focus: Indexing + Searching Documents

– “Document” is just a list of name+value pairs •  No crawlers or document parsing •  Flexible Text Analysis (tokenizers + token

filters) •  100% Java, no dependencies, no config

files

18

Page 19: Solr search engine with multiple table relation

What is SOLR

•  Solr (pronounced "solar") is an open source enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable.[1] Solr is the most popular enterprise search engine.[2] Solr 4 adds NoSQL features.[3]

19

Page 20: Solr search engine with multiple table relation

What is SOLR

•  Solr (pronounced "solar") is an open source enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable.[1] Solr is the most popular enterprise search engine.[2] Solr 4 adds NoSQL features.[3]

20

Page 21: Solr search engine with multiple table relation

Solr Features •  Advanced Full-Text Search Capabilities •  Optimized for High Volume Web Traffic •  Standards Based Open Interfaces - XML, JSON and

HTTP •  Comprehensive HTML Administration Interfaces •  Linearly scalable, auto index replication, auto failover

and recovery •  Near Real-time indexing •  Flexible and Adaptable with XML configuration •  Extensible Plugin Architecture

21

Page 22: Solr search engine with multiple table relation

Indexing Data

HTTP POST to http://localhost:8983/solr/update <add><doc> <field name=“id”>05991</field> <field name=“name”>Peter Parker</field> <field name=“supername”>Spider-Man</field> <field name=“category”>superhero</field> <field name=“powers”>agility</field> <field name=“powers”>spider-sense</field> </doc></add>

22

Page 23: Solr search engine with multiple table relation

Indexing CSV data Guru, Saurabh, Vivek, Siddhartha | Lubaib , Venugopal|superhero, php|bangalore|benguluru, Magneto, Mumbai|Bombay, GB|gigabytes, cm|centimeter, Purvankara

http://localhost:8983/solr/update/csv? fieldnames=supername,Vivek,Magento,gb &separator=, &f.name.split=true&f.name.separator=| &f.powers.split=true&f.powers.separator=|

23

Page 24: Solr search engine with multiple table relation

Data upload methods URL=http://localhost:8983/solr/update/csv

•  HTTP POST body (curl, HttpClient, etc) curl $URL -H 'Content-type:text/plain; charset=utf-8' --data-binary @info.csv

•  Multi-part file upload (browsers) •  Request parameter ?stream.body=‘Cyclops, Scott Summers,…’

•  Streaming from URL (must enable) ?stream.url=file://data/info.csv

24

Page 25: Solr search engine with multiple table relation

Indexing with SolrJ // Solr’s Java Client API… remote or embedded/local! SolrServer server = new

CommonsHttpSolrServer("http://localhost:8983/solr"); SolrInputDocument doc = new SolrInputDocument(); doc.addField(”player","Dravid"); doc.addField("name",”Kumar Rahul"); doc.addField(“category",“superhero"); server.add(doc); server.commit();

25

Page 26: Solr search engine with multiple table relation

Deleting Documents •  Delete by Id, most efficient <delete> <id>05591</id> <id>32552</id> </delete> •  Delete by Query <delete> <query>category:supervillain</query> </delete>

26

Page 27: Solr search engine with multiple table relation

Commit •  <commit/> makes changes visible

– Triggers static cache warming in solrconfig.xml

– Triggers autowarming from existing caches default on

•  <optimize/> same as commit, merges all index segments for faster searching _0.fnm _0.fdt _0.fdx _0.frq _0.tis _0.tii _0.prx _0.nrm _0_1.del

_1.fnm _1.fdt _1.fdx […]

Lucene Index Segments

27

Page 28: Solr search engine with multiple table relation

Searching http://localhost:8983/solr/select?q=powers:agility &start=0&rows=2&fl=supername,category <response> <result numFound=“427" start="0"> <doc> <str name=“supername">Spider-Man</str> <str name=“category”>superhero</str> </doc> <doc> <str name=“supername">Msytique</str> <str name=“category”>supervillain</str> </doc> </result> </response> 28

Page 29: Solr search engine with multiple table relation

Response Format •  Add &wt=json for JSON formatted response {“result": {"numFound":427, "start":0, "docs": [ {“supername”:”Spider-Man”, “category”:”superhero”}, {“supername”:” Magento”, “category”:” Purvankara”} ] } •  Also Python, Ruby, PHP, SerializedPHP, XSLT

29

Page 30: Solr search engine with multiple table relation

Scoring •  Query results are sorted by score descending •  VSM – Vector Space Model •  tf – term frequency: numer of matching terms in field •  lengthNorm – number of tokens in field •  idf – inverse document frequency •  coord – coordination factor, number of matching

terms •  document boost •  query clause boost http://lucene.apache.org/java/docs/scoring.html

30

Page 31: Solr search engine with multiple table relation

Explain http://solr/select?q=super fast&indent=on&debugQuery=on <lst name="debug"> <lst name="explain"> <str name="id=Flash,internal_docid=6"> 0.16389132 = (MATCH) product of: 0.32778263 = (MATCH) sum of: 0.32778263 = (MATCH) weight(text:fast in 6), product of: 0.5012072 = queryWeight(text:fast), product of: 2.466337 = idf(docFreq=5) 0.20321926 = queryNorm 0.65398633 = (MATCH) fieldWeight(text:fast in 6), product of: 1.4142135 = tf(termFreq(text:fast)=2) 2.466337 = idf(docFreq=5) 0.1875 = fieldNorm(field=fast, doc=6) 0.5 = coord(1/2) </str> <str name="id=Superman,internal_docid=7"> 0.1365761 = (MATCH) product of: 31

Page 32: Solr search engine with multiple table relation

Lucene Query Syntax 1.  justice league

•  Equiv: justice OR league •  QueryParser default operator is “OR”/optional

2.  +justice +league –name:aquaman •  Equiv: justice AND league NOT name:aquaman

3.  “justice league” –name:aquaman 4.  title:spiderman^10 description:spiderman 5.  description:“spiderman movie”~100

32

Page 33: Solr search engine with multiple table relation

Lucene Query Examples2 1.  releaseDate:[2000 TO 2007] 2.  Wildcard searches: sup?r, su*r, super* 3.  spider~

•  Fuzzy search: Levenshtein distance •  Optional minimum similarity: spider~0.7

4.  *:* 5.  (Superman AND “Lex Luthor”) OR

(+Batman +Joker)

33

Page 34: Solr search engine with multiple table relation

DisMax Query Syntax •  Good for handling raw user queries

–  Balanced quotes for phrase query –  ‘+’ for required, ‘-’ for prohibited –  Separates query terms from query structure

http://solr/select?qt=dismax &q=super man // the user query &qf=title^3 subject^2 body // field to query &pf=title^2,body // fields to do phrase queries &ps=100 // slop for those phrase q’s &tie=.1 // multi-field match reward &mm=2 // # of terms that should match &bf=popularity // boost function

34

Page 35: Solr search engine with multiple table relation

DisMax Query Form •  The expanded Lucene Query:

+( DisjunctionMaxQuery( title:super^3 | subject:super^2 | body:super)

DisjunctionMaxQuery( title:man^3 | subject:man^2 | body:man)

) DisjunctionMaxQuery(title:”super man”~100^2

body:”super man”~100) FunctionQuery(popularity) •  Tip: set up your own request handler with default parameters

to avoid clients having to specify them 35

Page 36: Solr search engine with multiple table relation

Function Query •  Allows adding function of field value to score

– Boost recently added or popular documents •  Current parser only supports function

notation •  Example: log(sum(popularity,1)) •  sum, product, div, log, sqrt, abs, pow •  scale(x, target_min, target_max)

– calculates min & max of x across all docs •  map(x, min, max, target)

– useful for dealing with defaults 36

Page 37: Solr search engine with multiple table relation

Boosted Query

•  Score is multiplied instead of added – New local params <!...> syntax added

&q=<!boost b=sqrt(popularity)>super man

•  Parameter dereferencing in local params &q=<!boost b=$boost v=$userq> &boost=sqrt(popularity) &userq=super man

37

Page 38: Solr search engine with multiple table relation

Configuring Relevancy <fieldType name="text" class="solr.TextField"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt“/> <filter class="solr.StopFilterFactory“ words=“stopwords.txt”/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> </analyzer> </fieldType>

38

Page 39: Solr search engine with multiple table relation

Field Definitions •  Field Attributes: name, type, indexed, stored,

multiValued, omitNorms, termVectors <field name="id“ type="string" indexed="true" stored="true"/> <field name="sku“ type="textTight” indexed="true" stored="true"/> <field name="name“ type="text“ indexed="true" stored="true"/> <field name=“inStock“ type=“boolean“ indexed="true“ stored=“false"/> <field name=“price“ type=“sfloat“ indexed="true“ stored=“false"/> <field name="category“ type="text_ws“ indexed="true" stored="true“

multiValued="true"/> •  Dynamic Fields <dynamicField name="*_i" type="sint“ indexed="true" stored="true"/> <dynamicField name="*_s" type="string“ indexed="true" stored="true"/> <dynamicField name="*_t" type="text“ indexed="true" stored="true"/>

39

Page 40: Solr search engine with multiple table relation

copyField •  Copies one field to another at index time •  Usecase #1: Analyze same field different ways

–  copy into a field with a different analyzer –  boost exact-case, exact-punctuation matches –  language translations, thesaurus, soundex

<field name=“title” type=“text”/> <field name=“title_exact” type=“text_exact”

stored=“false”/> <copyField source=“title” dest=“title_exact”/>

•  Usecase #2: Index multiple fields into single searchable field

40

Page 41: Solr search engine with multiple table relation

41

Page 42: Solr search engine with multiple table relation

42

Page 43: Solr search engine with multiple table relation

43

Page 44: Solr search engine with multiple table relation

Facet Query http://solr/select?q=foo&wt=json&indent=on &facet=true&facet.field=cat &facet.query=price:[0 TO 100] &facet.query=manu:IBM {"response":{"numFound":26,"start":0,"docs":[…]}, “facet_counts":{ "facet_queries":{ "price:[0 TO 100]":6, “manu:IBM":2}, "facet_fields":{ "cat":[ "electronics",14, "memory",3, "card",2, "connector",2] }}} 44

Page 45: Solr search engine with multiple table relation

Filters •  Filters are restrictions in addition to the query •  Use in faceting to narrow the results •  Filters are cached separately for speed 1. User queries for memory, query sent to solr is &q=memory&fq=inStock:true&facet=true&… 2. User selects 1GB memory size &q=memory&fq=inStock:true&fq=size:1GB&… 3. User selects DDR2 memory type &q=memory&fq=inStock:true&fq=size:1GB &fq=type:DDR2&… 45

Page 46: Solr search engine with multiple table relation

Highlighting http://solr/select?q=lcd&wt=json&indent=on &hl=true&hl.fl=features {"response":{"numFound":5,"start":0,"docs":[ {"id":"3007WFP", “price”:899.95}, …] "highlighting":{ "3007WFP":{ "features":["30\" TFT active matrix

<em>LCD</em>, 2560 x 1600” "VA902B":{ "features":["19\" TFT active matrix

<em>LCD</em>, 8ms response time, 1280 x 1024 native resolution"]}}} 46

Page 47: Solr search engine with multiple table relation

MoreLikeThis •  Selects documents that are “similar” to the

documents matching the main query. &q=id:6H500F0

&mlt=true&mlt.fl=name,cat,features "moreLikeThis":{ "6H500F0":{"numFound":

5,"start":0, "docs”: [ {"name":"Apple 60 GB iPod with Video Playback Black", "price":399.0, "inStock":true, "popularity":10, […] }, […] ] […] 47

Page 48: Solr search engine with multiple table relation

High Availability

Load Balancer

Appservers

Solr Searchers

Solr Master

DB Updater updates

updates admin queries

Index Replication

admin terminal

HTTP search requests

Dynamic HTML Generation

48

Page 49: Solr search engine with multiple table relation

Resources •  WWW

– http://lucene.apache.org/solr – http://lucene.apache.org/solr/tutorial.html – http://wiki.apache.org/solr/

•  Mailing Lists – [email protected] – [email protected]

49