solr search engine with multiple table relation
DESCRIPTION
Here you can learn how to use solr search engine and implement in your application like in PHP/MYSQL. I am introducing how to handle multiple table data handling in SOLR.TRANSCRIPT
Powerful Full-Text Search with Solr
Jay Bharat [email protected]
Carmatec It solution, Bangalore 1 July 2013
1
Implementing search with free software
An introduction to Solr
2
Solr Tm -1/2
3
Solr Tm-2/2
4
What is Solr?
• Solr is an open source enterprise search server based on the Lucene Java search library.
• Solr runs in a Java servlet container such as Tomcat or Jetty
• Solr is free software and a project of the Apache Software Foundation
• Solr is a sub-project of Lucene and can be found at http://lucene.apache.org/solr/
5
Key Features • Advanced Full-Text search • Optimized for High Volume Web Traffic • Standards Based Open Interfaces – XML and
HTTP • Comprehensive HTML Administration Interface • Server statistics exposed over JMX for monitoring • Scalability through efficient replication • Flexibility with XML configuration and Plugins • Push vs Crawl indexing method
6
Solr Clients • Solr can be integrated with, among others…
– Ruby – PHP – Java – Python – JSON – Forrest/Cocoon – C# or Deveel Solr Client or solrnet – Coldfusion – Drupal or apacheSolr project for Drupal
7
Indexing
• Push vs Crawl • Schema.xml • Add documents • HTML interface
– Update – Delete – Commit
• DataImportHandler – For searching databases
8
Searching
• Full text search http://localhost:8983/solr/select?q=Iraq § Search only within a field http://localhost:8983/solr/select?
q=category:news § Control which fields are displayed in result http://localhost:8983/solr/select?
q=video&fl=id,category § Provide ranges to fields http://localhost:8983/solr/select?q=price:[0
TO400]&fl=id,name,price
9
More Searching • Faceting information http://localhost:8983/solr/select?
q=news&fl=id,description&facet=true&facet.field=category
§ More like this (MLT) http://localhost:8983/solr/select?
q=Iraq&mlt=true&mlt.fl=headline&mlt.mindf=1&mlt.mintf=1&fl=id,score&rows=100
• More information on how this works and the options available can be found at http://wiki.apache.org/solr/MoreLikeThis
10
QueryResponseWriter
§ A QueryResponseWriter is a Solr Plugin that defines the response format for any request
§ All of the requests we have made so far are formatted with the XMLResponseWriter
§ Other formats can be applied by appending wt=format to the search string like this:
http://localhost:8983/solr/select?q=date:[1998%20TO%201999]&fl=id,name,date,headline&rows=200&wt=xslt&tr=example.xsl
11
Acknowledgements
• Search smarter with Apache Solr, Part 1: Essential features and the Solr schema – http://www.ibm.com/developerworks/java/
library/j-solr1/ • Solr Tutorial from Lucid Imagination
– http://www.lucidimagination.com/Community/Hear-from-the-Experts/Podcasts-and-Videos/Solr-Tutorial
• Solr Wiki – http://wiki.apache.org/solr/ 12
Powered by Lucene • Wikipedia • Internet Archive • LinkedIn • monster.com
13
Indexing aardvark
hood
red
little
riding
robin
women
zoo
Little Red Riding Hood
Robin Hood
Little Women
0 1
0 2
0
0
2
1
0
1
2
14
Search
• Core parameters • qt – query type (request handler) • wt – writer type (response writer)
• Common parameters • q • sort • start • rows • fq – filters • fl – return fields
15
Search Syntax • field:term (*:* returns everything) • A score is generated at query time, the value itself doesn’t have any meaning, the
scores are relevant only when relative to each other (a scale) • fq can filter query based on some supplied condition • wt is the return type of the results (xml,json, etc.) • qt is the request handler used to process the request (default is “standard”) • fl is the list of fields to return (field must be stored) • q is the query string • You can specify the start value and maxrows
16
Search Syntax • field:term (*:* returns everything) • A score is generated at query time, the value itself
doesn’t have any meaning, the scores are relevant only when relative to each other (a scale)
• fq can filter query based on some supplied condition • wt is the return type of the results (xml,json, etc.) • qt is the request handler used to process the request
(default is “standard”) • fl is the list of fields to return (field must be stored) • q is the query string • You can specify the start value and maxrows
17
What is Lucene • High performance, scalable, full-text
search library • Focus: Indexing + Searching Documents
– “Document” is just a list of name+value pairs • No crawlers or document parsing • Flexible Text Analysis (tokenizers + token
filters) • 100% Java, no dependencies, no config
files
18
What is SOLR
• Solr (pronounced "solar") is an open source enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable.[1] Solr is the most popular enterprise search engine.[2] Solr 4 adds NoSQL features.[3]
19
What is SOLR
• Solr (pronounced "solar") is an open source enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable.[1] Solr is the most popular enterprise search engine.[2] Solr 4 adds NoSQL features.[3]
20
Solr Features • Advanced Full-Text Search Capabilities • Optimized for High Volume Web Traffic • Standards Based Open Interfaces - XML, JSON and
HTTP • Comprehensive HTML Administration Interfaces • Linearly scalable, auto index replication, auto failover
and recovery • Near Real-time indexing • Flexible and Adaptable with XML configuration • Extensible Plugin Architecture
21
Indexing Data
HTTP POST to http://localhost:8983/solr/update <add><doc> <field name=“id”>05991</field> <field name=“name”>Peter Parker</field> <field name=“supername”>Spider-Man</field> <field name=“category”>superhero</field> <field name=“powers”>agility</field> <field name=“powers”>spider-sense</field> </doc></add>
22
Indexing CSV data Guru, Saurabh, Vivek, Siddhartha | Lubaib , Venugopal|superhero, php|bangalore|benguluru, Magneto, Mumbai|Bombay, GB|gigabytes, cm|centimeter, Purvankara
http://localhost:8983/solr/update/csv? fieldnames=supername,Vivek,Magento,gb &separator=, &f.name.split=true&f.name.separator=| &f.powers.split=true&f.powers.separator=|
23
Data upload methods URL=http://localhost:8983/solr/update/csv
• HTTP POST body (curl, HttpClient, etc) curl $URL -H 'Content-type:text/plain; charset=utf-8' --data-binary @info.csv
• Multi-part file upload (browsers) • Request parameter ?stream.body=‘Cyclops, Scott Summers,…’
• Streaming from URL (must enable) ?stream.url=file://data/info.csv
24
Indexing with SolrJ // Solr’s Java Client API… remote or embedded/local! SolrServer server = new
CommonsHttpSolrServer("http://localhost:8983/solr"); SolrInputDocument doc = new SolrInputDocument(); doc.addField(”player","Dravid"); doc.addField("name",”Kumar Rahul"); doc.addField(“category",“superhero"); server.add(doc); server.commit();
25
Deleting Documents • Delete by Id, most efficient <delete> <id>05591</id> <id>32552</id> </delete> • Delete by Query <delete> <query>category:supervillain</query> </delete>
26
Commit • <commit/> makes changes visible
– Triggers static cache warming in solrconfig.xml
– Triggers autowarming from existing caches default on
• <optimize/> same as commit, merges all index segments for faster searching _0.fnm _0.fdt _0.fdx _0.frq _0.tis _0.tii _0.prx _0.nrm _0_1.del
_1.fnm _1.fdt _1.fdx […]
Lucene Index Segments
27
Searching http://localhost:8983/solr/select?q=powers:agility &start=0&rows=2&fl=supername,category <response> <result numFound=“427" start="0"> <doc> <str name=“supername">Spider-Man</str> <str name=“category”>superhero</str> </doc> <doc> <str name=“supername">Msytique</str> <str name=“category”>supervillain</str> </doc> </result> </response> 28
Response Format • Add &wt=json for JSON formatted response {“result": {"numFound":427, "start":0, "docs": [ {“supername”:”Spider-Man”, “category”:”superhero”}, {“supername”:” Magento”, “category”:” Purvankara”} ] } • Also Python, Ruby, PHP, SerializedPHP, XSLT
29
Scoring • Query results are sorted by score descending • VSM – Vector Space Model • tf – term frequency: numer of matching terms in field • lengthNorm – number of tokens in field • idf – inverse document frequency • coord – coordination factor, number of matching
terms • document boost • query clause boost http://lucene.apache.org/java/docs/scoring.html
30
Explain http://solr/select?q=super fast&indent=on&debugQuery=on <lst name="debug"> <lst name="explain"> <str name="id=Flash,internal_docid=6"> 0.16389132 = (MATCH) product of: 0.32778263 = (MATCH) sum of: 0.32778263 = (MATCH) weight(text:fast in 6), product of: 0.5012072 = queryWeight(text:fast), product of: 2.466337 = idf(docFreq=5) 0.20321926 = queryNorm 0.65398633 = (MATCH) fieldWeight(text:fast in 6), product of: 1.4142135 = tf(termFreq(text:fast)=2) 2.466337 = idf(docFreq=5) 0.1875 = fieldNorm(field=fast, doc=6) 0.5 = coord(1/2) </str> <str name="id=Superman,internal_docid=7"> 0.1365761 = (MATCH) product of: 31
Lucene Query Syntax 1. justice league
• Equiv: justice OR league • QueryParser default operator is “OR”/optional
2. +justice +league –name:aquaman • Equiv: justice AND league NOT name:aquaman
3. “justice league” –name:aquaman 4. title:spiderman^10 description:spiderman 5. description:“spiderman movie”~100
32
Lucene Query Examples2 1. releaseDate:[2000 TO 2007] 2. Wildcard searches: sup?r, su*r, super* 3. spider~
• Fuzzy search: Levenshtein distance • Optional minimum similarity: spider~0.7
4. *:* 5. (Superman AND “Lex Luthor”) OR
(+Batman +Joker)
33
DisMax Query Syntax • Good for handling raw user queries
– Balanced quotes for phrase query – ‘+’ for required, ‘-’ for prohibited – Separates query terms from query structure
http://solr/select?qt=dismax &q=super man // the user query &qf=title^3 subject^2 body // field to query &pf=title^2,body // fields to do phrase queries &ps=100 // slop for those phrase q’s &tie=.1 // multi-field match reward &mm=2 // # of terms that should match &bf=popularity // boost function
34
DisMax Query Form • The expanded Lucene Query:
+( DisjunctionMaxQuery( title:super^3 | subject:super^2 | body:super)
DisjunctionMaxQuery( title:man^3 | subject:man^2 | body:man)
) DisjunctionMaxQuery(title:”super man”~100^2
body:”super man”~100) FunctionQuery(popularity) • Tip: set up your own request handler with default parameters
to avoid clients having to specify them 35
Function Query • Allows adding function of field value to score
– Boost recently added or popular documents • Current parser only supports function
notation • Example: log(sum(popularity,1)) • sum, product, div, log, sqrt, abs, pow • scale(x, target_min, target_max)
– calculates min & max of x across all docs • map(x, min, max, target)
– useful for dealing with defaults 36
Boosted Query
• Score is multiplied instead of added – New local params <!...> syntax added
&q=<!boost b=sqrt(popularity)>super man
• Parameter dereferencing in local params &q=<!boost b=$boost v=$userq> &boost=sqrt(popularity) &userq=super man
37
Configuring Relevancy <fieldType name="text" class="solr.TextField"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt“/> <filter class="solr.StopFilterFactory“ words=“stopwords.txt”/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> </analyzer> </fieldType>
38
Field Definitions • Field Attributes: name, type, indexed, stored,
multiValued, omitNorms, termVectors <field name="id“ type="string" indexed="true" stored="true"/> <field name="sku“ type="textTight” indexed="true" stored="true"/> <field name="name“ type="text“ indexed="true" stored="true"/> <field name=“inStock“ type=“boolean“ indexed="true“ stored=“false"/> <field name=“price“ type=“sfloat“ indexed="true“ stored=“false"/> <field name="category“ type="text_ws“ indexed="true" stored="true“
multiValued="true"/> • Dynamic Fields <dynamicField name="*_i" type="sint“ indexed="true" stored="true"/> <dynamicField name="*_s" type="string“ indexed="true" stored="true"/> <dynamicField name="*_t" type="text“ indexed="true" stored="true"/>
39
copyField • Copies one field to another at index time • Usecase #1: Analyze same field different ways
– copy into a field with a different analyzer – boost exact-case, exact-punctuation matches – language translations, thesaurus, soundex
<field name=“title” type=“text”/> <field name=“title_exact” type=“text_exact”
stored=“false”/> <copyField source=“title” dest=“title_exact”/>
• Usecase #2: Index multiple fields into single searchable field
40
41
42
43
Facet Query http://solr/select?q=foo&wt=json&indent=on &facet=true&facet.field=cat &facet.query=price:[0 TO 100] &facet.query=manu:IBM {"response":{"numFound":26,"start":0,"docs":[…]}, “facet_counts":{ "facet_queries":{ "price:[0 TO 100]":6, “manu:IBM":2}, "facet_fields":{ "cat":[ "electronics",14, "memory",3, "card",2, "connector",2] }}} 44
Filters • Filters are restrictions in addition to the query • Use in faceting to narrow the results • Filters are cached separately for speed 1. User queries for memory, query sent to solr is &q=memory&fq=inStock:true&facet=true&… 2. User selects 1GB memory size &q=memory&fq=inStock:true&fq=size:1GB&… 3. User selects DDR2 memory type &q=memory&fq=inStock:true&fq=size:1GB &fq=type:DDR2&… 45
Highlighting http://solr/select?q=lcd&wt=json&indent=on &hl=true&hl.fl=features {"response":{"numFound":5,"start":0,"docs":[ {"id":"3007WFP", “price”:899.95}, …] "highlighting":{ "3007WFP":{ "features":["30\" TFT active matrix
<em>LCD</em>, 2560 x 1600” "VA902B":{ "features":["19\" TFT active matrix
<em>LCD</em>, 8ms response time, 1280 x 1024 native resolution"]}}} 46
MoreLikeThis • Selects documents that are “similar” to the
documents matching the main query. &q=id:6H500F0
&mlt=true&mlt.fl=name,cat,features "moreLikeThis":{ "6H500F0":{"numFound":
5,"start":0, "docs”: [ {"name":"Apple 60 GB iPod with Video Playback Black", "price":399.0, "inStock":true, "popularity":10, […] }, […] ] […] 47
High Availability
Load Balancer
Appservers
Solr Searchers
Solr Master
DB Updater updates
updates admin queries
Index Replication
admin terminal
HTTP search requests
Dynamic HTML Generation
48
Resources • WWW
– http://lucene.apache.org/solr – http://lucene.apache.org/solr/tutorial.html – http://wiki.apache.org/solr/
• Mailing Lists – [email protected] – [email protected]
49