scaling search to a million pages with solr, python, and django
DESCRIPTION
A talk given to DJUGL on the 26th July 2010, describing and introducing Solr, and discussing how we use it at Timetric to drive navigation across over a million dataseries.TRANSCRIPT
![Page 2: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/2.jpg)
1,079,446!!!
![Page 3: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/3.jpg)
![Page 4: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/4.jpg)
Data store
Big Bad Web
Django
![Page 5: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/5.jpg)
Data store
Big Bad Web
Django
![Page 6: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/6.jpg)
Key-Value Store
FilesystemBerkeley DB
MySQL
} unstructured
structured-
![Page 7: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/7.jpg)
Foreign Key (RDBMS)
SQLiteMySQLPostgresOracle...
related contentthrough JOINs
overstructured data
![Page 8: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/8.jpg)
Search Engines
Solr (Lucene)Xapian(Whoosh)
Denormalized,Inverted Index
over unstructured/semi-structured data
![Page 9: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/9.jpg)
http://www.postgresql.org/docs/8.4/static/textsearch.htmlhttp://code.google.com/p/djangosearch/
http://www.sphinxsearch.com/
Other routes to full-text search
![Page 10: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/10.jpg)
Solr: HTTP interface to Lucene
Lucene written by Doug Cutting (HADOOP), first release 2001.
Solr in-house CNET project, open-sourced in 2006
Solr + Lucene merged in March 2010
Solr 1.4, Lucene 3.0 released November 2009
Next version - 1.5/3.1/4.0 - not for production use yet.
![Page 11: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/11.jpg)
SolrIndex
composed ofDocuments
ALL DOCUMENTS HAVETHE SAME STRUCTURE
RDBMSTable
composed ofRows
![Page 12: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/12.jpg)
•Optional columns•Denormalized data
Contributer(M2M Person)
Author(FK Person)
Magazine
Editor(FK Person)
First name
Last name
Person
ISSN
Publication Frequency
Title
Book
Title
ISBNmultiValued,
defaultDefault Search
Identifier
Document
Pub. Frequency
Title
multiValued
required
required
uniqueKey
Associated name
Entity type
Field options
Associated NameDefault Search
TitlecopyField
![Page 13: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/13.jpg)
There is no update, only overwrite!!!
Solar Enterprise
Search Server
Book
Identifier
Pub. Freq.
David Smiley,Eric Pugh
Solr 1.4 Enterprise
Search Server
Book
Identifier
Pub. Freq.
David Smiley,Eric Pugh
Solr can't overwrite without a uniqueKey
![Page 14: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/14.jpg)
<field name="title" type="text" indexed="true" stored="true" required="true" multiValued="false"/>
Schema design
What do you want to search on?
What do you want to do with results?
╳query
textintlongfloatdoubledate
![Page 15: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/15.jpg)
Solr
<xml>,csv,
<xml>,{json},exec. python
Ingest Output
Query:URL-escaped Lucene query syntax
(yuck)
HTTP HTTP
![Page 16: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/16.jpg)
GET http://localhost:8983/solr/select/?q=searchterm
GET http://localhost:8983/solr/current/select/?fq=private
%3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=20&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united+kingdom%22+AND+NOT+is_mapreduce%3Atrue%29+OR
+%28%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united+kingdom%22+AND+is_index
%3Atrue%5E100%29
![Page 17: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/17.jpg)
Need ORM equivalent (OIM?)
http://haystacksearch.org/
http://timetric.com/about/opensource/#sunburnt
(cleaves close to Django, not schema-driven)
Sunburnt:
http://github.com/tow/sunburnt
![Page 18: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/18.jpg)
GET http://localhost:8983/solr/current/select/?fq=private
%3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=20&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A
%22united+kingdom%22%29+OR+%28%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united
+kingdom%22+AND+is_index%3Atrue%5E100%29
solr.query(tags="ons:dataseries-fullid=YBUKQA")\ .query(tags="united kingdom")\ .filter(private=False)\ .boost_relevancy(100, is_index=True)\ .facet_by("tags", mincount=1, limit=20)\ .paginate(rows=20)
![Page 19: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/19.jpg)
![Page 20: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/20.jpg)
FacetingMoreLikeThisHighlightingPaginationSorting
http://wiki.apache.org/solr/FrontPage
http://packtpub.com/solr-1-4-enterprise-search-server
![Page 21: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/21.jpg)
Scaling to a million pages ...
- talk to the Guardian (Content API)
Decouple read/writeRe-indexing/optimizing strategiesFieldType/Analyzer/Tokenizer tweaks
![Page 22: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/22.jpg)
Decouple read/write
Separate processes - many readers, single write pipeline. Beware multiple writers!
Remember standard DB practice -write to master, read from slave.
![Page 23: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/23.jpg)
Index
Index
IndexIndex
Adddocuments
Commit
Index Optimize
Fast
Index
Warm upfacet cache
![Page 24: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/24.jpg)
![Page 25: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/25.jpg)
![Page 26: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/26.jpg)
"UK crime: Betting, gaming and lotteries (year ending 5th April)"
BettingTokenizer
bet
Analyzer(Porter stemmer)
Belgium, Unemployment rate by gender, Total (BE,T)
BE,TTokenizer
(whitespace)
Tokenizer(character filter)
![Page 27: Scaling search to a million pages with Solr, Python, and Django](https://reader034.vdocuments.site/reader034/viewer/2022051314/54b794134a795953368b475a/html5/thumbnails/27.jpg)
Understand Solr schemas - build one for your data.how do you want to query?
how do you want to show results?
Understand Solr architecture - build around your data-flow.how/when do you want to read/write?
what shape/characteristics does your corpus have
In the small
In the large