implementing solr in online media bo raun
TRANSCRIPT
-
8/3/2019 Implementing Solr in Online Media Bo Raun
1/33
Suddenly. SolrImplementing Solr in Online Media as an
Alternative to Commercial Search Products
Bo Raun, Nordjyske Medier, DK
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
2/33
Introduction
2
About Nordjyske Medier
Our Search Challenges
Discovering Search with Solr
Making the Transition
Lessons Learned
Looking ahead
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
3/33
Nordjyske Medier
3
First publication in 1767
Danish media company for web,
radio, tv and print media. News and adverts for
Northern Denmark
About 600 employees: Media, Call centers, Instore TV, application development, etc
Media reaches 75% of local population daily 90% weekly
Net & mobile developing our online business
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
4/33
My background
4
20 Years experience
Analysis, design and programming
Pascal, Delphi, C, VB, C#
RDBMS and SQL
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
5/33
The environment and legacy
5
A long history working with RDBMS systems
IT strategy built on Microsoft technologies, other closed sourcesolutions, like Citrix, VMware, etc. No tradition or experience using Open Source Search
IT Organization & Development skills: .NET, Visual Studio, Windows, websites andwebservices (XML) app/integration development
MS SQL is the de facto storage for data
Main media sites: 183.000 users March 2010.Yellow Pages users: 2.700 August 2009, 10.900 March 2010
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
6/33
Why search is important to Nordjyske Medier
6
Yellow (and White) Pages Major source of revenue in advertising; tying online display
adverts to Yellow Pages directory listings Add-on to advertising campaign packages
Editorial Content
Articles (in-depth, review, short syndicated news feeds,picture captions)
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
7/33
Yellow pages and White pages (www.folkogfag.dk)
7
Names, addresses, phone numbers, directions, Vcards
Yellow Pages
Daily updates, a few hundred bytes to 5 kB Advertisers get boosting, links, keywords, profiles
500k documents
White Pages 4 million documents, a few hundred bytes
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
8/33
Yellow pages design keeping it simple
85/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
9/33
Editorial archive (www.nordjyske.dk)
9
Updates almost every minute during primetime
Changes pulled and indexed every 10 minutes
Offers different media versions for same story (default web)
Quite simple interface (for now)
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
10/33
Editoral search design
105/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
11/33
What were we using, and where did it fall short?
11
Original Strategy, c. 2008 Search is about crawling. Lets get a search appliance and have it
crawl content that we want to repurpose/present, and we canwork in the Yellow/White Pages data and every other websitewe want
Ten million terms in the index should be more than enough Relevance out of the box - Let the search appliance do its
magic Lean onto the branding-value of a well known search
technology
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
12/33
Appliance? Sounds good..
12
Up and running in no time
Excellent response times
Commercial support
Strong brand name
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
13/33
There were problems
13
Articles and yellow profile pages being missed
Slow updates lets turn off crawling and post data by scripting
How do we control boosting of costumer profiles?
Wheres the test environment? Or the development environment?
The index is running full put white pages in SQL and join them client side
Yellow pages polluted with text from news feeds and HTML layout
Strong brand name and so what.
Struggling with core functionality past deadline
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
14/33
MS SQL: Strengths and Weaknesses
14
Microsoft FAST?
Server already well know and supported inside the organization
Integrates well with .Net and Visual Studio
Virtually unlimited document space
But Yellow/White pages didnt perform well
Full Text goes some of the way
Not that many options compared to alternatives
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
15/33
Demand > technology
15
Appliance + SQL Server wasnt doing it for search.
Alternatives? Maybe Open Source?
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
16/33
Prototyping: faster than teaching a 9-year-old Ju-Jitsu
16
In the course of 60 minutes on my laptop
Found out Solr 1.3 was probably worth a try
Downloaded and installed Posted example documents into the Solr index
and started searching, figuring out operations
Feeding the family Late night coding, feeding Solr
Repurposed utility that posted XML into the appliance, to stageXML for Solr
Built harness to test Solr with random terms did quite well
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
17/33
The old SQL dog learns new tricks- or how I learned to stop worrying and love non relational data structures
17
How do I make schema relations?
No direct row editing for debug, no direct data manipulationstatements
XML-driven query and retrieval very appealing at first, reuse ofexisting scripts
Boosting documents instead of sorting relevance takes care of therest
Faceting instead of extra requests for counting results
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
18/33
Idea for sale but whos buying?
18
Are we Java based now?
.Net library (SolrSharp) for the developers
Are we based on open source now? What about support? Small adventures, e.g. mySQL, had been ill-introduced
Lucid Imagination and Findwise Support contract (ExpertLink) provided asupport scenario similar to commercial products
Support: For disaster scenarios and for stuck developers
Time for a sanity check Assesment report, ensuring stability
Annual search health check
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
19/33
How Solr was implemented
19
Solution: leverage the example schema!
Platform: 32 bit Windows machine w/2GB RAM, as Solr
(unfortunately) used very little capacity in proof of concept
Data handling done by old scripts
VMware machine snapshot backup
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
20/33
Results: It worked!
20
Customers boosted as promised
Excellent response times
Instant indexing
Full control over data (eventually disabled profiletext indexing)
But now the editorial archive is having frequent timeouts what todo about that?
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
21/33
Upgrading the editorial search
21
Configuration from scratch
Content directly from SQL
More challenges Ontology integration
More features wanted
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
22/33
Handy Solr features out of the box
22
Stemming (Danish supported, not dictionary perfect) Example specialist vs specialsterne (specialists)
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
23/33
Handy Solr features out of the box
23
Special characters Example lborg vs Aalborg.
5/25/2010Apache Lucene EuroCon
-
8/3/2019 Implementing Solr in Online Media Bo Raun
24/33
Data import handlers (DIH) vs posting XML- Goodbye import scripts, hello XML-SQL and curl
24
Easy import SQL scripting
-
8/3/2019 Implementing Solr in Online Media Bo Raun
25/33
The usual data-juggling... The Solr way
25
Getting the data for a corehttp://solr01:8983/solr/Nordjyske/ dataimport?command=full-import
Incremental delta importshttp://solr01:8983/solr/Nordjyske/ dataimport?command=delta-import
Oopshttp://solr01:8983/solr/Nordjyske/ dataimport?command=abort
Reload without restarthttp://solr01:8983/solr/Nordjyske/dataimport?command=reload-config
Curl + scheduled tasks = saves hours of plumbing &
programming
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
26/33
.. Is this thing turned on..?
26
http://solr01:8983/solr/Nordjyske/dataimport?command=status
busy
.0:2:24.925312621
.. Indexing completed. Added/Updated: 21 documents. Deleted 0documents.
This response format is experimental. It is likely to change in the future.
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
27/33
Boosting Yellow Page costumer happiness
27
The old XML posting
599113
-
8/3/2019 Implementing Solr in Online Media Bo Raun
28/33
Custom transformers
28
Data Import Handler transformer call 0){row.put("$docBoost", 2.5f);return row;
}}
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
29/33
Integration with Semaphore (ontology classification)
29
Ontology engine server adds in keywords automatically
Subjects (Emner)
People (Mennesker)
Places (Steder)
Companies (Firmaer)
analyze text
Documents User search
Enhance search
Create topic pages seamlessly
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
30/33
Semaphore integration?
30
Custom transformers save the day yet again
Repackage documents, throw them at Classification server, fill in metatags, save to Solr
Smartlogic
August 2009: Solr?
Now has built in integration for Solr (baseliner)
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
31/33
Semaphore and Solr
31
The basic setup
SolrContent DB Web
SearchEnhancement
ServerOntology Server
ClassificationServer
IndexingPipeline
Baseliner
Rulebases
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
32/33
Nutch integration
32
A little more tricky than just turning on Solr
Cygwin
Seem to work ok, requires some nursing (like GSA)
Useful for external sites and closed turn-key websites from3d party
Currently on hold crawling just got less important
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw -
8/3/2019 Implementing Solr in Online Media Bo Raun
33/33
Conclusion: Lessons learned
33
The right tools for the right job
To crawl or not to crawl
No such thing as magic relevance
Prototyping is the key
Get buy-in from the people who will run it
Commercial support foundation
Extensibility integration and no limits to document base Solr pops up everywhere! In the new CMS, next editoral
backend, ontology integration
Get in touch: [email protected]
5/25/2010Apache Lucene EuroCon
http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw