turning search upside down with powerful open source search software

57
Charlie Hull - Managing Director 28 th November 2014 [email protected] www.flax.co.uk/blog +44 (0) 8700 118334 Twitter: @FlaxSearch Turning Search Upside Down with open source software

Upload: charlie-hull

Post on 15-Jul-2015

1.818 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Turning search upside down with powerful open source search software

Charlie Hull - Managing Director28th November 2014

[email protected]/blog+44 (0) 8700 118334Twitter: @FlaxSearch

Turning Search Upside Down

with open source software

Page 2: Turning search upside down with powerful open source search software

We design, build and support open source powered search applications

Who are Flax?

Page 3: Turning search upside down with powerful open source search software

We design, build and support open source powered search applications

Based in Cambridge U.K., technology agnostic & independent – but open source exponents & committers

Who are Flax?

Page 4: Turning search upside down with powerful open source search software

We design, build and support open source powered search applications

Based in Cambridge U.K., technology agnostic & independent – but open source exponents & committers

UK Authorized Partner of

Who are Flax?

Page 5: Turning search upside down with powerful open source search software

We design, build and support open source powered search applications

Based in Cambridge U.K., technology agnostic & independent – but open source exponents & committers

UK Authorized Partner of

Customers include Reed Specialist Recruitment, Mydeco, NLA, Gorkana, Financial Times, News UK, EMBL-EBI, Accenture, University of Cambridge, UK Government...

Who are Flax?

Page 6: Turning search upside down with powerful open source search software

We design, build and support open source powered search applications

Based in Cambridge U.K., technology agnostic & independent – but open source exponents & committers

UK Authorized Partner of

Customers in recruitment, government, e-commerce, news & media, bioinformatics, consulting, law...

Who are Flax?

Page 7: Turning search upside down with powerful open source search software

We design, build and support open source powered search applications

Based in Cambridge U.K., technology agnostic & independent – but open source exponents & committers

UK Authorized Partner of

Customers in recruitment, government, e-commerce, news & media, bioinformatics, consulting, law...

Who are Flax?

Page 8: Turning search upside down with powerful open source search software

What I'll cover today

Monitoring & classifying data with stored queries

Media monitoring - how it used to be done

What we can't change

Turning search upside down with Flax's Luwak

Case studies: Media monitoring in Scandinavia Who else is using Luwak?

Conclusions

Page 9: Turning search upside down with powerful open source search software

Monitoring & classifying with stored queries 'Saved searches' / 'alerting' / 'stored profiles'

Page 10: Turning search upside down with powerful open source search software

Monitoring & classifying with stored queries 'Saved searches' / 'alerting' / 'stored profiles'

Use cases:– Monitor news for client interests– Tag/classify incoming data according to certain rules

Page 11: Turning search upside down with powerful open source search software

Monitoring & classifying with stored queries 'Saved searches' / 'alerting' / 'stored profiles'

Use cases:– Monitor news for client interests– Tag/classify incoming data according to certain rules

Volume & speed– 10k → 1m stored queries– 10k → 1m new items a day– Latencies as low as 50-100ms

Page 12: Turning search upside down with powerful open source search software

Media monitoring

Provide a news monitoring service to clients

Page 13: Turning search upside down with powerful open source search software

Media monitoring

Provide a news monitoring service to clients

Usually also provide a searchable news archive

Page 14: Turning search upside down with powerful open source search software

Media monitoring

Provide a news monitoring service to clients

Usually also provide a searchable news archive

People translate requirements & verify output...

Page 15: Turning search upside down with powerful open source search software

Media monitoring

Provide a news monitoring service to clients

Usually also provide a searchable news archive

People translate requirements & verify output...

..but volume is handled by software

Page 16: Turning search upside down with powerful open source search software

Media monitoring

Provide a news monitoring service to clients

Usually also provide a searchable news archive

People translate requirements & verify output...

..but volume is handled by software

UK/Europe – Gorkana, Precise, Cision; USA – BurellesLuce

Page 17: Turning search upside down with powerful open source search software

Media monitoring – how it used to be done

Manual process of translating client interests into stored queries

Page 18: Turning search upside down with powerful open source search software

Media monitoring – how it used to be done

Manual process of translating client interests into stored queries – many pages of Boolean terms– 10k-250k characters– Evolved over many years, grown not built– Difficult to maintain and troubleshoot

Page 19: Turning search upside down with powerful open source search software

Like this...

(((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR ";G.P.R.S"; OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR !MOBILE COMM*"; OR ";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA"; OR ";H.S.U.P.A"; OR ";HIGH SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR ";M.V.N.O"; OR ";SMS"; OR ";SHORT MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*"; OR ";!MOBILES"; OR ";!CELLPHONE*"; OR ";!TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR ";PHONE*"; OR ";!TELEKOM*"; OR ";TELCO*"; OR ";VODAFONE"; OR ";T-MOBILE"; OR ";TMOBILE"; OR ";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*"; OR ";TEXT MESSAG*"; OR ";SMARTPHONE*"; OR ";!VIRGIN !MEDIA*"; OR ";CABLE & !WIRELESS"; OR ";CABLE AND !WIRELESS";) W/48 ((";PROFIT*"; OR ";LOSS*"; OR ";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR ";!REFINANC*"; OR ";OFFICE OF FAIR TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR ";TAKEOVER*"; OR ";BUYOUT*"; OR ";BUY-OUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*"; OR ";ACCOUNT*"; OR ";MONEY"; OR ";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*"; OR ";JOINT*"; OR ";NEW VENTURE*"; OR ";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";!EXECUTIVE"; OR ";SALE*"; OR ";SELL*"; OR ";FULL YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR ";LAWS"; OR ";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR ";ETHIC*"; OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR ";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MIS-SELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*"; OR ";FIBRE OPTIC*"; OR ";TAX"; OR ";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR ";STAFF"; OR ";WORKER*"; OR ";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR ";UNFAIR"; OR ";%UNSCRUPULOUS"; OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR ";TARIFF*"; OR ";PRICE PLAN*"; OR ";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR ";READER OFFER"; OR ((";CALLS COST";) W/6 (";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT LANDLINE*";))))

…and that's an easy one!

Page 20: Turning search upside down with powerful open source search software

Media monitoring – how it used to be done Complex pipelines with many sources monitored– print & scan & OCR– XML feeds (e.g. NLA media access in UK)– Web content, Twitter etc.

Lots of search engines running in parallel– dtSearch on 100+ servers, then merged– 'Stored query' functionality built in e.g. Verity

Page 21: Turning search upside down with powerful open source search software

What we can't change...

Syntax of the stored queries (sometimes) – Rewriting these is a huge, risky job for the business

Page 22: Turning search upside down with powerful open source search software

What we can't change...

Syntax of the stored queries (sometimes) – Rewriting these is a huge, risky job for the business

Volume keeps increasing

Page 23: Turning search upside down with powerful open source search software

What we can't change...

Syntax of the stored queries (sometimes) – Rewriting these is a huge, risky job for the business

Volume keeps increasing

The need to test new and old queries on archive data - using the same syntax

Page 24: Turning search upside down with powerful open source search software

What we can't change...

Syntax of the stored queries (sometimes) – Rewriting these is a huge, risky job for the business

Volume keeps increasing

The need to test new and old queries on archive data - using the same syntax

Source data will often be low quality– Scans– PDF extracts– Re-typed XML

Page 25: Turning search upside down with powerful open source search software

Speaking the same language

Parse the old query syntax into a new form– Either on the fly (dtSearch, Verity)– Or off-line (Verity) – In theory we can translate any old syntax– A neutral syntax protects against lock-in• Why not?

Page 26: Turning search upside down with powerful open source search software

Speaking the same language

Parse the old query syntax into a new form– Either on the fly (dtSearch, Verity)– Or off-line (Verity) – In theory we can translate any old syntax– A neutral syntax protects against lock-in• Why not?

Testing very, very important– Agree a test set– Watch for broken queries (in several ways)– Use the client's resources to help

Page 27: Turning search upside down with powerful open source search software

Turning search upside down..

Docs

Result

QueryQueryStoredQueries

Page 28: Turning search upside down with powerful open source search software

Turning search upside down..

Docs

Result

QueryQueryStoredQueries

1 million queriesSome 250k longComplex rules

1 million new documents a day

Within 5-100ms

Page 29: Turning search upside down with powerful open source search software

Turning search upside down..

Docs

Result

QueryQueryStoredQueries

1 million queriesSome 250k longComplex rules

1 million new documents a day

$$$$$$

Within 5-100ms

Page 30: Turning search upside down with powerful open source search software

Turning search upside down..

Docs

Result

QueryQueryStoredQueries

1 million queriesSome 250k longComplex rules

1 million new documents a day

$$$$$$

Within 5-100ms

Page 31: Turning search upside down with powerful open source search software

Turning search upside down..

Docs

QueryQueryStoredQueries 1.

Pre

QuerySubset

1 million queriesSome 250k longComplex rules

~200

Doc

1 million new documents a day

Page 32: Turning search upside down with powerful open source search software

Turning search upside down..

Docs

QueryQueryStoredQueries 1.

Pre

QuerySubset

Result

1 million queriesSome 250k longComplex rules

~200

2.Search

Doc

1 million new documents a day

Page 33: Turning search upside down with powerful open source search software

Turning search upside down..

How to avoid running 1 million searches on 1 million documents?

Page 34: Turning search upside down with powerful open source search software

Turning search upside down..

How to avoid running 1 million searches on 1 million documents? – don't run all of them!

– 1. Presearch - Find the likely candidates first by searching the stored queries using the document

Page 35: Turning search upside down with powerful open source search software

Turning search upside down..

• How to avoid running 1 million searches on 1 million documents? – don't run all of them!

– 1. Presearch - Find the likely candidates first by searching the stored queries using the document

– 2. Search - Then perform a normal search, but with far fewer queries, and do it in memory for speed

Page 36: Turning search upside down with powerful open source search software

Introducing Luwak

Based on a (slight) fork of Apache Lucene

A library for efficiently running many stored queries

Two stages:– Presearcher (find candidate queries)– Monitor Searcher (run candidate queries against a

single document in-memory index)

Very fast – 70-100k queries per second when first tested and 3-4 times faster now

https://github.com/flaxsearch/luwak

Page 37: Turning search upside down with powerful open source search software

The gory details...

Custom QueryParsers for dtSearch, Verity VQL/OTL etc.

Hacked Lucene to expose positional informationLUCENE-3831, LUCENE-3827 in Lucene 4.xLUCENE-2878 'Nuke Spans' coming in Lucene 5.0

Works with standard Lucene queries TermQuery, BooleanQuery, RegexpQuery...Can be extended to work with custom query types

Lots (and lots) of performance tuning

Page 38: Turning search upside down with powerful open source search software

Using Luwak in a larger system

Run multiple instances of Luwak to handle throughput

Use message passing platforms (e.g. RabbitMQ, Apache Kafka)

Build a separate index of articles – a searchable archive – for testing new stored queries and/or for customers

Use RESTful Web Service APIs everywhere

Test against known matches from older systems

Page 39: Turning search upside down with powerful open source search software

Likely problems

Some old queries will be broken– Replicate the broken behaviour, or fix it?

Page 40: Turning search upside down with powerful open source search software

Likely problems

Some old queries will be broken– Replicate the broken behaviour, or fix it?

Some queries will need troubleshooting– Queries can be very large– Possibly in a foreign language

Page 41: Turning search upside down with powerful open source search software

Likely problems

Some old queries will be broken– Replicate the broken behaviour, or fix it?

Some queries will need troubleshooting– Queries can be very large– Possibly in a foreign language

Performance tuning is hard– JVM settings, caching, storage types

Page 42: Turning search upside down with powerful open source search software

Likely problems

Some old queries will be broken– Replicate the broken behaviour, or fix it?

Some queries will need troubleshooting– Queries can be very large– Possibly in a foreign language

Performance tuning is hard– JVM settings, caching, storage types

Remember this is a different system – 1 to 1 matching correspondence is impossible!

Page 43: Turning search upside down with powerful open source search software

Case study – media monitoring in Scandinavia

“The leading Danish provider of media intelligence, i.e. media search, media monitoring and media analysis”

Old system based on Verity (monitoring) and Autonomy IDOL (archive search)

Issues:Systems at full capacity - even with lots of hardwareArchive search very, very slowComplex workflow with some very old parts (FTP, batch files...)Proprietary query format (Verity VQL)

Page 44: Turning search upside down with powerful open source search software

Case study – media monitoring in Scandinavia

Flax engaged in 2013 as part of a complete re-architecture

Phased approach:1. Translate all old queries to new in-house syntax 'IQL'2. New monitoring pipeline (RabbitMQ, RESTful Web Services, Flax's Luwak)3. New archive search (Apache Solr)

Page 45: Turning search upside down with powerful open source search software

Case study – media monitoring in Scandinavia

Flax engaged in 2013 as part of a complete re-architecture

Phased approach:1. Translate all old queries to new in-house syntax 'IQL'2. New monitoring pipeline (RabbitMQ, RESTful Web Services, Flax's Luwak)3. New archive search (Apache Solr)

Performance/volume targets:Monitoring 20 articles/second, aim for 100 eventually100 qps for archive search, response within 250ms17000 stored queries, 85 million articles in archive

Page 46: Turning search upside down with powerful open source search software

Case study – media monitoring in Scandinavia

Monitoring using 2 servers, Archive search using 8 servers (including failover)

6 cores, 18GB per serverlaunch more servers to scale the system

Page 47: Turning search upside down with powerful open source search software

Case study – media monitoring in Scandinavia

Monitoring using 2 servers, Archive search using 8 servers (including failover)

6 cores, 18GB per serverlaunch more servers to scale the system

All stored queries translated to new IQL syntaxIncluding the ones Verity didn't tell us weren't working...

Page 48: Turning search upside down with powerful open source search software

Case study – media monitoring in Scandinavia

Monitoring using 2 servers, Archive search using 8 servers (including failover)

6 cores, 18GB per serverlaunch more servers to scale the system

All stored queries translated to new IQL syntaxIncluding the ones Verity didn't tell us weren't working...

Monitoring customers are being migrated in stages from Oct 2014, Archive search beta testing during Oct 2014, completion by end 2014

Page 49: Turning search upside down with powerful open source search software

Case study – media monitoring in Scandinavia

The benefits:

– No vendor lock-in (open source, IQL)

– Industry standard software (Apache Lucene/Solr, RabbitMQ)

– Predictable scaling & costs

Page 50: Turning search upside down with powerful open source search software

Who else is using Luwak?

Bloomberg http://www.flax.co.uk/blog/2014/03/24/london-search-meetup-serious-solr-at-bloomberg-elasticsearch-1-0/

Booz Allen Hamilton http://www.slideshare.net/BryanBende/realtime-inverted-search-nyc-aslug-oct-2014

Early versions are installed at

We know of users in other sectors e.g. recruitment

We're hoping to make it a standard part of Lucene/Solr and Elasticsearch (i.e. to improve the Percolator)

Apache Samza integration?https://github.com/romseygeek/samza-luwak

Page 51: Turning search upside down with powerful open source search software

Conclusions

To solve a hard problem, sometimes you need to turn it upside down

Page 52: Turning search upside down with powerful open source search software

Conclusions

To solve a hard problem, sometimes you need to turn it upside down

When you migrate to open source, you can keep some of the parts of your system (if you must)

Page 53: Turning search upside down with powerful open source search software

Conclusions

To solve a hard problem, sometimes you need to turn it upside down

When you migrate to open source, you can keep some of the parts of your system (if you must)

Open source helps you scale - with predictable costs

Page 54: Turning search upside down with powerful open source search software

Conclusions

To solve a hard problem, sometimes you need to turn it upside down

When you migrate to open source, you can keep some of the parts of your system (if you must)

Open source helps you scale - with predictable costs

Leading companies are doing it

Page 55: Turning search upside down with powerful open source search software

Conclusions

To solve a hard problem, sometimes you need to turn it upside down

When you migrate to open source, you can keep some of the parts of your system (if you must)

Open source helps you scale - with predictable costs

Leading companies are doing it

Go and turn Search Upside Down!

Page 56: Turning search upside down with powerful open source search software

Another thing...

BioSolr – a funded project to improve the way search technology is used in bioinformatics– EBI & NCBI are involved– http://www.flax.co.uk/blog/2014/10/02/biosolr-begins-with-a-workshop-day/

– Already producing Solr patches e.g SOLR-1387– Come and join us!

Page 57: Turning search upside down with powerful open source search software

Thankyou!

Any questions?

[email protected]/blog+44 (0) 8700 118334Twitter: @FlaxSearch