solr @ etsy - apache lucene eurocon

56
Solr @

Upload: giovanni-fernandez-kincade

Post on 09-Jul-2015

2.536 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Solr @ Etsy - Apache Lucene Eurocon

Solr @

Page 2: Solr @ Etsy - Apache Lucene Eurocon

Things I’m not going to talk about:

A/B Testingi18n

Continuos Deployment

Page 3: Solr @ Etsy - Apache Lucene Eurocon

AboutUs

Page 4: Solr @ Etsy - Apache Lucene Eurocon
Page 5: Solr @ Etsy - Apache Lucene Eurocon
Page 6: Solr @ Etsy - Apache Lucene Eurocon

10+ Million Listings500 qps

Page 7: Solr @ Etsy - Apache Lucene Eurocon

Architecture Overview

Page 8: Solr @ Etsy - Apache Lucene Eurocon

Architecture OverviewThrift

Page 9: Solr @ Etsy - Apache Lucene Eurocon

Architecture OverviewThrift

struct Listing { 1: i64 listing_id }

struct ListingResults { 1: i64 count, 2: list<Listing> listings }

service Search { ListingResults search(1:string query) }

Page 10: Solr @ Etsy - Apache Lucene Eurocon

Architecture OverviewThrift

public class Search { public interface Iface { public ListingResults search(String query) throws TException; }

Generated Java server code:

Generated PHP client code: class SearchClient implements SearchIf {

/**...**/ public function search($query) { $this->send_search($query); return $this->recv_search(); }

Page 11: Solr @ Etsy - Apache Lucene Eurocon

Architecture OverviewThrift

• Service Encapsulation• Reduced Network Traffic

Why use Thrift?

Page 12: Solr @ Etsy - Apache Lucene Eurocon

Architecture OverviewThrift

• Index Size• Easy to scale PK lookups

Why only return IDs?

Page 13: Solr @ Etsy - Apache Lucene Eurocon

The Search Server

Page 14: Solr @ Etsy - Apache Lucene Eurocon

Architecture OverviewSearch Server

• Identical Code + Hardware• Roles/Behavior controlled by Env variables• Single Java Process• Solr running as a Jetty Servlet• Thrift Servers • Smoker

Page 15: Solr @ Etsy - Apache Lucene Eurocon

Architecture OverviewSearch Server

Master-specific processes:• Incremental Indexer• External File Field Updaters

Page 16: Solr @ Etsy - Apache Lucene Eurocon

Load Balancing

Page 17: Solr @ Etsy - Apache Lucene Eurocon

Load BalancingThrift TSocketPool

Page 18: Solr @ Etsy - Apache Lucene Eurocon

Load BalancingThrift TSocketPool

Page 19: Solr @ Etsy - Apache Lucene Eurocon

Load BalancingThrift TSocketPool

Page 20: Solr @ Etsy - Apache Lucene Eurocon

Load BalancingServer Affinity

Page 21: Solr @ Etsy - Apache Lucene Eurocon

Load BalancingServer Affinity Algorithm

$serversNew = array();

$numServers = count($servers); while($numServers > 0) {

// Take the first 4 chars of the md5sum of the server count // and the query, mod the available servers $key = hexdec(substr(md5($numServers . '+' . $query),0,4))%($numServers); $keySet = array_keys($servers); $serverId = $keySet[$key];

// Push the chosen server onto the new list and remove it // from the initial list array_push($serversNew, $servers[$serverId]); unset($servers[$serverId]); --$numServers;

}

[“host2”, “host3”, “host1”, “host4”]

Page 22: Solr @ Etsy - Apache Lucene Eurocon

Load BalancingServer Affinity Algorithm

“jewelry” [“host2”, “host3”, “host1”, “host4”]

“scarf”

$key = hexdec(substr(md5($query),0,4))

[“host2”, “host3”, “host1”, “host4”]

Page 23: Solr @ Etsy - Apache Lucene Eurocon

Load BalancingServer Affinity Algorithm

“jewelry” [“host2”, “host3”, “host1”, “host4”]

“scarf” [“host2”, “host1”, “host4”, “host3”]

$key = hexdec(substr(md5($numServers . '+' . $query),0,4))%(count($servers));

Page 24: Solr @ Etsy - Apache Lucene Eurocon

Load BalancingServer Affinity Results

2% 20%

Page 25: Solr @ Etsy - Apache Lucene Eurocon

Load BalancingServer Affinity Caveats

• Stemming / Analysis • Be wary of query distribution

Page 26: Solr @ Etsy - Apache Lucene Eurocon

Replication

Page 27: Solr @ Etsy - Apache Lucene Eurocon

ReplicationThe Problem

Page 28: Solr @ Etsy - Apache Lucene Eurocon

ReplicationThe Problem

Page 29: Solr @ Etsy - Apache Lucene Eurocon

ReplicationMulticast Rsync?

Page 30: Solr @ Etsy - Apache Lucene Eurocon

ReplicationMulticast Rsync?

[15:25]  <engineer> patrick: i'm gonna test multi-rsyncing some indexes from host1 to host2 and host3 in prod. I'll be watching the graphs and what not, but let me know if you see anything funky with the network[15:26]  <patrick> ok....

[15:31]  <keyur> is the site down?

Page 31: Solr @ Etsy - Apache Lucene Eurocon

ReplicationMulticast Rsync?

Page 32: Solr @ Etsy - Apache Lucene Eurocon

Hmm...Bit Torrent?

Page 33: Solr @ Etsy - Apache Lucene Eurocon

ReplicationBit Torrent POCUsing BitTornado:

Page 34: Solr @ Etsy - Apache Lucene Eurocon

ReplicationBit Torrent + Solr

Fork of TTorent: https://github.com/etsy/ttorrent

Multi-File SupportPerformance Enhancements

Page 35: Solr @ Etsy - Apache Lucene Eurocon

ReplicationBit Torrent + Solr

Page 36: Solr @ Etsy - Apache Lucene Eurocon

ReplicationBit Torrent + Solr

Page 37: Solr @ Etsy - Apache Lucene Eurocon

ReplicationBit Torrent + Solr

Page 38: Solr @ Etsy - Apache Lucene Eurocon

ReplicationBit Torrent + Solr

Page 39: Solr @ Etsy - Apache Lucene Eurocon

Solr InterOp

Page 40: Solr @ Etsy - Apache Lucene Eurocon

QParsers

Page 41: Solr @ Etsy - Apache Lucene Eurocon

“writing query strings is for suckers”

Page 42: Solr @ Etsy - Apache Lucene Eurocon
Page 43: Solr @ Etsy - Apache Lucene Eurocon

Solr InterOpQParsers

http://host:8393/solr/person/select/?q=_query_:%22{!dismax %20qf=$fqf%20v=$fnq}%22%20OR%20(_query_:%22{!dismax%20qf=$fiqf %20v=$fiq}%22%20AND%20(_query_:%22{!dismax%20qf=$lwqf%20v=$lwq} %22%20OR%20_query_:%22{!dismax%20qf=$lqf%20v=$lq}%20%22))&fnq= %22giovanni%20fernandez-kincade%22&fqf=full_name^4&fiq=giovanni&fiqf=first_name^2.0%20first_name_syn&qt=standard&lwq=fernandez-kincade*&lwqf=last_name&lq=fernandez-kincade&lqf=last_name^3

Page 44: Solr @ Etsy - Apache Lucene Eurocon

Solr InterOpQParsers

http://host:8393/solr/person/select/?q={!personrealqp}giovanni %20fernandez-kincade

Page 45: Solr @ Etsy - Apache Lucene Eurocon

Solr InterOpQParsers

class PersonNameRealQParser extends QParser {   public PersonNameRealQParser(String qstr, SolrParams localParams,

SolrParams params, SolrQueryRequest req) {     super(qstr, localParams, params, req);   }

Page 46: Solr @ Etsy - Apache Lucene Eurocon

Solr InterOpQParsers

@Override  public Query parse() throws ParseException { TermQuery exactFullNameQuery = new TermQuery(new Term("full_name", qstr));    exactFullNameQuery.setBoost(4.0f);

    String[] userQueryTerms = qstr.split("\\s+");    Query firstLastQuery = null;

    if (2 == userQueryTerms.length)      firstLastQuery = parseAsFirstAndLast(userQueryTerms[0], userQueryTerms[1]);    else      firstLastQuery = parseAsFirstOrLast(userQueryTerms);

    DisjunctionMaxQuery realNameQuery = new DisjunctionMaxQuery(0);    realNameQuery.add(exactFullNameQuery);    realNameQuery.add(firstLastQuery);

    return realNameQuery;  }

Page 47: Solr @ Etsy - Apache Lucene Eurocon

Solr InterOpQParsersThe QParserPlugin that returns our new QParser: public class PersonNameRealQParserPlugin extends QParserPlugin {   public static final String NAME = "personrealqp";

   @Override   public void init(NamedList args) {}

   @Override   public QParser createParser(String qstr, SolrParams localParams,

SolrParams params, SolrQueryRequest req) {     return new PersonNameRealQParser(qstr, localParams, params, req);   } }

Page 48: Solr @ Etsy - Apache Lucene Eurocon

Solr InterOpQParsers

Registering the plugin in solrconfig.xml:

<queryParser name="personrealqp" class="com.etsy.person.solr.PersonNameRealQParserPlugin" />

Page 49: Solr @ Etsy - Apache Lucene Eurocon

Custom Stemmer

Page 50: Solr @ Etsy - Apache Lucene Eurocon

Solr InterOpCustom Stemmer

Page 51: Solr @ Etsy - Apache Lucene Eurocon

Solr InterOpCustom Stemmer

banded, banding, birding, bouldering, bounded, buffing, bundler, canning, carded, circled, coupler, dangler, doubler, firring, foiling, hooper, japanned, lipped, napped, papered, pebbled, pitted, pocketed, reductive, ricer, rooter, roper, seeded, shouldered, silvered, skinning, spindling, staining, stitcher,

strapped, threaded, yellowing

Page 52: Solr @ Etsy - Apache Lucene Eurocon

Solr InterOpCustom StemmerFirst we extend KStemmer and intercept stem calls:

public class LStemmer extends KStemmer {

/**.....**/

     @Override     String stem(String term) {         String override = overrideStemTransformations.get(term);         if(override != null) return override;         return super.stem(term);     } }

Page 53: Solr @ Etsy - Apache Lucene Eurocon

Solr InterOpCustom Stemmer

Then create a TokenFilter that uses the new Stemmer:

final class LStemFilter extends TokenFilter {

/**.....**/         protected LStemFilter(TokenStream input, int cacheSize) { super(input); stemmer = new LStemmer(cacheSize); }          @Override public boolean incrementToken() throws IOException { /**....**/ }

Page 54: Solr @ Etsy - Apache Lucene Eurocon

Solr InterOpCustom Stemmer

Create a FilterFactory that exposes it:

public class LStemFilterFactory extends BaseTokenFilterFactory { private int cacheSize = 20000;     @Override public void init(Map<String, String> args) { super.init(args);     String cacheSizeStr = args.get("cacheSize");     if (cacheSizeStr != null) {      cacheSize = Integer.parseInt(cacheSizeStr);     }   }     @Override   public TokenStream create(TokenStream in) {    return new LStemFilter(in, cacheSize);   } }

Page 55: Solr @ Etsy - Apache Lucene Eurocon

Solr InterOpCustom Stemmer

And finally plug it into your analysis chain:

<analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true"

words="solr/common/conf/stopwords.txt"/> <filter class="com.etsy.solr.analysis.LStemFilterFactory" /> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer>

Page 56: Solr @ Etsy - Apache Lucene Eurocon

Thanks!