solr @ etsy - apache lucene eurocon

Post on 09-Jul-2015

2.536 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Solr @

Things I’m not going to talk about:

A/B Testingi18n

Continuos Deployment

AboutUs

10+ Million Listings500 qps

Architecture Overview

Architecture OverviewThrift

Architecture OverviewThrift

struct Listing { 1: i64 listing_id }

struct ListingResults { 1: i64 count, 2: list<Listing> listings }

service Search { ListingResults search(1:string query) }

Architecture OverviewThrift

public class Search { public interface Iface { public ListingResults search(String query) throws TException; }

Generated Java server code:

Generated PHP client code: class SearchClient implements SearchIf {

/**...**/ public function search($query) { $this->send_search($query); return $this->recv_search(); }

Architecture OverviewThrift

• Service Encapsulation• Reduced Network Traffic

Why use Thrift?

Architecture OverviewThrift

• Index Size• Easy to scale PK lookups

Why only return IDs?

The Search Server

Architecture OverviewSearch Server

• Identical Code + Hardware• Roles/Behavior controlled by Env variables• Single Java Process• Solr running as a Jetty Servlet• Thrift Servers • Smoker

Architecture OverviewSearch Server

Master-specific processes:• Incremental Indexer• External File Field Updaters

Load Balancing

Load BalancingThrift TSocketPool

Load BalancingThrift TSocketPool

Load BalancingThrift TSocketPool

Load BalancingServer Affinity

Load BalancingServer Affinity Algorithm

$serversNew = array();

$numServers = count($servers); while($numServers > 0) {

// Take the first 4 chars of the md5sum of the server count // and the query, mod the available servers $key = hexdec(substr(md5($numServers . '+' . $query),0,4))%($numServers); $keySet = array_keys($servers); $serverId = $keySet[$key];

// Push the chosen server onto the new list and remove it // from the initial list array_push($serversNew, $servers[$serverId]); unset($servers[$serverId]); --$numServers;

}

[“host2”, “host3”, “host1”, “host4”]

Load BalancingServer Affinity Algorithm

“jewelry” [“host2”, “host3”, “host1”, “host4”]

“scarf”

$key = hexdec(substr(md5($query),0,4))

[“host2”, “host3”, “host1”, “host4”]

Load BalancingServer Affinity Algorithm

“jewelry” [“host2”, “host3”, “host1”, “host4”]

“scarf” [“host2”, “host1”, “host4”, “host3”]

$key = hexdec(substr(md5($numServers . '+' . $query),0,4))%(count($servers));

Load BalancingServer Affinity Results

2% 20%

Load BalancingServer Affinity Caveats

• Stemming / Analysis • Be wary of query distribution

Replication

ReplicationThe Problem

ReplicationThe Problem

ReplicationMulticast Rsync?

ReplicationMulticast Rsync?

[15:25]  <engineer> patrick: i'm gonna test multi-rsyncing some indexes from host1 to host2 and host3 in prod. I'll be watching the graphs and what not, but let me know if you see anything funky with the network[15:26]  <patrick> ok....

[15:31]  <keyur> is the site down?

ReplicationMulticast Rsync?

Hmm...Bit Torrent?

ReplicationBit Torrent POCUsing BitTornado:

ReplicationBit Torrent + Solr

Fork of TTorent: https://github.com/etsy/ttorrent

Multi-File SupportPerformance Enhancements

ReplicationBit Torrent + Solr

ReplicationBit Torrent + Solr

ReplicationBit Torrent + Solr

ReplicationBit Torrent + Solr

Solr InterOp

QParsers

“writing query strings is for suckers”

Solr InterOpQParsers

http://host:8393/solr/person/select/?q=_query_:%22{!dismax %20qf=$fqf%20v=$fnq}%22%20OR%20(_query_:%22{!dismax%20qf=$fiqf %20v=$fiq}%22%20AND%20(_query_:%22{!dismax%20qf=$lwqf%20v=$lwq} %22%20OR%20_query_:%22{!dismax%20qf=$lqf%20v=$lq}%20%22))&fnq= %22giovanni%20fernandez-kincade%22&fqf=full_name^4&fiq=giovanni&fiqf=first_name^2.0%20first_name_syn&qt=standard&lwq=fernandez-kincade*&lwqf=last_name&lq=fernandez-kincade&lqf=last_name^3

Solr InterOpQParsers

http://host:8393/solr/person/select/?q={!personrealqp}giovanni %20fernandez-kincade

Solr InterOpQParsers

class PersonNameRealQParser extends QParser {   public PersonNameRealQParser(String qstr, SolrParams localParams,

SolrParams params, SolrQueryRequest req) {     super(qstr, localParams, params, req);   }

Solr InterOpQParsers

@Override  public Query parse() throws ParseException { TermQuery exactFullNameQuery = new TermQuery(new Term("full_name", qstr));    exactFullNameQuery.setBoost(4.0f);

    String[] userQueryTerms = qstr.split("\\s+");    Query firstLastQuery = null;

    if (2 == userQueryTerms.length)      firstLastQuery = parseAsFirstAndLast(userQueryTerms[0], userQueryTerms[1]);    else      firstLastQuery = parseAsFirstOrLast(userQueryTerms);

    DisjunctionMaxQuery realNameQuery = new DisjunctionMaxQuery(0);    realNameQuery.add(exactFullNameQuery);    realNameQuery.add(firstLastQuery);

    return realNameQuery;  }

Solr InterOpQParsersThe QParserPlugin that returns our new QParser: public class PersonNameRealQParserPlugin extends QParserPlugin {   public static final String NAME = "personrealqp";

   @Override   public void init(NamedList args) {}

   @Override   public QParser createParser(String qstr, SolrParams localParams,

SolrParams params, SolrQueryRequest req) {     return new PersonNameRealQParser(qstr, localParams, params, req);   } }

Solr InterOpQParsers

Registering the plugin in solrconfig.xml:

<queryParser name="personrealqp" class="com.etsy.person.solr.PersonNameRealQParserPlugin" />

Custom Stemmer

Solr InterOpCustom Stemmer

Solr InterOpCustom Stemmer

banded, banding, birding, bouldering, bounded, buffing, bundler, canning, carded, circled, coupler, dangler, doubler, firring, foiling, hooper, japanned, lipped, napped, papered, pebbled, pitted, pocketed, reductive, ricer, rooter, roper, seeded, shouldered, silvered, skinning, spindling, staining, stitcher,

strapped, threaded, yellowing

Solr InterOpCustom StemmerFirst we extend KStemmer and intercept stem calls:

public class LStemmer extends KStemmer {

/**.....**/

     @Override     String stem(String term) {         String override = overrideStemTransformations.get(term);         if(override != null) return override;         return super.stem(term);     } }

Solr InterOpCustom Stemmer

Then create a TokenFilter that uses the new Stemmer:

final class LStemFilter extends TokenFilter {

/**.....**/         protected LStemFilter(TokenStream input, int cacheSize) { super(input); stemmer = new LStemmer(cacheSize); }          @Override public boolean incrementToken() throws IOException { /**....**/ }

Solr InterOpCustom Stemmer

Create a FilterFactory that exposes it:

public class LStemFilterFactory extends BaseTokenFilterFactory { private int cacheSize = 20000;     @Override public void init(Map<String, String> args) { super.init(args);     String cacheSizeStr = args.get("cacheSize");     if (cacheSizeStr != null) {      cacheSize = Integer.parseInt(cacheSizeStr);     }   }     @Override   public TokenStream create(TokenStream in) {    return new LStemFilter(in, cacheSize);   } }

Solr InterOpCustom Stemmer

And finally plug it into your analysis chain:

<analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true"

words="solr/common/conf/stopwords.txt"/> <filter class="com.etsy.solr.analysis.LStemFilterFactory" /> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer>

Thanks!

top related