deduplication using solr: presented by neeraj jain, stubhub

19

Upload: lucidworks

Post on 12-Jul-2015

448 views

Category:

Software


2 download

TRANSCRIPT

Page 1: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
Page 2: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

Deduplication using SOLR Neeraj Jain, Software Engineer, Stubhub, Inc. [email protected]

Page 3: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

About myself

RIDE  ON  

Page 4: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

StubHub is about…..

 We  enable  “access”  to  events    

We want to be more!!!

Worlds  Largest  Ticke<ng  marketplace  

10M active listings

Page 5: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

Some Fun Facts about StubHub! Ø An eBay owned company Ø Over 25 million users and growing Ø We sell one ticket per second Ø ~8.5 million page views a day, on an average Ø ~ 3 million additional page views per day on Mobile devices Ø ~10 M tickets for sale in sports, concerts and others. Ø ~ 1 TB of data processed monthly by the analytics infrastructure – This number will significantly go up as we bring in data from many of the unstructured data sources Ø ~300 Million SQL executions/day

Page 6: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

2010   2011   2012   2013   2014  

Search at Stubhub!

SOLR  1.2   SOLR  3+,  Geo  spa<al  search   SOLR  Cloud  SOLR  4,  NRT  

Page 7: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

Agenda Ø Use case

Ø Challenges

Ø  Legacy solution

Ø Our approach

Ø Results

Page 8: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

Use Case : Content Ingestion

Input  record  

Deduplica<on  

Post  deduplica<on  

Pre  deduplica<on  

Normalize  

Geocode  

Review  

Insert  

Update  

Discard  

Filtering  

Classifica<on  

Feed-­‐1  

Form  

Feed-­‐2  

Feed-­‐3  

Feed-­‐n  

Event  DB  

Page 9: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

Challenges : Deduplication Ø  Problem space

²  Event  catalog  Ø  Performance considerations

²  Real  <me  processing  ²  Batch  processing  

Ø  Speed and data quality

Page 10: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

Legacy Solution : Deduplication Flow

Deduplica<onModule  

for  each  field  

Event  DB  

for  each  document  

Client  

1:  getDuplicates()  

2:  getSubsetByLoca.on()  

3:  loop  

4:  DuplicateList  

5:  upsert()  

Normalize   Filter   Compute  Score  

Feed  Ingestor  

UGC  

Batch  Job  

Page 11: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

Approach : Problem Model

Ø  Milpitas  Library  vs  Milpitas  Public  Library  Ø  1601  E  7th  St    vs    1601  E.  Seventh  St.    Ø  Pick  up  the  right  algo,  edit  distance,  jaccard.  

Milpitas  Library  160  N.  Main  St;    40  N.  Milpitas  Blvd.  Distance  :  ~0.5  mi  

Library,  Restaurant,  etc  

e.g.  venue  name,  street  number  Boost  

Dup  detec<on  -­‐  name,  address  etc  

Subset    -­‐  Text  Similarity  on  Categories  

Subset  -­‐  Geo  spa<al  distance  

Venue  Deduplica.on  

Page 12: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

Approach : Deduplication Flow Feed  Ingestor  

Deduplica<onService  

QueryBuilder   QueryExecuter   Scorer  

SOLR  Index  

Client  

UGC  

Batch  Job  

1:  /dedupe  

3.1:  /select  

7:  /update  3:  execute()   4:  compute()  2:  build()  

6:  DedupeResponse  

Event  DB  8:  upsert()  

IndexUpdater  

A1:  poll()  

A2:  /update,  /delete  

NameFilter  

AddressFilter  

Filter  

*Filter  

Page 13: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

Approach : Deduplication Service public interface DeduplicationService<T> {

/** * Checks for duplicate entity and return a DeduplicationResponse containing information about duplicates

found. For each possible duplicate, there is a justification as to why it's a duplicate. * @param t entity for which duplicates need to be found. * @param options use options provided by this object to find and filter the results. * @return a not null instance of DeduplicationResponse object. * @throws DeduplicationConnectivityException if there was an issue in connecting to the dedupe data

store. */   public DeduplicationResponse<T> findDuplicates(T t, DedupeOptions options)

throws DeduplicationConnectivityException; }

Page 14: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

Approach : Deduplication Service @Component(value = "VenueDeduplicationService”) public class VenueDeduplicationService implements DeduplicationService<Venue> { @Override

public DeduplicationResponse<Venue> findDuplicates(Venue venue, DedupeOptions options)  throws  Deduplica<onConnec<vityExcep<on  {  

}  } @Component(value = "EventDeduplicationService”) public class EventDeduplicationService implements DeduplicationService<Event> {

@Override public DeduplicationResponse<Event> findDuplicates(Event event, DedupeOptions options) throws DeduplicationConnectivityException {

} }

Page 15: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

Approach : Optimizations Ø  How to keep the score consistent?

²   <similarity  class=“TfSimilarity"/>  

Ø  Auto commit settings ²  <autoSomCommit><maxTime>5</maxTime></autoSomCommit>  

Ø  Custom PostFilter ²  <queryParser  name="fdist"  class=“DistanceQParserPlugin"/>  

Ø  Custom update handler ²   <processor  class=“VenueUpdateProcessorFactory”></processor>  

Page 16: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

Results : Sample Output Input  Venue   Matched  Venue   Score   Distance  

Jillian's  Billiards  Club    101  Fourth  St.  

Jillian's    175  4th  St.  

1.5573   5.6352  

Lush  Lounge    1092  Post  St.  

Lush  Lounge    1221  Polk  St.  

12.9836   16.6501  

Mountain  Theatre    10  Panoramic  Hwy.  

Mountain  Theater    Nearby  E  Ridgecrest  Boulevard  and  Pantoll  Road  

3.2509   5.8913  

Page 17: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

Results : Sample Output Input  Venue   Matched  Venue   Score   Distance  

The  Hedley  Club  at  Hotel  DeAnza    233  W.  Santa  Clara  St.  

Hedley  Club    233  W.  Santa  Clara  St.  

5.0805   0.0000  

Sonya  Paz  Fine  Art  Gallery    1793  LafayeYe  St.  

Sonya  Paz  Gallery  and  Studio    1793  LafayeYe  St.  Suite  110  

6.6764   0.0069  

Pearl  Avenue  Library  Community  Room    4270  Pearl  Ave.  

Pearl  Avenue  Branch  Library    4270  Pearl  Ave.  

5.7024   0.0000  

Milpitas  Library    160  N.  Main  St.  

Milpitas  Library    40  N.  Milpitas  Blvd.  

16.4318   0.7284  

Page 18: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

Summary Ø  Use case

²  Content  inges<on  Ø  Challenges

²  Deduplica<on  Ø  Legacy solution Ø  Our approach

²  Used  SOLR  for  text  similarity  ²  Extended  default  behavior  ²  REST  endpoint  over  SOLR  interface  

Ø  Next steps ²  Big  data  ²  Performer  matching  ²  I18n  

Ø  Results

Page 19: Deduplication Using Solr: Presented by Neeraj Jain, Stubhub

Thank You