structure, personalization, scale: a deep dive into linkedin search

56
Recruiting Solutions 1 Asif Danie l Structure, Personalization, Scale: A Deep Dive into LinkedIn Search

Upload: daniel-tunkelang

Post on 08-Sep-2014

4.821 views

Category:

Technology


6 download

DESCRIPTION

Structure, Personalization, Scale: A Deep Dive into LinkedIn Search Presented at QCon New York 2014 in the Applied Science and Machine Learning track https://qconnewyork.com/presentation/structure-personalization-scale-deep-dive-linkedin-search Also presented at the NYC Search, Discovery, and Analytics Meetup http://www.meetup.com/NYC-Search-and-Discovery/events/168265892/ Video: http://new.livestream.com/hugeinc/search-at-linkedin Abstract All of us are familiar with search as users. And as software engineers, many of us have worked on search problems in the context of web search, site search, or enterprise search. But search at LinkedIn is different. Our corpus is a richly structured professional graph comprised of 300M+ people, 3M+ companies, 2M+ groups, and 1.5M+ publishers. Our members perform billions of searches (over 5.7B in 2012), and each of those searches is highly personalized based on the searcher's identity and relationships with other professional entities in LinkedIn's economic graph. And all this data is in constant flux as LinkedIn adds more than 2 members every second in over 200 countries (2/3 of our members are outside the United States). As a result, we’ve built a system quite different from those used for other search applications. In this talk, we will discuss some of the unique challenges we've faced as we deliver highly personalized search over semi-structured data at massive scale. Asif Makhani heads Search at LinkedIn. Prior to that, he was a founding member of A9 and led the development and launch of Amazon CloudSearch, a fully managed and elastic search service in the AWS Cloud. Asif has a Masters in Computer Science from Stanford University and a BMath from University of Waterloo. @asifm Speaker Bios Daniel Tunkelang leads LinkedIn's efforts around query understanding. Before that, he led LinkedIn's product data science team. He previously led a local search quality team at Google and was a founding employee of Endeca (acquired by Oracle in 2011). He has written a textbook on faceted search, and is a recognized advocate of human-computer interaction and information retrieval (HCIR). He has a PhD in Computer Science from CMU, as well as BS and MS degrees from MIT. @dtunkelang

TRANSCRIPT

Page 1: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

Recruiting SolutionsRecruiting SolutionsRecruiting Solutions1

Asif Daniel

Structure, Personalization, Scale: A Deep Dive into LinkedIn Search

Page 2: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

Overview

What is LinkedIn search and why should you care?

What are our systems challenges?

What are our relevance challenges?

2

Page 3: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

3

Page 4: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

4

Page 5: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

5

Search helps members find and be found.

Page 6: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

6

Search for people, jobs, groups, and more.

Page 7: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

7

Page 8: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

8

A separate product for recruiters.

Page 9: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

9

Search is the core of key LinkedIn use cases.

Page 10: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

10

What’s unique

Personalized Part of a larger product experience

– Many products– Big part

Task-centric– Find a job, hire top talent, find a person, …

Page 11: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

11

Systems Challenges

Page 12: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

12

Evolution of LinkedIn’ Search Architecture

2004: No Search Engine Iterate through your network

and filter

2004

Page 13: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

13

2004

2007

Lucene Lucene

Lucene Lucene(Single Shard)

Updates Queries

Results

2007: Introducing Lucene (single shard, multiple replicas)

Page 14: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

14

2004

2007

2008

Lucene Lucene

Lucene

Updates Queries

ResultsLucene

Zoie

2008: Zoie - real-time search (search without commits/shutdown)

Page 15: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

15

2004

2007

2008

Lucene Lucene

Lucene

Source 1

Queries

Results

Lucene

ZoieSource 2

….

Source N

Content Store

….

2008: Content Store (aggregating multiple input sources)

Page 16: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

16

2004

2007

2008

Source

1

Queries

Results

Source

2

….

Source

N

Content Store

….

Sharded

Broker

2008: Sharded search

Page 17: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

17

2004

2007

2008

2009

Source

1

Queries

Results

Source

2

….

Source

N

Content Store

…. Sensei

Broker

LuceneZoie

Bobo

2009: Bobo – Faceted Search

Page 18: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

18

2004

2007

2008

2009

2010

2010: SenseiDB (cluster management, new query language, wrapping existing pieces)

Page 19: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

19

2004

2007

2008

2009

2010

2011

2011: Cleo (instant typeahead results)

Page 20: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

20

2004

2007

2008

2009

2010

2011

2013

2013: Too many stacks

Group SearchArticle/Post SearchAnd more…

Page 21: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

21

Challenges

Index rebuilding very difficult Live updates are at an entity granularity Scoring is inflexible Lucene limitations Fragmentation – too many components, too many stacks

Economic Graph

Opportunity

Page 22: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

22

2004

2007

2008

2009

2010

2011

2013

2014

2014: Introducing Galene

Page 23: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

23

Life of a Query

Query Rewriter/Planner

ResultsMerging

UserQuery

SearchResults

Search Shard

Search Shard

Page 24: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

24

Life of a Query – Within A Search Shard

RewrittenQuery

TopResults

FromShard

INDEX

TopResults

Retrieve aDocument

Score theDocument

Page 25: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

25

Life of a Query – Within A Rewriter

Query

DATAMODEL

RewriterState

RewriterModule

DATAMODEL

DATAMODEL

RewrittenQuery

RewriterModule

RewriterModule

Page 26: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

26

Life of Data - Offline

INDEX

Derived DataRaw Data

DATAMODEL

DATAMODEL

DATAMODEL

DATAMODEL

DATAMODEL

Page 27: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

27

Improvements

Regular full index builds using Hadoop– Easier to reshard, add fields

Improved Relevance– Offline relevance, query rewriting frameworks

Partial Live Updates Support– Allows efficient updates of high frequency fields (no sync)– Goodbye Content Store, Goodbye Zoie

Early termination– Ultra low latency for instant results– Goodbye Cleo

Indexing and searching across graph entities/attributes Single engine, single stack

Page 28: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

28

Galene Deep dive

Page 29: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

29

Primer on Search

Page 30: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

30

Lucene

An open source API that supports search functionality: Add new documents to index Delete documents from the index Construct queries Search the index using the query Score the retrieved documents

Page 31: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

31

The Search Index

Inverted Index: Mapping from (search) terms to list of documents (they are present in)

Forward Index: Mapping from documents to metadata about them

Page 32: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

32

BLAH BLAH BLAH DanielBLAH BLAH LinkedIn BLAH BLAH BLAH BLAH

BLAH BLAH AsifBLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH2.

1.

Daniel Asif LinkedIn

2

1

Inverted Index Forward Index

Page 33: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

33

The Search Index

The lists are called posting lists Upto hundreds of millions of posting lists Upto hundreds of millions of documents Posting lists may contain as few as a single hit and as

many as tens of millions of hits Terms can be

– words in the document– inferred attributes about the document

Page 34: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

34

Lucene Queries

term:“asif makhani” term:asif term:daniel +term:daniel +prefix:tunk +asif +linkedIn +term:daniel connection:50510 +term:daniel industry:software connection:50510^4

Page 35: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

35

Early termination

We order documents in the index based on a static rank – from most important to least important

An offline relevance algorithm assigns a static rank to each document on which the sorting is performed

This allows retrieval to be early-terminated (assuming a strong correlation between static rank and importance of result for a specific query)

Also works well with personalized search– +term:asif +prefix:makh

+(connection:35176 connection:418001 connection:1520032)

Page 36: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

36

Partial Updates

Lucene segments are “document-partitioned” We have enhanced Lucene with “term-partitioned”

segments We use 3 term-partitioned segments:

– Base index (never changed)– Live update buffer– Snapshot index

Page 37: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

37

Base IndexSnapshot

IndexLive Update

Buffer

Page 38: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

38

Going Forward

Consolidation across verticals Improved Relevance Support

– Machine-learned models, query rewriting, relevant snippets,…

Improved Performance Search as a Service (SeaS) Exploring the Economic Graph

Page 39: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

39

Quality Challenges

Page 40: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

40

The Search Quality Pipeline

New document

Features

Machine

learning model

scoreNew

document

Features

Machine

learning model

score

New document

Features

Machine learning model

score

Ordered listOrdered

listOrdered list

spellcheck query tagging vertical intent query expansion

Page 41: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

41

Spellcheck

PEOPLE NAMESCOMPANIES

TITLES

PAST QUERIES

n-gramsmarissa => ma ar ri is ss sa

metaphonemark/marc => MRK

co-occurrence countsmarissa:mayer = 1000

marisa meyer yahoo

marissa

marisa

meyer

mayer

yahoo

Page 42: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

42

Query Tagging

machine learning data scientist brooklyn

Page 43: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

43

Vertical Intent: Results Blending

[company]

[employees]

[jobs]

[name search]

Page 44: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

44

Vertical Intent: Typeahead

P(mongodb | mon) = 5%

P(monsanto | mons): 50%

P(mongodb | mong): 80%

Page 45: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

45

Query Expansion

Page 46: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

46

Ranking

New document

Features

Machine

learning model

scoreNew

document

Features

Machine

learning model

score

New document

Features

Machine learning model

score

Ordered listOrdered

listOrdered list

Page 47: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

47

Ranking is highly personalized.

kevin scott

Page 48: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

48

Not just for name search.

Page 49: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

49

Relevance Model

keywords

document

Features

Machine learning model

Page 50: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

50

Examples of Features

Search keywords matching title = 3

Searcher location = Result location

Searcher network distance to result = 2

Page 51: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

51

Model Training: Traditional Approach

Documents for training

Features

Human evaluation

Labels

Machine learning model

Page 52: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

52

Model Training: LinkedIn’s Approach

Documents for training

Features

Human evaluation

Search logs

Labels

Machine learning model

Page 53: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

53

Fair Pairs and Easy Negatives

Flipped

[Radlinski and Joachims, 2006]

• Sample negatives from bottom results• But watch out for variable length result sets.• Compromise, e.g., sample from page 10.

Page 54: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

54

Model Selection

Select model based on user and query features.– e.g., person name queries, recruiters making skills queries

Resulting model is a tree with logistic regression leaves. Only one regression model evaluated for each document.

X 2=0

X2=?

X2=1

X10< 0.1234 ?

Yes

No

Page 55: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

Summary

What is LinkedIn search and why should you care?

LinkedIn search enables the participants inthe economic graph to find and be found.

What are our systems challenges?

Indexing rich, structured content; retrievingusing global and social factors; real-time updates.

What are our relevance challenges?

Query understanding, personalizedmachine-learned ranking models.

55

Page 56: Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

56

Asif MakhaniDaniel [email protected] [email protected]://linkedin.com/in/asifmakhani https://linkedin.com/in/dtunkelang