search as communication: lessons from a personal journey

Post on 09-May-2015

8.688 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Search as Communication: Lessons from a Personal Journey by Daniel Tunkelang (Head of Query Understanding, LinkedIn) Presented at Etsy's Code as Craft Series on May 21, 2013 When I tell people I spent a decade studying computer science at MIT and CMU, most assume that I focused my studies in information retrieval — after all, I’ve spent most of my professional life working on search. But that’s not how it happened. I learned about information extraction as a summer intern at IBM Research, where I worked on visual query reformulation. I learned how search engines work by building one at Endeca. It was only after I’d hacked my way through the problem for a few years that I started to catch up on the rich scholarly literature of the past few decades. As a result, I developed a point of view about search without the benefit of academic conventional wisdom. Specifically, I came to see search not so much as a ranking problem as a communication problem. In this talk, I’ll explain my communication-centric view of search, offering examples, general techniques, and open problems. -- Daniel Tunkelang is Head of Query Understanding at LinkedIn. Educated at MIT and CMU, he has his career working on big data, addressing key challenges in search, data mining, user interfaces, and network analysis. He co-founded enterprise search and business intelligence pioneer Endeca, where he spent a decade as its Chief Scientist. In 2011, Endeca was acquired by Oracle for over $1B. Previous to LinkedIn, he led a team at Google working on local search quality. Daniel has authored fifteen patents, written a textbook on faceted search, and created the annual symposium on human-computer interaction and information retrieval.

TRANSCRIPT

Search  as  Communica/on:  Lessons  from  a  Personal  Journey  

Daniel  Tunkelang  Head  of  Query  Understanding,  LinkedIn  

These  are  great  textbooks  on  informa/on  retrieval.  

Unfortunately,  I  never  read  them  in  school.  

But  I  did  study  graphs  and  stuff.    

I  found  myself  developing  a  search  engine.  

And  the  next  thing  I  knew,  I  was  a  search  guy.  

So  what  did  I  learn  along  the  way?  

Search  isn't  a  ranking  problem.  It's  a  communica/on  problem.  

Outline  

1.  Lessons  from  Library  Science    2.  Adventures  with  InformaAon  ExtracAon    3.  A  Moment  of  Clarity  

1.  Lessons  from  Library  Science  

InformaAon  need   query   select  from  results  

rank  using  IR  model  

USER:  

SYSTEM:  M-­‐idf   PageRank  

A  birds-­‐eye  view  of  how  search  engines  work.  

Old  school  search:  ask  a  librarian.  

Search  lives  in  an  informa/on-­‐seeking  context.    

[Pirolli  and  Card,  2005]  

vs.  

Recognize  ambiguity  and  ask  for  clarifica/on.  

Clarify,  then  refine.  

Computers   Books  

Faceted  search.  It’s  not  just  for  e-­‐commerce.  

Give  users  transparency,  guidance,  and  control.  

Take-­‐away  for  search  engine  developers:      

Act  like  a  librarian.  Communicate  with  your  user.  

2.  Adventures  with  Informa/on  Extrac/on  

String  matching  is  great  but  has  limits.  

20  20

for i in [1..n]! s ← w1 w2 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← {s}! a.prob ← Pc(s)! B[i] ← {a}! for j in [1..i-1]! for b in B[j]! s ← wj wj+1 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← b.segs U {s}! a.prob ← b.prob * Pc(s)! B[i] ← B[i] U {a}! sort B[i] by prob! truncate B[i] to size k!

People  search  for  en//es.  Recognize  them!  

Named  en/ty  recogni/on  is  free,  as  in  free  beer.  

Problem:  they  process  each  document  separately.  

EnAty  DetecAon  System  

Why  not  take  advantage  of  corpus  features?      

Give  your  documents  the  right  to  vote!  

Use  a  high-­‐recall  method  to  collect  candidates.  •  e.g.,  all  Atle-­‐case  spans  of  words  other  

than  single  word  beginning  a  sentence.    Process  each  document  separately.  

•  Each  candidate  is  assigned  an  enAty  type,  or  no  type  at  all.  

 If  a  candidate  is  mostly  assigned  a  single  enAty  type,  extrapolate  to  all  its  occurrences.  

Looking  for  topics?  Use  idf,  and  its  cousin  ridf.  

Inverse  document  frequency  (idf)  •  Too  low?  Probably  a  stop  word.  •  Too  high?  Could  be  noise.    Residual  inverse  document  frequency  (ridf)  •  Predict  idf  using  Poisson  model.  •  Difference  between  idf  and  predicted  idf.  

 “a  good  keyword  is  far  from  Poisson”            [Church  and  Gale,  1995]  

Terminology  extrac/on?  Try  data  recycling.  

Obtain  en//es  by  any  means  necessary.  

Take-­‐away  for  search  engine  developers:      

En/ty  detec/on  is  crucial.  And  it  isn’t  that  hard.  

3.  A  Moment  of  Clarity  

informaAon  Need   query   select  from  results  

rank  using  IR  model  

USER:  

SYSTEM:  M-­‐idf   PageRank  

Let’s  go  back  to  our  pigeons  for  a  moment.    

What  does  this  process  look  like  to  the  system?  

vs.  

And  here’s  what  it  looks  like  to  the  user.  

GOOD   NOT  SO  GOOD  

But  can  the  system  tell  the  difference?  

User  experience  should  reflect  system  confidence.  

vs.  

h^p://searchengineland.com/ge`ng-­‐organized-­‐paid-­‐search-­‐user-­‐intent-­‐the-­‐search-­‐funnel-­‐116312  Derived  from  [Jansen  et  al,  2007].  

Searches  reflect  a  variety  of  informa/on  needs.  

34  34

for i in [1..n]! s ← w1 w2 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← {s}! a.prob ← Pc(s)! B[i] ← {a}! for j in [1..i-1]! for b in B[j]! s ← wj wj+1 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← b.segs U {s}! a.prob ← b.prob * Pc(s)! B[i] ← B[i] U {a}! sort B[i] by prob! truncate B[i] to size k!

We  can  segment  informa/on  need  from  the  query.  

We  can  learn  from  analyzing  user  behavior.  

And  we  can  look  at  our  relevance  scores.  

Naviga/onal   Exploratory  

Claudia  Hauff,  Query  Difficulty  for  Digital  Libraries  [2009]  

There  are  many  pre-­‐  and  post-­‐retrieval  signals.  

Take-­‐away  for  search  engine  developers:      

Queries  vary  in  difficulty.  Recognize  and  adapt.  

Review  

1.  Lessons  from  Library  Science  •  Act  like  a  librarian.  Communicate  with  users.  

 2.  Adventures  with  InformaAon  ExtracAon  

•  EnAty  detecAon  is  crucial.  And  isn’t  that  hard.    3.  A  Moment  of  Clarity  

•  Queries  vary  in  difficulty.  Recognize  and  adapt.  

Conclusion:  Read  the  textbooks.  

But  treat  search  as  a  communica/on  problem.  

WE’RE  HIRING!  hbp://data.linkedin.com/search  

   

Contact  me:  dtunkelang@linkedin.com  

hbp://linkedin.com/in/dtunkelang  

top related