ranking the web with spark · the anatomy of a large-scale hypertextual web search engine (1998)...

47
Ranking the Web with Spark Apache Big Data Europe 2016 [email protected] @sylvinus

Upload: others

Post on 18-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Ranking the Web with Spark

Apache Big Data Europe 2016

[email protected] @sylvinus

Page 2: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

/usr/bin/whoami

• Jamendo (Founder & CTO, 2004-2011)

• TEDxParis (Co-founder, 2009-2012)

• dotConferences (Founder, 2012-)

• Pricing Assistant (Co-founder & CTO, 2012-)

Page 3: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

transparency

reproducibility

Page 4: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features
Page 5: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

https://uidemo.commonsearch.org

Page 6: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

https://explain.commonsearch.org/?q=python&g=en

Page 7: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Ranking

Page 8: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Disclaimer: IANASRE(I Am Not A Search Relevance Engineer)

Page 9: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

What's in a score

score = fn( doc, query, language, user, time )

Page 10: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

What's in a score

score = fn( doc, query )

Page 11: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

What's in a score

score = fn( static_score, dynamic_score ( query ))

Page 12: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Static score

Page 13: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Static features

• Scopes:

• Page: URL depth, markup stats, ...

• Domain: Age, page count, blacklists, ...

• WebGraph: PageRank, ...

Page 14: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

http://infolab.stanford.edu/~backrub/google.htmlThe Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)

Crawler

Indexer

Database

SearcherRanker

Page 15: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Dynamic score

Page 16: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Dynamic features

• Text match: TF-IDF, BM25, proximity, topic, ...

• Query-level: number of words, popularity, ...

• Usage: clicks, dwell time, reformulations, ...

• Time

Page 17: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Scoring function

Page 18: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Users

DatabaseElasticsearch

IndexerPython, Spark

Data sourcesCommon Crawl, Alexa top 1M, ...

words, static score

query top 10 docs, final scores

Offline

OnlineSearcher

Go

Page 19: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features
Page 20: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

https://explain.commonsearch.org/?q=python&g=en

Page 21: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Issues with this architecture

• Static & dynamic scoring are in different codebases

• No control over result diversity

• Hard to optimize

• Very dependent on Elasticsearch

Page 22: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Rescoring

Page 23: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Users

Database

Indexer

words, static score, features

query

Searcher

top 1k docs, features

Rescorer

final 10 docs

Page 24: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Issues with rescoring

• Latency

• Pagination

• Harder to explain

Page 25: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Learning to rank

Page 26: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

LTR Model

• Features

• Training dataset

• Evaluation: NDCG, ERR, ...

• Algorithms: AdaRank, ListNet, LambdaMART, ...

• Learning with Spark!

Page 27: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

The right questions

• What do users expect?

• What features?

• How to evaluate and fine-tune in the real world?

Page 28: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

PageRank with Spark

Page 29: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features
Page 30: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

http://commoncrawl.org

Page 31: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

https://github.com/commonsearch/cosr-back

Page 32: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Common Search Pipeline

Doc sourcesCommon Crawl,

WARC files, URLs ...

Filter plugins

Document parsing

Output plugins

Data outputDatabase, file, HDFS, S3, ...

Page 33: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Most popular Wikipedia pages

Page 34: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Dumping the web graph

Page 35: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Naive pyspark PageRank

Page 36: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

GraphFrames

Page 37: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

SparkSQL PageRank

Page 38: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

SparkSQL PageRank

https://github.com/commonsearch/cosr-back/blob/master/spark/jobs/pagerank.py

Page 39: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Tests

http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm

https://github.com/commonsearch/cosr-back/blob/master/tests/sparktests/test_pagerank.py

Page 40: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

https://about.commonsearch.org/developer/get-started

Page 41: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features
Page 42: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Top 10

Page 43: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features
Page 44: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Spam

Page 45: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features
Page 46: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Spamdexing• Keyword stuffing, hidden text

• Scraper sites, Mirrors

• Link farms

• Splogs, Comment spam

• Domaining

• Cloaking

• Bombing

Page 47: Ranking the Web with Spark · The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database Ranker Searcher. Dynamic score. Dynamic features

Questions?https://about.commonsearch.org/contributing

https://github.com/commonsearch [email protected]

slack.commonsearch.org