technologies large-scale search

28
LARGE-SCALE SEARCH TECHNOLOGIES

Upload: others

Post on 12-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TECHNOLOGIES LARGE-SCALE SEARCH

LARGE-SCALE SEARCH TECHNOLOGIES

Page 2: TECHNOLOGIES LARGE-SCALE SEARCH

Why is building an internet search engine so hard?

Over 150 trillion webpages. Only 1 of every 3,000 webpages is indexed.

It is practically impossible to index all the words in every webpage

One cannot build a competitive search without huge amount of users’ data

It’s too expensive to compute search results on the fly

Page 3: TECHNOLOGIES LARGE-SCALE SEARCH

State of the art technologies

State of the art technologies wouldn’t work to power large-scale web search engines

Scalability Noise

Page 4: TECHNOLOGIES LARGE-SCALE SEARCH

Scalability challenges

The core algorithm of an inverted index scheme is based on a list intersection

This is a computationally inefficient operation, which is impractical when it comes to the web and its billions of webpages.

Scalability

Page 5: TECHNOLOGIES LARGE-SCALE SEARCH

Noise reduction challenges

Textual measures: Word intersections, word distances, word positions, td-idf, BM25 etc.

No matter which textually-based approach you take, it might work well on one query, but poorly on the other.

Noise

Page 6: TECHNOLOGIES LARGE-SCALE SEARCH

The conclusion

There is no point trying to index the parts of a webpage that are unlikely

to appear in prospective users’ search queries

Page 7: TECHNOLOGIES LARGE-SCALE SEARCH

Query log

Suppose that for each webpage you had a list of all the search queries that ended with a user clicking on this webpage (say from Google)

Any part of a webpage that does interact with any of the search queries leading to it is completely irrelevant for the retrieval process

Hence, you could focus on only indexing the search queries.

Page 8: TECHNOLOGIES LARGE-SCALE SEARCH

Example – Cyberpunk 2077

Page 9: TECHNOLOGIES LARGE-SCALE SEARCH

Smaller index size

Less noise with more accurate results

User queries

Page 10: TECHNOLOGIES LARGE-SCALE SEARCH

Predict first, index later

Indexing only the prospective user queries

Indexing problem is turned into a prediction problem

Page 11: TECHNOLOGIES LARGE-SCALE SEARCH

Data collection

Long tail queries No in-depth understanding

Not independent Expansive infrastructures

Without massive data collection are we doomed to fail?

Page 12: TECHNOLOGIES LARGE-SCALE SEARCH

Data prediction

Use AI to predict the prospective user's query

Page 13: TECHNOLOGIES LARGE-SCALE SEARCH

What about recall and precision?

This solves the indexing problem

The next challenge is to maximize the recall and precision

Page 14: TECHNOLOGIES LARGE-SCALE SEARCH

It’s all about the context

What is the release date of Cyberpunk 2077 to PS4 and Xbox One?

Page 15: TECHNOLOGIES LARGE-SCALE SEARCH

Not an NLP problem

What is the release date of Cyberpunk 2077 to PS4 and Xbox One?

release date Cyberpunk 2077 PS4 Xbox One

Page 16: TECHNOLOGIES LARGE-SCALE SEARCH

What will you choose?

What is the release date of Cyberpunk 2077 to PS4 and Xbox One?

Cyberpunk 2077 PS4 console

Page 17: TECHNOLOGIES LARGE-SCALE SEARCH

Contextual Graph

The power of context

Page 18: TECHNOLOGIES LARGE-SCALE SEARCH

Taxonomy vs. Contextuality

Nike

Shoes Sporting Goods

Running Basketball

Air Max

Cortez AdidasLebron James

StaticDynamic

Page 19: TECHNOLOGIES LARGE-SCALE SEARCH

Contextual graph

NikeAir

MaxSporting Goods

30%

90%

5%

0.01%≈

Contextual Taxonomical

Page 20: TECHNOLOGIES LARGE-SCALE SEARCH

Contextual graph

Nike

Shoes

Running Basketball

Sporting Goods

Air Max

Cortez

Lebron James

Adidas

0%≈0%≈

0%≈

2% 10% 8%

30%

15%

10%

5%

10%

90%

35%

25% 20%

30%

90% 80%

Page 21: TECHNOLOGIES LARGE-SCALE SEARCH

Graph statistics

Input: 5 billion webpages

10 billion entities

300 billion edges

10 million entities added daily

Cities Books Shoes

Restaurants Cars Songs

Quotes Company Sites

Page 22: TECHNOLOGIES LARGE-SCALE SEARCH

Challenges

Data collection

What to extract

How to connect

Scalable architecture

Page 23: TECHNOLOGIES LARGE-SCALE SEARCH

Solution

Neuroscience instead of NLP

Brain inspired auto associative networks and statistical pattern recognition

Page 24: TECHNOLOGIES LARGE-SCALE SEARCH

Technology

Applications to Search

Page 25: TECHNOLOGIES LARGE-SCALE SEARCH

Possible query forms

Greater Toronto Area

Page 26: TECHNOLOGIES LARGE-SCALE SEARCH

Possible query forms

Page 27: TECHNOLOGIES LARGE-SCALE SEARCH

Query augmentation

Page 28: TECHNOLOGIES LARGE-SCALE SEARCH

Contextual GraphIt’s all about the context