1 how does google work? the technology behind google's great results emre altug yavuz ph.d....

29
1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering University of British Columbia (UBC) Vancouver, BC, CANADA 2004 © Emre A. Yavuz. EECE, UBC

Upload: earl-brown

Post on 17-Dec-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

1

How Does Google Work? The Technology behind Google's Great

Results

Emre Altug Yavuz Ph.D. candidate

Data Communications Lab.

Electrical & Computer Engineering

University of British Columbia (UBC)

Vancouver, BC, CANADA

2004 © Emre A. Yavuz. EECE, UBC

Page 2: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

2

What is Google ?

A fully automated search engine, which employs robots known as “spiders” to crawl the web

frequently and find sites for inclusion in the Google database or index.

2004 © Emre A. Yavuz. EECE, UBC

Page 3: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

3

Some Google Factoids

• Named for the mathematical term “googol” or 10100,the number represented by the numeral 1 followed by 100 zeros.

• Global unique users per month: 81.9 million.

• Selected by Yahoo (2000) and AOL (2002) as search engine partner.

• Indexes largest amount of Internet accessible documents.

• Designed to scale well to extremely large data sets

• Efficient usage of storage space to store the index.

• Optimized data structures for fast and efficient access.

2004 © Emre A. Yavuz. EECE, UBC

Page 4: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

4

Who invented it, when and why ?

• In early 90s, search engines started springing out of academic projects.

• Low quality of the results and existence of poorly designed search engines prepared the born of Google.

• Designed and created by Sergie Brin and Larry Page.

• On September 7, 1998, Google Inc. opened its doors in a garage in Menlo Park, California.

2004 © Emre A. Yavuz. EECE, UBC

Page 5: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

5

How does Google Work ?

• When you perform a Google search, you are not actually searching the web, but rather an index of the copy of the web stored on Google’s servers.

• The index is compiled from all the pages that have been returned by a multitude of spiders – called GoogleBot - that crawl the web.

• When a user types in a query, the search items are looked up in the index and the results are then returned from a separate set of document servers along with advertisement.

• All of these bits are assembled, with the help of its PageRank technology, into the page of search results.

2004 © Emre A. Yavuz. EECE, UBC

Page 6: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

6

What is PageRank ?

• The method of measuring a page’s “importance”.

• The applied version of academic citation literature to the web.

• An extended idea based on the counted citations or backlinks to a given page by not counting links from all pages equally, and by normalizing the number of links on a page.

• Assuming page A having pointing pages to itself labeled from t1 to tn, the pagerank of page A is given as follows:

PR(A) = (1-d) + d . (PR(t1)/C(t1) + … + PR(tn)/C(tn))

where C(A) is defined as the # of links going out of page A.

2004 © Emre A. Yavuz. EECE, UBC

Page 7: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

7

How to tell what a PageRank of a page is

• Download a toolbar from http://toolbar.google.com.

• Once installed, there will be bar graph at the top of the browser showing a version of PageRank for the page being browsed.

• Hold the mouse over the bar to see a number from 0 to 10.

• Only to give you an idea, not very accurate, sometimes guesses, if the page entered is not in indexed, but there is a closer one. Just a representation of actual PageRank.

• Whilst PageRank is linear, Google uses a non-linear graph to portray it.

2004 © Emre A. Yavuz. EECE, UBC

Page 8: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

8

How significant is PageRank ?

• The significance of any factor in search engine algorithms depends on the quality of the information it supplies.

• A factor’s importance is known as its weight.

• Originally, when the Meta keyword tag was new, it could be used as an indicator of what the page was about.

• However, the weighting was fast approaching nothing since it was easily abused by the Webmasters with a high level of manipulation.

• Even though PageRank is harder to be manipulated, it is not impossible to do.

2004 © Emre A. Yavuz. EECE, UBC

Page 9: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

9

Is PageRank enough to determine thequality of a page (1)?

“People only link to pages they think are good.” However, there may be other reasons like:

Reciprocal links – “Link to me and I’ll link you.”

Link requirements – “Using our script requires you to put a link to our website.” or “We’ll give you an award in return for a link to our website.”

Friends and family – “This is my friend Pete’s site”

Free Page Add-ons – “This counter is provided by …”

2004 © Emre A. Yavuz. EECE, UBC

Page 10: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

10

Is PageRank enough to determine thequality of a page (2)?

• If a Webmaster picks the outbound links by searching on Google, then PageRank itself will have an influence on the number of links to a page, (in a circular way).

• Thus the links will no longer be based solely on human judgement and the increase will not be solely because it is a good page, but because its PageRank is already high.

Therefore, PageRank is not enough to produce high precision results.

2004 © Emre A. Yavuz. EECE, UBC

Page 11: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

11

Other System Features

• Title tag – most important factor since high level of importance is placed by most engines & directories.

• Proximity of search terms – how often do they appear ? How close together are they ?

• Text characteristics – font size and type, search terms in a larger or bolder font are weighted higher than others.

• Anchor text – Anchors often provide more accurate descriptions of web pages than the pages themselves. They may exist for documents which can not be indexed by a text based search engine – images, programs, databases etc.

2004 © Emre A. Yavuz. EECE, UBC

Page 12: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

12

The difference between PageRankand other factors

Title Tag Can only be listed once

Keywords in Body text Each successive repetition is less important. Proximity is important.

Anchor text Highly weighted, but like keywords in body text, there is a cutoff point where

further anchor text is no longer worthwhile

PageRank Potentially infinite. You are always capable of increasing your PageRank

significantly, but it takes work.

2004 © Emre A. Yavuz. EECE, UBC

Page 13: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

13

How does Google rank pages ?

• Find all pages matching the keywords of the search.

• Rank accordingly using “on the page factors” such as keywords bolded, relatively larger etc.

• Calculate the inbound anchor text.

• Adjust the results by PageRank scores.

2004 © Emre A. Yavuz. EECE, UBC

Page 14: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

14

System Anatomy (1)

• Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux.

• URLserver sends list of URLs to be fetched to the crawlers.

• The fetched web pages are sent to the storeserver to be compressed and stored into a repository.

• Every webpage has an associated ID number called a docID.

• The indexer reads the repository, uncompresses the documents and parses them to be converted into a set of word occurrences called hits.

2004 © Emre A. Yavuz. EECE, UBC

Page 15: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

15

High Level Google Architecture

2004 © Emre A. Yavuz. EECE, UBC

Page 16: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

16

System Anatomy (2)

• The hits record the word, position, fontsize and capitalization.

• The indexer distributes these hits into a set of barrels and parses out all the links in every webpage and stores important information about them in an anchors file.

• The URLresolver reads the anchors file and converts relative URLs into absolute URLs and docIDs.

• The sorter takes the barrels, sorted by docID and resorts them by wordID. It also produces a list of wordIDs and offsets into the inverted index.

2004 © Emre A. Yavuz. EECE, UBC

Page 17: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

17

System Anatomy (3)

• A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon.

• The searcher is run by a webserver and uses the lexicon together with the inverted index and the PageRank to answer queries.

2004 © Emre A. Yavuz. EECE, UBC

Page 18: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

18

How does Google make money ?

• Initially, sold targeted banner advertisements and provided search services to other websites including Yahoo.

• Later, launched AdWords – a system for automatically selling and displaying advertisements alongside search results. The ads are also ranked according to their popularity.

• Using the base created by AdWords, launched a context targeted advertisement system – AdSense.

• Google “next generation corporate software” – released on 2nd of June 04, query and document update software.

2004 © Emre A. Yavuz. EECE, UBC

Page 19: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

19

How do you maximize your place on Google ? (1)

• Make sure that all your pages are indexed in the first place.

• Pay a great deal of attention to your webpage titles.

• Have keywords well-represented in the body of the webpage.

• Add content to your pages and to your website, Google likes sites with lots of content.

• Use keywords as hyperlink names.

2004 © Emre A. Yavuz. EECE, UBC

Page 20: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

20

How do you maximize your place on Google ? (2)

• Have a good system of navigation between your webpages, PageRank gets passed among the internal links of a website.

• Get external links to as many pages on your site as you can. Each external link will add to the PageRank not only of the page that is linked, but also of every webpage on your site, if you have good site navigation.

• Do not submit a redirection web page. Most search engines will skip your web site completely in that case.

• Try to avoid using frames in your web site.

2004 © Emre A. Yavuz. EECE, UBC

Page 21: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

21

References

• “The Anatomy of a Large Scale Hypertextual Web Search Engine”, Sergey Brin and Lawrence Page.

• “PageRank Uncovered”, Chris Ridings and Mike Shishigin.

• “Google! Everything you always wanted to know, but didn’t have time to find out”, Judy Broom, Betsy Chessler and Katherine Foster.

• And not surprisingly http://www.google.com

2004 © Emre A. Yavuz. EECE, UBC

Page 22: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

THANKS

Questions ?

2004 © Emre A. Yavuz. EECE, UBC

Page 23: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

23

Some Features of Google (1)

• daterange: limits your search to a particular date or range of dates that a page was indexed by Google.

• only works with Julian dates, so you’ll need to find a Julian date converter online. The Julian date must be an integer (no decimals.)

• Usage daterange:start - stop

e.g. stjohns daterange:2452401-2452766

2004 © Emre A. Yavuz. EECE, UBC

Page 24: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

24

Some Features of Google (2)

• filetype: restricts your results to files ending in ".doc" (or .xls, .ppt. etc.), and shows you only files created with the corresponding program.

• The “dot” in the file extension – .doc – is optional.

• filetype:extension

e.g. stjohns -filetype:pdf

2004 © Emre A. Yavuz. EECE, UBC

Page 25: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

25

Some Features of Google (3)

• inanchor: restricts the results to text in a page’s link anchors.

• inanchor:terms

e.g. stjohns -inanchor:”ubc”

• intext: ignores link text, URLs, and titles, and only searches body text, helps you find query words that are too common in URLs and links.

• intext:terms

e.g.stjohns -intext:”ubc.ca”

2004 © Emre A. Yavuz. EECE, UBC

Page 26: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

26

Some Features of Google (4)

• intitle: restricts the results to documents containing a particular word in its title.

• inurl: restricts the results to documents containing a particular word in its URL.

• site: restricts the results to those websites in a domain.

• cache: shows the version of a web page that Google has in its cache.

2004 © Emre A. Yavuz. EECE, UBC

Page 27: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

27

Some Features of Google (5)

• link: restricts the results to those web pages that have links to the specified URL.

• related: lists web pages that are "similar" to a specified web page.

• info: presents some information that Google has about a particular web page.

2004 © Emre A. Yavuz. EECE, UBC

Page 28: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

28

Some Features of Google (6)

• There are actually three different Google phonebook operators.

• Using phonebook: searches the entire Google phonebook.

• Using rphonebook: searches residential listings only.

• Using bphonebook: searches business listings only.

2004 © Emre A. Yavuz. EECE, UBC

Page 29: 1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering

29

Some Features of Google (7)

• If you begin a query with stocks: Google will treat the rest of the query terms as stock ticker symbols, and will link to a Yahoo finance page showing stock information for those symbols.

• If you begin a query with define: Google will display definitions for the word or phrase that follows, if definitions are available.

2004 © Emre A. Yavuz. EECE, UBC