google project computer graph

4
Term Project Proposal Google Web Search Engine CSCI 5408 Theory of Graph Computer Science and Engineering University of Colorado Denver Yuguang Qiu ([email protected]) 2014.4.3 Abstract Above all, my research is focused on finding of general ideas that behind the graph algorithm underlying Google search engine. Basically, probability, page rank and adjacency matrices are fundamental concepts that contributes to an efficient and accurate web search. My term project argues these concepts and provide my own understanding and findings of how it works in the web search field. Afterward, writing a mini-implementation in Mathematicato demonstrate my ideas of such topic. In addition, I will dialectically analyze the current methodologies that used to implement Google’s web search, then point out advantage and disadvantage based from my perspective. Motivation and Problem As is known to all, the web search engines are dominating the website traffic, those engines around the world are popular and has become an important part of our daily life, for example, Google, Baidu and Bing are perfect examples of successful web search engines. We may raise a question that how does the search engine discover a satisfied web link or page for users from the Internet that contains millions of websites and links which may related to the search? It’s an interesting but comprehensive question, meanwhile, very complicated to implement by proper algorithms. My motivation is to have both a depth and breadth of knowledge of this project, besides, implement my ideas in “Mathematica”.

Upload: yu-guang-qiu

Post on 13-Apr-2017

107 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Google Project Computer graph

Term Project Proposal

Google Web Search Engine

CSCI 5408 Theory of Graph

Computer Science and Engineering

University of Colorado Denver

Yuguang Qiu ([email protected])

2014.4.3

Abstract

Above all, my research is focused on finding of general ideas that behind the graph algorithm

underlying Google search engine. Basically, probability, page rank and adjacency matrices are

fundamental concepts that contributes to an efficient and accurate web search. My term project

argues these concepts and provide my own understanding and findings of how it works in the

web search field. Afterward, writing a mini-implementation in “Mathematica” to demonstrate

my ideas of such topic. In addition, I will dialectically analyze the current methodologies that

used to implement Google’s web search, then point out advantage and disadvantage based from

my perspective.

Motivation and Problem

As is known to all, the web search engines are dominating the website traffic, those engines

around the world are popular and has become an important part of our daily life, for example,

Google, Baidu and Bing are perfect examples of successful web search engines. We may raise a

question that how does the search engine discover a satisfied web link or page for users from the

Internet that contains millions of websites and links which may related to the search? It’s an

interesting but comprehensive question, meanwhile, very complicated to implement by proper

algorithms. My motivation is to have both a depth and breadth of knowledge of this project,

besides, implement my ideas in “Mathematica”.

Page 2: Google Project Computer graph

Solutions

- Eigenvector & Probability

Basically, my research on Google web search engine is related to knowledge in graph theory and

the tree concepts. Probability can significantly contribute the Search engine optimization. For

example, landing page optimization is based on statistics, and statistics is based on probability

theory.

Eigenvector is a method that has acquired considerable currency today because of web

applications

When a user is trying to search the key word on google, it’s important if other important web

sites link to it, in simple words, the search engine needs to rank and identify the Priority of web

sites. Therefore, I want to rank each web site to be proportional to the sum of the importance of

all of the sites that link to it. I can use these mathematical expression to demonstrate [1]

:

Suppose X1, X2...Xn are the importance of all n of the web sites in the world

X1= K (X2+X3+X4)

x2 =K(X1000+X2000+X22001+X10000)

... = ...,

In these equation, K is the constant of proportionality, and on the right side of each equation is

the

Sum of the importance of all of the sites that link to the one on the left. Now I extend this

equation to a giant N×N matrix A whose (i, j) entry is 1 if web site j links to web site i, otherwise

0.

Thus this becomes a problem that using the idea of eigenvector to do ranking for the web sites

Priority.

Actually, the web search engine Google uses a variant of this idea to find the importance of a

large number of web sites, when you request a search, these results that pop up on the screen are

listed in a sequence that in decreasing order of important, which means the first page of result

may has the most strong connections to your consulting problems. This methodology will save

you a lot of time to find optimized matching links from all displayed results. On the other hand,

introducing some other problems, for example, a popular web link might be flooded by the web

traffic, thus leading a system crash on servers. Eigenvector has many applications to ranking

things or people by measuring of the influence they have over each other.

Page 3: Google Project Computer graph

- Page Rank

My research will introduce The Sensitivity of Page Rank to Connection Errors, we know that

large networks often contain "connection errors" in which the edges are mistakenly recorded. For

example, an set of extra edges can be wrongly added, a set of edges can be wrongly deleted, or a

set of edges may be "transposed", that is, an edge that should be recorded as going from node A

to node B is transposed and falsely recorded as going from node A to node C[2]

:

I will use “Mathematica” demonstration to examine the sensitivity of the page rank measure of

node centrality to these "connection errors" in a random graph. In simple words, here are the

steps:

Step 1: Choose the number of nodes and edges in the original random graph by moving the

"variant" slider

Step 2: Choose an example of a random graph satisfying these criteria for the original random

graph.

Step 3: Choose how many random edges to add, delete, or transpose from the graph using the

"operation" control in conjunction with the "error parameter" slider.

- Adjacency Matrices

An adjacency matrix is a means of representing which vertices (or nodes) of a graph are adjacent

to which other vertices. Another matrix representation for a graph is the incidence matrix. [3]

Here

is an example of an Adjacency Matrice. I will demonstrate how this is used in search engine.

Reference

1. Kenneth Appel and Wolfgang Haken. A proof of the four color theorem. Discrete Math.,

16(2):179–180, 1976

2. http://demonstrations.wolfram.com/TheSensitivityOfPageRankToConnectionErrors/

Page 4: Google Project Computer graph

3. http://www1.chapman.edu/library/instruction/tutorials/searchmethod/prox.html