google project computer graph
TRANSCRIPT
Term Project Proposal
Google Web Search Engine
CSCI 5408 Theory of Graph
Computer Science and Engineering
University of Colorado Denver
Yuguang Qiu ([email protected])
2014.4.3
Abstract
Above all, my research is focused on finding of general ideas that behind the graph algorithm
underlying Google search engine. Basically, probability, page rank and adjacency matrices are
fundamental concepts that contributes to an efficient and accurate web search. My term project
argues these concepts and provide my own understanding and findings of how it works in the
web search field. Afterward, writing a mini-implementation in “Mathematica” to demonstrate
my ideas of such topic. In addition, I will dialectically analyze the current methodologies that
used to implement Google’s web search, then point out advantage and disadvantage based from
my perspective.
Motivation and Problem
As is known to all, the web search engines are dominating the website traffic, those engines
around the world are popular and has become an important part of our daily life, for example,
Google, Baidu and Bing are perfect examples of successful web search engines. We may raise a
question that how does the search engine discover a satisfied web link or page for users from the
Internet that contains millions of websites and links which may related to the search? It’s an
interesting but comprehensive question, meanwhile, very complicated to implement by proper
algorithms. My motivation is to have both a depth and breadth of knowledge of this project,
besides, implement my ideas in “Mathematica”.
Solutions
- Eigenvector & Probability
Basically, my research on Google web search engine is related to knowledge in graph theory and
the tree concepts. Probability can significantly contribute the Search engine optimization. For
example, landing page optimization is based on statistics, and statistics is based on probability
theory.
Eigenvector is a method that has acquired considerable currency today because of web
applications
When a user is trying to search the key word on google, it’s important if other important web
sites link to it, in simple words, the search engine needs to rank and identify the Priority of web
sites. Therefore, I want to rank each web site to be proportional to the sum of the importance of
all of the sites that link to it. I can use these mathematical expression to demonstrate [1]
:
Suppose X1, X2...Xn are the importance of all n of the web sites in the world
X1= K (X2+X3+X4)
x2 =K(X1000+X2000+X22001+X10000)
... = ...,
In these equation, K is the constant of proportionality, and on the right side of each equation is
the
Sum of the importance of all of the sites that link to the one on the left. Now I extend this
equation to a giant N×N matrix A whose (i, j) entry is 1 if web site j links to web site i, otherwise
0.
Thus this becomes a problem that using the idea of eigenvector to do ranking for the web sites
Priority.
Actually, the web search engine Google uses a variant of this idea to find the importance of a
large number of web sites, when you request a search, these results that pop up on the screen are
listed in a sequence that in decreasing order of important, which means the first page of result
may has the most strong connections to your consulting problems. This methodology will save
you a lot of time to find optimized matching links from all displayed results. On the other hand,
introducing some other problems, for example, a popular web link might be flooded by the web
traffic, thus leading a system crash on servers. Eigenvector has many applications to ranking
things or people by measuring of the influence they have over each other.
- Page Rank
My research will introduce The Sensitivity of Page Rank to Connection Errors, we know that
large networks often contain "connection errors" in which the edges are mistakenly recorded. For
example, an set of extra edges can be wrongly added, a set of edges can be wrongly deleted, or a
set of edges may be "transposed", that is, an edge that should be recorded as going from node A
to node B is transposed and falsely recorded as going from node A to node C[2]
:
I will use “Mathematica” demonstration to examine the sensitivity of the page rank measure of
node centrality to these "connection errors" in a random graph. In simple words, here are the
steps:
Step 1: Choose the number of nodes and edges in the original random graph by moving the
"variant" slider
Step 2: Choose an example of a random graph satisfying these criteria for the original random
graph.
Step 3: Choose how many random edges to add, delete, or transpose from the graph using the
"operation" control in conjunction with the "error parameter" slider.
- Adjacency Matrices
An adjacency matrix is a means of representing which vertices (or nodes) of a graph are adjacent
to which other vertices. Another matrix representation for a graph is the incidence matrix. [3]
Here
is an example of an Adjacency Matrice. I will demonstrate how this is used in search engine.
Reference
1. Kenneth Appel and Wolfgang Haken. A proof of the four color theorem. Discrete Math.,
16(2):179–180, 1976
2. http://demonstrations.wolfram.com/TheSensitivityOfPageRankToConnectionErrors/
3. http://www1.chapman.edu/library/instruction/tutorials/searchmethod/prox.html