basic www technologies & mathematic background (chap 2 & 1, baldi) wen-hsiang lu ( 盧文祥...
Post on 21-Dec-2015
224 views
TRANSCRIPT
Basic WWW Technologies & Mathematic Background
(Chap 2 & 1, Baldi)
Wen-Hsiang Lu (盧文祥 )Department of Computer Science and Information Engineering,
National Cheng Kung University2006/10/5
2
World Wide Web
• The World Wide Web (Web) is a network of information resources.
• The Web relies on three mechanisms to make these resources available:1. A uniform naming scheme for locating
resources on the web (e.g., URIs).2. Protocols, for access to named resources
over the web (e.g., HTTP).3. Hypertext, for easy navigation among
resources (e.g., HTML).
3
Internet vs. Web
• Internet:– Internet is a more general term – Includes physical aspect of underlying networks and
mechanisms such as email, FTP, HTTP…
• Web:– Associated with information stored on the Internet– Refers to a broader class of networks, i.e. Web of
English Literature
• Both Internet and web are networks
4
Essential Components of WWW
• Resources (HTML, HyperText Markup Language)– Conceptual mappings to concrete or abstract entities, which do
not change in the short term– Taggin support for structuring and laying out documents
• Resource identifiers (hyperlinks):– Strings of characters represent generalized addresses that may
contain instructions for accessing the identified resource– http://www.google.com/ is used to identify the Google homepage
• Transfer protocols (HTTP, HyperText Transmission Protocol)– Conventions that regulate the communication between a
browser (web user agent) and a server
5
Standard Generalized Markup Language (SGML)
• Based on GML (generalized markup language), developed by IBM in the 1960s
• An international standard (ISO 8879:1986) defines how descriptive markup should be embedded in a document – Markup: extra information characterizing structure of a
document
• Gave birth to the extensible markup language (XML), W3C recommendation in 1998
6
SGML Components
• SGML documents have three parts:– Declaration: specifies which characters and delimiters
may appear in the application– DTD (Document Type Definition)/ style sheet: defines
the syntax of markup constructs– Document instance: actual text (with the tag) of the
documents
• More info could be found: http://www.W3.Org/markup/SGML
7
HTML Background
• HTML was originally developed by Tim Berners-Lee while at CERN, and popularized by the Mosaic browser developed at NCSA.
• The Web depends on Web page authors and vendors sharing the same conventions for HTML. This has motivated joint work on specifications for HTML.
• HTML standards are organized by W3C : http://www.w3.org/MarkUp/
8
HTML Functionalities
• HTML gives authors the means to:– Publish online documents with headings, text, tables,
lists, photos, etc• Include spread-sheets, video clips, sound clips, and other
applications directly in their documents
– Link information via hypertext links, at the click of a button
– Design forms for conducting transactions with remote services, for use in searching for information, making reservations, ordering products, etc
10
Sample Webpage: HTML Structure
• <HTML>
• <HEAD>
• <TITLE>The title of the webpage</TITLE>
• </HEAD>
• <BODY> <P>Body of the webpage
• </BODY>
• </HTML>
11
HTML Structure
• An HTML document is divided into a head section (here, between <HEAD> and </HEAD>) and a body (here, between <BODY> and </BODY>)
• The title of the document appears in the head (along with other information about the document)
• The content of the document appears in the body. The body in this example contains just one paragraph, marked up with <P>
12
HTML Hyperlink
• <a href="relations/alumni">alumni</a>• A link is a connection from one Web resource
to another
• It has two ends, called anchors, and a direction
• Starts at the "source" anchor and points to the "destination" anchor, which may be any Web resource (e.g., an image, a video clip, a sound bite, a program, an HTML document)
13
Resource Identifiers
• Uniform Resource Identifiers (URI): include two overlapping subsets of identifiers– URL: Uniform Resource Locators
– URN: Uniform Resource Names
14
Introduction to URIs
• Every resource available on the Web has an address that may be encoded by a URI
• URIs typically consist of three pieces:– The naming scheme of the mechanism used to
access the resource. (HTTP, FTP)– The name of the machine hosting the resource– The name of the resource itself, given as a path
15
URI Example
• http://www.w3.org/TR
• There is a document available via the HTTP protocol
• Residing on the machines hosting www.w3.org
• Accessible via the path "/TR"
16
Protocols
• Describe how messages are encoded and exchanged
• Different Layering Architectures
• ISO OSI 7-Layer Architecture
• TCP/IP 4-Layer Architecture
19
TCP/IP Layering Architecture
• A simplified model, provides the end-to-end reliable connection
• The network layer – Hosts drop packages into this layer, layer
routes towards destination – Only promise “Try my best”
• The transport layer– Reliable byte-oriented stream
20
Hypertext Transfer Protocol (HTTP)
• A connection-oriented protocol (TCP) used to carry WWW traffic between a browser and a server
• One of the transport layer protocol supported by Internet
• HTTP communication is established via a TCP connection and server port 80
23
Form
• <HTML><Form action= http://140.116.246.174/cgi-bin/meshdb.cgi method=post>[1] Median Eminence ( 可複選 ):1.<input type=checkbox name=‘Median Eminence’ value= 分泌 > 分泌2.<input type=checkbox name=‘Median Eminence’ value= 一般 > 一般 3.<input type=checkbox name=‘Median Eminence’ value= 王錫崗 > 王錫崗 .<input type=checkbox name=‘Median Eminence’ value= 垂體 > 垂體其他 :<input type=“text” name =‘Median Eminence’ ><input type=submit value= 確認 ></Form></HTML>
25
CGI (Common Gateway Interface)
Web Browser Web Server
Database
CGI
Service Request
Service ProcessingOutput
Service Response
29
Homework (1)
• Meta-search engine: dispatch the user query to several engines at same time, collect and merge the results into one list to the user.
• Homework: Develop a meta-search engine which responds user query with combined search results from a few search engines.
30
Domain Name System
• DNS (domain name service): mapping from domain names to IP address
• IPv4: • IPv4 was initially deployed January 1st. 1983 and
is still the most commonly used version.• 32 bit address, a string of 4 decimal numbers
separated by dot, range from 0.0.0.0 to 255.255.255.255.
• IPv6: • Revision of IPv4 with 128 bit address
31
Top Level Domains (TLD)
• Top level domain names, .com, .edu, .gov and ISO 3166 country codes .de, .fr, .it
• There are three types of top-level domains:• Generic domains were created for use by the Internet
public • Country code domains were created to be used by
individual country • The .arpa domain Address and Routing Parameter Area
domain is designated to be used exclusively for Internet-infrastructure purposes
32
Server Log Files
• Server Transfer Log: transactions between a browser and server are logged
• IP address, the time of the request• Method of the request (GET, HEAD, POST…)• Status code, a response from the server• Size in byte of the transaction
• Referrer Log: where the request originated
• Agent Log: browser software making the request (spider)
• Error Log: request resulted in errors (404)
33
Server Log Analysis
• Most and least visited web pages
• Entry and exit pages
• Referrals from other sites or search engines
• What are the searched keywords
• How many clicks/page views a page received
• Error reports, like broken links
35
Search Engines
• According to Pew Internet & American Life Project Report (2002), search engines are the most popular way to locate information online
• About 33 million U.S. Internet users query on search engines on a typical day.
• More than 80% have used search engines
• Search Engines are measured by coverage and recency
36
Web Crawler
• A crawler is a program that picks up a page and follows all the links on that page
• Crawler = Spider
• Types of crawler:– Breadth First– Depth First
37
Breadth First Crawlers
• Use breadth-first search (BFS) algorithm
• Get all links from the starting page, and add them to a queue
• Pick the 1st link from the queue, get all links on the page and add to the queue
• Repeat above step till queue is empty
39
Depth First Crawlers
• Use depth first search (DFS) algorithm
• Get the 1st link not visited from the start page
• Visit link and get 1st non-visited link
• Repeat above step till no non-visited links
• Go to next non-visited link in the previous level and repeat 2nd step
41
Coverage
• Overlap analysis used for estimating the size of the indexable web
• W: set of webpages• Wa, Wb: pages crawled by two
independent engines a and b• P(Wa), P(Wb): probabilities that a page
was crawled by a or b– P(Wa)=|Wa| / |W| – P(Wb)=|Wb| / |W|
42
Overlap Analysis
• P(Wa Wb| Wb) = P(Wa Wb)/ P(Wb) = |Wa Wb| / |Wb|
• If a and b are independent:– P(Wa Wb) = P(Wa)*P(Wb)– P(Wa Wb| Wb) = P(Wa)*P(Wb)/P(Wb)
= |Wa| / |W| * (|Wb| / |W|) / (|Wb| / |W|) = |Wa| / |W| = P(Wa)
43
Overlap Analysis
• Using |W| = |Wa|/ P(Wa), the researchers found:– Web had at least 320 million pages in 1997– 60% of web was covered by six major engines– Maximum coverage of a single engine was
1/3 of the web
44
How to Improve the Coverage?
• Meta-search engine: dispatch the user query to several engines at same time, collect and merge the results into one list to the user.
• Any suggestions?
• Homework: Develop a meta-search engine which responds user query with combined search results from a few search engines.
45
Probability
• Model uncertainty: make inferences about events given observed data
• An event e: proposition or statement about the world at large– “the number of Web pages in existence on 1 January
2003 was greater than five billion”
• A probability P(e): can be viewed as a number that reflects our uncertainty about whether e is true or false in the real world, given whatever information we have available.
46
Learning from a Bayesian Perspective
• A conditional probability P(e | D): represent the degree of belief (Bayesian interpretation of probability), where D is the background information (data) on which our belief is based.
• Bayesian approach: probability as being a dynamic entity updated when more data arrive
– Prior probability: P(e) is your belief in the event e before you see any data
– Posterior probability: P(e | D) reflects your updated belief in event e given the observed data D
– Likelihood: P(D | e) is the probability of the data under the assumption that e is true
• How to model P(D | e)?
)(
)()|()|(
DP
ePeDPDeP
47
Standard Probabilistic Distribution
• Discrete distributions • Continuous distributions
!)|(
)1()(
...!!...
!),...,(
)1(),|(
1
11
111
kekXP
ppkXP
ppkk
nkXkXP
ppk
nnpkXP
k
k
km
k
mmm
knk
m
x
x
x
exx
exf
exN
1
)(2
1
)(),|(
)|(
2
1),|(
22
Geometric
Poisson
Exponential
Gamma
48
Learning from a Bayesian Perspective (cont.)
• Take logarithms for easier operations
• Obtain more data D2 (second data set)
)(
)()|()|(
DP
ePeDPDeP
)(log)(log)|(log)|(log DPePeDPDeP
)|(
)|(),|(),|(
2
22 DDP
DePDeDPDDeP
49
Parameter Estimation from Data
• Maximum a posteriori (MAP)– The objective of parameter estimation is to find or approximate
the best set of parameters for a model, i.e., to find the set of parameters maximizing the posterior P(|D), or log P(|D). This is called maximum a posteriori (MAP) estimation.
– To deal with positive quantities, we can minimize - log P(|D)
– P(D) plays the role of a normalizing constant and is thus irrelevant for the optimization, i.e.,the minimization of
– If the prior P() is uniform over sample space, then the problem reduces to finding the maximum of P(D|), or log P(D|). This is known as maximum likelihood (ML) estimation.
– Simpler ML estimation procedure, i.e., the minimization of
)(log)|(log)( PDP
)(log)(log)|(log)|(log)( DPPDPDP
)|(log)( DP