web mining - sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · december 24, 2006...
TRANSCRIPT
![Page 1: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/1.jpg)
Web Mining
Kyumars Sheykh Esmaili
Data Mining CourseSharif University of Technology
Fall 2006
![Page 2: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/2.jpg)
December 24, 2006 Web Mining 2
Table of ContentsIntroductionWeb Content Mining
Feature Selection and Similarity MeasuresWeb Structure Mining
Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms
PageRankCyber-Communities
HITSCT
Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining
Focus CrawlingWeb Search Result Clustering
Summary
![Page 3: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/3.jpg)
December 24, 2006 Web Mining 3
Table of ContentsIntroductionWeb Content Mining
Feature Selection and Similarity MeasuresWeb Structure Mining
Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms
PageRankCyber-Communities
HITSCT
Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining
Focus CrawlingWeb Search Result Clustering
Summary
![Page 4: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/4.jpg)
December 24, 2006 Web Mining 4
Introduction
Information Overloading on the webSize
2001New information created: 6 exabytes (10^18 bytes) 10 billion (nonspam) e-mail messages were sent per day.
2002New information created: 12 exabytes (10^18 bytes)
2003the public Internet contained about 1 trillion pages and was increasing at a rate of approximately 8 million pages per day.
200535 billion messages per day by 2005.
![Page 5: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/5.jpg)
December 24, 2006 Web Mining 5
Challenges on WWW Interactions
Finding Relevant InformationCreating knowledge from Information availablePersonalization of the informationLearning about customers / individual users
Web Mining can play an important Role!
![Page 6: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/6.jpg)
December 24, 2006 Web Mining 6
Introduction
Web mining - data mining techniques to automatically discover and extract information from Web documents/servicesWeb mining research – integrate research from several research communities :
Database (DB) Information retrieval (IR) The sub-areas of machine learning (ML) Natural language processing (NLP)
![Page 7: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/7.jpg)
December 24, 2006 Web Mining 7
Web Data
Web pagesIntra-page structuresInter-page structuresUsage dataSupplemental data
ProfilesRegistration informationCookies
![Page 8: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/8.jpg)
December 24, 2006 Web Mining 8
Web Data Categories
Web Data
Content Data
Structure Data
Usage Data
User Profile Data
Free Texts
HTML Files
XML Files
Dynamic Content
Multimedia
Static Link
Dynamic Link
![Page 9: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/9.jpg)
December 24, 2006 Web Mining 9
Web Mining
Web StructureMining
Web ContentMining
Web C-SMining
Web UsageMining
Web Mining Taxonomy
![Page 10: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/10.jpg)
December 24, 2006 Web Mining 10
Web Mining : SubtasksResource Finding
Task of retrieving intended web-documents
Information Selection & Pre-processingAutomatic selection and pre-processing specific information from retrieved web resources
GeneralizationAutomatic Discovery of patterns in web sites
AnalysisValidation and / or interpretation of mined patterns
![Page 11: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/11.jpg)
December 24, 2006 Web Mining 11
Table of ContentsIntroductionWeb Content Mining
Feature Selection and Similarity MeasuresWeb Structure Mining
Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms
PageRankCyber-Communities
HITSCT
Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining
Focus CrawlingWeb Search Result Clustering
Summary
![Page 12: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/12.jpg)
December 24, 2006 Web Mining 12
Feature Selection for Web Mining
for the purposes of automated text classification text features should be:
Relatively few in numberModerate in frequency of assignmentLow in redundancyLow in noiseRelated in semantic scope to the classes
to be assignedRelatively unambiguous in meaning
![Page 13: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/13.jpg)
December 24, 2006 Web Mining 13
Feature Selection
Potential features:BODYMETATITLESnippet
Means sentences attached with URL u appeared in search results
Anchor WindowThe anchor text and text around the hyperlink v->u in the
source page vMT, the union of META and TITLE content;BMT, the union of BODY, META and TITLE content.
![Page 14: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/14.jpg)
December 24, 2006 Web Mining 14
Percentage of Web Pages With Words in HTML Tags
Feature Selection for Content Mining
![Page 15: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/15.jpg)
December 24, 2006 Web Mining 15
Feature Selection For Web Pages
Classification performance for various representations of web pages
![Page 16: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/16.jpg)
December 24, 2006 Web Mining 16
Vector Space Model for Content-Similarity
IR systems usually adopt index terms to process queriesIndex term:
a keyword or group of selected wordsany word (more general)
Stemming might be used:connect: connecting, connection, connections
An inverted file is built for the chosen index terms
![Page 17: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/17.jpg)
December 24, 2006 Web Mining 17
Vector Space Model - Basic Concepts
Ki is an index termdj is a documentt is the number of index termsK = (k1, k2, …, kt) is the set of all index termswij >= 0 is a weight associated with (ki,dj)wij = 0 indicates that term does not belong to docvec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document djgi(vec(dj)) = wij is a function which returns the weight associated with pair (ki,dj)
![Page 18: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/18.jpg)
December 24, 2006 Web Mining 18
The Vector Space Model
Sim(dk,dj) = cos(Θ) = [vec(dk) • vec(dj)] / |dk| * |dj| = [Σ wik * wij] / |dk| * |dj|Since wij > 0 and wik > 0, 0 <= sim(dk,dj) <=1
A document is retrieved even if it matches the target document terms only partially
i
j
dj
dkΘ
![Page 19: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/19.jpg)
December 24, 2006 Web Mining 19
The Vector Space Model: Example
d1
d2
d3d4 d5
d6d7
k1k2
k3
k1 k2 k3 q • dj |dj| Sim(dj,q)d1 1 0 1 2 1.41 0.82d2 1 0 0 1 1 0.58d3 0 1 1 2 1.41 0.82d4 1 0 0 1 1 0.58d5 1 1 1 3 1.73 1d6 1 1 0 2 1.41 0.82d7 0 1 0 1 1 0.58
q 1 1 1 |q| 1.73
![Page 20: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/20.jpg)
December 24, 2006 Web Mining 20
The Vector Space Model - Weighting
Sim(q,dj) = [Σ wij * wiq] / |dj| * |q|How to compute the weights wij and wiq ?A good weight must take into account two effects:
quantification of intra-document contents (similarity)tf factor, the term frequency within a document
quantification of inter-documents separation (dissi-milarity)idf factor, the inverse document frequency
wij = tf(i,j) * idf(i)
![Page 21: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/21.jpg)
December 24, 2006 Web Mining 21
Example:• A collection includes 10,000 documents• The term A appears 20 times in a particular document• The maximum apperance of any term in this document is 50• The term A appears in 2,000 of the collection documents.• f(i,j) = freq(i,j) / max(freq(l,j)) = 20/50 = 0.4• idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32• wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.93
The Vector Model - Weighting
![Page 22: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/22.jpg)
December 24, 2006 Web Mining 22
Table of ContentsIntroductionWeb Content Mining
Feature Selection and Similarity MeasuresWeb Structure Mining
Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms
PageRankCyber-Communities
HITSCT
Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining
Focus CrawlingWeb Search Result Clustering
Summary
![Page 23: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/23.jpg)
December 24, 2006 Web Mining 23
Social network analysis
Social network is the study of social entities (people in an organization, called actors), and their interactions and relationships. The interactions and relationships can be represented with a network or graph,
each vertex (or node) represents an actor and each link represents a relationship.
From the network, we can study the properties of its structure, and the role, position and prestige of each social actor. We can also find various kinds of sub-graphs, e.g., communities formed by groups of actors.
![Page 24: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/24.jpg)
December 24, 2006 Web Mining 24
Social network and the Web
Social network analysis is useful for the Web because the Web is essentially a virtual society, and thus a virtual social network,
Each page: a social actor and each hyperlink: a relationship.
Many results from social network can be adapted and extended for use in the Web context.
![Page 25: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/25.jpg)
December 24, 2006 Web Mining 25
Web Structure MiningThe Web consists not only of pages, but also of hyperlinks pointing from one page to another
These hyperlinks contain an enormous amount of latent human annotation
Assumption: link from page A to page B is a recommendation of page B by AIf A and B are connected by a link, there is a higher probability that they are on the same topic
![Page 26: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/26.jpg)
December 24, 2006 Web Mining 26
Web Link Analysis
Used for Ordering documents matching a user query: rankingDeciding what pages to add to a collection: crawlingPage categorizationFinding related pagesFinding duplicated web sites
![Page 27: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/27.jpg)
December 24, 2006 Web Mining 27
Table of ContentsIntroductionWeb Content Mining
Feature Selection and Similarity MeasuresWeb Structure Mining
Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms
PageRankCyber-Communities
HITSCT
Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining
Focus CrawlingWeb Search Result Clustering
Summary
![Page 28: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/28.jpg)
December 24, 2006 Web Mining 28
Structural Similarity MeasuresWe must define the similarity of two nodes
Method I:For page and page B, A is related to B if there is a hyper-link from A to B, or from B to A
Not so good. Consider the home page of IBM and Microsoft.
Page A
Page B
![Page 29: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/29.jpg)
December 24, 2006 Web Mining 29
Structural Similarity Measures
Method II (from Bibliometrics)Co-citation: the similarity of A and B is measured by the number of pages cite both A and B
Bibliographic coupling: the similarity of A and B is measured by the number of pages cited by both A and B.
Page A Page B
Page A Page B
![Page 30: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/30.jpg)
December 24, 2006 Web Mining 30
Table of ContentsIntroductionWeb Content Mining
Feature Selection and Similarity MeasuresWeb Structure Mining
Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms
PageRankCyber-Communities
HITSCT
Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining
Focus CrawlingWeb Search Result Clustering
Summary
![Page 31: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/31.jpg)
December 24, 2006 Web Mining 31
Using link structure of web (cont.)
There are two famous Link-Structure based algorithms for ranking :
PageRankHITS
Nearly All other algorithms are base on these ones :
Salsa,Clever,.
![Page 32: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/32.jpg)
December 24, 2006 Web Mining 32
PageRank
Introduced by Page et al (1998)An offline algorithm (Query independent)The weight is assigned by the rank of parents
![Page 33: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/33.jpg)
December 24, 2006 Web Mining 33
A Practical Example for PageRank
![Page 34: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/34.jpg)
December 24, 2006 Web Mining 34
Table of ContentsIntroductionWeb Content Mining
Feature Selection and Similarity MeasuresWeb Structure Mining
Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms
PageRankCyber-Communities
HITSCT
Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining
Focus CrawlingWeb Search Result Clustering
Summary
![Page 35: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/35.jpg)
December 24, 2006 Web Mining 35
What is cyber-communityA community on the web is a group of web pages sharing a common interest
Eg. A group of web pages talking about POP MusicEg. A group of web pages interested in data-mining
Main properties: Pages in the same community should be similar to each other in contentsThe pages in one community should differ from the pages in another community Similar to cluster
![Page 36: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/36.jpg)
December 24, 2006 Web Mining 36
Cyber Communities
![Page 37: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/37.jpg)
December 24, 2006 Web Mining 37
Two different types of communities
Explicitly-defined communitiesThey are well known ones, such as the resource listed by Yahoo!
Implicitly-defined communitiesThey are communities unexpected or invisible to most users
Arts
Music
Classic Pop
Painting
eg.
eg. The group of web pages interested in a particular singer
![Page 38: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/38.jpg)
December 24, 2006 Web Mining 38
Different types of communities
The explicit communities are easy to identifyEg. Yahoo!, InfoSeek, Clever System
In order to extract the implicit communities, we need analyze the web-graph objectively
In research, people are more interested in the implicit communities
![Page 39: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/39.jpg)
December 24, 2006 Web Mining 39
Methods of clustering
Clustering methods based on co-citation analysis
Methods derived from HITS (Kleinberg)Using co-citation matrix
CT Method
![Page 40: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/40.jpg)
December 24, 2006 Web Mining 40
Table of ContentsIntroductionWeb Content Mining
Feature Selection and Similarity MeasuresWeb Structure Mining
Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms
PageRankCyber-Communities
HITSCT
Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining
Focus CrawlingWeb Search Result Clustering
Summary
![Page 41: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/41.jpg)
December 24, 2006 Web Mining 41
HITS: Hubs and Authority
Hub: web page links to a collection of prominent sites on a common topicAuthority: Pages that link to a collection of authoritative pages on a broad topic; web page pointed to by hubsMutual Reinforcing Relationship: a good authority is a page that is pointed to by many good hubs, while a good hub is a page that points to many good authorities
![Page 42: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/42.jpg)
December 24, 2006 Web Mining 42
Authority and Hubness
2
3
4
1 1
5
6
7
x(1) = y(2) + y(3) + y(4) y(1) = x(5) + x(6) + xs(7)
![Page 43: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/43.jpg)
December 24, 2006 Web Mining 43
HITS Steps (1)
Creating root and base sets
![Page 44: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/44.jpg)
December 24, 2006 Web Mining 44
HITS Steps (2)
Calculating Weights
Authority weight :
Hub weight :
Matrix notation: A - adjacency matrixA(i, j) = 1 if i-th page points to j-th page
![Page 45: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/45.jpg)
December 24, 2006 Web Mining 45
Final Result of HITS
![Page 46: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/46.jpg)
December 24, 2006 Web Mining 46
HITS Results – 3D perspective
![Page 47: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/47.jpg)
December 24, 2006 Web Mining 47
A Practical Example for HITS
![Page 48: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/48.jpg)
December 24, 2006 Web Mining 48
Difference between PageRank and HITS
The PageRank is computed for all web pages stored in the database and then prior to the query; HITS is performed on the set of retrieved web pages, and for each query.HITS computes authorities and hubs; PageRank computes authorities only.PageRank: non-trivial to compute, HITS: easy to compute, but real-time execution is hard
![Page 49: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/49.jpg)
December 24, 2006 Web Mining 49
Table of ContentsIntroductionWeb Content Mining
Feature Selection and Similarity MeasuresWeb Structure Mining
Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms
PageRankCyber-Communities
HITSCT
Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining
Focus CrawlingWeb Search Result Clustering
Summary
![Page 50: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/50.jpg)
December 24, 2006 Web Mining 50
A cheaper method
Previous methods are expensive
There another simple method called communities trawling (CT)
It has been implemented on the graph of 200 millions pages, it worked very well
![Page 51: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/51.jpg)
December 24, 2006 Web Mining 51
Basic idea of CT
Definition of communitiesdense directed bipartite sub graphs
Bipartite graph: Nodes are partitioned into two sets, F and CEvery directed edge in the graph is directed from a node u in F to a node v in Cdense if many of the possible edges between F and C are present
Fans Centers
F C
![Page 52: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/52.jpg)
December 24, 2006 Web Mining 52
Basic idea of CT
Bipartite coresa complete bipartite subgraph with at least i nodes from F and at least j nodes from C i and j are tunable parametersA (i, j) Bipartite core
Every community have such a core with a certain i and j.
A (i=3, j=3) bipartite core
![Page 53: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/53.jpg)
December 24, 2006 Web Mining 53
Basic idea of CT
A bipartite core is the identity of a community
To extract all the communities is to enumerate all the bipartite cores on the web.
Author invent an efficient algorithm to enumerate the bipartite cores. Its main idea is iterate pruning --elimination-generation pruning
![Page 54: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/54.jpg)
December 24, 2006 Web Mining 54
Table of ContentsIntroductionWeb Content Mining
Feature Selection and Similarity MeasuresWeb Structure Mining
Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms
PageRankCyber-Communities
HITSCT
Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining
Focus CrawlingWeb Search Result Clustering
Summary
![Page 55: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/55.jpg)
December 24, 2006 Web Mining 55
Content Link Clustering
By CLC, each web page q in data set D is representedas 3 vectors:
qOutqIn
qKword
with M, N and L as the vector dimension respectively
The ith item of vector qOut (and qIn) indicates whether q has the corresponding out-link as the ith one in M out-links. If yes, the ith item is1, else 0.
The kth item of vector qKword indicates the frequency of the corresponding kth term of L appeared in page q.
![Page 56: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/56.jpg)
December 24, 2006 Web Mining 56
Similarity Measure
The similarity of two pages Q and R is the linear combination of three parts:
poutS(Qout,Rout)+ pinS(Qin,Rin)+ ptermS(Qterm,Rterm)
pout +pin +pterm =1
S(Qout,Rout) is defined as Cosine of two out-link vectors.
![Page 57: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/57.jpg)
December 24, 2006 Web Mining 57
Tuning the similarity measure
By varying weighting factors in second formula, it is possible to study the effects of out-links, in-link and terms on clustering process.
Results of term-based clustering is rather coarse and usually includes very general groups, which are totally different each other from semantic point of view.
E.g. for topic “jaguar”, “car” group and “animal” group are two very general groups with very different semantic topics;
![Page 58: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/58.jpg)
December 24, 2006 Web Mining 58
Tuning the similarity measure
So, term-based clustering could only roughly separate pages into general semantic groups and failed to handle the finer case
Like “racing car” and “car driver club” since both pages may include some terms like “car, model etc.
The main reasons of poor “purity” of clusters produced by term-based clustering are:
Noise pages are included into clusters instead of removing since noise pages share some unimportant terms with other pages;
Pages that on different finer topics (but the same general topic) are mixed together.
![Page 59: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/59.jpg)
December 24, 2006 Web Mining 59
Tuning the similarity measure
Hyperlinks represent the authors’ view of the relationship among Web pages
hyperlink-based clustering expresses “association” of pages.
Therefore, we could say that clusters produced by link-based clustering are in finer granularity.
The problem of link-based clustering is that some similar pages (e.g. new created pages) may not have enough co-citation/citation to be grouped together. That is to say, recall is some low.
![Page 60: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/60.jpg)
December 24, 2006 Web Mining 60
Tuning the similarity measure
“T”, “L” and “CLC” to denote terms–based (with pout , pin and pKword as (0, 0, 1), link-based (with pout ,pin and pKword as (0.5, 0.5, 0) and contents-link coupled (with pout , pin and pKword as (0.2,0.3, 0.5) clustering approaches respectively.
Parameters are Similarity threshold weighting factors
The label of each cluster is identified automatically by term vector of centroid for each cluster.
![Page 61: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/61.jpg)
December 24, 2006 Web Mining 61
Content Link Mining
![Page 62: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/62.jpg)
December 24, 2006 Web Mining 62
Table of ContentsIntroductionWeb Content Mining
Feature Selection and Similarity MeasuresWeb Structure Mining
Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms
PageRankCyber-Communities
HITSCT
Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining
Focus CrawlingWeb Search Result Clustering
Summary
![Page 63: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/63.jpg)
December 24, 2006 Web Mining 63
Web Usage Mining
Web usage mining also known as Web log miningmining techniques to discover interesting usage patterns from the secondary data derived from the interactions of the users while surfing the webIncluding
web log data, click-stream data, cookies, user queries, and any data related to the results of interaction between human’s interaction with the web
![Page 64: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/64.jpg)
December 24, 2006 Web Mining 64
Web Usage MiningApplications
Target potential customers for electronic commerceEnhance the quality and delivery of Internet information services to the end userImprove Web server system performanceIdentify potential prime advertisement locationsFacilitates personalization/adaptive sitesImprove site designFraud/intrusion detectionPredict user’s actions (allows prefetching)
![Page 65: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/65.jpg)
December 24, 2006 Web Mining 65
![Page 66: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/66.jpg)
December 24, 2006 Web Mining 66
Web Log Clustering Applications
Association rules– Find pages that are often viewed togetherClustering– Cluster users based on browsing patterns– Cluster pages based on content
![Page 67: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/67.jpg)
Server Logs
![Page 68: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/68.jpg)
December 24, 2006 Web Mining 68
Fields
Client IP: 128.101.228.20Authenticated User ID: - -Time/Date: [10/Nov/1999:10:16:39 -0600]Request: "GET / HTTP/1.0"Status: 200Bytes: -Referrer: “-”Agent: "Mozilla/4.61 [en] (WinNT; I)"
![Page 69: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/69.jpg)
December 24, 2006 Web Mining 69
WUM – Pre-Processing
Data CleaningRemoves log entries that are not needed for the mining
processData Integration
Synchronize data from multiple server logsUser Identification
Associates page references with different users
Session/Episode IdentificationGroups user’s page references into user sessions
Path CompletionFills in page references missing due to browser and proxy caching
![Page 70: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/70.jpg)
December 24, 2006 Web Mining 70
![Page 71: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/71.jpg)
December 24, 2006 Web Mining 71
WUM – Association Rule Generation
Discovers the correlations between pages that are most often referenced together in a single server sessionProvide the information
What are the set of pages frequently accessed together by Web users?What page will be fetched next?What are paths frequently accessed by Web users?
Association ruleA B [ Support = 60%, Confidence = 80% ]
Example“50% of visitors who accessed URLs /infor-f.html and labo/infos.htmlalso visited situation.html”
![Page 72: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/72.jpg)
December 24, 2006 Web Mining 72
WUM – Clustering
Groups together a set of items having similar characteristicsUser Clusters
Discover groups of users exhibiting similar browsing patternsPage recommendation
User’s partial session is classified into a single clusterThe links contained in this cluster are recommended
![Page 73: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/73.jpg)
December 24, 2006 Web Mining 73
Web Usage Clustering –Sample Results
clients who often access/products/software/webminer.htmltend to be from educational institutions.clients who placed an online order for software tend to be students in the 20-25 age group and live in the United States.75% of clients who download software from/products/software/demos/ visit between 7:00 and 11:00 pm on weekends.
![Page 74: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/74.jpg)
December 24, 2006 Web Mining 74
Table of ContentsIntroductionWeb Content Mining
Feature Selection and Similarity MeasuresWeb Structure Mining
Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms
PageRankCyber-Communities
HITSCT
Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining
Focus CrawlingWeb Search Result Clustering
Summary
![Page 75: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/75.jpg)
December 24, 2006 Web Mining 75
Focused Crawling
Only visit links from a page if that page is determined to be relevant.Classifier is static after learning phase.Components:
Classifier which assigns relevance score to each page based on crawl topic.Distiller to identify hub pages.Crawler visits pages to based on crawler and distiller scores.
![Page 76: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/76.jpg)
December 24, 2006 Web Mining 76
Focused Crawling
Classifier also determines how useful outgoing links areHub Pages contain links to many relevant pages. Must be visited even if not high relevance score.
![Page 77: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/77.jpg)
December 24, 2006 Web Mining 77
Focused Crawling
![Page 78: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/78.jpg)
December 24, 2006 Web Mining 78
Table of ContentsIntroductionWeb Content Mining
Feature Selection and Similarity MeasuresWeb Structure Mining
Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms
PageRankCyber-Communities
HITSCT
Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining
Focus CrawlingWeb Search Result Clustering
Summary
![Page 79: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/79.jpg)
December 24, 2006 Web Mining 79
In the web search context:organizing web pages (search results) into groups, so that different groups correspond to different user needs
search enginei.e.: engine car part
Engine Corp.Why not other data mining techniques?
Motivation
![Page 80: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/80.jpg)
December 24, 2006 Web Mining 80
(1) Using Contents of Documents
Creating clusters based on snippets returned by web search engines.Clusters based on snippets are almost as good as clusters created using the full text of Web documents.Suffix Tree Clustering (STC) : incremental, O(n)time algorithm
LinearIncrementalOverlappingCan be extended to hierarchical
![Page 81: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/81.jpg)
December 24, 2006 Web Mining 81
STC algorithm
Step 1: CleaningStemmingSentence boundary identificationPunctuation elimination
Step 2: Suffix tree constructionProduces base clusters (internal nodes)Base clusters are scored based on size and phrase score (which depends on length and word “quality”)
Step 3: Merging base clustersHighly overlapping clusters are merged
![Page 82: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/82.jpg)
December 24, 2006 Web Mining 82
(2) Using user’s usage logs
Advantage: relevancy information is objectively reflected by the usage logsAn experimental result on www.nasa.gov/
Cluster 1 /shuttle/missions/41-c/news/shuttle/missions/61-b…
Cluster 2 /history/apollo/sa-2/news//history/apollo/sa-2/images…
Cluster 3 /software/winvn/userguide/3_3_2.htm/software/winvn/userguide/3_3_4.htm…
… ….
![Page 83: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/83.jpg)
December 24, 2006 Web Mining 83
(3) Using hyperlinks
For each URL P in search results R, we extract its all out-links as well as top n in-links by services of AltaVistaWe could get all distinct N out-links and M in-links for all URLs in R.Each page P in R (result set) is represented as 2 vectors:
POut (N- dimension) PIn (Mdimension)
![Page 84: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/84.jpg)
December 24, 2006 Web Mining 84
(3) Using Hyperlinks: continued
![Page 85: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/85.jpg)
December 24, 2006 Web Mining 85
(3) Using Hyperlinks: continued
![Page 86: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/86.jpg)
December 24, 2006 Web Mining 86
Concerns on current methods
Each method has pros and cons
Using hyperlinks : the best accuracy and still some room to improve
STC : best to browse and for incrementality.
![Page 87: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/87.jpg)
December 24, 2006 Web Mining 87
Sample systems
Scatter/GatherGrouperCarrot2
VivisimoMapuccinoSHOC
![Page 88: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/88.jpg)
December 24, 2006 Web Mining 88
Grouper
OnlineOperates on query result snippetsClusters together documents with large common subphrasesSuffix Tree Clustering (STC)STC induces labeling
![Page 89: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/89.jpg)
December 24, 2006 Web Mining 89
![Page 90: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/90.jpg)
December 24, 2006 Web Mining 90
![Page 91: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/91.jpg)
December 24, 2006 Web Mining 91
![Page 92: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/92.jpg)
December 24, 2006 Web Mining 92
Table of ContentsIntroductionWeb Content Mining
Feature Selection and Similarity MeasuresWeb Structure Mining
Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms
PageRankCyber-Communities
HITSCT
Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining
Focus CrawlingWeb Search Result Clustering
Summary
![Page 93: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/93.jpg)
December 24, 2006 Web Mining 93
Web Mining
Web StructureMining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Summary
![Page 94: Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006 Web Mining 25 Web Structure Mining zThe Web consists not only of pages, but also](https://reader036.vdocuments.site/reader036/viewer/2022062414/5c966a7009d3f28e0d8c440d/html5/thumbnails/94.jpg)
December 24, 2006 Web Mining 94
Thank You