Download - Data Mining and Information Retrieval
Data Mining and Information Data Mining and Information RetrievalRetrieval
Introduction to Web MiningIntroduction to Web Mining
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 2 / 23
What is Web Mining?What is Web Mining?
Discovering useful information from the World-Wide Web and its usage patterns.
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 3 / 23
Web Mining vs. Data MiningWeb Mining vs. Data MiningStructure (or lack of it)
Textual information and linkage structureScale
Data generated per day is comparable to largest conventional “data warehouses”
SpeedOften need to react to evolving usage patterns in real-time (e.g., merchandising)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 4 / 23
Web Mining topicsWeb Mining topicsWeb graph analysisPower Laws and The Long TailStructured data extractionWeb advertising Systems Issues
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 5 / 23
Size of the WebSize of the WebNumber of pages
Technically, infiniteMuch duplication (30-40%)Best estimate of “unique” static HTML pages comes from search engine claims
Google = 8 billion(?), Yahoo = 20 billion
Number of web sites Netcraft survey says 206,675,938 sites (March 2010)
(http://news.netcraft.com/archives/web_server_survey.html)
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 6 / 23
NetcraftNetcraft SurveySurvey
http://news.netcraft.com/archives/web_server_survey.html
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 7 / 23
The Web as a GraphThe Web as a GraphPages = nodes, hyperlinks = edges
Ignore contentDirected graph
High linkage8-10 links/page on averagePower-law degree distribution
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 8 / 23
Structure of Web GraphStructure of Web GraphLet’s take a closer look at structure
Broder et al (2000) studied a crawl of 200M pages and other smaller crawlsBow-tie structure
Not a “small world”
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 9 / 23
BowBow--tie Structuretie Structure
Source: Broder et al, 2000
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 10 / 23
What can the graph tell us?What can the graph tell us?Distinguish “important” pages from unimportant ones
Page rankDiscover communities of related pages
Hubs and AuthoritiesDetect web spam
Trust rank
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 11 / 23
Web Mining topicsWeb Mining topicsWeb graph analysisPower Laws and The Long TailStructured data extractionWeb advertising Systems Issues
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 12 / 23
PowerPower--law degree distributionlaw degree distribution
Source: Broder et al, 2000
log
Long tail
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 13 / 23
PowerPower--laws galorelaws galoreStructure
In-degreesOut-degreesNumber of pages per site
Usage patternsNumber of visitorsPopularity
And much more…
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 14 / 23
Web Mining topicsWeb Mining topicsWeb graph analysisPower Laws and The Long TailStructured data extractionWeb advertising Systems Issues
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 15 / 23
Extracting Structured DataExtracting Structured Data
http://www.simplyhired.com
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 16 / 23
Web Mining topicsWeb Mining topicsWeb graph analysisPower Laws and The Long TailStructured data extractionWeb advertising Systems Issues
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 17 / 23
Searching the WebSearching the Web
Content aggregatorsThe Web Content consumers
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 18 / 23
Ads vs. search resultsAds vs. search results
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 19 / 23
Ads vs. search resultsAds vs. search resultsSearch advertising is the revenue model
Multi-billion-dollar industryAdvertisers pay for clicks on their ads
Interesting problemsWhat ads to show for a search?If I’m an advertiser, which search terms should I bid on and how much to bid?
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 20 / 23
Web Mining topicsWeb Mining topicsWeb graph analysisPower Laws and The Long TailStructured data extractionWeb advertising Systems Issues
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 21 / 23
Systems architectureSystems architecture
Memory
Disk
CPUMachine Learning, Statistics
“Classical” Data Mining
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 22 / 23
Very LargeVery Large--Scale Data MiningScale Data Mining
Mem
Disk
CPU
Mem
Disk
CPU
Mem
Disk
CPU…
Cluster of commodity nodes
CMPT 454: Database Systems II CMPT 454: Database Systems II –– Introduction to Web MiningIntroduction to Web Mining 23 / 23
Systems IssuesSystems IssuesWeb data sets can be very large
Tens to hundreds of terabytesCannot mine on a single server!
Need large farms of serversHow to organize hardware/software to mine multi-terabye data sets