mobile web crawling master thesis defense jan fiedler 04/17/98

24
Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

Upload: stanley-preston

Post on 12-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

Mobile Web Crawling

Master Thesis Defense

Jan Fiedler

04/17/98

Page 2: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 2

Presentation Outline

• Resource Discovery Problem• Web Crawling Techniques

– Traditional Web Crawling– Mobile Web Crawling

• Mobile Crawling Architecture– Distributed Runtime Environment– Application Framework– Performance Evaluation

• Summary and Conclusion

Page 3: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 3

Resource Discovery Problem

• Web establishes large distributed hypertext system– 1.6 million Web sites

– 320 million Web documents

– 40% of the Web content changes within a month

– exponential growing rate

– lack of structure (i.e. no strict hierarchy)

Goal: overlay the distributed Web structure with a centralized information system which allows resource discovery

Page 4: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 4

Web Indices and Search Engines

• Search engine statistics:– index size 30-110 million pages (approx. 700GB)

– web coverage 10%-35%

– daily crawl 3-10 million pages (approx. 60GB)

• Year 2000 estimates:– index size 880 million pages (approx. 5.6TB)

– daily crawl 80 million pages (approx. 480GB)

Traditional Web crawling will experience severe scaling problems in the near future.

Page 5: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 5

Traditional Crawling Overview Google domain

LAN

Web

Repository

URLServer

IndexerAnchorsURL

Resolver

Crawler

Crawler

Crawler

Crawler

HTTP

StoreServer

Page 6: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 6

Traditional Web Crawling

• Characteristics of traditional Web crawling:– remote data access

– focus on rapid data retrieval

– centralized, database oriented architecture

– brute force download of Web content

– resource intensive approach

Traditional Web crawling techniques do not exploit information about the pages being crawled in order to reduce the crawling costs.

Page 7: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 7

Mobile Crawling Overview

Search Engine

Remote Host

HTTPServer

Web

Remote Host

HTTPServer

Remote Host

HTTPServer

Index

Crawler Manager

Page 8: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 8

Mobile Web Crawling

• Characteristics of mobile Web crawling:– local data access

– focus on effective data retrieval

– distributed, data source oriented architecture

– intelligent download of significant Web content

– resource preserving approach

Mobility allows a Web crawler to analyze Web pages before investing Web resources for their transmission

Page 9: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 9

Mobile Crawling Advantages

• Remote page selection– determine significance of a page prior to transmission

– applicable for specialized search engines

• Remote page filtering– use effective page representation model

– applicable for non-fulltext search engines

• Remote page compression– compress page data prior to transmission

– applicable for all search engines

Page 10: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 10

Crawler Specification

• Rule based programming paradigm– represent crawler data as facts (e.g. page-facts)

– describe crawler behavior as a set rules which operate upon facts

• Advantages– it is easier to specify crawling rules than to devise a

crawling algorithm

– no need to model control flow

– rule based programs have very simple runtime states

Page 11: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 11

Mobile Crawling Architecture

Application Framework Architecture

Distributed Crawler Runtime Environment

DatabaseCommand Manager

DB

ConnectionManager

SQ

L

Crawler ManagerCrawlerSpec

CommunicationSubsystem

Outbox Inbox

QueryEngine

Archive Manager

CommunicationSubsystem

VirtualMachine

HTTPServer

Net

CommunicationSubsystem

VirtualMachine

HTTPServer

CommunicationSubsystem

VirtualMachine

HTTPServer

CommunicationSubsystem

VirtualMachine

HTTPServer

Page 12: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 12

Mobile Crawling Architecture

• Distributed Crawler Runtime Environment– provide platform independent execution environment

– virtual machine for remote crawler execution

– communication layer for crawler migration

• Application Framework– support for crawler specification and configuration

– crawler manager for crawler specification

– query engine as crawler/application interface

– archive manager as database connectivity framework

Page 13: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 13

Crawler Virtual Machine

• How to execute a rule based crawler specification?– crawler execution = rule application upon fact base

– use inference engine for the the rule application process

1. Initialization• insert rules and facts into inference engine

2. Rule application• start rule application process within inference engine

3. Finalization• extract rules and facts once the rule application stopped

Page 14: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 14

Crawler Virtual MachineVirtual Machine

Communication Layer

Scheduling

ExecutionThread

InferenceEngine

ExecutionThread

InferenceEngine

ExecutionThread

InferenceEngine

ExecutionThread

InferenceEngine

Page 15: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 15

Crawler Query Engine

• How to access the crawler knowledge?– provide a query facility to query the crawler fact base

– implement a SQL subset as query language

– represent query result as data tuples, not as facts

– allows the user to reason about crawling results

– query engine implementation uses inference engine

Query engine serves as the primary interface between the user application and the mobile crawler

Page 16: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 16

Crawler Query EngineCrawler Object

Query Engine

Crawler Facts

UserQuery

QueryCompiler

Query Rule

Crawler FactsCrawler Facts

Crawler FactsCrawler FactsCrawler Facts

Crawler FactsCrawler FactsCrawler Rules

Crawler FactsCrawler FactsResult Tuples

Inference Engine

Page 17: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 17

Performance Evaluation Setup

• Use distributed virtual machines to support mobile as well as traditional Web crawling

REM OT E L OC A L

Craw lerManager

Communic ationSubs y s tem

Craw lerSpec

V ir tualMac hine

Communic ationSubs y s tem

HTMLHTTPServ er

V ir tualMac hine

Communic ationSubs y s tem

Page 18: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 18

Performance Evaluation

• Controlled environment setup– static HTML data set with known properties

– personal HTTP server

– unshared communication channel (dialup line)

• Measurements1. network load for traditional (stationary) crawler

2. network load for mobile crawler without page compression

3. network load for mobile crawler with page compression

Page 19: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 19

Benefit of Remote Page Selection

0

50

100

150

200

250

300

350

400

450

S1 M1 M2 M3 M4

Tota

l loa

d (K

B)

uncompressed

compressed

Traditional crawler (S1) versus mobile crawlers (M1-M4) with different keyword sets for page selection

Page 20: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 20

Benefit of Remote Page Filtering Mobile crawler (M1) with a decreasing degree of page

filtering (10%-90% page data preserved)

0%

20%

40%

60%

80%

100%

120%

90% 80% 70% 60% 50% 40% 30% 20% 10%

Filter degree

Net

wor

k lo

ad

Load uncompressed Load compressed

Page 21: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 21

Benefit of Page Compression Traditional crawler (S1) and mobile crawler (M1) with an

increasing number of crawled pages

0

100

200300

400

500

600700

800

900

1 10 22 51 82 158

Retrieved pages

Tota

l loa

d (in

KB

)

Stationary Mobile uncompressed Mobile compressed

Page 22: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 22

Costs and Benefits

• Overhead– overhead due to crawler migration (<5K)

– overhead due to facts based data representation (6%)

• Benefits without page compression– as soon as less than 85% per page needs to be preserved

– as soon as less than 90% of all pages are transmitted

• Benefits with page compression– reduction in network load by a factor of 4.5

Page 23: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 23

Summary and Conclusion

• Mobile crawling advantages:– approach fits better in distributed web environment

– approach beneficial for all types of search engines

– better support for specialized search engines

– network overhead due to crawler mobility is small

Mobile crawling solves the scaling problems of the traditional crawling approach by allowing remote operations to be performed on the crawled data.

Approach provides a base for smart Web crawling.

Page 24: Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

04/17/98 [email protected] 24

Future Work

• Security– crawler identification based on digital signatures

– restrict crawler execution to positive identified crawlers

– implement virtual machine as a secure sandbox

• Crawler mobility support– integrate virtual machine into web servers

• Mobile crawling algorithms– optimize crawling algorithms with crawler mobility in

mind (e.g. crawler communication)