84079707-nemo-ppt (1)

48
Sailing the Web with Captain Nemo  a Pers onal ized Metasearch Eng ine (http://www.dblab.ntua.gr/~stef/nemo)  Ste fan os Sou ld at os, Theo do re Dalamagas, Timo s Sell is (Nati onal Techn ical Unive rsity of Ath ens, Greece )

Upload: ajay-agrawal

Post on 08-Aug-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 1/48

Sailing the Web with

Captain Nemo a Personalized Metasearch Engine

(http: / /www.dblab.ntua.gr/~stef/nemo) 

Stefanos Souldatos, Theodore Dalamagas, Timos Sellis

(National Technical University of Athens, Greece)

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 2/48

INTRODUCTION

Metasearching

Personalization

Metasearching & Personalization

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 3/48

Metasearching

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 4/48

Metasearching

WEB

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 5/48

Metasearching

SearchEngine

1

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 6/48

Metasearching

SearchEngine

1

Search

Engine

2

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 7/48

Metasearching

SearchEngine

1

Search

Engine

2

SearchEngine

3

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 8/48

Metasearching

MetasearchEngine

SearchEngine

1

Search

Engine

2

SearchEngine

3

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 9/48

Metasearching

MetasearchEngine

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 10/48

Personalization

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 11/48

Personalization

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 12/48

Personalization

M t hi &

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 13/48

Metasearching &

Personalization

ResultRetrieval

ResultPresentation

Result Administration

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 14/48

INTRODUCTION TO

CAPTAIN NEMO

Personalization in Captain Nemo

Contribution

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 15/48

Person a l iza t ion in Capt a in Nem o 

Personal Retrieval Model(search engines, #pages, timeout)

Personal Presentation Style(grouping, ranking, appearance)

Topics of Personal Interest(semi-automatic classification)

ResultRetrieval

Result

Presentation

Result

 Administration

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 16/48

Contribution

n We present personalization techniques for metasearch engines (presentation style,retrieval model, ranking algorithm).

n We suggest semi-automatic classificationtechniques in order to recommend relevanttopics of interest to classify the retrieved

Web pages.n We present a fully-functional metasearch

engine, called Captain Nemo, that

implements the above framework.

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 17/48

RELATED WORK

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 18/48

Personalization in Retrieval

WebCrawler

Search

Ixquick 

Infogrid

Mamma

Profusion

Query Server

search engines tobe used

timeout option (i.e.max time to wait for

search engine results)

number of pages to be

retrieved by each engin

User defines the:

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 19/48

Personalization in Retrieval

WebCrawler

Search

Ixquick 

Infogrid

Mamma

Profusion

Query Server

search engines tobe used

timeout option (i.e.max time to wait for

search engine results)

number of pages to be

retrieved by each engin

User defines the:

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 20/48

Personalization in Retrieval

WebCrawler

Search

Ixquick 

Infogrid

Mamma

Profusion

Query Server

search engines tobe used

timeout option (i.e.max time to wait for

search engine results)

number of pages to be

retrieved by each engin

User defines the:

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 21/48

Personalization in Presentation

 AllthewebPersonal stylesheets tocustomize the look ‘n’ feel

 AltaVistaHigh or low details in thedescription of the results

WebCrawlerMetaCrawler

Dogpile

Result grouping by searchengine that retrieved them

Metasearch Engines

Search Engines

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 22/48

Topics of Personal Interest

Buntine etal. (2004)

Topic-based open sourcesearch engine

Organizes search resultsinto custom foldersNorthern Light

Recognises categories and

improves queries towardsa categoryInquirus2

Chakrabarti

et al. (1998)

Exploit link information for

hypertext categorization

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 23/48

CAPTAIN NEMO

UserProfile

Personal Retrieval Model

Personal Presentation Style

Topics of Personal Interest

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 24/48

Personal Retrieval Model

n Search Engines

n Number of Results

n Search Engine Timeout

n Search Engine Weight

SearchEngine 1

SearchEngine 2

SearchEngine 3

ü

20

6

ü

30

8

ü

10

4

7 10 5

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 25/48

CAPTAIN NEMO

UserProfile

Personal Retrieval Model

Personal Presentation Style

Topics of Personal Interest

R lt G i

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 26/48

Result Grouping

n Merged in a single list

n Grouped by search engine

n Grouped by relevant topic of interest

R lt C t t

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 27/48

Result Content

n Title

n Title, URL

n Title, URL, Description

L k ‘ ’ F l

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 28/48

Look ‘n’ Feel

n Color Themes(XSL Stylesheets)

n Page Layout

n Font Size

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 29/48

CAPTAIN NEMO

UserProfile

Personal Retrieval Model

Personal Presentation Style

Topics of Personal Interest

Topics Administration

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 30/48

Topics Administration

n The user defines topics of personal interest(i.e. thematic categories).

n Each thematic category has a name and a

description of 10-20 words.

n The system offers an environment for the

administration of the thematic categories andtheir content.

Semi automatic Classification

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 31/48

Semi-automatic Classification

n The system proposes the most appropriatethematic category for each result.

n The user can save the results in the

proposed or other category.

n The classification implements a NearestNeighbor algorithm (Witten et al., 1999)

comparing the title and description of resultswith the name and description of thethematic categories.

Classification Example

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 32/48

Classification Example

Topics of Interest(t1) Sports:

football basketball

baseball swimmingtennis soccer game

(t2) Science:

scientific maths

physics computertechnology

(t3) Arts:

decorating art

painting poetrysculpture musi

 Alen Computer Co. can teach you the art of programming...Technology is just a game

now...computer science for beginners

Result

Classification Example

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 33/48

Classification Example

Topics of Interest(t1) Sports:

football basketball

baseball swimmingtennis soccer game

(t2) Science:

scientific maths

physics computertechnology

(t3) Arts:

decorating art

painting poetrysculpture musi

 Alen Computer Co. can teach you the art of programming...Technology is just a game

now...computer science for beginners

Result

0.287

Classification Example

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 34/48

Classification Example

Topics of Interest(t1) Sports:

football basketball

baseball swimmingtennis soccer game

(t2) Science:

scientific maths

physics computertechnology

(t3) Arts:

decorating art

painting poetrysculpture musi

 Alen Computer Co. can teach you the art of programming...Technology is just a game

now...computer science for beginners

Result

0.287 0.892

Classification Example

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 35/48

Classification Example

Topics of Interest(t1) Sports:

football basketball

baseball swimmingtennis soccer game

(t2) Science:

scientific maths

physics computertechnology

(t3) Arts:

decorating art

painting poetrysculpture musi

 Alen Computer Co. can teach you the art of programming...Technology is just a game

now...computer science for beginners

Result

0.287 0.892 0.368

Classification Example

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 36/48

Classification Example

Topics of Interest(t1) Sports:

football basketball

baseball swimmingtennis soccer game

(t2) Science:

scientific maths

physics computertechnology

(t3) Arts:

decorating art

painting poetrysculpture musi

 Alen Computer Co. can teach you the art of programming...Technology is just a game

now...computer science for beginners

Result

0.287 0.892 0.368

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 37/48

METASEARCH RANKING

Two Ranking Approaches

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 38/48

Two Ranking Approaches

Using Initial

Scores of 

Search Engines

Not Using

Initial Scores of 

Search Engines

Using Initial Scores

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 39/48

Using Initial Scores

n Rasolofo et al. (2001) believe that the initial scoresof the search engines can be exploited.

n Normalization is required in order to achieve a

common measure of comparison.

n  A weight factor incorporates the reliability of each

search engine. Search engines that return more

Web pages should receive higher weight. This isdue to the perception that the number of relevant

Web pages retrieved is proportional to the total

number of Web pages retrieved as relevant.

Not Using Initial Scores

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 40/48

Not Using Initial Scores

n The scores of various search engines are notcompatible and comparable even when normalized.

n Towell et al. (1995) note that the same document

receives different scores in various search engines.

n Gravano and Papakonstantinou (1998) point out

that the comparison is not feasible not even among

engines using the same ranking algorithm.n Dumais (1994) concludes that scores depend on

the document collection used by a search engine.

Aslam and Montague (2001)

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 41/48

Aslam and Montague (2001)

n Bayes-fuse uses probabilistic theory tocalculate the probability of a result to berelevant to a query.

n Borda-fuse is based on democratic voting. Itconsiders that each search engine givesvotes in the results it returns (N votes in the

first result, N-1 in the second, etc). Themetasearch engine gathers the votes and theranking is determined democratically bysumming up the votes.

Aslam and Montague (2001)

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 42/48

Aslam and Montague (2001)

n Weighted borda-fuse: weighted alternativeof borda-fuse, in which search engines are

not treated equally, but their votes are

considered with weights depending on thereliability of each search engine.

Weighted Borda-Fuse

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 43/48

Weighted Borda Fuse

n V (r i,j) = w j * (maxk(r k) - i + 1)n V(r i,j): Votes of i result of j search engine

n w j: weight of j search engine (set by user)

n maxk(r k) : maximum number of results

n Example:

2345SE1:

345SE2:

12345SE3:

W1=7

14212835

304050

510152025

W2=10

W3=5

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 44/48

CONCLUSION – FUTURE WORK

Conclusion

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 45/48

n We presented Captain Nemo, a fully-functional metasearch engine with personalsearch spaces.

n Users can define their personal retrievalmodel, presentation style and topics of interest.

n Captain Nemo recommends a relevant topicof interest to classify each result, exploitingNearest-Neighbour classification techniques.

Future Work

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 46/48

n To replace the flat model of topics of interestby a hierarchy of topics in the spirit of Kunz

and Botsch (2002).

n To improve the classification process,

exploiting background knowledge in the form

of ontologies (Bloehdorn & Hotho, 2004).

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 47/48

Captain Nemo 

http://www.dblab.ntua.gr/~stef/nemo 

Links

8/23/2019 84079707-nemo-ppt (1)

http://slidepdf.com/reader/full/84079707-nemo-ppt-1 48/48

IntroductionIntroduction to Captain Nemo

Related work

Captain Nemo: Personal Retrieval Model

Captain Nemo: Personal Presentation Style

Captain Nemo: Topics of Personal Interest

Metasearch Ranking

Conclusion – Future Work