the development of a search engine & comparison according to algorithms 20032017 sungsoo kim...

12
The Development of a The Development of a search engine & search engine & Comparison according to Comparison according to algorithms algorithms 20032017 20032017 Sungsoo Kim Sungsoo Kim 20032066 Haebeom Lee 20032066 Haebeom Lee The mid-term progress report

Upload: clifford-butler

Post on 20-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report

The Development of a The Development of a search engine & Comparison search engine & Comparison

according to algorithmsaccording to algorithms

20032017 20032017 Sungsoo KimSungsoo Kim20032066 Haebeom Lee20032066 Haebeom Lee

The mid-term progress report

Page 2: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report

Topic of our term projectTopic of our term project

Compare the performance of the Compare the performance of the algorithms used in information algorithms used in information retrieval.retrieval.

On the basis of that comparison, On the basis of that comparison, make efficient search engine and make efficient search engine and demonstrate it.demonstrate it.

Page 3: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report

ProceduresProcedures

I.I. Extracting the text-information’s Extracting the text-information’s position from raw files.position from raw files.

II.II. Extracting the keyword or index from Extracting the keyword or index from the text.the text.

III.III. Making the index file.Making the index file.IV.IV. Gathering and sorting those index fileGathering and sorting those index fileV.V. Getting information of index.Getting information of index.VI.VI. Boolean retrieval Boolean retrieval VII.VII. Natural language retrieval using Natural language retrieval using

Vector and Probability model.Vector and Probability model.

Page 4: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report

Procedure (I)-1Procedure (I)-1

Raw document: putting together into a Raw document: putting together into a file from HTML files.file from HTML files.

ex)ex)

<HTML> …document …</HTML><HTML> …document …</HTML>

<HTML> …document….</HTML><HTML> …document….</HTML> Get the text information by string Get the text information by string

match algorithm.match algorithm.

Page 5: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report

Procedure (I)-2Procedure (I)-2

Tuned Boyer-Moore AlgorithmTuned Boyer-Moore Algorithm

BalkParcMoraParkBalkParcMoraParkParkParkParkPark

ParkPark

ParkParkParkPark

Modified from Boyer-Moore AlgorithmModified from Boyer-Moore Algorithm Using the bad-character shift functionUsing the bad-character shift function Easy to applyingEasy to applying Can search in a 1/3 times to the general Can search in a 1/3 times to the general

search algorithmsearch algorithm

Page 6: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report

Procedure (II)Procedure (II)

Statistical information from the Statistical information from the extracted textextracted text

The result containThe result contain - average text length- average text length - total the number of the text- total the number of the text - average text file from a document- average text file from a document This information do not be used in This information do not be used in

analyzing the search engine directlyanalyzing the search engine directly

Page 7: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report

Procedure (III)Procedure (III)

Making temporary indexMaking temporary index There are a number of making index There are a number of making index

word.word. Exclude stopword from index word Exclude stopword from index word

Ex) Stopword : “the”, “of” , “and”, “to”Ex) Stopword : “the”, “of” , “and”, “to”

Stored in AVL treeStored in AVL tree AVL tree enables the machine to AVL tree enables the machine to

insert or delete nodes and help to insert or delete nodes and help to search efficiently.search efficiently.

Page 8: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report

Procedure (IV)-1Procedure (IV)-1

Gathering and getting information Gathering and getting information of index terms.of index terms.

Document index consists of a pair Document index consists of a pair of index from document and of index from document and location which that index word location which that index word appeared.appeared.

That location information is That location information is pointed to lexicon and posting.pointed to lexicon and posting.

Page 9: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report

Procedure (IV)-2Procedure (IV)-2Sample documentSample document

Document Document No.No.

ContentsContents

11 Peace porridge hot, peace porridgePeace porridge hot, peace porridge

22 Peace porridge in the hotPeace porridge in the hot

33 Nine days oldNine days old

44 Some like it hot, some like itSome like it hot, some like it

55 Some like it in the potSome like it in the pot

66 Nine days oldNine days old

Lexicon fileLexicon file Posting filePosting file

ColdCold 22

DayDayss

22

HotHot 22

inin 22

1,41,4

3,63,6

1,41,4

2,52,5

coldcold

coldcold

Page 10: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report

Typical information Typical information retrievalretrieval

Boolean modelBoolean model

- - set model, express query and express as a setset model, express query and express as a set - “not”, “or”, “and” - “not”, “or”, “and” - easy to understand but difficult for user to use- easy to understand but difficult for user to use Vector modelVector model- - assign weighted value to indexassign weighted value to index- calculate the similarity and rank the resultcalculate the similarity and rank the result- Most popular modelMost popular model Probability modelProbability model- Robertson &Robertson & Sparck Jones suggest in 1976Sparck Jones suggest in 1976- Based on probability and Bayes’ theorem Based on probability and Bayes’ theorem

Page 11: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report

Until now….& nextUntil now….& next

Extract information from raw-files.Extract information from raw-files. Extract the keyword and index word.Extract the keyword and index word. Be making index file and lexicon/postingBe making index file and lexicon/posting Will survey model (boolean, vector, Will survey model (boolean, vector,

probability)probability) Will make engine consists of three part Will make engine consists of three part

(according to 3 model)(according to 3 model) Compare their performance and Compare their performance and

suggest simple engine.suggest simple engine.

Page 12: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report

Development systemDevelopment system

System:System:

Pentium 4 (1.6G) , XP windowPentium 4 (1.6G) , XP window OS: OS:

Red hat-linux on VM wareRed hat-linux on VM ware Interface:Interface:

Execute on console lineExecute on console line

Text-based resultText-based result