the development of a search engine & comparison according to algorithms 20032017 sungsoo kim...
TRANSCRIPT
![Page 1: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report](https://reader036.vdocuments.site/reader036/viewer/2022082518/56649ea85503460f94bab5d7/html5/thumbnails/1.jpg)
The Development of a The Development of a search engine & Comparison search engine & Comparison
according to algorithmsaccording to algorithms
20032017 20032017 Sungsoo KimSungsoo Kim20032066 Haebeom Lee20032066 Haebeom Lee
The mid-term progress report
![Page 2: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report](https://reader036.vdocuments.site/reader036/viewer/2022082518/56649ea85503460f94bab5d7/html5/thumbnails/2.jpg)
Topic of our term projectTopic of our term project
Compare the performance of the Compare the performance of the algorithms used in information algorithms used in information retrieval.retrieval.
On the basis of that comparison, On the basis of that comparison, make efficient search engine and make efficient search engine and demonstrate it.demonstrate it.
![Page 3: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report](https://reader036.vdocuments.site/reader036/viewer/2022082518/56649ea85503460f94bab5d7/html5/thumbnails/3.jpg)
ProceduresProcedures
I.I. Extracting the text-information’s Extracting the text-information’s position from raw files.position from raw files.
II.II. Extracting the keyword or index from Extracting the keyword or index from the text.the text.
III.III. Making the index file.Making the index file.IV.IV. Gathering and sorting those index fileGathering and sorting those index fileV.V. Getting information of index.Getting information of index.VI.VI. Boolean retrieval Boolean retrieval VII.VII. Natural language retrieval using Natural language retrieval using
Vector and Probability model.Vector and Probability model.
![Page 4: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report](https://reader036.vdocuments.site/reader036/viewer/2022082518/56649ea85503460f94bab5d7/html5/thumbnails/4.jpg)
Procedure (I)-1Procedure (I)-1
Raw document: putting together into a Raw document: putting together into a file from HTML files.file from HTML files.
ex)ex)
<HTML> …document …</HTML><HTML> …document …</HTML>
<HTML> …document….</HTML><HTML> …document….</HTML> Get the text information by string Get the text information by string
match algorithm.match algorithm.
![Page 5: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report](https://reader036.vdocuments.site/reader036/viewer/2022082518/56649ea85503460f94bab5d7/html5/thumbnails/5.jpg)
Procedure (I)-2Procedure (I)-2
Tuned Boyer-Moore AlgorithmTuned Boyer-Moore Algorithm
BalkParcMoraParkBalkParcMoraParkParkParkParkPark
ParkPark
ParkParkParkPark
Modified from Boyer-Moore AlgorithmModified from Boyer-Moore Algorithm Using the bad-character shift functionUsing the bad-character shift function Easy to applyingEasy to applying Can search in a 1/3 times to the general Can search in a 1/3 times to the general
search algorithmsearch algorithm
![Page 6: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report](https://reader036.vdocuments.site/reader036/viewer/2022082518/56649ea85503460f94bab5d7/html5/thumbnails/6.jpg)
Procedure (II)Procedure (II)
Statistical information from the Statistical information from the extracted textextracted text
The result containThe result contain - average text length- average text length - total the number of the text- total the number of the text - average text file from a document- average text file from a document This information do not be used in This information do not be used in
analyzing the search engine directlyanalyzing the search engine directly
![Page 7: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report](https://reader036.vdocuments.site/reader036/viewer/2022082518/56649ea85503460f94bab5d7/html5/thumbnails/7.jpg)
Procedure (III)Procedure (III)
Making temporary indexMaking temporary index There are a number of making index There are a number of making index
word.word. Exclude stopword from index word Exclude stopword from index word
Ex) Stopword : “the”, “of” , “and”, “to”Ex) Stopword : “the”, “of” , “and”, “to”
Stored in AVL treeStored in AVL tree AVL tree enables the machine to AVL tree enables the machine to
insert or delete nodes and help to insert or delete nodes and help to search efficiently.search efficiently.
![Page 8: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report](https://reader036.vdocuments.site/reader036/viewer/2022082518/56649ea85503460f94bab5d7/html5/thumbnails/8.jpg)
Procedure (IV)-1Procedure (IV)-1
Gathering and getting information Gathering and getting information of index terms.of index terms.
Document index consists of a pair Document index consists of a pair of index from document and of index from document and location which that index word location which that index word appeared.appeared.
That location information is That location information is pointed to lexicon and posting.pointed to lexicon and posting.
![Page 9: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report](https://reader036.vdocuments.site/reader036/viewer/2022082518/56649ea85503460f94bab5d7/html5/thumbnails/9.jpg)
Procedure (IV)-2Procedure (IV)-2Sample documentSample document
Document Document No.No.
ContentsContents
11 Peace porridge hot, peace porridgePeace porridge hot, peace porridge
22 Peace porridge in the hotPeace porridge in the hot
33 Nine days oldNine days old
44 Some like it hot, some like itSome like it hot, some like it
55 Some like it in the potSome like it in the pot
66 Nine days oldNine days old
Lexicon fileLexicon file Posting filePosting file
ColdCold 22
DayDayss
22
HotHot 22
inin 22
1,41,4
3,63,6
1,41,4
2,52,5
coldcold
coldcold
![Page 10: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report](https://reader036.vdocuments.site/reader036/viewer/2022082518/56649ea85503460f94bab5d7/html5/thumbnails/10.jpg)
Typical information Typical information retrievalretrieval
Boolean modelBoolean model
- - set model, express query and express as a setset model, express query and express as a set - “not”, “or”, “and” - “not”, “or”, “and” - easy to understand but difficult for user to use- easy to understand but difficult for user to use Vector modelVector model- - assign weighted value to indexassign weighted value to index- calculate the similarity and rank the resultcalculate the similarity and rank the result- Most popular modelMost popular model Probability modelProbability model- Robertson &Robertson & Sparck Jones suggest in 1976Sparck Jones suggest in 1976- Based on probability and Bayes’ theorem Based on probability and Bayes’ theorem
![Page 11: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report](https://reader036.vdocuments.site/reader036/viewer/2022082518/56649ea85503460f94bab5d7/html5/thumbnails/11.jpg)
Until now….& nextUntil now….& next
Extract information from raw-files.Extract information from raw-files. Extract the keyword and index word.Extract the keyword and index word. Be making index file and lexicon/postingBe making index file and lexicon/posting Will survey model (boolean, vector, Will survey model (boolean, vector,
probability)probability) Will make engine consists of three part Will make engine consists of three part
(according to 3 model)(according to 3 model) Compare their performance and Compare their performance and
suggest simple engine.suggest simple engine.
![Page 12: The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report](https://reader036.vdocuments.site/reader036/viewer/2022082518/56649ea85503460f94bab5d7/html5/thumbnails/12.jpg)
Development systemDevelopment system
System:System:
Pentium 4 (1.6G) , XP windowPentium 4 (1.6G) , XP window OS: OS:
Red hat-linux on VM wareRed hat-linux on VM ware Interface:Interface:
Execute on console lineExecute on console line
Text-based resultText-based result