crawler-based search engine milestone iv by ryan caplet, morris wright and bryan chapman
Post on 20-Dec-2015
217 views
TRANSCRIPT
Topics Breakdown
• Updated Task Breakdown
• Parts of the Search Engine that are within the System
• Diagram
• Testing and Integration
Task Breakdown
• Bryan– Crawler– Keyword Generator
• Morris– Database and Server Administrator– Search Function
• Ryan– Part of Crawler– Search Function– User Interface
• All– Testing System Components
Topic Breakdown
• Updated Task Breakdown
• Parts of the Search Engine that are within the System
• Diagram
• Testing and Integration
Breakdown of System Components
• Recursive wget
• Crawler / Indexer
• Keyword Generator
• Search Page
Recursive wget
• Run to recursively run on the Uconn Network
• Web pages (2800+) pages were downloaded into www folder
• ~ 3 GB in size
The Crawler – new_strip.pl
• Written in the Perl Programming Language
• Strips the title of each page and URL and stores them into the Page Index Database
• Uses File::Basename Library to get titles when none is found.
Keyword Generator
• Uses Index built from the Crawler
• Stemming Algorithm is used
• PHP is used to stem the words but Perl is used to interact with the Keywords Database.
• Filenames: process2.php, fileopen.php, stemming.php and processKeyword.pl
Side Topic: Stemming Algorithm
• Process of finding the root or natural form of a word.
• Example: “stemmer”, “stemming”, “stemmed” are based on “stem”. “Stem” is the stem.
• In this case it is going to give us the stems of those word variations
Keyword Generator Cont’d
• Keyword Generator will produce thousands of tables for each word.
• Those tables will contain URLs and frequencies of those words at that URL.
• Use of md5 checksum
• This is what we will be searching from!
Search Page
• Written in HTML and PHP• Filenames: index.html and results.php• Will access the Database and search the
tables for the words specified• Uses Quicksort Algorithm to sort results by
Frequency• Use of md5 checksum to make it search
only what was generated by keyword script.
Topic Breakdown
• Updated Task Breakdown
• Parts of the Search Engine that are within the System
• Diagram
• Testing and Integration
Topic Breakdown
• Updated Task Breakdown
• Parts of the Search Engine that are within the System
• Diagram
• Testing and Integration
Testing Entry Criteria
• Must work adequately for the creator.
• Once a first party sees it works it is then verified by a second party.
Integration Stategy Points
• All parts of the system are relatively separate.
• Yet the earlier parts depend on the later parts output.
• Integration is done as shown in the diagram.
Exit Criteria
• In order for this system to be ready for beta testing:– The search page must be test thoroughly to
make sure that it functions correctly also with proper security concerns taken care of as they come up
– Make sure that the keyword tables build properly and are able to be accessed by the search page.