IST 441 Example Projects
Undergrad Project
• Find a customer – interest in xbox game forum• Build a search engine for Xbox game forums etc. • Compare two approaches: Google CSE and LucidWorks.• Steps:
• Crawl websites (at most 5).• Determine crawl depth, how to include/exclude certain pages, filetypes.
• Extract information and build the index.• Experiment with different rankings (see “relevancy workbench” app in your
LucidWorks installation).• http://ist441.ist.psu.edu:8988/relevancy/experiment
• Perform search and compare the precision@K values.
Graduate Project
• Crawling academic institution webpages in Qatar (it’s a small domain). • Integrating a more powerful crawler such as Nutch/heritrix with LucidWorks
system.• Focused crawling i.e. crawling for specific type of pages such as researchers’
home pages.
• Modifying the parser to extract specific information such as email address, phone numbers in a web page.• Modifying Solr schema and/or ranking functions.• Comparing search results with Google CSE.• Discuss with instructor for more information.