web based information retrieval

Upload: tpitikaris

Post on 07-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Web Based Information Retrieval

    1/83

    WEB BASED INFORMATION

    RETRIEVAL

    byTheodoros Pitikaris

    A thesis submitted in partial fulfill-ment of the requirements for thedegree of:

    BSc in Computing andInformation Technology

    Department of Computing

    University of Surrey

  • 8/6/2019 Web Based Information Retrieval

    2/83

    UNIVERSITY OF SURREYABSTRACT

    WEB BASED INFORMATION RE-TRIEVAL

    byTheodoros PitikarisSupervisory Committee: Dr . Bogdan Vrusias

    Department of Computing

    Dr. Nick AntonopoulosDepartment of Computing

    Web World Wide contains large sets of information. This characteristic ofWeb however, can become a real pain fo r users who seek sources thatwould be qualitative and relative, at the same time, to their informativeneeds. In this Final Year project we tr y to examine some information re-trieval methods over web stored information. The main focus is given on ifand how software agents could potentially enhance the information re-trieval process.

    Another topic that we examine in this final year project is the require-ments, phases and evaluation process that are necessary in software de-sign & production process.

  • 8/6/2019 Web Based Information Retrieval

    3/83

    Table of ContentsIntroduction .............................. . ....................... ................................. 1Final year project objectives .............................. .. ........................... 3Final year Project Structure .................. ........................ .. .... . ............. .. ... ..... 3Chapter 1. ANALYSIS I- LITERATURE REVIEW .... .. ......................... .. 5GOOGLE search engine ......... ................................................ .........5

    Text Retrieval Methods .............................................................. . .. 11Natural Language processing ................................. . .............. ........ ....... ... .. 15Neural Network as infrastructure in retrieval. ......... ................. . .. . .... ... . ........ 16Latent Semantic Indexing ..................... ..... . ..... .... .................... . .. .... .. ........ 16Latent Semantic Algorithm .... .. .. ...... ....... . ...................... . ... .... .. ................. 17Advantages of Neural Network Models over Traditional IR Models .... .... ...... ... 17Special issues on web Information Retrieval ................. . .. . .. .... . .. ........ . .. ..... . 18TheAagent's Technology ........ ... .. .. .... ...... . ..... ... .. ... ..... ........... ........ 19Introduction ........... ... ............... ..... ........... . ..... ... .... ... ....... ... ... .. ...... . .. ....... 19Categories of agents in more details ............................... . .. .... ............ ...... .. 20

    Chapter 2. SYSTEM Development Process .................................. . .... 22Definition of software development process ................ .... . .... .. ...... .. . 22System Development Life Cycle (SDLC) ...................... ... . ...... .. ... .... 23Agile Software Development in details ................ ... ...... . ... .... . .... .. .. ..... ........ 27General Characteristics of SDLC ................................. .... ... .... ..... .. .32Requirement Gathering and Prioritization ... ................................... .34

    Software requirements analysis ..... . ..... ... ... .. ... . ... .... ... ... .. ......... ... ... .... . ...... 34Requirements Gathering ................................... ..... ............. . ..... .. .. 36Problems & Difficulties .................................................................. 37Main techniques of Information Gathering .................................................. 39

    Chapter 3. Software Requirements Specification ............................. .41Introduction .................................................... ....... .. ... ........ ....... .41Identification .. .................... ......... .. .. ... ... .. ........ ............................ 41System overview ................................................... ......... ........ .....42Definitions, Acronyms, and Abbreviations ....................................... 43Reference ............................................................... ................. ... 43Genera I Description .................................................. .................... 43User Personas and Characteristics ....................... .. ............... ..................... 43Product Perspective ............................................................. ........... .. .. .. . .. 43Overview of Functional Requirements ............................................ 44Overview of Data Requirements ........... .. .. .......... ..... .... .... ..... ...... ... 45General Constraints, Assumptions, Dependencies, Guidelines ........... 46External Interface Requirements ......... .. .... ...... ... .. .... .. ............ .. .. .. . 47Detailed Description of Functional Requirements ................. ........... .48

    Performance Requirements ......................................................... ... ... .. .. .. .. 49Quality Attributes .................................................................... ... .. ... ... .. ... 50Other Requirements ............................................... .. ..... ....... .................... 50

    Chapter 4. System Design .......................................................... ... 51Methodology Chosen .................................................................... 51System Overview ... ... ........ ... ....... . .. ......... ...... ...................... .. .... ... 52

    System Core and f ront - ends .......... .. ................... ... .. .. .. .. .......... ..... ...... .... 52

  • 8/6/2019 Web Based Information Retrieval

    4/83

    Project development process ... . ... ... ..... ............ .. ... ..... ................ .... ........... 54Chapter 5 . Software Development PHASES in Details ...... .......... .......58Design Overview ............................................ ...... . .. ................ .. ... 58Facilities .. ......... .................. ........ .... . ................... . ........................ ......... .. 58The core system .... .. ............................... .. .. .. ............................. .. 59Software development platform ............. . .......................... ......... ............... 59Intergraded Development Environment Development.. ..................... ... .. ...... 60System Design ............................................... .......... ....................... ..... ... 61.Unit Testing ........................... .............. ................................ ........ .......... 69Integration Testing ................ .............................. .. .................. . ... ............ 70Chapter 6. DISCUSSION ...................................... .. ....................... 72

    Interesting parts during development process ................................ .72Prototype evaluation ................ .................................. .................. 72Comments on the evaluation results and related work ..................... 74Overall project Evaluation ................................ ............................. 75Chapter 7. Conclusions ...... .... ..... ........................................... .. ..... 77Future work ............................................................................. ... 78INDEX ............... ...... ...... ... .. .. ... ..... .... ....... ........ ........ ... ..................... 83

    ii

  • 8/6/2019 Web Based Information Retrieval

    5/83

    LIST OF TABLESTable 1 Agile vs Waterfall methodo logy (available fromhttp:/ en. wikipedia.org/wiki/Agile_software_development) ........................ .. 29Table 2 Development Phases ........................................ .. ................... .. ................ 57Table 3 Sample of a Matrix candidate for SVD .. .. .... .. ...... .. .. ................................... 64

    List of figuresFigure 1 Google database development ................................................................ 6Figure 2 The Waterfall Model .............................................................................. 26Figure 4 Waterfa ll vs. Agile ................................................................................ 28Figure 5 System Use Case ................ .. .. ....... .... .. .. .. ............ .. ......... .. ................... 62Figure 6 System State Qiagram .......................................................................... 63Figure 7 Users' opinion about the system ................ .... ............ .... .. ..................... 74

    iii

  • 8/6/2019 Web Based Information Retrieval

    6/83

    AcknowledgmentsThe author wishes to express sincere appreciation to Mr StaurakakisEmanuel and Mr Tsagatsakis John for their assistance in the preparation ofthis Final year Project report.

    iv

  • 8/6/2019 Web Based Information Retrieval

    7/83

    INTRODUCTION

    In 2001 the Bank of Sweden Prize in Economic Sciences in Memoryof Alfred Nobel was awarded to James Mirrlees and William Vickreyfor their fundamental contributions to the theory of incentives underasymmetric information.

    With their work(http://www .nobel.se/economics/laureates/2001/ecoadv. pdf)they have validated not only the importance of the Information butalso the importance of accessibility over this information.

    Nowadays everyone in west, especially after the development of theinternet, has access to large amount of data, in electronic or paperform. The main problem that we usually face is that the volume ofthis information is so large that we can not easily handle it, or worseit has no use.

    In order to take advantage of this information we need to categorizeit in thematic cohesion and thus to manageable data. A few decadesago this was librarians' line, but as already mentioned the volume ofdata has increased dramatically in such a degree(Society, 2004)that the traditional methods of indexing are not in position to face

    this new challenge.

    The problem gets bigger when we need to categorize new documentsbased on their content, of course in many documents their is an ab-stract on top of them ; but in fact only scientific papers with a specialpurpose have this form, for example an abstract is essential for apaper but no t for a newspaper or a magazine.

    1

  • 8/6/2019 Web Based Information Retrieval

    8/83

    Some people believe that when we talk about retrieving dat> ough" initial system prototype, whichnormally should be presents to supervisory committee forcomments,

    Thesecommentswould be taken into account n the next proto-type version. This iterative process s repeated until no newcomments are expressedby the supervisory ommittee.

    The final system evolvesgradually hrough this processof tr ialand error, as graduallythe syst em is refined by this iterationprocess.

    S y s t e m O v e r v i e w

    S y s t e m C o r e a n d f r o n t - e n d sTh e proposed system is running into two parts, the firstpart that consistsof the main application eceivesqueries

    . from users either by command ine or web interface.Th e query is passed o Googlesearch engine from whichth e system receivea result a l ist of URLs maximum #50)that according td Google ar e correlated to user init ialquery. The in i t ia lquery is stored n a map and in a data-base for future reference.Then system crawl each of this URLS an d produce twoHashMap ype object; on e that contains he total of termsoccurred in all document that have been crawled an d a

    52

  • 8/6/2019 Web Based Information Retrieval

    60/83

    second one that contains current document term indexan d in what frequency his term occurred n the text.During the ma p creation phase closed class words (VanPetten C, 1991) are remo ved while the remaining termpass through a stemmer that implements the PorterSteamer algorithm(C.J. an Rijsbergen,1980).After we have finishedwith the crawlingof all URLswe en dup with 50 Hash Map objects,( ne for each document andon e large Hash Map with all the words that we have metduring he URLs rawling.EachURL s representedby a nx1 dimensions ector wheren i s th e number of terms that lives n each document.At this stage system we combine he La rge Hash Ma p an deach URLs individual Hash Map in order to produce onelarge 2D array with the all terms hash map values as rowsand visit ed URLsas columns.Then we decompo se his large 2D array using the SingularValue Decomposition. he ne xt step is to use Latent Seman-t ic Analysis echnique nd Eucl id ian istance o classi fy herelevance f eachdocument o the originalUserQuery.The EuclidianDistance f two vectors P= (p' pu,F* ) and*Q=(Q", 9y, 9*), is defined by the formula Edistance(P,Q)=^l(n,-o ' +(P - Q ' +(P* q*)'

    Th e user can access he relativi ty is t with a web interfaceorvia the standardoutput.

    53

  • 8/6/2019 Web Based Information Retrieval

    61/83

    The other part of this applicationmplements ome charac-teristicsof an agent; this agent-likepaft is initiatedvia atime scheduler nd has as scope o rework he previous24hours user queries,but now by gettingextra results romyahoo.com.This part is launcheddaily 5GTMsince after several estswere run it was found o be he best time in term of lowernetwork congestions oth in Europeand the majority ofUSA(pleaseefer to Appendix I with the Greek Networks

    LT Weathermap).

    P r o j e c t d e v e l o p m e n t p r o c e s sIn order to accomplishhe tasks of this projct he develop-ment process ad been segment o discretephaseswhile hesofturaredevelopment,since ava is a language hat promotereusabilityand Object Orientationshad been developed nmodulesogic.

    . Phasel.The Initial task was to determine he idea hat theprojectshouldbe served. n the beginninghere wasathoughtaboutcreating n intelligent earch ngineus-in gAI .

    qr-2.But after some discussion ith the projectsupervi-sor a decision o incorporate ome researchaboutAgentswas taken. Additionallyhere was an agree-ment o implement omeof the Agents haracteristics,if the time and resource asantiquated.

    54

  • 8/6/2019 Web Based Information Retrieval

    62/83

    Phase I

    Phase II

    3,The second ask was to decideon what platform hesystem would be developed.Platforms concept was in-cluding he Programming anguageand the HumanComputer nteraction,

    r The next step was to undertake a Biblio-graphical Researchon how the state-of-the-ar t web search engines work. The mainsearchengine of interestwas Google.At thisstage some literacy esearchwa s taken placein order to understand his area of computerscience.

    1. After bibliographic alesearchwa s finished henext task wa s to decideabout he modulesthat shouldcomprise n the final project.Th eDomainchartsand init ialclassdiagramswa sconstruct.The top down strategywas em-ployed n order o visualize n overviewofthe system without going nto detail or anypaft of it.Ibe first unit was the web crawlersystem. or collecting he web-data.

    2. Testing s a continuesprocess n prototyping*methodology nd so after the WebCrawler de-sign and implementation inish some ests wa scarried out to determine the efficiencyand thestabil i tyof this specific nit. Test and f ix.3. UserEvaluation

    55

  • 8/6/2019 Web Based Information Retrieval

    63/83

    t

    4. The next modulewas he softwarepart hatwouldcount he occurrencef each erm ineverydocument nd he interconnection ith thetask6.5. Again o me estingwas akenplace. es tandF tx .6. Incorporate I techniquesn order o test therelevance f each etrievedURL o originalques-tion and nterconnecthe new softwarewith soft-ware rom ask8 and6.7. UserEvaluation8. TestandFix.

    1 , System Front-End. How userwould interactwith the core system. For safety reasons(sinceknowledge n graphical ui was lim-ited) both the consolemode and web inter-face methodshave been employed.UserEvaluationTest and Fix.Introduce Agent characteristics.AutonomousTest and Fix.

    ?

    1. Totalsystem esting2, FinalSystemEvaluationrom User3, Producehe finaldocumentation

    Phase V

    PhaseV

    2 .3 .4.5 .

    56

  • 8/6/2019 Web Based Information Retrieval

    64/83

    BibliographicalResearch on how the state-of-the-af tweb searchengineswork (Google& msn)BibliographicalResearchon how design and impl ementan soft-agentDesignand mplement n agent (crawler)or collectingthe web-dataDesign he storage databaseDecide what AI method are applicable o our domai nproblemDesign ndproduce n output nterface {Evaluate sability f the interface singusers eedbad