eecs 767 - university of kansasc732r120/pw/documents/searchaway.pdf · eecs 767 information...

15
EECS 767 Information Retrieval Midterm Progress Report - Semester Project Written by Chinmay Ratnaparkhi Kunal Karnik Our project can be accessed at: http://people.eecs.ku.edu/~cratnapa/ir/

Upload: others

Post on 26-Feb-2020

26 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EECS 767 - University of Kansasc732r120/pw/documents/searchAway.pdf · EECS 767 Information Retrieval Midterm Progress Report - Semester Project Written by ... except for the s trong

EECS 767 Information Retrieval

Midterm Progress Report - Semester Project

Written by

Chinmay Ratnaparkhi

Kunal Karnik

Our project can be accessed at:

http://people.eecs.ku.edu/~cratnapa/ir/

Page 2: EECS 767 - University of Kansasc732r120/pw/documents/searchAway.pdf · EECS 767 Information Retrieval Midterm Progress Report - Semester Project Written by ... except for the s trong

Table of contents 1. Preface ……………………...……………………………………………………… 3

○ Project Goals ○ Work Summary

2. Programming Platform Selection……………………...………………………. 4

○ Implementation in Haskell ○ Implementation in Scheme (Racket)

3. Data Structures……………………...……………………………………………. 4

○ Stat ○ Tf-Posit

4. Parsing Functions……………………...………………………………………… 6

○ Prep Functions ○ Pre-Processing and Processing

5. Dictionary (Inverted Index)……………………...……………………………… 7

6. Vectorization……………………...……………………………………………….. 8

○ Document Vectorization ○ Query Vectorization

7. Web Spider ……………………...……………………………….…………….....10

8. Accepting & Processing Queries …………………………….………...….....12

7. User Interface ...………………...……………………………….…………….....14

8. Testing and Performance ……………………...………………………………. 15

9. Roadmap from Here On……………………...………………………………..... 15

EECS 767 Information Retrieval 2

Page 3: EECS 767 - University of Kansasc732r120/pw/documents/searchAway.pdf · EECS 767 Information Retrieval Midterm Progress Report - Semester Project Written by ... except for the s trong

Preface

Project Goals To create a simple Information Retrieval System (Search Engine) using the Vector Space Model. Add optimization techniques such as query expansion and relevance feedback; and test the product on a practical data set obtained by crawling a specific domain with a Niche Crawler.

Work Summary We began writing micro-algorithms for our project planning to work in a common object oriented programming language such as C++, Java or JavaScript. We thought that it would be easier to implement our own data structures; as we could custom build them, but we quickly realized that most of the functionality we needed for our data structures can be very easily extracted from standard libraries available with most modern day programming languages. We also had a combined interest in functional programing, so we switched gears and decided to implement the system in Haskell. And after some basic testing we decided that Scheme (Racket) was a better choice for our implementation.

After writing the basic structure of our project, we tested our code using the document examples provided in the class quiz tests; this way we could ensure that we were on the right track. Toward the end of the implementation of our inverted index and vectorization functions, we began testing with dummy data. We downloaded wikipedia pages of TV shows that we had watched and added pages to our collection from more rating websites such as Rotten Tomatoes, Imdb, TV.com etc. While testing, we were trying to reach a particular page by querying descriptions instead of keywords for shows. For instance, If we were looking for the show “The Big Bang theory”, we would search for words such as “Bazinga!”, we realized that they were too specific to the show and would not appear on any other pages.

EECS 767 Information Retrieval 3

Page 4: EECS 767 - University of Kansasc732r120/pw/documents/searchAway.pdf · EECS 767 Information Retrieval Midterm Progress Report - Semester Project Written by ... except for the s trong

Programming Platform Selection

Implementation in Haskell Last year both members of our team were enrolled in EECS 776 - Functional Programming and Domain Specific Languages. We enjoyed learning about this new programming paradigm and very much appreciated the quirks and benefits of functional programming over object oriented programming and hence we decided to to use a functional programming language for the implementation of our project. We quickly realized, and as it is apparent from the function definition above, that the files that are read in are not simply lists of strings. Since they are obtained using an input command, a derivative of the Haskell IO Monad, the resultant list of words was wrapped into the IO Monad.

Expected Format : Doc :: [String] Actual Format : Doc :: IO ([String])

This meant that all of the functions we had written, assuming a lists of strings would now either needed to be mapped (Fmap :: (a -> b) -> f a -> f b ) on to these monadic lists, or they could be bound to the structures (>>= :: m a -> (a -> m b) -> m b ). This made our implementation very complicated and significantly reduced the readability of our work. We abolished our plan and discontinued our work in Haskell.

Implementation in Scheme (Racket) We decided to use Scheme (Racket) since it offered a similar programming style as that of Haskell’s and functionality as Haskell, except for the strong typing and monadic expectations. We quickly translated our functions in Scheme and tested them out on our dummy documents and our implementation seemed to be working well, so we decided to proceed with Scheme.

Data Structures For building our Inverted Index for the vector space model, we decided to use several data structures. We mainly relied on the built in data structure List to represent documents, token streams, vectors etc. and the library hash table to contain the final inverted index for constant time access of the entries. Other than these we only had to build a handful of data structures to contain term statistics, postings of terms corresponding to documents and term proximity data. The following is a list of data structures we have used in our implementation.

EECS 767 Information Retrieval 4

Page 5: EECS 767 - University of Kansasc732r120/pw/documents/searchAway.pdf · EECS 767 Information Retrieval Midterm Progress Report - Semester Project Written by ... except for the s trong

STAT (define-type Stat (mtStat) (stat (docFreq number?) (termFreq number?) (postings hash?)))

Stat contains the entirety of the statistics corresponding to each term in the dictionary. A freshly created term contains mtStat i.e. no statistics. Then a new record is added with the stat constructor and contains the following information -

● docFreq (of type number) represents the document frequency for a term. ● termFreq (of type number) represents the total term frequency in the collection. ● postings (of type hashtable) contains all the postings. Keys represent the document ID

and the values are of type TF-Posit (described below)

Tf-Posit (define-type Tf-Posit (noPos) (tf-pos-list (doc-term-freq number?) (positions list?)))

In the dictionary of terms, values of type Tf-Posit are contained in the hash table representing the posting lists for each term. An empty Tf-Posit is defined as noPos and does not contain any information. tf-pos-list contains the following information -

● doc-term-freq (of type number) represents the total number of times the term appears in a particular document.

● positions (of type list) contains a list of numbers representing where the term appears in that document.

The Stat and Tf-Posit records for the term “arrived” in the described document “doc 1” will be :

Doc : “Arrived with gold. Another arrived with silver. Then arrived platinum” Stat : (stat 1 3 (postings-hashtable)) Tf-Posit: (tf-pos-list 2 ‘(3 7 11))

EECS 767 Information Retrieval 5

Page 6: EECS 767 - University of Kansasc732r120/pw/documents/searchAway.pdf · EECS 767 Information Retrieval Midterm Progress Report - Semester Project Written by ... except for the s trong

Parsing Functions

(read-file file-name) ● Reads file of title 'file-name' from the 'data/' directory located inside of the project folder.

(clean-up str) , (tag-spacer x), (tokenize-string str) ● The (clean-up … ) method Accepts a string as an argument, analyzes each character in

the string to remove special characters from the string. specialChars is a statically defined set of characters.

● (tag-spacer … ) Accepts a string and adds spaces before and after the ‘<’ and ‘>’ characters to separate html tags from one another (should they appear without spaces). This helps us in parsing of the input.

● (tokenize-string … ) Uses space (#\space) and tab (#\tab) as delimiters to separate parts of strings as tokens.

(trim xs) and (rev-trim xs) ● These functions accept a stream of tokens and removes style and script tags along with

the content they enclose in them. These functions complete the task through making tail calls to one another.

(de-tag list-of-tokens) and (remove-stops xs ) ● (de-tag … ) Works in conjunction with (is-member x xs) to remove all the HTML tags

from a given list of tokens. ● (remove-stops … ) accepts a list of processed tokens and removes stop-words.

Pre-processing and Processing We combined several of the functions mentioned above and apply them to an input stream sequentially, the entire process is divided into three routines. Prepping (Prep) (map string-downcase (map list->string (map tag-spacer (map clean-up (read-file file-name))))) Pre-Processing (Pre-proc) (de-tag (flatten (map tokenize-string (trim (prep file-name)))))) Processing (remove-stops (pre-proc file-name))

EECS 767 Information Retrieval 6

Page 7: EECS 767 - University of Kansasc732r120/pw/documents/searchAway.pdf · EECS 767 Information Retrieval Midterm Progress Report - Semester Project Written by ... except for the s trong

Dictionary (Inverted Index) (define dictionary (make-hash)) After processing each document into an independent list of tokens (without removing duplicates) we decided to hash these tokens into a dictionary like structure using the standard library hash table. The idea was to use the terms as keys for the dictionary and their respective definitions (i.e. values of the keys) would be a data structure representing the statistics of the term in the entire collection. The image below describes the structure of the dictionary we have implemented.

Figure 1 - Structural representation of the dictionary implemented using nested hashtables. It makes use

of two custom data structures Stat and Tf-posit. The key function to process each term is defined as hash which accepts the following parameters:

1. word - The term that is being inserted in the hash table (dictionary) 2. doc - The document identification number (1,2,3…) 3. posit - Position of the term in the document

The hash function follows a simple algorithm

1. Initial check : Does the term already exist in the dictionary?

EECS 767 Information Retrieval 7

Page 8: EECS 767 - University of Kansasc732r120/pw/documents/searchAway.pdf · EECS 767 Information Retrieval Midterm Progress Report - Semester Project Written by ... except for the s trong

2. No, the term is not in the dictionary ○ Since the term is not present in the dictionary, a fresh record is needed. ○ A new instance of stat data structure is created ○ The starting document frequency of the term is 1 ○ The starting total term frequency of the term is 1 ○ For the posting list of this term (which is another embedded hash table)

■ Use doc as the key ■ Initiate a list with posit as the position

3. Yes, the term is already in the dictionary

○ Since the term is already present in the dictionary, its stat record needs to be updated and the new position needs to be documented

○ The associated stat record is retrieved for the term and the total term frequency is increased by 1.

○ If this term has previously appeared in the document that is currently being processed, document frequency is not increased. The postings are accessed and the new position of the term is simply added to the list of positions and the term frequency associated with the document is increased by 1.

○ If this term is appearing for the first time in the document that is currently being processed, document frequency is increased by 1. A fresh record is added to the postings with key being the document id and associated value being the term frequency of 1 (since it is appearing for the first time in this document); and a new record is added in the hashtable in the postings with:

■ Key being the value of doc ■ Associated value being a new list containing the value of posit

○ In both cases mentioned above, the total term frequency is increased by 1.

Vectorization

Document Vectorization

(define vectors (make-hash)) This function creates an empty hashtable named “vectors”

(define (dictionary-terms) .. ) This function goes through the hashtable “Dictionary” and makes a list of all the keys it contains. The list obtained is later used for word comparison between an issued query and all the terms that exist in the dictionary.

EECS 767 Information Retrieval 8

Page 9: EECS 767 - University of Kansasc732r120/pw/documents/searchAway.pdf · EECS 767 Information Retrieval Midterm Progress Report - Semester Project Written by ... except for the s trong

(define (make-vector all-terms cur-doc) .. ) Creates the vector for the Current Document given by the variable cur-Document. It goes through the entire hash table and and checks if each Current-document Exists in the postings of the that particular word using the exists? Function below. If true,It accesses the term frequency of the document, calculates the idf, multiplies the two and adds the result to the vector at that position. If the term is not present, 0 is added to the vector.

(define (exists? ht word) This function simply checks if the “word” is present in the hashtable “ht”.

Inverted index entry for a term

(“Entropy” → (DocFreq 3) (TermFreq 7) (HashTable

( 2 → ( 4 ‘(1, 22, 99, 165)) ( 7 → ( 2 ‘(23, 68)) ( 9 → ( 1 ‘(45))))

(define (vectorize-helper start max list) .. ) It maps the overall vectorization process to all the documents.

❖ Max is a variable used to keep count of the functions remaining to Vectorize. That is, It keeps going down with every recursive iteration, and we are done with the vectorisation process when it hits 0.

❖ Start is the current document being vectorized. → It gets Normalize and Mae vector functions called on it to get the final vector.

❖ Thus Start goes up with every iteration and Max keeps coming down till it eventually reaches 0; when the vectorization process is completed.

(define (vectorize) .. ) This function, calls the above vectorize-helper function on the hash table of Dictionary terms, after checking if the tokenization and hashing process has successfully completed.

(define (get-vector x) .. ) we have a hash table called “vectors” which stores all the vectors we have created in the process of vectorization. This function is simply a reference function that accepts an id and returns the corresponding vector from the vectors hash table. I.e. passing “1” to this function will return vector of the first document.

EECS 767 Information Retrieval 9

Page 10: EECS 767 - University of Kansasc732r120/pw/documents/searchAway.pdf · EECS 767 Information Retrieval Midterm Progress Report - Semester Project Written by ... except for the s trong

The Web Spider We decided to continue working in Racket (Scheme) for our web spider as well. Following are the important data structures and functions we used in the implementation of our spider.

(define q (box ‘())) For our frontier we decided to use a standard queue. Since mutation is not primarily used or appreciated in functional programming languages, we had to dig around a little bit to figure out if it is even possible. ‘Box’ offers mutable behavior in Scheme. We start out with an empty queue, i.e. an empty list in a box. We then implemented standard queue functions such as push, pop and full? to interact with the queue.

(define crawled (make-hash)) We decided to keep a hash table of all the URLs we have visited and downloaded sources from. The ‘crawled’ hash table served this purpose. We had associated functions such as ‘crawled?’ which would check if a given URL is already in the hashtable. This gave us constant time performance for checking whether a given URL needs to be visited or not.

(define (crawl-helper limit … ) ...) This function would make a call to the crawl function, but we realized that sometimes websites return errors while downloading sources from them. Whenever this happened, the spider would come to a complete halt. To avoid that, we decided to implement this crawl-helper function which would report an error to the console with a message, skip that website, readjust all the parameters and keep the crawl function going. This way, our crawl did not end until the user-specified limit was reached, regardless of whether there were any errors in the process.

(define (get-title) … ) During the crawl, before saving the source of a webpage as an HTML document, the (get-title…) method runs through the source code and fetches text wrapped in the <title></title> tag. An index is added at the front of the extracted text and this new string is used as the name of HTML document whilst saving it. E.g. “03 International Student Services”. The purpose of the index is to differentiate pages with the same title. We realized that some of the pages were being overwritten when they had the same title, hence the strategy.

EECS 767 Information Retrieval 10

Page 11: EECS 767 - University of Kansasc732r120/pw/documents/searchAway.pdf · EECS 767 Information Retrieval Midterm Progress Report - Semester Project Written by ... except for the s trong

(define (extract-urls … ) … ) The (extract-urls…) method runs through the entire source code before saving it as an HTML document and spots all the chunks of text that contain signature elements of a URL for example ‘http’, ‘https’, ‘.com’, ‘.org’, ‘.net’ etc. An array of parsing functions are first applied to the source code to separate HTML tags and get rid of unnecessary information so that URLs are not missed because of misrecognition. Each URL is tested for validity with the (valid? ...) method, then cross-referenced with the ‘crawled’ hash table to ensure that it was not already looked at. If it is a fresh, valid URL, it is pushed into the frontier queue.

(define (crawl … ) …) The crawl method combines all the methods listed above. It is an overloaded procedure which can be invoked in a couple of different ways. Following are the instructions on how to use this spider in compliance with our search engine.

Instructions Run the "Spider.rkt" file first A list of URL's can be provided as seeds which will be crawled to download pages from the internet. Please use the following commands after running the Spider.rkt file

(seed '("http://www.example.com" "http://www.another.com") The crawl can be started in two ways, there is a free crawl, i.e. with no domain restriction all valid URLs are used, and there is a restricted crawl, i.e. pages containing the provided specific domain will be used and the rest of them will be ignored even if they are valid.

1. Free crawl : (crawl #no_of_pages) 2. Restricted : (crawl! #no_of_pages "xyz" "abc.com" "")

In restricted search a page will be downlaoded only if the URL contains either "xyz" or "abc.com". The fourth argument (currently an empty string) is the baser, useful for websites using relative links. (more on it in documentation)

Output After the crawl has completed the following files will be created in the "Exports/" directory. It is important to include these files on your server along with the index files created by the ‘main.rkt’ file, to be able to run the search engine successfully.

1. urls.txt 2. Titles.txt

EECS 767 Information Retrieval 11

Page 12: EECS 767 - University of Kansasc732r120/pw/documents/searchAway.pdf · EECS 767 Information Retrieval Midterm Progress Report - Semester Project Written by ... except for the s trong

Accepting and Processing Queries After finishing our implementation in of the Main program and the Spider, we were pondering upon developing a server in Scheme and hosting a web page that would take user queries and then invoke our main program to process that query and compare it against all the documents. This would involve our Main program continuously running on the server as a process listening for requests from the web page. This would involve a lot of server-side programming and most of the things we would need to learn would be out of the scope of this course. We decided to rely on some non traditional methods which would serve the purpose yet not be very complicated. After a little bit of thinking, we decided that we were going to let our Spider run and download a few pages, let our Main program process all those downloaded documents and then export all the processed data as JSON objects.

We wrote a number of parsing functions which would take scheme representation of our data structures, i.e. the main hashtable, Dictionary and the embedded Stat and Posit structures and translate them into a string that looks like a JSON object. We would then write this newly built string into a file. Using this technique, we decided to export the following files :

1. Term-IDFs - A list of all terms with their corresponding IDF values 2. Vectors - All the document vectors 3. Terms - A list of all the terms in the dictionary

After downloading each document from the Spider, we were already storing information such as the corresponding URL of the page and the Title that was extracted from the source of that page. So similarly, from the Spider scheme file, we decided to export the following two files :

1. URLS - A list of URLs written sequentially to correspond each downloaded document

2. Titles - A list of all the page titles in the same order as above.

EECS 767 Information Retrieval 12

Page 13: EECS 767 - University of Kansasc732r120/pw/documents/searchAway.pdf · EECS 767 Information Retrieval Midterm Progress Report - Semester Project Written by ... except for the s trong

We decided to call these files our Index Files. These five files were then uploaded to the server from which our webpage could request information. For processing the query, we decided to rely on JavaScript and simple JQuery methods such as $.get() helped us fetch data from our Index Files.

For vectorization of the query, an array of 0s is created. Length of this array is the same as the total

number of terms. IDFs of the terms in the queries are then added at the appropriate indices in this array. It was noted that JavaScript was extremely fast and could process the Index Files in fractions of a second and it was not even noticeable that we were loading the entire index for more than 1500 documents! Which we later corrected to only load data for the terms present in the query, but more on that later. The function documented above, demonstrates how swiftly we could vectorize the query provided by the user. A vector in JavaScript is simply an array as long as the number of terms we had.

Calculation of similarities between the query vector and each document vector returns an array of objects

containing the document ID, its similarity value and title.

EECS 767 Information Retrieval 13

Page 14: EECS 767 - University of Kansasc732r120/pw/documents/searchAway.pdf · EECS 767 Information Retrieval Midterm Progress Report - Semester Project Written by ... except for the s trong

User Interface The user interface of this project is written in HTML, CSS and JavaScript. We have inherited most of our CSS from Bootstrap, although we have customized it quite a bit to meet our needs. Upon entering our website, the user lands on the Main page. This is an introductory page that gives the user an overview of our project. We have listed out all three categories of searches that we have designed in the implementation of our project. User can pick one of the search categories which then takes them to that corresponding page. The user may pick from the following three categories:

1. Regular Search 2. Search with Relevance Feedback 3. Search with Proximity

In the Regular Search, upon searching for a particular query, the query is compared across the array of all of our documents. The top 10 documents, i.e. the documents with the highest similarity are then listed on the page in the following format:

Title of the page http://www.url-of-the-page.com A nice description of the page with important words, i.e. words that appear in the query highlighted to draw attention. This makes our page look very pretty!

The Relevance Feedback page is presented in a similar manner. In front of the title of the page, we offer ‘relevant’ and ‘irrelevant’ buttons. Using which the user can mark a document. As soon as a document is marked, the ‘Refine Search’ button is enabled. After marking as many documents as the user wishes, the search can be refined.

Title of the page Relevant Irrelevant

http://www.url-of-the-page.com A nice description of the page with important words, i.e. words that appear in the query highlighted to draw attention. This makes our page look very pretty!

For example, in the adjacent search, the user entered the query ‘Lisa’. He then decided that the page titled Lisa Kudrow - IMDB was a relevant page, he has

EECS 767 Information Retrieval 14

Page 15: EECS 767 - University of Kansasc732r120/pw/documents/searchAway.pdf · EECS 767 Information Retrieval Midterm Progress Report - Semester Project Written by ... except for the s trong

also decided that The other two pages Yeardley Smith - IMDB and The Simpsons - IMDB are irrelevant pages. User’s selections are displayed at the top right corner of the page. On the final, Proximity based search, upon searching for a particular query, the query is compared across the array of all of our documents. The top 10 documents, i.e. the documents with the highest similarity are then picked out and they are ranked based on distance of query terms from one another in the actual documents. Assuming that if the terms are closer together, It would add semantic value to the document for that particular information need.

Testing and Performance For a test run, we performed a restricted crawl on IMDB and downloaded 500 IMDB pages. For the processing of these 500 pages, It took :

● 3 min and 23 seconds to read in the words and tokenize all the documents ● 11 seconds to hash the data into an Inverted Index

We considered this to be a decent speed, but we conjectured that the following optimizations will make our tokenization faster:

● Accomplishing multiple tasks in single iteration, so the number of document traversals is decreased.

● Using a hashtable of stop words instead of list, so linear time access is improved to constant time access.

Roadmap from Here On As can be seen in the description of the User-Interface and definitions above it; we are

currently getting the results delivered back to the web page as a JSON object and parse them with JavaScript to produce a beautified result page - which shows the title of the document, a url for the webpage and a small description of the page obtained from the metadata fetched from its URL. The title can be clicked to reach the actual page using hyperlinks for ease of the user.

There are three successful running versions of the engine - a Simple one using cosine similarities only, one with relevance feedback (version Beta) {intended for images} and one basing results on proximity of query terms from one another in the documents.

We are still planning on implementing a multi-threaded spider and have the document processor run parallelly with it, so as to keep them running simultaneously side by side. This will keep our index growing until it is noticed by a sufficiently big company (hopefully Google) and makes us a multibillion dollar offer.

EECS 767 Information Retrieval 15