sample searches query: smalheiser response: dblp neil smalheiser pnas abstract smalheiser et et al...

59
Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser UIC Dept of Psychiatry Neil Smalheiser

Upload: clinton-higgins

Post on 16-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Sample Searches

• Query: smalheiser

• Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry Neil Smalheiser

Page 2: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Query: computer science genetics

• Campus program: university, campus, college and employment resources

• SpringerLink (On-line journals and books in science, technology and medicine)

• Course ( Advanced topics in computer science and computational genetics)

• Annual Review of Computer Science

• …

Page 3: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments for Slides 1-2The first two searches are intended to show Google is reasonably goodIn retrieving Web pages when given a few keywords. In the first query,The name Smalheiser is submitted. Ideally, the home page of Smalheisershould be retrieved first. Instead, his publications in computer science areretrieved in the first document. This is followed by some of his publicationsIn the medical area. Finally, his home page in Pyschiatry is retrieved.

The second query asks for important documents in the intersection of the twoareas “computer science” and “genetics”. The first retrieved document seemsto be unrelated to the query. The second retrieved document seems ok.The third document is a course in both areas.

The examples show that Google is still far from perfect.

Page 4: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Information Retrieval

Document Representation:

Remove stop words:

Eg. In “Automatically IdentifyingGene terms in MEDLINE Abstracts”

Remove “in”

Stemming: “Automatically” becomes “automatic”;“Identifying” becomes “identify”;“Abstracts” becomes “abstract”

Page 5: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments on last slide

The first two steps in constructing a document representation consist of eliminating non-content wordsand mapping variations of the same word to the same stem via a process called stemming.

Page 6: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Document representation

• Document is a set of content words or terms:

{ automatic, identify, gene, term, medline, abstract}

Sometimes, keep locations of terms. Eg. “automatic” first word in title

Page 7: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments on last slide

Location information can be of importance in differentiating the ordering of contents words in a query from other orderings of the same words. It is also useful in determining phrases.

Page 8: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Assign weights to terms

Term frequency: no. of times the term occurs in the document

Document frequency: no. of documents having the term

The weight of a term in a document: proportional to term frequency, inversely proportional to document frequency

Eg term frequency * log ( N/document frequency)N = no. of documents in collection

Page 9: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments on the last slide

The well-known tf-idf weighting scheme to assign a weight to a term is given. The weight is proportional to the term frequency and inversely proportional to its document frequency. There are numerous variations of this formula, but all of them have the property that higher weights are given to terms with higher term frequencies and lower document frequencies.

Page 10: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Other factors in assigning weights

Terms in title;

Terms in abstract;Terms in big fonts etc.

get heavier weights

Page 11: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments to last slide

If a term occurs in the title, it usually gets a higher weight than the same term occurring in the main text. This may apply to the term appearing in the abstract. If the term occurs in big fonts or a way that attracts reader’s attention, it should also gets a higher weight.

All these situations can be implemented by assuming that each occurrence of such a term is equivalent to k occurrences of the same term in the main text with k >1.

Page 12: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Query representation

Two common models:

Vector space model: query as a set of terms, possibly ordered

Boolean Model: Terms connected by “AND”, “OR” and “NOT”

Page 13: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments to the last slide

In the information retrieval literature, it has been shown that the vector pace model is usually better than the Boolean Model, because if a query contains quite a few terms which are connected by “AND”s, then there may not be a document satisfying the query. If the terms are connected by “OR”s, then there may be too many unordered documents satisfying the query and the user has no efficient way to identify the useful documents from the irrelevant ones.

In practice, it is likely that a hybrid model having features of both models is used for effective retrieval.

Page 14: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Vector space model

Each dimension of a vector represents a distinct term;#dimensions = all terms in the collection, including proper names

Eg. Automatic identify gene … ( 1, 1, 1, ….)

Page 15: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Compute the similarity between a query and a document

Q = (q1, …, qn) D= (d1, …, dn)

Dot [Q, D] =

#terms in common, favors long documents

Norm( D) =

Cosine( Q, D) = Dot[Q, D] /( Norm(D)*Norm(Q))

i

id2

i

ii dq

Page 16: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments on last slide

When the documents are binary vectors, the Dot product similarity function obtains the number of terms in common between the two vectors. When the terms are weighted, the weights are incorporated into the similarity function. Clearly, this favors a long document such as an encyclopedia.

To compensate it, the norm ( length) of a document is included in the denominator of the similarity function so that a longer document gets a larger denominator. The query norm is used to ensure that the Cosine function returns a value between 0 and 1, if all terms have non-negative values. When the two vectors differ from each other by a positive multiplicative constant, their angular distance is 0 and the Cosine value is 1.

Page 17: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Boolean model

gene AND abstract;( sometimes, uses “+” to ensure the term needs to be present)

gene OR abstract;

gene AND NOT abstract;( uses “-” to indicate undesiredterms; Eg. +gene –abstract)

Page 18: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Other features

Phrase search: “information retrieval”

Proximity search: information NEAR retrieval

Date search: 2003

Field search: Eg in the field “Author”, look for “Neal”

Wildcard search: smal*er

Page 19: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments on last slide

Some systems require a query phrase suchas “information retrieval” to be placed inquotes. This may require a retrieved document to have

exactly such a phrase.If a document containing the words “retrieval of information”

is desired, the query can be reformulated as “Information” near “retrieval”.

Filtering operations can be specified by filling in additional information in specific fields such as the author field. Wildcard entries such as smal*er, where “*” denotes zero or more characters are allowed, provided that “*” does not occur in the first few characters ( say 3), otherwise the space for searching matching strings will be too large.

Page 20: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Additional features

Case sensitive: java gets java, Java, JAVA;Java gets Java and possibly JAVA ( first capital letter implies a proper name )

ordered query terms eg. stray dog

spelling error: if no such word,some search engines suggests similar words

Page 21: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments on last slide

Location information in documents and the query can be used to differentiate stray dog from dog stray.

If a word does not exist in the index of all words in the documents, then some search engines may suggest some neighboring words which differ from the misspelled word by 1 or 2 characters. Note that proper names are included in the index.

Page 22: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Directory searchSpecify subtree:

computer finance medicine

hardware software ……….. …………….

……………….

query “memory” under computer means computer memory vs human memory in medicine

Page 23: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments in last slide

Directory search may reduce ambiguities.In the given directory, documents or pages are

classified under each node. For example, there is a set of documents which are classified under computer and another class under medicine. The former class contains documents about computer memory while the latter class contains documents about human memory. If the query is restricted to the class “computer”, then only documents in the former class relating to computer memory will be retrieved.

Page 24: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Feedback

identify relevant documents and possibly irrelevant documents

re-formulate query using terms from relevant documents and from irrelevant documents;

Query: apple; Rel Doc: computer; Irrel: fruit

Modified query: apple, computer, - fruit

Page 25: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments to last slide

The user needs to identify relevant documents and possibly irrelevant documents. Terms from the relevant documents may be added to the query, while terms from the irrelevant documents may be used to exclude documents having such terms to be retrieved in the next round. In the example, the term “computer” is found in the relevant documents and is added to the query, while the term “fruit” is found in the irrelevant documents and it is used to exclude documents having such a word.

Page 26: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Web

Surface Web: linked together

Deep Web: Not linked; documents can be generated dynamically by programs

Quite a few medical databases and bio-medical databases are in the Deep Web

Page 27: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments on the last slide

The Web is roughly classified into theSurface web and the Deep Web. The pages

in the former are hyperlinked, while pages in the latter are accessible only by submitting queries to query interfaces.

Web crawlers which extract content information from Surface Web pages are unable to get into Deep Web pages for lack of hyperlinks.

Page 28: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Retrieval from the surface Web

Anchor text: belong to the document pointed to.

<a href="http://tigger.uic.edu/htbin/cgiwrap/bin/newsbureau/cgi-bin/index.cgi">More News</a>

Page rank: importance of a Web page

Rank( P) =

for every Qi pointing to P; iterative; Web surfing interpretation

i

ii QoutQrank ))(/)((

Page 29: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments on last slide

There are some differences between retrieval from the Web and from non-Web sources. In the former case, words known as anchor texts which appear together with the link from a page A to another page B should be utilized for retrieval. Specifically, the anchor words should be used as content words for page B, as they describe the contents of B as observed by the user who creates A.

Page 30: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Example to illustrate page rank

Rank(P) = ½ Rank(A) + 1/3 Rank(B)A lot of pages pointing to the IBM home page, implying that it has a very high page rank.

A

B

P

Page 31: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments on last slide

The example illustrates how the page rank of a page can be computed. In practice, all pages are initialized with the same rank and the page rank formula is applied to compute the page ranks of all pages. This process is repeated until convergence is reached. Under some reasonable assumptions, convergence is guaranteed. The page rank information is utilized to rank pages for any user query.

Page 32: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Query: IBM

Thousands of pages have that word, but among those pages having that word, IBM home page has largest rank.

Google utilizes page rank

Page 33: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments on last slide

There are a number of ways to utilize page ranks to rank pages for a given query. One way is to first retrieve pages which have reasonable similarities with the query. Then the retrieved pages are re-ranked in descending order of page rank. Another way is to compute the relevance of a page based on a function of the similarity of the page with the query and its page rank. Then pages are re-ranked in descending order of relevance.

Page 34: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Authority and Hub

• Query retrieves documents based on similarities

• Expand this set by adding their parents and their children

• Compute A(p) = sum H(q) for each edge (q,p)

• Compute H(p) = sum A(q) for each edge (p,q)

Page 35: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Authority and Hub continued

• Normalize A(p) and H(p)• Repeat until A() and H() converge• Output pages with top authority scores( It has been shown that convergence is

guaranteed.)

• www.teoma.com( This company claims to have an advanced

search capability which is more accurate than the standard authority and hub technique.)

Page 36: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Various features of different search engines, including Google, AltaVista, Hotbot etc

Search Engines for the World Wide Web

By Alfred and Emily Glossbrenner, 3rd edition,Peachpit Press, 2001.

Page 37: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Metasearch engine

Connects to numerous search engines.

Given query Q, finds suitable search engines to process the query, invokes the selected search engines to search and merges their results.

Page 38: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments on last slide

Instead of using a search engine such as Google, a metasearch engine which connects to numerous search engines can be utilized. Upon receiving a user query, a metasearch engine sends the query ( with possibly some modifications) to appropriate search engines and merges and re-ranks the retrieved documents returned from the invoked search engines.

Page 39: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Advantages of Metasearch Engines over Search Engines

Do not need substantial hardware relative to large search engines;

Large coverage;

up-to-date information.

Page 40: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments to last slide

There is no need for substantial hardware, because the searches are done by the underlying search engines. The coverage of a metasearch engine is the union of the coverages of the individual search engines. That it may have more up-to-date information than a large search engine will be explained by the next few slides.

Page 41: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Up-to-date information

• Search engine crawler gets data

• Builds large index database

• Time consuming to update large index database

• Metasearch engine connects to numerous small search engines

Page 42: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments to last slide

A search engine utilizes a crawler to extract contents from Surface Web pages and then builds an index database. Upon receiving a query, the search engine searches the index database to determine the pages to return to the user. Since the contents of Web pages keep on changing, the index database needs to be updated. However, the index database is large and refreshing it may take a long time, say weeks. In contrast, if a metasearch engine is connected to numerous small search engines and each of these search engines keeps its database up-to-date, the metasearch engine may be able to provide current information.

Page 43: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Utilizes dictionary/ontology

Wordnet: ordinary dictionary terms

MeSH hierarchy: medical terms

May want to include synonyms and hyponyms of query terms into query

Person --- (Synonyms: human, people)

Hyponyms: man woman

Page 44: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments to last slide

Dictionaries or ontologies may be utilized to achieve high retrieval effectiveness. A common dictionary in a general domain is Wordnet which provides synonyms, hyponyms as well as other relationships to each ordinary word. As an example, if a query contains the word “person”, its synonyms and hyponyms may be added to the query. Note that a word may have multiple senses (meanings) and selections of suitable synonyms and hyponyms are essential. It is worthwhile to explore the use of the MeSH hierarchy for effective retrieval in the medical domain.

Page 45: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Difficulty

A word sometimes has many senses

Eg Query: drugs for mental patients

senses for drugs: prescription drugs; illegal drugsuseful to include antidepressant;will retrieve a lot of irrelevant documents if include heroin

Page 46: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments to the last slide

The example shows that a correct addition of a hyponym (antidepressant is a hyponym of drug) will lead to high retrieval effectiveness while an incorrect addition (heroin is also a hyponym of drug) leads to poor retrieval results.

Page 47: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Natural language Processingfor Information Retrieval

• finds part-of-speech of each word;

• identify noun phrases;

• identify proper names;

• recognizes acronyms:eg. CHF congestive heart failures

• Word sense disambiguation

eg. Apple CPU

Page 48: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments to last slide

Natural language processing plays a role in information retrieval. However, so far, it is used to identify parts of speech of words, named entities and phrases only.

Recognition of acronyms is also useful for information retrieval.

Page 49: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Word Sense Disambiguation

Pine 1 kinds of evergreen tree with needle-shaped leaves

2 waste away through sorrow or illness

Cone 1 solid body which narrows to a point

2 fruit of certain evergreen trees

Find the combination of descriptions which have the largest number of words in common.

Page 50: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments to last slide

If the query is “pine cone” and each of the two words has multiple senses, the correct sense may be identified by finding the combination of senses whose descriptions have the largest number of words in common. In this example, sense 1 of pine and sense 2 of cone have the words “evergreen tree “ in common in their descriptions. These common words may be added to the query to improve retrieval effectiveness.

Page 51: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Information extraction

Information retrieval obtains whole documents; often users want small parts of retrieved documents.Examples:

From certain papers on heart disease, extract names of authors;from experimental sections of papers, extract tables of interest.

Page 52: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Techniques

(1) Construct rules involving patterns or keywords of identify parts of interest; utilizes a grammar to extract required information

Eg. To identify terrorist events

useful keywords: kill, bomb etc. use a grammar to identify the subjects (terrorists) and the objects (victims)

Page 53: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments

Traditionally, information extraction is achieved by manually constructed rules for the extraction, after examining numerous instances of what are desired. In order to save labor cost, machine learning techniques are introduced. Rules are automatically constructed and based on positive and negative examples, promising rules are kept for future extraction activities.

Page 54: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

(2) Use machine learning techniques to

construct rules

Positive and negatives examples can

be given to guide the construction

Aim: Reduce manual construction of rules

Page 55: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Machine learning

Example: Pavilion a230 Minitower

AMD ® Athlon XP .. GHz

…….

Pavilion a210n Minitower

Intel ® Celeron … … GHz

Rule: (var1) * ‘®’ ( var2) ‘GHz’

Page 56: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Comments

In this example, the user supplies a few positive instances to be extracted. Then, the system automatically constructs the rule with R and GHz as landmarks. The words before the landmarks are captured by variables.

In the Web environment, HTML or XML documents have tags and they may be used to construct rules. However, rules involving tags may be site dependent, implying that new rules may need to be generated when there is a site change.

Page 57: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Rules involving tags

<b> Martino Motor Sales </b>

<b> Currie Motors Lincoln Memory </b>

Rule: * <b> ‘Var’ </b>

Extracted data may not be that structured

Layout of document can be site dependent, implying that new correct rules need to be constructed for new sites

Page 58: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Summary

• Information retrieval:

user’s point of view:

eg. phrase, case sensitive

system point of view:

eg. Feedback query construction

Web retrieval vs non-web retrieval

search engine vs metasearch engine

Page 59: Sample Searches Query: smalheiser Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry

Summary continued

• Natural language processing Eg. acronym recognition

• Information extraction rules: manual, machine learning Can be site dependent