information retrieval and information extraction akhil deshmukh (06d5007) narendra kumar (06d05008)...

36
Information Retrieval and Information Extraction Akhil Deshmukh (06D5007) Narendra Kumar (06D05008) Kumar Lav (06D05012) image: google

Upload: teresa-goodman

Post on 28-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Information Retrieval and

Information Extraction

Akhil Deshmukh (06D5007)Narendra Kumar (06D05008)Kumar Lav (06D05012)

image: google

Page 2: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

IR and IEIR is the process of selecting

documents from the collection.IE is retrieval of specific

information from the document.IE transform the unstructured

data into the structured.They together provide powerful

tool for text processing.

Page 3: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

contd ....IR and IE can be seen in our day to

day life.The email-inboxThe searches we make on

internet.This seminar itself is a result of IR

and IE.

Page 4: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Information retrieval TasksInformation retrieval Tasks

•Search Engines (eg. Google and Yahoo).

•Feedback.

•Meta search.

•Data clustering and Visualization search results.

Page 5: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Clustering of search results

Page 6: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

IR measuresIR measuresDivision of documents.

PrecisionRecallF-score.

Page 7: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Models for IRModels for IRLanguage Models In LM, the idea is to compute the conditional

probability P(q | d), i.e. the probability of generating a query q given the observation of a document d.

Vector Space Model (VSM) VSM is a algebraic model, for representing

objects as vectors of identifiers.

Page 8: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Model 1Given a query q, we want to compute the relevance of

document s. We find the probability -

p(r=1| q, d) , where r is the binary label for ‘relevant’.

Applying Bayes rule - p(r = 1| q, d) = p(d, q | r=1) p(r=1)

p(q , d)

We make two assumptions –

1. p(q | d, r =1) is computed from unigram model created from d,

2. p(q | d, r = 0) = p(q | r = 0)

Here we consider the ratio of log values –

log p(r = 1 | q, d) = log p(q | d, r=1) + log p( r=1 | d)

p(r = 0 | q, d) p(q | d, r=0) p(r=0 | d)

• Disadvantages

Language Models for IRLanguage Models for IR

Page 9: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Vector Space Model for IRVector Space Model for IREach document d is represented as a

scalar tf.idf, known as tf-idf scheme.

Where “ tf ” is normalize term frequency, and “ idf “ is the inverse document frequency.

tfw = c(w, d)/ maxv c(v, d) , c =count

idfw = log(N/nw) N = no. Of documents in

collection n = no. Of documents

containing w

so we have tf . idfw = tfw × idfw.

Page 10: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Fig: http://www.miislita.com/term-vector/term-vector-3.html

Page 11: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

sim(d, q) = cos( )

= dT q = ∑j Wd,j x Wq,j

|d| |q| |d| |q|

here |d|= ∑j Wd,j 2 , |q| = ∑j Wq,j 2

and Wd,j = weight of term j in document d

i.e. Wd,j = tfd,j x idfj

Wq,j = weight of term j in query q

•Normalization ??

•Documents are ranked by their cosine similarity to q

Page 12: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Trigger Language Model

Here we calculate the probability p(q| Md). where Md is the language model for

document d. a) Estimation p(qi| Md) = tf(qi, d) , tf(qi, d)=

frequency of qi in d dld dld = total no. of

terms in d.

b) Zero counts p(qi | Md) = p(qi | Md) , if tf(qi, d) >0

= cfqi , otherwise

cs cfqi = raw count of qi in all

documents cs = total no. of terms in all

documentsc) Smoothing: calculate pavg(qi) .

d)Final term probability, p#(qi | Md) = p(qi | Md)(1-R) x pavg(qi)R , if tf(qi, d)>0

= cfqi ,

otherwise cs

Page 13: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

e) Finally:

p(q | Md) = πqi∈q p#(qi | Md) x πqi∈q [1- p#(qi | Md)]

To minimize the smoothing risk, R = [1/(1+f)] x [f/(1+f)]tf

Where f is the mean term frequency of term, f = pavg(qi) x dld

Fig by : ZHANG Jun-lin, SUN Le QU Wei-min, SUN Yu-fang,2002

Page 14: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Information Extraction (IE) Task of locating specific

information from a natural language document.

Convert information from different text documents into database entries (structured format).

It runs on the top of IR systems.

Page 15: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Is IE and IR same ? IE systems do not understand the

text in the input documents.IR retrieves relevant documents,

IE extracts relevant informationIE sees text as structure, for IR

they are bag of words.

Both, together provide powerful tools for text processing.

Page 16: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Extraction Free text:

◦ Natural language. Articles in newspapers etc.

◦Uses NLP techniques. like POS tagging.◦Rules based on syntactic and semantic

information.Structured Text:

◦Text as in database or tables. ◦Extraction is easy and direct.◦Rules are based on the structure of the

text.

Page 17: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Extractioncontd...

Semi-structured:◦are between free and structured text.◦comprises of phrases.◦regular expressions are used.

Information may be :single-slot : single information.

eg. Micrsoft CEO Bill Gates

multi-slot : set of information. eg. Powai: 1 BHK Rs.7,000 p.m., 2 BHK Rs.

20,000 p.m.

Page 18: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

IE System DesignTwo main approaches :Knowledge engineering :

◦grammars expressing rules are constructed by hand.

◦specific to application.◦rely on the skill of knowledge engineer.

Automatic training :◦based on training-testing paradigm.◦highly depend on the corpus for

accuracy.

Page 19: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Learning pattern by LP (LP)2

It learns rules by generalising over a set of examples marked via SGML tags in a training corpus

It induces two types of symbolic rules: tagging rules :

◦ inserts single SGML tag.◦ selects a tag in the training corpus and extracts

from the text a window of w words to the left and w words to the right.

◦ Windows are tagged by SGML tags.◦ Uses Context rules.◦ eg. Seminar at 9 pm. After tagging Seminar at

<time> 9 </time> pm

Page 20: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

(LP)2 contd ..

Correction rules:◦shifts wrongly positioned tags to the

correct positions.◦Identical to the tagging rules except:

patterns match also the tags inserted by the tagging rules

actions shift misplaced tags rather than adding new ones.

Page 21: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

(LP)2 contd ..

Generalisation:It generalises the initially induced

rules. Each generalisation is tested on the training corpus and the best k are retained. adding the wildcard characters like in 4

pm * pm Using POS tagger NLP knowledge

Page 22: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

(LP)2 contd ..

Information extraction:Information is extracted in following

ways: best pool rules are used to tag. correction rules are applied to remove

errors. tags are validated i.e. uncoupled tags are

removed.

Page 23: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

(LP)2 contd ..

NLP based.

[7]

Page 24: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

WHISKSystem name

Text style

Multi-slot Need syntax

Need trim

WI Struct Yes No No

SRV Semi No No No

RAPIER Semi No No No

AutoSlog Free No Yes Yes

CRYSTAL Free Yes Yes Yes

CRYSTAL Semi Yes Yes Yes

WHISK Struct Yes No No

WHISK Semi Yes No No

WHISK Free Yes Yes No

IE Systems

[6]

Page 25: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

WHISK:

Input Data.WHISK rule representation.WHISK Algorithm.

Page 26: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Input Data:Structured data Input:

eg: Profiles on web host, consider facebook, orkut etc.

Semi-Structured:eg: Ads in newspaper.

Free text:eg: Any random article.

Single-slot / Multi-slot

Page 27: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

WHISK Rules:For Structured and semi-

structured text:eg: Consider an add for renting house. The corresponding rule will be of the form:

• ID:: 1

Pattern:: * ( Digit ) ‘R’ * ‚’$’ ( Number )

OutPut:: Rental {Bedrooms $1} {Price $2}

• Disjunction allowed:Bdrm = (brs | br | bds | bdrm | bd | bedrooms |

bedroom | bed)

• Thus the newly formed rule will be:ID:: 2

Pattern:: * ( Nghbr ) * ( Digit ) ‘ ‘ Bdrm * ‘$‘ ( Number )

Output:: Rental {Neighborhood $1} {Bedrooms $2} {Price $}

We have used Nghbr, what does it stand for?

Page 28: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Contd:

Extension of the rules for grammatical text:

• Pre-processing required, Syntactic analysis.@S[ {SUBJ @PN[ C. Vincent Protho ]PN , @PS[ chairman and chief excutive officer ]

of this maker of semiconductots, }

{VB @Passive was named @nam }

{PP to the additional post of @PS[ president ]PS , }

{REL_V succeeding @succeed @PN[ John W. Smith ]PN ,

who resigned @resign to pursue @pursu other interests. }

]@S 8910130051-1

• The corresponding rule pattern will be:ID:: 3

Pattern:: *(Person)* ‘@Passive‘ *F ‘named‘ * {PP *F (Position) * ‘@succeed ‚ (Person)

Output:: Succession {PersonIn $1} {Post $2} {PersonOut $3}

Page 29: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

WHISK Algorithm:Features:

◦ A supervised learning algorithm◦ Tagging process is interleaved with the learning.◦ Presents the user with a batch of instances to tag.◦ Induces a set of rules from the expanded training

set.

Progress:◦ Begins with a reservoir of untagged instances.◦ At each iteration a set of untagged instances are

selected from the reservoir and presented to the user to annotate.

◦ The user adds a tag for each case frame to be extracted from the instance, ‘Training Instances’

Page 30: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Algorithm:@S[ Capitol Hill – 1 br twnhne. Fplc D/W W/D. Undrgrnd pkg incl

$675. 3 BR,upper flr of turn of ctry HOME. Incl gar, grt N. Hill loc $995. (206)

999-9999 <br><i> <font size=-2> (This ad last ran on 08/03/97.) </font> </i>

<hr>]@S 5@@TAGS Rental {Neighborhood Capitol Hill} {Bedrooms 1} {Price

675}@@TAGS Rental {Neighborhood Capitol Hill} {Bedrooms 3} {Price

995}

WHISK(Reservoir)RuleSet = NULLTraining = NULLRepeat at user‘s request

Select a batch of NewInst from Reservoir(User tags the NewInst)Add NewInst to TrainingDiscard rules with errors on NewInstFor each Inst in Training

For each Tag of InstIf Tag is not covered by RuleSetRule = GROW_RULE(Inst, Tag, Training)

Prune RuleSet

Page 31: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Anchoring the extraction Slot:

Empty Rule: “ * ( * ) * ( * ) * ( * ) * “Anchoring Slot 1:

Base_1: * ( Nghbr )Base_2: ‘@start‘ ( * ) ‘ -‘

Anchoring Slot 2:Base_1: * ( Nghbr ) * ( Digit )Base_2: * ( Nghbr ) * ‘- ‘ ( * ) ‘ br‘

Anchoring Slot 3:Base_1: * ( Nghbr ) * ( Digit ) * ( Number )Base_2: * ( Nghbr ) * ( Digit ) * ‘$‘ ( * ) ‘.‘

Contd:

Page 32: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

GROW_RULE(Inst, Tag, Training)Rule = empty rule (terms replaced by wildcards)For i = 1 to number of slots in Tag

ANCHOR(Rule, Inst, Tag, Training, i)Do until Rule makes no errors on Training or no improvement in Laplacian

EXTEND_RULE(Rule, Inst, Tag, Training)

ANCHOR(Rule, Inst, Tag, Training, i)Base_1 = Rule + terms just within extraction iTest first i slots of Base_1 on TrainingWhile Base_1 does not cover Tag

EXTEND_RULE(Base_1, Inst, Tag, Training)Base_2 = Rule + terms just outside extraction iTest first i slots of Base_2 on TrainingWhile Base_2 does not cover Tag

EXTEND_RULE(Base_2, Inst, Tag, Training)Rule = Base_1If Base_2 covers more of Training than Base_1

Rule = Base_2

Laplacian = (e +1) / (n +1), where e is the number of errors and n is the number of

extractions made on the training set

Contd:

Page 33: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

EXTEND_RULE(Rule, Inst, Tag, Training)Best_Rule = NULLBest_L = 1.0If Laplacian of Rule within error tolerance

Best_Rule = RuleBest_L = Laplacian of Rule

For each Term in InstProposed = Rule +TermTest Proposed on TrainingIf Laplacian of Proposed < Best_L

Best_Rule = ProposedBest_L = Laplacian of Proposed

Rule = Best_Rule

WHISK represents a form of hill climbing, and cannot guarantee that the rules it grows on are optimal.

Contd:

Page 34: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

Interactive preparation of trainingSelecting informative instances: In each iteration of WHISK, a batch of instances is

selected from the reservoir of untagged instances, presented to the user for tagging, and then added to the training set.

Classes of untagged instances:◦ Instances covered by an existing rule◦ Instances that are near misses of a rule◦ Instances not covered by any rule

When to stop tagging?

Page 35: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

ConclusionTill now the IR is done with the

assumption of term independence, work is going on to overcome this assumption.

Ability to use dynamic knowledge in IR.

Visual IR and IE will improve our ability to use the WWW.

Page 36: Information Retrieval and Information Extraction Akhil Deshmukh (06D5007)  Narendra Kumar (06D05008)  Kumar Lav (06D05012)  image: google

ReferencesReferences1. Xiaojin Zhu, “Advanced NLP: Information Retrieval”,

CS838-1, WISC,2007.

2. Dr. Ian Soboroff, ”Information Retrieval”, CMSC 491I/691I UMBC, 2002

3. Nazli Goharian, “Language Models”, CS429, Illinois Institute of Technology, 2007

4. Fabio Ciravegna, “(LP)2, an Adaptive Algorithm for Information Extraction from Web-related Texts”,IJCAI-2001 Workshop on Adaptive Text Extraction and Mining , 2001.

5. Stephen Soderland, “Learning Information Extraction Rules for Semi-Structured and Free Text”,Machine Learning 34, 233–272 ,1999

6. Line Eikvil, “Information Extraction from world wide web”, A Survey. No. 945, ISBN 82-539-0429-0, pp. 39, 1999