hierarchical indexing and flexible element retrieval for structured document

28
April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of Computing, NUS Ji-Rong Wen Microsoft Research Asia Tat-Seng Chua School of Computing, NUS

Upload: uriah

Post on 24-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of Computing, NUS Ji-Rong Wen Microsoft Research Asia Tat-Seng Chua School of Computing, NUS. Presentation for ECIR’03, Pisa, Italy. Outline. Motivations and problems - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

1

Hierarchical Indexing and Flexible Element Retrieval for Structured Document

Hang CuiSchool of Computing, NUS

Ji-Rong WenMicrosoft Research Asia

Tat-Seng ChuaSchool of Computing, NUS

Page 2: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

2

Presentation for ECIR’03, Pisa, Italy

Outline

• Motivations and problems

• Hierarchical index propagation and pruning

• Flexible element retrieval

• Evaluation

• Conclusions

Page 3: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

3

Presentation for ECIR’03, Pisa, Italy

Outline

• Motivations and problems

• Hierarchical index propagation and pruning

• Flexible element retrieval

• Evaluation

• Conclusions

Page 4: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

4

Presentation for ECIR’03, Pisa, Italy

Motivations• More structured and semi-structured documents

on the Web.• Users want to explore more of the document

structure.– Access only relevant parts of a document, i.e. sections

or paragraphs

• IR can’t help– Document as the smallest resulting unit.

• Not Question Answering!– Can’t provide views of the internal document structure.

Page 5: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

5

Presentation for ECIR’03, Pisa, Italy

Encarta Articles – An Example• Online encyclopedia.• Well structured XML

documents.• Nodes (elements) –

documents, sections and paragraphs (leaf nodes)

• Text contained in paragraphs, which constitute sections and documents.

Document

Section Section SectionParagraph

Section Section SectionParagraph

Paragraph ParagraphParagraph Paragraph Paragraph Paragraph

Page 6: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

6

Presentation for ECIR’03, Pisa, Italy

Problems• A document covers multiple aspects of a central

topic– Represented by sections or paragraphs. – Users usually want just one of the aspects.

• How to achieve this goal by utilizing the document structure?– Flexible element retrieval to get elements at arbitrary

level rather than only leaf nodes.– Let each element at different levels have proper

keywords description.

Page 7: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

7

Presentation for ECIR’03, Pisa, Italy

Our contributions• Building index with the same hierarchical structure as the

document has.– Not just index the leaf nodes.

• Keywords propagation mechanism.– Assign proper keywords to each level’s nodes (push broad-

sense keywords to upper-level nodes).– Why can’t use weight propagation technique?– Considering terms’ distributions.

• Flexible element retrieval according to queries.– With the hierarchical index, the system can access arbitrary-

level elements – documents, sections or paragraphs w.r.t queries.

– Avoid assembling separate text fragments with leaf nodes retrieval only.

Page 8: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

8

Presentation for ECIR’03, Pisa, Italy

Outline

• Motivations and problems

• Hierarchical index propagation and pruning

• Flexible element retrieval

• Evaluation

• Conclusions

Page 9: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

9

Presentation for ECIR’03, Pisa, Italy

Hierarchical Indexing for Structured Documents• Term weighting for the leaf nodes and the

intermediate elements.– Combining the statistics of the term

occurrences and the distributions.– Term selection threshold.

• Propagation and pruning of the index terms

Page 10: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

10

Presentation for ECIR’03, Pisa, Italy

Term Weighting for Paragraphs• Paragraphs are “atomic” without children

elements.

• Consider the term occurrences only – TFIDF measure.

ijiji n

NPttfPtWeight ln)),(ln(),(

Page 11: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

11

Presentation for ECIR’03, Pisa, Italy

Term Weighting for Intermediate elements

• Document-level or section-level elements.• Taking into account the term distributions in the

immediately descendant elements.

),()),(1ln(),( jijiji EtIEttfEtWeight

Page 12: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

12

Presentation for ECIR’03, Pisa, Italy

Measuring Term Distributions• Entropy-like measurement

– How even a term is distributed in all the immediate-descendant elements of an intermediate element.

– Normalization factor – the theoretic maximum entropy.

)(

1ln),(

),(

),(ln),(

)(

1ln

)(

),(

),(

),(ln),(

),(

subNEttf

Ettf

subttfsubttf

subNsubN

Ettf

Ettf

subttfsubttf

EtI

ji

ESub ji

kiki

ESub

ji

ESub ji

kiki

ji

jk

jk

jk

Page 13: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

13

Presentation for ECIR’03, Pisa, Italy

Term Selection• Term weights are normalized to the range of 0

and 1 for the purpose of comparison.• Compare the terms within one element.

– Select those terms with the weights beyond a threshold as the index terms for this element.

• Repeat this process from bottom up.– Broader-sense terms can be propagated to upper

level elements.– Term pruning to avoid duplications of index terms.

Page 14: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

14

Presentation for ECIR’03, Pisa, Italy

Terms Propagation and Pruning Algorithm

1. For each leaf element, i.e. paragraph, calculate all terms’ weights for paragraphs.

2. For each composite element Ej at the next upper level, calculate the terms’ weights by measuring these terms’ occurrences in this element and the distributions in the immediate-descendant elements of Ej.

3. For term ti, if Weight(ti, Ej)>= average(Ej)+std_dev(Ej) , then this term is selected as an index term of the element Ej and all the descendent elements of Ej would eliminate ti from their index term lists. This process is called the index term propagation and pruning.

4. Recursively perform step 2 onwards until the root node (i.e., the document) is reached.

Page 15: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

15

Presentation for ECIR’03, Pisa, Italy

An illustration of the process

Qi ngDynastyManchuKang XiHi storyChi na

TangDynasty

SuiSui YangHi storyChi na

Secti on Secti on

Document

Qi ngManchuKang Xi

TangSuiYang

Hi storyDynasty

Economy

Chi na

The DocumentStructure

The IndexStructure

Page 16: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

16

Presentation for ECIR’03, Pisa, Italy

Outline

• Motivations and problems

• Hierarchical index propagation and pruning

• Flexible element retrieval

• Evaluation

• Conclusions

Page 17: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

17

Presentation for ECIR’03, Pisa, Italy

Flexible Element Retrieval• No term duplications along one path.• The path of an element

– including all the elements from this node to the root.

• Ranking relevant elements is equal to rank their paths.

Q

i ipip n

NPathtWeightPathlevance

1

ln),()(Re

Page 18: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

18

Presentation for ECIR’03, Pisa, Italy

Path Ranking Algorithm1. Find all elements that contain at least one query term. 2. Get paths for all candidate elements and merge the

paths, that is, merge two paths into one if one is a part of the other.

3. Assign the weights of the query terms for the elements to their paths respectively.

4. Rank these paths using the equation on the previous slide.

5. Return the elements corresponding to the ranked paths with the ranks satisfying the pre-defined threshold in a descending order.

Page 19: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

19

Presentation for ECIR’03, Pisa, Italy

Result Browsing• The prototype interface can

– Highlight the relevant parts of the selected document.

– Allow the user to browse results in the original document structure.

• Query example – “the Manchu Qing Dynasty”– A section in “China”– The whole document for “Qing Dynasty”

Page 20: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

20

Presentation for ECIR’03, Pisa, Italy

Prototype Interface

Page 21: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

21

Presentation for ECIR’03, Pisa, Italy

Outline

• Motivations and problems

• Hierarchical index propagation and pruning

• Flexible element retrieval

• Evaluation

• Conclusions

Page 22: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

22

Presentation for ECIR’03, Pisa, Italy

Evaluation• Data Set

– 41,942 XML documents in various topics from Encarta online encyclopedia.

– Ten experimental queries • Can be answered by only parts of the relevant document,

e.g. “Fleet Street in London” answered by a paragraph of the document London.

• Relevance judgment made by human assessors – for each query, there is a group of paragraphs representing relevant sections or such paragraphs.

– Baseline system (TFIDF Para)• Indexing paragraph nodes only.• Applying TFIDF measure to weight terms and using cosine

similarity to retrieve answers.

Page 23: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

23

Presentation for ECIR’03, Pisa, Italy

Performance Evaluation• Use precision, recall and F-value as

performance metrics.• Two sets of hierarchical index

– Utilizing titles and without considering titles.

• Answer selection threshold– Fixed numbers 0.1 – 0.9, used by most of existing

systems.– Dynamic thresholds – Avg and Avg + Std_Dev

• Compared our system with TFIDF Para using different answer selection thresholds.

Page 24: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

24

Presentation for ECIR’03, Pisa, Italy

Results of Performance Comparison• Figures are impressive

– Improvements on precision are 48.83% (w/ titles) and 41.67% (w/o titles) in average.

– For F-Values, improvements are 56.02% (w/ titles) and 40.89% (w/o titles).

– Recall slightly decreases with some threshold settings (too rigorous threshold for index term selection).

• User feedback– Our system can find more meaningful units instead of

separate paragraphs, including some paragraphs not actually containing query terms.

– Users are clear of their context when browsing the answers within the original document structure.

Page 25: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

25

Presentation for ECIR’03, Pisa, Italy

Threshold Setting• Our system is less sensitive to the answer selection

threshold settings. • Dynamic threshold is a good alternative for such

structured document retrieval.Comparison of F-Values with Different Selection Thresholds

0.00000.10000.20000.30000.40000.50000.60000.70000.80000.9000

Thresholds

F-V

alu

es

TFIDF Para

Flexible Retrieval (with titleterms)

Flexible Retrieval (withouttitle terms)

Page 26: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

26

Presentation for ECIR’03, Pisa, Italy

Outline

• Motivations and problems

• Hierarchical index propagation and pruning

• Flexible element retrieval

• Evaluation

• Conclusions

Page 27: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

27

Presentation for ECIR’03, Pisa, Italy

Conclusions• A novel hierarchical index propagation and

pruning mechanism to generate structured index.

• Flexible element retrieval of getting arbitrary-level relevant elements is realized on the hierarchical index.

• It can better satisfy users than previous passage retrieval systems.

• More work can be done on generating hierarchical index for federate search.

Page 28: Hierarchical Indexing and Flexible Element Retrieval for Structured Document

April 14, 2003 Hang Cui, Ji-Rong Wen and Tat-Seng Chua

28

Presentation for ECIR’03, Pisa, Italy

Thanks!