discipline-independent and style-free canonical...

57
Discipline-Independent Canonical Representation Extraction Discipline-Independent and Style-Free Canonical Representation Extraction for Heterogeneously Styled References Using Knowledge from the Web Sung Hee Park Department of Computer Science, Virginia Tech October 14, 2011 1/1

Upload: others

Post on 16-Apr-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Discipline-Independent Canonical Representation Extraction

Discipline-Independent and Style-Free CanonicalRepresentation Extraction for Heterogeneously Styled

References Using Knowledge from the Web

Sung Hee Park

Department of Computer Science, Virginia Tech

October 14, 2011

1 / 1

Discipline-Independent Canonical Representation Extraction

Outline

2 / 1

Discipline-Independent Canonical Representation Extraction

Introduction

Problem Contexts, Requirements, and Problems

Problem Contexts

Scholarly Digital LibrariesI citation analysis (CiteSeerX and the ACM Digital Library)

Text Processing, e.g., Supporting Machine ReadingI unsupervised learningI domain-independent methods

3 / 1

Discipline-Independent Canonical Representation Extraction

Introduction

Problem Contexts, Requirements, and Problems

Problem Contexts

4 / 1

Discipline-Independent Canonical Representation Extraction

Introduction

Problem Contexts, Requirements, and Problems

Problem Contexts - Inverse Problem

5 / 1

Discipline-Independent Canonical Representation Extraction

Introduction

Problem Contexts, Requirements, and Problems

Problem Contexts - Database Version

6 / 1

Discipline-Independent Canonical Representation Extraction

Introduction

Problem Contexts, Requirements, and Problems

Problem Contexts - Rendered Version

7 / 1

Discipline-Independent Canonical Representation Extraction

Introduction

Problem Contexts, Requirements, and Problems

Requirements

Scalability across DisciplinesI VT-ETD db:

F Over 18,000 ETDsF 8 collegesF 79 departments

I arXiv:F 662,023 e-printsF 7 large disciplinesF 148 sub-categories

Exposing References for Wider Utilization

Citation Reference Analysis for Impact EvaluationCitation Metadata Extraction as Inputs to Zotero + COinS

I Zotero: Firefox toolI COinS(Context Object in SPAN): convention for bibliographic

metadata

8 / 1

Discipline-Independent Canonical Representation Extraction

Introduction

Problem Contexts, Requirements, and Problems

Problems

Surface & Semantics MappingI semantic labeling

Disciplines & StylesI scalability across disciplines and styles

Domain Specific & Implicit General KnowledgeI acquisition

9 / 1

Discipline-Independent Canonical Representation Extraction

Introduction

Challenges and Opportunities

Challenges

Wide Variety of Citation Styles

Different Document Types

Discipline Dependent Properties

Lexical Ambiguities (e.g., acronyms, homonyms)

Errors in Typing and Other Inaccuracies

10 / 1

Discipline-Independent Canonical Representation Extraction

Introduction

Challenges and Opportunities

Opportunities from the Web

Abundant Bibliographic Data

Easy Access to Data and Information

Availability of Training Data

11 / 1

Discipline-Independent Canonical Representation Extraction

Introduction

Research Questions, Approach

Research Questions

1 What open Internet information and knowledge is available to helpaddress this problem?

2 What features should be selected for training?

3 What methods give the best performance and greatest effectiveness?

12 / 1

Discipline-Independent Canonical Representation Extraction

Introduction

Research Questions, Approach

Approach

We solve the Representation Extraction Problem

Effective and efficient information extraction from references found inpublications

To generate canonical representations from heterogeneously styledreferences

Using:

An integration of machine learning and knowledge based approachesI entity (e.g., name, city) lists identified on the WWWI and a variety of training data

Two-stage classifier collections of style descriptions

Without knowledge of a reference’s style or domain

13 / 1

Discipline-Independent Canonical Representation Extraction

Related Work

Metadata Extraction Methods

Metadata Extraction Methods

Table: Comparison of Previous Approaches

Approach Author&Year Description Sup/Unsup

Rule-based Day et al. (2006) INFOMAP SCortez et al. (2007) FLUX-CiM UAfzal et al. (2010) TIERL U

Machine learning Councill et al. (2008) CRF SHong et al. (2009) CRF S

Hetzner (2008) HMM S

14 / 1

Discipline-Independent Canonical Representation Extraction

Related Work

Metadata Extraction Methods

Metadata Extraction Methods

Table: Pros. Vs. Cons.

Approach Pros. Cons.

Rule/Knowledge-based

Unsupervised methodsexist

Not easy to extract rules

Discipline-dependent

Machine learning If training data are ready,scalable

Difficult to get trainingdata

15 / 1

Discipline-Independent Canonical Representation Extraction

Related Work

Metadata Extraction Methods

Features

Table: Features for Canonical Representation Extraction

[0.1em] Features Description

Local features Non-lexical information about the token

Lexical features Information about the meaning of the words within the token

Contextual features Lexical or local features of a token’s neighbours

Layout features Relative position of a word in the entire reference string[0.1em]

16 / 1

Discipline-Independent Canonical Representation Extraction

Related Work

Knowledge Bases

Knowledge Bases

Acquisition Methods

1 Manual

2 Semi-Automatic

3 Automatic

Knowledge Scope

1 Common Sense

2 Domain-Specific

17 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Our Proposed Hybrid Method

1 Knowledge Bases

2 Feature Extraction

3 Learning & Classification

18 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Our Proposed Hybrid Method

19 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Building Knowledge Bases from Mining the Web

Building Knowledge Bases from Mining the Web

1 Knowledge Bases

2 Bibliographic Databases on the Web

3 General World Knowledge

4 Domain-Specific Knowledge

20 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Building Knowledge Bases from Mining the Web

Knowledge Bases

A knowledge base is defined as

a set of pairs K = {(o1, i1), (o2, i2), ..., (on, in)}I on is a bibliographic field like

F authorF titleF journal

I in is its corresponding instance

Ex:I (’AUTHOR’, ’Sung Hee Park’),I (’JOURNAL’,’ACM Transaction on Information System’),I (’YEAR’,’2011’).

21 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Building Knowledge Bases from Mining the Web

Bibliographic Databases on the Web

CiteUlike

Google Scholar

DBLP (Digital Bibliography & Library Project)

22 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Building Knowledge Bases from Mining the Web

General World Knowledge Sources

[0.1em] Knowledge Type Instances Sources

Person names Sung Hee Park, Edward A. Fox DBLP, Wiki

Cities Blacksburg, Washington DC World Factbook, IEEE conf. info. service

Publishers Springer, MIT Press CiteUlike,DBLP

Years, Dates 2011, Jan. IEEE Conference Information Schedule

DOI type identifiers doi://100.100.1.1. Crossref (http://www.crossref.org)

URL/URL Identifiers DBLP:conf/iccsa/2005-2 DBLP, IEEE Conference Information Schedule

Reference Output Styles APA, IEEE, AAAI EndNote[0.1em]

23 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Building Knowledge Bases from Mining the Web

Knowledge Bases

24 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Building Knowledge Bases from Mining the Web

Knowledge Bases

25 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Building Knowledge Bases from Mining the Web

Knowledge Bases

26 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Building Knowledge Bases from Mining the Web

Knowledge Bases

27 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Building Knowledge Bases from Mining the Web

Building Knowledge Bases for Output Styles

28 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Building Knowledge Bases from Mining the Web

Building Knowledge Bases for Output Styles

Algorithm

1 Extract all style names from the bibliographic reference generation interface.2 Import a reference set into the EndNoteWeb.3 Generate all output styled references (see Appendix).4 Convert HTML files to text files.5 Convert raw files to training sets.

29 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Building Knowledge Bases from Mining the Web

Knowledge Bases for Output Styles

30 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Feature Extraction

Feature Extraction

1 Tokenization2 Feature Types

I Local FeaturesI Lexical FeaturesI Contextual FeaturesI Layout Features

31 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Feature Extraction

Local Features

[0.1em] Categories Names Descriptions Examples

Letters Patterns

INITCAP Starts with a capitalized letter Computer Science[0.03em](r)2-5 ALLCAP All letters are capitalized COMPUTER[0.01em](r)2-5 ACRO Acronyms WWW

(r)2-5 LONELYINITIAL One single capitalized letter S.

Special Character Patterns

CONTAINSDOTS Contains at least one dot S., C4.5(r)2-5 CONTAINSDASH Contains at least one dash 123-124(r)2-5 PUNC Punctuation dot (”.”), comma (”,”)(r)2-5 Ended with dot(.) Regular expression for ending with a dot A.

Special PatternsEMAIL Regular expression for e-addresses [email protected]

(r)2-5 WORD Word references(r)2-5 Pagination pattern Regular expression for pagination formats 200-5, H100-H105

Numeric Patterns

Four-digit year patterns Regular expression for four-digit year pattern 2005(r)2-5 Four-digit year pattern Regular expression for four-digit year patterns 2005(r)2-5 Six-digit pattern Regular expression for six-digit patterns 2005(r)2-5 CONTAINSDIGITS Contains at least one digit 1, F1, A1*

Length Patterns fieldLength # of characters the token has fieldLength(style)=5[0.1em]

32 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Feature Extraction

Lexical Features

[0.1em] Names Descriptions Examples

FAMILYNAME Match word in family name lexicon Smith, Johns

AFFILIATION Word like University, Institute University, Institution,Labs

ADDRESS Match word in address lexicon Blacksburg, Virginia

AUTHOR Match word in author lexicon Blacksburg, Virginia

ARTICLE TITLE Match word in article title lexicon Blacksburg, Virginia

JOURNAL TITLE Match word in journal title lexicon Blacksburg, Virginia

TRUNCATION The word is et or al, or et., or al. et al, et. al.

PAGE The word is pp. or p., or pp, or p Blacksburg, Virginia

DATE Match word in Jan. Feb. Jan., Feb.

NOTES Words like appeared, submitted submitted, in print[0.1em]

33 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Style-Free Canonical Representation Extraction through Two-Stage Method

Two-Stage Method

34 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Style-Free Canonical Representation Extraction through Two-Stage Method

Two-Stage Method

1 Output Style Classification

2 Canonical Representation Extraction

35 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Style-Free Canonical Representation Extraction through Two-Stage Method

Output Style Classification

1 Multi-class classification problem

2 Multiple binary class classification problem3 SVM

I maximum margin classifierI kernel function method

36 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Style-Free Canonical Representation Extraction through Two-Stage Method

Output Style Classification: Algorithm

1 Input data: any styled reference string S.I Tokenized by a set of m delimiters D = {d1, d2, ..., dm},

2 Segmented into a set of n tokens T = {t1, t2, ..., tn}.3 A set of m features F = {fi,1, fi,2, ..., fi,j} are extracted per each

token ti.4 A reference feature vector ~r = (f1,1, f1,2, ..., fp,q) where fi,j is the

jth feature of ith token,I Inputs of SVM classifier stated above

5 This SVM is already trained by a training corpus of references(label 3),

I World general knowledge from the Web (e.g., EndNoteWebreference management tool of label 4).

I Output is one of output styles (label 5-7).

37 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Style-Free Canonical Representation Extraction through Two-Stage Method

Canonical Representation Extraction

Sequence labeling problemConditional Random Field

discriminative probabilistic modelI to find parameters maximizing argmaxY P (Y |X;W )I instead of argmaxY P (Y,X)

F Y is a permutation of a set of labels L = {l1, l2, ..., lk}F X is an input reference string,F transformed into a set of tokens T = {t1, t2, .., tn}, andF W is a set of weights for feature functions W = {w1, w2, .., wm}.

38 / 1

Discipline-Independent Canonical Representation Extraction

Methodology

Style-Free Canonical Representation Extraction through Two-Stage Method

Sequence tagging: Algorithm

1 Input data: any styled reference string S.I Tokenized by a set of m delimiters D = {d1, d2, ..., dm},

2 Segmented into a set of n tokens T = {t1, t2, ..., tn}.3 A set of m features F = {fi,1, fi,2, ..., fi,j} are extracted per each

token ti.4 A reference feature vector ~r = (f1,1, f1,2, ..., fp,q) where fi,j is the jth

feature of ith token,I Inputs of CRF classifier stated above

5 This CRF is already trained by a training corpus of references (label11),

I World general knowledge from the Web (e.g., EndNoteWeb referencemanagement tool of label 4).

I Output is tagged references(label 12-14).

39 / 1

Discipline-Independent Canonical Representation Extraction

Evaluation

Evaluation

1 Preliminary Experiment

2 Knowledge Bases

3 Features

4 Sequence Labeling Methods

40 / 1

Discipline-Independent Canonical Representation Extraction

Evaluation

Preliminary Experiments

Experiment Design

1 ObjectivesI Dependency check of styles on reference metadata extraction accuracy.

2 DatasetsI 2,500 references from 10 different output stylesI 1) AAG, 2) ACS,I 3) API, 4) APA,I 5) Chicago15A, 6) IEEE,I 7) JAMA, 8) MLA,I 9) NLM, and 10) Turabian

3 MethodI CRF

41 / 1

Discipline-Independent Canonical Representation Extraction

Evaluation

Preliminary Experiments

Experiment Design

Metrics

Accuracy =# of (true positive + true negative) tokens

# of (true positive + false positive + true negative + false negative) tokens

Precision = # of true positive tokens# of (true positive + false positive) tokens

Recall = # of true positive tokens# of (true positive + false negative) tokens

F1 = 2× Precision×RecallPrecision+Recall

42 / 1

Discipline-Independent Canonical Representation Extraction

Evaluation

Preliminary Experiments

Average of Each Field Extraction

43 / 1

Discipline-Independent Canonical Representation Extraction

Evaluation

Preliminary Experiments

Results

1 Overall2 Discussion

I Features

44 / 1

Discipline-Independent Canonical Representation Extraction

Evaluation

Preliminary Experiments

AUTHOR Field Extraction

45 / 1

Discipline-Independent Canonical Representation Extraction

Evaluation

Preliminary Experiments

Results

1 Author2 Discussion

I Features

46 / 1

Discipline-Independent Canonical Representation Extraction

Evaluation

Preliminary Experiments

JOURNAL Field Extraction

47 / 1

Discipline-Independent Canonical Representation Extraction

Evaluation

Preliminary Experiments

Results

1 Journal2 Discussion

I Features

48 / 1

Discipline-Independent Canonical Representation Extraction

Evaluation

Knowledge Bases, Features, and Classification Methods

Knowledge Bases1 Experiment Design

I ObjectivesF What WWW information and knowledge is available to help address

this problem?I Datasets

F VT-ETDsF arXiv

I Knowledge basesF World general knowledgeF Domain-specific knowledge

I MethodsF SVM+CRF (our two-stage method)

I MetricsF AccuracyF PrecisionF RecallF F1

49 / 1

Discipline-Independent Canonical Representation Extraction

Evaluation

Knowledge Bases, Features, and Classification Methods

Features1 Experiment Design

I ObjectivesF What features are effective to solve the style-free canonical reference

representation extraction problem?I Datasets

F VT-ETDsF arXiv

I FeaturesF Local featuresF Lexical featuresF Contextual featuresF Layout features

I MethodsF SVM+CRF (our two-stage method)

I MetricsF AccuracyF PrecisionF RecallF F1 50 / 1

Discipline-Independent Canonical Representation Extraction

Evaluation

Knowledge Bases, Features, and Classification Methods

Classification Methods1 Experiment Design

I ObjectivesF What methods give the best performance and greatest effectiveness in

improving canonical reference extraction?I Datasets

F VT-ETDsF arXiv

I MethodsF SVMstruct

F HMM (Hidden Markov Model)F MEMM (Maximum Entropy Markov Model)F SVM+CRF (our two-stage method)

I MetricsF AccuracyF PrecisionF RecallF F1

51 / 1

Discipline-Independent Canonical Representation Extraction

Research Timeline

Research Timeline

52 / 1

Discipline-Independent Canonical Representation Extraction

Research Timeline

Contributions & Deliverables

Contributions

1 Scalable, discipline-independent citation metadata extraction method

2 Feature list supporting the extraction

3 Knowledge bases for discipline-independent citation extraction

4 Weak supervised learning technologies generalizing across disciplines

Deliverables

1 A two-stage machine learning classifier and labeler

2 A feature list and extraction software tool

3 Knowledge bases and aquisition scripts

4 Training dataset builder scripts

53 / 1

Discipline-Independent Canonical Representation Extraction

Research Timeline

Publications

Related Publications

Book chapters1 Nadia P. Kozievitch, Ricardo da Silva Torres, Edward A. Fox, Sung Hee

Park, Nathan Short, Lynn Abbott, Supratik Misra, Michael Hsiao:Rethinking fingerprint evidence through integration of very large digitallibraries. In: Castelli, D., Ioannidis, Y., Manghi, P., Pagano, P., Ross, S.(eds.). Proceedings of the Second DL.org Workshop on Making DLsInteroperable: Challenges & Approaches (MDLI2010) and Proceedings ofthe Third Workshop on Very Large Digital Libraries (VLDL2010), Inconjunction with the European Conference on Digital Libraries 2010,Glasgow, Scotland (UK), 10th of September 2010, Springer LNCS, toappear in 2011

54 / 1

Discipline-Independent Canonical Representation Extraction

Research Timeline

Publications

Related Publications

Peer-reviewed papers1 Nadia P. Kozievitch, Ricardo da Silva Torres, Sung Hee Park, Edward A.

Fox, Nathan Short, Lynn Abott, Supratik Misra, Michael Hsiao. RethinkingFingerprint Evidence Through Integration of Very Large Digital Libraries.VLDL Workshop at 14th European Conference on Research and AdvancedTechnology for Digital Libraries (ECDL2010), Glasgow, Sept. 6-10, 8 pages

2 Sung Hee Park, Nicholas Lynberg, Jesse Racer, Phil McElmurray, EdwardA. Fox. HTML5 ETDs. Refereed paper for ETD 2010 - 13th InternationalSymposium on Electronic Theses and Dissertations. Austin, TX. June16-18, 2010

3 Sung Hee Park, Jonathan P. Leidig, Lin Tzy Li, Edward A. Fox, Nathan J.Short, Kevin E. Hoyle, A. Lynn Abott, and Michael S. Hsiao, Experimentand Analysis Services in a Fingerprint Digital Library for CollaborativeResearch, 1st Theory and Practice in Digital Libraries (TPDL 2011), Berlin,Sept. 26-28, 2011, submitted

55 / 1

Discipline-Independent Canonical Representation Extraction

Research Timeline

Publications

Related PublicationsPosters

1 Nathan Short, Lynn Abbott, Supratik Misra, Michael Hsiao, NadiaKozievitch, Sung Hee Park, Edward Fox. Latent Fingerprint Matching.Poster for CESCA (Center for Embedded Systems for Critical Applications)Day, Virginia Tech, Blacksburg, VA, May 6, 2010

2 Supratik Misra, Nathan Short, Michael Hsiao, Lynn Abbott, Edward Fox,Sung Hee Park, Nadia Kozievitch. Fingerprint Sufficiency. Poster forCESCA (Center for Embedded Systems for Critical Applications) Day,Virginia Tech, Blacksburg, VA, May 6, 2010

3 Sung Hee Park, N.dia Kozievitch, Edward A. Fox, Michael Hsiao, LynnAbott, Nathan Short, Supratik Misra. Model-based fingerprint imagequality Analysis. Poster for CESCA (Center for Embedded Systems forCritical Applications) Day, Virginia Tech, Blacksburg, VA, May 6, 2010

4 Nadia Kozievitch, Sung Hee Park, Supratik Misra, Nathan Short, MichaelHsiao, Lynn Abott, Edward A.Fox. Database for Fingerprint Experiments.Poster for CESCA (Center for Embedded Systems for Critical Applications)Day, Virginia Tech, Blacksburg, VA, May 6, 2010

5 Ryan Richardson, Venkat Srinivasan, Xiaoyu Zhang, Weihua Zhu, Sung HeePark, Pramodh Pochu, Siva Sanagavarapu, Mustafa Rafique, Min He, JiaoJiao, Edward Fox. Making ETDs More Usable for Students in aMultilingual World, 2009

56 / 1

Discipline-Independent Canonical Representation Extraction

Research Timeline

Publications

Q & A

57 / 1