discipline-independent and style-free canonical...

Discipline-Independent Canonical Representation Extraction

Discipline-Independent and Style-Free CanonicalRepresentation Extraction for Heterogeneously Styled

References Using Knowledge from the Web

Sung Hee Park

Department of Computer Science, Virginia Tech

October 14, 2011

1 / 1


Outline

2 / 1


Introduction

Problem Contexts, Requirements, and Problems

Problem Contexts

Scholarly Digital LibrariesI citation analysis (CiteSeerX and the ACM Digital Library)

Text Processing, e.g., Supporting Machine ReadingI unsupervised learningI domain-independent methods

3 / 1


Introduction


Problem Contexts

4 / 1


Introduction


Problem Contexts - Inverse Problem

5 / 1


Introduction


Problem Contexts - Database Version

6 / 1


Introduction


Problem Contexts - Rendered Version

7 / 1


Introduction


Requirements

Scalability across DisciplinesI VT-ETD db:

F Over 18,000 ETDsF 8 collegesF 79 departments

I arXiv:F 662,023 e-printsF 7 large disciplinesF 148 sub-categories

Exposing References for Wider Utilization

Citation Reference Analysis for Impact EvaluationCitation Metadata Extraction as Inputs to Zotero + COinS

I Zotero: Firefox toolI COinS(Context Object in SPAN): convention for bibliographic

metadata

8 / 1


Introduction


Problems

Surface & Semantics MappingI semantic labeling

Disciplines & StylesI scalability across disciplines and styles

Domain Specific & Implicit General KnowledgeI acquisition

9 / 1


Introduction

Challenges and Opportunities

Challenges

Wide Variety of Citation Styles

Different Document Types

Discipline Dependent Properties

Lexical Ambiguities (e.g., acronyms, homonyms)

Errors in Typing and Other Inaccuracies

10 / 1


Introduction

Challenges and Opportunities

Opportunities from the Web

Abundant Bibliographic Data

Easy Access to Data and Information

Availability of Training Data

11 / 1


Introduction

Research Questions, Approach

Research Questions

1 What open Internet information and knowledge is available to helpaddress this problem?

2 What features should be selected for training?

3 What methods give the best performance and greatest effectiveness?

12 / 1


Introduction

Research Questions, Approach

Approach

We solve the Representation Extraction Problem

Effective and efficient information extraction from references found inpublications

To generate canonical representations from heterogeneously styledreferences

Using:

An integration of machine learning and knowledge based approachesI entity (e.g., name, city) lists identified on the WWWI and a variety of training data

Two-stage classifier collections of style descriptions

Without knowledge of a reference’s style or domain

13 / 1


Related Work

Metadata Extraction Methods


Table: Comparison of Previous Approaches

Approach Author&Year Description Sup/Unsup

Rule-based Day et al. (2006) INFOMAP SCortez et al. (2007) FLUX-CiM UAfzal et al. (2010) TIERL U

Machine learning Councill et al. (2008) CRF SHong et al. (2009) CRF S

Hetzner (2008) HMM S

14 / 1


Related Work



Table: Pros. Vs. Cons.

Approach Pros. Cons.

Rule/Knowledge-based

Unsupervised methodsexist

Not easy to extract rules

Discipline-dependent

Machine learning If training data are ready,scalable

Difficult to get trainingdata

15 / 1


Related Work


Features

Table: Features for Canonical Representation Extraction

[0.1em] Features Description

Local features Non-lexical information about the token

Lexical features Information about the meaning of the words within the token

Contextual features Lexical or local features of a token’s neighbours

Layout features Relative position of a word in the entire reference string[0.1em]

16 / 1


Related Work

Knowledge Bases

Knowledge Bases

Acquisition Methods

1 Manual

2 Semi-Automatic

3 Automatic

Knowledge Scope

1 Common Sense

2 Domain-Specific

17 / 1


Methodology

Our Proposed Hybrid Method

1 Knowledge Bases

2 Feature Extraction

3 Learning & Classification

18 / 1


Methodology

Our Proposed Hybrid Method

19 / 1


Methodology

Building Knowledge Bases from Mining the Web


1 Knowledge Bases

2 Bibliographic Databases on the Web

3 General World Knowledge

4 Domain-Specific Knowledge

20 / 1


Methodology


Knowledge Bases

A knowledge base is defined as

a set of pairs K = {(o1, i1), (o2, i2), ..., (on, in)}I on is a bibliographic field like

F authorF titleF journal

I in is its corresponding instance

Ex:I (’AUTHOR’, ’Sung Hee Park’),I (’JOURNAL’,’ACM Transaction on Information System’),I (’YEAR’,’2011’).

21 / 1


Methodology


Bibliographic Databases on the Web

CiteUlike

Google Scholar

DBLP (Digital Bibliography & Library Project)

22 / 1


Methodology


General World Knowledge Sources

[0.1em] Knowledge Type Instances Sources

Person names Sung Hee Park, Edward A. Fox DBLP, Wiki

Cities Blacksburg, Washington DC World Factbook, IEEE conf. info. service

Publishers Springer, MIT Press CiteUlike,DBLP

Years, Dates 2011, Jan. IEEE Conference Information Schedule

DOI type identifiers doi://100.100.1.1. Crossref (http://www.crossref.org)

URL/URL Identifiers DBLP:conf/iccsa/2005-2 DBLP, IEEE Conference Information Schedule

Reference Output Styles APA, IEEE, AAAI EndNote[0.1em]

23 / 1


Methodology


Knowledge Bases

24 / 1


Methodology


Knowledge Bases

25 / 1


Methodology


Knowledge Bases

26 / 1


Methodology


Knowledge Bases

27 / 1


Methodology


Building Knowledge Bases for Output Styles

28 / 1


Methodology


Building Knowledge Bases for Output Styles

Algorithm

1 Extract all style names from the bibliographic reference generation interface.2 Import a reference set into the EndNoteWeb.3 Generate all output styled references (see Appendix).4 Convert HTML files to text files.5 Convert raw files to training sets.

29 / 1


Methodology


Knowledge Bases for Output Styles

30 / 1


Methodology

Feature Extraction

Feature Extraction

1 Tokenization2 Feature Types

I Local FeaturesI Lexical FeaturesI Contextual FeaturesI Layout Features

31 / 1


Methodology

Feature Extraction

Local Features

[0.1em] Categories Names Descriptions Examples

Letters Patterns

INITCAP Starts with a capitalized letter Computer Science[0.03em](r)2-5 ALLCAP All letters are capitalized COMPUTER[0.01em](r)2-5 ACRO Acronyms WWW

(r)2-5 LONELYINITIAL One single capitalized letter S.

Special Character Patterns

CONTAINSDOTS Contains at least one dot S., C4.5(r)2-5 CONTAINSDASH Contains at least one dash 123-124(r)2-5 PUNC Punctuation dot (”.”), comma (”,”)(r)2-5 Ended with dot(.) Regular expression for ending with a dot A.

Special PatternsEMAIL Regular expression for e-addresses [email protected]

(r)2-5 WORD Word references(r)2-5 Pagination pattern Regular expression for pagination formats 200-5, H100-H105

Numeric Patterns

Four-digit year patterns Regular expression for four-digit year pattern 2005(r)2-5 Four-digit year pattern Regular expression for four-digit year patterns 2005(r)2-5 Six-digit pattern Regular expression for six-digit patterns 2005(r)2-5 CONTAINSDIGITS Contains at least one digit 1, F1, A1*

Length Patterns fieldLength # of characters the token has fieldLength(style)=5[0.1em]

32 / 1


Methodology

Feature Extraction

Lexical Features

[0.1em] Names Descriptions Examples

FAMILYNAME Match word in family name lexicon Smith, Johns

AFFILIATION Word like University, Institute University, Institution,Labs

ADDRESS Match word in address lexicon Blacksburg, Virginia

AUTHOR Match word in author lexicon Blacksburg, Virginia

ARTICLE TITLE Match word in article title lexicon Blacksburg, Virginia

JOURNAL TITLE Match word in journal title lexicon Blacksburg, Virginia

TRUNCATION The word is et or al, or et., or al. et al, et. al.

PAGE The word is pp. or p., or pp, or p Blacksburg, Virginia

DATE Match word in Jan. Feb. Jan., Feb.

NOTES Words like appeared, submitted submitted, in print[0.1em]

33 / 1


Methodology

Style-Free Canonical Representation Extraction through Two-Stage Method

Two-Stage Method

34 / 1


Methodology


Two-Stage Method

1 Output Style Classification

2 Canonical Representation Extraction

35 / 1


Methodology


Output Style Classification

1 Multi-class classification problem

2 Multiple binary class classification problem3 SVM

I maximum margin classifierI kernel function method

36 / 1


Methodology


Output Style Classification: Algorithm

1 Input data: any styled reference string S.I Tokenized by a set of m delimiters D = {d1, d2, ..., dm},

2 Segmented into a set of n tokens T = {t1, t2, ..., tn}.3 A set of m features F = {fi,1, fi,2, ..., fi,j} are extracted per each

token ti.4 A reference feature vector ~r = (f1,1, f1,2, ..., fp,q) where fi,j is the

jth feature of ith token,I Inputs of SVM classifier stated above

5 This SVM is already trained by a training corpus of references(label 3),

I World general knowledge from the Web (e.g., EndNoteWebreference management tool of label 4).

I Output is one of output styles (label 5-7).

37 / 1


Methodology


Canonical Representation Extraction

Sequence labeling problemConditional Random Field

discriminative probabilistic modelI to find parameters maximizing argmaxY P (Y |X;W )I instead of argmaxY P (Y,X)

F Y is a permutation of a set of labels L = {l1, l2, ..., lk}F X is an input reference string,F transformed into a set of tokens T = {t1, t2, .., tn}, andF W is a set of weights for feature functions W = {w1, w2, .., wm}.

38 / 1


Methodology


Sequence tagging: Algorithm

1 Input data: any styled reference string S.I Tokenized by a set of m delimiters D = {d1, d2, ..., dm},

2 Segmented into a set of n tokens T = {t1, t2, ..., tn}.3 A set of m features F = {fi,1, fi,2, ..., fi,j} are extracted per each

token ti.4 A reference feature vector ~r = (f1,1, f1,2, ..., fp,q) where fi,j is the jth

feature of ith token,I Inputs of CRF classifier stated above

5 This CRF is already trained by a training corpus of references (label11),

I World general knowledge from the Web (e.g., EndNoteWeb referencemanagement tool of label 4).

I Output is tagged references(label 12-14).

39 / 1


Evaluation

Evaluation

1 Preliminary Experiment

2 Knowledge Bases

3 Features

4 Sequence Labeling Methods

40 / 1


Evaluation

Preliminary Experiments

Experiment Design

1 ObjectivesI Dependency check of styles on reference metadata extraction accuracy.

2 DatasetsI 2,500 references from 10 different output stylesI 1) AAG, 2) ACS,I 3) API, 4) APA,I 5) Chicago15A, 6) IEEE,I 7) JAMA, 8) MLA,I 9) NLM, and 10) Turabian

3 MethodI CRF

41 / 1


Evaluation


Experiment Design

Metrics

Accuracy =# of (true positive + true negative) tokens

# of (true positive + false positive + true negative + false negative) tokens

Precision = # of true positive tokens# of (true positive + false positive) tokens

Recall = # of true positive tokens# of (true positive + false negative) tokens

F1 = 2× Precision×RecallPrecision+Recall

42 / 1


Evaluation


Average of Each Field Extraction

43 / 1


Evaluation


Results

1 Overall2 Discussion

I Features

44 / 1


Evaluation


AUTHOR Field Extraction

45 / 1


Evaluation


Results

1 Author2 Discussion

I Features

46 / 1


Evaluation


JOURNAL Field Extraction

47 / 1


Evaluation


Results

1 Journal2 Discussion

I Features

48 / 1


Evaluation

Knowledge Bases, Features, and Classification Methods

Knowledge Bases1 Experiment Design

I ObjectivesF What WWW information and knowledge is available to help address

this problem?I Datasets

F VT-ETDsF arXiv

I Knowledge basesF World general knowledgeF Domain-specific knowledge

I MethodsF SVM+CRF (our two-stage method)

I MetricsF AccuracyF PrecisionF RecallF F1

49 / 1


Evaluation


Features1 Experiment Design

I ObjectivesF What features are effective to solve the style-free canonical reference

representation extraction problem?I Datasets

F VT-ETDsF arXiv

I FeaturesF Local featuresF Lexical featuresF Contextual featuresF Layout features

I MethodsF SVM+CRF (our two-stage method)

I MetricsF AccuracyF PrecisionF RecallF F1 50 / 1


Evaluation


Classification Methods1 Experiment Design

I ObjectivesF What methods give the best performance and greatest effectiveness in

improving canonical reference extraction?I Datasets

F VT-ETDsF arXiv

I MethodsF SVMstruct

F HMM (Hidden Markov Model)F MEMM (Maximum Entropy Markov Model)F SVM+CRF (our two-stage method)

I MetricsF AccuracyF PrecisionF RecallF F1

51 / 1


Research Timeline

Research Timeline

52 / 1


Research Timeline

Contributions & Deliverables

Contributions

1 Scalable, discipline-independent citation metadata extraction method

2 Feature list supporting the extraction

3 Knowledge bases for discipline-independent citation extraction

4 Weak supervised learning technologies generalizing across disciplines

Deliverables

1 A two-stage machine learning classifier and labeler

2 A feature list and extraction software tool

3 Knowledge bases and aquisition scripts

4 Training dataset builder scripts

53 / 1


Research Timeline

Publications

Related Publications

Book chapters1 Nadia P. Kozievitch, Ricardo da Silva Torres, Edward A. Fox, Sung Hee

Park, Nathan Short, Lynn Abbott, Supratik Misra, Michael Hsiao:Rethinking fingerprint evidence through integration of very large digitallibraries. In: Castelli, D., Ioannidis, Y., Manghi, P., Pagano, P., Ross, S.(eds.). Proceedings of the Second DL.org Workshop on Making DLsInteroperable: Challenges & Approaches (MDLI2010) and Proceedings ofthe Third Workshop on Very Large Digital Libraries (VLDL2010), Inconjunction with the European Conference on Digital Libraries 2010,Glasgow, Scotland (UK), 10th of September 2010, Springer LNCS, toappear in 2011

54 / 1


Research Timeline

Publications

Related Publications

Peer-reviewed papers1 Nadia P. Kozievitch, Ricardo da Silva Torres, Sung Hee Park, Edward A.

Fox, Nathan Short, Lynn Abott, Supratik Misra, Michael Hsiao. RethinkingFingerprint Evidence Through Integration of Very Large Digital Libraries.VLDL Workshop at 14th European Conference on Research and AdvancedTechnology for Digital Libraries (ECDL2010), Glasgow, Sept. 6-10, 8 pages

2 Sung Hee Park, Nicholas Lynberg, Jesse Racer, Phil McElmurray, EdwardA. Fox. HTML5 ETDs. Refereed paper for ETD 2010 - 13th InternationalSymposium on Electronic Theses and Dissertations. Austin, TX. June16-18, 2010

3 Sung Hee Park, Jonathan P. Leidig, Lin Tzy Li, Edward A. Fox, Nathan J.Short, Kevin E. Hoyle, A. Lynn Abott, and Michael S. Hsiao, Experimentand Analysis Services in a Fingerprint Digital Library for CollaborativeResearch, 1st Theory and Practice in Digital Libraries (TPDL 2011), Berlin,Sept. 26-28, 2011, submitted

55 / 1


Research Timeline

Publications

Related PublicationsPosters

1 Nathan Short, Lynn Abbott, Supratik Misra, Michael Hsiao, NadiaKozievitch, Sung Hee Park, Edward Fox. Latent Fingerprint Matching.Poster for CESCA (Center for Embedded Systems for Critical Applications)Day, Virginia Tech, Blacksburg, VA, May 6, 2010

2 Supratik Misra, Nathan Short, Michael Hsiao, Lynn Abbott, Edward Fox,Sung Hee Park, Nadia Kozievitch. Fingerprint Sufficiency. Poster forCESCA (Center for Embedded Systems for Critical Applications) Day,Virginia Tech, Blacksburg, VA, May 6, 2010

3 Sung Hee Park, N.dia Kozievitch, Edward A. Fox, Michael Hsiao, LynnAbott, Nathan Short, Supratik Misra. Model-based fingerprint imagequality Analysis. Poster for CESCA (Center for Embedded Systems forCritical Applications) Day, Virginia Tech, Blacksburg, VA, May 6, 2010

4 Nadia Kozievitch, Sung Hee Park, Supratik Misra, Nathan Short, MichaelHsiao, Lynn Abott, Edward A.Fox. Database for Fingerprint Experiments.Poster for CESCA (Center for Embedded Systems for Critical Applications)Day, Virginia Tech, Blacksburg, VA, May 6, 2010

5 Ryan Richardson, Venkat Srinivasan, Xiaoyu Zhang, Weihua Zhu, Sung HeePark, Pramodh Pochu, Siva Sanagavarapu, Mustafa Rafique, Min He, JiaoJiao, Edward Fox. Making ETDs More Usable for Students in aMultilingual World, 2009

56 / 1


Research Timeline

Publications

Q & A

57 / 1

discipline-independent and style-free canonical...

Documents