discipline-independent and style-free canonical...
TRANSCRIPT
Discipline-Independent Canonical Representation Extraction
Discipline-Independent and Style-Free CanonicalRepresentation Extraction for Heterogeneously Styled
References Using Knowledge from the Web
Sung Hee Park
Department of Computer Science, Virginia Tech
October 14, 2011
1 / 1
Discipline-Independent Canonical Representation Extraction
Introduction
Problem Contexts, Requirements, and Problems
Problem Contexts
Scholarly Digital LibrariesI citation analysis (CiteSeerX and the ACM Digital Library)
Text Processing, e.g., Supporting Machine ReadingI unsupervised learningI domain-independent methods
3 / 1
Discipline-Independent Canonical Representation Extraction
Introduction
Problem Contexts, Requirements, and Problems
Problem Contexts
4 / 1
Discipline-Independent Canonical Representation Extraction
Introduction
Problem Contexts, Requirements, and Problems
Problem Contexts - Inverse Problem
5 / 1
Discipline-Independent Canonical Representation Extraction
Introduction
Problem Contexts, Requirements, and Problems
Problem Contexts - Database Version
6 / 1
Discipline-Independent Canonical Representation Extraction
Introduction
Problem Contexts, Requirements, and Problems
Problem Contexts - Rendered Version
7 / 1
Discipline-Independent Canonical Representation Extraction
Introduction
Problem Contexts, Requirements, and Problems
Requirements
Scalability across DisciplinesI VT-ETD db:
F Over 18,000 ETDsF 8 collegesF 79 departments
I arXiv:F 662,023 e-printsF 7 large disciplinesF 148 sub-categories
Exposing References for Wider Utilization
Citation Reference Analysis for Impact EvaluationCitation Metadata Extraction as Inputs to Zotero + COinS
I Zotero: Firefox toolI COinS(Context Object in SPAN): convention for bibliographic
metadata
8 / 1
Discipline-Independent Canonical Representation Extraction
Introduction
Problem Contexts, Requirements, and Problems
Problems
Surface & Semantics MappingI semantic labeling
Disciplines & StylesI scalability across disciplines and styles
Domain Specific & Implicit General KnowledgeI acquisition
9 / 1
Discipline-Independent Canonical Representation Extraction
Introduction
Challenges and Opportunities
Challenges
Wide Variety of Citation Styles
Different Document Types
Discipline Dependent Properties
Lexical Ambiguities (e.g., acronyms, homonyms)
Errors in Typing and Other Inaccuracies
10 / 1
Discipline-Independent Canonical Representation Extraction
Introduction
Challenges and Opportunities
Opportunities from the Web
Abundant Bibliographic Data
Easy Access to Data and Information
Availability of Training Data
11 / 1
Discipline-Independent Canonical Representation Extraction
Introduction
Research Questions, Approach
Research Questions
1 What open Internet information and knowledge is available to helpaddress this problem?
2 What features should be selected for training?
3 What methods give the best performance and greatest effectiveness?
12 / 1
Discipline-Independent Canonical Representation Extraction
Introduction
Research Questions, Approach
Approach
We solve the Representation Extraction Problem
Effective and efficient information extraction from references found inpublications
To generate canonical representations from heterogeneously styledreferences
Using:
An integration of machine learning and knowledge based approachesI entity (e.g., name, city) lists identified on the WWWI and a variety of training data
Two-stage classifier collections of style descriptions
Without knowledge of a reference’s style or domain
13 / 1
Discipline-Independent Canonical Representation Extraction
Related Work
Metadata Extraction Methods
Metadata Extraction Methods
Table: Comparison of Previous Approaches
Approach Author&Year Description Sup/Unsup
Rule-based Day et al. (2006) INFOMAP SCortez et al. (2007) FLUX-CiM UAfzal et al. (2010) TIERL U
Machine learning Councill et al. (2008) CRF SHong et al. (2009) CRF S
Hetzner (2008) HMM S
14 / 1
Discipline-Independent Canonical Representation Extraction
Related Work
Metadata Extraction Methods
Metadata Extraction Methods
Table: Pros. Vs. Cons.
Approach Pros. Cons.
Rule/Knowledge-based
Unsupervised methodsexist
Not easy to extract rules
Discipline-dependent
Machine learning If training data are ready,scalable
Difficult to get trainingdata
15 / 1
Discipline-Independent Canonical Representation Extraction
Related Work
Metadata Extraction Methods
Features
Table: Features for Canonical Representation Extraction
[0.1em] Features Description
Local features Non-lexical information about the token
Lexical features Information about the meaning of the words within the token
Contextual features Lexical or local features of a token’s neighbours
Layout features Relative position of a word in the entire reference string[0.1em]
16 / 1
Discipline-Independent Canonical Representation Extraction
Related Work
Knowledge Bases
Knowledge Bases
Acquisition Methods
1 Manual
2 Semi-Automatic
3 Automatic
Knowledge Scope
1 Common Sense
2 Domain-Specific
17 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Our Proposed Hybrid Method
1 Knowledge Bases
2 Feature Extraction
3 Learning & Classification
18 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Our Proposed Hybrid Method
19 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Building Knowledge Bases from Mining the Web
Building Knowledge Bases from Mining the Web
1 Knowledge Bases
2 Bibliographic Databases on the Web
3 General World Knowledge
4 Domain-Specific Knowledge
20 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Building Knowledge Bases from Mining the Web
Knowledge Bases
A knowledge base is defined as
a set of pairs K = {(o1, i1), (o2, i2), ..., (on, in)}I on is a bibliographic field like
F authorF titleF journal
I in is its corresponding instance
Ex:I (’AUTHOR’, ’Sung Hee Park’),I (’JOURNAL’,’ACM Transaction on Information System’),I (’YEAR’,’2011’).
21 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Building Knowledge Bases from Mining the Web
Bibliographic Databases on the Web
CiteUlike
Google Scholar
DBLP (Digital Bibliography & Library Project)
22 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Building Knowledge Bases from Mining the Web
General World Knowledge Sources
[0.1em] Knowledge Type Instances Sources
Person names Sung Hee Park, Edward A. Fox DBLP, Wiki
Cities Blacksburg, Washington DC World Factbook, IEEE conf. info. service
Publishers Springer, MIT Press CiteUlike,DBLP
Years, Dates 2011, Jan. IEEE Conference Information Schedule
DOI type identifiers doi://100.100.1.1. Crossref (http://www.crossref.org)
URL/URL Identifiers DBLP:conf/iccsa/2005-2 DBLP, IEEE Conference Information Schedule
Reference Output Styles APA, IEEE, AAAI EndNote[0.1em]
23 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Building Knowledge Bases from Mining the Web
Knowledge Bases
24 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Building Knowledge Bases from Mining the Web
Knowledge Bases
25 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Building Knowledge Bases from Mining the Web
Knowledge Bases
26 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Building Knowledge Bases from Mining the Web
Knowledge Bases
27 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Building Knowledge Bases from Mining the Web
Building Knowledge Bases for Output Styles
28 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Building Knowledge Bases from Mining the Web
Building Knowledge Bases for Output Styles
Algorithm
1 Extract all style names from the bibliographic reference generation interface.2 Import a reference set into the EndNoteWeb.3 Generate all output styled references (see Appendix).4 Convert HTML files to text files.5 Convert raw files to training sets.
29 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Building Knowledge Bases from Mining the Web
Knowledge Bases for Output Styles
30 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Feature Extraction
Feature Extraction
1 Tokenization2 Feature Types
I Local FeaturesI Lexical FeaturesI Contextual FeaturesI Layout Features
31 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Feature Extraction
Local Features
[0.1em] Categories Names Descriptions Examples
Letters Patterns
INITCAP Starts with a capitalized letter Computer Science[0.03em](r)2-5 ALLCAP All letters are capitalized COMPUTER[0.01em](r)2-5 ACRO Acronyms WWW
(r)2-5 LONELYINITIAL One single capitalized letter S.
Special Character Patterns
CONTAINSDOTS Contains at least one dot S., C4.5(r)2-5 CONTAINSDASH Contains at least one dash 123-124(r)2-5 PUNC Punctuation dot (”.”), comma (”,”)(r)2-5 Ended with dot(.) Regular expression for ending with a dot A.
Special PatternsEMAIL Regular expression for e-addresses [email protected]
(r)2-5 WORD Word references(r)2-5 Pagination pattern Regular expression for pagination formats 200-5, H100-H105
Numeric Patterns
Four-digit year patterns Regular expression for four-digit year pattern 2005(r)2-5 Four-digit year pattern Regular expression for four-digit year patterns 2005(r)2-5 Six-digit pattern Regular expression for six-digit patterns 2005(r)2-5 CONTAINSDIGITS Contains at least one digit 1, F1, A1*
Length Patterns fieldLength # of characters the token has fieldLength(style)=5[0.1em]
32 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Feature Extraction
Lexical Features
[0.1em] Names Descriptions Examples
FAMILYNAME Match word in family name lexicon Smith, Johns
AFFILIATION Word like University, Institute University, Institution,Labs
ADDRESS Match word in address lexicon Blacksburg, Virginia
AUTHOR Match word in author lexicon Blacksburg, Virginia
ARTICLE TITLE Match word in article title lexicon Blacksburg, Virginia
JOURNAL TITLE Match word in journal title lexicon Blacksburg, Virginia
TRUNCATION The word is et or al, or et., or al. et al, et. al.
PAGE The word is pp. or p., or pp, or p Blacksburg, Virginia
DATE Match word in Jan. Feb. Jan., Feb.
NOTES Words like appeared, submitted submitted, in print[0.1em]
33 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Style-Free Canonical Representation Extraction through Two-Stage Method
Two-Stage Method
34 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Style-Free Canonical Representation Extraction through Two-Stage Method
Two-Stage Method
1 Output Style Classification
2 Canonical Representation Extraction
35 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Style-Free Canonical Representation Extraction through Two-Stage Method
Output Style Classification
1 Multi-class classification problem
2 Multiple binary class classification problem3 SVM
I maximum margin classifierI kernel function method
36 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Style-Free Canonical Representation Extraction through Two-Stage Method
Output Style Classification: Algorithm
1 Input data: any styled reference string S.I Tokenized by a set of m delimiters D = {d1, d2, ..., dm},
2 Segmented into a set of n tokens T = {t1, t2, ..., tn}.3 A set of m features F = {fi,1, fi,2, ..., fi,j} are extracted per each
token ti.4 A reference feature vector ~r = (f1,1, f1,2, ..., fp,q) where fi,j is the
jth feature of ith token,I Inputs of SVM classifier stated above
5 This SVM is already trained by a training corpus of references(label 3),
I World general knowledge from the Web (e.g., EndNoteWebreference management tool of label 4).
I Output is one of output styles (label 5-7).
37 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Style-Free Canonical Representation Extraction through Two-Stage Method
Canonical Representation Extraction
Sequence labeling problemConditional Random Field
discriminative probabilistic modelI to find parameters maximizing argmaxY P (Y |X;W )I instead of argmaxY P (Y,X)
F Y is a permutation of a set of labels L = {l1, l2, ..., lk}F X is an input reference string,F transformed into a set of tokens T = {t1, t2, .., tn}, andF W is a set of weights for feature functions W = {w1, w2, .., wm}.
38 / 1
Discipline-Independent Canonical Representation Extraction
Methodology
Style-Free Canonical Representation Extraction through Two-Stage Method
Sequence tagging: Algorithm
1 Input data: any styled reference string S.I Tokenized by a set of m delimiters D = {d1, d2, ..., dm},
2 Segmented into a set of n tokens T = {t1, t2, ..., tn}.3 A set of m features F = {fi,1, fi,2, ..., fi,j} are extracted per each
token ti.4 A reference feature vector ~r = (f1,1, f1,2, ..., fp,q) where fi,j is the jth
feature of ith token,I Inputs of CRF classifier stated above
5 This CRF is already trained by a training corpus of references (label11),
I World general knowledge from the Web (e.g., EndNoteWeb referencemanagement tool of label 4).
I Output is tagged references(label 12-14).
39 / 1
Discipline-Independent Canonical Representation Extraction
Evaluation
Evaluation
1 Preliminary Experiment
2 Knowledge Bases
3 Features
4 Sequence Labeling Methods
40 / 1
Discipline-Independent Canonical Representation Extraction
Evaluation
Preliminary Experiments
Experiment Design
1 ObjectivesI Dependency check of styles on reference metadata extraction accuracy.
2 DatasetsI 2,500 references from 10 different output stylesI 1) AAG, 2) ACS,I 3) API, 4) APA,I 5) Chicago15A, 6) IEEE,I 7) JAMA, 8) MLA,I 9) NLM, and 10) Turabian
3 MethodI CRF
41 / 1
Discipline-Independent Canonical Representation Extraction
Evaluation
Preliminary Experiments
Experiment Design
Metrics
Accuracy =# of (true positive + true negative) tokens
# of (true positive + false positive + true negative + false negative) tokens
Precision = # of true positive tokens# of (true positive + false positive) tokens
Recall = # of true positive tokens# of (true positive + false negative) tokens
F1 = 2× Precision×RecallPrecision+Recall
42 / 1
Discipline-Independent Canonical Representation Extraction
Evaluation
Preliminary Experiments
Average of Each Field Extraction
43 / 1
Discipline-Independent Canonical Representation Extraction
Evaluation
Preliminary Experiments
Results
1 Overall2 Discussion
I Features
44 / 1
Discipline-Independent Canonical Representation Extraction
Evaluation
Preliminary Experiments
AUTHOR Field Extraction
45 / 1
Discipline-Independent Canonical Representation Extraction
Evaluation
Preliminary Experiments
Results
1 Author2 Discussion
I Features
46 / 1
Discipline-Independent Canonical Representation Extraction
Evaluation
Preliminary Experiments
JOURNAL Field Extraction
47 / 1
Discipline-Independent Canonical Representation Extraction
Evaluation
Preliminary Experiments
Results
1 Journal2 Discussion
I Features
48 / 1
Discipline-Independent Canonical Representation Extraction
Evaluation
Knowledge Bases, Features, and Classification Methods
Knowledge Bases1 Experiment Design
I ObjectivesF What WWW information and knowledge is available to help address
this problem?I Datasets
F VT-ETDsF arXiv
I Knowledge basesF World general knowledgeF Domain-specific knowledge
I MethodsF SVM+CRF (our two-stage method)
I MetricsF AccuracyF PrecisionF RecallF F1
49 / 1
Discipline-Independent Canonical Representation Extraction
Evaluation
Knowledge Bases, Features, and Classification Methods
Features1 Experiment Design
I ObjectivesF What features are effective to solve the style-free canonical reference
representation extraction problem?I Datasets
F VT-ETDsF arXiv
I FeaturesF Local featuresF Lexical featuresF Contextual featuresF Layout features
I MethodsF SVM+CRF (our two-stage method)
I MetricsF AccuracyF PrecisionF RecallF F1 50 / 1
Discipline-Independent Canonical Representation Extraction
Evaluation
Knowledge Bases, Features, and Classification Methods
Classification Methods1 Experiment Design
I ObjectivesF What methods give the best performance and greatest effectiveness in
improving canonical reference extraction?I Datasets
F VT-ETDsF arXiv
I MethodsF SVMstruct
F HMM (Hidden Markov Model)F MEMM (Maximum Entropy Markov Model)F SVM+CRF (our two-stage method)
I MetricsF AccuracyF PrecisionF RecallF F1
51 / 1
Discipline-Independent Canonical Representation Extraction
Research Timeline
Research Timeline
52 / 1
Discipline-Independent Canonical Representation Extraction
Research Timeline
Contributions & Deliverables
Contributions
1 Scalable, discipline-independent citation metadata extraction method
2 Feature list supporting the extraction
3 Knowledge bases for discipline-independent citation extraction
4 Weak supervised learning technologies generalizing across disciplines
Deliverables
1 A two-stage machine learning classifier and labeler
2 A feature list and extraction software tool
3 Knowledge bases and aquisition scripts
4 Training dataset builder scripts
53 / 1
Discipline-Independent Canonical Representation Extraction
Research Timeline
Publications
Related Publications
Book chapters1 Nadia P. Kozievitch, Ricardo da Silva Torres, Edward A. Fox, Sung Hee
Park, Nathan Short, Lynn Abbott, Supratik Misra, Michael Hsiao:Rethinking fingerprint evidence through integration of very large digitallibraries. In: Castelli, D., Ioannidis, Y., Manghi, P., Pagano, P., Ross, S.(eds.). Proceedings of the Second DL.org Workshop on Making DLsInteroperable: Challenges & Approaches (MDLI2010) and Proceedings ofthe Third Workshop on Very Large Digital Libraries (VLDL2010), Inconjunction with the European Conference on Digital Libraries 2010,Glasgow, Scotland (UK), 10th of September 2010, Springer LNCS, toappear in 2011
54 / 1
Discipline-Independent Canonical Representation Extraction
Research Timeline
Publications
Related Publications
Peer-reviewed papers1 Nadia P. Kozievitch, Ricardo da Silva Torres, Sung Hee Park, Edward A.
Fox, Nathan Short, Lynn Abott, Supratik Misra, Michael Hsiao. RethinkingFingerprint Evidence Through Integration of Very Large Digital Libraries.VLDL Workshop at 14th European Conference on Research and AdvancedTechnology for Digital Libraries (ECDL2010), Glasgow, Sept. 6-10, 8 pages
2 Sung Hee Park, Nicholas Lynberg, Jesse Racer, Phil McElmurray, EdwardA. Fox. HTML5 ETDs. Refereed paper for ETD 2010 - 13th InternationalSymposium on Electronic Theses and Dissertations. Austin, TX. June16-18, 2010
3 Sung Hee Park, Jonathan P. Leidig, Lin Tzy Li, Edward A. Fox, Nathan J.Short, Kevin E. Hoyle, A. Lynn Abott, and Michael S. Hsiao, Experimentand Analysis Services in a Fingerprint Digital Library for CollaborativeResearch, 1st Theory and Practice in Digital Libraries (TPDL 2011), Berlin,Sept. 26-28, 2011, submitted
55 / 1
Discipline-Independent Canonical Representation Extraction
Research Timeline
Publications
Related PublicationsPosters
1 Nathan Short, Lynn Abbott, Supratik Misra, Michael Hsiao, NadiaKozievitch, Sung Hee Park, Edward Fox. Latent Fingerprint Matching.Poster for CESCA (Center for Embedded Systems for Critical Applications)Day, Virginia Tech, Blacksburg, VA, May 6, 2010
2 Supratik Misra, Nathan Short, Michael Hsiao, Lynn Abbott, Edward Fox,Sung Hee Park, Nadia Kozievitch. Fingerprint Sufficiency. Poster forCESCA (Center for Embedded Systems for Critical Applications) Day,Virginia Tech, Blacksburg, VA, May 6, 2010
3 Sung Hee Park, N.dia Kozievitch, Edward A. Fox, Michael Hsiao, LynnAbott, Nathan Short, Supratik Misra. Model-based fingerprint imagequality Analysis. Poster for CESCA (Center for Embedded Systems forCritical Applications) Day, Virginia Tech, Blacksburg, VA, May 6, 2010
4 Nadia Kozievitch, Sung Hee Park, Supratik Misra, Nathan Short, MichaelHsiao, Lynn Abott, Edward A.Fox. Database for Fingerprint Experiments.Poster for CESCA (Center for Embedded Systems for Critical Applications)Day, Virginia Tech, Blacksburg, VA, May 6, 2010
5 Ryan Richardson, Venkat Srinivasan, Xiaoyu Zhang, Weihua Zhu, Sung HeePark, Pramodh Pochu, Siva Sanagavarapu, Mustafa Rafique, Min He, JiaoJiao, Edward Fox. Making ETDs More Usable for Students in aMultilingual World, 2009
56 / 1