impact of different relation extraction methods on network analysis results

Impact of different relation extraction methods on network analysis results

Jana Diesner

Motivation

Text Data Network Data Applications

• Need: scalable, reliable, robust methods & tools

• Unstructured

• At any scale

• Network Analysis • Answer

substantive and graph-theoretic questions

• Develop and test hypothesis and theories

• Visualizations • Populate

databases• Input to further

computations, e.g. simulations, machine learning

Research Questions and Relevance

• How do network data and analysis results obtained by using different relation extraction methods compare to each other?

• Why does it matter?– Increased comparability,

generalizability, transparency of methods and tools

– Increased control and power for developers and users

– Supports drawing of reasonable and valid conclusions

Relation Extraction Methods

Proximity-based

linkage of nodes

Proximity-based

linkage of nodes

Database query

Proximity-based

linkage of nodes

Meta-

Data

Text, manual

(TextM)

Text, automated (TextA) Meta-data

(META)

Subject Matter

Experts (SME)

Codebook

Data

5

Sudan Corpus Funding Corpus Enron CorpusGenre Newswire Scientific Writing Emails

Size 80,000 articles 56,000 proposals 53,000 emailsSource LexisNexis Cordis FERC/ SECTime span 8 years 22 years 4 yearsText-based networks

Article bodies Project description

Email bodies

Meta-data network

Index terms Index terms and collaborators

Email headers

• Large-scale, over-time, open source data from different domains

Results I

1. Text automated vs. manual: total number of nodes of sub-type “generic” far higher than “specific”

– Rethink focus of network analysis: collectives vs. individuals

– Importance of detecting unnamed entities

2. Ground truth data (SME) hardly resembled by analyzing text bodies and not at all by meta-data networks

– In most ideal case, 50% of nodes and 20% of links

3. Agreement in structure and key entities depends on type of network

Results II

3. Agreement between text-based, and with meta-data depends on type of network

Type of Network

Text-Based Networks Meta-Data Network

Social networks

- Substantial overlap between manual and automated, esp. w.r.t. key players- Localized view on geo-political entities and culture

-Major international key players-Small overlap in key entities with text-based networks

Knowledge networks

- Gist of information in terms of common sense entities- Minimal overlap between manual and automated

- Seem more informative (mini-summaries)-Less coreference resolution issues - Minimal overlap with text-based For more complete view, combine automated text-based

with meta-data network

Acknowledgements • This work was supported by the National Science Foundation (NSF)

IGERT 9972762, the Army Research Institute (ARI) W91WAW07C0063, the Army Research Laboratory (ARL/CTA) DAAD19-01- 2-0009, the Air Force Office of Scientific Research (AFOSR) MURI FA9550-05-1-0388, the Office of Naval Research (ONR) MURI N00014‐08‐11186, and a Siebel Scholarship. Additional support was provided by the CASOS Center at Carnegie Mellon University. The views and conclusions contained in this talk are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of the NSF, ARI, ARL, AFOSR, ONR, or the United States Government.

8

Thank You! Questions, Comments, Feedback: [email protected]

impact of different relation extraction methods on network analysis results

Documents

network data

key entities

metadata networksin

open source data

text bodies

manual textmtext

geopolitical entities

key players localized