6/29/051 new frontiers in corpus annotation workshop, 6/29/05 ann bies – linguistic data...

30
6/29/05 1 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive Science* Mark Mandel – Linguistic Data Consortium* Parallel Entity and Treebank Annotation

Post on 19-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 1

New Frontiers in Corpus Annotation Workshop, 6/29/05

Ann Bies – Linguistic Data Consortium*Seth Kulick – Institute for Research in Cognitive Science*Mark Mandel – Linguistic Data Consortium* *University of Pennsylvania

Parallel Entity and Treebank

Annotation

Page 2: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 2

Mining the Bibliome: Information Extraction from the Biomedical Literature

• NSF ITR grant EIA-0205448• Collaboration with Division of Oncology,

Children’s Hospital of Philadelpia• PubMed abstracts – mining cancer literature for

associations that link variations in genes with malignancies

• http://bioie.ldc.upenn.edu - release 0.9 available 1157 abstracts entity annotated, 318 also treebanked

Page 3: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 3

Outline

• Entity Annotation

• Treebank Annotation – • Modifications from Penn Treebank guidelines

• Annotation Process and Merged Format

• Entity-Constituent Mapping – How successful?

Page 4: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 4

Entity Annotation

• Gene X with genomic Variation event Y is correlated with Malignancy Z• Gene – composite entity, can refer to gene or protein

: Gene-generic, Gene-protein, Gene-RNA• (Malignancy – under development, not included in

release 0.9)• Variation Event – Relation between entities

representing different aspects of a variation

Page 5: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 5

Entity Annotation - Variations

• Variation – A relation between variation component entities

• “a single nucleotide substitution at codon 249, predicting a serine to cysteine amino acid substitution”• Var-type – substitution• Var-location –codon 249• Var-state-orig –serine• Var-state-altered –cysteine

Page 6: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 6

A Change in Tokenization

• Tokenization – Many hyphenated words treated as separate tokens• “New York-based”

• Old (Penn Treebank) tokenization: [New] [York-based]

• New tokenization: [New][York][-][based]

Page 7: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 7

Discontinuous Entities• E.g.: “K- and N-ras”

• Tokenization: [K][-][and][N][-][ras]

• Entity annotation: • [K][-]… [ras] – “chain” of discontinuous tokens

• [N][-][ras] – Contiguous tokens

• Splitting up not always done, depends on coordination

Page 8: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 8

Treebank Annotation

• Default NP right-branching structure

• (NP (JJ primary) (NN liver) (NN cancer))

• Simplifies multi-token nominal annotation

• Allows recovery of implicit constituents:• (NP (JJ primary)

(newnode (NN liver) (NN cancer)))

• Entities sometimes map to such implicit constituents

Page 9: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 9

Treebank Annotation • Exceptions to right-branching marked by NML • So: Any two or more non-final elements that form

a constituent are a NML• (ADJP (NML (NNP New) (NNP York))

(HYPH -) (VBN based))

• (ADJP (NML (NN breast) (NN cancer)) (HYPH -) (VBN associated))

• (NP (NML (NN human) (NN liver) (NN tumor)) (NN analysis)

Page 10: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 10

Treebank Annotation • Placeholder *P* for distributed material in

coordinated nominal structures

• “K- and N-ras”NP

NN

NP CC

K

andHYPH

-

NML-1

-NONE-

*P*

NN

NP

N

HYPH

-

NML-1

-NONE-

ras

Page 11: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 11

Treebank Annotation

• To the left or right

• “codon 12 or 13” NP

NML-1

NN

NP CC

codon

CD

12

or NML-1

-NONE-

NP

*P*

CD

13

Page 12: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 12

First Release

• Goal – let users choose how to handle the integration of entity and treebank levels

• Standoff annotation for entity and treebank

• Identical tokenization

• Merged representation• Penn Treebank style

• (POSTag:[from..to] terminal)

• Entity listing before each tree.

Page 13: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 13

Merged Output Example

sentence 4 Span:331..605;In the present study, we screened for ;the K-ras exon 2 point mutations in a ;group of 87 gynecological neoplasms ;[373..378]:gene-rna:"K-ras";[379..385]:variation-location:"exon 2";[386..401]:variation-type: "point mutations“

Page 14: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 14

Merged Output Example

[…] ((VP (VBD:[356..364] screened) (PP-CLR (IN:[365..368] for) (NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (NN:[386..391] point) (NNS:[392..401] mutations))) […]

Page 15: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 15

Merged Output Example

((VP (VBD:[356..364] screened) (PP-CLR (IN:[365..368] for) (NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (NN:[386..391] point) (NNS:[392..401] mutations)))

;[373..378]:gene-rna:"K-ras";[379..385]:variation-location:"exon 2";[386..401]:variation-type: "point mutations"

Page 16: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 16

Entity-Constituent Mapping : Exact Match

• Exact Match: A node in the tree yields exactly the entity:

;[379..385]:variation-location:"exon 2"

(NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (NN:[386..391] point) (NNS:[392..401] mutations)))

Page 17: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 17

Entity-Constituent Mapping : Missing Node

• Missing Node – Possible to add a node to yield exactly the entity

;[386..401]:variation-type: "point mutations"

(NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (NN:[386..391] point) (NNS:[392..401] mutations)))

Page 18: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 18

Entity-Constituent Mapping : Missing Node

• Done for internal research purposes, not in release (implicit constituents)

• NML already in release (explicit constituents)

(NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (newnode(NN:[386..391] point) (NNS:[392..401] mutations))))

Page 19: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 19

Entity-Constituent Mapping : Crossing

• Crossing: Cuts across constituent boundaries, so cannot even add a node yielding the entity

• Typical case: entity containing text corresponding to a prepositional phrase

One ER showed a G-to-T mutation in the second position of codon 12

[1280..1307]: variation-location: “second position of codon 12”

Page 20: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 20

Entity-Constituent Mapping : Crossing

• Crossing - Determiner in NP but not in entity.

• Could relax matching, or modify entity or treebank annotation. Didn’t do that.

(NP (NP (DT:[1276..1279] the) (JJ:[1280..1286] second) (NN:[1287..1295] position)) (PP (IN:[1296..1298] of) (NP (NN:[1299..1304] codon) (CD:[1305..1307] 12)))))

[1280..1307]: variation-location: “second position of codon 12”

Page 21: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 21

Entity-Constituent Mapping – Chain Exact Match

• “codon 12 or 13”• Entities: “codon 12”, “codon..13”

NP

NML-1

NN

NP CC

codon

CD

12

or NML-1

-NONE-

NP

*P*

CD

13

Page 22: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 22

Entity-Constituent Mapping – Chain Not a Exact Match

• “specific codons (12, 13, and 61)•Entities: “codons…12”, “codons..13”, “codons..61”

(NP (JJ specific) (NNS codons) (PRN (-LRB- -LRB-) (NP (NP (CD 12)) (, ,) (NP (CD 13)) (, ,) (CC and) (NP (CD 61))) (-RRB- -RRB-)))

Page 23: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 23

Multiple Token Entities (Non-Chained)

Entity Type Total Exact Match

Missing Node

Crossing

Gene-generic 6 4 1 1

Gene-protein 349 236 103 10

Gene-RNA 156 115 35 6

Var-location 445 348 68 29

Var-state-orig 5 3 1 1

Var-state-altered 10 8 0 2

Var-type 271 123 142 6

Total 1242 837 350 55(4.4%)

Page 24: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 24

Multiple Token Entities (Chained)Entity Type Total Exact

MatchNot Exact Match

Gene-generic 0 0 0

Gene-protein 6 4 2

Gene-RNA 36 29 7

Var-location 125 103 22

Var-state-orig 0 0 0

Var-state-altered 0 0 0

Var-type 1 0 1

Total 168 136 32(19%)

Page 25: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 25

Conclusion• Annotation of entities and treebank done together

• Identical tokenization for entities and trees, with standoff annotation

• Allows flexibility in use of integrated annotation

• Only 6.2% of the entities cannot be mapped to an implicit or explicit constituent node• Changes in Treebank guidelines• Use of Relations for potentially large entities

• Next: Relation annotation and integrated taggers

Page 26: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 26

References

• Ryan’s tagger

• Dan’s parser

• Web page again

Page 27: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 27

Entity Annotation - Variations• “(S249C)”

• Var-type – none • Var-location –249• Var-state-orig –S• Var-state-altered –C

• Gene-{RNA,generic,protein} disambiguates gene metonymy

• Var-{type,location,state-orig,state-altered} are different kinds of entities

Page 28: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 28

Entities

Entity Type Single Tokens

Non-chains

Chains

Gene-generic 104 6 0

Gene-protein 921 349 6

Gene-RNA 1987 156 36

Var-location 95 445 125

Var-state-orig 151 5 0

Var-state-altered 162 10 0

Var-type 235 271 1

--Multiple Tokens--

Page 29: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 29

Introduction

• Corpus for biomedical IE with several levels of annotation:• Entity

• Syntactic Structure (Treebank)

• Relations (McDonald et al, ACL 2005)

• Ideal - entities mapped to treebank constituents

• Allow users to choose how to integrate the levels

Page 30: 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive

6/29/05 30

Annotation Process

• Tokenization Entity POS Treebanking Merged Representation

• Minimal requirement: identical tokenization for entity and treebank annotation

• Did not require an entity/constituent correspondence – but how did it work out?