lecture 19 lexical networks

49
Lecture 19 Lexical networks Slides modified from Dragomir R. Radev

Upload: chelsey

Post on 28-Jan-2016

25 views

Category:

Documents


0 download

DESCRIPTION

Lecture 19 Lexical networks. Slides modified from Dragomir R. Radev. Social data. Blog postings News stories Speeches in Congress Query logs Movie and book reviews Scientific papers Financial reports Query logs Encyclopedia entries Email Chat room discussions Social networking sites. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 19 Lexical networks

Lecture 19

Lexical networks

Slides modified from Dragomir R. Radev

Page 2: Lecture 19 Lexical networks

Social data

Blog postings News stories Speeches in Congress Query logs Movie and book reviews Scientific papers Financial reports Query logs Encyclopedia entries Email Chat room discussions Social networking sites

WHAT DO ALL OF THESE HAVE IN COMMON?

2

Page 3: Lecture 19 Lexical networks

Natural language processing

Part of speech tagging Prepositional phrase attachment Parsing Word sense disambiguation Document indexing Text summarization Machine translation Question answering Information retrieval Social network extraction Topic modeling

3

Page 4: Lecture 19 Lexical networks

Talk outline

Lexical networks

Semantic networks

Lexical centrality

Latent networks

Conclusion

4

Page 5: Lecture 19 Lexical networks

Lexical networks

Page 6: Lecture 19 Lexical networks

Lexical networks

A special case of networks where nodes are words or documents and edges link semantically related nodes

Other examples: Words used in dictionary definitions Names of people mentioned in the same story Words that translate to the same word

A semantic network consists of a set of nodes that are connected by labeled arcs.

The nodes represent concepts and The arcs represent relations between concepts.

6

Page 7: Lecture 19 Lexical networks

Semantic network

7

Page 8: Lecture 19 Lexical networks

The large-scale structure of semantic networks:statistical analyses and a model of semantic growthM. Steyvers, J. B. Tenenbaum (2005)Cognitive Science, 29(1)

Free word associations

Page 9: Lecture 19 Lexical networks

Meredith yesterday apples

bought

green

Dependency network

9

Page 10: Lecture 19 Lexical networks

Dependency network

10

Page 11: Lecture 19 Lexical networks

Semantic Networks

Page 12: Lecture 19 Lexical networks

So again… A Semantic Network is…

A semantic (or associative) network is a simple representation scheme which uses a graph of labeled nodes and labeled, directed arcs to encode knowledge. Labeled nodes: objects/classes/concepts.

Labeled links: relations/associations between nodes

Labels define the semantics of nodes and links

Usually used to represent static, taxonomic, concept dictionaries

Page 13: Lecture 19 Lexical networks

Nodes and Arcs

Nodes denote objects/classes arcs define binary relationships between objects.

john 5Sue

age

mother

mother(john,sue)age(john,5)wife(sue,max)age(sue,34)...

34

age

father

Max

wifehusband

age

Page 14: Lecture 19 Lexical networks

Common Semantic Relations

There is no standard set of relations for semantic networks, but the following relations are very common:

INSTANCE: X is an INSTANCE of Y if X is a specific example of the general concept Y.

Example: Elvis is an INSTANCE of Human

ISA: X ISA Y if X is a subset of the more general concept Y.

Example: sparrow ISA bird

HASPART: X HASPART Y if the concept Y is a part of the concept X.

Or this can be any other property

Example: sparrow HASPART tail

Page 15: Lecture 19 Lexical networks
Page 16: Lecture 19 Lexical networks

ISA hierarchy

The ISA (is a) or AKO (a kind of) relation is often used to link a class and its superclass.

And sometimes an instance and it’s class.

Some links (e.g. has-part) are inherited along ISA paths.

The semantics of a semantic net can be relatively informal or very formal often defined at the

implementation level

isa

isa

isaisa

Robin

Bird

Animal

RedRusty

hasPart

Wings

Page 17: Lecture 19 Lexical networks

Inference by association

Red (a robin) is related to Air Force One by association (as directed path originated from these two nodes join at nodes Wings and Fly)

Bob and George are not related (no paths originated from them join in this network

Wings

isa

isa

isaBoeing 747

Airplane

Machine

Air Force one

Flycan-do

has-partisa

isa

isaisa

Robin

Bird

Animal

RedRusty

Has-part

can-do

owner

Bob George

passenger

Page 18: Lecture 19 Lexical networks

Frames – A Semantic Network with properties

A frame represents an entity as a set of slots (attributes) and associated values. act, look, etc. like objects in C++ a more robust/compact version of a semantic network

Each slot may have constraints that describe legal values that the slot can take.

A frame can represent a specific entity, or a general concept.

Frames are implicitly associated with one another because the value of a slot can be another frame.

Page 19: Lecture 19 Lexical networks

19

Page 20: Lecture 19 Lexical networks

Semantic Networks

Rules are appropriate for some types of knowledge, but do not easily map to others.

Semantic nets can easily represent inheritance and exceptions, but are not well-suited for representing negation, disjunction,

preferences, conditionals, and cause/effect relationships.

Frames allow arbitrary functions (demons) and typed inheritance. Implementation is a bit more cumbersome.

Page 21: Lecture 19 Lexical networks

Lexical Centrality

Page 22: Lecture 19 Lexical networks

LexRank – Centrality in Text Graphs

Vertices

Units of text (sentences or documents)

Edges

Pairwise similarity between text

22

Page 23: Lecture 19 Lexical networks

LexRank – Centrality in Text Graphs

Intuition

LexRank score is propagated through

edges

Central vertices are those that are similar to other central vertices

23

Page 24: Lecture 19 Lexical networks

LexRank – Centrality in Text Graphs

Recurrence Relation

sCan guarantee solution by

allowing “jump” probability d/N.

0.5

0.3

0.80.2

0.1

0.3

0.9

0.2 0.4

24

Page 25: Lecture 19 Lexical networks

25

Page 26: Lecture 19 Lexical networks

26http://tangra.si.umich.edu/clair/lexrank/

Page 27: Lecture 19 Lexical networks

NLP and network analysis

Page 28: Lecture 19 Lexical networks

... , sagte der Sprecher bei der Sitzung .... , rief der Vorsitzende in der Sitzung .

... , warf in die Tasche aus der Ecke .

C1: sagte, warf, riefC2: Sprecher, Vorsitzende, TascheC3: inC4: der, die

[Biemann 2006] [Mihalcea et al 2004] [Mihalcea et al 2004]

[Widdows and Dorow 2002][Pang and Lee 2004]

Part of speech tagging Word sense disambiguation Document indexing

Subjectivity analysis Semantic class induction

Q

relevanceinter-similarity

Passage retrieval

[Otterbacher,Erkan,Radev05]28

Page 29: Lecture 19 Lexical networks

MavenRank – Centrality in Speech Graphs

Vertices

Speech transcripts from a given topic

Edges

tf-idf cosine similarity (with threshold)

Hypothesis

Key speakers will have speeches with high centrality.

29

Page 30: Lecture 19 Lexical networks

MavenRank: Example

23

1

87

6

4

5

Speaker 1Speeches

Speaker 2Speeches

Speaker 3Speeches

Speech Scores

1 0.132 0.133 0.104 0.195 0.106 0.147 0.088 0.13

Speaker Scores (mean speech score)

1 0.122 0.153 0.12

30

Page 31: Lecture 19 Lexical networks

31

Page 32: Lecture 19 Lexical networks

GIN: Gene Interaction NetworkMotivation:

Biomedical literature is growing rapidly. Manually curated databases cover small portion of the available information

Most protein interaction information is uncovered in biomedical articles

Approach: text mining and network analysis for

Automatic extraction of molecule interactions

Automatic article summarization

Interaction and citation networks

Inferring gene-disease associations

32

Page 33: Lecture 19 Lexical networks

Feature Extraction from Dependency Trees

Path1: KaiC – nsubj – interacts – obj – SasA

Path2: KaiC – nsubj – interacts – obj – SasA – conj_and – KaiA

Path3: KaiC – nsubj – interacts – obj - SasA – conj_and – KaiB

Path4: SasA – conj_and – KaiA

Path5: SasA – conj_and – KaiB

Path6: KaiA - prep_with - SasA – conj_and – KaiB

“The results demonstrated that KaiC interacts rhythmically with KaiA, KaiB, and SasA.”

33

Page 34: Lecture 19 Lexical networks

Inferring Genes Related to Prostate Cancer Hypothesis:

Genes that are interacting with many genes that are known to be related to prostate cancer are likely to be related to prostate cancer

Approach: Extract the interaction network of genes (seed genes) that are known

to be related to prostate cancer automatically from the literature Infer new genes related to prostate cancer from the network topology Use eigenvalue centrality to rank gene-prostate cancer associations

Hypothesis restatement: Genes central in the constructed network are most probably related

to prostate cancer.

34

Page 35: Lecture 19 Lexical networks

Approach

Corpus: PMCOA (PubMed Central Open Access) – full text articles Articles in PMCOA split into sentences and sentences tagged with

GeniaTagger

Compile seed list of genes known to be related to prostate cancer 20 genes compiled from OMIM (Online Mendelian Inheritance in

Man) Database Extend seed gene list with synonyms from HGNC (HUGO Gene

Nomenclature Committee) database.

Use the automatic interaction extraction pipeline to extract the interaction network of the seed genes and their neighbors (genes interacting with the seed genes).

35

Page 36: Lecture 19 Lexical networks

Seed Genes

Gene DescriptionAR androgen receptor (dihydrotestosterone receptor; testicular feminization; spinal and bulbar muscular atrophy; Kennedy disease)BRCA2 breast cancer 2, early onsetMSR1 macrophage scavenger receptor 1EPHB2 EPH receptor B2KLF6 Kruppel-like factor 6MAD1L1 MAD1 mitotic arrest deficient-like 1 (yeast)TUSC3 tumor suppressor candidate 3HIP1 huntingtin interacting protein 1CBX8 chromobox homolog 8 (Pc class homolog, Drosophila)|#|chromobox homolog 8 (Drosophila Pc class)CD82 CD82 moleculeZFHX3 zinc finger homeobox 3ELAC2 elaC homolog 2 (E. coli)MXI1 MAX interactor 1PTEN phosphatase and tensin homolog (mutated in multiple advanced cancers 1)RNASEL ribonuclease L (2',5'-oligoisoadenylate synthetase-dependent)HPC1 hereditary prostate cancer 1CHEK2 CHK2 checkpoint homolog (S. pombe)HPCX hereditary prostate cancer, X-linked predisposing for prostate cancerPCAP predisposing for prostate cancer PRCA1 prostate cancer 1

20 genes that are reported in OMIM to be related to prostate cancer

36

Page 37: Lecture 19 Lexical networks

Interactions of the seed genes(gene names normalized to their HGNC symbols)

37

Page 38: Lecture 19 Lexical networks

Sample Extracted Interaction Sentences

A study by Jin et al. [20] indicated that the association of Tax with hsMAD1, a mitotic spindle checkpoint (MSC) protein, led to the translocation of both MAD1 and MAD2 to the cytoplasm.

PTEN is transcriptionally regulated by transcription factors such as p53, Egr-1, NFκB and SMADs,

while protein levels and activity are modulated by phosphorylation, oxidation, subcellular

localisation, phospholipid binding and protein stability [29].

Interestingly, one of these, HPC1, is linked to RNASEL [10,11].

In response to DNA damage, the cell-cycle checkpoint kinase CHEK2 can be activated by ATM

kinase to phosphorylate p53 and BRCA1, which are involved in cell-cycle control, apoptosis, and

DNA repair [1,2].

The interactions of RAD51 with TP53, RPA and the BRC repeats of BRCA2 are relatively well

understood (see Discussion).

The interaction of BRCA2 with HsRad51 is significantly more different to both RadA and RecA

(Figure 2c).

Max interactor protein, MXI1 (gene L07648) competes for MAX thus negatively regulates MYC

function and may play a role in insulin resistance.

Mad2 binds to Cdc20, an activator of the anaphase-promoting complex (APC), to inhibit APC

activity and arrest cells in metaphase in response to checkpoint activation.

38

Page 39: Lecture 19 Lexical networks

Inferred Genes (evaluation of top-20 scoring genes)

6 are seed genes; 14 genes are inferred to be related to prostate cancer (Check GeneGo Pathway database; if no evidence there, check PubMed literature)

9 genes: marked as being related to prostate cancer by GeneGo Pathway Database 1 gene: Found evidence in PubMed that gene related to prostate cancer 4 genes: no evidence found

Gene Description EvidenceTP53 tumor protein p53 (Li-Fraumeni syndrome) GeneGoBRCA1 breast cancer 1, early onset GeneGoEREG epiregulin noAKT1 v-akt murine thymoma viral oncogene homolog 1 GeneGoMAPK1 mitogen-activated protein kinase 1 noTNF tumor necrosis factor (TNF superfamily, member 2) GeneGoCCND1 cyclin D1 GeneGoMYC v-myc myelocytomatosis viral oncogene homolog (avian) GeneGoAPC adenomatosis polyposis coli PubMedCDKN1B cyclin-dependent kinase inhibitor 1B (p27, Kip1) GeneGoMAPK8 mitogen-activated protein kinase 8 GeneGoNR3C1 nuclear receptor subfamily 3, group C, member 1 (glucocorticoid receptor) noVEGFA vascular endothelial growth factor A GeneGoMDM2 mouse double minute 2, human homolog of; p53-binding protein no

39

Page 40: Lecture 19 Lexical networks

40

Page 41: Lecture 19 Lexical networks

Other networks

Diabetes Type I

Diabetes Type II

Bipolar Disorder

41

Page 42: Lecture 19 Lexical networks

Properties of lexical networks

Page 43: Lecture 19 Lexical networks

Dependency network

43

Page 44: Lecture 19 Lexical networks

Random network

44

Page 45: Lecture 19 Lexical networks

Analyzing networks

Properties of networks Clustering coefficient

Watts/Strogatz cc = #triangles/#triples

Power law coefficient

Diameter (longest shortest path)

Average shortest path (ASP)

Properties of nodes Centrality: degree, closeness, betweenness, eigenvector

45

Page 46: Lecture 19 Lexical networks

Types of networks

Regular networks Uniform degree distribution

Random networks Memoryless Poisson degree distribution Characteristic value Low clustering coefficient Large asp

Small world networks High transitivity Presence of hubs (memory) High clustering coefficient

(e.g., 1000 times higher than random)

Small ASP Power law degree distribution

(typical value of between 2 and 3)

Npkk

kekP

kk

!)(

)()(

k

kP

46

Page 47: Lecture 19 Lexical networks

Comparing the dependency graph to a random (Poisson) graph

Random Actual

n 5563 5584

M 14440 14472

Diameter 21 13

ASP 8.788 4.01

W/S cc 0.00062 0.092

n/a 2.1947

Page 48: Lecture 19 Lexical networks

Properties of lexical networks

Entries in a thesaurus[Motter et al. 2002]

c/c0 = 260 (n=30,000)

Co-occurrence networks [Dorogovtsev and Mendes 2001, Sole and Ferrer i Cancho 2001]

c/c0 = 1,000 (n=400,000)

Mental lexicon [Vitevitch 2005] c/c0 = 278 (n=19,340)

letter

actor

character nature

universe

world

48

Page 49: Lecture 19 Lexical networks

syntactic dependency degree distribution(loglog scale)

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

49