cancer hallmark text classification using convolutional ... · the hallmarks of cancer •...

26
Cancer Hallmark Text Classification Using Convolutional Neural Networks Simon Baker, Anna Korhonen, Sampo Pyysalo Cambridge Language Technology Lab

Upload: others

Post on 18-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Cancer Hallmark Text Classification

Using Convolutional Neural Networks

Simon Baker, Anna Korhonen, Sampo Pyysalo

Cambridge Language Technology Lab

Page 2: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Introduction and motivation

• A major goal of cancer research is to understand the biological

mechanisms involved: how tumorous growths starts in the

body, how they are sustained, and how they turn malignant.

• Cancer is often described in the biomedical literature by its

hallmarks: a set of interrelated biological properties and

behaviours that enable cancer to thrive in the body.

2

Page 3: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Introduction and motivation

• The hallmarks of cancer were first introduced in the seminal

paper of Hanahan et al. (2000), the most cited paper in the

journal Cell.

• The paper introduces six hallmarks, which were then extended

in a follow-up paper (Hanahan et al. 2011) by another four,

forming the set of ten hallmarks that are known today.

3

Page 4: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Introduction and motivation

In recent work. A corpus comprised of over 1,800 abstracts from

biomedical publications annotated with the ten hallmarks of

cancer (Baker et al. 2016).

A machine learning based method for classifying abstracts

according to the hallmarks is also proposed. The approach

utilizes conventional NLP pipeline that extracts a feature-rich

representation that is used to train support vector machine

(SVM) classifiers.

The method achieves a an average F-score of 77%.

4

Page 5: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Introduction and motivation

A conventional pipeline method is expensive:

• Computationally demanding.

• Requires handcrafting and feature engineering.

• Error propagation through the pipeline.

Our goal is to overcome these challenges by applying

Convolutional Neural Networks to this task.

5

Page 6: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

The Hallmarks of Cancer

• Sustaining proliferative signalling

• Evading growth suppressors

• Resisting cell death

• Enabling replicative immortality

• Inducing angiogenesis.

• Activating invasion & metastasis

• Genome instability & mutation

• Tumor-promoting inflammation

• Deregulating cellular energetics

• Avoiding immune destruction

6

Page 7: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Data

• Corpus of 1853 scientific abstracts.

• Labelled with zero or more hallmarks.

• Inter-annotator agreement on 155 subset of Keppa = 0.81.

• Split data into three sets: train (70%), development (10%), test

(10%).

• We used a sampling strategy that preserves the overall

distribution of the 10 classes.

• We train ten independent binary classifiers (one for each

hallmark).

7

Page 8: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Data

25%

13%

23%

6%

8%

16%

18%

13%

6%

6%

75%

87%

77%

94%

92%

84%

82%

87%

94%

94%

Proliferative signaling

Evading growth

Resisting cell death

Replicative immortality

Angiogensis

Invasion & metastasis

Genomic instablity

Tumor promoting inflamation

Cellular energetics

Avoiding immune destruction

Postives Negatives

8

Page 9: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Convolutional Neural Networks

We base our CNN architecture on the simple model of (Kim

2014).

9

Page 10: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Convolutional Neural Networks

We implemented the neural network using Keras.

Model hyperparameters and the training setup were initially fixed

to those applied by Kim 2014, summarized in the following:

• Word vector size:300 (Google News vectors)

• Filter sizes: 3, 4, 5

• Number of filters: 300 (100 of each size)

• Dropout probability: 0.5

• Minibatch size: 50

10

Page 11: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Convolutional Neural Networks

Adapting the model to our task:

• Oversampling positive examples

• Pre-train embeddings

• Tune filter-sizes

11

Page 12: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Oversampling

• We oversampled the positive examples (2X, 4X, 8X, 16X).

We selected balanced oversampling strategy where the

number of classes are equal.

• Oversampling improves F-score to 86.1% compared to 85.1%

without oversampling.

12

Page 13: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Pre-training Embeddings

We consider a variety of word embeddings:

- general-domain Google News vectors.

- PubMed (PM).

- PMC.

- Wikipedia texts.

- PMC-based vectors introduced for the BioASQ shared task.

- Finally, we consider two variants of PubMed-based vectors

introduced by (Chiu et al 2016).

13

Page 14: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Pre-training Embeddings

14

Page 15: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Pre-training Embeddings

On Development data:

85.6

86.1

84.9

85.2

85.2

85.3

86.1

84.0 84.5 85.0 85.5 86.0 86.5

Chiu-win-30

Chiu-win-2

Wiki+PM+PMC

PM+PMC

PM

PMC

GoogleNews

F-score (%)

15

Page 16: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Pre-training Embeddings

On Development data:

97.3

97.6

97.2

97.2

97.3

97.2

97.5

97.0 97.1 97.2 97.3 97.4 97.5 97.6 97.7

Chiu-win-30

Chiu-win-2

Wiki+PM+PMC

PM+PMC

PM

PMC

GoogleNews

AUC (%)

16

Page 17: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Selecting Filter Sizes

• The base model uses three filter sizes: 3,4,5.

• We investigate what happens to performance when changing

filter sizes (1-10).

• And the number of filter sizes (1-5).

• We keep the total number of filters fixed for each filter size.

17

Page 18: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Selecting Filter Sizes

18

Page 19: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Baseline

• CNN with original Kim 2014 hyperparameters

• SVM with Bag-of-Words features

• SVM with rich features (Baker et al 2016)

19

Page 20: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Baseline – Feature-rich SVM

Tokenisation

Lemmatisation Dependency

Parsing

Named Entitiy Recognition

Feature Encoding

GR

Classifier Feature

Selection

LBo

W

Information flow

Extracted features

POS Tagging

Data Cleaning

Metadata Extraction

Verb Class Clustering

N-gram Extraction

No

un

Big

ram

Ve

rb C

lass

es

Ch

em &

Me

SH

Nam

ed

En

titi

es

1 2

3 4 5

6 7 8 9

10 11 12

Input Article

20

Page 21: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Results (Average F-score %)

69.2

76.8

76.6

81

SVM-BoW

SVM-Rich

CNN-Base

CNN-Tuned

21

Page 22: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Results (Average AUC %)

93.1

94.9

97.1

97.8

SVM-BoW

SVM-Rich

CNN-Base

CNN-Tuned

22

Page 23: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Results (F-score %)

50 55 60 65 70 75 80 85 90 95

Proliferative signaling

Evading growth

Resisting cell death

Replicativeimmortality

Angiogensis

Invasion & metastasis

Genomic instablity

Tumor promotinginflamation

Cellular energetics

Avoiding immunedestruction

CNN-Tuned

CNN-Base

SVM-Rich

SVM-BoW

23

Page 24: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Results

• CNNs greatly reduce the burden of hand crafting and feature

engineering for text classification.

• More portable then an SVM pipeline.

• Hyperparameter space is large, and exhaustive searching is

prohibitive

24

Page 25: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Conclusions

• We investigated the application of CNNs to the biomedical

domain text classification task of identifying the Hallmarks of

Cancer.

• We demonstrated that a CNN using only text and embeddings

can achieve a competitive performance to a feature-heavy

SVM classifier.

• We further adapted the CNN to the task by oversampling

positive examples, using tuned embeddings induced from

biomedical text, and tuning hyperparameters. We achieve a

substantive improvement over the previous state-of-the-art.

25

Page 26: Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer • Sustaining proliferative signalling • Evading growth suppressors • Resisting cell death

Thank you for listening!

26