using machine learning techniques in stylometry ramyaa, congzhou he, dr. khaled rasheed

20
Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

Upload: emory-poole

Post on 05-Jan-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

 Using Machine Learning Techniques in Stylometry

Ramyaa, Congzhou He, Dr. Khaled Rasheed

Page 2: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

Introduction

• Stylometry

• Major problems facing stylometry

• Decision trees

• Artificial Neural Networks

Page 3: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

Stylometry

• The measure of style

• Fundamental assumption: there is an unconscious aspect to an author’s style that cannot be consciously manipulated but which possesses quantifiable and distinctive features.

• Major applications today: clinical tools in disease detection and forensic tools in court trials, text categorization, author attribution.

Page 4: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

Major problems facing stylometry

• no consensus as to what characteristic features to use

• Which indicators to use – word length, sentence length, tests of position, the distribution of once-occurring words (hapax legomena), the frequencies of marker words, letter sequence, syllable length or syntactical measures?

Page 5: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

Major problems facing stylometry

• No consensus as to what methodology or techniques to apply in standard research

Which techniques to use -- statistical methods and automated pattern recognition methods?

• Statistical methods: e.g. Bayesian analysis, cluster analysis such as the widely used Principal Components Analysis (PCA).

• Automated pattern recognition methods: e.g. Artificial Neural Networks (ANN), Genetic Programming (GP).

Page 6: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

Significant Featuresof our paper

• Recognizing the works of five authors • Use of unconventional indicators such as

punctuation marks as well as standard indicators such as function words

• Only 21 indicators, which shows that not many features are required for high-performance classification as opposed to common belief

Page 7: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

Data Extraction

78 samples from five popular Victorian authors– Jane Austen:

• Pride and Prejudice Chapters 1-5

• Mansfield Park Chapters 1-5

• Emma Chapters 1-5

• Sense and Sensibility Chapters 1-5

Page 8: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

– Charles Dickens• David Copperfield Chapters 1-5

• Great Expectations Chapters 1-5

• Hard Times Chapters 1-6

• Tale of Two Cities Chapters 1-6

-- William Thackeray• Vanity Fair Chapters 1-6

• Men’s Wives Chapters 1-6

– Emily Bronte• Wuthering Heights Chapters 1-12

– Charlotte Bronte Jane Eyre Chapters 1-12

Page 9: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

21 attributes as input• type-token ratio • mean word length • mean sentence length• standard deviation of sentence length• mean paragraph length• chapter length • number of commas per thousand tokens• number of semicolons per thousand tokens• number of quotation marks per thousand tokens

Page 10: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

• number of exclamation marks /1000 tokens• number of hyphens per thousand tokens• number of and’s per thousand tokens• number of but’s per thousand tokens• number of however’s per thousand tokens• number of if’s per thousand tokens• number of that’s per thousand tokens• number of more’s per thousand tokens• number of must’s per thousand tokens• number of might’s per thousand tokens• number of this’s per thousand tokens• number of very’s per thousand tokens

Page 11: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

Decision Tree Learning

• See5 package by Quinlan based on ID3 algorithm

• features of decision tree: results easy to understand; focus on individual attributes

• Use fuzzy thresholds for continuous values• Either winnowing or boosting gives the best

result: 82.4% accuracy, significantly above random guess (20%).

Page 12: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

Result from winnowing: Evaluation on test data (17 cases):

Decision Tree ---------------- Size Errors

  5 3(17.6%) <<

  (a) (b) (c) (d) (e) <-classified as ---- ---- ---- ---- ---- 4 1 (a): class jane 5 1 (b): class charles 2 (c): class william 1 1 (d): class emily 2 (e): class charlotte

Page 13: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

Results from boosting:Evaluation on test data (17 cases):

boost 3(17.6%) <<

 

(a) (b) (c) (d) (e) <-classified as

---- ---- ---- ---- ----

4 1 (a): class jane

5 1 (b): class charles

2 (c): class william

1 1 (d): class emily

2 (e): class charlotte

Page 14: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

Artificial Neural Network (ANN)

Learning • practical and powerful method of pattern

recognition• can invent new features that are not explicit in the

input• all attributes taken into consideration• inductive rules not accessible to humans

Page 15: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

• Many architectures were tried.• Kohonen SOM, Probabilistic nets, Nets based on

statistical model were tried• Back propagation feed forward nets gave the best

results• The best network had 21 inputs and 10 outputs• The best architecture had 15 hidden nodes in the

first hidden layer and 11 in the second

Page 16: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

Predictor analysis

-1

-0.5

0

0.5

1

Pattern

error

JA CD WT EB CB

Page 17: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

Results from ANN

(a) (b) (c) (d) (e) classified as

---- ---- ---- ---- ----

2 (a): class jane

2 (b): class charles

2 (c): class william

2 4 (d): class emily

5 (e): class charlotte

 

Page 18: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

Misclassifications:

• No. 4: Pride and Prejudice Chapter 3 is misclassified as written by Charlotte Bronte

• Nos. 67 & 71: Tale of Two Cities Chapter 1 and Chapter 5 are misclassified as written by William Thackeray.

•  All the other authors are correctly classified. (88.2% accuracy on the validation set)

Page 19: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

Conclusion• Very good results were obtained in both the

experiments• Artificial Intelligence provides stylometry with

excellent classifiers that require fewer input variables than traditional statistics 

• Future Research– GA/GP

– a general classifier applicable to all authors – Different set of features

Page 20: Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

Thank you

?