words, words, words: reading shakespare with python

Words, words, wordsReading Shakespeare with Python

Prologue

Motivation

How can we use Python to supplement our reading of Shakespeare?

How can we get Python to read for us?

Why Shakespeare?

Polonius: What do you read, my lord?Hamlet: Words, words, words.P: What is the matter, my lord?H: Between who?P: I mean, the matter that you read, my lord. --II.2.184

Why Shakespeare?

(Also the XML)

(thank you, https://github.com/severdia/PlayShakespeare.com-XML !!!)

https://github.com/severdia/PlayShakespeare.com-XML




Shakespeare XML

Challenges

• Language, especially English, is messy

• Texts are usually unstructured

• Pronunciation is not standard

• Reading is pretty hard!

Humans and Computers

Nuance

Ambiguity

Close reading

Counting

Repetitive tasks

Making graphs

Humans are good at: Computers are good at:

Act II

(leveraging metadata)

Who is the main Character in _______?

Who is the main character in Hamlet?

Number of Lines

Who is the main character in King Lear?

Number of Lines

Who is the main character in Macbeth?

Number of Lines

Who is the main character in Othello?

Number of Lines

Iago and Othello, Detail

Number of Lines

Obligatory Social Network

Act III

First steps with natural language processing (NLP)

What are Shakespeare’s most interesting rhymes?

Shakespeare’s Sonnets

• A sonnet is 14 line poem• There are many different rhyme schemes a sonnet can have; Shakespeare was pretty unique in choosing one• This is a huge win for us, since we can “hard code” his rhyme scheme in our analysis

Shall I compare thee to a summer’s day?Thou art more lovely and more temperate:Rough winds do shake the darling buds of May,And summer’s lease hath all too short a date;Sometime too hot the eye of heaven shines,And often is his gold complexion dimm'd;And every fair from fair sometime declines,By chance or nature’s changing course untrimm'd;But thy eternal summer shall not fade,Nor lose possession of that fair thou ow’st;Nor shall death brag thou wander’st in his shade,When in eternal lines to time thou grow’st: So long as men can breathe or eyes can see, So long lives this, and this gives life to thee.

http://www.poetryfoundation.org/poem/174354

ababcdcdefefgg

Sonnet 18

Rhyme Distribution

• Most common rhymes• nltk.FreqDict

Frequency Distribution

• Given a word, what is the frequency distribution of the words that rhyme with it?• nltk.ConditionalFreqDict

Conditional Frequency Distribution

Rhyme Distribution

1) “Boring” rhymes: “me” and “thee”2) “Lopsided” rhymes: “thee” and “usury”

Interesting Rhymes?

Act IV

Classifiers 101

Writing code that reads

Our Classifier

Can we write code to tell if a given speech is from a tragedy or comedy?

● Requires labeled text○ (in this case, speeches labeled by genre)○ [(<speech>, <genre>), ...]

● Requires “training”● Predicts labels of text

Classifiers: overview

Classifiers: ingredients

● Classifier● Vectorizer, or Feature Extractor● Classifiers only interact with features, not

the text itself

Vectorizers (or Feature Extractors)

● A vectorizer, or feature extractor, transforms a text into quantifiable information about the text.

● Theoretically, these features could be anything. i.e.:○ How many capital letters does the text contain?○ Does the text end with an exclamation point?

● In practice, a common model is “Bag of Words”.

Bag of Words is a kind of feature extraction where:

● The set of features is the set of all words in the text you’re analyzing

● A single text is represented by how many of each word appears in it

Bag of Words

Bag of Words: Simple Example

Two texts:

● “Hello, Will!”● “Hello, Globe!”


Two texts:


Bag: [“Hello”, “Will”, “Globe”]

“Hello” “Will” “Globe”


Two texts:


Bag: [“Hello”, “Will”, “Globe”]


“Hello, Will”

1 1 0

“Hello, Globe”

1 0 1


Two texts:



“Hello, Will”

1 1 0

“Hello, Globe”

1 0 1

“Hello, Will” → “A text that contains one instance of the word “Hello”, contains one instance of the word “Will”, and does not contain the word “Globe”.(Less readable for us, more readable for computers!)

Live Vectorizer:

Why are these called “Vectorizers”?

text_1 = "words, words, words"

text_2 = "words, words, birds"

# times “birds” is used

# times “words” is used

text_2

text_1

Putting it all Together

Classifier Workflow

Classification: Steps

1) Split pre-labeled text into training and testing sets

2) Vectorize text (extract features)3) Train classifier4) Test classifier

Text → Features → Labels

Training

Classifier Training

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB

vectorizer = CountVectorizer()

vectorizer.fit(train_speeches)

train_features = vectorizer.transform(train_speeches)

classifier = MultinomialNB()

classifier.fit(train_features, train_labels)

Testing

test_speech = test_speeches[0]

print test_speech

Farewell, Andronicus, my noble father,

The woefull'st man that ever liv'd in Rome.

Farewell, proud Rome, till Lucius come again;

He loves his pledges dearer than his life.

...

(From Titus Andronicus, III.1.288-300)

Classifier Testing

Classifier Testing

test_speech = test_speeches[0]

test_label = test_labels[0]

test_features = vectorizer.transform([test_speech])

prediction = classifier.predict(test_features)[0]

print prediction

>>> 'tragedy'

print test_label

>>> 'tragedy'

test_features = vectorizer.transform(test_speeches)

print classifier.score(test_features, test_labels)

>>> 0.75427682737169521

Classifier Testing

Critiques

• "Bag of Words" assumes a correlation

between word use and label. This

correlation is stronger in some cases

than in others.

• Beware of highly-disproportionate

training data.

Epilogue

[email protected]@adampalay

www.adampalay.com

Thank you!

mailto:[email protected]

mailto:[email protected]

https://twitter.com/adampalay

https://twitter.com/adampalay

http://www.adampalay.com

http://www.adampalay.com

words, words, words: reading shakespare with python

Software

main character

bag of words

reading of shakespeare

labeled text

number of lineswho

text itselfvectorizers

text end

wordsreading shakespeare