words, words, words: reading shakespare with python
TRANSCRIPT
Words, words, wordsReading Shakespeare with Python
Prologue
Motivation
How can we use Python to supplement our reading of Shakespeare?
How can we get Python to read for us?
Act I
Why Shakespeare?
Polonius: What do you read, my lord?Hamlet: Words, words, words.P: What is the matter, my lord?H: Between who?P: I mean, the matter that you read, my lord. --II.2.184
Why Shakespeare?
(Also the XML)
(thank you, https://github.com/severdia/PlayShakespeare.com-XML !!!)
Shakespeare XML
Shakespeare XML
Challenges
• Language, especially English, is messy
• Texts are usually unstructured
• Pronunciation is not standard
• Reading is pretty hard!
Humans and Computers
Nuance
Ambiguity
Close reading
Counting
Repetitive tasks
Making graphs
Humans are good at: Computers are good at:
Act II
(leveraging metadata)
Who is the main Character in _______?
Who is the main character in Hamlet?
Number of Lines
Who is the main character in King Lear?
Number of Lines
Who is the main character in Macbeth?
Number of Lines
Who is the main character in Othello?
Number of Lines
Iago and Othello, Detail
Number of Lines
Obligatory Social Network
Act III
First steps with natural language processing (NLP)
What are Shakespeare’s most interesting rhymes?
Shakespeare’s Sonnets
• A sonnet is 14 line poem• There are many different rhyme schemes a sonnet can have; Shakespeare was pretty unique in choosing one• This is a huge win for us, since we can “hard code” his rhyme scheme in our analysis
Shall I compare thee to a summer’s day?Thou art more lovely and more temperate:Rough winds do shake the darling buds of May,And summer’s lease hath all too short a date;Sometime too hot the eye of heaven shines,And often is his gold complexion dimm'd;And every fair from fair sometime declines,By chance or nature’s changing course untrimm'd;But thy eternal summer shall not fade,Nor lose possession of that fair thou ow’st;Nor shall death brag thou wander’st in his shade,When in eternal lines to time thou grow’st: So long as men can breathe or eyes can see, So long lives this, and this gives life to thee.
http://www.poetryfoundation.org/poem/174354
ababcdcdefefgg
Sonnet 18
Rhyme Distribution
• Most common rhymes• nltk.FreqDict
Frequency Distribution
• Given a word, what is the frequency distribution of the words that rhyme with it?• nltk.ConditionalFreqDict
Conditional Frequency Distribution
Rhyme Distribution
Rhyme Distribution
1) “Boring” rhymes: “me” and “thee”2) “Lopsided” rhymes: “thee” and “usury”
Interesting Rhymes?
Act IV
Classifiers 101
Writing code that reads
Our Classifier
Can we write code to tell if a given speech is from a tragedy or comedy?
● Requires labeled text○ (in this case, speeches labeled by genre)○ [(<speech>, <genre>), ...]
● Requires “training”● Predicts labels of text
Classifiers: overview
Classifiers: ingredients
● Classifier● Vectorizer, or Feature Extractor● Classifiers only interact with features, not
the text itself
Vectorizers (or Feature Extractors)
● A vectorizer, or feature extractor, transforms a text into quantifiable information about the text.
● Theoretically, these features could be anything. i.e.:○ How many capital letters does the text contain?○ Does the text end with an exclamation point?
● In practice, a common model is “Bag of Words”.
Bag of Words is a kind of feature extraction where:
● The set of features is the set of all words in the text you’re analyzing
● A single text is represented by how many of each word appears in it
Bag of Words
Bag of Words: Simple Example
Two texts:
● “Hello, Will!”● “Hello, Globe!”
Bag of Words: Simple Example
Two texts:
● “Hello, Will!”● “Hello, Globe!”
Bag: [“Hello”, “Will”, “Globe”]
“Hello” “Will” “Globe”
Bag of Words: Simple Example
Two texts:
● “Hello, Will!”● “Hello, Globe!”
Bag: [“Hello”, “Will”, “Globe”]
“Hello” “Will” “Globe”
“Hello, Will”
1 1 0
“Hello, Globe”
1 0 1
Bag of Words: Simple Example
Two texts:
● “Hello, Will!”● “Hello, Globe!”
“Hello” “Will” “Globe”
“Hello, Will”
1 1 0
“Hello, Globe”
1 0 1
“Hello, Will” → “A text that contains one instance of the word “Hello”, contains one instance of the word “Will”, and does not contain the word “Globe”.(Less readable for us, more readable for computers!)
Live Vectorizer:
Why are these called “Vectorizers”?
text_1 = "words, words, words"
text_2 = "words, words, birds"
# times “birds” is used
# times “words” is used
text_2
text_1
Act V
Putting it all Together
Classifier Workflow
Classification: Steps
1) Split pre-labeled text into training and testing sets
2) Vectorize text (extract features)3) Train classifier4) Test classifier
Text → Features → Labels
Training
Classifier Training
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
vectorizer = CountVectorizer()
vectorizer.fit(train_speeches)
train_features = vectorizer.transform(train_speeches)
classifier = MultinomialNB()
classifier.fit(train_features, train_labels)
Testing
test_speech = test_speeches[0]
print test_speech
Farewell, Andronicus, my noble father,
The woefull'st man that ever liv'd in Rome.
Farewell, proud Rome, till Lucius come again;
He loves his pledges dearer than his life.
...
(From Titus Andronicus, III.1.288-300)
Classifier Testing
Classifier Testing
test_speech = test_speeches[0]
test_label = test_labels[0]
test_features = vectorizer.transform([test_speech])
prediction = classifier.predict(test_features)[0]
print prediction
>>> 'tragedy'
print test_label
>>> 'tragedy'
test_features = vectorizer.transform(test_speeches)
print classifier.score(test_features, test_labels)
>>> 0.75427682737169521
Classifier Testing
Critiques
• "Bag of Words" assumes a correlation
between word use and label. This
correlation is stronger in some cases
than in others.
• Beware of highly-disproportionate
training data.
Epilogue