text categorization moshe koppel lecture 4: author profiling
DESCRIPTION
Text Categorization Moshe Koppel Lecture 4: Author Profiling. With Shlomo Argamon, Jonathan Schler, James Pennebaker, Kfir Zigdon and others. Profiling. In real life: We don’t have a closed set of candidate authors We don’t have writing samples from each of them - PowerPoint PPT PresentationTRANSCRIPT
Text CategorizationMoshe Koppel
Lecture 4: Author ProfilingWith Shlomo Argamon, Jonathan Schler, James Pennebaker,
Kfir Zigdon and others
Profiling
In real life:1. We don’t have a closed set of candidate authors2. We don’t have writing samples from each of them
We can still try to say something about the author: Gender Age group Linguistic background …
Which is Male/Female?
• My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance .
• The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.
British National Corpus
• 920 documents labelled for – author gender– document genre
• Used 566 controlled for genre
Fiction / Female 132
Fiction / Male 132
Non-fiction / Female 151
Non-fiction / Male 151
Arts (Non-academic) 16
Arts (Academic) 24
Belief & Thought 24
Biography 54
Commerce 10
Leisure 16
Science 26
Soc. Sci. (Non-ac.) 52
Soc. Sci. (Ac.) 38
World Affairs 42
Experiment
Features: 400+ FW ; 600+ POS n-grams
Learner: exponential gradient / linear SVM
Test: 10-fold cross-validation
Results per Feature Set
50
55
60
65
70
75
80
85
All docs Fiction Non-Fiction
FWPOSFW+POS
•Handle fiction and non-fiction separately
•Use full feature set
Results per Genre
Testing on Genre: # of docs Train on All Train on Fiction Fiction 264 74.5 79.5
Fiction / Female 132 74.8 81.7 Fiction / Male 132 74.2 77.3
Train on Non-fiction Non-fiction 302 79.7 82.6
Non-fiction / Female 151 79.2 83.3 Non-fiction / Male 151 80.2 81.9 Arts (Non-academic) 16 76.0 76.3 Arts (Academic) 24 75.6 77.5 Belief & Thought 24 85.0 85.0 Biography 54 87.0 90.0 Commerce 10 60.0 84.0 Leisure 16 85.7 81.3 Science 26 74.2 78.5 Social Science (Non-academic) 52 77.5 83.0 Social Science (Academic) 38 82.9 78.4 World Affairs 42 79.2 82.9
Learning-Based Feature Reduction
• Apply learning algorithm
• Eliminate features with low weights
• Learn again
Results: Feature Reduction
Fiction
0.6
0.65
0.7
0.75
0.8
0.85
0.9
all 128 64 32 16 8
Number of features
accu
racy
FWPOSFWPOS
Results: Feature Reduction
Feature reduction for Nonfiction
0.6
0.65
0.7
0.75
0.8
0.85
0.9
all 128 64 32 16 8
Number of features
Accu
racy
FWPOS
POS
FW
What are the Distinguishing Features?
• Fiction– Male: a, the, as– Female: she, for, with, not
• Non-Fiction– Male: that, one, of, PRP, AT0– Female: she, for, with, and, in, PNP
Feature
FictionNon-fiction
Male μ stderr
Female μ stderr
Male μ stderr
Female μ stderr
PNP732 ± 14809 ± 15291 ± 12331 ± 17
he145 ± 4.7135 ± 4.747.5 ± 3.548.1 ± 4.3
she67 ± 4.3139 ± 6.98.73 ± 1.721.5 ± 2.3
AT0735 ± 9.5626 ± 8.7884 ± 9.1822 ± 12
DT0160 ± 2.9153 ± 2.0220 ± 4.0204 ± 4.6
the520 ± 8.6418 ± 7.5611 ± 8.4614 ± 12
XX084 ± 2.498 ± 2.254 ± 1.555 ± 2.3
PRP623 ± 6.0615 ± 5.7767 ± 5.9763 ± 7.0
PRF170 ± 4.2158 ± 3.7355 ± 7.2324 ± 7.9
for55.7 ± 1.161.3 ± 1.077.9 ± 1.690.7 ± 1.4
with58.6 ± 1.166.5 ± 1.056.9 ± 1.167.8 ± 1.4
and234 ± 4.9249 ± 5.5242 ± 3.9287 ± 5.2
Feature Frequencies
Summary: Male vs. Female Style
Males use more• Determiners• Adjectives• of modifiers (e.g. pot of gold)
Females use more• Pronouns• for and with• Negation• Present tense
Informational features
Involvedness features
Which is Male/Female?
• My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance .
• The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.
Which is Male/Female?
• My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance .
• The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.
Which is Male/Female?
• My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance .
• The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.
Which is Male/Female?
• My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance .
• The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.
Which is Male/Female?
• My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance .
• The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.
Blog Corpus
• 85,000 blogs
• blogger-provided profiles (gender, age, occupation, astrological sign)
• harvested August 2004
• non-text ignored (formatting, quoting)
Example 1
Yesterday we had our second jazz competition. Thank God we weren't competing. We were sooo bad. Like, I was so ashamed, I didn't even want to talk to anyone after. I felt so rotton, and I wanted to cry, but...it's ok.
Example 2
My gracious boss had agreed to let me have one week off of "work." He did finally give me my report back after eight freakin' days! Now I only have the rest of this week and then one full week after my vacation to finish this damned thing.
Example 3
So about a month or two ago, I met Katy N. at a party in New York. Katy's friend, Kevin M., whom she met while living in Barcelona last year, lives in Miami and is working on getting a TV series produced. Kevin is friends with a guy named Charlie P.
Blog Corpus
gender
age female male Totalunknown 12287 12259 2454613-17 6949 4120 1106918-22 7393 7690 1508323-27 4043 6062 1010528-32 1686 3057 474333-37 860 1827 268738-42 374 819 119343-48 263 584 847>48 314 906 1220 Total 34169 37324 71493
Final balanced corpus:• 19,320 total blogs
– 8240 in “10s”– 8086 in “20s”– 2994 in “30s”
• 681,288 total posts• 141,106,859 total
words
Experimental Setup
Feature sets:• Content: words (filtered by infogain on train set)• Style: parts-of-speech, function words, blog slang
Learning algorithms: Real-valued balanced winnow (RBW) Bayesian Multinomial Regression (BMR)
Evaluation: 10-fold cross-validation
Age: Classification
RBW BMRStyle & Content 75.0% 77.4%Function Words 67.7% 69.4%Content Words 75.9% 76.2%
The lifecycle of the common blogger...
feature 10s 20s 30s
bored 3.84 1.11 0.47boring 3.69 1.02 0.63
awesome 2.92 1.28 0.57
mad 2.16 0.8 0.53
homework 1.37 0.18 0.15
mum 1.25 0.41 0.23
maths 1.05 0.03 0.02dumb 0.89 0.45 0.22
sis 0.74 0.26 0.1
crappy 0.46 0.28 0.11
The lifecycle of the common blogger...
feature 10s 20s 30s
bored 3.84 1.11 0.47boring 3.69 1.02 0.63
awesome 2.92 1.28 0.57
mad 2.16 0.8 0.53
homework 1.37 0.18 0.15
mum 1.25 0.41 0.23
maths 1.05 0.03 0.02dumb 0.89 0.45 0.22
sis 0.74 0.26 0.1
crappy 0.46 0.28 0.11
feature 10s 20s 30s
college 1.51 1.92 1.31bar 0.45 1.53 1.11
apartment 0.18 1.23 0.55
beer 0.32 1.15 0.7
student 0.65 0.98 0.61
drunk 0.77 0.88 0.41
album 0.64 0.84 0.56dating 0.31 0.52 0.37
semester 0.22 0.44 0.18
someday 0.35 0.4 0.28
The lifecycle of the common blogger...
feature 10s 20s 30s
bored 3.84 1.11 0.47boring 3.69 1.02 0.63
awesome 2.92 1.28 0.57
mad 2.16 0.8 0.53
homework 1.37 0.18 0.15
mum 1.25 0.41 0.23
maths 1.05 0.03 0.02dumb 0.89 0.45 0.22
sis 0.74 0.26 0.1
crappy 0.46 0.28 0.11
feature 10s 20s 30s
college 1.51 1.92 1.31bar 0.45 1.53 1.11
apartment 0.18 1.23 0.55
beer 0.32 1.15 0.7
student 0.65 0.98 0.61
drunk 0.77 0.88 0.41
album 0.64 0.84 0.56dating 0.31 0.52 0.37
semester 0.22 0.44 0.18
someday 0.35 0.4 0.28
feature 10s 20s 30s
son 0.51 0.92 2.37local 0.38 1.18 1.85
marriage 0.27 0.83 1.41
development 0.16 0.5 0.82
tax 0.14 0.38 0.72
campaign 0.14 0.38 0.7
provide 0.15 0.54 0.69democratic 0.13 0.29 0.59
systems 0.12 0.36 0.55
workers 0.1 0.35 0.46
Gender: Classification
RBW BMRStyle & Content 80.0%Style Words 77.0%Content Words 73.0%
Men are from Mars...Women are from Venus...
LIWC category male female
job 68.1±0.6 56.5±0.5
money 43.6±0.4 37.1±0.4
sports 31.2±0.4 20.4±0.2
tv 21.1±0.3 15.9±0.2
sex 32.4±0.4 43.2±0.5
family 27.5±0.3 40.6±0.4
eating 23.9±0.3 30.4±0.3
friends 20.5±0.2 25.9±0.3
sleep 18.4±0.2 23.5±0.2
pos-emotions 248.2±1.9 265.1±2
neg-emotions 159.5±1.3 178±1.4
Relating Age & Gender
• Let's examine the connection between age and gender a little more generally...
• Consider the most distinctive words for both Age and Gender:– Intersection of the 1000 words with highest Age
information gain and the 1000 words with highest Gender information gain
– Total of 316 words– Consider log(30s/10s) vs. log(male/female)
Relating Age & Gender
-8
-6
-4
-2
0
2
4
6
8
-2 -1 0 1 2
log(male/female)
log(
30s/
10s)
Relating Age & Gender
-8
-6
-4
-2
0
2
4
6
8
-2 -1 0 1 2
log(male/female)
log(
30s/
10s)
“husband”
Native Language
Given English text, can we determine the author’s native language?
In the second part of this outhor’s novel, called Time Passes, time has passed indeed and Mrs Ramsay has died. There are pejudments of small groups, such as homosexuals, inmigrants, aids diseaseds, etc. But "political correctness" has have positive and negative consecuences. There is one more kind of films irritating many television viewers - "soap" serials. «Santa Barbara» has even won "Oskar" prize.
Try it yourself. These were written by Russian, French and Spanish speakers, respectively. Can you tell which is which?
Possible Clues
Patterns of native language are typically reflected in how other languages are spoken (Rado61, Corder81):
• Word selection
• Syntax
• Spelling
Measurable Features for Automated Native Language Detection
• Frequency of function words• Frequency of letter sequences (adapted from Peng+ 04)
• Idiosyncrasies
We will gather idiosyncrasies data automatically.
Orthographic Idiosyncrasies
• Repeated letter (e.g. remmit instead of remit)
• Double letter appears once (e.g. comit instead of commit)
• Letter instead of (e.g. firsd instead of first)
• Letter inversion (e.g. fisrt instead of first)
• Inserted letter (e.g. friegnd instead of friend)
• Missing letter (e.g. frend instead of friend)
• Conflated words (e.g stucktogether)
Syntactic Idiosyncrasies
• Sentence Fragment• Run-on Sentence• Repeated Word• Missing Word• Mismatched Singular/Plural• Mismatched Tense • that/which confusion• Rare POS pairs (Chodorow-Leacock 00)
Automatically Finding Idiosyncrasies
1. Run text through automated spell/grammar checker
2. Compare flagged word to best suggestion
3. Mark error accordingly
e.g. text=remmit suggestion=remit
mark as “repeated letter”
Summary: Features Used
• 400 function words
• 200 letter sequences
• 185 error types
• 250 rare POS pairs
Each document is represented as numerical vector of length 1035
Test Corpus
International Corpus of Learner English (Granger98)
• 11 countries• Subjects same age, proficiency level• Samples same genre, length• Actually used in study- 258 docs from each of
– France– Spain– Bulgaria – Czech Rep.– Russia
SVM Classification Accuracy (10-fold CV)
30
40
50
60
70
80
90
Function words+ Letter n-grams
Function wordsLetter n-gramsErrors
shaded: w/o error features white: with error featuresBaseline=20%
Confusion Matrix Classified As
Czech French Bulgarian Russian Spanish
Actual Czech 209 1 18 20 10
French 9 219 13 12 5
Bulgarian14 8 211 18 7
Russian 24 8 24 194 8
Spanish 16 10 10 7 215
What Gives It Away?
• Russian –over, the (infrequent), number_reladverb
• French – indeed, Mr (no period), misused o (e.g. outhor)
• Spanish – c-q confusion (e.g. cuality), m-n confusion (e.g. confortable), undoubled consonant (e.g. comit)
• Bulgarian – most_ADVERB, cannot (uncontracted)
• Czech – doubled consonant (e.g. remmit)
French:In the second part of this outhor’s novel, called Time Passes, time has passed indeed and Mrs Ramsay has died. Spanish:There are pejudments of small groups, such as homosexuals, inmigrants, aids diseaseds, etc. But "political correctness" has have positive and negative consecuences. Russian:There is one more kind of films irritating many television viewers - "soap" serials. «Santa Barbara» has even won "Oskar" prize.
Let’s look back at our examples. Now it’s pretty obvious.
Real-Life Issues
• Many candidate languages
• Very short texts
• Unpredictable English proficiency
Personality
• Pennebaker data:– Students wrote essays
– Same students took personality assessment tests
• Experiment:Given text, determine if author is – Open
– Conscientious
– Neurotic
– Extroverted
– Agreeable
Accuracy Results
–Open 66%
–Conscientious 65%
–Neurotic 63%
–Extroverted 62%
–Agreeable 60%
Key Features
• Openness– consciousness, strange, thoughts, maybe, you– hope, feel, home, friends, football, team
• Conscientiousness– school, always, high, grades– damn, bad, hate, you, more