text categorization moshe koppel lecture 5: authorship verification with jonathan schler and shlomo...

Text CategorizationMoshe Koppel

Lecture 5: Authorship Verification

with Jonathan Schler and Shlomo Argamon

Attribution vs. Verification

• Attribution – Which of authors A1,…,An wrote doc X?

• Verification – Did author A write doc X?

Authorship Verification: Did the author of S also write X?

Story: Ben Ish Chai, a 19th C. Baghdadi Rabbi, is the author of a corpus, S, of 500+ legal letters.

Ben Ish Chai also published another corpus of 500+ legal letters, X, but denied authorship of X, despite external evidence that he wrote it.

How can we determine if the author of S is also the author of X?

Verification is Harder than Attribution

In the absence of a closed set of alternate suspects to S, we’re never sure that we have a representative set of not-S documents.

Let’s see why this is bad.

Round 1: “The Lineup”D1,…,D5 are corpora written by other Rabbis of the same

region and period as Ben Ish Chai. They will play the role of “impostors”.

Round 1: “The Lineup”D1,…,D5 are corpora written by other Rabbis of the same

region and period as Ben Ish Chai. They will play the role of “impostors”.

1. Learn model for S vs. (each of) impostors.

2. For each document in X, check if it is classed as S or an impostor.

3. If “many” are classed as impostors, exonerate S.

Round 1: “The Lineup”D1,…,D5 are corpora written by other Rabbis of the same region

and period as Ben Ish Chai. They will play the role of “impostors”.

1. Learn model for S vs. (each of) impostors.2. For each document in X, check if it is classed as S or an

impostor.3. If “many” are classed as impostors, exonerate S.

In fact, almost all are classified as S. (i.e. many mystery documents seem to point to S as the “guilty” author.)

Does this mean S really is the author?

Why “The Lineup” Fails

No.This only shows that S is a better fit than these impostors,

not that he is guilty.

The real author may simply be some other person more similar to S than to (any of) these impostors.

(One important caveat: suppose we had, say, 10000 impostors. That would be a bit more convincing.)

Well, at least we can rule out these impostors.

Round 2: Composite Sketch Does X Look Like S?

Learn a model for S vs. X. If CV “fails” (so that we can’t distinguish S from X), S is probably guilty

(esp. since we already know that we can distinguish S [and X] from each of the impostors).

Round 2: Composite Sketch Does X Look Like S?

Learn a model for S vs. X. If CV “fails” (so that we can’t distinguish S from X), S is probably guilty

(esp. since we already know that we can distinguish S [and X] from each of the impostors).

In fact, we obtain 98% CV accuracy for S vs. X.

So can we exonerate S?

Why Composite Sketch Fails

No. Superficial differences, due to: • thematic differences, • chronological drift, • different purposes or contexts, • deliberate ruses would be enough to allow differentiation between S and X

even if they were by the same author.

We call these differences “masks”.

Example: House of Seven Gables

This is a crucial point, so let’s consider an example where we know the author’s identity.

With what CV accuracy can we distinguish House of Seven Gables from the known works of Hawthorne, Melville and Cooper (respectively)?

Example: House of Seven Gables

This is a crucial point, so let’s consider an example where we know the author’s identity.

With what CV accuracy can we distinguish House of Seven Gables from the known works of Hawthorne, Melville and Cooper (respectively)?

In each case, we obtain 95+% (even though Hawthorne really wrote it).

Example (continued)

A small number of features allow House to be distinguished from other Hawthorne work (Scarlet Letter). For example: he, she

What happens when we eliminate features like those?

Round 3: Unmasking

1. Learn models for X vs. S and for X vs. each impostor.

2. For each of these, drop the k (k=5,10,15,..) best (=highest weight in SVM) features and learn again.

3. “Compare curves.”

House of Seven Gables (concluded)

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9

Melville Cooper Hawthorne

Does Unmasking Always Work?

Experimental setup:• Several similar authors each with multiple books

(chunked into approx. equal-length examples)

• Construct unmasking curve for each pair of books

• Compare same-author pairs to different-author pairs

Non Identical Authors

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8

Identical Authors

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8

Unmasking: 19th C. American Authors (Hawthorne, Melville, Cooper)

Unmasking: 19th C. English Playwrights (Shaw, Wilde)

Identical Authors

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8


0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Unmasking: 19th C. American Essayists (Thoreau, Emerson)

Identical Authors

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8


0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Experiment

• 21 books ; 10 authors (= 210 labelled examples)• Represent unmasking curves as vectors

Leave-out-one-book experiments• Use training books to learn same-author curves

from diff-author curves• Classify left out book (yes/no) for each author

(independently)• Use “The Lineup” to filter false positives

Results

• 2 misclassed out of 210

• Simple rule that almost always works:

· accuracy after 6 elimination rounds is lower than 89% and

·second highest accuracy drop in two consec. iterations greater than 16%

books are by same author

Unmasking Ben Ish Chai

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11

Unmasking: Summary

• This method works very well in general (provided X and S are both large).

• Key is not how similar/different two texts are, but how robust that similarity/difference is to changes in the feature set.

Now let’s try a much harder problem…

• Suppose, instead of one candidate, we have 10,000 candidate authors – and we aren’t even sure if any of them is the real author. (This is two orders of magnitude more than has ever been tried before.)

• Building a classifier doesn’t sound promising, but information retrieval methods might have a chance.

• So, let’s try assigning an anonymous document to whichever author’s known writing is most similar (using the usual vector space/cosine model).

IR Approach

• We tried this on a corpus of 10,000 blogs, where the object was to attribute a short snippet from each blog. (Each attribution problem is handled independently.)

• Our feature set consisted of character 4-grams.

IR Approach

• We tried this on a corpus of 10,000 blogs, where the object was to attribute a short snippet from each blog. (Each attribution problem is handled independently.)

• Our feature set consisted of character 4-grams.

• 46% of “snippets” are correctly attributed.

IR Approach

• 46% is not bad but completely useless in most applications.

• What we’d really like to do is figure out which attributions are reliable and which are not.

• In an earlier attempt (KSA 2006), we tried building a meta-classifier that could solve that problem (but meta-classifiers are a bit fiddly).

When does most similar = actual author?

• Can generalize unmasking idea.

• Check if similarity between snippet and an author’s known text is robust wrt changes in feature set.– If it is, that’s the author.

– If not, we just say we don’t know. (If in fact none of the candidates wrote it, that’s the best answer).

Algorithm

1. Randomly choose subset of features.

2. Find most similar author (using that FS).

3. Iterate.

4. If S is most similar, for at least k% of iterations, S is author. Else, say Don’t Know. (Choice of k trades off precision against recall.)

Results

• 100 iterations, 50% of features per iteration

• training text= 2000 words, snippet = 500 words

1000 candidates: 93.2% precision at 39.2% recall. (k=90)

# candidates

Results

How often do we attribute a snippet not written by any candidate to somebody?

K=90

• 10,000 candidates – 2.5%

• 5,000 candidates – 3.5%

• 1,000 candidates – 5.5%

(The fewer candidates, the greater the chance some poor shnook will consistently be most similar.)

Comments

• Can give estimate of probability A is author. Almost all variance in recall/precision explained by:– Snippet length

– Known-text length

– Number of candidates

– Score (number iterations A is most similar)

• Method is language independent.

So Far…

• Have covered cases of many authors (closed or open set).

• Unmasking covers cases of open set, few authors, lots of text.

• Only uncovered problem is the ultimate problem: open set, few authors, little text.

• Can we convert this case to our problem by adding artificial candidates?

text categorization moshe koppel lecture 5: authorship verification with jonathan schler and shlomo...

Documents

author of s

s documents

author of x

authorship of x

role of impostors

guilty author

write doc x

ben ish chai