text categorization moshe koppel lecture 5: authorship verification with jonathan schler and shlomo...

34
Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Post on 20-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Text CategorizationMoshe Koppel

Lecture 5: Authorship Verification

with Jonathan Schler and Shlomo Argamon

Page 2: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Attribution vs. Verification

• Attribution – Which of authors A1,…,An wrote doc X?

• Verification – Did author A write doc X?

Page 3: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Authorship Verification: Did the author of S also write X?

Story: Ben Ish Chai, a 19th C. Baghdadi Rabbi, is the author of a corpus, S, of 500+ legal letters.

Ben Ish Chai also published another corpus of 500+ legal letters, X, but denied authorship of X, despite external evidence that he wrote it.

How can we determine if the author of S is also the author of X?

Page 4: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Verification is Harder than Attribution

In the absence of a closed set of alternate suspects to S, we’re never sure that we have a representative set of not-S documents.

Let’s see why this is bad.

Page 5: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Round 1: “The Lineup”D1,…,D5 are corpora written by other Rabbis of the same

region and period as Ben Ish Chai. They will play the role of “impostors”.

Page 6: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Round 1: “The Lineup”D1,…,D5 are corpora written by other Rabbis of the same

region and period as Ben Ish Chai. They will play the role of “impostors”.

1. Learn model for S vs. (each of) impostors.

2. For each document in X, check if it is classed as S or an impostor.

3. If “many” are classed as impostors, exonerate S.

Page 7: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Round 1: “The Lineup”D1,…,D5 are corpora written by other Rabbis of the same region

and period as Ben Ish Chai. They will play the role of “impostors”.

1. Learn model for S vs. (each of) impostors.2. For each document in X, check if it is classed as S or an

impostor.3. If “many” are classed as impostors, exonerate S.

In fact, almost all are classified as S. (i.e. many mystery documents seem to point to S as the “guilty” author.)

Does this mean S really is the author?

Page 8: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Why “The Lineup” Fails

No.This only shows that S is a better fit than these impostors,

not that he is guilty.

The real author may simply be some other person more similar to S than to (any of) these impostors.

(One important caveat: suppose we had, say, 10000 impostors. That would be a bit more convincing.)

Well, at least we can rule out these impostors.

Page 9: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Round 2: Composite Sketch Does X Look Like S?

Learn a model for S vs. X. If CV “fails” (so that we can’t distinguish S from X), S is probably guilty

(esp. since we already know that we can distinguish S [and X] from each of the impostors).

Page 10: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Round 2: Composite Sketch Does X Look Like S?

Learn a model for S vs. X. If CV “fails” (so that we can’t distinguish S from X), S is probably guilty

(esp. since we already know that we can distinguish S [and X] from each of the impostors).

In fact, we obtain 98% CV accuracy for S vs. X.

So can we exonerate S?

Page 11: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Why Composite Sketch Fails

No. Superficial differences, due to: • thematic differences, • chronological drift, • different purposes or contexts, • deliberate ruses would be enough to allow differentiation between S and X

even if they were by the same author.

We call these differences “masks”.

Page 12: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Example: House of Seven Gables

This is a crucial point, so let’s consider an example where we know the author’s identity.

With what CV accuracy can we distinguish House of Seven Gables from the known works of Hawthorne, Melville and Cooper (respectively)?

Page 13: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Example: House of Seven Gables

This is a crucial point, so let’s consider an example where we know the author’s identity.

With what CV accuracy can we distinguish House of Seven Gables from the known works of Hawthorne, Melville and Cooper (respectively)?

In each case, we obtain 95+% (even though Hawthorne really wrote it).

Page 14: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Example (continued)

A small number of features allow House to be distinguished from other Hawthorne work (Scarlet Letter). For example: he, she

What happens when we eliminate features like those?

Page 15: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Round 3: Unmasking

1. Learn models for X vs. S and for X vs. each impostor.

2. For each of these, drop the k (k=5,10,15,..) best (=highest weight in SVM) features and learn again.

3. “Compare curves.”

Page 16: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

House of Seven Gables (concluded)

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9

Melville Cooper Hawthorne

Page 17: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Does Unmasking Always Work?

Experimental setup:• Several similar authors each with multiple books

(chunked into approx. equal-length examples)

• Construct unmasking curve for each pair of books

• Compare same-author pairs to different-author pairs

Page 18: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Non Identical Authors

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8

Identical Authors

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8

Unmasking: 19th C. American Authors (Hawthorne, Melville, Cooper)

Page 19: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Unmasking: 19th C. English Playwrights (Shaw, Wilde)

Identical Authors

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Non Identical Authors

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Page 20: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Unmasking: 19th C. American Essayists (Thoreau, Emerson)

Identical Authors

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Non Identical Authors

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Page 21: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Experiment

• 21 books ; 10 authors (= 210 labelled examples)• Represent unmasking curves as vectors

Leave-out-one-book experiments• Use training books to learn same-author curves

from diff-author curves• Classify left out book (yes/no) for each author

(independently)• Use “The Lineup” to filter false positives

Page 22: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Results

• 2 misclassed out of 210

• Simple rule that almost always works:

· accuracy after 6 elimination rounds is lower than 89% and

·second highest accuracy drop in two consec. iterations greater than 16%

books are by same author

Page 23: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Unmasking Ben Ish Chai

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11

Page 24: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Unmasking: Summary

• This method works very well in general (provided X and S are both large).

• Key is not how similar/different two texts are, but how robust that similarity/difference is to changes in the feature set.

Page 25: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Now let’s try a much harder problem…

• Suppose, instead of one candidate, we have 10,000 candidate authors – and we aren’t even sure if any of them is the real author. (This is two orders of magnitude more than has ever been tried before.)

• Building a classifier doesn’t sound promising, but information retrieval methods might have a chance.

• So, let’s try assigning an anonymous document to whichever author’s known writing is most similar (using the usual vector space/cosine model).

Page 26: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

IR Approach

• We tried this on a corpus of 10,000 blogs, where the object was to attribute a short snippet from each blog. (Each attribution problem is handled independently.)

• Our feature set consisted of character 4-grams.

Page 27: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

IR Approach

• We tried this on a corpus of 10,000 blogs, where the object was to attribute a short snippet from each blog. (Each attribution problem is handled independently.)

• Our feature set consisted of character 4-grams.

• 46% of “snippets” are correctly attributed.

Page 28: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

IR Approach

• 46% is not bad but completely useless in most applications.

• What we’d really like to do is figure out which attributions are reliable and which are not.

• In an earlier attempt (KSA 2006), we tried building a meta-classifier that could solve that problem (but meta-classifiers are a bit fiddly).

Page 29: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

When does most similar = actual author?

• Can generalize unmasking idea.

• Check if similarity between snippet and an author’s known text is robust wrt changes in feature set.– If it is, that’s the author.

– If not, we just say we don’t know. (If in fact none of the candidates wrote it, that’s the best answer).

Page 30: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Algorithm

1. Randomly choose subset of features.

2. Find most similar author (using that FS).

3. Iterate.

4. If S is most similar, for at least k% of iterations, S is author. Else, say Don’t Know. (Choice of k trades off precision against recall.)

Page 31: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Results

• 100 iterations, 50% of features per iteration

• training text= 2000 words, snippet = 500 words

1000 candidates: 93.2% precision at 39.2% recall. (k=90)

# candidates

Page 32: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Results

How often do we attribute a snippet not written by any candidate to somebody?

K=90

• 10,000 candidates – 2.5%

• 5,000 candidates – 3.5%

• 1,000 candidates – 5.5%

(The fewer candidates, the greater the chance some poor shnook will consistently be most similar.)

Page 33: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

Comments

• Can give estimate of probability A is author. Almost all variance in recall/precision explained by:– Snippet length

– Known-text length

– Number of candidates

– Score (number iterations A is most similar)

• Method is language independent.

Page 34: Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon

So Far…

• Have covered cases of many authors (closed or open set).

• Unmasking covers cases of open set, few authors, lots of text.

• Only uncovered problem is the ultimate problem: open set, few authors, little text.

• Can we convert this case to our problem by adding artificial candidates?