to reduce annotation cost - department of computer...

1
Both methods give significant improvements in accuracy over two strong baselines: Above curves were generated using rationales from a single annotator, A0. What about other annotators? We solicited rationales from several other annotators (on 100 documents) and saw similar improvements in accuracy: Tags ( ) Words ( ) x r Armageddon This disaster flick is a disaster alright. Directed by Tony Scott (Top Gun), it's the story of an asteroid the size of Texas caught on a collision course with Earth. After a great opening, in which an American spaceship, plus NYC, are completely destroyed by a comet shower, NASA detects said asteroid and go into a frenzy. They hire the world's best oil driller (Bruce Willis), and send him and his crew up into space to fix our global problem. The action scenes are over the top and too ludicrous for words. So much so, I had to sigh and hit my head with my notebook a couple of times. Also, to see a wonderful actor like Billy Bob Thornton in a film like this is a waste of his talents. The only real reason for making this film was to somehow out-perform Deep Impact. Bottom line is, Armageddon is a failure. بم ا ر آ هر اا ا ه. اج إ ، ات ) ب ج ( و آ وي ، رة ا ام س . إ، إء أ ، را ا وآ ا ا ،رك . ا ا ن ) وس و ( ، ه ا ح ء ا إ و و. ي و و آرة وا اه ت آ، و ، أ ت و إات و . ب ن رؤ و أاه ا ه آ نر . ا اآ ا د ق او ها ا ه . بم ا ، ا . Omar F. Zaidan Jason Eisner Christine D. Piatko Machine Learning with Annotator Rationales to Reduce Annotation Cost Task: classification of movie reviews into positive and negative reviews. Typically, an annotator is given a set of unannotated documents, and annotates the correct class for each one (). This classification task is actually hard for a machine learner. Place yourself in its position, and imagine the task in Arabic instead (). If you are not fluent in Arabic, class data alone might not be very helpful. This is truly the situation from a computer’s point of view! But what if, in addition to class, you were also told which segments of the text actually support that class ()? That should make it easier for you to learn the true model. Presumably, the same is true for a machine learner as well. Proposal: we wish to better utilize annotators by having them tell us more about their classification process. We propose annotators indicate not only what the correct answers are, but also provide hints about why. We propose that they should highlight relevant portions of the example, such as substrings (), that help to justify their annotations. We call such hints rationales. We have collected rationales for the movie review dataset of [PL04], and have developed two methods that use the rationales during training. One is a discriminative method [ZEP07] and one is generative [ZE08]. Both methods yield significant accuracy improvements… From the rationale annotations on a positive example x i , we construct several “not-quite-as- positive” contrast examples v ij . Each contrast v ij is obtained by starting with the original and “masking out” a rationale substring: The intuition: a correct model should be less sure of a positive classification on the contrast example v ij than on the original example x i , because v ij lacks evidence the annotator found significant. We express our intuition as additional constraints on an SVM-like model: we want (for each j) to have w·x i w·v ij ≥ μ, where μ≥0 controls the size of a margin between original and contrast examples: What this means in practice Standard SVM cares about this margin Modified SVM cares about both margins We typically choose parameters that explain class labels y of training data. With rationales, we propose that parameters be chosen to explain rationale data r in addition to class labels y . For instance, a conditional log-linear model: would have a parameter vector chosen so that: The second factor corresponds to a model for r. Its parameters, , capture how the true influence the annotator. To do this, we encode the rationales as a tag sequence and model it using a CRF: The first-order “emission” features of relate the tag r m to (x m ,y, m ), whereas the second-order “transition” features of relate the tag r m to r m–1 . What this means in practice ) , , | ( θ φ r i i i x y r p = = n i i i x y p 1 , ) , | ( max arg θ θ φ θ r r r r Learned jointly ( ) ) , ( exp ) , | ( y x f x y p r r r θ θ i.e. try to model class labels & rationales well θ r φ r θ r ( ) ) , , , ( exp ) , , | ( : CRF θ φ θ φ r r r r r r y x r g y x r p r 1 x 1 r 2 x 2 r M x M r 3 x 3 1 x θ 2 x θ 3 x θ M x θ ) (g r ) (g r ! " () , . a acting actors after all alot am an and around as at awfully bar being better brother buscemi but capable career chloe cream debbie . debut decides did dies directing director dogs drive even excellent expectation fargo favorite finding flick for from gave girl . good hand hanging having he him his i ice i'm in interest is it it's job know liked local loser lounge love memorable michael movie . mr my named not now obvious of off old one out pick put reach reservoir same says seen sevigny slow soup starts steve story take . the then think thinking this though to tommy trees tries truck uncle up well when with writing year you you've . ! “( ) , . a acting actors after all alot am an and around as at awfully bar being better brother buscemi but capable career chloe cream debbie . debut decides did dies directing director dogs drive even excellent expectation fargo favorite finding flick for from gave girl good hand hanging having he him his I ice i'm in interest is it it's job know liked local loser lounge love memorable michael movie . mr my named not now obvious of off old one out pick put reach reservoir same says seen sevigny slow soup starts steve story take . the then think thinking this though to tommy trees tries truck uncle up well when with writing year you you've . 0 45 . 0 ) 1 , ( ZE08 > + = + = y x f r r θ +1 = * y Our model: 0 41 . 0 ) 1 , ( std < = + = y x f r r θ –1 = * y Standard model: = = = M m x m m y B r I y x r g 1 10 rel ) ( ) ( ) , , , ( θ θ I r r r = = = = M m m m r r I y x r g 1 1 ) and ( ) , , , ( I O I - O θ r r r 0 400 800 1200 1600 Training Set Size (Documents) 95 93 91 89 87 85 83 81 79 Accuracy (%) S i g n i fi c a n t l y d i f f e re n t Generative Approach Discriminative Approach SVM Baseline Log-Linear Baseline 72.0 74.0 73.0 75.0 Discriminative Method 70.0 72.0 72.0 72.0 SVM baseline 74.0 77.0 76.0 76.0 Generative Method 70.0 71.0 73.0 71.0 Log-linear baseline A5 A4 A3 A0 References: [PL04] B. Pang and L. Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts . Proc. ACL, pages 271–278, 2004. [ZEP07] O. Zaidan, J. Eisner, and C. Piatko. Using “annotator rationales” to improve machine learning for text categorization . Proc. NAACL HLT, pages 260–267, 2007. [ZE08] O. Zaidan and J. Eisner. Modeling annotators: A generative approach to learning from annotator rationales . Proc. EMNLP, pages 31–40, 2008. بم ا ر آ هر اا ا ه. اج إ ، ات ) ب ج ( و آ وي ، رة ا ام س . إ، إء أ ، را ا وآ ا ا ،رك . ا ا ن ) وس و ( ، ه ا ح ء ا إ و و. ي و و آرة وا اه ت آ، و ، أ ت و إات و . ب ن رؤ و أاه ا ه آ نر . ا اآ ا د ق او ها ا ه . بم ا ، ا . Armageddon This disaster flick is a disaster alright. Directed by Tony Scott (Top Gun), it's the story of an asteroid the size of Texas caught on a collision course with Earth. After a great opening, in which an American spaceship, plus NYC, are completely destroyed by a comet shower, NASA detects said asteroid and go into a frenzy. They hire the world's best oil driller (Bruce Willis), and send him and his crew up into space to fix our global problem. The action scenes are over the top and too ludicrous for words. So much so, I had to sigh and hit my head with my notebook a couple of times. Also, to see a wonderful actor like Billy Bob Thornton in a film like this is a waste of his talents. The only real reason for making this film was to somehow out-perform Deep Impact. Bottom line is, Armageddon is a failure. Original Example (x i ) Contrast Examples (v ij ) Take-home Message: Annotators are underutilized! Richer annotations, such as rationales, can aid machine learning. Existing machine learning methods can be modified to exploit rationales. Remember, at test time: - No change to decision rule. - No new features. - No need for rationales. Improvements due solely to better-learned w/ Doing an annotation project? Collect rationales! Even a small number could help. θ r Discriminative Approach Generative Approach What is B 10 ? Classifying pos_987.txt (correct class: +1): ) ( i y 1 i x w + + j i i ij contrast i C C , ξ ξ ) ( i y 1 ij x w ij ξ : Minimize : Subject to 2 2 1 w : Subject to μ ij i v w x w 1 μ ij i v x w ij ξ ) 1 ( ij ξ ) ( i y i ξ ) ( i y w 1 1 x . . . ... 12 v 11 v j 1 v μ 2 x . . . ... 22 v 21 v j 2 v μ θ Original Contrast O O O O I I I I I O the prince of egypt succeeds where other movies failed . x r (Plus several other g features…)

Upload: others

Post on 27-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: to Reduce Annotation Cost - Department of Computer Scienceozaidan/rationales/Zaidan_etal_rationales-nips200… · Earth. After a great opening, in which an American spaceship, plus

Both methods give significant improvements in accuracy over two strong baselines:

Above curves were generated using rationales from a single annotator, A0. What about other annotators? We solicited rationales from several other annotators (on 100 documents) and saw similar improvements in accuracy:

Tags ( )

Words ( )x

r

Armageddon

This disaster flick is a disaster alright. Directed by Tony Scott (Top Gun), it's the story of an asteroid the size of Texas caught on a collision course with Earth. After a great opening, in which an American spaceship, plus NYC, are completely destroyed by a comet shower, NASA detects said asteroid and go into a frenzy. They hire the world's best oil driller (Bruce Willis), and send him and his crew up into space to fix our global problem.

The action scenes are over the top and too ludicrous for words. So much so, I had to sigh and hit my head with my notebook a couple of times. Also, to see a wonderful actor like Billy Bob Thornton in a film like this is a waste of his talents. The only real reason for making this film was to somehow out-perform Deep Impact. Bottom line is, Armageddon is a failure.

��م ا�� �ب

ا42*()، => إ:9اج 6/7# . ه5ا ا42*() ا12&ر$# ه/ .-, +*() آ&ر$#، 9Eوي DJK آ/H-. I1E) و9C=( DEFج +*() 7/ب A> (@1/ت

D*MرNام .&912ة اQRST2 UVE9W #+ س&Y17 . QZ[= Q\. D]EQ= ^2إ D+&Mإ ،D*1E9=ء أ&a+ D]*4@ U*+ 9=Q7 ،bcرا #d&ee+ا

&Z6/]A <HEا52آ9 و f6g I1E/126&@& ا f[e17 ،رك/E/*6 .، ).9وس وn*)E(+*eY\*[/ن .j IV]= ka+l> اi4]2 +# ا2\&2)

o5ه D*p2&\2ا &]e)1[= حTSr ء&a42إ2^ ا UpK&Wو U6/)@9Eو.

=]&هQ اr$&رة واuآ]> =Z. s2&t& وI\JE وl. &Z4C@ fSي إQd ^2 وQAت U*+ #Y46 أk. ،QZ]7 وeZtA v.9M# ،آ(p&ت

و أy+ &aEن رؤkw= kwp= DE .*(# ./ب . .Qpوba. #e6 =9ات Utاه/p2 D\*a= ,-. /5ا هZن +# +*() آ/e6ر/$ . Q*d/2ا ItY2ا

vآ&t=ا IEد ()*+ ^)j 4/قe2ا D2ه5ا ا42*() ه/ =-&و b]J2&= DVE9R. .k|&+ ()*+ ب&Y-2م ا/E ،9=Nا DST:.

Omar F. Zaidan Jason Eisner Christine D. PiatkoMachine Learning with Annotator Rationales

to Reduce Annotation Cost

Task: classification of movie reviews into positive and negative reviews. Typically, an annotator is given a set of unannotated documents, and annotates the correct class for each one (�).

This classification task is actually hard for a machine learner. Place yourself in its position, and imagine the task in Arabic instead (�). If you are not fluent in Arabic, class data alone might not be very helpful. This is truly the situation from a computer’s point of view!

But what if, in addition to class, you were also told which segments of the text actually support that class (�)? That should make it easier for you to learn the true model. Presumably, the same is true for a machine learner as well.

Proposal: we wish to better utilize annotators by having them tell us more about their classification process. We propose annotators indicate not only what the correct answers are, but also provide hints about why.

We propose that they should highlight relevant portions of the example, such as substrings (�), that help to justify their annotations. We call such hints rationales.

We have collected rationales for the movie review dataset of [PL04], and have developed two methods that use the rationales during training. One is a discriminative method [ZEP07] and one is generative [ZE08]. Both methods yield significant accuracy improvements…

From the rationale annotations on a positive example xi, we construct several “not-quite-as-positive” contrast examples vij. Each contrast vij is obtained by starting with the original and “masking out” a rationale substring:

The intuition: a correct model should be less sure of a positive classification on the contrast examplevij than on the original example xi, because vijlacks evidence the annotator found significant.

We express our intuition as additional constraints on an SVM-like model: we want (for each j) to have w · xi – w · vij ≥ µ, where µ ≥ 0 controls the size of a margin between original and contrast examples:

What this means in practice

Standard SVM cares about this margin

Modified SVM cares about bothmargins

We typically choose parameters that explain class labels y of training data. With rationales, we propose that parameters be chosen to explain rationale data r in addition to class labels y. For instance, a conditional log-linear model:

would have a parameter vector chosen so that:

The second factor corresponds to a model for r. Its parameters, , capture how the true influence the annotator. To do this, we encode the rationales as a tag sequence and model it using a CRF:

The first-order “emission” features of relate the tag rm to (xm,y, m), whereas the second-order “transition” features of relate the tag rm to rm–1.

What this means in practice

),,|( θφ

r

iii xyrp⋅∏=

=n

i

ii xyp1,

),|(maxarg θθφθ

rr

rr

Learned jointly

( )),(exp),|( yxfxyprrr

⋅∝ θθ

i.e. try to model class labels & rationales well

θr

φr

θr

( )),,,(exp),,|( :CRF θφθφ

rrrrrryxrgyxrp ⋅∝

…r1

x1

r2

x2

rM

xM

r3

x3

1xθ

2xθ

3xθ

Mxθ…

)(⋅gr

)(⋅gr

! " ( ) , . a acting actors after all alot am an and around as at awfully bar being better brother buscemi but capable career chloe cream debbie.

debut decides did dies directing director dogs drive even excellent expectation fargo favorite finding flick for from gave girl .

good hand hanging having he him his i ice i'm in interest is it it's job know liked local loser lounge love memorable michael movie .

mr my named not now obvious of off old one out pick put reach reservoir same says seen sevigny slow soup starts steve story take .

the then think thinking this though to tommy trees tries truck uncle up well when with writing year you you've .

! “ ( ) , . a acting actors after all alot am an and around as at awfully bar being better brother buscemi but capable career chloe cream debbie .

debut decides did dies directing director dogs drive even excellent expectation fargo favorite finding flick for from gave girl

good hand hanging having he him his I ice i'm in interest is it it's job know liked local loser lounge love memorable michael movie .

mr my named not now obvious of off old one out pick put reach reservoir same says seen sevigny slow soup starts steve story take .

the then think thinking this though to tommy trees tries truck uncle up well when with writing year you you've .

045.0)1,(ZE08 >+=+=⋅ yxfrr

θ +1=*yOur model:

041.0)1,(std <−=+=⋅ yxfrr

θ –1=*yStandard model:

∑=

⋅⋅==M

m

xm myBrIyxrg

1

10rel )()(),,,( θθ Irrr

∑=

− ===M

m

mm rrIyxrg1

1 ) and (),,,( IOI-O θrrr

0 400 800 1200 1600

Training Set Size (Documents)

95

93

91

89

87

85

83

81

79

Acc

ura

cy (

%)

Significantly different

Generative Approach

Discriminative Approach

SVM Baseline

Log-Linear Baseline

72.074.073.075.0Discriminative Method

70.072.072.072.0SVM baseline

74.077.076.076.0Generative Method

70.071.073.071.0Log-linear baseline

A5A4A3A0

References:

[PL04] B. Pang and L. Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. Proc. ACL, pages 271–278, 2004.

[ZEP07] O. Zaidan, J. Eisner, and C. Piatko. Using “annotator rationales” to improve machine learning for text categorization. Proc. NAACL HLT, pages 260–267, 2007.

[ZE08] O. Zaidan and J. Eisner. Modeling annotators: A generative approach to learning from annotator rationales. Proc. EMNLP, pages 31–40, 2008.

��م ا�� �ب

ا42*()، => إ:9اج 6/7# . ه5ا ا42*() ا12&ر$# ه/ .-, +*() آ&ر$#، 9Eوي DJK آ/H-. I1E) و9C=( DEFج +*() 7/ب A> (@1/ت

D*MرNام .&912ة اQRST2 UVE9W #+ س&Y17 . QZ[= Q\. D]EQ= ^2إ D+&Mإ ،D*1E9=ء أ&a+ D]*4@ U*+ 9=Q7 ،bcرا #d&ee+ا

&Z6/]A <HEا52آ9 و f6g I1E/126&@& ا f[e17 ،رك/E/*6 .، ).9وس وn*)E(+*eY\*[/ن .j IV]= ka+l> اi4]2 +# ا2\&2)

o5ه D*p2&\2ا &]e)1[= حTSr ء&a42إ2^ ا UpK&Wو U6/)@9Eو.

=]&هQ اr$&رة واuآ]> =Z. s2&t& وI\JE وl. &Z4C@ fSي إQd ^2 وQAت U*+ #Y46 أk. ،QZ]7 وeZtA v.9M# ،آ(p&ت

و أy+ &aEن رؤkw= kwp= DE .*(# ./ب . .Qpوba. #e6 =9ات Utاه/p2 D\*a= ,-. /5ا هZن +# +*() آ/e6ر/$ . Q*d/2ا ItY2ا

vآ&t=ا IEد ()*+ ^)j 4/قe2ا D2ه5ا ا42*() ه/ =-&و b]J2&= DVE9R. .k|&+ ()*+ ب&Y-2م ا/E ،9=Nا DST:.

Armageddon

This disaster flick is a disaster alright. Directed by Tony Scott (Top Gun), it's the story of an asteroid the size of Texas caught on a collision course with Earth. After a great opening, in which an American spaceship, plus NYC, are completely destroyed by a comet shower, NASA detects said asteroid and go into a frenzy. They hire the world's best oil driller (Bruce Willis), and send him and his crew up into space to fix our global problem.

The action scenes are over the top and too ludicrous for words. So much so, I had to sigh and hit my head with my notebook a couple of times. Also, to see a wonderful actor like Billy Bob Thornton in a film like this is a waste of his talents. The only real reason for making this film was to somehow out-perform Deep Impact. Bottom line is, Armageddon is a failure.

Original Example (xi)

Contrast Examples (vij)

Take-home Message:

� Annotators are underutilized! Richer annotations,such as rationales, can aid machine learning.

� Existing machine learning methods can bemodified to exploit rationales.

� Remember, at test time:- No change to decision rule. - No new features.- No need for rationales.

Improvements due solely to better-learned w/

� Doing an annotation project? Collect rationales!Even a small number could help.

θr

Discriminative Approach Generative Approach

What is B10?

Classifying pos_987.txt (correct class: +1):

) (iy 1≥⋅ ixw

∑+∑+jiiijcontrasti CC

,

ξξ

) (iy 1≥⋅ ijxw ijξ−

:Minimize :Subject to

2

2

1w

:Subject to

µ≥⋅−⋅ iji vwxw

1≥−

⋅µ

iji vxw ijξ−

)1( ijξ−) (iy

iξ−

) (iy

w1

1x

. . . . . .

12v11v

j1v

µ

2x

. . .. . .

22v21v

j2v

µ

θ

Original

Contrast

O O O O I I I I I O

the prince of egypt succeeds where other movies failed . x

r

(Plus several other g features…)