end-user debugging of machine learning systems weng-keen wong oregon state university school of...
TRANSCRIPT
End-User Debugging of Machine Learning Systems
Weng-Keen WongOregon State UniversitySchool of Electrical Engineering and Computer Sciencehttp://www.eecs.oregonstate.edu/~wong
Collaborators
• Margaret Burnett
• Simone Stumpf
• Tom Dietterich
• Jon Herlocker
• Erin Fitzhenry
• Lida Li
• Ian Oberst
• Vidya Rajaram
• Russell Drummond
• Erin Sullivan
Faculty Grad Students Undergrads
Papers
Stumpf S., Rajaram V., Li L., Burnett M., Dietterich T., Sullivan E., Drummond R., Herlocker J. (2007) . Toward Harnessing User Feedback For Machine Learning. In Proceedings of IUI 2007.
Stumpf, S., Rajaram V., Li L., Wong, W.-K., Burnett, M., Dietterich, T., Sullivan, E., Herlocker, J. (2008) Interacting Meaningfully with Machine Learning Systems: Three Experiments. (Submitted to IJHCS)
Stumpf, S., Sullivan, E., Fitzhenry, E., Oberst, I., Wong, W.-K., Burnett., M. (2008). Integrating Rich User Feedback into Intelligent User Interfaces. In Proceedings of IUI 2008.
MotivationDate: Mon, 28 Apr 2008 23:59:00 (PST)From: John Doe <[email protected]>To: Weng-Keen Wong <[email protected]>Subject: CS 162 Assignment
I can’t get my Java assignment to work! It just won’t compile and it prints out lots of error messages! Please help!
public class MyFrame extends JFrame {
private AsciiFrameManager reader;
private JPanel displayPanel;
public MyFrame(String filename) throws Exception {reader = new AsciiFrameManager(filename);displayPanel = new JPanel();
...
CS 162
John Doe
Trash
?
• Machine learning tool adapts to end user
• Similar situation in recommender systems, smart desktops, etc.
MotivationDate: Mon, 28 Apr 2008 23:51:00 (PST)From: Bella Bose <[email protected]>To: Weng-Keen Wong <[email protected]>Subject: Teaching Assignments
I’ve compiled the teaching preferences for all the faculty. Here are the teaching assignments for next year:
Fall QuarterCS 160 (Computer Science Orientation) – Paul PaulsonCS 161 (Introduction to Programming I) – Chris WallaceCS 162 (Introduction to Programming II) – Weng-Keen Wong...
Trash
• Machine Learning systems are great when they work correctly, aggravating when they don’t
• The end user is the only person at the computer
• Can we let end users correct machine learning systems?
6
Motivation
Learn to correct behavior quickly Sparse data on start Concept drift
Rich end-user knowledge Effects of user feedback on accuracy? Effects on users?
Overview
ExplanationEnd user feedback
End-User
Machine Learning Algorithm
Related WorkExplanation
• Expert Systems (Swartout 83, Wick and Thompson 92)
• TREPAN (Craven and Shavlik 95)
• Description Logics (McGuinness 96)
• Bayesian networks (LaCave and Diez 00)
• Additive classifiers (Poulin et al. 06)
• Others (Crawford et al. 02, Herlocker et al. 00)
End user interaction
• Active Learning (Cohn et al. 96, many others)
• Constraints (Altendorf et al. 05, Huang and Mitchell 06)
• Ranks (Radlinski and Joachims 05)
• Feature Selection (Raghavan et al. 06)
• Crayons (Fails and Olsen 03)
• Programming by Demonstration (Cypher 93, Lau and Weld 99, Lieberman 01)
9
Outline
1. What types of explanations do end users understand? What types of corrective feedback could end users provide? (IUI 2007)
2. How do we incorporate this feedback into a ML algorithm? (IJHCS 2008)
3. What happens when we put this together? (IUI 2008)
What Types of Explanations do End Users Understand? Thinkaloud study with 13
participants Classify Enron emails Explanation systems: rule-based,
keyword-based, similarity-based Findings:
Rule-based best but not a clear winner Evidence indicates multiple
explanation paradigms needed
What types of corrective feedback could end users provide?
Suggested corrective feedback in response to explanations:
1. Adjust importance of word2. Add/remove word from consideration3. Parse / extract text in a different way4. Word combinations5. Relationships between
messages/people
12
Outline
1. What types of explanations do end users understand? What types of corrective feedback could end users provide? (IUI 2007)
2. How do we incorporate this feedback into a ML algorithm? (IJHCS 2008)
3. What happens when we put this together? (IUI 2008)
Incorporating Feedback into ML Algorithms
Two approaches: Constraint-based User co-training
Constraint-based approach
Constraints:1. If weight on word reduced or word removed,
remove the word as a feature2. If weight of word increased, word assumed to
be important for that folder
3. If weight of word increased, word is a better predictor for that folder than other words
)1|()1|( kkjk xyYPxyYP
)|1()|1( kjkj yYxPyYxP
Estimate parameters for Naive Bayes using MLE with these constraints
Standard Co-training
Create classifiers C1 and C2 based on the two independent feature sets.
Repeat i timesAdd most confidently classified messages by any classifier to training data
Rebuild C1 and C2 with the new training data
User Co-training
CUSER = “Classifier” based on user feedback
CML = Machine learning algorithm
For each “session” of user feedback
Add most confidently classified messages by CUSER to training data
Rebuild CML with the new training data
User Co-training
CUSER = “Classifier” based on user feedback
CML = Machine learning algorithm
For each “session” of user feedback
Add most confidently classified messages by CUSER to training data
Rebuild CML with the new training data
We’ll expand the inner loop on the next slide
User Co-training
For each folder f, let vector vf = words with weights increased by the user
For each message m in the unlabeled set For each folder f, Compute Probf from the machine learning classifier Scoref=# of words in vf appearing in the message * Probf
Scorem=Scorefmax –Scoreother
Sort Scorem for all messages in decreasing order
Select the top k messages to add to the training set along with their folder label fmax
Rebuild CML with the new training data
fmax ScorefFoldersf
maxarg
fother ScoreScoremax\
maxfFoldersf
Constraint-based vs User co-training
Constraint-based Difficult to set “hardness” of constraint Constraints often already satisfied End-user can over-constrain the
learning algorithm Slow
User co-training Requires unlabeled emails in inbox Better accuracy than constraint-based
Results
0%10%20%30%40%50%60%70%80%90%
100%
Algorithm
Accura
cy
0%10%20%30%40%50%60%70%80%90%
100%
Algorithm
Acc
ura
cy
Feedback from keyword-based paradigm
Feedback from similarity-based paradigm
21
Outline
1. What types of explanations work for end users? What types of corrective feedback could end users provide? (IUI 2007)
2. How do we incorporate this feedback into a ML algorithm? (IJHCS 2008)
3. What happens when we put this together? (IUI 2008)
Experiment: Email program
22
Experiment: Procedure
Intelligent email system to classify emails into folders 43 English-speaking, non-CS students Background questionnaire Tutorial (email program and folders) Experiment task on feedback set
Correct folders. Add, remove, change weight on keywords.
30 interaction logs Post-session questionnaire
23
Experiment: Data
Enron data set 9 folders 50 training messages
10 each for 5 folders with folder labels 50 feedback messages
For use in experiment Same for each participant
1051 test messages For evaluation after experiment
24
Experiment: Classification algorithm “User co-training”
Two classifiers: User, Naïve Bayes Slight modification on user classifier
Scoref=sum of weights in vf appearing in the message
Weights can be modified interactively by user
25
Results: Accuracy improvements of rich feedback
26
Rich Feedback: participant folder labels and keyword changes
Folder feedback: participant folder labels
Subject
Accuracy Δ over folder feedback
Results: Accuracy improvements of rich feedback
27
Rich Feedback: participant folder labels and keyword changes
Baseline: original Enron labels
Subject
Accuracy Δ over baseline
Results: Accuracy summary
60% of participants saw accuracy improvements, some very substantial
Some dramatic decreases More time between filing emails or more
folder assignments → higher accuracy
29
Interesting bits
1. Need to communicate the effects of the user’s corrective feedback
2. Unstable classifier period With sparse training data, a single new
training example can dramatically change the classifier’s decision boundaries
Wild fluctuations in classifier’s predictions frustrate end users
Causes “wall of red”
Interesting bits: Unstable classifier period
31
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 50 100 150 200 250 300 350
Number of training data points
Acc
ura
cy
Moved test emails into training set to look for effect on accuracy (Baseline, participant 101)
Interesting bits
3. “Unlearning” important, especially to correct undesirable changes
4. Gender differences Females took longer to complete Females added twice as many
keywords Comment more on unlearning
Interesting directions for HCI
1. Gender differences2. More directed debugging3. Other forms of feedback4. Communicating effects of corrective
feedback Users need to detect the system is
listening to their feedback
5. Explanations Form Fidelity
Interesting directions for Machine Learning
1. Algorithms for learning from corrective feedback
2. Modeling reliability of user feedback
3. Explanations4. Incorporating new features
35
Future work
ML Whyline (with Andy Ko)