building continuous learning systems

40
Continuous Online Learners Anuj Gupta Saurabh Arora (Freshdesk)

Upload: anuj-gupta

Post on 16-Apr-2017

270 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Building Continuous Learning Systems

Continuous Online Learners

Anuj GuptaSaurabh Arora(Freshdesk)

Page 2: Building Continuous Learning Systems

Agenda1. Problem v 1.0

2. Solution

3. Issuesa. Driftb. Evolving Vocabc. Feedback loop

4. Problem v 2.0

5. Our Solutiona. Globalb. Localc. glocald. Drift Detection

6. Local – pros and cons

7. Way Forward

8. Conclusion/takeaway

Page 3: Building Continuous Learning Systems

Problem Statement – v 1.0• Build a spam filter for twitter

• Use case: In customer service, we listen to twitter on behalf of brands and figure out what is that brands can respond to.

• Examples:

To filter spam from the actionable in real-time twitter stream of brands.

Page 4: Building Continuous Learning Systems

Twitter is noisy

There is ~65-70% noise in consumer-to-business communication [and 100% noise in business-to-consumer ].

% of noise is only higher if you are big B2C company

Page 5: Building Continuous Learning Systems

Solution• Model it as (binary) classification problem.

• Acquire good quality dataset.

• Engineer features – there are some very good indicators.

• Select an algorithm.

• Train-test-tune, ~85% accuracy.

• Deploy.

Actionable Spam

Page 6: Building Continuous Learning Systems

Paradise lostIn production the model started very well, however, as time* went by we found the running accuracy of our model started falling down.

*within couple of weeks of deployment

Page 7: Building Continuous Learning Systems

• Our data was changing and changing fast.

Behind the Scene

Non-stationary distributions

A stationary process is time-independent the averages remain more or less the constant.

This is also called drift – distribution generating the data changes over time.

Page 8: Building Continuous Learning Systems

• Vocabulary of our dataset was increasing.

o Unlike any other language - twitter vocabulary evolves faster, significantly faster.

Behind the Scene

Page 9: Building Continuous Learning Systems
Page 10: Building Continuous Learning Systems

• Not learning from mistakes: In our system, user (brand agent) has the option to tell the system know if the classification done by the system is wrong.

• The model was not utilizing these signals to improve.

Behind the Scene

Page 11: Building Continuous Learning Systems

In Nutshell

• Based on last few slides, degradation (with time) in the prediction accuracies of our model shouldn’t come as surprise.

• This is not just specific to twitter data. In general, these problems are likely occur in following domains :o Monitoring & Anomaly detection (one-class classification) in adversarial setting

o Recommendations (where the user preferences are continuously changing; evolving labels)

o Stock market predictions (concept drift; evolving distributions).

Page 12: Building Continuous Learning Systems

• Build a spam filter for twitter which can:o Handle drift in data.

o Learn (and improve) using feedbacks.

o Handle fast evolving vocabulary.

Problem Statement – v 2.0

• Build a classifier which can:o Handle drift in data.

o Learn (and improve) using feedbacks.

o Handle fast evolving vocabulary.

Page 13: Building Continuous Learning Systems

Possible Solutions

• Frequently retrain your model on the updated data and deploy the same.

o Training, testing, fine-tuning – lot of work. Doesn’t scale at all.

o Loose all old learnings

• Continuous Learning : Model adapts to the new incoming data.

Page 14: Building Continuous Learning Systems

What worked for us

Deep Learning ModelBatch trainedLarge CorpusNo short term updates Per-brand model

Fast learnerInstant feedback

Detect drift

Page 15: Building Continuous Learning Systems

Text Representation

• Preprocess the tweets – replace mentions, hashtags, urls, emojis, dates, numbers, currency by relevant constants. Remove stop words.

• How good is your preprocessing ? - ZIPF’s Law

• Given a large corpus, if t1, t2, t3 are the most common term (ascending order) in the corpus and cfi be the collection frequency of the ith most common term, then cfi  a 1/i

Page 16: Building Continuous Learning Systems

Raw dataset - Zipf’s (mis)fit

Page 17: Building Continuous Learning Systems

Preprocessed dataset - Zipf’s fit

Page 18: Building Continuous Learning Systems

Text Representation• Words Embedding:

o Use Google’s pre-trained word2vec model to replace a word by its corresponding embedding (300 dimensions).

o For a tweet, we average all the word embedding vectors for its constituent words.

o For missing words, we generate a random number between (-0.25, 0.25) for each of 300 dimensions. (Yann LeCun 2014)

o Final representation:

Tweet = 300 dim vector of real numbers

Page 19: Building Continuous Learning Systems

● DeepNet ○ CNN○ Trained over a corpus of ~8 million tweets○ Of the shelf architecture gave us ~86% cv accuracy.

Global model

Page 20: Building Continuous Learning Systems

Local

• Goalso Strictly improves with every feedback.o Higher retention of older concepts

• Desired propertieso Online learnero Fast learner; aggressive model update

Incorporates (every last) Feedback successfully (After model update, if the same data point is presented, it must correctly predict its class label.)

o Don’t forget recent i data points (After model update, if the last N data point is presented, it must predict its class label with higher accuracy.)

Page 21: Building Continuous Learning Systems

Building feedback loop

ML model

<Tweet, Yp>

<Tweet, Y>If Y ≠ Yp

Page 22: Building Continuous Learning Systems

● Reward/punish if the prediction is right/wrong.

● For binary classification problem, underlying MDP is too small (2 states). Doesn’t learn much.

Works fine if the velocity of feedback data is high (don’t have to wait long to accumulate a mini-batch of feedbacks). Many applications don’t have high velocity.

Just 1 data point - can skew the model

Reinforcement Learning mini-batches Instant feedback, tiny-batches

Possible Approaches

Page 23: Building Continuous Learning Systems

Building feedback loop

• We model a feedback point <Tweet, Y> as a datapoint presented to local model in online setting.

• Thus, a bunch of feedbacks = incoming data stream • Thus, we use a Online Learner.• Online method in ML:

Data is modeled as stream.Model makes a prediction (y’), when presented with data point (x).Environment reveals the correct class label (y)If y ≠ y’, update the model.

Page 24: Building Continuous Learning Systems

Online Algorithms

http://scikit-learn.org/stable/auto_examples/linear_model/plot_sgd_comparison.html

You can try various on-line classifiers on your dataset. We chose Crammer’s PA-II as our local model.

Page 25: Building Continuous Learning Systems

• Dataset – 160K tweets from 2015, time sequenced

• Feedback incorporation improves accuracy:o Trained (offline batch mode) model on first 100K data points. o On test set (last 60k data points) it gave 74% accuracy (offline batch mode)

o Then ran the model on test data (50k data points) in online fashion Model made a total 9028 mistakes. These mistakes were instantaneously fed into the local model as feedback.This gives a accuracy ~85 % across the test set.

○ We gained ~11% accuracy by incorporating feedback.

Results of Local :

Page 26: Building Continuous Learning Systems

PA-II parameter tuning

Page 27: Building Continuous Learning Systems

Improving accuracy

Page 28: Building Continuous Learning Systems

Its no flukeWe tested the local by feeding it with wrong feedbacks:

Page 29: Building Continuous Learning Systems

glokal : Ensembling global and local• We use online stacking to ensemble our continuously adapting local and

erudite DeepNet model

• Outputs of the global and local go to an OnlineSVM.

• We train the ensemble in batch offline but continue to train it further on feedback points in an online fashion.

• We get an cv accuracy of 82%

Global

Local

Online SVM

glocal

Page 30: Building Continuous Learning Systems
Page 31: Building Continuous Learning Systems

● Handle Drift

○ Periodically replace the model. ■ Shooting in the dark esp. when drifts are far and few

○ Find if a drift has indeed occurred or not■ If it has, adapt to the changes.■ 3 main algorithms:

● DDM (Gama et. al 2004)● EDDM ● DDD

■ What about the old model - it knows the old concept, so keep it if the old distribution lingers.

Last but not the Least

Page 32: Building Continuous Learning Systems

Handle Drift

We borrow Drift Detection Method (Gama et. al 2004)

Page 33: Building Continuous Learning Systems

Pros• Improves running accuracy

• Personalization : The notion of spam varies from brand to brand. Some brands treat ‘Hi’, ‘Hello’ as spam while some treat them as actionable.

The local model serves well as per user statistical model, thus brining in user personalization. Thus, learning from feedback, the model adapts to the notions of the brand.

• Its light weight, fast thus easy to boot-strap, deploy and scale.

Page 34: Building Continuous Learning Systems

Cons

• PA-II decision boundary is a hyper-plane that divides feature space into 2 half-planes.

• Margin of the data point a distance b/w data point and the hyperplane.

• An update on the model results in new hyper plane to remain as close as possible to the current one while achieving at least a unit margin on the most recent data point.

• Thus, incorporating a feedback is nothing but shifting the hyperplane to a unit margin on the feedback point.

• Lets see this visually.

Page 35: Building Continuous Learning Systems

Cons

• This shifting of hyperplane increases model’s accuracy on one class (correct label of the feedback point) while decreases model’s accuracy on other class.

• To verify the above, split the test set into 2 chunks as per class. And run the local only on 1 chunk. If the above hypothesis is true then:

• #feedbacks should be very small and only in the initial part of the data set

• The running accuracy should on increase.

Page 36: Building Continuous Learning Systems

• Changing the algorithm doesn’t help much – all online learning classifiers in current literature are linear

Page 37: Building Continuous Learning Systems

Way Forward

• Instead of modeling the problem as classification, model it as ranking (Gmail’s priority inbox does this).

• Actionable tweets are high in ranking, spam tweets are low in ranking.

• Actionable vs Spam = finding a cut of in the ranking.

• Incorporating feedback = updating the algorithm to get a better ranking without getting biased towards one class.

• This is a work in progress.

Page 38: Building Continuous Learning Systems

Take Home

• Incorporating feedback is an important step in improving your model’s performance.

• Global + Local is a great way to introduce personalization in ML.

• PA-II does well as local provided your data is such that most data points are far from the decision hyperplane.

• For domains where distributions are continuously evolving, handling drift is must.

Page 39: Building Continuous Learning Systems

References1. “Online Passive-Aggressive Algorithms” - Crammer et al., JMLR 2006

2. “The learning behind gmail priority inbox” – Aberdeen et al., LCCC: NIPS Workshop 2010

3. “Learning with drift detection” – Gama et al., BSAI 2004

4. Baena-Garcıa, Manuel, et al. "Early drift detection method." - Baena-Garcıa et al., IWKDSD, 2006

5. "DDD: A new ensemble approach for dealing with concept drift." - Minku et al., IEEE transactions (2012)

6. "Adaptive regularization of weight vectors."  ” - Crammer et al., ANIPS 2009

7. Soft Confidence Weighted algorithms - Wang et al., 2012

8. LIBOL - A Library for Online Learning Algorithms. https://github.com/LIBOL/LIBOL