building continuous learning systems

Post on 16-Apr-2017

270 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Continuous Online Learners

Anuj GuptaSaurabh Arora(Freshdesk)

Agenda1. Problem v 1.0

2. Solution

3. Issuesa. Driftb. Evolving Vocabc. Feedback loop

4. Problem v 2.0

5. Our Solutiona. Globalb. Localc. glocald. Drift Detection

6. Local – pros and cons

7. Way Forward

8. Conclusion/takeaway

Problem Statement – v 1.0• Build a spam filter for twitter

• Use case: In customer service, we listen to twitter on behalf of brands and figure out what is that brands can respond to.

• Examples:

To filter spam from the actionable in real-time twitter stream of brands.

Twitter is noisy

There is ~65-70% noise in consumer-to-business communication [and 100% noise in business-to-consumer ].

% of noise is only higher if you are big B2C company

Solution• Model it as (binary) classification problem.

• Acquire good quality dataset.

• Engineer features – there are some very good indicators.

• Select an algorithm.

• Train-test-tune, ~85% accuracy.

• Deploy.

Actionable Spam

Paradise lostIn production the model started very well, however, as time* went by we found the running accuracy of our model started falling down.

*within couple of weeks of deployment

• Our data was changing and changing fast.

Behind the Scene

Non-stationary distributions

A stationary process is time-independent the averages remain more or less the constant.

This is also called drift – distribution generating the data changes over time.

• Vocabulary of our dataset was increasing.

o Unlike any other language - twitter vocabulary evolves faster, significantly faster.

Behind the Scene

• Not learning from mistakes: In our system, user (brand agent) has the option to tell the system know if the classification done by the system is wrong.

• The model was not utilizing these signals to improve.

Behind the Scene

In Nutshell

• Based on last few slides, degradation (with time) in the prediction accuracies of our model shouldn’t come as surprise.

• This is not just specific to twitter data. In general, these problems are likely occur in following domains :o Monitoring & Anomaly detection (one-class classification) in adversarial setting

o Recommendations (where the user preferences are continuously changing; evolving labels)

o Stock market predictions (concept drift; evolving distributions).

• Build a spam filter for twitter which can:o Handle drift in data.

o Learn (and improve) using feedbacks.

o Handle fast evolving vocabulary.

Problem Statement – v 2.0

• Build a classifier which can:o Handle drift in data.

o Learn (and improve) using feedbacks.

o Handle fast evolving vocabulary.

Possible Solutions

• Frequently retrain your model on the updated data and deploy the same.

o Training, testing, fine-tuning – lot of work. Doesn’t scale at all.

o Loose all old learnings

• Continuous Learning : Model adapts to the new incoming data.

What worked for us

Deep Learning ModelBatch trainedLarge CorpusNo short term updates Per-brand model

Fast learnerInstant feedback

Detect drift

Text Representation

• Preprocess the tweets – replace mentions, hashtags, urls, emojis, dates, numbers, currency by relevant constants. Remove stop words.

• How good is your preprocessing ? - ZIPF’s Law

• Given a large corpus, if t1, t2, t3 are the most common term (ascending order) in the corpus and cfi be the collection frequency of the ith most common term, then cfi  a 1/i

Raw dataset - Zipf’s (mis)fit

Preprocessed dataset - Zipf’s fit

Text Representation• Words Embedding:

o Use Google’s pre-trained word2vec model to replace a word by its corresponding embedding (300 dimensions).

o For a tweet, we average all the word embedding vectors for its constituent words.

o For missing words, we generate a random number between (-0.25, 0.25) for each of 300 dimensions. (Yann LeCun 2014)

o Final representation:

Tweet = 300 dim vector of real numbers

● DeepNet ○ CNN○ Trained over a corpus of ~8 million tweets○ Of the shelf architecture gave us ~86% cv accuracy.

Global model

Local

• Goalso Strictly improves with every feedback.o Higher retention of older concepts

• Desired propertieso Online learnero Fast learner; aggressive model update

Incorporates (every last) Feedback successfully (After model update, if the same data point is presented, it must correctly predict its class label.)

o Don’t forget recent i data points (After model update, if the last N data point is presented, it must predict its class label with higher accuracy.)

Building feedback loop

ML model

<Tweet, Yp>

<Tweet, Y>If Y ≠ Yp

● Reward/punish if the prediction is right/wrong.

● For binary classification problem, underlying MDP is too small (2 states). Doesn’t learn much.

Works fine if the velocity of feedback data is high (don’t have to wait long to accumulate a mini-batch of feedbacks). Many applications don’t have high velocity.

Just 1 data point - can skew the model

Reinforcement Learning mini-batches Instant feedback, tiny-batches

Possible Approaches

Building feedback loop

• We model a feedback point <Tweet, Y> as a datapoint presented to local model in online setting.

• Thus, a bunch of feedbacks = incoming data stream • Thus, we use a Online Learner.• Online method in ML:

Data is modeled as stream.Model makes a prediction (y’), when presented with data point (x).Environment reveals the correct class label (y)If y ≠ y’, update the model.

Online Algorithms

http://scikit-learn.org/stable/auto_examples/linear_model/plot_sgd_comparison.html

You can try various on-line classifiers on your dataset. We chose Crammer’s PA-II as our local model.

• Dataset – 160K tweets from 2015, time sequenced

• Feedback incorporation improves accuracy:o Trained (offline batch mode) model on first 100K data points. o On test set (last 60k data points) it gave 74% accuracy (offline batch mode)

o Then ran the model on test data (50k data points) in online fashion Model made a total 9028 mistakes. These mistakes were instantaneously fed into the local model as feedback.This gives a accuracy ~85 % across the test set.

○ We gained ~11% accuracy by incorporating feedback.

Results of Local :

PA-II parameter tuning

Improving accuracy

Its no flukeWe tested the local by feeding it with wrong feedbacks:

glokal : Ensembling global and local• We use online stacking to ensemble our continuously adapting local and

erudite DeepNet model

• Outputs of the global and local go to an OnlineSVM.

• We train the ensemble in batch offline but continue to train it further on feedback points in an online fashion.

• We get an cv accuracy of 82%

Global

Local

Online SVM

glocal

● Handle Drift

○ Periodically replace the model. ■ Shooting in the dark esp. when drifts are far and few

○ Find if a drift has indeed occurred or not■ If it has, adapt to the changes.■ 3 main algorithms:

● DDM (Gama et. al 2004)● EDDM ● DDD

■ What about the old model - it knows the old concept, so keep it if the old distribution lingers.

Last but not the Least

Handle Drift

We borrow Drift Detection Method (Gama et. al 2004)

Pros• Improves running accuracy

• Personalization : The notion of spam varies from brand to brand. Some brands treat ‘Hi’, ‘Hello’ as spam while some treat them as actionable.

The local model serves well as per user statistical model, thus brining in user personalization. Thus, learning from feedback, the model adapts to the notions of the brand.

• Its light weight, fast thus easy to boot-strap, deploy and scale.

Cons

• PA-II decision boundary is a hyper-plane that divides feature space into 2 half-planes.

• Margin of the data point a distance b/w data point and the hyperplane.

• An update on the model results in new hyper plane to remain as close as possible to the current one while achieving at least a unit margin on the most recent data point.

• Thus, incorporating a feedback is nothing but shifting the hyperplane to a unit margin on the feedback point.

• Lets see this visually.

Cons

• This shifting of hyperplane increases model’s accuracy on one class (correct label of the feedback point) while decreases model’s accuracy on other class.

• To verify the above, split the test set into 2 chunks as per class. And run the local only on 1 chunk. If the above hypothesis is true then:

• #feedbacks should be very small and only in the initial part of the data set

• The running accuracy should on increase.

• Changing the algorithm doesn’t help much – all online learning classifiers in current literature are linear

Way Forward

• Instead of modeling the problem as classification, model it as ranking (Gmail’s priority inbox does this).

• Actionable tweets are high in ranking, spam tweets are low in ranking.

• Actionable vs Spam = finding a cut of in the ranking.

• Incorporating feedback = updating the algorithm to get a better ranking without getting biased towards one class.

• This is a work in progress.

Take Home

• Incorporating feedback is an important step in improving your model’s performance.

• Global + Local is a great way to introduce personalization in ML.

• PA-II does well as local provided your data is such that most data points are far from the decision hyperplane.

• For domains where distributions are continuously evolving, handling drift is must.

References1. “Online Passive-Aggressive Algorithms” - Crammer et al., JMLR 2006

2. “The learning behind gmail priority inbox” – Aberdeen et al., LCCC: NIPS Workshop 2010

3. “Learning with drift detection” – Gama et al., BSAI 2004

4. Baena-Garcıa, Manuel, et al. "Early drift detection method." - Baena-Garcıa et al., IWKDSD, 2006

5. "DDD: A new ensemble approach for dealing with concept drift." - Minku et al., IEEE transactions (2012)

6. "Adaptive regularization of weight vectors."  ” - Crammer et al., ANIPS 2009

7. Soft Confidence Weighted algorithms - Wang et al., 2012

8. LIBOL - A Library for Online Learning Algorithms. https://github.com/LIBOL/LIBOL

top related