daeil kim: machine learning at the new york times

Machine learning @ NYTDae Il Kim - [email protected]

Overview● Assisting Great Journalism: The Story of Faulty Takata Airbags

○ Using Logistic Regression to help uncover suspicious comments

● Extracting insights from big data - A Bayesian perspective○ BNPy: A fully pythonic framework for Bayesian Nonparametric Models○ Refinery: A Locally Deployable Web App for Scalable Topic Modeling

● Using ML to help news-related non journalistic problems○ Single Copy - Using ML to effectively predict the number of papers to print○ Subscribers - Retention and Audience Acquisition○ Recommendations - Using collaborative topic models for recommendations

Part 1: The Story of Faulty Takata Airbags

Complaints data from NHTSA complaints

The DataData contains 33,204 comments with 2219 of these painstakingly labeled as being suspicious (by Hiroko Tabuchi).

A Machine Learning ApproachDevelop a prediction algorithm that can predict whether a comment was either suspicious or not. The algorithm will then learn from the dataset which features are representative of a suspicious comment.

The Machine Learning ApproachA sample comment. We will preprocess this data for the algorithm

- NEW TOYOTA CAMRY LE PURCHASED JANUARY 2004 - ON FEBRUARY 25TH KEY WOULD NOT TURN (TOOK 10 - 15 MINUTES TO START IT) - LATER WHILE PARKING, THE CAR THE STEERING LOCKED TURNING THE CAR TO THE RIGHT - THE CAR ACCELERATED AND SURGED DESPITE DEPRESSING THE BRAKE (SAME AS ODI PEO4021) - THOUGH THE CAR BROKE A METAL FLAG POLE, DAMAGED A RETAINING WALL, AND FELL SEVEN FEET INTO A MAJOR STREET, THE AIR BAGS DID NOT DEPLOY - CAR IS SEVERELY DAMAGED: WHEELS, TIRES, FRONT END, GAS TANK, FRONT AXLE - DRIVER HAS A SWOLLEN AND SORE KNEE ALONG WITH SIGNIFICANT SOFT TISSUE INJURIES INCLUDING BACK PAIN *SC *JB

(NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Break this into individual words (NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Break this into bigrams (every two word combinations)

TOKENIZE

FILTER

(NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Remove tokens that appear in less than 5 comments(NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Remove bigrams that appear in less than 5 comments

DATA IS READY FOR TRAINING!

The data now consists of 33,204 examples with 56,191 features

Cross-ValidationCo

mm

ent I

D

Features (i.e word frequency)

0 0 0 3 1 0 2 0...

1 0 0 0 2 0 1 1...

...

1 1 5 1 2 0 0 1...

This is our training set. Take a subset of the data for training

S

NS

S

S

NS

NS

NS

NS

NS

Labels (S = Suspicious, NS = Not Suspicious)

This is our test set. After training, test on this dataset to obtain accuracy measures.

How did we do?

Experiment SetupWe hold out 25% of both the suspicious and not suspicious comments for testing and train on the rest. We do this 5 times, creating random splits and retraining the model with these splits.

Performance!We obtain a very high AUC (~.97) on our test sets.

Check what we missedThese comments are potentially worth checking twice.

The most predictive words / features

Predictive of a suspicious comment

Predictive of a normal comment.

After training the model, we then applied this on the full dataset.

We looked for comments that Hiroko didn’t label as being suspicious, but the algorithm did to follow up on (374 / 33K total).

Result: 7 new cases where a passenger was injured were discovered from those comments she missed.

Part 2: Extracting Interpretable Insights from Big Data

Understanding Documents using Topic Models

There are reasons to believe that the genetics of an organism are likely to shift due to the extreme changes in our climate. To protect them, our politicians must pass environmental legislation that can protect our future species from becoming extinct…

Decompose documents as a probability distribution over “topic” indices

1

0“Politics”

“Climate Change”

“Genetics”

“Climate Change” “Genetics”“Politics”

Topics in turn represent probability distributions over the unique words in your vocabulary.

Topic Models: A Graphical Model PerspectiveLDA: Latent Dirichlet Allocation (Bayesian Topic Model)

Blei et. al, 2001

1

0“Politics”

“Climate Change”

“Genetics” dna: 2, obama: 1, state: 1, gene: 2, climate: 3, government: 1, drug: 2, pollution: 3

Bayes Theorem

Prior belief about the world. In terms of LDA, our modeling assumptions / priors.

Normalization constant makes this problem a lot harder. We need this for valid probabilities.

Likelihood. Given our model, how likely is this data?

Posterior distribution. Probability of our new model given the data.

Posterior Inference in LDA

GOAL: Obtain this posterior

which means that we need to calculate this intractable term:

For LDA, this represents the posterior over latent variables representing how much a document contains of topic k (θ) and topic word assignments z.

LDA: Latent Dirichlet Allocation (Bayesian Topic Model)

Blei et. al, 2001

Scalable Learning & Inference in Topic Models

LDA: Latent Dirichlet Allocation (Bayesian Topic Model)

Blei et. al, 2001

Analyze a subset of your total documents before updating.

Update θ, z, and β after analyzing each mini-batch of documents.

Please check out BNPy (Bayesian Nonparametric Python)

Open source and supports a large set of powerful Bayesian nonparametric models. Actively maintained and highly scalable code.

git clone https://bitbucket.org/michaelchughes/bnpy-dev/

Refinery: An open source web-app for large document analyses

Daeil Kim @ New York TimesFounder of [email protected]

Ben Swanson @ MIT Media LabCo-Founder of [email protected]

Refinery is a 2014 Knight Prototype Fund winner. Check it out at: http://docrefinery.org

http://docrefinery.org

Installing Refinery

1) Command → git clone https://github.com/daeilkim/refinery.git2) Go to the root folder. Command → vagrant up3) Open brower and go to --> 11.11.11.11:8080

3 Simple Steps to get Refinery runningInstall these first!

https://github.com/daeilkim/refinery.git

A Typical Refinery Pipeline

Step 1: Upload documents

Step 2: Extract Topics from a Topic Model

Step 3: Find a subset of documents with topics of interest.

Step 4: Discover Interesting Phrases

A Quick Refinery Demo

Extracting NYT articles from keyword “obama” in 2013.

What themes / topics defined the Obama administration during 2013?

Future Directions: Better tools for Investigative Reporting

Collecting & Scraping

Data

Refinery focuses on extracting insights from relatively clean data

Great tools like DocumentCloud take care of steps 1 & 2

Enterprise stories might be completed in a fraction of the time.

Filtering& Cleaning

Data

Extracting Insights

Part 3: Using ML to help in the bottom line

Part 3: Using ML to help in non-news related endeavors

Training predictive models for each part of this funnel

We’re interested in developing a meaningful loyal relationship with our readers. Can we discover covariates that indicate better ways to obtain and maintain that relationship with our audience?

Starbucks Single Copy

Using machine learning to predict the number of actual copies we should sell to Starbucks outlets across the nation.

Understanding international audiences

Part of our ability to expand the New York Times internationally will be to leverage algorithms based off of topic models to help understand reading patterns and behaviors.

Making better recommendations

Given how people read the news and some of their demographic info, can we make better recommendations for articles?

Even better, if they haven’t read anything what kind of recommendations can we make given just their metadata?

Age: 32State: NYJob: Student

read

recommend

Attract first time users with relevant content

daeil kim: machine learning at the new york times

Technology

suspicious comments

suspicious commentpredictive

algorithm new toyota

commentsnew toyota

training set

big data

individual words new

missedthese comments