machine learning startups...full stack product team calls backend team apis a model is not a product...
TRANSCRIPT
Machine Learning Startups
My Background
Lessons Learned
Turn hard problems into easy ones
ML in practice requires carefully formulating research problems
...and being creative about bootstrapping training data
Lessons Learned
Many ways to capture dependencies
Training data and features > models
Lessons Learned
A model is not a product
Nobody cares about your ideas
Flightcaster
Predicting the real-time state of the global air traffic network
The Prediction Problem
Flight F departing at time T
Likelihood that F departs at T, T+n1, T+n2
Featurizing
Carrier, FAA, weather data
Nightly reset natural cadence for feature vecs
Every aircraft has a unique tail #
Fuzzy k-way join on tail #, time, location
Isolate incorrect joins by keeping feature vecs independent
positions in past - already delayed at prediction time?
weather and status - FAA groundings at airports on path?
featurizing time - how delayed and how many mins from departure?
Models
trees could pick up dependencies that linear model couldn’t
but perf became trivially incremental once adding more sophisticated ways of featurizing dependencies
Tools and Deployment
Clojure on hadoop for featurizing and model training
Wrap complexity in simple API
FP awesome for data pipelines
Write models to json
Product team used Rails
Read json and make predictions
Predictions stored in production DB for eval
Pain Points
Log-based-debugging paradigm sucks
Don’t want to catch ETL and feature eng issues in hadoop setting
At same time can not catch at tiny scale because needs real data at material scale
dirty data -- manual entry
early days of clojure / hadoop
deploying json models rather than services
Lessons learned
Model selection mattered less than featurizing
Many ways to capture dependencies
Intuitions of domain expert useful but also often misleading
Use domain experts to identify data sources
Then build good tools and take scientific approach to exploring the feature space
Computational graph with HOF in order to log structured data
Inspired fast debugging with plumbing.graph at prismatic
Isolate issues: single thread, multi thread, multi process and multi machine
Production was OK not great
Better to to put ML behind services
Full stack product team calls backend team APIs
A model is not a product
Humans don’t understand probability distributions
Even if discretized or turned into classification
Solve a human need directly -- turn into recommendations, etc
Prismatic
Personalized Ranking of People, Topics, and Content
The Personalized Ranking Problem
Given a index of content, display the content that maximizes the likelihood of user engagement
Intention: max LT engagement
Proxy: max session interactions
Content
Focused crawling of Twitter, FB and web
Maximum coverage algorithms
Spam content and de-duping
Featurizing
Content and interaction features
Feature crosses and hacks for dependencies
Bootstrapping weight hacks -- can’t train on overly sparse interactions
Scores for interests (topics, people, publishers, …)
Models: Personalized Ranking
Logistic -- newsfeed ranking has to be ultra fast in prod 100ms
Learning to rank -- inversions
Universal features, user specific weight-vectors
Snapshot every session
Models: Classification
How do you train a large set of topic classifiers?
Latent topic models don’t work
But how would we get labeled data to train a classifier for each topic?
Enter distant supervision
Create mechanism to bootstrap training data with noisy labels
Requires lots of heuristics and clever hacks
Snarf docs with twitter queries, etc
Create pos and negs using filters and distance measures
Lots of techniques to featurize text for filters and training
Tools and Deployment
Clojure! Plumbing on github
Clj backend and cljs frontend
Graph, schema, ml libs
Pain Points
Presentation biases
People click what they're shown
Biases clicks on top stories
Self-reinforcing viral propagation engine
Data Issues
Dupes easy, but spambots and nets keep getting more sophisticated esp on twitter
Bootstrapping distance supervision is hard but OK
Bootstrapping ranking with sparse interactions is super hard
Social vs interest based personalization
What’s interesting vs what’s viral?
How do you define what’s interesting?
How much is a share worth compared with dwell time?
Researchers bias on their own prefs
Lessons learned
Overboard with clojure NIH
Environment changes fast -- missed spark etc
automated classifier training data and retrain with zero intervention
Can optimize interactions a lot > 50%
When data is too sparse, optimize product before optimizing models
Heuristic IR may be good enough for a while
Investment in learning to rank is massive
Goal
10 Companies3 Years$65M
So far
2 Companies6 Months$1MM
Unsexy low beta
Prop modelsProp data
Cyber MGA
Indirect losses● stock price● credit rating● sales
Market
➔ 2.5B today➔ 35% growth➔ 50B in 10 years
Catalysts
➔ SEC and EU regs➔ High profile breaches➔ Large indirect losses
➔ Positives : recorded breaches
➔ Negatives : random sample of companies (not
attacked)
➔ Features : Security features
● DNS records, certs, service vulnerabilities, …
First iteration - Supervised Learning
#FAIL
➔ Incorrect assumption: breached companies having worse
security and negative samples not being attacked
First iteration - Supervised Learning
Likelihood of breach
Absence of historical data and nonstationarity create a challenging environment
➔ Rich current data isn’t available historically and decays in predictive power over time
➔ Could static data be a more robust and stable predictor of risk?
Relationship with catastrophes
Insurance models for earthquakes, floods, hurricanesSparse events (cannot estimate probs from freqs)Events are correlated (how true is this for cyber?)
Can we draw from ideas in cat risk to model cyber risk?
Relationship with catastrophes
Cat: ➔ Stochastic simulation using physical models➔ Impacts change in magnitude but not type
Cyber:➔ Behavior of incentivised cyber-attackers hard to model➔ Impacts change over time
TSDynamic behavior
Industry baseline
Infrastructure security
Social Engineering
Freq. of breaches
Size of loss
Assets (+Lifecycle)Uncertainty Load
Broader Approach
Premium Decomposition
Premium Decomposition
Simplifying assumption: we can start incrementally and loss magnitude will always hit limit
Likelihood and uncertainty depend on breach sample➔ Estimate uncertainty from on confidence➔ Estimate likelihood from risk features
Indirect Losses
Quantifying Indirect losses is complicated➔ Normalizing market and industry effects➔ Effect of news and corporate events?➔ Over what time period?➔ How do we define a statistically significant loss?
Investigation Tools
Roadmap
Freq. estimation Loss model Pricing support
V1 Industry based freq. Stock loss Uncertainty from variance
V2 Net. security model -- --
V3 Behavior of company -- Better uncertainty quant.
V4 -- Sales losses --
V5 -- Credit Rating --
V6 Social engineering -- --
V7 -- -- Pricing model
Future Challenges
Accumulation risk
➔ Correlated breaches➔ Autonomous vehicles➔ Supply chain➔ Physical damage
Bloomberg for Back Office
The world’s first AI enabled
compliance solution
Market
➔ Banks spend ~100B on compliance
➔ ~20B on analytics alone
➔ growing at 20% annually
Catalysts
➔ 9/11 and 2008 crisis
➔ 20X explosion in fines
➔ Exec departures
Computer Vision
Image due diligence
Image distance for ID check
➔ Detects faces in the image using pre-trained models
➔ Transform the image using the real-time pose estimation
➔ Represent the face on the hypersphere using the neural network
➔ Apply any classification technique to the found features
Image due diligence
➔ Check whether the photos on several IDs belongs to the same person
➔ Perform image due diligence in the databases of criminals and other databases
NLP: Detecting Adverse News
IR approach: name + keyword in same sentence
● Low false negs● High false pos
John Smith
● Judge John Smith sentenced James Doe for money laundering.● Amy Smith is accused of murder of her brother John Smith.
Raptor NLP: detecting adverse news
Classification approaches:
● General entity centric “sentiment” classifier○ High coverage○ Not easy to interpret and understand what is going on
● Multiple specific relationship extractors (X sentenced for Y, X accused of Y, …)○ Lower coverage○ Easy to debug and understand
Problem formulation
Training data: generate noisy training data using heuristics
Positive examples: Look for mentions of people with bad news
Negative examples: Tricky and hard part. Many heuristics:
● Use list of judges and attorneys, search for their mentions● Simple syntactic rules: “X said”, ...
Distant Supervision: Training data
heuristics fall into three categories:
a) Poor: doesn’t work
b) Low coverage: only catches few samples
c) Good: big impact on performance
Distant Supervision: Training data
Different sources have different rates of true vs false positives (think bbc.com vs court proceedings reports).
Use this info with some other heuristic to gain a lots of negative samples.
One heuristic might be even previous version of classifier.
Distant Supervision: One nice heuristic
We have to work in multiple languages which limits use of features coming from tools like dependency parsers.
Currently exploring heuristics based on parse trees and machine translation.
Distant Supervision: Languages
Need a model that captures entity centric features and word order
Logistic with classic text features (raw bigrams, entity centric features, dependency parse features, …)
○ Lot of time spent in building features○ Easier to understand, interpret and debug than neural nets
Deep learning: RNN/CNNs
○ Saves time on feature engineering○ Hard to debug, understand and interpret○ Currently, slightly better performance than features + logistic
Distant Supervision: Modeling Approach
Modeling Approach: Recurrent Networks
Pre-trained word embeddings
a) Help to achieve better performanceb) Can be easily obtained for any languagec) Can be shared across multiple tasks
Distant Supervision: Modeling Approach
Modeling Approach: Convolutional Networks
CNN vs RNN setups for NLP
a) CNNs are coming into NLP from CVb) CNNs faster than RNNs and can have similar performancec) In our case currently a tie
Distant Supervision: Modeling Approach
Open problems with false negatives:
● Information spanning multiple sentences:○ Coreference resolution (John is mayor of Boston. He was
sentenced for …)○ Discourse analysis (relations between sentences)
● Analysis of formatted text (tables, bullet points, …)
Key Takeaways and open problems
Key Takeaways:
● Improving training data helps a lot more than tweaking model● Avoid the academic trap of testing many neural net architectures
Key Takeaways and open problems
Risk Ranking and Networks
Google for Risk
➔ Google wins in ranking because it has the most user click data
➔ We win in risk because we have analyst annotation data
Task: Search for CDD/EDD sources and rank the results based on the risk they represent
Goal: Do not miss anything important AND filter as many false positives as possible
Google for risk
Query
fraudchargedforgery
Raptor
Ranked results
Problems:● accurate identification of the person (name collisions)● identify the right context the person is mentioned in
Additional requirements:● interpretable results on all levels: rank, risk, NLP● utilize user feedback: implicit vs explicit
Google for risk - approaches
Risk Model Validation and Interpretability
Prediction vs. Ranking:
Prediction● Want scores and filtering● Still interesting to order results● Loss is error
Learning To Rank● Want optimal ordering of results ● Scores not interesting● Loss is number of inversions
Google for risk - approaches
Risk Networks
Task: Identify risks from the person’s social network
Evaluate risk in network on different levels:● node● edge● path● subgraph
Facebook for risk
Facebook for risk: Streamline investigation with the risk network
➔ Links between all people and business entities
➔ Pagerank for risk
➔ See the riskiest paths through the network
➔ Drill down into high risk customer-customer and customer-entity
relationships
Building the multigraph (1)
Nodes = entities (people and orgs)Edges = relationships
1. Start with extracting relationships from structured databases:○ Wikipedia○ Company Registers ○ Panama Papers, etc.
2. De-duplicate nodes across different datasets○ another ML problem
Facebook for risk
Facebook for risk
David Cameron
Ian Donald Cameron
Mary Fleur MountNancy Gwen
Arthur Elwen
Blairmore Holdings Inc
Cameron Optometry Limited
Univel Limited
Accelerated Mailing & Marketing Limited
Mannings Heath Hotel Limited
Wikipedia
Open Corporates
Panama Papers
Query: David CameronRisk: Blairmore Holdings Inc
Building the multigraph (2)
Extract additional relationships from text => more NLP ○ named entity extraction for nodes○ relation extraction for edges
Facebook for risk
Ian Cameron was a director of Blairmore Holdings Inc, an investment fund run from the Bahamas but named after the family’s ancestral home in Aberdeenshire.
subject objectrelation
Roadmap
1. Start with risk CDD/EDD risk scores at node level2. Propagate risk across edges to derive edge weights3. Social Network Analysis:
○ random walk with restarts: PageRank, HITS, Personalized PageRank
4. Subgraph risk ranking: use SNA approaches to featurize graph for ranking
Facebook for risk
Link prediction Task: Inferring new relationships from network and behavior
Approach1. add behavior data to network2. extract features from network
○ node based: Jaccard’s coef., Preferential attachment …○ path based: Katz, PageRank, Personalized PageRank … ○ feature matrices over pair of nodes: Path Ranking Algorithm
3. combine with semantic features for each node4. treat as a binary class classification problem
Facebook for risk
CEAI Topics
Computational Finance
Computer Vision
Computational Bio & Medicine
Proactive Full-stackMGAs