The Future Of KaggleWhere we came from and where we’re going
kaggle.com/benhamner@benhamner
Our mission is to help the world learn from data
@benhamner
We got started running supervised learning competitions
@benhamner
Since 2010, we’ve run
● 240 general competitions● 1,610 university classroom competitions
We’re now doing this at scale
@benhamner
This has attracted a talented and diverse community
@benhamner
We’ve taught hundreds of thousands machine learning
@benhamner
We’ve pushed the state of the art forward
@benhamner
● What techniques work well● How people win competitions● Why our community participates● What major pain points data scientists hit● How we can help data scientists ameliorate these pain points
We’ve learned a tremendous amount along the way
@benhamner
Great data scientists optimize the entire ML workflow
@benhamner
GBM’s and deep neural networks are incredibly effective
@benhamner
Model ensembling almost always ekes out gains
@benhamner
Successful participants avoid overfitting
@benhamner
We’ve seen major pain points
@benhamner
Today’s practices are like programming in assembly
@benhamner
Beside software engineering tools, ML tools feel like they came from the stone age
@benhamner
Accessing data is tough
@benhamner
Getting high quality data is even tougher
@benhamner
Cleaning data is painful
Essay: “This essay got good marks, but as far as I can tell, it's gibberish.”
Human Scores: 5/5, 4/5@benhamner
Data leakage is common and subtle
@benhamner
Going from research to production can be brutal
@benhamner
Reproducing work takes days to months
@benhamner
We can do better than this
@benhamner
Accessing data should be seamless
@benhamner
You should never need to repeat work others have done
@benhamner
A single command should reproduce everything start-to-end
> make all
@benhamner
Making a successful one-line update should take seconds
@benhamner
Helpful metadata shouldn’t stay buried in minds or emails
@benhamner
Best practices should be easy defaults, not complicated custom contraptions
@benhamner
We’re changing this
@benhamner
We’ve launched two new products: Kernels and Datasets
We recently joined Google Cloud to accelerate our growth
@benhamner
Datasets, Kernels, and Competitions have an exciting future
@benhamner
The world’s data will be accessible with a common interface
@benhamner
That captures the important code and metadata on top of it
@benhamner
A central searchable hub for your organization’s data
@benhamner
A kernel is an atom of reproducible data science
@benhamner
Kernels will be your continuous integration server for data
@benhamner
We’ve started running code competitions
@benhamner
● Backtested time series● Live data feeds● Reinforcement learning● Generative modeling● Adversarial learning● Machine learning under computational constraints● Sensitive datasets