data science popup austin: data do's and dont's: lessons from the front line

DATA SCIENCEPOP UP

AUSTIN

Data Do's and Dont's: Lessons From the Front Line

Ryan OrbanVP of Product and Strategy,

Data Scientist, Galvanize

ryanorban

DATA SCIENCEPOP UP

AUSTIN

#datapopupaustin

April 13, 2016Galvanize, Austin Campus

http://www.dominodatalab.com

Data Do’s and Dont’s: Lessons from the Frontline

Co-Founder & CEO Zipfian Academy

Ryan Orban @ryanorban

EVP of Product and Strategy Galvanize

We believe an opportunity belongs to anyone with aptitude and ambition.

4Galvanize 2015

NODES ON THE NETWORK

COLORADO (BOULDER, DENVER, FORT COLLINS)

SEATTLE, WA

SAN FRANCISCO, CA

AUSTIN, TX (OPENING Q1 2016)

Programs: Full Stack Immersive, Data Science Immersive, Entrepreneurship


Programs: Full Stack Immersive, Data Science Immersive, Data Engineering Immersive, Masters of Science in Data Science, Entrepreneurship


[Explanation Text]

5Galvanize 2015

5 PROGRAMS

• Full Stack Immersive

• Data Science Immersive

• Data Engineering Immersive

Project over 500 Student Member Graduates in 2015

Currently over 1500 Members

• Master of Science in Data Science (University of New Haven)

• Startup Membership

6Galvanize 2015

PLACEMENT STATS

FULL STACK IMMERSIVE DATA SCIENCE IMMERSIVE

$43K $77KPre-program Salary

Average Starting Salary

97% Placement Rate*

*Galvanize is a founder member of NESTA (New Economy Skills Training Association), a trade organization founded to regulate the new “bootcamp” market. This place rate is more rigorous than that requested by state licensure agencies. The placement rate is calculated 6 months after graduation.

$72K $114KPre-program Salary

94% Placement Rate*

Average Starting Salary

Software Engineering

Data Science

Data Analysis

Data Engineering

Machine Learning Java

Linux, UNIX

Mobile Development

Objective C

C, C++, C#

Web Development

Ruby on Rails

JavaScript

Front-endPHP

Full-Stack

Excel

Python

SQL

NLPHadoop

Databases

Network Analysis

Java

AssemblyStatistics

R

The orange words are the most important things we teach.

How These Things Relate to Each Other

Full-Stack Web Development and Data Science are in gray circles.

8Galvanize 2015

DATA SCIENCE IMMERSIVE

Week 1 - Exploratory Data Analysis and Software Engineering Best Practices

Week 2 - Statistical Inference, Bayesian Methods, A/B Testing, Multi-Armed Bandit

Week 3 - Regression, Regularization, Gradient Descent

Week 4 - Supervised Machine Learning: Classification, Validation, Ensemble Methods

Week 5 - Clustering, Topic Modeling (NMF, LDA), NLP

Week 6 - Network Analysis, Matrix Factorization, and Time Series

Week 7 - Hadoop, Hive, and MapReduce

Week 8 - Data Visualization with D3.js, Data Products, and Fraud Detection Case Study

Weeks 9-10 - Capstone Projects

Week 12 - Onsite Interviews

Data Manipulation Model Creation Prediction

Data Manipulation

Do

Don’t

• Assume your data is friendly • ETL and feature engineering is largely opaque to others (and yourself after enough time away)

• Automate cleaning and transformation pipelines • Jupyter and RStudio are great for EDA, but have issues with collaboration and version control

• Build functional code to be reused; export into plain code files, track with Git

Model Creation

Do

Don’t• Never use accuracy as your main metric

• You can have 99% accuracy but 0% predictive power • Unbalanced classes; sampling

• Use metrics like precision and recall • Aggregate metrics like F1-score, AUC/AIC/BIC also good • Remember that models with highest scores are not always the ones you need; permissive vs. conservative based on use case

Do

Don’t• Don’t start with the most complicated models first (deep learning, gradient boosting, SVMs, etc.)

• Don’t focus on the algorithm •“More data always beats better algorithms” • But better features usually beat better algorithms*

• Start with a baseline model, then continuously “close the loop” • Create a base case to optimize against • Does 1% greater F1-score outweigh a 10x training time in production? Not usually unless you’re Google-scale.

Do

Don’t

• Assume your cross-validation metrics will hold up against real-life data

• Separate your application and prediction code • Fast iteration cycles are key. Create a “scoring service” that is uncoupled from application code.

• APIs & service oriented architectures typically work best

Communication

Do

Don’t

• Don’t focus on the “how”, i.e. cover every trial and tribulation

• Cut to the chase • After a presentation, I always ask the class two questions: • What is one sentence that describes what the speaker learned? • Why do I care?

19Galvanize 2015

• Early Access to Students

• Candidate Matching

• Curriculum Development

• Corporate Student Sponsorship

• Diversity

TALENT

20Galvanize 2015

• Membership

• Organic Relationships

• Course Content

• Mentorship

• Community

• Events

ACCESS

21Galvanize 2015

• Galvanize Experts

• Capstone Projects

• Internship

• Corporate Training

EXPERTISE

THANK YOURYAN ORBAN | EVP, STRATEGY [email protected] @ryanorban

www.galvanize.com

DATA SCIENCEPOP UP

AUSTIN

@datapopup #datapopupaustin

data science popup austin: data do's and dont's: lessons from the front line

Data & Analytics