how to build consistent and scalable workspaces for data science teams

10
How to build consistent, scalable workspaces for data science teams Elaine Lee

Upload: elaine-k-lee

Post on 22-Jan-2018

219 views

Category:

Software


3 download

TRANSCRIPT

Page 1: How to Build Consistent and Scalable Workspaces for Data Science Teams

How to build consistent, scalable workspaces for data science teams

Elaine Lee

Page 2: How to Build Consistent and Scalable Workspaces for Data Science Teams

Data science is hard. Doing data science is even harder.

Ensuring enough resourcesManaging dependencies

http://www.seriouseats.com/assets_c/2014/06/20140525-294370-best-deep-dish-pizza-art-of-pizza-primary-thumb-1500xauto-404176.jpghttps://s-media-cache-ak0.pinimg.com/736x/91/6b/f0/916bf0f23660fc7019353800668060af.jpg

Page 3: How to Build Consistent and Scalable Workspaces for Data Science Teams

Nail it down

Identify system requirements for base Docker imageStabilize dependencies for data science work environment Increase test coverageGet continuous integration (CI) platform on the same page

Page 4: How to Build Consistent and Scalable Workspaces for Data Science Teams

Scale it up

Create a pool of worker machines ready to accept jobsSet up an asynchronous task queueProvide a simple command line interface for data scientists

Page 5: How to Build Consistent and Scalable Workspaces for Data Science Teams

Putting it all together

Pull changes Start Docker container

Run test suite Report Pass/Fail Export image for commit

Commit pushed to Github

Report resultGet image for commit

Start container from image

Run task

Request arrives in queue

workers

123abc…123abc…

123abc…123abc…

s3

Page 6: How to Build Consistent and Scalable Workspaces for Data Science Teams

Benefits

Flexible to any composition of EC2 instances-Extensible to EMR

Task environment guaranteed-Isolated from other tasks-Identical to conditions at time of development

One-time configuration-EC2 AMI

Extensible command line interface-R interface-Cluster management-Job monitoring

Page 7: How to Build Consistent and Scalable Workspaces for Data Science Teams

Use case: Quality assurance

CI testing

Other tests- Data validation

- Model consistency

http://img.pandawhale.com/post-52368-thanks-obama-making-sandwich-m-whnc.jpeg

Page 8: How to Build Consistent and Scalable Workspaces for Data Science Teams

Use case: Parallelizable tasks

Data manipulation- Feature engineering

Model builds- Advanced machine learning algorithms

- Hyperparameter search

https://pbs.twimg.com/media/Buw8Bz6IIAAxgxg.png

Page 9: How to Build Consistent and Scalable Workspaces for Data Science Teams

Elaine LeeData Engineer

[email protected]@elaineklee

avant.com

Page 10: How to Build Consistent and Scalable Workspaces for Data Science Teams

Elaine LeeData Engineer

[email protected]@elaineklee

avant.com