data science meets software development

DATA SCIENCE MEETSSOFTWARE DEVELOPMENT

Alexis Seigneurin - Ippon Technologies

Who I am

• Software engineer for 15 years

• Consultant at Ippon Tech in Paris, France

• Favorite subjects: Spark, Cassandra, Ansible, Docker

• @aseigneurin

http://http

• 200 software engineers in France and the US

• In the US: offices in DC, NYC and Richmond, Virginia

• Digital, Big Data and Cloud applications

• Java & Agile expertise

• Open-source projects: JHipster, Tatami, etc.

• @ipponusa

The project

• Data Innovation Lab of a large insurance company

• Data → Business value

• Team of 30 Data Scientists + Software Developers

Data ScientistsWho they are

&How they work

Skill set of a Data Scientist

• Strong in:• Science (maths / statistics)• Machine Learning• Analyzing data

• Good / average in:• Programming

• Not good in:• Software engineering

Programming languages

• Mostly Python, incl. frameworks:• NumPy• Pandas• SciKit Learn

• SQL

• R

Development environments

• IPython Notebook

Development environments• Dataiku

Machine Learning

• Algorithms:• Logistic Regression• Decision trees• Random forests

• Implementations:• Dataiku• Scikit-Learn• Vowpal Wabbit

ProgrammersWho they are

&How they work

http://xkcd.com/378/

http://xkcd.com/378/

Skill set of a Developer

• Strong in:• Software engineering• Programming

• Good / average in:• Science (maths / statistics)• Analyzing data

• Not good in:• Machine Learning

How Developers work• Programming languages

• Java• Scala

• Development environment• Eclipse• IntelliJ IDEA

• Toolbox• Maven• …

A typical Data Science project

In the Lab

Workflow

1. Data Cleansing

2. Feature Engineering

3. Train a Machine Learning model1. Split the dataset: training/validation/test datasets2. Train the model

4. Apply the model on new data

Data Cleansing

• Convert strings to numbers/booleans/…

• Parse dates

• Handle missing values

• Handle data in an incorrect format

• …

Feature Engineering• Transform data into numerical features

• E.g.:• A birth date → age• Dates of phone calls → Number of calls• Text → Vector of words• 2 names → Levensthein distance

Machine Learning• Train a model

• Test an algorithm with different params

• Cross validation (Grid Search)

• Compare different algorithms, e.g.:• Logistic regression• Gradient boosting trees• Random forest

Machine Learning• Evaluate the accuracy of the

model• Root Mean Square Error (RMSE)• ROC curve• …

• Examine predictions• False positives, false negatives…

IndustrializationCookbook

Disclaimer

• Context of this project:• Not So Big Data (but Smart Data)• No real-time workflows (yet?)

Distribute the processing

R E C I P E # 1


• Data Scientists work with data samples

• No constraint on processing time

• Processing on the Data Scientist’s workstation (IPython Notebook) or on a single server (Dataiku)


• In production:• H/W resources are constrained• Large data sets to process

• Spark:• Included in CDH• DataFrames (Spark 1.3+) ≃ Pandas DataFrames• Fast!

Use a centralizeddata store

R E C I P E # 2

Use a centralized data store

• Data Scientists store data on their workstations• Limited storage• Data not shared within the team• Data privacy not enforced• Subject to data losses

Use a centralized data store

• Store data on HDFS:• Hive tables (SQL)• Parquet files

• Security: Kerberos + permissions

• Redundant + potentially unlimited storage

• Easy access from Spark and Dataiku

Rationalize the use of programming

languages

R E C I P E # 3


• Data Scientists write code on their workstations• This code may not run in the datacenter

• Language variety → Hard to share knowledge


• Use widely spread languages

• Spark in Python/Scala• Support for R is too young

• Provide assistance to ease the adoption!

Use an IDE

R E C I P E # 4

Use an IDE

• Notebooks:• Powerful for exploratory work• Weak for code edition and code

structuring• Inadequate for code versioning

Use an IDE

• IntelliJ IDEA / PyCharm• Code compilation• Refactoring• Execution of unit tests• Support for Git

Source Control

R E C I P E # 5

Source Control

• Data Scientists work on their workstations• Code is not shared• Code may be lost• Intermediate versions are not preserved

• Lack of code review

Source Control

• Git + GitHub / GitLab

• Versioning• Easy to go back to a version running in production

• Easy sharing (+permissions)

• Code review

Packaging the code

R E C I P E # 6

Packaging the code

• Source code has dependencies

• Dependencies in production ≠ at dev time

• Assemble the code + its dependencies

Packaging the code

• Freeze the dependencies:• Scala → Maven• Python → Setuptools

• Packaging:• Scala → Jar (Maven Shade plugin)• Python → Egg (Setuptools)

• Compliant with spark-submit.sh

R E C I P E # 7

Secure the build process


• Data Scientists may commit code… without running tests first!

• Quality may decrease over time

• Packages built by hand on a workstation are not reproducible


• Jenkins• Unit test report• Code coverage report• Packaging: Jar / Egg• Dashboard• Notifications (Slack + email)

Automate the process

R E C I P E # 8


• Data is loaded manually in HDFS:• CSV files, sometimes compressed• Often received by email• Often samples


• No human intervention should be required• All steps should be code / tools• E.g. automate file transfers, unzipping…

Adapt to living data

R E C I P E # 9


• Data Scientists work with:• Frozen data• Samples

• Risks with data received on a regular basis:• Incorrect format (dates, numbers…)• Corrupt data (incl. encoding changes)• Missing values


• Data Checking & Cleansing• Preliminary steps before processing the data• Decide what to do with invalid data

• Thetis• Internal tool• Performs most checking & cleansing operations

Provide a library of transformations

R E C I P E # 1 0

Library of transformations

• Dataiku « shakers »:• Parse dates• Split a URL (protocol, host, path, …)• Transform a post code into a city / department name• …

• Cannot be used outside Dataiku

Library of transformations

• All transformations should be code

• Reuse transformations between projects

• Provide a library• Transformation = DataFrame → DataFrame• Unit tests

Unit test the data pipeline

R E C I P E # 1 1


• Independent data processing steps

• Data pipeline not often tested from beginning to end

• Data pipeline easily broken


• Unit test each data transformation stage• Scala: Scalatest• Python: Unittest

• Use mock data

• Compare DataFrames:• No library (yet?)• Compare lists of lists

Assemble the Workflow

R E C I P E # 1 2

Assemble the Workflow

• Separate transformation processes:• Transformations applied to some data• Results are frozen and used in other processes

• Jobs are launched manually• No built-in scheduler in Spark

Assemble the workflow• Oozie:

• Spark• Map-Reduce• Shell• …

• Scheduling

• Alerts

• Logs

Summary&

Conclusion

Summary

• Keys:• Use industrialization-ready tools• Pair Programming: Data Scientist + Developer

• Success criteria:• Lower time to market• Higher processing speed• More robust processes

Thank you!

@aseigneurin - @ipponusa

data science meets software development

Technology