data science meets software development
TRANSCRIPT
DATA SCIENCE MEETSSOFTWARE DEVELOPMENT
Alexis Seigneurin - Ippon Technologies
Who I am
• Software engineer for 15 years
• Consultant at Ippon Tech in Paris, France
• Favorite subjects: Spark, Cassandra, Ansible, Docker
• @aseigneurin
• 200 software engineers in France and the US
• In the US: offices in DC, NYC and Richmond, Virginia
• Digital, Big Data and Cloud applications
• Java & Agile expertise
• Open-source projects: JHipster, Tatami, etc.
• @ipponusa
The project
• Data Innovation Lab of a large insurance company
• Data → Business value
• Team of 30 Data Scientists + Software Developers
Data ScientistsWho they are
&How they work
Skill set of a Data Scientist
• Strong in:• Science (maths / statistics)• Machine Learning• Analyzing data
• Good / average in:• Programming
• Not good in:• Software engineering
Programming languages
• Mostly Python, incl. frameworks:• NumPy• Pandas• SciKit Learn
• SQL
• R
Development environments
• IPython Notebook
Development environments• Dataiku
Machine Learning
• Algorithms:• Logistic Regression• Decision trees• Random forests
• Implementations:• Dataiku• Scikit-Learn• Vowpal Wabbit
Skill set of a Developer
• Strong in:• Software engineering• Programming
• Good / average in:• Science (maths / statistics)• Analyzing data
• Not good in:• Machine Learning
How Developers work• Programming languages
• Java• Scala
• Development environment• Eclipse• IntelliJ IDEA
• Toolbox• Maven• …
A typical Data Science project
In the Lab
Workflow
1. Data Cleansing
2. Feature Engineering
3. Train a Machine Learning model1. Split the dataset: training/validation/test datasets2. Train the model
4. Apply the model on new data
Data Cleansing
• Convert strings to numbers/booleans/…
• Parse dates
• Handle missing values
• Handle data in an incorrect format
• …
Feature Engineering• Transform data into numerical features
• E.g.:• A birth date → age• Dates of phone calls → Number of calls• Text → Vector of words• 2 names → Levensthein distance
Machine Learning• Train a model
• Test an algorithm with different params
• Cross validation (Grid Search)
• Compare different algorithms, e.g.:• Logistic regression• Gradient boosting trees• Random forest
Machine Learning• Evaluate the accuracy of the
model• Root Mean Square Error (RMSE)• ROC curve• …
• Examine predictions• False positives, false negatives…
IndustrializationCookbook
Disclaimer
• Context of this project:• Not So Big Data (but Smart Data)• No real-time workflows (yet?)
Distribute the processing
R E C I P E # 1
Distribute the processing
• Data Scientists work with data samples
• No constraint on processing time
• Processing on the Data Scientist’s workstation (IPython Notebook) or on a single server (Dataiku)
Distribute the processing
• In production:• H/W resources are constrained• Large data sets to process
• Spark:• Included in CDH• DataFrames (Spark 1.3+) ≃ Pandas DataFrames• Fast!
Use a centralizeddata store
R E C I P E # 2
Use a centralized data store
• Data Scientists store data on their workstations• Limited storage• Data not shared within the team• Data privacy not enforced• Subject to data losses
Use a centralized data store
• Store data on HDFS:• Hive tables (SQL)• Parquet files
• Security: Kerberos + permissions
• Redundant + potentially unlimited storage
• Easy access from Spark and Dataiku
Rationalize the use of programming
languages
R E C I P E # 3
Programming languages
• Data Scientists write code on their workstations• This code may not run in the datacenter
• Language variety → Hard to share knowledge
Programming languages
• Use widely spread languages
• Spark in Python/Scala• Support for R is too young
• Provide assistance to ease the adoption!
Use an IDE
R E C I P E # 4
Use an IDE
• Notebooks:• Powerful for exploratory work• Weak for code edition and code
structuring• Inadequate for code versioning
Use an IDE
• IntelliJ IDEA / PyCharm• Code compilation• Refactoring• Execution of unit tests• Support for Git
Source Control
R E C I P E # 5
Source Control
• Data Scientists work on their workstations• Code is not shared• Code may be lost• Intermediate versions are not preserved
• Lack of code review
Source Control
• Git + GitHub / GitLab
• Versioning• Easy to go back to a version running in production
• Easy sharing (+permissions)
• Code review
Packaging the code
R E C I P E # 6
Packaging the code
• Source code has dependencies
• Dependencies in production ≠ at dev time
• Assemble the code + its dependencies
Packaging the code
• Freeze the dependencies:• Scala → Maven• Python → Setuptools
• Packaging:• Scala → Jar (Maven Shade plugin)• Python → Egg (Setuptools)
• Compliant with spark-submit.sh
R E C I P E # 7
Secure the build process
Secure the build process
• Data Scientists may commit code… without running tests first!
• Quality may decrease over time
• Packages built by hand on a workstation are not reproducible
Secure the build process
• Jenkins• Unit test report• Code coverage report• Packaging: Jar / Egg• Dashboard• Notifications (Slack + email)
Automate the process
R E C I P E # 8
Automate the process
• Data is loaded manually in HDFS:• CSV files, sometimes compressed• Often received by email• Often samples
Automate the process
• No human intervention should be required• All steps should be code / tools• E.g. automate file transfers, unzipping…
Adapt to living data
R E C I P E # 9
Adapt to living data
• Data Scientists work with:• Frozen data• Samples
• Risks with data received on a regular basis:• Incorrect format (dates, numbers…)• Corrupt data (incl. encoding changes)• Missing values
Adapt to living data
• Data Checking & Cleansing• Preliminary steps before processing the data• Decide what to do with invalid data
• Thetis• Internal tool• Performs most checking & cleansing operations
Provide a library of transformations
R E C I P E # 1 0
Library of transformations
• Dataiku « shakers »:• Parse dates• Split a URL (protocol, host, path, …)• Transform a post code into a city / department name• …
• Cannot be used outside Dataiku
Library of transformations
• All transformations should be code
• Reuse transformations between projects
• Provide a library• Transformation = DataFrame → DataFrame• Unit tests
Unit test the data pipeline
R E C I P E # 1 1
Unit test the data pipeline
• Independent data processing steps
• Data pipeline not often tested from beginning to end
• Data pipeline easily broken
Unit test the data pipeline
• Unit test each data transformation stage• Scala: Scalatest• Python: Unittest
• Use mock data
• Compare DataFrames:• No library (yet?)• Compare lists of lists
Assemble the Workflow
R E C I P E # 1 2
Assemble the Workflow
• Separate transformation processes:• Transformations applied to some data• Results are frozen and used in other processes
• Jobs are launched manually• No built-in scheduler in Spark
Assemble the workflow• Oozie:
• Spark• Map-Reduce• Shell• …
• Scheduling
• Alerts
• Logs
Summary&
Conclusion
Summary
• Keys:• Use industrialization-ready tools• Pair Programming: Data Scientist + Developer
• Success criteria:• Lower time to market• Higher processing speed• More robust processes
Thank you!
@aseigneurin - @ipponusa