data science meets software development
TRANSCRIPT
![Page 1: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/1.jpg)
DATA SCIENCE MEETSSOFTWARE DEVELOPMENT
Alexis Seigneurin - Ippon Technologies
![Page 2: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/2.jpg)
Who I am
• Software engineer for 15 years
• Consultant at Ippon Tech in Paris, France
• Favorite subjects: Spark, Cassandra, Ansible, Docker
• @aseigneurin
![Page 3: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/3.jpg)
• 200 software engineers in France and the US
• In the US: offices in DC, NYC and Richmond, Virginia
• Digital, Big Data and Cloud applications
• Java & Agile expertise
• Open-source projects: JHipster, Tatami, etc.
• @ipponusa
![Page 4: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/4.jpg)
The project
• Data Innovation Lab of a large insurance company
• Data → Business value
• Team of 30 Data Scientists + Software Developers
![Page 5: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/5.jpg)
Data ScientistsWho they are
&How they work
![Page 6: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/6.jpg)
Skill set of a Data Scientist
• Strong in:• Science (maths / statistics)• Machine Learning• Analyzing data
• Good / average in:• Programming
• Not good in:• Software engineering
![Page 7: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/7.jpg)
Programming languages
• Mostly Python, incl. frameworks:• NumPy• Pandas• SciKit Learn
• SQL
• R
![Page 8: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/8.jpg)
Development environments
• IPython Notebook
![Page 9: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/9.jpg)
Development environments• Dataiku
![Page 10: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/10.jpg)
Machine Learning
• Algorithms:• Logistic Regression• Decision trees• Random forests
• Implementations:• Dataiku• Scikit-Learn• Vowpal Wabbit
![Page 12: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/12.jpg)
Skill set of a Developer
• Strong in:• Software engineering• Programming
• Good / average in:• Science (maths / statistics)• Analyzing data
• Not good in:• Machine Learning
![Page 13: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/13.jpg)
How Developers work• Programming languages
• Java• Scala
• Development environment• Eclipse• IntelliJ IDEA
• Toolbox• Maven• …
![Page 14: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/14.jpg)
A typical Data Science project
In the Lab
![Page 15: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/15.jpg)
Workflow
1. Data Cleansing
2. Feature Engineering
3. Train a Machine Learning model1. Split the dataset: training/validation/test datasets2. Train the model
4. Apply the model on new data
![Page 16: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/16.jpg)
Data Cleansing
• Convert strings to numbers/booleans/…
• Parse dates
• Handle missing values
• Handle data in an incorrect format
• …
![Page 17: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/17.jpg)
Feature Engineering• Transform data into numerical features
• E.g.:• A birth date → age• Dates of phone calls → Number of calls• Text → Vector of words• 2 names → Levensthein distance
![Page 18: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/18.jpg)
Machine Learning• Train a model
• Test an algorithm with different params
• Cross validation (Grid Search)
• Compare different algorithms, e.g.:• Logistic regression• Gradient boosting trees• Random forest
![Page 19: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/19.jpg)
Machine Learning• Evaluate the accuracy of the
model• Root Mean Square Error (RMSE)• ROC curve• …
• Examine predictions• False positives, false negatives…
![Page 20: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/20.jpg)
IndustrializationCookbook
![Page 21: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/21.jpg)
Disclaimer
• Context of this project:• Not So Big Data (but Smart Data)• No real-time workflows (yet?)
![Page 22: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/22.jpg)
Distribute the processing
R E C I P E # 1
![Page 23: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/23.jpg)
Distribute the processing
• Data Scientists work with data samples
• No constraint on processing time
• Processing on the Data Scientist’s workstation (IPython Notebook) or on a single server (Dataiku)
![Page 24: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/24.jpg)
Distribute the processing
• In production:• H/W resources are constrained• Large data sets to process
• Spark:• Included in CDH• DataFrames (Spark 1.3+) ≃ Pandas DataFrames• Fast!
![Page 25: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/25.jpg)
Use a centralizeddata store
R E C I P E # 2
![Page 26: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/26.jpg)
Use a centralized data store
• Data Scientists store data on their workstations• Limited storage• Data not shared within the team• Data privacy not enforced• Subject to data losses
![Page 27: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/27.jpg)
Use a centralized data store
• Store data on HDFS:• Hive tables (SQL)• Parquet files
• Security: Kerberos + permissions
• Redundant + potentially unlimited storage
• Easy access from Spark and Dataiku
![Page 28: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/28.jpg)
Rationalize the use of programming
languages
R E C I P E # 3
![Page 29: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/29.jpg)
Programming languages
• Data Scientists write code on their workstations• This code may not run in the datacenter
• Language variety → Hard to share knowledge
![Page 30: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/30.jpg)
Programming languages
• Use widely spread languages
• Spark in Python/Scala• Support for R is too young
• Provide assistance to ease the adoption!
![Page 31: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/31.jpg)
Use an IDE
R E C I P E # 4
![Page 32: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/32.jpg)
Use an IDE
• Notebooks:• Powerful for exploratory work• Weak for code edition and code
structuring• Inadequate for code versioning
![Page 33: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/33.jpg)
Use an IDE
• IntelliJ IDEA / PyCharm• Code compilation• Refactoring• Execution of unit tests• Support for Git
![Page 34: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/34.jpg)
Source Control
R E C I P E # 5
![Page 35: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/35.jpg)
Source Control
• Data Scientists work on their workstations• Code is not shared• Code may be lost• Intermediate versions are not preserved
• Lack of code review
![Page 36: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/36.jpg)
Source Control
• Git + GitHub / GitLab
• Versioning• Easy to go back to a version running in production
• Easy sharing (+permissions)
• Code review
![Page 37: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/37.jpg)
Packaging the code
R E C I P E # 6
![Page 38: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/38.jpg)
Packaging the code
• Source code has dependencies
• Dependencies in production ≠ at dev time
• Assemble the code + its dependencies
![Page 39: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/39.jpg)
Packaging the code
• Freeze the dependencies:• Scala → Maven• Python → Setuptools
• Packaging:• Scala → Jar (Maven Shade plugin)• Python → Egg (Setuptools)
• Compliant with spark-submit.sh
![Page 40: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/40.jpg)
R E C I P E # 7
Secure the build process
![Page 41: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/41.jpg)
Secure the build process
• Data Scientists may commit code… without running tests first!
• Quality may decrease over time
• Packages built by hand on a workstation are not reproducible
![Page 42: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/42.jpg)
Secure the build process
• Jenkins• Unit test report• Code coverage report• Packaging: Jar / Egg• Dashboard• Notifications (Slack + email)
![Page 43: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/43.jpg)
Automate the process
R E C I P E # 8
![Page 44: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/44.jpg)
Automate the process
• Data is loaded manually in HDFS:• CSV files, sometimes compressed• Often received by email• Often samples
![Page 45: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/45.jpg)
Automate the process
• No human intervention should be required• All steps should be code / tools• E.g. automate file transfers, unzipping…
![Page 46: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/46.jpg)
Adapt to living data
R E C I P E # 9
![Page 47: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/47.jpg)
Adapt to living data
• Data Scientists work with:• Frozen data• Samples
• Risks with data received on a regular basis:• Incorrect format (dates, numbers…)• Corrupt data (incl. encoding changes)• Missing values
![Page 48: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/48.jpg)
Adapt to living data
• Data Checking & Cleansing• Preliminary steps before processing the data• Decide what to do with invalid data
• Thetis• Internal tool• Performs most checking & cleansing operations
![Page 49: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/49.jpg)
Provide a library of transformations
R E C I P E # 1 0
![Page 50: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/50.jpg)
Library of transformations
• Dataiku « shakers »:• Parse dates• Split a URL (protocol, host, path, …)• Transform a post code into a city / department name• …
• Cannot be used outside Dataiku
![Page 51: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/51.jpg)
Library of transformations
• All transformations should be code
• Reuse transformations between projects
• Provide a library• Transformation = DataFrame → DataFrame• Unit tests
![Page 52: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/52.jpg)
Unit test the data pipeline
R E C I P E # 1 1
![Page 53: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/53.jpg)
Unit test the data pipeline
• Independent data processing steps
• Data pipeline not often tested from beginning to end
• Data pipeline easily broken
![Page 54: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/54.jpg)
Unit test the data pipeline
• Unit test each data transformation stage• Scala: Scalatest• Python: Unittest
• Use mock data
• Compare DataFrames:• No library (yet?)• Compare lists of lists
![Page 55: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/55.jpg)
Assemble the Workflow
R E C I P E # 1 2
![Page 56: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/56.jpg)
Assemble the Workflow
• Separate transformation processes:• Transformations applied to some data• Results are frozen and used in other processes
• Jobs are launched manually• No built-in scheduler in Spark
![Page 57: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/57.jpg)
Assemble the workflow• Oozie:
• Spark• Map-Reduce• Shell• …
• Scheduling
• Alerts
• Logs
![Page 58: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/58.jpg)
Summary&
Conclusion
![Page 59: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/59.jpg)
Summary
• Keys:• Use industrialization-ready tools• Pair Programming: Data Scientist + Developer
• Success criteria:• Lower time to market• Higher processing speed• More robust processes
![Page 60: Data Science meets Software Development](https://reader031.vdocuments.site/reader031/viewer/2022021918/58ae89641a28abdf068b4da5/html5/thumbnails/60.jpg)
Thank you!
@aseigneurin - @ipponusa