introduction to guerrilla analytics
TRANSCRIPT
Introduction to Guerrilla Analytics
Presented by:
Enda Ridge, PhD
A Practical Approach to Doing Data Science
What this talk is about
• Why doing Data Science is hard
• What is Guerrilla Analytics?
• How to cope in a Guerrilla Analytics environment
– The Guerrilla Analytics Principles
• Applying the Principles
• Next steps and research topics
Copyright Enda Ridge 2014 1
What we are told about Data Science
2Copyright Enda Ridge 2014
“Data is the new science. Big data holds the answers.”
“the sexy job in the next 10 years will be statisticians”
“Data Scientist: The Sexiest Job of the 21st Century”
“Information is the oil of the 21st century, and analytics is the combustion engine.”
http://www.gapminder.org/http://www.statistics.com/data-science-quotes/https://github.com/mbostock/d3/wiki/Gallery
Hi, we need to produce last week’s list of customers with incorrect addresses and report their total value. It’s going to the Chief Risk Officer this afternoon.
Um. Which list? I sent at least 2 lists and Jo sent one too. Maybe.
I’ll check my mailbox and send you the Excel file.
And we’ll also need the change in customer population since last week’s report.
Er.....the customer population has changed with the new data from this morning.
Oh and we might have deleted some duplicate customers yesterday so we can’t go back to last week.
The Data Science reality
3Copyright Enda Ridge 2014
My background
Mechanical Engineer
PhD Computer Science
(York 2007)
• “Design of Experiments for the Tuning of Algorithms”
Boutique Consultancy
• Social Network Analysis for Fraud
Forensic Data Analytics
• Professional Services
Senior Manager
• Data Science Product Development
Copyright Enda Ridge 2014 4
Common Perception of the Analytics Workflow
Copyright Enda Ridge 2014 6
Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22
Project Reality – Guerrilla Analytics
•Changing Data
•Changing Requirements
•Changing Resources
•Changing Business Rules
Copyright Enda Ridge 2014 7
•Limited Time
•Limited Toolsets
•Limited Resources
•Reproducible
•Explainable
•Testable
Some of the Guerrilla Analytics Principles
• Space is cheap, confusion is expensive. 1
• Prefer simple, visual project structures over heavily documented and project-specific rules. 2
• Prefer automation with program code over manual graphical methods. 3
• Maintain a link between DATA on the file system, in the ANALYTICS environment, and in work products (INSIGHT). 4
• Version control changes to data and program code. 5Copyright Enda Ridge 2014 10
Stage 2: Data Receipt
Copyright Enda Ridge 2014 11
Guerrilla Analytics Environment
• Lost Data
• Multiple Copies of data
• No supporting information
• Local copies of data
• Renamed data
Guerrilla Analytics Approach
• Have 1 Data location
• Data Unique Identifiers
• Data log
• Keep supporting material near its data
Stage 2: Data Receipt
Guerrilla Analytics Environment Guerrilla Analytics Approach
Copyright Enda Ridge 2014 12
Stage 3: Data Load
Copyright Enda Ridge 2014 13
Guerrilla Analytics Environment
•Data corruption
•Data preparation for load
•Loss of loaded data
•Cluttered Data Manipulation Environment
Stage 4: Analytics Examples & Themes
• Multiple languages
• Multiple code files
• Output of images
SQL code file that goes through a supplier address table and identifies the address country for each supplier.
This derived address country is added into the dataset as new data field. Plot on a map.
Copyright Enda Ridge 2014 17
• Data manipulation on file system
• External tools
Python script that runs through thousands of office documents, calls a tool to convert these to XML, and saves the XML file.
This process is to prepare the data for further entity enrichment with another tool.
• Larger number of code files
• Multiple languages
• Multiple tools
Twenty code files are run in a particular order to manipulate and reshape data so it can be imported into a data-mining tool.
• Even simple code snippetsExport of a dataset out of the Data Manipulation Environment so the customer can do their own work with the data.
Stage 4: Analytics
Guerrilla Analytics Environment Guerrilla Analytics Approach
Copyright Enda Ridge 2014 18
Clearly labelled running order
Stage 4: Analytics
Guerrilla Analytics Environment
Copyright Enda Ridge 2014 19
Raw 1
Raw 2
Union_and_clean
Known_Population
Result
1
2
Stage 4: Analytics
Guerrilla Analytics Environment
Guerrilla Analytics Approach
Copyright Enda Ridge 2014 20
Raw 1
Raw 2
Union_and_clean
Known_Population
Result
1
2
Raw 1
Raw 2
UnionClean 1
Clean 2
Tagged
Known_Population
Result_filtered
1
2
3 4
Stage 7: Reporting – Guerrilla Analytics approach
Copyright Enda Ridge 2014 24
1
2
5
Select min/max of transaction_time
WP_030
Select min/max of customer_age
WP_035
Purchases by type
WP_042
Wrap Up
Discussed
• Guerrilla Analytics Projects
• Disruptions
• Constraints
• Guerrilla Analytics Principles and Practice Tips
• Data Receipt and Load
• Analytics and Reporting
• Consolidation in Builds
Other topics
• Testing in a Guerrilla Analytics Environment
• Capability – People, Process, Technology
• Data Gymnastics – common analytics patterns
Copyright Enda Ridge 2014 25
Open questions and challenges
Software Engineering
• Version control of data and of analytics
• Build tools across multiple languages
Workflows & Project Management
• Appropriate workflow and supporting tools?
• Project Management methodologies
Testing
• Types (Builds, ad-hoc, data quality)
• Test Harnesses (multi-language, dataset vs code)
‘Big Data’
• Do Guerrilla Analytics Principles work with volume / velocity?
• NoSQL analytics
Copyright Enda Ridge 2014 26
Keep in Touch!
Copyright Enda Ridge 2014 27
@Enda_Ridge
www.guerrilla-analytics.net