[srijan wednesday webinars] from data management to data analysis pipelines
TRANSCRIPT
From Data Management to Data Analysis Pipelines
Open source based architectures to get the job done
Young-Jin [email protected]
What will be covered@srijan #SrijanWW
1. Common Data Management Problems in NPOs and NGOs
2. Obstacles for Data Analysis rooted in Data Management Practices
3. Start with "Why?" not with "How?"
4. You have more data than you think
5. How to miss the on-ramp to Data Analysis by going top-down
6. Getting to Data Analysis from the ground-up using pipelines
7. Some useful data infrastructure architectures
8. Some useful tools to build the data pipelines
9. Questions
If "Data is the New Oil"
Then most of the NPO/NGO sector is pumping it by hand
and isn't refining it from crudeto create greater value.
@srijan #SrijanWW
Everyday Data Battles of NPOs/NGOs@srijan #SrijanWW
Similar pain points across many types of NPOs and NGOs when faced with managing mission critical data on programs, clients, donors and volunteer:
● Adhoc, non-uniform data collection tools across organization
● Managing data is time consuming and inefficient
● Difficulty tracking NPO staff-client interactions over time
● Organization has high turnover in staff/volunteers/clients
● People can not update their own contact information
● Missing linkage between real world entities due to duplicates
● Problems syncing data between local on-the-ground efforts
and national umbrella organization
Data Silos → Blocked Data Flows
Don
or F
undr
aisi
ng S
oftw
are
Pro
gram
Dat
a S
prea
dshe
ets
Eve
nt T
icke
ting
Sys
tem
Con
tact
s D
atab
ase
/ CR
M
Web
& E
mai
l Mar
ketin
g
HowNPOs
Manage Data
...directly leads to...
isolateddata silos
impossible toperform
data analysis
Obstacles to Data AnalysisOperational Data Stores (ODS) without proper governance, integration and tools will lack data flows and pose serious obstacles to data analysis for the organization.
● ODS → data silos → blocked data flows
○ Missing integrations into unified Database of Record
○ Weak or Missing Data Governance Rules and Policies
● Data Quality Issues in each ODS
○ Lack of Data Hygiene and Quality Assurance Policies
○ Missing Entity Resolution within ODS and across ODSs
● No Data Strategy leads to adhoc tactical technology stopgaps
"We need the new proprietary XYZ system now, we will work
out if/how XYZ integrates with our current systems later..."
○
@srijan #SrijanWW
Top-down favors "How?" not "Why?"@srijan #SrijanWW
Which leader of an organization doesn't want the latest and
greatest, fashionable "How?" answers, buzz words or products:
● Data Lake to replace the Enterprise Data Warehouse
● Hadoop/Spark Cluster for Streaming Big Data Processing
● Predictive Analytics Platform for Decision Support
● Drag-and-drop self-service visualizations and drill-downs
● Business Intelligence Platform with A/B testing
Start with "Why?" to avoid "cargo cult" data science which is
usually due to top-down mandates by leadership to become
more of a data-driven organization. Putting in place all the
"How?" answers and systems never fully answers the "Why?"
Dangers of "How?" ahead of "Why?"@srijan #SrijanWW
ceci n'est pas un phone.
Fallacy of "cargo cult" data science
Invest and build the latest-greatest data
systems and the rich insights and data driven
decision making will spew forth from the
systems in deus ex machina style.
Data Pipelines and Food Preparation@srijan #SrijanWW
Raw Data Software Systems Insights
unwieldy Clean, refine, transform Actionable
Raw Ingredients Cooking Techniques Delicious Dish
Inedible Clean, cut, prepare Enjoyable
You likely have more data than you think@srijan #SrijanWW
Take a Data inventory:
● the "obvious" data sources: what you're probably collecting already (say, what's in your CRM, event attendance lists)
● the less-obvious data sources: ○ not collecting something you could: leaving the data on
the floor (data exhaust)○ collecting something, but then throwing it away:
webserver logs● don't collect everything
○ over time you may even forget why it's there (or why it's important) making cleanup difficult
○ the less data you store means lower risk exposure if there is a break-in
Pathways to Data Analysis
Master Data Management (MDM)consists of processes, governance, policies, standards and tools
that consistently define and manage the critical data of an
organization to provide a single point of reference in a Database
of Record (DBOR)
Master data management has the objective of providing processes for collecting, aggregating,
matching, consolidating, quality-assuring, persisting and distributing such data throughout an
organization to ensure consistency and control in the ongoing maintenance and application use
of this information.
http://en.wikipedia.org/wiki/Master_data_management
@srijan #SrijanWW
"Fail Slow" use the top-down approachImplementing MDM in multi-year top-down project with full
requirements gathering in a water-fall based approach will fail:
● takes a long time → very expensive
● weak support within organization: perceived value is low
● project's scope keeps shifting and thus is never done
● new systems and bad data sets are added as project
progresses, never finishes
● insights are slow coming since data analysis follows full MDM
implementation
● MDM-first-approach = analysis paralysis, never have all the
information to know what measure is valuable where
@srijan #SrijanWW
“Data is the new oil? No: Data is the new soil.”
– David McCandles
lay seedsgrowingstewardshipharvesting fruits
bottom-up organic view
@srijan #SrijanWW
Doing Data Analysis from the bottom-upImplementing MDM from the bottom-up in an agile, iterative
process is preferred. Incremental refinement of data into an
eventual MDM is more powerful, here's why:
● faster insights from data by harvesting low hanging fruit
● grow support within organization: perceived value increases
● project is work in progress, so iterative nature is understood
● new systems and bad data sets are added and requirements
shift, both are handled incrementally
● insights steadily improve over time and so does the data
analysis as eventual MDM implementation nears full MDM
● MDM-eventually-approach allows for the organization's
analytical capabilities to grow, also more cost-effective
@srijan #SrijanWW
Data Architectures and Best Practices@srijan #SrijanWW
Golden Record with Incremental Data Refining
Operationalize Data Insights early and often, which in turn
incrementally aligns organization around better data practices
and organically builds data governance structures and policies.
ProgramData
EventsDB CRM
GoldenRecord
CMSDonorDB
Incremental ETLs with Cleansing
Dedupe, record linkage
Open Source Tools: build data pipelines@srijan #SrijanWW
OpenDataKit collect survey data on mobile devices
Drupal CMS widely adopted by the NPO/NGO community
CiviCRM open source CRM for the NPO/NGO sector
Pentaho Data Integration powerful open source ETL tool
OpenRefine for data cleansing
Python Dedupe Library for entity resolution
Knime Analytics Platform Machine Learning platform
Python Analytics Stack (ipython + Pandas + scikit-learn)
R-Studio R-language IDE for statistical analysis and visualization
DC.JS Dimensional Charting Visualizations (d3 + crossfilter)
Elasticsearch, Neo4j, MongoDB, Hadoop, Spark, PostGIS etc.
Open Source based Data Architecture@srijan #SrijanWW
ProgramData
TicketingDB
CiviCRMCRM
GoldenRecord + Rest API
Drupal CMSRaiser's
EdgeDB
Incremental ETLs with Cleansing
Dedupe, record linkage
Data AnalysisVisualizations
Machine Learning
Useful Resources● NTEN A Consumer's Guide to Donor Management Systems
● NTEN Getting Started with Data-Driven Decision Making: A
Workbook
● PWC Data lakes and the promise of unsiloed data
@srijan #SrijanWW
Young-Jin [email protected]
Thank You!
Take this conversation online by tweeting using the hashtag #SrijanWW