data reservoir governance best practices final · “solomo ” context-enriched ... decision model...
TRANSCRIPT
© 2015 IBM Corporation
The Data Reservoir: Architecture, Best Practices and Governance
Jo Ramos, Distinguished Engineer,
Chief Architect for Information Integration & Governance
Agenda for Today
• How to become a data driven analytics organization
• Modern Data Architecture Considerations
• The Data Reservoir & Data Governance
1
500 million DVDs worth of data is generated daily
1 trillion connected objects and devices by
2015
80% of the world’s data
is unstructured
Data is becoming
the world’s new
natural resource
25% of the world's applications will be available
in the cloud by 2016
85% of new softwareis being built for cloud
72% of developers say
cloud-based services or APIs
are central to the
applications they are
designing
The emergence of cloud is
transforming IT and
business processes into
digital services
80% of individuals are willingto trade their information
for a personalized offering
84% of millennials say social and user-generated
content has an influence on what they buy
5 minutes: response time users expect once they have
contacted a company via social media
Social, mobile and access
to data are changing how
individuals
are understood and
engaged
Three major shifts in our industry
Data Monetization Drivers
Data volumes
Data volumes
Data variety
Data variety
Real-time executionReal-time execution
Cost of data
storage
Cost of data
storage
BI & Analytics visibility
BI & Analytics visibility
Increasing value of
data
Increasing value of
data
Under-utilized
resource
Under-utilized
resource
Data driven analytics platform capabilities move from
information to insights and from reporting to action
Real-Time Decision Management: What action
should I take ? What is the Next Best Action ?
Prediction / Simulation: What will happen? Uses
data to forecast based on complex algorithms or
rules; forecasting of customers’ buying behavior.
Evaluation: Why did it happen? Thoroughly
analyzes data to support important decisions /
understand root causes for unusual observations;
evaluation of campaign responses to understand
changes in customer behavior.
Data Mining: Why did it happen? Looks for patterns
in the data to explain a not yet understood
observation; understand why certain customers
show a certain behavioral pattern.
Monitoring: What is happening now? Looks for
trigger information in the data indicating need for
action; monitoring live campaigns and making
optimization decisions as necessary.
Reporting: What happened? Slices and dices data
to create transparency on campaign performance
and financial or quality outcomes.
Business value / impact
Tech
no
log
ica
l co
mp
lexi
ty
Low High
Low
High
Cognitive: What did I learn, what’s best?
Multi-structured Data Mashups provide the Greatest Enterprise Value
Structured Data
Small Data
Clearly formatted
Quantitative
Objective
Logical
Puzzle
Repeatable
linear
Unstructured Data
Big Data
Language based
Qualitative
Subjective
Intuitive
Mystery
Exploratory
dynamic
Hadoop, Streams, Spark
Systems of Record Structured data from operational systems
20% of all data generated
Systems of Engagement Data that “connects” companies with their
customers, partners and employees
80% of all data generated
Systems of InsightDiverse data types that combine structured and unstructured data
for business insight
Data Warehouses
Traditional Sources
Advanced Analytics
Context Accumulation
Enterprise Integration
Transaction data
ERP Data
Mainframe Data
OLTP System Data
Electronic Health Records
New Data Sources
Audio
Documents
Images
RFID
Emails
Sensors
Social Data
Video
Web Logs
A growing data demand … and organizational tensions
Data Scientists seeking data for new analytics models.
Marketer seeking data for new campaigns.
Fraud investigator seeking data to understand the details of suspicious activity.
Agility
Data Access Freedom
Any kinds of data
Powerful Analysis &
Visualization
Security
Data Privacy
Regulatory Compliance
Standards
Application DeveloperKnowledge Worker
Lines of Business IT Organization
…
CDO
Modern Data Architectures: What to aim for
We need IT solutions that are like Legos
�Agile, flexible, adaptive, efficient
�Able to give fast responses to business needs
�Taking advantage of cloud, mobile, social, localization, Big Data
IT should not be like pouring cement
�Rigid, hardcoded business processes
�Monolithic
�Expensive to change
�Out of sync with business needs
Modern Data Architecture
9
� Simplification of the IT environment
� Eliminate redundancies
� Reduce cost
� Decouple systems
� Decouple transactions
� Fast, easy reuse
� Faster time-to-market
� Explore opportunities
� React to problems
� Generate insights from information in real time
� Use insights to improve customer experience, anticipate facts
What to aim for
Modern Data Architecture
10
Enables better Customer Engagement
Smart Channels�Mobile & Web�“SoLoMo”�Context-enriched
Social� Profiles� Preferences� Activities
Sensors� Geolocation� Movement� Events� Internet of Things
Transactional� Customer info� Transaction history� Product rules
Context-Aware
Experiences�Personalized services�Real-time Intelligence�“Moments of Truth”�Opportunistic offers
Analytics�“Next Best Action”�Expert systems�Future trends�Big Data
Processes� Tasks & milestones� Integration� Monitoring & SLAs
Analytics Lifecycle
Search & Survey
(understand what data is available)
Analytics Discovery
(application development)
Deploy & Consume
(application & workflow)
IT -
Op
e
rati
on
s
Online
Collaboration
Workspace
Big Data Lakes or Swamps?
• As we bring data together, are we creating a data swamp?
� No one is sure of the origin or purity of data.
� No one can find the data they need.
� No one knows what data is present and if it is being adequately protected.
• How do we build trust in big data?
� Need trust both to share and to consume data.
� Need understanding of quality, origin and ownership of data.
� Need classification of data to govern and protect it.
� Need timely, reliable data feeds and results.
� All built on secure and reliable infrastructure.
IBM’s Data Lake
IBM’s Data Lake = Efficient Management, Governance, Protection and Access.
Data Lake
Information Management and Governance Fabric
Data Lake Services
Data Lake Repositories
Users supported by the Data Lake
Data Lake (System of Insight)
Information Management and Governance Fabric
Data Lake Services
Analytics
TeamsGovernance, Risk and
Compliance TeamInformation
Curator
Line of Business
Teams
Data Lake
Operations
Data Lake Repositories
Enterprise IT
Other Data Lakes
Systems of Engagement
Systems of Automation
Systems of Record
New Sources
The Data Lake Subsystems (Services)
Data Lake (System of Insight)
Information Management and Governance Fabric
Catalogue
Self-ServiceAccess
EnterpriseIT Data
Exchange
Self-ServiceAccess
Analytics
TeamsGovernance, Risk and
Compliance TeamInformation
Curator
Line of Business
Teams
Data Lake
Operations
Enterprise IT
Other Data Lakes
Systems of Engagement
Data Lake Repositories
Systems of Automation
Systems of Record
New Sources
Considerations for a well-managed and governed data lake
No direct
access to
repositories
Business-led
information
governance and
management
Catalog of data,
ownership, meaning
and permitted usage
Moderated, view-
based self-service
access to data and
analytics for line of
business.
Access to raw data to
develop new
production analytics.
Effective
interchange of
data and insight
with other
systems.
Data-centric
Security
Multiple repositories organized based on
source and usage; hosted on appropriate
data platforms for workload.
Curation of all data
to define meaning
and classifications
Active monitoring
and management
of data
1
10
9
8
7
6
4
3
5
2
View from the user community - fraud
Conform to
regulations
Investigate
Fraud Case
Develop new
fraud models
Detect and
prevent fraud
Detect and
prevent fraud
Detect and
prevent fraud
Data repositories support multiple zones
Data Lake
CatalogInterfaces
Raw DataInteraction
Enterprise IT Interaction
View-basedInteraction
Data
Lake R
eposito
ries
Information Integration & Governance
Descriptive Data
Context Data
Deposited Data
Historical Data
Harvested Data
Published Data
Big data needs a variety of repositories for cost, access and performance reasons
Data Lake
CatalogInterfaces
Raw DataInteraction
Enterprise IT Interaction
View-basedInteraction
Data
Lake R
eposito
ries
Descriptive
Data
INFORMATION
VIEWS
CATALOG
Context
DataASSET
HUB
ASSET
HUB
ASSET
HUB
ACTIVITY
HUB
ACTIVITY
HUB
ACTIVITY
HUB
CODE
HUB
CONTENT
HUB
Deposited
Data
Historical
Data
Harvested
Data
INFORMATION
WAREHOUSEDEEP DATA
AUDIT
DATA
OPERATIONAL
HISTORY
SEARCH
INDEX
Information Integration & Governance
LOG
DATA
Published
Data
DATA
MARTS
OBJECT
CACHE
EXPORT
AREA
Data lake logical architecture
Line of Business
Applications
Information
Service Calls
Search
Requests
Report
Requests
Deploy
Decision
Models
Data
Access
Data Lake
Deploy
Real-time
Decision
Models
Data Lake
Operations
Curation
Interaction
Management
Data
Access
Data
Deposit
Data
Deposit
Decision Model
Management
Events to
Evaluate
Information
Service Calls
Data Out
Data In
Notifications
Deploy
Real-time
Decision
Models
Understand
Information
Sources
Understand
Information
Sources
Understand
ComplianceReport
Compliance
Advertise
Information
Source
Governance, Risk and
Compliance Team
Information
Curator
CatalogInterfaces
Information
Service Calls
Raw DataInteraction
Enterprise IT Interaction
APIs
Data
Ingestion
Publishing
Feeds
Continuous
Analytics
STREAMING
ANALYTICS
Consumersof Insight
Simple, ad hoc
Discovery
and Analysis
Reporting
Analytical Insight
Applications
Analytics Tools
View-basedInteraction
Secure
Access
SAND
BOXES
EVENT
CORRELATION
Data
Lake R
eposito
ries
Descriptive
Data
INFORMATION
VIEWS
CATALOG
Context
DataASSET
HUB
ASSET
HUB
ASSET
HUB
ACTIVITY
HUB
ACTIVITY
HUB
ACTIVITY
HUB
CODE
HUB
CONTENT
HUB
Deposited
Data
Historical
Data
Harvested
Data
INFORMATION
WAREHOUSEDEEP DATA
AUDIT
DATA
OPERATIONAL
HISTORY
SEARCH
INDEX
Information Integration & Governance
INFORMATION
BROKEROPERATIONAL
GOVERNANCE
HUB
BROKER
CODE
HUB
WWORKFLOWS ASTAGING AREAS GGUARDSMMONITOROFFLINE
ARCHIVE
LOG
DATA
Published
Data
DATA
MARTS
OBJECT
CACHE
EXPORT
AREA
Search
Access
Feedback
Refine
SAND
BOXES
Secure Access Secure Access
Enterprise IT
New Sources
Third Party Feeds
Third Party APIs
Internal Sources
Other SystemsOf InsightOther Data
Lakes
System of Record
Applications
Ente
rpris
e
Serv
ice B
us
Systems of Engagement
Systems of Automation
Differing user perspectives
Data Data
Stores
Curation of Metadata about
Stores, Models, Definitions
Information Governance
Catalogue
Search for, locate and download
data and related artifacts.
Provision Sand
Boxes.
Add additional insight into
data sources through
automated analysis.
Develop data management
models and implementations.
Data Data
StoresData Data
Stores
Sand
Box Define governance policies,
rules and classifications.
Monitor compliance.
View lineage (business and technical)
and perform impact analysis.
Governance RulesDefined for each classification for each situation
Personal information
masked here
Personal information
masked here
Sensitive information
masked here
Integrated Metadata• Data Lineage (Traceability)
� Where does this data come from?
� Why is this data incorrect?
� Why is this data incomplete?
� Can I trust this value?
• Impact Analysis
� Where is this element used?
� What happens if I change this?
• Optimization
� Where is the redundancy?
� How can I make this run more efficiently?
• Understanding
� What does this mean?
� How is this used?
• Control
� Why is this parameter set to this value?
� Who made this change?
� I can change this to meet new business requirements
Secure access to the data lake’s data
• The data lake’s security is assured with this combination of business processes and technical mechanisms.
Isolated repositories
Data centric security access
Well defined access pointsData access
approval by
subject area
owner
Access
monitoring
and logging
Security
analytics
and
investigation
Security
audit and
review
ADMINISTRATION AT RUNTIME REVIEW
Data
curation
for security
and
protection
Building a data lake
• The first step in creating the lake is to establish the following:
� The information integration and governance components,
� The staging areas for integration,
� The catalog,
� The common data standards.
• The build out of the lake then proceeds iteratively based on the following processes:
� Governance of a data lake subject area.
� Managing an information source.
� Managing an information view.
� Enabling analytics.
� Maintaining the data lake infrastructure.