data reservoir governance best practices final · “solomo ” context-enriched ... decision model...

© 2015 IBM Corporation

The Data Reservoir: Architecture, Best Practices and Governance

Jo Ramos, Distinguished Engineer,

Chief Architect for Information Integration & Governance

Agenda for Today

• How to become a data driven analytics organization

• Modern Data Architecture Considerations

• The Data Reservoir & Data Governance

1

500 million DVDs worth of data is generated daily

1 trillion connected objects and devices by

2015

80% of the world’s data

is unstructured

Data is becoming

the world’s new

natural resource

25% of the world's applications will be available

in the cloud by 2016

85% of new softwareis being built for cloud

72% of developers say

cloud-based services or APIs

are central to the

applications they are

designing

The emergence of cloud is

transforming IT and

business processes into

digital services

80% of individuals are willingto trade their information

for a personalized offering

84% of millennials say social and user-generated

content has an influence on what they buy

5 minutes: response time users expect once they have

contacted a company via social media

Social, mobile and access

to data are changing how

individuals

are understood and

engaged

Three major shifts in our industry

Data Monetization Drivers

Data volumes

Data volumes

Data variety

Data variety

Real-time executionReal-time execution

Cost of data

storage

Cost of data

storage

BI & Analytics visibility

BI & Analytics visibility

Increasing value of

data

Increasing value of

data

Under-utilized

resource

Under-utilized

resource

Data driven analytics platform capabilities move from

information to insights and from reporting to action

Real-Time Decision Management: What action

should I take ? What is the Next Best Action ?

Prediction / Simulation: What will happen? Uses

data to forecast based on complex algorithms or

rules; forecasting of customers’ buying behavior.

Evaluation: Why did it happen? Thoroughly

analyzes data to support important decisions /

understand root causes for unusual observations;

evaluation of campaign responses to understand

changes in customer behavior.

Data Mining: Why did it happen? Looks for patterns

in the data to explain a not yet understood

observation; understand why certain customers

show a certain behavioral pattern.

Monitoring: What is happening now? Looks for

trigger information in the data indicating need for

action; monitoring live campaigns and making

optimization decisions as necessary.

Reporting: What happened? Slices and dices data

to create transparency on campaign performance

and financial or quality outcomes.

Business value / impact

Tech

no

log

ica

l co

mp

lexi

ty

Low High

Low

High

Cognitive: What did I learn, what’s best?

Multi-structured Data Mashups provide the Greatest Enterprise Value

Structured Data

Small Data

Clearly formatted

Quantitative

Objective

Logical

Puzzle

Repeatable

linear

Unstructured Data

Big Data

Language based

Qualitative

Subjective

Intuitive

Mystery

Exploratory

dynamic

Hadoop, Streams, Spark

Systems of Record Structured data from operational systems

20% of all data generated

Systems of Engagement Data that “connects” companies with their

customers, partners and employees

80% of all data generated

Systems of InsightDiverse data types that combine structured and unstructured data

for business insight

Data Warehouses

Traditional Sources

Advanced Analytics

Context Accumulation

Enterprise Integration

Transaction data

ERP Data

Mainframe Data

OLTP System Data

Electronic Health Records

New Data Sources

Audio

Documents

Images

RFID

Emails

Sensors

Social Data

Video

Web Logs

DATA DRIVEN ARCHITECTURE

A growing data demand … and organizational tensions

Data Scientists seeking data for new analytics models.

Marketer seeking data for new campaigns.

Fraud investigator seeking data to understand the details of suspicious activity.

Agility

Data Access Freedom

Any kinds of data

Powerful Analysis &

Visualization

Security

Data Privacy

Regulatory Compliance

Standards

Application DeveloperKnowledge Worker

Lines of Business IT Organization

…

CDO

Modern Data Architectures: What to aim for

We need IT solutions that are like Legos

�Agile, flexible, adaptive, efficient

�Able to give fast responses to business needs

�Taking advantage of cloud, mobile, social, localization, Big Data

IT should not be like pouring cement

�Rigid, hardcoded business processes

�Monolithic

�Expensive to change

�Out of sync with business needs

Modern Data Architecture

9

� Simplification of the IT environment

� Eliminate redundancies

� Reduce cost

� Decouple systems

� Decouple transactions

� Fast, easy reuse

� Faster time-to-market

� Explore opportunities

� React to problems

� Generate insights from information in real time

� Use insights to improve customer experience, anticipate facts

What to aim for

Modern Data Architecture

10

Enables better Customer Engagement

Smart Channels�Mobile & Web�“SoLoMo”�Context-enriched

Social� Profiles� Preferences� Activities

Sensors� Geolocation� Movement� Events� Internet of Things

Transactional� Customer info� Transaction history� Product rules

Context-Aware

Experiences�Personalized services�Real-time Intelligence�“Moments of Truth”�Opportunistic offers

Analytics�“Next Best Action”�Expert systems�Future trends�Big Data

Processes� Tasks & milestones� Integration� Monitoring & SLAs

Analytics Lifecycle

Search & Survey

(understand what data is available)

Analytics Discovery

(application development)

Deploy & Consume

(application & workflow)

IT -

Op

e

rati

on

s

Online

Collaboration

Workspace

The Data Reservoir

• Built to extract value from the data.

• Managed, Trusted and Governed

Big Data Lakes or Swamps?

• As we bring data together, are we creating a data swamp?

� No one is sure of the origin or purity of data.

� No one can find the data they need.

� No one knows what data is present and if it is being adequately protected.

• How do we build trust in big data?

� Need trust both to share and to consume data.

� Need understanding of quality, origin and ownership of data.

� Need classification of data to govern and protect it.

� Need timely, reliable data feeds and results.

� All built on secure and reliable infrastructure.

IBM’s Data Lake

IBM’s Data Lake = Efficient Management, Governance, Protection and Access.

Data Lake

Information Management and Governance Fabric

Data Lake Services

Data Lake Repositories

Users supported by the Data Lake

Data Lake (System of Insight)


Data Lake Services

Analytics

TeamsGovernance, Risk and

Compliance TeamInformation

Curator

Line of Business

Teams

Data Lake

Operations


Enterprise IT

Other Data Lakes

Systems of Engagement

Systems of Automation

Systems of Record

New Sources

The Data Lake Subsystems (Services)

Data Lake (System of Insight)


Catalogue

Self-ServiceAccess

EnterpriseIT Data

Exchange

Self-ServiceAccess

Analytics

TeamsGovernance, Risk and

Compliance TeamInformation

Curator

Line of Business

Teams

Data Lake

Operations

Enterprise IT

Other Data Lakes




Systems of Record

New Sources

Considerations for a well-managed and governed data lake

No direct

access to

repositories

Business-led

information

governance and

management

Catalog of data,

ownership, meaning

and permitted usage

Moderated, view-

based self-service

access to data and

analytics for line of

business.

Access to raw data to

develop new

production analytics.

Effective

interchange of

data and insight

with other

systems.

Data-centric

Security

Multiple repositories organized based on

source and usage; hosted on appropriate

data platforms for workload.

Curation of all data

to define meaning

and classifications

Active monitoring

and management

of data

1

10

9

8

7

6

4

3

5

2

View from the user community - fraud

Conform to

regulations

Investigate

Fraud Case

Develop new

fraud models

Detect and

prevent fraud

Detect and

prevent fraud

Detect and

prevent fraud

Data repositories support multiple zones

Data Lake

CatalogInterfaces

Raw DataInteraction

Enterprise IT Interaction

View-basedInteraction

Data

Lake R

eposito

ries

Information Integration & Governance

Descriptive Data

Context Data

Deposited Data

Historical Data

Harvested Data

Published Data

Big data needs a variety of repositories for cost, access and performance reasons

Data Lake

CatalogInterfaces

Raw DataInteraction



Data

Lake R

eposito

ries

Descriptive

Data

INFORMATION

VIEWS

CATALOG

Context

DataASSET

HUB

ASSET

HUB

ASSET

HUB

ACTIVITY

HUB

ACTIVITY

HUB

ACTIVITY

HUB

CODE

HUB

CONTENT

HUB

Deposited

Data

Historical

Data

Harvested

Data

INFORMATION

WAREHOUSEDEEP DATA

AUDIT

DATA

OPERATIONAL

HISTORY

SEARCH

INDEX


LOG

DATA

Published

Data

DATA

MARTS

OBJECT

CACHE

EXPORT

AREA

Data lake logical architecture

Line of Business

Applications

Information

Service Calls

Search

Requests

Report

Requests

Deploy

Decision

Models

Data

Access

Data Lake

Deploy

Real-time

Decision

Models

Data Lake

Operations

Curation

Interaction

Management

Data

Access

Data

Deposit

Data

Deposit

Decision Model

Management

Events to

Evaluate

Information

Service Calls

Data Out

Data In

Notifications

Deploy

Real-time

Decision

Models

Understand

Information

Sources

Understand

Information

Sources

Understand

ComplianceReport

Compliance

Advertise

Information

Source

Governance, Risk and

Compliance Team

Information

Curator

CatalogInterfaces

Information

Service Calls

Raw DataInteraction


APIs

Data

Ingestion

Publishing

Feeds

Continuous

Analytics

STREAMING

ANALYTICS

Consumersof Insight

Simple, ad hoc

Discovery

and Analysis

Reporting

Analytical Insight

Applications

Analytics Tools


Secure

Access

SAND

BOXES

EVENT

CORRELATION

Data

Lake R

eposito

ries

Descriptive

Data

INFORMATION

VIEWS

CATALOG

Context

DataASSET

HUB

ASSET

HUB

ASSET

HUB

ACTIVITY

HUB

ACTIVITY

HUB

ACTIVITY

HUB

CODE

HUB

CONTENT

HUB

Deposited

Data

Historical

Data

Harvested

Data

INFORMATION

WAREHOUSEDEEP DATA

AUDIT

DATA

OPERATIONAL

HISTORY

SEARCH

INDEX


INFORMATION

BROKEROPERATIONAL

GOVERNANCE

HUB

BROKER

CODE

HUB

WWORKFLOWS ASTAGING AREAS GGUARDSMMONITOROFFLINE

ARCHIVE

LOG

DATA

Published

Data

DATA

MARTS

OBJECT

CACHE

EXPORT

AREA

Search

Access

Feedback

Refine

SAND

BOXES

Secure Access Secure Access

Enterprise IT

New Sources

Third Party Feeds

Third Party APIs

Internal Sources

Other SystemsOf InsightOther Data

Lakes

System of Record

Applications

Ente

rpris

e

Serv

ice B

us



Differing user perspectives

Data Data

Stores

Curation of Metadata about

Stores, Models, Definitions

Information Governance

Catalogue

Search for, locate and download

data and related artifacts.

Provision Sand

Boxes.

Add additional insight into

data sources through

automated analysis.

Develop data management

models and implementations.

Data Data

StoresData Data

Stores

Sand

Box Define governance policies,

rules and classifications.

Monitor compliance.

View lineage (business and technical)

and perform impact analysis.

Governance RulesDefined for each classification for each situation

Personal information

masked here

Personal information

masked here

Sensitive information

masked here

Integrated Metadata• Data Lineage (Traceability)

� Where does this data come from?

� Why is this data incorrect?

� Why is this data incomplete?

� Can I trust this value?

• Impact Analysis

� Where is this element used?

� What happens if I change this?

• Optimization

� Where is the redundancy?

� How can I make this run more efficiently?

• Understanding

� What does this mean?

� How is this used?

• Control

� Why is this parameter set to this value?

� Who made this change?

� I can change this to meet new business requirements

Secure access to the data lake’s data

• The data lake’s security is assured with this combination of business processes and technical mechanisms.

Isolated repositories

Data centric security access

Well defined access pointsData access

approval by

subject area

owner

Access

monitoring

and logging

Security

analytics

and

investigation

Security

audit and

review

ADMINISTRATION AT RUNTIME REVIEW

Data

curation

for security

and

protection

Building a data lake

• The first step in creating the lake is to establish the following:

� The information integration and governance components,

� The staging areas for integration,

� The catalog,

� The common data standards.

• The build out of the lake then proceeds iteratively based on the following processes:

� Governance of a data lake subject area.

� Managing an information source.

� Managing an information view.

� Enabling analytics.

� Maintaining the data lake infrastructure.

zzz

zz

zz

Questions?

Thank You