how information governance is getting analytics on big data's...
Post on 20-May-2020
1 Views
Preview:
TRANSCRIPT
© 2017 IBM Corporation
How Information Governance is getting
Analytics on Big Data's Best Friend
Albert Maier
amaier@de.ibm.com
© 2017 International Business Machines Corporation3
How many astronauts are there in Argentina?
© 2017 International Business Machines Corporation4
Traditional Governance: Ensures proper Management and Use of Information
Information Governance
Compliance
PolicyAdministration
PolicyEnforcement
PolicyMonitoring
PolicyImplementation
Standards Protection
Lifecycle
Quality
Information ValuesQuality
InformationDependencies
InformationRequirements
Information SupplyChain Integrity
InformationIdentification
InformationRetention
InformationUsage
InformationPrivacy
InformationArchitecture
InformationDisposal
Are People/Systems operating properly
Is data qualitysufficient for use?
Is data kept for appropriate
length of time?
Is data properlyprotected from loss or
inappropriate use?
Are systems built to appropriate
standards?
© 2017 International Business Machines Corporation5
A growing demand …
5
Business Teams want• Open access to more information• More powerful analysis and visualization tools
Business Teams want• Self-service access to more information (“big data”)• More powerful analysis and visualization tools
IT Teams are• Concerned about cost.
• Concerned about governance and regulatory requirements.
Governance mitigates, it enables
the self-service world
This is related to the
“Rise of The CDO”
Chief Data Officers need
• To enable access to enterprise wide information assets
• Collaboration & Sharing of assets
• Enhanced compliance with regulations
• …
© 2017 International Business Machines Corporation6
“Governance 2.0”: Drives the Self-Service World
Information is Accurate
Information is Secure
Information is Understood
Information is Current
Informationis Holistic
Informationis Findable
Creates confidence to both consume and share information
Governed data lakes are an excellent example scenario to discuss how governance helps
to achieve these goals
© 2017 International Business Machines Corporation7
Data Lake (IBM’s view)
Data Lake = Efficient Management, Governance, Protection and Access of Big Data
Data Lake
Information Management and Governance Fabric
Data Lake Services
Data Lake Repositories
7
All services integrated with
governance
© 2017 International Business Machines Corporation8
Governed Data Lake: Users and Subsystems
Data Lake (System of Insight)
Information Management and Governance Fabric
Catalogue
Self-
Service
Access
Enterprise
IT Data
Exchange
Self-Service
Access
Analytics
Teams
Governance, Risk and
Compliance Team
Information
Curator
Line of Business
Teams
Data Lake
Operations
Enterprise IT
Other Data
Lakes
Systems of
Engagement
Data Lake Repositories
Systems of
Automation
Systems of
Record
New Sources
8
Governance is important
here
© 2017 International Business Machines Corporation9
Governance & Data Lake Summary
No direct access to
repositories
Business-led information governance
Catalog of data, ownership, meaning and permitted usage
Moderated, view-based self-service access to data and analytics for line of business.
Governed access to raw data to develop new production analytics. Shop for data.
Effective and governed interchange of data and insight with other systems.
Data-centric Security
Multiple repositories organized based on source and usage; hosted on appropriate data platforms for
workload.
Curation of all data to define meaning and classifications
9
© 2017 International Business Machines Corporation10
Selected challenges demanding innovation on the technology side
▪ A central metadata catalog is not realistic, independent of technology choice
• How to design and implement the „virtual“ metadata catalog of the future?
• How to keep governance services such as data lineage efficient in a distributed world?
=> How to design and implement efficient Open Metadata capabilities?
▪ Standard search for information and governance assets fails to deliver results that are good enough for business users
• How to build efficient contextual search capabilities?
• How to keep this extensible for all the asset types relevant for a specific enterprise?
=> How to design and implement open and efficient contextual search solutions?
▪ Business-level classification of information assets is still a costly manual process
• Current discovery technologies fail to propose good enough classification candidates
• Exploitation of machine learning for this domain is in its infancy
=> How to design and implement an efficient automated classification of information?
© 2017 International Business Machines Corporation11
Open Metadata - What problem is it solving
All industrial products for metadata and governance are built on top of a central metadata
repository. This turns out to be not future proof for various reasons, including
• Cloud platforms, open data and API economy means an organization no longer owns
and manages all of the data it uses. Maintaining a single inventory of metadata is
untenable since IT is no longer in control of all of the data.
• The data landscape is evolving too rapidly to maintain a metadata repository that is fed
with snapshots of metadata from data platforms. The metadata repository can get
quickly out-of-date and can become untrusted. Metadata for big data needs to be local
to this data.
• Metadata is typically locked in specific tools and platforms in proprietary formats.
Supporting the ever-increasing variety of data platforms, data types and functions in a
proprietary model is expensive and needs significant development bandwidth
© 2017 International Business Machines Corporation12
Open Metadata Management
▪ Peer-to-peer network of repositories
▪ Metadata stored and managed close to its source
▪ Open, extensible metadata structures for metadata exchange and federation – extending coverage of the types of resources that need to be described.
▪ Open source infrastructure sharing cost of development and maintenance between vendors
▪ Support for open standards where available
CollaborationSpace Metadata
Analytics Platform Metadata
ApplicationMetadata
Cloud SaaS platform Metadata
Hadoop Platform Metadata
© 2017 International Business Machines Corporation13
IBM is proposing and working towards an Open Metadata
Ecosystem on top of Apache Atlas
• Significant enhancements of Apache Atlas are
necessary to broaden its scope and to mature it
• IBM started to strongly engage in this community
and to contribute code (e.g. a graph abstraction
and capabilities to address HA)
• IBM has been working on additional componentry
(Open Metadata Access Services, Open
Connector Framework, Governance Action
Framework, Open Discovery Framework) and
intends to contribute significant parts of this work
to the community
Open ConnectorFramework
Governance Action Framework
Open Discovery FrameworkConnector Broker
Metadata Repository
Databases
Applications
FunctionFunction
Functions
ConnectorConnect
or
OperationalLogs
Engine
Open Metadata Access Services
Met
adat
aC
on
nec
tor
Files
Connector
Apache AtlasIBM Value-addOthers Value-add
Key
Met
adat
aC
on
nec
tor
More details here: http://www.ibmbigdatahub.com/blog/insightout-role-apache-atlas-open-metadata-ecosystem
© 2017 International Business Machines Corporation14
Selected challenges demanding innovation on the technology side
▪ A central metadata catalog is not realistic, independent of technology choice
• How to design and implement the „virtual“ metadata catalog of the future?
• How to keep governance services such as data lineage efficient in a distributed world?
=> How to design and implement efficient Open Metadata capabilities?
▪ Standard search for information and governance assets fails to deliver results that are good enough for business users
• How to build efficient contextual search capabilities?
• How to keep this extensible for all the asset types relevant for a specific enterprise?
=> How to design and implement open and efficient contextual search solutions?
▪ Business-level classification of information assets is still a costly manual process
• Current discovery technologies fail to propose good enough classification candidates
• Exploitation of machine learning for this domain is in its infancy
=> How to design and implement an efficient automated classification of information?
© 2017 International Business Machines Corporation15
Background: What we learned from customers and user studies
Primary Focus: Business Analysts and Data Scientists
1. Getting started is hard…Contextual search (e.g., ‘Shop For Data’) to quickly
find relevant information, assets and experts
What’s needed Common Challenges
Seamless conversation as the common denominator
across tools 2. …Teams are diverse, and adhoc sharing is vital.
Provenance information captured automatically and
transparently
3. Context is critical to establish trustworthiness.
=> LabBook project (IBM Research)
© 2017 International Business Machines Corporation16
LabBooks heart is a graph that is both populated and consumed by 3rd party tools
Data Integration Tools
Data Science Tools
Social Networking Tools
Business Analyst Tools
Contextual Usage Graph Embeddable WidgetsSource systems User Interfaces
COMMUNITY
COMMENT
WORKSTREAM
PERSON
PERSON
DATASET
VISUALIZATION
APP
DATASET
INVOKES RESPONSE
DATASETCOMMENT
WORKSTREAM
Business users
Business analysts
Data Scientists
IT staff
Contextual
Search
Social Widgets
Recommendations
Activity
Streams
Contextual
Graph Browser
© 2017 International Business Machines Corporation17
What context is currently captured in the graph?
▪ Schematic▪ How data is structured
▪ Semantic▪ What data means
▪ Collaborative▪ How people work together
▪ Usage▪ How data is used
memberOf
follows
publishes
contains
contains
contains
visualize
is
similarTo
consumes produces
derivedFrom
ORGANIZATION
PERSON
DATASOURCEDATASET
DATAFILE
TABLE
VISUALIZATION
COLUMN
ONTOLOGYREF
APPLICATION
COMMENT
RESPONSE
collaborates
createdBy
hasauthorOf
authorOf
replyTo
respondTo
is
COMMUNITYmemberOf
authorOf
INVOCATION
contains
NOTE
QUERY
DATABASE
SCHEMA
outputsdownloads
17
© 2017 International Business Machines Corporation18
Selected challenges demanding innovation on the technology side
▪ A central metadata catalog is not realistic, independent of technology choice
• How to design and implement the „virtual“ metadata catalog of the future?
• How to keep governance services such as data lineage efficient in a distributed world?
=> How to design and implement efficient Open Metadata capabilities?
▪ Standard search for information and governance assets fails to deliver results that are good enough for business users
• How to build efficient contextual search capabilities?
• How to keep this extensible for all the asset types relevant for a specific enterprise?
=> How to design and implement open and efficient contextual search solutions?
▪ Business-level classification of information assets is still a costly manual process
• Current discovery technologies fail to propose good enough classification candidates
• Exploitation of machine learning for this domain is in its infancy
=> How to design and implement an efficient automated classification of information?
© 2017 International Business Machines Corporation19
Classification – situation technology side
▪ Classification is about tagging information assets (e.g. columns) with their semantic meaning
(e.g. social security number, date of birth, account status, …)
• This is crucial for finding the right information
• This is crucial for managing information according to regulations and company policies
▪ Many existing capabilities and assets (within IBM products, competitor products, research, ..)
• Typically focusing on either low hanging classification based on simple syntactic analysis (regular
expressions, simple code, …) or very specialized domains (e.g. finding address information)
• Typically only able to automatically classify a smaller percentage of the information assets
▪ No silver bullet on the technology side
• all existing algorithms are specialized to address specific scenarios, e.g.
• they work for specific data formats only (e.g. text data only)
• they assume certain metadata being available & useful (e.g. descriptions)
• many have no machine learning, for others training sets and proper feedback has been an issue
▪ No common technology base, everybody has been re-inventing the wheel, nothing that
brings diverse technologies together to play in concert
Similiar issues exist for the broader area of automating data understanding
This motivated us to build an Open Discovery Framework (ODF)
Intention is to contribute this to Open Source soon
© 2017 International Business Machines Corporation20
A closer look at Open Discovery Framework (ODF)
▪ Pluggable Framework to enrich metadata with discovery results• Developers writing discovery and classification algorithms can easily plugin their code
▪ Built on open-source stack: Atlas, Kafka, Spark, Zookeeper
▪ Extension points to support other environments. IBM is using these e.g. for• Information Server: Use XMeta instead of Atlas• Bluemix: Message Hub service instead of plain Kafka, Cloudant as config store
▪ Jenkins based build pipeline for build and test automation
▪ The IBM Information Analyzer profiling and data quality analysis services are available as plugins for this framework
▪ IBM started to develop diverse new classification services, specifically• A „term classification“ service comparing information asset metadata against business
glossary content• A „fingerprint“ based classification service comparing statistical fingerprints against
fingerprints of already classified information assets
© 2017 International Business Machines Corporation21
Open Discovery Framework Architecture
ODF Core
ODF REST API
Service Choreography
Request Notifications
Config Store
Metadata Access
Service 1Annotation Store
Declarative Request Processor
Spark runtime Queue
Java runtime Queue
Notification Topic
ODF Java API
ODF Event API
Service2
Some Metadata
Store
Some ConfigStore
Kafka
REST service Queue
© 2017 International Business Machines Corporation22
Selected challenges demanding innovation on the technology side
▪ A central metadata catalog is not realistic, independent of technology choice
• How to design and implement the „virtual“ metadata catalog of the future?
• How to keep governance services such as data lineage efficient in a distributed world?
=> How to design and implement efficient Open Metadata capabilities?
▪ Standard search for information and governance assets fails to deliver results that are good enough for business users
• How to build efficient contextual search capabilities?
• How to keep this extensible for all the asset types relevant for a specific enterprise?
=> How to design and implement open and efficient contextual search solutions?
▪ Business-level classification of information assets is still a costly manual process
• Current discovery technologies fail to propose good enough classification candidates
• Exploitation of machine learning for this domain is in its infancy
=> How to design and implement an efficient automated classification of information?
© 2017 International Business Machines Corporation23
Take Aways & Outlook
▪ Take Aways
• Governance is extending from „Governance for Compliance“ to „Governance for Insights“
• Data lakes are helping CDOs to implement a vision of a data driven enterprise,
but data lakes need to be fully governed to live up to this value proposition
• Governance and the underlying metadata and metadata discovery and exploitation
technologies are not mature enough for big data, there is a lot of (and vice versa big data systems
are not mature enough to be a player in a governed landscape)
▪ IBM Governance Strategic Directions:
• Huge focus on Governance for Insights (comprising topics like shop for data, recommendation driven tools, machine learning, collaboration ,...)
• Moving to an open-source base (Kafka, Spark, Atlas, ...)
• Re-basing governance on an open, non-centralized metadata infrastructure
• Huge focus on „Unified Governance“ to bring IBM‘s governance capabilities together(across structured and unstructured data, across cloud and on prem, across all information governance domains)
© 2017 International Business Machines Corporation24
zzzzzzz
Questions?
© 2017 International Business Machines Corporation25
© Copyright IBM Corporation 2017. All rights reserved. The information contained in these materials is provided for informational purposes
only, and is provided AS IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use
of, or otherwise related to, these materials. Nothing contained in these materials is intended to, nor shall have the effect of, creating any
warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement
governing the use of IBM software. References in these materials to IBM products, programs, or services do not imply that they will be available in
all countries in which IBM operates. Product release dates and/or capabilities referenced in these materials may change at any time at IBM’s sole
discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any
way. IBM, the IBM logo, and other IBM products and services are trademarks of the International Business Machines Corporation, in the United
States, other countries or both. Other company, product, or service names may be trademarks or service marks of others.
top related