introduction to the ibm dataops methodology and practice ...... · regulatory requirements that may...
TRANSCRIPT
DigitalEventExperience
Introduction to the IBM DataOpsmethodology and practiceJulie LocknerDirector, Portfolio Optimization and Offering ManagementIBM Data and AI
Steven EliukVP, Deep Learning & Governance AutomationIBM Global CDO
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
81%do not understand the data required for AI
2
8XAI pioneers are 8X more likely to have a robust data architecture
There is no AI without IA(information architecture)
“No amount of AI algorithmic sophisticationwill overcome a lack of data (architecture)...”Data collection & preparation is the mosttime consuming and difficult part of AI.Think 2020 / May 5, 2020 / © 2020 IBM Corporation
COLLECT - Make data simple and accessible
ORGANIZE - Create a business-ready analytics foundation
ANALYZE - Build and scale AI with trust & explainability
INFUSE - Operationalize AI throughout the business
AI
The AI Ladder A prescriptive approach to the journey to AI
MODERNIZEUnlock the value of data for an AI and multicloud world
One Platform, Any CloudTalent &
Skills
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
Use your data
Trust your data
Know your dataDataOps delivers business-ready data fastORGANIZE
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
COLLECT ANALYZESelf-service interaction for data preparation and testing
Data Governance and Curation
Data Quality
Master Data Management
Data Integration
Data Replication
Data Virtualization
Know
Catalog & Metadata Management
Trust Use
ORGANIZE: Critical information architecture capabilities
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
6
Prepare Data Pipelines“Most dreaded part of AI”
Data OperationsDiscover, understand, ingest,integrate, assess quality, clean data
Build Run Manage
Months - Quarters
Problem Statement: Business users need access to high quality data fast. Data pipelines are the primary source of bottlenecks.
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
“Our study shows that 95% of organizations see negative impacts from poor data quality, resulting in wasted resources and additional costs.”https://www.experian.co.uk/assets/data-quality/experian-global-data-management-report-jan-2019.pdf
Poor Data Quality and Governance Cause Negative Business Impact
IBM Watson / © 2020 IBM CorporationThink 2020 / May 5, 2020 / © 2020 IBM Corporation
Introducing DataOps
“DataOps is a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organization.”
Gartner
8IBM Watson / © 2020 IBM CorporationThink 2020 / May 5, 2020 / © 2020 IBM Corporation
9
Prepare Build Run Manage
Months - Quarters
Hours - Days
DataOps Consistently Delivers High Quality Data Fast
DataOps expedites delivery of high-quality data by:
— Streamlining data pipeline processes.
— Automating core operations on data.
— Incorporating agile processes and workflows.
— Taps into data sources and consumers for end-to-end DataOps.
— Automates test data generation and management
— Enables collaborative communication across key stakeholders and SME.
IBM Watson / © 2020 IBM CorporationThink 2020 / May 5, 2020 / © 2020 IBM Corporation
DataOps Impact – Know Your Data in Minutes Data Inventory Case Study
Reduction in business glossary creation time
Reduction in time to discover metadata and assign terms
Number of technical assets across multiple clouds discovered in less than 5 mins
85% 90% 200,000
Financial Services, Telecommunications, Retail Examples, Healthcare Payer
2 HourROI
Uncovered Protected Health InformationPHI / PII exposure
IBM Watson / © 2020 IBM CorporationThink 2020 / May 5, 2020 / © 2020 IBM Corporation
DataOps Impact - Trust Your DataData Quality Case Study International Bank
13Per hour (manual)
50Per min (automated)
Data records update speed
6%Per hour (manual)
93%Per min (automated)
Data quality score
2 yearsNet promoter score
230xData quality improvement
With DataOps
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
DataOps Impact – Use Your DataData Integration Use CaseLeading European Retailer
Customer affinity analysis
Inventory stock positions
>3 weeks
< 2 minutes
Data change delay on reporting systems
20 days
< 1 day
~24 hours
< 4 hours
DataOps Impact
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
Comparing the two scenarios. Which one is yours?
Single iterationMonths-QuartersOne outcome, costly if wrong
Multiple iterationsDays-WeeksMultiple outcomes, more chances for success
Without DataOps With DataOps
80%Data Prep
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
DataOps requires Automation and Multicloud Architecture
Automateddata curation and qualityservices
Automated metadata managementand catalog services
Self-services interaction
Automated data integration
Automated test data management services
Automated master data management
Governed data access services
Business-ready data
OrganizeDataOps Delivers Business Ready Data Fast
On-Prem
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
DataOps Maturity Model
• Know: Spreadsheets• Trust: Emails• Use: Hand coding
• Know: Departmental / LOB Catalog• Trust: Data Quality Program• Use: Data Virtualization, Data Integration and Data Replication
• Know: Enterprise Catalog• Trust: Data Governance Program with Data Stewardship and Business
Glossary• Use: Self Service Data Prep and Test Data Management
• Know: Enforced and Enriched Catalog• Trust: Compliance, Business Ontology and
Automated Classification• Use: DataOps for All Data Pipelines
Advanced DataOps
Developed DataOps
FoundationalDataOps
No DataOps
Increased business valueand speed in Delivering business-ready data.
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
DataOps Methodology Automates Data Management Best Practices
DataOps Methodology
— Prioritize and align data pipelines with business objective and success criteria.
— Associated with the Data Engineering discipline
— Automatically measures accuracy and speed of data capture, quality and use.
— Automates data and metadata ingestion and classification.
— Automatically assesses data quality issues and alerts when anomalies are detected.
— Automatically initiates remediation via workflow.
— Automates test data management
— Automatically ensures authorized use of published data assets by enforcing data privacy and governance policies.
Inventory and categorize data
Publish data and use
Deliver quality and governance
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
DataOps Interoperates with Peer Organizations
DataOps Interoperates Cross-Functionally
- Application development teams publish source data and incorporate feedback from DataOps to improve data definitions and data quality.
- IT security and compliance teams publish security, privacy and governance policies to DataOps teams to be enforced and respond to audits when necessary.
- Data science teams consume data assets published by data engineering and leverage DataOps for model lineage, data definitions and security and privacy policies.
- Lines-of-business leverage the output of DataOps for accessing high-quality data quickly and efficiently while providing feedback for data definitions, data quality and submitting new assets to be catalogued, assessed and published.
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
DataOps combines people, process and technology
Organization design
Executive Sponsor
Executive Steering CommitteeCDO, CIO, LOB Execs,
Chief Risk Officer
Data Architecture Working GroupEnterprise Data Architect
Data ModelersDatabase Administrators
Enterprise Data Governance CouncilData Governance ManagerBusiness Process Owners
Compliance and Legal
Data Custodians
Data Governance OfficeMeta Administrator
Data Governance Analyst Domain Data Stewards
Lead Data Steward
DataOpsData Pipeline Deployment & Test
DataOps Monitoring & ManagementSelf-Service Operations
Data Engineers
IBM Watson / © 2020 IBM CorporationThink 2020 / May 5, 2020 / © 2020 IBM Corporation
DataOps in Action at IBM’s Global Chief Data Office
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
IBM CEO
SVP Finance & Operations, Chief Financial Officer
Global Chief Data OfficeVP Finance, ControllerEnterprise Ops & Services
CAO
CIO
Enterprise Data Standards
E2E Data Flows
Enterprise Governance Workflow automation
Data Acquisition (M&A, 3rd Party, Public)
Data Stewardship
Advanced Technology
Hybrid Cloud Development Environment
Production Platform & Solutions Engineering Delivery
Business Controls, Support & Operations
Production Platform Release Mgmt & Project Mgmt
Discovery
Budget & Financial Controls
Platform Adoption
AI Accelerator
BUDO Network
Client Reference Data
Product Data
Modernization & Transformation leveraging
Enterprise Data & AI Platform
Enterprise Data & AI Platform Adoption & Value Creation Client & Product Master DataEnterprise Data Governance Deep Learning
IBM Global Chief Data Office Organizational Structure
IBM Watson / © 2020 IBM CorporationThink 2020 / May 5, 2020 / © 2020 IBM Corporation
Importance of Metadata
It can take DAYS for SMEs to review/ approve business term
MetadataEvery enterprise struggles with the problem of labeling
Large risk item, consider:• Untapped potential in dark data• Data Governance, Compliance, Audits, potential Leakage of sensitive data
METADATA makes data visibleand understandable
Metadata unlocks data
Users can easily find, understand and trust the data they need to drive business insights WITH SPEED
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
Examples of Metadata Benefits
Regulatory ComplianceMetadata management conducted on a unified platform that provides stewardship, data lineage, and impact analysis services is the best assurance that an organization can validate and demonstrate that the data reported is true.
• e.g., GDPR, Government Owned Entity
Productivity & DiscoveryData is abundant. Much of it comes from existing systems and data stores for which no documentation exists or the documentation does not reflect the changes and updates of those systems and data stores.
• Data scientists can spend 80% of their time finding and cleaning data prior to using it!
Risk AvoidanceMetadata management provides the measure of trust that businesses need. Through data lineage and impact analysis, businesses can know the accuracy, completeness and currency of the data used in their planning or decision-making models.
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
IBM GCDO automated metadata generation (AMG)
Distributed Federated Learning
Lack of data for model training impacts the performance
Local restrictions related to processing of the business information within the limits of certain jurisdiction
Automated Metadata Generation (AMG) uses automation and data science to link data
• A complex series of organic Deep Learning models were developed for CEDP metadata classifications
• Backed by micro-services: Can be installed anywhere (cloud, container)
• ~60TB of labeled training data in addition to public sources and synthetically generated data
Challenges addressedImplementation
Compliance with local regulation
Larger volume of training data allows to achieve better performance
No isolated business units that lack training data
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
Unified.Classifying terabytes of data to make it easily discoverable while providing the data stewardship, lineage, and impact analysis to assure it is trustworthy
Dramatically enhancedData Qualitywith regulatory & governance checks
~$27 million
inproductivity
savings
Up to 95% reductionin cycle time:targeted at full automation in 18 months
An AI-powered process for curating, verifying, and classifying data that enhances speed and usability at speed
IBM GCDO Automated Metadata Generation (AMG)
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
Small Tag Set as a Product
Project StagesTo provide top-5 recommendation
5x less workload
Goal: full automation, i.e. zero SME involved
600 terms
2500 terms
30%of data
70%of data
How we define it:• Better prediction quality
is available for the small tag set
• No need to provide top-5 recommendations, the choice is easy
~95% workload decreaseTo provide single recommendation
20x less workload
To provide the correct Metadata
NO workload, almost.
1
2
3
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
Watson Knowledge CatalogAutomated cataloging to discover, classify, prepare & share data
• ML-driven intelligent discoverability of data sources, models, notebooks, AI artifacts
• Operationalize Data governance program • Data lineage in the language of the Business
IBM Watson / © 2020 IBM CorporationThink 2020 / May 5, 2020 / © 2020 IBM Corporation
Watson Knowledge Catalog now with automated metadata generation
Up to 96% accuracy on holdout data
Up to 70% accuracy on data that was once inaccessible
Business terms can differ across the different groups in an organization.
To address this: AMG's classifications in the current releaseuse an "umbrella" set of 25 terms defined to cover the varying cases we see at the GCDO
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
Q1 2018
Concept development of MVP 1Proven internally in GCDOAnd on external enterprise
use cases
MVP 3
Q4 2019
Released in Watson Knowledge Catalog services forCloud Pak for Data
Subsequentrelease 1
Subsequentrelease 2
AMG capability roadmap
MVP 2
Q2 2018 Q4 2018 2020
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
Getting Started
— Try Watson Knowledge Catalog today at ibm.com/Watson-Knowledge-Catalog
— Schedule a DataOps Garage Workshop with one of our DataOps Center of Excellence Experts by contacting [email protected]
— Learn more about IBM DataOps at ibm.com/DataOps
Use your data Know your data
Trust your data
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
Notices and disclaimers
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
© 2020 International Business Machines Corporation. No part of this document may be reproduced or transmitted in any form without written permission from IBM.
U.S. Government Users Restricted Rights — use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. This document is distributed “as is” without any warranty, either express or implied. In no event, shall IBM be liable for any damage arising from the use of this information, including but not limited to, loss of data, business interruption, loss of profit or loss of opportunity. IBM products and services are warranted per the terms and conditions of the agreements under which they are provided. The performance data and client examples cited are presented for illustrative purposes only. Actual performance results may vary depending on specific configurations and operating conditions.
IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our warranty terms apply.”
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.
Notices and disclaimerscontinued
Think 2020 / May 5, 2020 / © 2020 IBM Corporation
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer follows any law.
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products about this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products.Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied warranties of merchantability and fitness for a purpose.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right.
IBM, the IBM logo, and ibm.com are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at: www.ibm.com/legal/copytrade.shtml.