the emerging data lake it strategy
DESCRIPTION
Meaning making – separating signal from noise. How do we transform the customer's next input into an action that creates a positive customer experience? We make the data more intelligent, so that it is able to guide our actions. The Data Lake builds on Big Data strengths by automating many of the manual development tasks, providing several self-service features to end-users, and an intelligent management layer to organize it all. This results in lower cost to create solutions, "smart" analytics, and faster time to business value.TRANSCRIPT
© 2014
The Emerging Data Lake IT Strategy An Evolving Approach for Dealing with Big Data & Changing Environments
SPEAKERS:
Thomas Kelly, Practice Director Cognizant Technology Solutions
Sean Martin, Founder and CTO Cambridge Semantics
bit.ly/DataLake
© 2014 2
We’re living in an amazing world of information sharing,
connecting with family, neighbors, vendors, and customers all over the world
© 2014 3
Telling the world about what we like and don’t like
#HIMYMfinale
@MLB
… is now following Cognizant Technology Solutions and Cambridge Semantics
© 2014 4
What we’re doing and how we’re succeeding
© 2014 5
We’re deciding what advertising that we want to see…
… and what we don’t
Unsubscribe
Influencing how business and customers engage
© 2014 6
Many businesses have emerged that embrace this model of customer engagement
and we’ve said Goodbye to businesses that didn’t
10 million stays in 2013, without owning a hotel
Grew to nearly $75B in annual retail revenue in 2013, without opening a storefront Shares over 40 million
photos each day
© 2014 7
Retail Engaging in a more personalized shopping experience, retailers are building a stronger relationship with each customer
© 2014 8
Customer Service Delivering a positive and successful experience for each customer
© 2014 9
Life Sciences and Healthcare Combining health, genetic, clinical, and public sciences data to bring effective therapies to patients sooner
© 2014 10
Financial Services Delivering innovative products and services, based on a 360° view of the Customer, across all business lines, engaging all available data assets, internal and external
© 2014 11
The Challenges That We're Addressing
Onboarding and Integrating Data is Slow and Expensive
• Transforming data from a growing variety of technologies
• Custom coded ETL
• Existing ETL processes are not reusable
• Optimization for analytics is time-consuming and costly
• Often wait until there is a defined need for a set of data, delaying benefits realization while waiting to onboard the data
Data Provenance is Often Poorly Recorded
• Data meaning is “lost in translation”
• Data transformations tracked in spreadsheets
• Post-onboarding, maintenance and analysis cost for onboarded data is high
• Recreating data lineage is manual, time-consuming, and error-prone
© 2014 12
The Challenges That We're Addressing
Target Data is Difficult to Consume
• Optimization favors known analytics, but not well suited to new requirements
• A one-size-fits-all canonical view is used rather than fit-for-purpose views
• Or, lacks a conceptual model to easily consume the target data
• Difficult to identify what data is available, how to get access, and how to integrate the data to answer a question
Industrializing the Big Data Environment is Difficult to Manage
• Proliferation of data silos leads to inconsistency/syncing issues
• Conflicting objectives of opening access to data assets while managing security and privacy requirements
• Velocity of business change rapidly invalidate data organization and analytics optimizations
• Managing the integration/interaction with the multiple data management technologies that make up the Big Data environment
© 2014 13
Data Ingestion
The Data Lake is made up of four key components
Data Lake Management
Data Management Query Management
Delivering
• Low Cost, High Performance Storage • Flexible, Easy-to-Use Data Organization • Performance-Optimized Analytics • Automation of most manual Development and
Query Activities • Self-Service End-User Features • Intelligent Processing
© 2014 14
Data Ingestion
Data Lake Management
Data Management Query Management
Data Sources
Linked Data
Internet of Things IoT
Data Ingestion
On-Demand Query
Streaming
Semantic Tagging
Scheduled Batch Load
Model-Driven
Self-Service
Desktop and Mobile
Operational Systems
Social Media and Cloud
© 2014 15
Data Management
Data Lake Management
Data Management Query Management
Provenance Data
Movement
Data Sources
Linked Data
Internet of Things IoT
Semantic
Graph
Columnar
In Memory
Data Ingestion
On-Demand Query
Streaming
Semantic Tagging
Scheduled Batch Load
Model-Driven
Self-Service
Desktop and Mobile
NoSQL Map Reduce
Operational Systems
Social Media and Cloud
HDFS Storage
Structured and Unstructured Data
HDFS Storage
© 2014 16
Data Ingestion
Data Lake Management
Data Management Query Management
Semantic
Graph
Columnar
In Memory
Provenance Data
Movement
Data Lake Management
Data Assets Catalog
Workflow Models Access
Management
Data Sources
Linked Data
Internet of Things IoT
Data Mappings • Source-to-Target • Transformations
• Internal and External Data Assets
• Defined Data Orgs (ontologies, taxonomies, thesauri)
• Authorization and Access Rules • Rule-based Security • Group, Role, and User Level
Authorization • Auditable Access
• Processes • Schedules • Provenance
Capture
On-Demand Query
Streaming
Semantic Tagging
Scheduled Batch Load
Model-Driven
Self-Service
Business-Focused • Business Unit Data
Organization and Terms • Optimized to Assist
Analytics
Monitoring • Monitor and Manage
Data Lake Operations
Desktop and Mobile
Data Governance • Focus on Shared Data • Standard Models • Controlled Vocabulary • Common Definitions • Standards-based Data
Views (FIBO, CDISC/RDF)
NoSQL Map Reduce
Operational Systems
Social Media and Cloud
Structured and Unstructured Data
HDFS Storage
© 2014 17
Query Management
Data Ingestion
On-Demand Query
Streaming
Semantic Tagging
Data Lake Management
Data Management
Scheduled Batch Load
Model-Driven
Self-Service
Query Management
Provenance Data
Movement
Data Sources
Linked Data
Internet of Things IoT
Semantic
Graph
Columnar
In Memory
Query Data, Metadata, and Provenance
Capture and Share Analytics Expertise
Semantic Search
Analytics Directed to the Best Query Engine
Data Discovery
Desktop and Mobile
NoSQL Map Reduce
Operational Systems
Social Media and Cloud
HDFS Storage
Structured and Unstructured Data
HDFS Storage
© 2014 18
Semantic Technology Delivers “Smart” Data
Integrates a network of internal and external data assets, insulating end users from the details of the underlying technologies
Captures expertise (logic, inferencing) and integrates it with the data, delivering “smart” data to non-expert users
Manages a comprehensive inventory of the data assets
Secures access to the right data assets by the right users
© 2014 19
Key W3C Standards in Semantic Technology
Resource Description
Framework (RDF)
Framework for storing and
integrating data and data
definitions in the form of subject-
predicate-object expressions, or
“triples”. Relationships are
organized in a logical graph
model. Reduced development
time and cost; faster time-to-
business value.
Web Ontology Language
(OWL)
An ontology is a comprehensive
model of data definitions and
relationships that is human- and
machine-readable. Ontologies
are inheritable and extensible.
Improved application quality,
flexible iterative / investigative
approach, easily adapts to
business change.
SPARQL
Query Language
SQL-like query language for
semantic data that can leverage
the ontological relationships and
constructs to execute smarter
queries. Access multiple
internal and external databases
simultaneously in a single query.
Access and integrate data
across business silos.
Inference
Reasoning over data through
business rules. Expertise is
captured and embedded in the
ontology model, accessible
through user queries. This is
the “smart” in Smart Data.
Easier end user access to
expertise; intelligent systems
capabilities.
Linked Data
Connects data contained in
different databases, allowing
queries to find, share and
combine data so insights can be
identified across the Web.
Connect disparate databases to
navigate and integrate data
regardless of location or
technology platform.
RDB to RDF Mapping
Language (R2RML)
Preserving current investments
in relational technology, R2RML
maps relational data to an
ontology. SPARQL can query
RDF and relational databases
simultaneously.
Low cost of entry to use
Semantic Technology to deliver
high-value solutions
© 2014 20
The Common Model is the “Data Glue”
Lead (SFA system)
Quote (Quote system)
Order (OMS system)
Contract (CMS system)
Common Model (“Data Glue”)
Source Systems
• Different business entities in physical systems actually share many of the same concepts, meanings, and relationships
• Semantic data science exposes common business concepts and connects them with their physical expression in production systems
• Data is “glued” together by its business meaning, rather than physical structures dictated by the underlying technologies
The conceptual model can be directly used by both business and IT users to operationalize data services, understand the data landscape, track data lineage, and
conduct downstream analytics.
© 2014 21
Semantic Models Relate Data by Business Meaning
Life Events
Life Style
Preferences
Interests
Customer Music
Purchasing
Personal Network
Entertainment
Profession
© 2014 22
Implications to the Existing IT Architecture and Practices
User Tools to Discover and Optimize Data
Relationships
Structured and Unstructured Data, Voice, and Video
Data Analysis Automation
Extends Existing Investments in IT Architecture
Manages Secure Access
Builds Out Enterprise Data Models, with
Integration Hub Capabilities
Self-Service Data Feeds and Analytics
Infrastructure Capacity Elasticity
Reduction of Data Mart Silos
Easier Access
to External
Data
© 2014 23
Data Lake Approach to Meeting Business Needs
Business Needs Traditional Technologies
and Practices Data Lake Technologies
and Practices
Onboard New Data
Comprehensive analysis creates rigid structure that is difficult to change, or
Minimal definition of data organization requires detailed understanding of data contents
Flexible data model can be revised or extended without redesign of the database
Agile, evolutionary refinement of the data organization, leveraging new insights as users work with the data
Connect External Data
External data is collected and loaded into the analytics repository.
Data is streamed, or is refreshed on a scheduled frequency.
External data can be sourced from databases, spreadsheets, Web pages, news feeds, and more; data is queried through common methods, without regard to location, with real-time values delivered at query time.
Integrate Data between Business Units or Business Partners
Governance activities establish common vocabulary, and data definitions
And, systems of record publish existing data specifications or ontology model; each organization defines data in a manner that is best suited for its business.
Shared data is copied to an integrated database.
Federation and virtualization features provide choices in which data to copy and which data to retain in the system(s) of record
Organization-specific definitions may require duplicating certain data in marts
All models can be supported through a single copy of the data, maintained in the data lake or system of record.
Capture and Embed Expertise Expertise often captured in the reporting
and analytics; change management challenge when updates required.
Expertise captured in the data definitions; single, shared definition minimizes change management efforts
© 2014 24
Lessons learned from early adopters
Prioritize Prioritize data onboarding by the data’s ability to contribute to customer engagement
Onboard Onboard data assets as they become available
Connect Connect to available internal and external data assets
Load Load the data unfiltered/untransformed
Organize Use models to provide organization to the data
Customize Create models that are tailored to the needs of the business groups
Search Make it easy to find data
Secure Manage security and privacy, but make it easy to authorize access to data that users need
© 2014 25
Addressing Challenges
- Privacy vs Personal Value
- Granularity of customer understanding
- Delivering strategic objectives when projects tend
to have a technical focus
- Opening access to data
- Need for executive sponsorship
- Access to external data
- Establishing firewalls
- Persistent, pervasive data quality issues
© 2014 26
Clues to better customer engagement will be found in the ever-growing volume of data that we’re creating
© 2014 27
A Data Lake Strategy helps you to create a personalized, engaging experience with each customer
Visibility Self-Service
Smart Provenance
Open, yet Secure
Internet Scale
Agile
Adaptable
Universal Data Access
© 2014 28
Questions?
© 2014 29
Thank you!