the emerging data lake it strategy

29
© 2014 The Emerging Data Lake IT Strategy An Evolving Approach for Dealing with Big Data & Changing Environments SPEAKERS: Thomas Kelly, Practice Director Cognizant Technology Solutions Sean Martin, Founder and CTO Cambridge Semantics bit.ly/DataLake

Upload: thomas-kelly-pmp

Post on 27-Jan-2015

105 views

Category:

Data & Analytics


0 download

DESCRIPTION

Meaning making – separating signal from noise. How do we transform the customer's next input into an action that creates a positive customer experience? We make the data more intelligent, so that it is able to guide our actions. The Data Lake builds on Big Data strengths by automating many of the manual development tasks, providing several self-service features to end-users, and an intelligent management layer to organize it all. This results in lower cost to create solutions, "smart" analytics, and faster time to business value.

TRANSCRIPT

Page 1: The Emerging Data Lake IT Strategy

© 2014

The Emerging Data Lake IT Strategy An Evolving Approach for Dealing with Big Data & Changing Environments

SPEAKERS:

Thomas Kelly, Practice Director Cognizant Technology Solutions

Sean Martin, Founder and CTO Cambridge Semantics

bit.ly/DataLake

Page 2: The Emerging Data Lake IT Strategy

© 2014 2

We’re living in an amazing world of information sharing,

connecting with family, neighbors, vendors, and customers all over the world

Page 3: The Emerging Data Lake IT Strategy

© 2014 3

Telling the world about what we like and don’t like

#HIMYMfinale

@MLB

… is now following Cognizant Technology Solutions and Cambridge Semantics

Page 4: The Emerging Data Lake IT Strategy

© 2014 4

What we’re doing and how we’re succeeding

Page 5: The Emerging Data Lake IT Strategy

© 2014 5

We’re deciding what advertising that we want to see…

… and what we don’t

Unsubscribe

Influencing how business and customers engage

Page 6: The Emerging Data Lake IT Strategy

© 2014 6

Many businesses have emerged that embrace this model of customer engagement

and we’ve said Goodbye to businesses that didn’t

10 million stays in 2013, without owning a hotel

Grew to nearly $75B in annual retail revenue in 2013, without opening a storefront Shares over 40 million

photos each day

Page 7: The Emerging Data Lake IT Strategy

© 2014 7

Retail Engaging in a more personalized shopping experience, retailers are building a stronger relationship with each customer

Page 8: The Emerging Data Lake IT Strategy

© 2014 8

Customer Service Delivering a positive and successful experience for each customer

Page 9: The Emerging Data Lake IT Strategy

© 2014 9

Life Sciences and Healthcare Combining health, genetic, clinical, and public sciences data to bring effective therapies to patients sooner

Page 10: The Emerging Data Lake IT Strategy

© 2014 10

Financial Services Delivering innovative products and services, based on a 360° view of the Customer, across all business lines, engaging all available data assets, internal and external

Page 11: The Emerging Data Lake IT Strategy

© 2014 11

The Challenges That We're Addressing

Onboarding and Integrating Data is Slow and Expensive

• Transforming data from a growing variety of technologies

• Custom coded ETL

• Existing ETL processes are not reusable

• Optimization for analytics is time-consuming and costly

• Often wait until there is a defined need for a set of data, delaying benefits realization while waiting to onboard the data

Data Provenance is Often Poorly Recorded

• Data meaning is “lost in translation”

• Data transformations tracked in spreadsheets

• Post-onboarding, maintenance and analysis cost for onboarded data is high

• Recreating data lineage is manual, time-consuming, and error-prone

Page 12: The Emerging Data Lake IT Strategy

© 2014 12

The Challenges That We're Addressing

Target Data is Difficult to Consume

• Optimization favors known analytics, but not well suited to new requirements

• A one-size-fits-all canonical view is used rather than fit-for-purpose views

• Or, lacks a conceptual model to easily consume the target data

• Difficult to identify what data is available, how to get access, and how to integrate the data to answer a question

Industrializing the Big Data Environment is Difficult to Manage

• Proliferation of data silos leads to inconsistency/syncing issues

• Conflicting objectives of opening access to data assets while managing security and privacy requirements

• Velocity of business change rapidly invalidate data organization and analytics optimizations

• Managing the integration/interaction with the multiple data management technologies that make up the Big Data environment

Page 13: The Emerging Data Lake IT Strategy

© 2014 13

Data Ingestion

The Data Lake is made up of four key components

Data Lake Management

Data Management Query Management

Delivering

• Low Cost, High Performance Storage • Flexible, Easy-to-Use Data Organization • Performance-Optimized Analytics • Automation of most manual Development and

Query Activities • Self-Service End-User Features • Intelligent Processing

Page 14: The Emerging Data Lake IT Strategy

© 2014 14

Data Ingestion

Data Lake Management

Data Management Query Management

Data Sources

Linked Data

Internet of Things IoT

Data Ingestion

On-Demand Query

Streaming

Semantic Tagging

Scheduled Batch Load

Model-Driven

Self-Service

Desktop and Mobile

Operational Systems

Social Media and Cloud

Page 15: The Emerging Data Lake IT Strategy

© 2014 15

Data Management

Data Lake Management

Data Management Query Management

Provenance Data

Movement

Data Sources

Linked Data

Internet of Things IoT

Semantic

Graph

Columnar

In Memory

Data Ingestion

On-Demand Query

Streaming

Semantic Tagging

Scheduled Batch Load

Model-Driven

Self-Service

Desktop and Mobile

NoSQL Map Reduce

Operational Systems

Social Media and Cloud

HDFS Storage

Structured and Unstructured Data

HDFS Storage

Page 16: The Emerging Data Lake IT Strategy

© 2014 16

Data Ingestion

Data Lake Management

Data Management Query Management

Semantic

Graph

Columnar

In Memory

Provenance Data

Movement

Data Lake Management

Data Assets Catalog

Workflow Models Access

Management

Data Sources

Linked Data

Internet of Things IoT

Data Mappings • Source-to-Target • Transformations

• Internal and External Data Assets

• Defined Data Orgs (ontologies, taxonomies, thesauri)

• Authorization and Access Rules • Rule-based Security • Group, Role, and User Level

Authorization • Auditable Access

• Processes • Schedules • Provenance

Capture

On-Demand Query

Streaming

Semantic Tagging

Scheduled Batch Load

Model-Driven

Self-Service

Business-Focused • Business Unit Data

Organization and Terms • Optimized to Assist

Analytics

Monitoring • Monitor and Manage

Data Lake Operations

Desktop and Mobile

Data Governance • Focus on Shared Data • Standard Models • Controlled Vocabulary • Common Definitions • Standards-based Data

Views (FIBO, CDISC/RDF)

NoSQL Map Reduce

Operational Systems

Social Media and Cloud

Structured and Unstructured Data

HDFS Storage

Page 17: The Emerging Data Lake IT Strategy

© 2014 17

Query Management

Data Ingestion

On-Demand Query

Streaming

Semantic Tagging

Data Lake Management

Data Management

Scheduled Batch Load

Model-Driven

Self-Service

Query Management

Provenance Data

Movement

Data Sources

Linked Data

Internet of Things IoT

Semantic

Graph

Columnar

In Memory

Query Data, Metadata, and Provenance

Capture and Share Analytics Expertise

Semantic Search

Analytics Directed to the Best Query Engine

Data Discovery

Desktop and Mobile

NoSQL Map Reduce

Operational Systems

Social Media and Cloud

HDFS Storage

Structured and Unstructured Data

HDFS Storage

Page 18: The Emerging Data Lake IT Strategy

© 2014 18

Semantic Technology Delivers “Smart” Data

Integrates a network of internal and external data assets, insulating end users from the details of the underlying technologies

Captures expertise (logic, inferencing) and integrates it with the data, delivering “smart” data to non-expert users

Manages a comprehensive inventory of the data assets

Secures access to the right data assets by the right users

Page 19: The Emerging Data Lake IT Strategy

© 2014 19

Key W3C Standards in Semantic Technology

Resource Description

Framework (RDF)

Framework for storing and

integrating data and data

definitions in the form of subject-

predicate-object expressions, or

“triples”. Relationships are

organized in a logical graph

model. Reduced development

time and cost; faster time-to-

business value.

Web Ontology Language

(OWL)

An ontology is a comprehensive

model of data definitions and

relationships that is human- and

machine-readable. Ontologies

are inheritable and extensible.

Improved application quality,

flexible iterative / investigative

approach, easily adapts to

business change.

SPARQL

Query Language

SQL-like query language for

semantic data that can leverage

the ontological relationships and

constructs to execute smarter

queries. Access multiple

internal and external databases

simultaneously in a single query.

Access and integrate data

across business silos.

Inference

Reasoning over data through

business rules. Expertise is

captured and embedded in the

ontology model, accessible

through user queries. This is

the “smart” in Smart Data.

Easier end user access to

expertise; intelligent systems

capabilities.

Linked Data

Connects data contained in

different databases, allowing

queries to find, share and

combine data so insights can be

identified across the Web.

Connect disparate databases to

navigate and integrate data

regardless of location or

technology platform.

RDB to RDF Mapping

Language (R2RML)

Preserving current investments

in relational technology, R2RML

maps relational data to an

ontology. SPARQL can query

RDF and relational databases

simultaneously.

Low cost of entry to use

Semantic Technology to deliver

high-value solutions

Page 20: The Emerging Data Lake IT Strategy

© 2014 20

The Common Model is the “Data Glue”

Lead (SFA system)

Quote (Quote system)

Order (OMS system)

Contract (CMS system)

Common Model (“Data Glue”)

Source Systems

• Different business entities in physical systems actually share many of the same concepts, meanings, and relationships

• Semantic data science exposes common business concepts and connects them with their physical expression in production systems

• Data is “glued” together by its business meaning, rather than physical structures dictated by the underlying technologies

The conceptual model can be directly used by both business and IT users to operationalize data services, understand the data landscape, track data lineage, and

conduct downstream analytics.

Page 21: The Emerging Data Lake IT Strategy

© 2014 21

Semantic Models Relate Data by Business Meaning

Life Events

Life Style

Preferences

Interests

Customer Music

Purchasing

Personal Network

Entertainment

Profession

Page 22: The Emerging Data Lake IT Strategy

© 2014 22

Implications to the Existing IT Architecture and Practices

User Tools to Discover and Optimize Data

Relationships

Structured and Unstructured Data, Voice, and Video

Data Analysis Automation

Extends Existing Investments in IT Architecture

Manages Secure Access

Builds Out Enterprise Data Models, with

Integration Hub Capabilities

Self-Service Data Feeds and Analytics

Infrastructure Capacity Elasticity

Reduction of Data Mart Silos

Easier Access

to External

Data

Page 23: The Emerging Data Lake IT Strategy

© 2014 23

Data Lake Approach to Meeting Business Needs

Business Needs Traditional Technologies

and Practices Data Lake Technologies

and Practices

Onboard New Data

Comprehensive analysis creates rigid structure that is difficult to change, or

Minimal definition of data organization requires detailed understanding of data contents

Flexible data model can be revised or extended without redesign of the database

Agile, evolutionary refinement of the data organization, leveraging new insights as users work with the data

Connect External Data

External data is collected and loaded into the analytics repository.

Data is streamed, or is refreshed on a scheduled frequency.

External data can be sourced from databases, spreadsheets, Web pages, news feeds, and more; data is queried through common methods, without regard to location, with real-time values delivered at query time.

Integrate Data between Business Units or Business Partners

Governance activities establish common vocabulary, and data definitions

And, systems of record publish existing data specifications or ontology model; each organization defines data in a manner that is best suited for its business.

Shared data is copied to an integrated database.

Federation and virtualization features provide choices in which data to copy and which data to retain in the system(s) of record

Organization-specific definitions may require duplicating certain data in marts

All models can be supported through a single copy of the data, maintained in the data lake or system of record.

Capture and Embed Expertise Expertise often captured in the reporting

and analytics; change management challenge when updates required.

Expertise captured in the data definitions; single, shared definition minimizes change management efforts

Page 24: The Emerging Data Lake IT Strategy

© 2014 24

Lessons learned from early adopters

Prioritize Prioritize data onboarding by the data’s ability to contribute to customer engagement

Onboard Onboard data assets as they become available

Connect Connect to available internal and external data assets

Load Load the data unfiltered/untransformed

Organize Use models to provide organization to the data

Customize Create models that are tailored to the needs of the business groups

Search Make it easy to find data

Secure Manage security and privacy, but make it easy to authorize access to data that users need

Page 25: The Emerging Data Lake IT Strategy

© 2014 25

Addressing Challenges

- Privacy vs Personal Value

- Granularity of customer understanding

- Delivering strategic objectives when projects tend

to have a technical focus

- Opening access to data

- Need for executive sponsorship

- Access to external data

- Establishing firewalls

- Persistent, pervasive data quality issues

Page 26: The Emerging Data Lake IT Strategy

© 2014 26

Clues to better customer engagement will be found in the ever-growing volume of data that we’re creating

Page 27: The Emerging Data Lake IT Strategy

© 2014 27

A Data Lake Strategy helps you to create a personalized, engaging experience with each customer

Visibility Self-Service

Smart Provenance

Open, yet Secure

Internet Scale

Agile

Adaptable

Universal Data Access

Page 28: The Emerging Data Lake IT Strategy

© 2014 28

Questions?

Page 29: The Emerging Data Lake IT Strategy

© 2014 29

Thank you!