how to build a successful data lake

35
How to Build a Successful Data Lake Alex Gorelik Waterline Data Founder and CEO

Upload: hadoop-summit

Post on 06-Jan-2017

927 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: How to build a successful Data Lake

How to Build a Successful Data Lake

Alex GorelikWaterline Data

Founder and CEO

Page 2: How to build a successful Data Lake

Data Lakes Power Data-Driven Decision Making

Page 3: How to build a successful Data Lake

Maximize Business Value With a Data LakeHow Do You Democratize the Data Lake to Maximize Business Value?

Data Lake

Data Puddle

Data Swamp

No Value Enterprise Impact

Tight Control

“Governed”Self-Service

Business Value

DataDemocratization

DW Off-loading

Page 4: How to build a successful Data Lake

Data Swamps

Raw data

Can’t find or use data

Can’t allow access without protecting sensitive data

Page 5: How to build a successful Data Lake

Data Warehouse Offloading: Cost SavingsI prefer a data

warehouse--it’s more predictable

It takes IT 3 months of data architecture and ETL work to add new data to the data lake

I can’t get the original data

Page 6: How to build a successful Data Lake

Low variety of data and low adoption• Focused use case (e.g., fraud detection)• Fully automated programs (e.g., ETL off-

loading) • Small user community (e.g., data science

sand box)

Strong technical skill set requirement

Data Puddles: Limited Scope and Value

Page 7: How to build a successful Data Lake

What Makes a Successful Data Lake?

Right Data Right InterfaceRight Platform+ +

Page 8: How to build a successful Data Lake

Right Platform:

• Volume—Massively scalable

• Variety—Schema on read

• Future proof—modular—same data can be used by many different projects and technologies

• Platform cost – extremely attractive cost structure

Page 9: How to build a successful Data Lake

Right Data Challenges Most Data is Lost, So it Can’t Be Analyzed Later

Only a small portion of data in enterprises today is saved in data warehouses

Data Exhaust

Page 10: How to build a successful Data Lake

Right Data: Save Raw Data Now to Analyze Later

• Don’t know now what data will be needed later

• Save as much data as possible now to analyze later

Page 11: How to build a successful Data Lake

• Don’t know now what data will be needed later

• Save as much data as possible now to analyze later

• Save raw data, so it can be treated correctly for each use case

Right Data: Save Raw Data Now to Analyze Later

Page 12: How to build a successful Data Lake

• Departments hoard and protect their data and do not share it with the rest of the enterprise

• Frictionless ingestion does not depend on data owners

Right Data Challenges: Data Silos and Data Hoarding

Page 13: How to build a successful Data Lake

Right Interface: Key to Broad Adoption

• Data marketplace for data self-service

• Providing data at the right level of expertise

Page 14: How to build a successful Data Lake

Providing Data at the Right Level of ExpertiseData scientists

Business analysts

Raw data

Clean, trusted, prepared data

Page 15: How to build a successful Data Lake

Roadmap to Data Lake Success

Organize the lake

Set up for self-service

Open the lake to the users

Page 16: How to build a successful Data Lake

Organize the Data Lake into Zones Organize the lake

Page 17: How to build a successful Data Lake

Multi-modal IT – Different Governance Levels for Different Zones

Raw or Landing Sensitive

Gold or Curated Work

Data Stewards

Data Scientists

Data Engineers

Data Scientists, Business Analysts

Minimal governance Make sure there is

no sensitive data

Minimal governance Make sure there is

no sensitive data

Heavy governance Trusted, curated data Lineage, data quality

Heavy governance Restricted access

Page 18: How to build a successful Data Lake

Business Analyst Self-Service Workflow

Find and Understand Provision Prep Analyze

Set up for self-

service

Page 19: How to build a successful Data Lake

Finding, understanding and governing data in a data lake is like shopping at a flea market

“We have 100 million fields of data – how can anyone find or trust anything?” – Telco Executive

Page 20: How to build a successful Data Lake

Botond Horvath / Shutterstock.com

DATA SCIENTIST /BUSINESS ANALYST

DATASTEWARD

BIG DATA ARCHITECT

Can’t govern and trust data (unknown metadata,

data quality, PII, data lineage)

Need data to use with self-service tools but can’t

explore everything manually to find and

understand data

Can’t catalog all the data manually and keep up with data provisioning

Page 21: How to build a successful Data Lake

Instead Imaging Shopping On Amazon.com

Catalog Find, Understand And Collaborate Provision

Page 22: How to build a successful Data Lake

Catalog Find, Understand And Collaborate Provision

Waterline Data is like Amazon for Data in Hadoop

Page 23: How to build a successful Data Lake

Finding and Understanding Data• Crowdsource metadata and

automate creation of a catalog• Institutionalize tribal data knowledge• Automate discovery to cover all data

sets• Establish trust

• Curated annotated data sets• Lineage• Data quality• Governance

Find and Understand

Page 24: How to build a successful Data Lake

Accessing and Provisioning DataYou cannot give all access to all usersYou must protect PII data and sensitive business information

Provision

Agile/Self-service approach

Create a metadata-only catalog

When users request access, data is de-identified and provisioned

Top down approach Find and de-identify all sensitive data

Provide access to every user for every dataset as needed

Page 25: How to build a successful Data Lake

Provide a Self-Service Interface to Find, Understand, and Provision Data

Page 26: How to build a successful Data Lake

Prepare data for analytics PrepClean data

Remove or fix bad data, fill in missing values, convert to common units of measure

Shape data

Combine (join, concatenate)

Resolve entities (create a single customer record from multiple records or sources)

Transform (aggregate, bucketize, filter, convert codes to names, etc.)

Blend data

Harmonize data from multiple sources to a common schema or model

Tooling

Many great dedicated data wrangling tools on the horizon

Some capabilities in BI and data visualization tools

SQL and scripting languages for the more technical analysts

Page 27: How to build a successful Data Lake

Data Analysis

• Many wonderful self-service BI and data visualization tools

• Mature space with many established and innovative vendors

Magic Quadrant for Business Intelligence and Analytics Platforms04 February 2016 | ID:G00275847Analyst(s): Josh Parenteau, Rita L. Sallam, Cindi Howson, Joao Tapadinhas, Kurt Schlegel, Thomas W. Oestreich

Analyze

Page 28: How to build a successful Data Lake

Unlock the Value of the Data Lake with the Waterline Data Smart Data Catalog

Time To Value Tribal Knowledge Sharing Trust

Page 29: How to build a successful Data Lake

Waterline Data Is The Only Smart Data Catalog For The Data Lake

“Use an INFORMATION CATALOG TO MAXIMIZE BUSINESS VALUE From

Information Assets”

“automatically identify, profile, and metatag files in HDFS and make them

available for analysis and exploration”

“tapped into an important and underserved

opportunity”

“comprehensive big data governance and discovery

platform”

“opens the data to a wider variety of

people”

“fills a critical gap in big data exploratory analytics by automating the tagging

and cataloging of data”

Page 30: How to build a successful Data Lake

Current Customers

Healthcare

Insurance

Life Sciences

Aerospace

Automotive

Banking

Government

Marketing

"Opening up a data lake for self-service analytics requires a data catalog that's smart enough to

automatically catalog every field of data so business analysts can maximize time to value” -- Jerry Megaro,

Global Head Of Data Analytics, Merck KGaA

“Understanding where your data came from and what it means in context is vital to making a data lake initiative

successful and not just another data quagmire – the catalog plays a critical component in this” -- Global

Head of Data Governance, Risk, and Standard, International Multi-Line Insurer

“A governed yet agile data catalog is key to open up the data lake to business people” -- Paolo Arvati, Big

Data, CSI-Piemonte

Page 31: How to build a successful Data Lake

We Run Natively On Hadoop And Integrate With Existing Tools

Page 32: How to build a successful Data Lake

Workflow of Enabling Self-Service Analytics With Hortonworks

Hortonworks Atlas And Ranger

Data Prep Analytics & Visualizatio

n

Smart Data Discovery

Profiling, Sensitive Data &

Data Lineage Discovery, Automated

Tagging

Data Stewardshi

pCurate Tags

Self-Service

Data Catalog

Find, Collaborate And

Take Action

Metadata, Tags, Data Lineage

Metadata, Tags, Roles & Access Control

Roles & Access Control

Page 33: How to build a successful Data Lake

A Successful Data Lake

Right Data Right InterfaceRight Platform+ +

Page 34: How to build a successful Data Lake

Come to Booth 303 to see a demoand talk to us about your data lake

Come to the Atlas session at 4:00 PM on Thursday in room 210C

Page 35: How to build a successful Data Lake

Waterline DataThe Smart Data Catalog Company