apache atlas: why big data management requires hierarchical taxonomies

33
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

Upload: dataworks-summithadoop-summit

Post on 16-Apr-2017

2.753 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

Page 2: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Disclaimer

This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed.

Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery.

This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product.

Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.

Page 3: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Speakers

Andrew AhnGovernance Director Product Management

Page 4: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Agenda

• Atlas Overview• Near term roadmap• Taxonomy Benefits• Questions

Page 5: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Atlas Overview

Page 6: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

DGI* Community becomes Apache Atlas

May2015

Proto-typeBuilt

Apache AtlasIncubation

DGI groupKickoff

Feb2015

Dec 2014

July2015HDP 2.3 FoundationGA Release

First kickoff to GA in 7 months

Global FinancialCompany

* DGI: Data Governance Initiative

Faster & SaferCo-Development driven by customer use cases

Page 7: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

STRUCTURED

UN

STRUCTU

RED

Vision - Enterprise Data Governance Across Platfroms

TRADITIONALRDBMS

METADATA

MPP APPLIANCES

Project 1

Project 5

Project 4

Project 3

Metadata

Project 6

DATALAKE

STREAMING

Atlas: Metadata Truth in Hadoop

Data Managementalong the entire data lifecycle with integrated provenance and lineage capability

Modeling with Metadataenables comprehensive data lineage through a hybrid approach with enhanced tagging and attribute capabilities

Interoperable Solutionsacross the Hadoop ecosystem, through a common metadata store

Page 8: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Atlas: Metadata Services

• Cross- component dataset lineage. Centralized location for all metadata inside HDP

• Single Interface point for Metadata Exchange with platforms outside of HDP

• Business Taxonomy based classification. Conceptual, Logical And Technical

Apache Atlas

Hiv

e

Ran

ger

Falc

on

Sqoo

p

Stor

m

Kaf

ka

Spar

k

NiF

i

Page 9: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Big Data Management Through Metadata

Management ScalabilityMany traditional tools and patterns do not scale when applied to multi-tenant data lakes. Many enterprise have silo’d data and metadata stores that collide in the data lake. This is compounded by the ability to have very large windows (years). Can traditional EDW tools manage 100 million entities effectively with room to grow ?

Metadata Tools

Scalable, decoupled, de-centralized manage driven through metadata is the only via solution. This allows quick integration with automation and other metamodels

Tags for Management, Discovery and Security

Proper metadata is the foundation for business taxonomy, stewardship, attribute based security and self-service.

Page 10: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Atlas High Level Architecture

Type System

Repository

Search DSL

Brid

ge

Hive Storm

Falcon Others

REST API

Graph DB

Sear

ch

Kafka

Sqoop

Conn

ecto

rs

Mes

sagi

ng F

ram

ewor

k

Page 11: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Taxonomies Benefits:

• Discovery – Business catalog of conceptual, logical and physical assets

• Security --Dynamic metadata based Access control

Page 12: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Near Term Roadmap: Summer 2016

Page 13: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sqoop

TeradataConnector

ApacheKafka

Expanded Native Connector: Dataset Lineage

Custom Activity Reporter

MetadataRepository

RDBMS

Page 14: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Business Catalog

Breadcrumbs for taxonomy context path

Contents at taxonomy context

Page 15: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Technical and Logical Metadata Exchange

Knowledge Store

AtlasREST API

StructuredUnstructured

Files:XML / JSON

3rd Party Vendors

CustomReporter

Non-Hadoop Taxonomy

Data LineageTechnical Metadata

Page 16: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Governance Ready Certification Program

DiscoveryTagging

Prep / CleanseETL

GovernanceBPM

Self Service Visualization

Curated: Selected group of vendor partners to provide rich, complimentary and complete features

Choice: Customers choose features that they want to deploy—a la carte versus vendor lock

Agile: Low switching costs, Faster deployement and innovation

Standard: Common SLA & common open metadata store

Flexibility: Interoperability of products through Atlas metadata

HDP at core to provide stability and interoperability

Page 17: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Business Taxonomy Inheritance

Human Resources

Drivers(Dimension)

Timesheets(Facts)

PII

PIIPII

Parent

ChildChild

Logical Business

Taxonomy

Data Assets

Page 18: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Dynamic Access PolicyApache Ranger + Atlas Integration

Page 19: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Dynamic Access Policy Driven by metadata

Page 20: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Tag-based Access Policy Requirements

• Basic Tag policy – PII example. Access and entitlements must be tag based ABAC and scalable in implementation.

• Geo-based policy – Policy based on IP address, proxy IP substitution maybe required. The rule enforcement but be geo aware.

• Time-based policy – Timer for data access, de-coupled from deletion of data.

• Prohibitions – Prevention of combination of Hive tables that may pose a risk together.

Page 21: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

How does Atlas work with Ranger at scale?

Atlas provides: Metadata• Business Classification (taxonomy): Company > HR > Driver

• Hierarchy with Inheritance of attribute to child objects: Sensitive “PII” tag of department HR will be inherited by group HR> Driver

• Atlas will notify Ranger via Kafka Topic for changes

Apache Atlas

Hiv

e

Ran

ger

Falc

on

Kaf

ka

Stor

m

Atlas provides the metadata tag to create policies

Ranger provides: Access & Entitlements

• Ranger will cache tags and asset mapping for performance

• Ranger will have a policy based on tags instead of roles.

• Example: PII = <group> This can work for a may assets.

Page 22: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Use cases drives design – high reliability

Metastore

• Tags• Assets• Entities

Notification Framework

Kafka Topics

AtlasAtlas Client

• Subscribes to Topic• Gets Metadata

Updates

PDPResource Cache

Ranger

Notification Metadata updates

Messagedurability

Optimized for Speed

Event driven updates

Page 23: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

• Security• Discovery & Lineage

Preview Demo

Page 24: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Availability: - Tech Preview VMs: May 2016 - GA Release: Summer 2016

Page 25: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Questions ?

Page 26: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Reference

Page 27: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Online Resources

VM: https://s3.amazonaws.com/demo-drops.hortonworks.com/HDP-Atlas-Ranger-TP.ova —> Download Public Preview VM

Tutorial: https://github.com/hortonworks/tutorials/tree/atlas-ranger-tp/tutorials/hortonworks/atlas-ranger-preview

Blog: http://hwxjojo.wpengine.com/blog/the-next-generation-of-hadoop-based-security-data-governance/ (this is giving an error, right now)

Learn More: http://hortonworks.com/solutions/atlas-ranger-integration/

Page 28: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Tag Based Security Video:

https://drive.google.com/file/d/0B0wjjMSH77srLXFZN3lmWHVJWVU/view?usp=sharinghttps://drive.google.com/file/d/0B0wjjMSH77srLXFZN3lmWHVJWVU/view?usp=sharing

Page 29: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You

Page 30: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

HDF: Dataflow Governance Solution

Page 31: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION

Dataflow Security Use case Requirements

Accelerated Data Collection: An integrated, data source agnostic collection platform

Increased Security and Unprecedented Chain of Custody: Secure from source to storage with high fidelity data provenance

The Internet of Any Thing (IoAT): A Proven Platform for the Internet of Things

http://hortonworks.com/hdf/

Page 32: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Enterprise Grade Governance Dataflow Solution

Filtered Metadata

• HDP Taxonomy• Centrallized

Metadata Repository

• Downstream HDP Impacts

• Cross component lineage

• 3rd Party integration

• Guaranteed Delivery

• Data Buffering• Prioritized

Queueing• Flow specific QoS• Visual Command

& ControlMonthsLineage

YearsLineage

ReferenceTaxonomy

(Tags)

Event level versus Dataset

level

HDF - NiFI

Operation Control

MaximumFidelity

Event Level

HDP – Atlas

GovernanceManagement

Medium / LowFidelity

Dataset Level

Page 33: Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Expanded visibility throughout the eco-system

HDF

ETLHive

Hive Hook(Native)

Security Appliance

Data

Metadata

NiFi

NiFi

NiFi

NiFi

Kafka

Hive Hook(Native)

Hive

Hive Hook(Native)

HDP

AtlasMetadataRepository

Centralized Repository for multiple NiFi Deployments

End to end data lineage

Security Appliance

Security Appliance

Security Appliance

Security Appliance

Security Appliance