apache atlas: why big data management requires hierarchical taxonomies
TRANSCRIPT
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Disclaimer
This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed.
Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery.
This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product.
Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Speakers
Andrew AhnGovernance Director Product Management
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
• Atlas Overview• Near term roadmap• Taxonomy Benefits• Questions
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Overview
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DGI* Community becomes Apache Atlas
May2015
Proto-typeBuilt
Apache AtlasIncubation
DGI groupKickoff
Feb2015
Dec 2014
July2015HDP 2.3 FoundationGA Release
First kickoff to GA in 7 months
Global FinancialCompany
* DGI: Data Governance Initiative
Faster & SaferCo-Development driven by customer use cases
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
STRUCTURED
UN
STRUCTU
RED
Vision - Enterprise Data Governance Across Platfroms
TRADITIONALRDBMS
METADATA
MPP APPLIANCES
Project 1
Project 5
Project 4
Project 3
Metadata
Project 6
DATALAKE
STREAMING
Atlas: Metadata Truth in Hadoop
Data Managementalong the entire data lifecycle with integrated provenance and lineage capability
Modeling with Metadataenables comprehensive data lineage through a hybrid approach with enhanced tagging and attribute capabilities
Interoperable Solutionsacross the Hadoop ecosystem, through a common metadata store
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas: Metadata Services
• Cross- component dataset lineage. Centralized location for all metadata inside HDP
• Single Interface point for Metadata Exchange with platforms outside of HDP
• Business Taxonomy based classification. Conceptual, Logical And Technical
Apache Atlas
Hiv
e
Ran
ger
Falc
on
Sqoo
p
Stor
m
Kaf
ka
Spar
k
NiF
i
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Data Management Through Metadata
Management ScalabilityMany traditional tools and patterns do not scale when applied to multi-tenant data lakes. Many enterprise have silo’d data and metadata stores that collide in the data lake. This is compounded by the ability to have very large windows (years). Can traditional EDW tools manage 100 million entities effectively with room to grow ?
Metadata Tools
Scalable, decoupled, de-centralized manage driven through metadata is the only via solution. This allows quick integration with automation and other metamodels
Tags for Management, Discovery and Security
Proper metadata is the foundation for business taxonomy, stewardship, attribute based security and self-service.
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas High Level Architecture
Type System
Repository
Search DSL
Brid
ge
Hive Storm
Falcon Others
REST API
Graph DB
Sear
ch
Kafka
Sqoop
Conn
ecto
rs
Mes
sagi
ng F
ram
ewor
k
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Taxonomies Benefits:
• Discovery – Business catalog of conceptual, logical and physical assets
• Security --Dynamic metadata based Access control
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Near Term Roadmap: Summer 2016
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop
TeradataConnector
ApacheKafka
Expanded Native Connector: Dataset Lineage
Custom Activity Reporter
MetadataRepository
RDBMS
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Business Catalog
Breadcrumbs for taxonomy context path
Contents at taxonomy context
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Technical and Logical Metadata Exchange
Knowledge Store
AtlasREST API
StructuredUnstructured
Files:XML / JSON
3rd Party Vendors
CustomReporter
Non-Hadoop Taxonomy
Data LineageTechnical Metadata
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Governance Ready Certification Program
DiscoveryTagging
Prep / CleanseETL
GovernanceBPM
Self Service Visualization
Curated: Selected group of vendor partners to provide rich, complimentary and complete features
Choice: Customers choose features that they want to deploy—a la carte versus vendor lock
Agile: Low switching costs, Faster deployement and innovation
Standard: Common SLA & common open metadata store
Flexibility: Interoperability of products through Atlas metadata
HDP at core to provide stability and interoperability
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Business Taxonomy Inheritance
Human Resources
Drivers(Dimension)
Timesheets(Facts)
PII
PIIPII
Parent
ChildChild
Logical Business
Taxonomy
Data Assets
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Access PolicyApache Ranger + Atlas Integration
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Access Policy Driven by metadata
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Tag-based Access Policy Requirements
• Basic Tag policy – PII example. Access and entitlements must be tag based ABAC and scalable in implementation.
• Geo-based policy – Policy based on IP address, proxy IP substitution maybe required. The rule enforcement but be geo aware.
• Time-based policy – Timer for data access, de-coupled from deletion of data.
• Prohibitions – Prevention of combination of Hive tables that may pose a risk together.
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How does Atlas work with Ranger at scale?
Atlas provides: Metadata• Business Classification (taxonomy): Company > HR > Driver
• Hierarchy with Inheritance of attribute to child objects: Sensitive “PII” tag of department HR will be inherited by group HR> Driver
• Atlas will notify Ranger via Kafka Topic for changes
Apache Atlas
Hiv
e
Ran
ger
Falc
on
Kaf
ka
Stor
m
Atlas provides the metadata tag to create policies
Ranger provides: Access & Entitlements
• Ranger will cache tags and asset mapping for performance
• Ranger will have a policy based on tags instead of roles.
• Example: PII = <group> This can work for a may assets.
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use cases drives design – high reliability
Metastore
• Tags• Assets• Entities
Notification Framework
Kafka Topics
AtlasAtlas Client
• Subscribes to Topic• Gets Metadata
Updates
PDPResource Cache
Ranger
Notification Metadata updates
Messagedurability
Optimized for Speed
Event driven updates
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
• Security• Discovery & Lineage
Preview Demo
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Availability: - Tech Preview VMs: May 2016 - GA Release: Summer 2016
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions ?
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Reference
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Online Resources
VM: https://s3.amazonaws.com/demo-drops.hortonworks.com/HDP-Atlas-Ranger-TP.ova —> Download Public Preview VM
Tutorial: https://github.com/hortonworks/tutorials/tree/atlas-ranger-tp/tutorials/hortonworks/atlas-ranger-preview
Blog: http://hwxjojo.wpengine.com/blog/the-next-generation-of-hadoop-based-security-data-governance/ (this is giving an error, right now)
Learn More: http://hortonworks.com/solutions/atlas-ranger-integration/
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Tag Based Security Video:
https://drive.google.com/file/d/0B0wjjMSH77srLXFZN3lmWHVJWVU/view?usp=sharinghttps://drive.google.com/file/d/0B0wjjMSH77srLXFZN3lmWHVJWVU/view?usp=sharing
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDF: Dataflow Governance Solution
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Dataflow Security Use case Requirements
Accelerated Data Collection: An integrated, data source agnostic collection platform
Increased Security and Unprecedented Chain of Custody: Secure from source to storage with high fidelity data provenance
The Internet of Any Thing (IoAT): A Proven Platform for the Internet of Things
http://hortonworks.com/hdf/
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enterprise Grade Governance Dataflow Solution
Filtered Metadata
• HDP Taxonomy• Centrallized
Metadata Repository
• Downstream HDP Impacts
• Cross component lineage
• 3rd Party integration
• Guaranteed Delivery
• Data Buffering• Prioritized
Queueing• Flow specific QoS• Visual Command
& ControlMonthsLineage
YearsLineage
ReferenceTaxonomy
(Tags)
Event level versus Dataset
level
HDF - NiFI
Operation Control
MaximumFidelity
Event Level
HDP – Atlas
GovernanceManagement
Medium / LowFidelity
Dataset Level
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Expanded visibility throughout the eco-system
HDF
ETLHive
Hive Hook(Native)
Security Appliance
Data
Metadata
NiFi
NiFi
NiFi
NiFi
Kafka
Hive Hook(Native)
Hive
Hive Hook(Native)
HDP
AtlasMetadataRepository
Centralized Repository for multiple NiFi Deployments
End to end data lineage
Security Appliance
Security Appliance
Security Appliance
Security Appliance
Security Appliance