multi-tenant apache hadoop clusters - bi consulting
TRANSCRIPT
![Page 1: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/1.jpg)
1© Cloudera, Inc. All rights reserved.
Multi-tenant Apache HadoopClustersDániel Schöberle | Cloudera
Designated Support Engineer for Bank of America
![Page 2: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/2.jpg)
2© Cloudera, Inc. All rights reserved.
What is multi-tenancy?
Single tenant Free-for-all Multi-tenancy
![Page 3: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/3.jpg)
3© Cloudera, Inc. All rights reserved.
Why do we need it?
• Optimize resource usage
• Share infrastructure
• Allow different groups access to storage/data
• Support wide audience (developers, analysts, data scientists from different organizational units)
• Allow the little guy access to big resources
![Page 4: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/4.jpg)
4© Cloudera, Inc. All rights reserved.
What should multi-tenancy solve?
• Resource Management / Sharing
• Access control / Security
• Reporting / Operations / Management considerations
![Page 5: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/5.jpg)
5© Cloudera, Inc. All rights reserved.
• Single General Purpose Hadoop Cluster
• Multiple distinct user groups with code & data that need to be separated
• Sharing storage (HDFS) & processing resources (cores & RAM)
• Mixed work loads storage only, batch & interactive processing
• Typically run by an in-house data center team on-premise or in the cloud
What is a multi-tenant Hadoop cluster?
BI Marketing Engineering
![Page 6: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/6.jpg)
6© Cloudera, Inc. All rights reserved.
Resource Isolation & Management
![Page 7: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/7.jpg)
7© Cloudera, Inc. All rights reserved.
YARN scheduler (Dynamic resource allocation)
• Capacity or fair scheduler (Cloudera recommended) should be used to control to cluster resource by YARN applications
• Allocation is dynamic, based on queues
• Resources are divided between queues. If a queue is not allocating any resources, they can be distributed to other queues
• Access to queues can be restricted based on user/group executing the YARN job
• Works with: MapReduce, Spark, Hive, Oozie , …
![Page 8: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/8.jpg)
8© Cloudera, Inc. All rights reserved.
• Applications outside of YARN need to be tamed
• Linux Control Groups (cgroups) allows for per-resource isolation between services and roles
• Services are allocated a static percentage of total resources:
• CPU shares
• I/O weight
• Memory usage
Cgroups (Static resource allocation)
![Page 9: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/9.jpg)
9© Cloudera, Inc. All rights reserved.
Security
![Page 10: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/10.jpg)
10© Cloudera, Inc. All rights reserved.
• We need to know who the users are
• We need to know which groups they belong to
• We need to know what can they access
• And what level of permissions they have
• Kerberos is the only authentication method supported by most components
• LDAP can be used for some components (HiveServer 2 / Impala)
• LDAP allows group management by integrating with Identity Management solutions (AD, Centrify, SSSD)
Authentication / Authorization
![Page 11: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/11.jpg)
11© Cloudera, Inc. All rights reserved.
Apache Sentry I
![Page 12: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/12.jpg)
12© Cloudera, Inc. All rights reserved.
Apache Sentry II
• It’s an authorization service usable by many components
• Familiar SQL syntax, manages permissions, stores them in private database
• Role-based access control /GRANT SELECT ON TABLE data TO Analyst/
• Objects (Hive/Impala) are: server, database, table, column, HDFS URI
• Objects are mapped to HDFS directories for jobs outside of Hive/Impala
• Roles are mapped to groups /GRANT ROLE Analyst TO GROUP finance-managers/
• Permissions SELECT(rx), INSERT(wx) and ALL(rwx) are mapped to POSIX file permissions outside of Hive/Impala
![Page 13: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/13.jpg)
13© Cloudera, Inc. All rights reserved.
Data in-transit
• SSL/TLS needs to be enabled to encrypt data between clients and services’ endpoints (Hive, Hue, …)
• Certificates and key management tasks are usually outside of scope of Hadoop cluster
• Keys and certificates are configured per service/role
Data at-rest (Key Trustee)
• Multiple encryption zones on HDFS allow only authorized users to access the data.
• Data is transmitted in encrypted form as the encryption is on HDFS block level.
• Keys can be stored in Java keystore or HSM
Encryption
![Page 14: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/14.jpg)
14© Cloudera, Inc. All rights reserved.
HDFS considerations
• Organize your data, think namespaces (directory structure and name conventions)
• Make sure nobody uses too much space, enable HDFS quotas
• Unix file permissions are not enough, enable ACLs
• If using Sentry, enable Sentry HDFS sync plug-in
![Page 15: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/15.jpg)
15© Cloudera, Inc. All rights reserved.
Operations / Managing the Cluster
![Page 16: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/16.jpg)
16© Cloudera, Inc. All rights reserved.
Managing the cluster
Shared Nothing Shared Management Shared Resources
![Page 17: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/17.jpg)
17© Cloudera, Inc. All rights reserved.
Reports on user activity
Monitor, monitor, monitor!
• “How much CPU & memory did each tenant use?”
• “I set up fair scheduler. Did each of my tenants get their fair share?”
• “Which tenants had to wait the longest for their applications to get resources?
• “Which tenants asked for the most memory but used the least?”
• “When do I need to add nodes to my cluster?”
![Page 18: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/18.jpg)
18© Cloudera, Inc. All rights reserved.
Cloudera Manager reports
![Page 19: Multi-tenant Apache Hadoop Clusters - BI Consulting](https://reader030.vdocuments.site/reader030/viewer/2022012606/619a9e1d5131c448fb0eb5d3/html5/thumbnails/19.jpg)
19© Cloudera, Inc. All rights reserved.
Start small
• 2-3 tenants
Plan ahead!
• user management
• data governance
Configure Kerberos• http://www.cloudera.com/documentation/enterprise/l
atest/topics/cm_sg_authentication.html
Enable HDFS ACLs:<property>
<name>dfs.namenode.acls.enabled</name>
<value>true</value>
</property>
Enable fair scheduler:• http://www.cloudera.com/documentation/enterprise/l
atest/topics/admin_fair_scheduler.html
Look into Sentry:• http://www.cloudera.com/documentation/enterprise/l
atest/topics/cm_sg_sentry_service.html
How to start?