best practices for administering hadoop with hortonworks data platform (hdp) 2.3 _part 2
TRANSCRIPT
Hadoop Admin Best Practices with HDP 2.3
Part-2
We have INSTRUCTOR LED - both Online LIVE & Classroom Session
Present for classroom sessions in Bangalore & Delhi (NCR)
We are the ONLY Education delivery partners for Mulesoft, Elastic, Pivotal & Lightbend in India
We have delivered more than 5000 trainings and have over 400 courses and a vast pool of over 200 experts to make YOU the EXPERT!
FOLLOW US ON SOCIAL MEDIA TO STAY UPDATED ON THE UPCOMING WEBINARS
Online and Classroom Training on Technology Coursesat SpringPeople Certified Partners
Non-Certified Courses
…and many more
…NOW
The Hadoop Ecosystem
Hadoop
The HDP 2.3 Platform Versions
Covered Till Now
1. Use Ambari – Cluster Management Tool
2. More of WebHDFS Access
3. WebHDFS
4. Use More of HDFS Access Control Lists
5. Use HDFS Quotas
6. Understanding of YARN Components
7. Adding, Deleting, or Replacing Worker Nodes
8. Rack Awareness
9. NameNode High Availability
10. ResourceManager High Availability
11. Ambari Metrics System
12. What to Backup?
13 – Setting appropriate Directory Space Quota
• Best practice is to also set space limits on home directory To set a 12TB limit:$ hdfs dfsadmin –setSpaceQuota 12t /user/username
• Includes space for replication• This is the actual use of space• Example:
• If storing 1TB and replication factor is 3• 3TB is needed
• Quota can be set on any directory
14 - Configuring Trash
• Enable by setting time delay for trash's checkpoint removal: In core-site.xml• fs.trash.interval
• Delay is set in minutes (24 hours would be 1440 minutes)• Recommendation is to set to 360 minutes (6 hours)• Setting the value to 0 disables Trash
• Files deleted programmatically are deleted immediately• Files can be immediately deleted from the command line using -
skipTrash
15 - Compression Needs and Tradeoffs
Compressing data can speed up data-intensive I/O operations• MapReduce jobs are almost always I/O bound
Compressed data can save storage space and speed up data transfers across the network• Capital allocation for hardware can go further
Reduced I/O and network load can result in significant performance improvements• MapReduce jobs can finish faster overall
But, CPU utilization and processing time increase during compression and decompression• Understanding the tradeoffs is important for MapReduce pipeline’s overall performance
16 - Sqoop Security
• Database Authentication:• Sqoop needs to authenticate to the RDBMS
• How?• Usually this involves a username/password
(Oracle Wallet is the exception)• Can hard code password in scripts (not recommended/used)• Password usually stored in plaintext in a file protected by the filesystem
• Hadoop Credential Management Framework added in HDP 2.2• Not a keystore, but a way to interface with keystore backends• Passwords can be stored in a keystore and not in plain text• Can help with “no passwords in plaintext” requirements
17 - distcp Configurations
• If Distcp runs out of memory before copying:• Possible Cause: Number of files/directories being copied from source
path(s) is extremely large (e.g. 100,000 paths)• Change: HEAP Size
- Export HADOOP_CLIENT_OPTS="-Xms64m -Xmx1024m”• Map Sizing
• If -m is not specified: Default to 20 maps max• Tuning the number of maps to:
- Size of the source and destination cluster- The size of the copy- Available bandwidth
18 - Falcon Centrally manages data lifecycle
• Centralized definition & management of pipelines for data ingest, process and export
Supports Business continuity and Disaster Recovery• Out of the box policies for data replication
and retention• End-to-end monitoring of data pipelines
Addresses basic audit & compliance requirements• Visualize data pipeline lineage • Track data pipeline audit logs• Tag data with business metadata
19 - Running Balancer
• Can be run periodically as a batch job• Examples: every 24 hours or weekly
• Run after new nodes have been added to the cluster• To run balancer:
hdfs balancer [-threshold <threshold>] [-policy <policy>]]• Runs until there are no blocks to move
orUntil it has lost contact with the NameNode
• Can be stopped with a Ctrl+C
20 - HDFS Snapshots
Create HDFS directory snapshots Fast operation - only metadata affected Results in .snapshot/ directory in the HDFS directory Snapshots are named or default to timestamp Directories must be made snapshottable Snapshot Steps:
– Allow snapshot on directoryhdfs dfsadmin -allowSnapshot foo/bar/
– Create snapshot for directory and optionally provide snapshot namehdfs dfs -createSnapshot foo/bar/ mysnapshot_today
– Verify snapshothdfs dfs -ls foo/bar/.snapshot
21 - HDFS Data – Automate & Restore
• Use Falcon/Oozie to automate backups• Falcon utilizes Oozie as a workflow scheduler• distcp is an Oozie action
- use -update and -prbugp• Restoring is the reverse process of backups
1. On your backup cluster choose which snapshot to restore2. Remove/move target directory on production system3. Run distcp without -update options
22 - Apache Ranger
Upcoming Hortonworks Classes at SpringPeople
Classroom (Bengaluru)
05 - 08 Sept26 - 28 Sept10 - 13 Oct07 - 10 Nov05 - 08 Dec19 - 21 Dec
Online LIVE22 - 31 Aug05 - 17 Sept
19 Sept - 01 Oct