best practices for administering hadoop with hortonworks data platform (hdp) 2.3 _part 2

Hadoop Admin Best Practices with HDP 2.3

Part-2

We have INSTRUCTOR LED - both Online LIVE & Classroom Session

Present for classroom sessions in Bangalore & Delhi (NCR)

We are the ONLY Education delivery partners for Mulesoft, Elastic, Pivotal & Lightbend in India

We have delivered more than 5000 trainings and have over 400 courses and a vast pool of over 200 experts to make YOU the EXPERT!

FOLLOW US ON SOCIAL MEDIA TO STAY UPDATED ON THE UPCOMING WEBINARS

Online and Classroom Training on Technology Coursesat SpringPeople Certified Partners

Non-Certified Courses

…and many more

…NOW

The Hadoop Ecosystem

Hadoop

The HDP 2.3 Platform Versions

Covered Till Now

1. Use Ambari – Cluster Management Tool

2. More of WebHDFS Access

3. WebHDFS

4. Use More of HDFS Access Control Lists

5. Use HDFS Quotas

6. Understanding of YARN Components

7. Adding, Deleting, or Replacing Worker Nodes

8. Rack Awareness

9. NameNode High Availability

10. ResourceManager High Availability

11. Ambari Metrics System

12. What to Backup?

13 – Setting appropriate Directory Space Quota

• Best practice is to also set space limits on home directory To set a 12TB limit:$ hdfs dfsadmin –setSpaceQuota 12t /user/username

• Includes space for replication• This is the actual use of space• Example:

• If storing 1TB and replication factor is 3• 3TB is needed

• Quota can be set on any directory

14 - Configuring Trash

• Enable by setting time delay for trash's checkpoint removal: In core-site.xml• fs.trash.interval

• Delay is set in minutes (24 hours would be 1440 minutes)• Recommendation is to set to 360 minutes (6 hours)• Setting the value to 0 disables Trash

• Files deleted programmatically are deleted immediately• Files can be immediately deleted from the command line using -

skipTrash

15 - Compression Needs and Tradeoffs

Compressing data can speed up data-intensive I/O operations• MapReduce jobs are almost always I/O bound

Compressed data can save storage space and speed up data transfers across the network• Capital allocation for hardware can go further

Reduced I/O and network load can result in significant performance improvements• MapReduce jobs can finish faster overall

But, CPU utilization and processing time increase during compression and decompression• Understanding the tradeoffs is important for MapReduce pipeline’s overall performance

16 - Sqoop Security

• Database Authentication:• Sqoop needs to authenticate to the RDBMS

• How?• Usually this involves a username/password

(Oracle Wallet is the exception)• Can hard code password in scripts (not recommended/used)• Password usually stored in plaintext in a file protected by the filesystem

• Hadoop Credential Management Framework added in HDP 2.2• Not a keystore, but a way to interface with keystore backends• Passwords can be stored in a keystore and not in plain text• Can help with “no passwords in plaintext” requirements

17 - distcp Configurations

• If Distcp runs out of memory before copying:• Possible Cause: Number of files/directories being copied from source

path(s) is extremely large (e.g. 100,000 paths)• Change: HEAP Size

- Export HADOOP_CLIENT_OPTS="-Xms64m -Xmx1024m”• Map Sizing

• If -m is not specified: Default to 20 maps max• Tuning the number of maps to:

- Size of the source and destination cluster- The size of the copy- Available bandwidth

18 - Falcon Centrally manages data lifecycle

• Centralized definition & management of pipelines for data ingest, process and export

Supports Business continuity and Disaster Recovery• Out of the box policies for data replication

and retention• End-to-end monitoring of data pipelines

Addresses basic audit & compliance requirements• Visualize data pipeline lineage • Track data pipeline audit logs• Tag data with business metadata

19 - Running Balancer

• Can be run periodically as a batch job• Examples: every 24 hours or weekly

• Run after new nodes have been added to the cluster• To run balancer:

hdfs balancer [-threshold <threshold>] [-policy <policy>]]• Runs until there are no blocks to move

orUntil it has lost contact with the NameNode

• Can be stopped with a Ctrl+C

20 - HDFS Snapshots

Create HDFS directory snapshots Fast operation - only metadata affected Results in .snapshot/ directory in the HDFS directory Snapshots are named or default to timestamp Directories must be made snapshottable Snapshot Steps:

– Allow snapshot on directoryhdfs dfsadmin -allowSnapshot foo/bar/

– Create snapshot for directory and optionally provide snapshot namehdfs dfs -createSnapshot foo/bar/ mysnapshot_today

– Verify snapshothdfs dfs -ls foo/bar/.snapshot

21 - HDFS Data – Automate & Restore

• Use Falcon/Oozie to automate backups• Falcon utilizes Oozie as a workflow scheduler• distcp is an Oozie action

- use -update and -prbugp• Restoring is the reverse process of backups

1. On your backup cluster choose which snapshot to restore2. Remove/move target directory on production system3. Run distcp without -update options

22 - Apache Ranger

[email protected]

Upcoming Hortonworks Classes at SpringPeople

Classroom (Bengaluru)

05 - 08 Sept26 - 28 Sept10 - 13 Oct07 - 10 Nov05 - 08 Dec19 - 21 Dec

Online LIVE22 - 31 Aug05 - 17 Sept

19 Sept - 01 Oct