coludera certificate

Coludera certificate:1. Infrastructure Objectives: Recognize and identify Apache Hadoop daemons and how they function both in data storage and processing.Hadoop is comprised of five separate daemons. Each of these daemon run in its own JVM.

Following 3 Daemons run on Master Nodes NameNode - This daemon stores and maintains the metadata for HDFS. Secondary NameNode - Performs housekeeping functions for the NameNode. JobTracker - Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.Following 2 Daemons run on each Slave nodes DataNode Stores actual HDFS data blocks. TaskTracker - Responsible for instantiating and monitoring individual Map and Reduce tasks.

Name node:Refer: http://www.aosabook.org/en/hdfs.html#footnote-1

Job Tracker:JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop. There is only One Job Tracker process run on any hadoop cluster (only in Hadoop 1.x version). HA allows multiple job trackers. JobTracker runs on its own JVM process. In a typical production cluster its run on a separate machine. Each slave node is configured with job tracker node location. The JobTracker is single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker in Hadoop performs following actions Client applications submit jobs to the Job tracker. The JobTracker talks to the NameNode to determine the location of the data. The JobTracker locates TaskTracker nodes with available slots at or near the data The JobTracker submits the work to the chosen TaskTracker nodes. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may even blacklist the TaskTracker as unreliable. When the work is completed, the JobTracker updates its status. Client applications can poll the JobTracker for information. The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.Task Tracker: A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle operations) from a JobTracker. There is only One Task Tracker process run on any hadoop slave node. Task Tracker runs on its own JVM process. Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.

Understand how Apache Hadoop exploits data locality.

Hadoop does its best to run the map task on a node where the input data resides in HDFS, because it doesnt use valuable cluster bandwidth. This is called the data locality optimization. Sometimes, however, all the nodes hosting the HDFS block replicas for a map tasks input split are running other map tasks, so the job scheduler will look for a free map slot on a node in the same rack as one of the blocks. Very occasionally even this is not possible, so an off-rack node is used, which results in an inter-rack network transfer.

Reduce tasks dont have the advantage of data locality; the input to a single reduce task is normally the output from all mappers. Basically, the first tasktracker that heartbeats to the Jobtracker and has enough slots available will get a reduce tasks.

Identify the role and use of both MapReduce v1 (MRv1) and MapReduce v2 (MRv2 / YARN) daemons.

Analyze the benefits and challenges of the HDFS architecture.

On the other hand, if splits are too small, the overhead of managing the splits and map task creation begins to dominate the total job execution time. For most jobs, a good split size tends to be the size of an HDFS block, which is 128 MB by default, although this can be changed for the cluster (for all newly created files) or specified when each file is created

Analyze how HDFS implements file sizes, block sizes, and block abstraction.

HDFS block size is of 64MB (very large as compared to other file systems). Also, unlike other file systems, a file smaller than the block size does not occupy the complete block sizes worth of memory. The block size is kept so large so that less time is made doing disk seeks as compared to the data transfer rate.Why block abstraction: Files can be bigger than individual disks. Filesystem metadata does not need to be associated with each and every block. Simplifies storage management - Easy to figure out the number of blocks which can be stored on each disk. Fault tolerance and storage replication can be easily done on a per-block basis (storage/HA policies can be run on individual blocks). The namenode does not store the block locations persistently. It is reconstructed from the data nodes as the system starts.

Understand default replication values and storage requirements for replication.

To ensure high availability of data, Hadoop replicates the data. When we are storing the files into HDFS, hadoop framework splits the file into set of blocks (64 MB or 128 MB) and then these blocks will be replicated across the cluster nodes. The configuration dfs.replication is to specify how many replications are required. The default value for dfs.replication is 3, but this is configurable depends on your cluster setup. The first replica of the block will be stored in the client node if available or a random node from a cluster. The next 2 replicas will be stored in different rack than first replica stored. Maximum replica for a rack is 2.

Determine how HDFS stores, reads, and writes files.

When an application reads a file, the HDFS client first asks the NameNode for the list of DataNodes that host replicas of the blocks of the file. The list is sorted by the network topology distance from the client. The client contacts a DataNode directly and requests the transfer of the desired block.

When a client writes, it first asks the NameNode to choose DataNodes to host replicas of the first block of the file. The client organizes a pipeline from node-to-node and sends the data. When the first block is filled, the client requests new DataNodes to be chosen to host replicas of the next block. A new pipeline is organized, and the client sends the further bytes of the file. Choice of DataNodes for each block is likely to be different.

Identify the role of Apache Hadoop Classes, Interfaces, and Methods. Understand how Hadoop Streaming might apply to a job workflow.

2. Data Management Objectives Import a database table into Hive using Sqoop. Create a table using Hive (during Sqoop import). Successfully use key and value types to write functional MapReduce jobs. Given a MapReduce job, determine the lifecycle of a Mapper and the lifecycle of a Reducer. Analyze and determine the relationship of input keys to output keys in terms of both type and number, the sorting of keys, and the sorting of values. Given sample input data, identify the number, type, and value of emitted keys and values from the Mappers as well as the emitted data from each Reducer and the number and contents of the output file(s). Understand implementation and limitations and strategies for joining datasets in MapReduce. Understand how partitioners and combiners function, and recognize appropriate use cases for each.

Recognize the processes and role of the sort and shuffle process.

MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort and transfers the map outputs to the reducers as inputs is known as the shuffle.

Combiner will be invoked after the Partitioners are completed in the same JVM as Mapper runs. Refer: http://www.bigsynapse.com/mapreduce-internalsRefer: http://ercoppa.github.io/HadoopInternals

Each map task has a circular memory buffer that it writes the output to. The buffer is 100 MB by default (the size can be tuned by changing the mapreduce.task.io.sort.mb property). When the contents of the buffer reach a certain threshold size (mapreduce.map.sort.spill.percent, which has the default value 0.80, or 80%), a background thread will start to spill the contents to disk.

Map outputs will continue to be written to the buffer while the spill takes place, but if the buffer fills up during this time, the map will block until the spill is complete. Spills are written in round-robin fashion to the directories specified by the mapreduce.cluster.local.dir property, in a job-specific subdirectory.

Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort. Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.

Each time the memory buffer reaches the spill threshold, a new spill file is created, so after the map task has written its last output record, there could be several spill files.

Before the task is finished, the spill files are merged into a single partitioned and sorted output file. The configuration property mapreduce.task.io.sort.factor controls the maximum number of streams to merge at once; the default is 10.

It is often a good idea to compress the map output as it is written to disk, because doing so makes it faster to write to disk, saves disk space, and reduces the amount of data to transfer to the reducer. By default, the output is not compressed, but it is easy to enable this by setting mapreduce.map.output.compress to true. The compression library to use is specified by mapreduce.map.output.compress.codec;

The reduce task needs the map output for its particular partition from several map tasks across the cluster. The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes. This is known as the copy phase of the reduce task. The reduce task has a small number of copier threads so that it can fetch map outputs in parallel. The default is five threads, but this number can be changed by setting the mapreduce.reduce.shuffle.parallelcopies property.

Understand common key and value types in the MapReduce framework and the interfaces they implement. Use key and value types to write functional MapReduce jobs.

3. Job Mechanics Objectives Construct proper job configuration parameters and the commands used in job submission. A Job object forms the specification of the job and gives you control over how the job is run. When we run this job on a Hadoop cluster, we will package the code into a JAR file (which Hadoop will distribute around the cluster). Rather than explicitly specifying the name of the JAR file, we can pass a class in the Jobs setJarByClass() method, which Hadoop will use to locate the relevant JAR file by looking for the JAR file containing this class

Job job = new Job(); job.setJarByClass(MaxTemperature.class);

Next, we specify the map and reduce types to use via the setMapperClass() and setReducerClass() methods.

job.setMapperClass(MaxTemperatureMapper.class); job.setReducerClass(MaxTemperatureReducer.class);

The setOutputKeyClass() and setOutputValueClass() methods control the output types for the reduce function, and must match what the Reduce class produces. The map output types default to the same types, so they do not need to be set if the mapper produces the same types as the reducer (as it does in our case). However, if they are different, the map output types must be set using the setMapOutputKeyClass() and setMapOutputValueClass() methods.

job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);

Analyze a MapReduce job and determine how input and output data paths are handled.

Having constructed a Job object, we specify the input and output paths. An input path is specified by calling the static addInputPath() method on FileInputFormat, and it can be a single file, a directory (in which case, the input forms all the files in that directory), or a file pattern. As the name suggests, addInputPath() can be called more than once to use input from multiple paths.

FileInputFormat.addInputPath(job, new Path(args[0]));

The output path (of which there is only one) is specified by the static setOutputPath() method on FileOutputFormat. It specifies a directory where the output files from the reduce function are written. The directory shouldnt exist before running the job because Hadoop will complain and not run the job. This precaution is to prevent data loss (it can be very annoying to accidentally overwrite the output of a long job with that of another).

FileOutputFormat.setOutputPath(job, new Path(args[1]));

Given a sample job, analyze and determine the correct InputFormat and OutputFormat to select based on job requirements. Analyze the order of operations in a MapReduce job. Understand the role of the RecordReader, and of sequence files and compression. Use the distributed cache to distribute data to MapReduce job tasks. Build and orchestrate a workflow with Oozie.

4. Querying Objectives Write a MapReduce job to implement a HiveQL statement. A UDF must satisfy the following two properties: A UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF. A UDF must implement at least one evaluate() method.

Write a MapReduce job to query data stored in HDFS.

There are several notable differences between the two APIs:

The new API is in the org.apache.hadoop.mapreduce package (and subpackages).The old API can still be found in org.apache.hadoop.mapred.

The new API favors abstract classes over interfaces, since these are easier to evolve. This means that you can add a method (with a default implementation) to an abstract class without breaking old implementations of the class. For example, the Mapper and Reducer interfaces in the old API are abstract classes in the new API.

The new API makes extensive use of context objects that allow the user code to communicate with the MapReduce system. The new Context, for example, essentially unifies the role of the JobConf, the OutputCollector, and the Reporter from the old API.

In both APIs, key-value record pairs are pushed to the mapper and reducer, but in addition, the new API allows both mappers and reducers to control the execution flow by overriding the run() method. For example, records can be processed in batches, or the execution can be terminated before all the records have been processed. In the old API, this is possible for mappers by writing a MapRunnable, but no equivalent exists for reducers.

Job control is performed through the Job class in the new API, rather than the old JobClient, which no longer exists in the new API.

Configuration has been unified in the new API. The old API has a special JobConf object for job configuration, which is an extension of Hadoops vanilla Configuration object (used for configuring daemons; see The Configuration API). In the new API, job configuration is done through a Configuration, possibly via some of the helper methods on Job.

Output files are named slightly differently: in the old API both map and reduce outputs are named part-nnnnn, whereas in the new API map outputs are named part-m-nnnnn and reduce outputs are named part-r-nnnnn (where nnnnn is an integer designating the part number, starting from 00000).

User-overridable methods in the new API are declared to throw java.lang.InterruptedException. This means that you can write your code to be responsive to interrupts so that the framework can gracefully cancel long-running operations if it needs to.

In the new API, the reduce() method passes values as a java.lang.Iterable, rather than a java.lang.Iterator (as the old API does). This change makes it easier to iterate over the values using Javas for-each loop construct.

For the details in mapred-site.xml Refer: https://hadoop.apache.org/docs/r1.0.4/mapred-default.html

HDFS High Availability Hadoop 2 remedied this situation by adding support for HDFS high availability (HA). In this implementation, there are a pair of namenodes in an active-standby configuration. In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption.

coludera certificate

Documents