hadoop interview questions part 2
TRANSCRIPT
ACADGILDACADGILD
Before going through this Hadoop interview questions part-2, we recommend our users to gothrough our previous post on Hadoop interview questions part -1 .
In this Second Part of Hadoop interview Questions, we would be discussing various questions
related to Big Data Hadoop Ecosystem.
We have given relevant posts with most of the questions which you can refer for practical
implementation.
1.Can you join or transform tables/columns when importing using Sqoop?
Yes, we can perform all the SQL commands while importing the table data to Sqoop.
For more information, please refer to this post – Beginners Guide to Sqoop.
2.What is the importance of indexing in Hive and how does this relate to Partition and
Bucketing?
The goal of Hive indexing is to improve the speed of query lookup on certain columns of a
table. Without an index, queries with predicates like 'WHERE tab1.col1 = 10' load the entire
table or partition and processes all the rows. However, if an index exists for col1, then only a
portion of the file needs to be loaded and processed.
Indexes become even more essential when the tables grow extremely large, and as you now
undoubtedly know, Hive thrives on large tables. We can index tables that are partitioned or bucketed.
Bucketing:
Bucketing is usually used for join operations, as you can optimize joins by bucketing records by a
specific 'key' or 'id'. In this way, when you want to do a join operation, records with the same 'key' will
be in the same bucket and then the join operation will be faster. This is similar to a technique for
decomposing data sets into more manageable parts.
Partitioning:
When any user wants a data contained within a table to be split across multiple sections in Hive table,
the use of partition is highly suggested.
The entries for the various columns of the dataset are segregated and then stored in their respective
partition. When we write the query to fetch the values from the table, only the required partitions of the
table are queried, which reduces the time taken by the query to yield the result.
https://acadgild.com/blog/wp-admin/post.php?post=16254&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=16254&action=edit
ACADGILDACADGILD
3. How many types of joins are present in Hadoop and when to use them?
In Hadoop, there are two types of joins. One is Map side join and the other one is Reduce side join
Map Side Join:
Joining of datasets in the map phase is called map side join. Map side join is preferred when you need
to perform a join between one larger dataset and one smaller dataset. Map side joins are faster and are
executed in the cache. A technique called Distributed Cache is implemented in map side joins, where
the smaller dataset is given to all the data nodes through cache memory. The smaller dataset size is
limited to the cache memory of the cluster.
Reduce Side Join:
Joining of datasets in the reduce class is called reduce side join. When both the datasets are large, we
use reduce side join. They are less efficient than maps-side joins because the datasets have to go
through the sort and shuffle phase.
4.How to optimize Hive queries?
Follow the below blog link to get the tips to optimize your hive queries
Optimizing Hive Queries
5.What are combiners in Hadoop?
Combiner class can summarize the map output records with the same key, and the output (key value
collection) of the combiner will be sent over the network to the actual Reducer task as an input. This
will help to cut down the amount of data shuffled between the mappers and the reducers.
For more information, please refer to this post – Combiners in Hadoop.
6.What is the difference between Combiner and in-mapper combiner in
Hadoop?
You are probably already aware that a combiner is a process that runs locally on each Mapper machine
to pre-aggregate data before it’s shuffled across the network to the various cluster Reducers.
The in-mapper combiner takes this optimization a bit further: the aggregations does not even write
to local disk. They occur in-memory in the Mapper itself.
The in-mapper combiner does this by taking advantage of the setup() and cleanup() methods in
https://acadgild.com/blog/wp-admin/post.php?post=16254&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=16254&action=edit
ACADGILDACADGILD
org.apache.hadoop.mapreduce.Mapper
7.Let’s consider this scenario; if I have a folder consisting of n number of files
(datasets) and if I want to apply the same mapper and reducer logic, what should
I do?
The traditional FileInputFormat takes each row as input to the mapper. Instead of that, if you want to
take the whole file as input, you need to use wholeFileInputFormat of MapReduce. It takes the whole
file as the input to the mapper.
8.Suppose, if you have 50 mappers and 1 reducer, how will your cluster
performance be? And if it takes a lot of time, how can you reduce it?
If there are 50 mappers and 1 reducer, it will take a lot of time to run the whole program,
because the reducer needs to collect all the mapper’s output and then it need to process. So,
for this we can do two things:
A. If possible, you can add a combiner so that the amount of output coming from the mapper will be
reduced and the load on the reducer also will get reduced.
B. You can enable map output compression so that the size of the data going to the reducer will be less.
9. Explain some string functions in Hive
String functions perform operations on String data type columns. The various string functions are as
follows:
ASCII – Converts the first character of the string to its ASCII value.
Concat – Concatenates all the string columns in the table.
substr(string A, int start) – Returns the sub string starting from the index given until the end.
(string A) – Returns the string converted to upper case.
https://acadgild.com/blog/wp-admin/post.php?post=16254&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=16254&action=edit
ACADGILDACADGILD
lower(string A) – Returns the string converted to lower case.
trim(String A) – Returns the string trimming the spaces from both the ends.
10.Can you create a table in Hive, which can skip the header lines from the
dataset?
Yes, we can include the skip.header.line.count property inside the tblproperties while creating the
table.
For example:
CREATE TABLE Employee (Emp_Number Int,Emp_Name String,Emp_sal Int) row format delimited
fields terminated BY ',' lines terminated BY '\n' tblproperties("skip.header.line.count"="1");
11.What are the binary storage formats available in Hive?
The default format in Hive is TextInputFormat, but Hive supports many file formats like Sequence
Files, Avro Data files, RCFiles, ORC files, Parquet files, etc.
You can refer to the below posts to know more on these file formats.
File Formats in Apache Hive
Introduction to Avro in Hive
Parquet File Format in Hadoop
12.Can you use multiple Hive instances at the same time? If yes, how can you do
that?
By default, Hive comes with Derby database. So, you cannot use multiple instances with Derby
database. However, if you change the Hive metastore as MySQL, then you can use multiple Hive
instances at the same time.
You can refer to the post – MySQL Metastore Integration with Hive, to know how to configure Hive
metastore as MySQL.
13.Is there any testing available in Pig? If yes, how can you do it?
Yes, we can do unit testing for Pig scripts.
You can refer to the post – Unit testing Pig scirpts, to know how Pig scripts can be tested.
14. Can you run Pig scripts using Java? If yes, how can you do it?
https://acadgild.com/blog/wp-admin/post.php?post=16254&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=16254&action=edit
ACADGILDACADGILD
Yes, it is possible to embed Pig scripts inside a Java code.
You can refer to the post – Embedding Pig in Java, to know how pig scripts can be run using java.
15.Can you automate a Flume job by running it up to a stipulated time? If yes,
how can you do that?
Flume job can be run for a stipulated time using a Java program. For this, Flume provides an
application class to run it using a Java program.
public class flume { public static void main(String[] args) { String[] args1 = new String[] { "agent","-nTwitterAgent","-fflume.conf" }; BasicConfigurator.configure(); Application.main(args1); System.setProperty("hadoop.home.dir", "/"); }}
The following is the code to run the Flume configuration file using a Java program. We can automate
this program while keeping this code inside a thread.
Hope this post has been useful in helping you prepare for that big interview. In the case of any queries,
feel free to comment below and we will get back to you at the earliest.
Keep visiting our site www.acadgild.com for more updates on Big data and other technologies.
https://acadgild.com/blog/wp-admin/post.php?post=16254&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=16254&action=edit