hadoop interview questions part 2

5
ACADGILD ACADGILD Before going through this Hadoop interview questions part-2, we recommend our users to go through our previous post on Hadoop interview questions part -1 . In this Second Part of Hadoop interview Questions, we would be discussing various questions related to Big Data Hadoop Ecosystem. We have given relevant posts with most of the questions which you can refer for practical implementation. 1.Can you join or transform tables/columns when importing using Sqoop? Yes, we can perform all the SQL commands while importing the table data to Sqoop. For more information, please refer to this post – Beginners Guide to Sqoop . 2.What is the importance of indexing in Hive and how does this relate to Partition and Bucketing? The goal of Hive indexing is to improve the speed of query lookup on certain columns of a table. Without an index, queries with predicates like 'WHERE tab1.col1 = 10' load the entire table or partition and processes all the rows. However, if an index exists for col1, then only a portion of the file needs to be loaded and processed. Indexes become even more essential when the tables grow extremely large, and as you now undoubtedly know, Hive thrives on large tables. We can index tables that are partitioned or bucketed. Bucketing: Bucketing is usually used for join operations, as you can optimize joins by bucketing records by a specific 'key' or 'id'. In this way, when you want to do a join operation, records with the same 'key' will be in the same bucket and then the join operation will be faster. This is similar to a technique for decomposing data sets into more manageable parts. Partitioning: When any user wants a data contained within a table to be split across multiple sections in Hive table, the use of partition is highly suggested. The entries for the various columns of the dataset are segregated and then stored in their respective partition. When we write the query to fetch the values from the table, only the required partitions of the table are queried, which reduces the time taken by the query to yield the result. https://acadgild.com/blog/wp-admin/post.php?post=16254&action=edit https://acadgild.com/blog/wp-admin/post.php?post=16254&action=edit

Upload: acadgild

Post on 18-Feb-2017

74 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Hadoop interview questions part   2

ACADGILDACADGILD

Before going through this Hadoop interview questions part-2, we recommend our users to gothrough our previous post on Hadoop interview questions part -1 .

In this Second Part of Hadoop interview Questions, we would be discussing various questions

related to Big Data Hadoop Ecosystem.

We have given relevant posts with most of the questions which you can refer for practical

implementation.

1.Can you join or transform tables/columns when importing using Sqoop?

Yes, we can perform all the SQL commands while importing the table data to Sqoop.

For more information, please refer to this post – Beginners Guide to Sqoop.

2.What is the importance of indexing in Hive and how does this relate to Partition and

Bucketing?

The goal of Hive indexing is to improve the speed of query lookup on certain columns of a

table. Without an index, queries with predicates like 'WHERE tab1.col1 = 10' load the entire

table or partition and processes all the rows. However, if an index exists for col1, then only a

portion of the file needs to be loaded and processed.

Indexes become even more essential when the tables grow extremely large, and as you now

undoubtedly know, Hive thrives on large tables. We can index tables that are partitioned or bucketed.

Bucketing:

Bucketing is usually used for join operations, as you can optimize joins by bucketing records by a

specific 'key' or 'id'. In this way, when you want to do a join operation, records with the same 'key' will

be in the same bucket and then the join operation will be faster. This is similar to a technique for

decomposing data sets into more manageable parts.

Partitioning:

When any user wants a data contained within a table to be split across multiple sections in Hive table,

the use of partition is highly suggested.

The entries for the various columns of the dataset are segregated and then stored in their respective

partition. When we write the query to fetch the values from the table, only the required partitions of the

table are queried, which reduces the time taken by the query to yield the result.

https://acadgild.com/blog/wp-admin/post.php?post=16254&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=16254&action=edit

Page 2: Hadoop interview questions part   2

ACADGILDACADGILD

3. How many types of joins are present in Hadoop and when to use them?

In Hadoop, there are two types of joins. One is Map side join and the other one is Reduce side join

Map Side Join:

Joining of datasets in the map phase is called map side join. Map side join is preferred when you need

to perform a join between one larger dataset and one smaller dataset. Map side joins are faster and are

executed in the cache. A technique called Distributed Cache is implemented in map side joins, where

the smaller dataset is given to all the data nodes through cache memory. The smaller dataset size is

limited to the cache memory of the cluster.

Reduce Side Join:

Joining of datasets in the reduce class is called reduce side join. When both the datasets are large, we

use reduce side join. They are less efficient than maps-side joins because the datasets have to go

through the sort and shuffle phase.

4.How to optimize Hive queries?

Follow the below blog link to get the tips to optimize your hive queries

Optimizing Hive Queries

5.What are combiners in Hadoop?

Combiner class can summarize the map output records with the same key, and the output (key value

collection) of the combiner will be sent over the network to the actual Reducer task as an input. This

will help to cut down the amount of data shuffled between the mappers and the reducers.

For more information, please refer to this post – Combiners in Hadoop.

6.What is the difference between Combiner and in-mapper combiner in

Hadoop?

You are probably already aware that a combiner is a process that runs locally on each Mapper machine

to pre-aggregate data before it’s shuffled across the network to the various cluster Reducers.

The in-mapper combiner takes this optimization a bit further: the aggregations does not even write

to local disk. They occur in-memory in the Mapper itself.

The in-mapper combiner does this by taking advantage of the setup() and cleanup() methods in

https://acadgild.com/blog/wp-admin/post.php?post=16254&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=16254&action=edit

Page 3: Hadoop interview questions part   2

ACADGILDACADGILD

org.apache.hadoop.mapreduce.Mapper

7.Let’s consider this scenario; if I have a folder consisting of n number of files

(datasets) and if I want to apply the same mapper and reducer logic, what should

I do?

The traditional FileInputFormat takes each row as input to the mapper. Instead of that, if you want to

take the whole file as input, you need to use wholeFileInputFormat of MapReduce. It takes the whole

file as the input to the mapper.

8.Suppose, if you have 50 mappers and 1 reducer, how will your cluster

performance be? And if it takes a lot of time, how can you reduce it?

If there are 50 mappers and 1 reducer, it will take a lot of time to run the whole program,

because the reducer needs to collect all the mapper’s output and then it need to process. So,

for this we can do two things:

A. If possible, you can add a combiner so that the amount of output coming from the mapper will be

reduced and the load on the reducer also will get reduced.

B. You can enable map output compression so that the size of the data going to the reducer will be less.

9. Explain some string functions in Hive

String functions perform operations on String data type columns. The various string functions are as

follows:

ASCII – Converts the first character of the string to its ASCII value.

Concat – Concatenates all the string columns in the table.

substr(string A, int start) – Returns the sub string starting from the index given until the end.

(string A) – Returns the string converted to upper case.

https://acadgild.com/blog/wp-admin/post.php?post=16254&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=16254&action=edit

Page 4: Hadoop interview questions part   2

ACADGILDACADGILD

lower(string A) – Returns the string converted to lower case.

trim(String A) – Returns the string trimming the spaces from both the ends.

10.Can you create a table in Hive, which can skip the header lines from the

dataset?

Yes, we can include the skip.header.line.count property inside the tblproperties while creating the

table.

For example:

CREATE TABLE Employee (Emp_Number Int,Emp_Name String,Emp_sal Int) row format delimited

fields terminated BY ',' lines terminated BY '\n' tblproperties("skip.header.line.count"="1");

11.What are the binary storage formats available in Hive?

The default format in Hive is TextInputFormat, but Hive supports many file formats like Sequence

Files, Avro Data files, RCFiles, ORC files, Parquet files, etc.

You can refer to the below posts to know more on these file formats.

File Formats in Apache Hive

Introduction to Avro in Hive

Parquet File Format in Hadoop

12.Can you use multiple Hive instances at the same time? If yes, how can you do

that?

By default, Hive comes with Derby database. So, you cannot use multiple instances with Derby

database. However, if you change the Hive metastore as MySQL, then you can use multiple Hive

instances at the same time.

You can refer to the post – MySQL Metastore Integration with Hive, to know how to configure Hive

metastore as MySQL.

13.Is there any testing available in Pig? If yes, how can you do it?

Yes, we can do unit testing for Pig scripts.

You can refer to the post – Unit testing Pig scirpts, to know how Pig scripts can be tested.

14. Can you run Pig scripts using Java? If yes, how can you do it?

https://acadgild.com/blog/wp-admin/post.php?post=16254&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=16254&action=edit

Page 5: Hadoop interview questions part   2

ACADGILDACADGILD

Yes, it is possible to embed Pig scripts inside a Java code.

You can refer to the post – Embedding Pig in Java, to know how pig scripts can be run using java.

15.Can you automate a Flume job by running it up to a stipulated time? If yes,

how can you do that?

Flume job can be run for a stipulated time using a Java program. For this, Flume provides an

application class to run it using a Java program.

public class flume { public static void main(String[] args) { String[] args1 = new String[] { "agent","-nTwitterAgent","-fflume.conf" }; BasicConfigurator.configure(); Application.main(args1); System.setProperty("hadoop.home.dir", "/"); }}

The following is the code to run the Flume configuration file using a Java program. We can automate

this program while keeping this code inside a thread.

Hope this post has been useful in helping you prepare for that big interview. In the case of any queries,

feel free to comment below and we will get back to you at the earliest.

Keep visiting our site www.acadgild.com for more updates on Big data and other technologies.

https://acadgild.com/blog/wp-admin/post.php?post=16254&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=16254&action=edit