big data hadoop interview questions and answers

8/19/2019 Big Data Hadoop Interview Questions and Answers

1/26

Big Data Hadoop Interview Questionsand AnswersThese are Hadoop Basic Interview Questions and Answers for freshers and

experienced.

1. What is Big Data?

Big data is defined as the voluminous amount of structured, unstructured or semi-structured

data that has huge potential for mining but is so large that it cannot be processed using

traditional database sstems. Big data is characteri!ed b its high velocit, volume and variet

that re"uires cost effective and innovative methods for information processing to draw

meaningful business insights. #ore than the volume of the data $ it is the nature of the data that

defines whether it is considered as Big %ata or not.

Here is an interesting and explanator visual on &'hat is Big %ata()

2. What do the four V’s of Big Data denote?

IB# has a nice, simple explanation for the four critical features of big data*

a+ olume $cale of datab+ elocit $Analsis of streaming data

c+ ariet $ %ifferent forms of data

d+ eracit $ncertaint of data

Here is an explanator video on the four /s of Big %ata

3. How ig data ana!"sis he!ps usinesses in#rease their revenue? $ive

e%a&p!e.Big data analsis is helping businesses differentiate themselves $ for example 'almart the

world/s largest retailer in 0123 in terms of revenue - is using big data analtics to increase its

sales through better predictive analtics, providing customi!ed recommendations and launching

new products based on customer preferences and needs. 'almart observed a significant 214

to 254 increase in online sales for 62 billion in incremental revenue. There are man more


2/26

companies li7e 8aceboo7, Twitter, 9in7edIn, :andora, ;:#organ


3/26

0+Hadoop #apCeduce-This is a Eava based programming paradigm of Hadoop framewor7 that

provides scalabilit across various Hadoop clusters. #apCeduce distributes the wor7load into

various tas7s that can run in parallel. Hadoop Eobs perform 0 separate tas7s- Eob. The map Eob

brea7s down the data sets into 7e-value pairs or tuples. The reduce Eob then ta7es the output

of the map Eob and combines the data tuples to into smaller set of tuples. The reduce Eob is

alwas performed after the map Eob is executed.

Here is a visual that clearl explain the H%8 and Hadoop #apCeduce


4/26

%ata Integration


5/26

• Hadoop Inter'ie (uestions and )nsers for Experien%ed - (.*os-4,2,3,+5

8or a detailed :%8 report on Hadoop alaries - CLICK HERE

Hadoop HD6+ Interview Questions and Answers

1. What is a !o#/ and !o#/ s#anner in HD6+?

Bloc7 - The minimum amount of data that can be read or written is generall referred to as a

&bloc7) in H%8. The default si!e of a bloc7 in H%8 is J3#B.

Bloc7 canner - Bloc7 canner trac7s the list of bloc7s present on a %ataode and verifies

them to find an 7ind of chec7sum errors. Bloc7 canners use a throttling mechanism to reserve

dis7 bandwidth on the datanode.

2. 7%p!ain the di*eren#e etween (a&e(ode8 Ba#/up (ode and

9he#/point (a&e(ode.

(a&e(ode* ameode is at the heart of the H%8 file sstem which manages the metadata

i.e. the data of the files is not stored on the ameode but rather it has the director tree of all

the files present in the H%8 file sstem on a hadoop cluster. ameode uses two files for the

namespace-

fsimage file- It 7eeps trac7 of the latest chec7point of the namespace.

edits file-It is a log of changes that have been made to the namespace since chec7point.

9he#/point (ode:


6/26


7/26

0. 7%p!ain the di*eren#e etween (A+ and HD6+.

• *)# runs on a sin&le ma%hine and thus there is no pro!a!ility of data

redundan%y hereas H6F# runs on a %luster of di7erent ma%hines thus there is data

redundan%y !e%ause of the repli%ation proto%ol.

• *)# stores data on a dedi%ated hardare hereas in H6F# all the data !lo%"s

are distri!uted a%ross lo%al dri'es of the ma%hines.

• In *)# data is stored independent of the %omputation and hen%e Hadoop

8apRedu%e %annot !e used for pro%essin& hereas H6F# or"s ith Hadoop

8apRedu%e as the %omputations in H6F# are mo'ed to data.

. 7%p!ain what happens if during the >,< operation8 HD6+ !o#/ is

assigned a rep!i#ation fa#tor 1 instead of the defau!t va!ue 3.

Ceplication factor is a propert of H%8 that can be set accordingl for the entire cluster to

adEust the number of times the bloc7s are to be replicated to ensure high data availabilit. 8orever bloc7 that is stored in H%8, the cluster will have n-2 duplicated bloc7s. o, if the

replication factor during the :T operation is set to 2 instead of the default value F, then it will

have a single cop of data. nder these circumstances when the replication factor is set to 2 ,if

the %ataode crashes under an circumstances, then onl single cop of the data would be

lost.

. What is the pro#ess to #hange the 4!es at aritrar" !o#ations in HD6+?

H%8 does not support modifications at arbitrar offsets in the file or multiple writers but filesare written b a single writer in append onl format i.e. writes to a file in H%8 are alwas made

at the end of the file.

15. 7%p!ain aout the inde%ing pro#ess in HD6+.

Indexing process in H%8 depends on the bloc7 si!e. H%8 stores the last part of the data that

further points to the address where the next part of data chun7 is stored.

11. What is a ra#/ awareness and on what asis is data stored in a ra#/? All the data nodes put together form a storage area i.e. the phsical location of the data nodes is

referred to as Cac7 in H%8. The rac7 information i.e. the rac7 id of each data node is ac"uired

b the ameode. The process of selecting closer data nodes depending on the rac7

information is 7nown as Cac7 Awareness.


8/26

The contents present in the file are divided into data bloc7 as soon as the client is read to load

the file into the hadoop cluster. After consulting with the ameode, client allocates F data

nodes for each data bloc7. 8or each data bloc7, there exists 0 copies in one rac7 and the third

cop is present in another rac7. This is generall referred to as the Ceplica :lacement :olic.

'e have further categori!ed Hadoop H%8 Interview Questions for 8reshers and xperienced-

• Hadoop Inter'ie (uestions and )nsers for Freshers - (.*os- ,4,1,3,+5,++

• Hadoop Inter'ie (uestions and )nsers for Experien%ed - (.*os- +,,

,/,0,1,2

Cli%" here to "no more a!out our I98 Certied Hadoop 6e'eloper %ourse

Hadoop ap@edu#e Interview Questions and Answers

1. 7%p!ain the usage of 9onte%t e#t.

Ne,alue,context+

F+cleanup >+ - This method is called onl once at the end of reduce tas7 for clearing all the

temporar files.

https://www.dezyre.com/Hadoop-Training-online/19https://www.dezyre.com/Hadoop-Training-online/19


9/26

8unction %efinition -public void cleanup >context+

3. 7%p!ain aout the partitioning8 shue and sort phase

+hue >hase:?nce the first map tas7s are completed, the nodes continue to perform

several other map tas7s and also exchange the intermediate outputs with the reducers as

re"uired. This process of moving the intermediate outputs of map tas7s to the reducer is

referred to as huffling.

+ort >hase- Hadoop #apCeduce automaticall sorts the set of intermediate 7es on a single

node before the are given as input to the reducer.

>artitioning >hase:The process that determines which intermediate 7es and value will be

received b each reducer instance is referred to as partitioning. The destination partition is

same for an 7e irrespective of the mapper instance that generated it.

'. How to write a #usto& partitioner for a Hadoop ap@edu#e o?

teps to write a


10/26

. What are the pri&ar" phases of a @edu#er?

The F primar phases of a reducer are $

2+huffle

0+ort

F+Ceduce

. What is a


11/26

0+'hen data is stored in the form of collections

F+If the application demands 7e based access to data while retrieving.

Ne components of HBase are $

Cegion- This component contains memor data store and Hfile.

Cegion erver-This monitors the Cegion.

HBase #aster-It is responsible for monitoring the region server.

Goo7eeper- It ta7es care of the coordination between the HBase #aster component and the

client.


12/26

C%B# does not have support for in-built partitioning whereas in HBase there is automated

partitioning.

C%B# stores normali!ed data whereas HBase stores de-normali!ed data.

). 7%p!ain aout the di*erent #ata!og ta!es in HBase?

The two important catalog tables in HBase, are C??T and #TA. C??T table trac7s where the

#TA table is and #TA table stores all the regions in the sstem.

-. What is #o!u&n fa&i!ies? What happens if "ou a!ter the !o#/ siCe of

9o!u&n6a&i!" on an a!read" popu!ated dataase?

The logical deviation of data is represented through a 7e 7nown as column 8amil.


13/26

There are F different tpes of tombstone mar7ers in HBase for deletion-

2+8amil %elete #ar7er- This mar7ers mar7s all columns for a column famil.

0+ersion %elete #ar7er-This mar7er mar7s a single version of a column.

F+'A9+ in which all the H9og edits are written immediatel.'A9 edits remain in the

memor till the flush period in case of deferred log flush.

'e have further categori!ed Hadoop HBase Interview Questions for 8reshers and xperienced-

• Hadoop Inter'ie (uestions and )nsers for Freshers - (.*os-+,,,/,1

• Hadoop Inter'ie (uestions and )nsers for Experien%ed - (.*os-

,4,0,2,3,+5

Hadoop +Foop Interview Questions and Answers

1. 7%p!ain aout so&e i&portant +Foop #o&&ands other than i&port and

e%port.

9reate =o G::#reate

Here we are creating a Eob with the name m Eob, which can import the table data from C%B#

table to H%8. The following command is used to create a Eob that is importing data from the

emploee table in the db database to the H%8 file.

6 "oop Eob --create mEob O

--import O


14/26

--connect Edbc*ms"l*LLlocalhostLdb O

--username root O

--table emploee --m 2

Verif" =o G::!ist

P--list/ argument is used to verif the saved Eobs. The following command is used to verif the list

of saved "oop Eobs.

6 "oop Eob --list

Inspe#t =o G::show

P--show/ argument is used to inspect or verif particular Eobs and their details. The following

command and sample output is used to verif a Eob called mEob.

6 "oop Eob --show mEob

7%e#ute =o G::e%e#

P--exec/ option is used to execute a saved Eob. The following command is used to execute a

saved Eob called mEob.

6 "oop Eob --exec mEob

2. How +Foop #an e used in a =ava progra&?

The "oop Ear in classpath should be included in the Eava code. After this the method

"oop.runTool >+ method must be invo7ed. The necessar parameters should be created to"oop programmaticall Eust li7e for command line.

3. What is the pro#ess to perfor& an in#re&enta! data !oad in +Foop?


15/26

The process to perform incremental data load in "oop is to snchroni!e the modified or

updated data >often referred as delta data+ from C%B# to Hadoop. The delta data can be

facilitated through the incremental load command in "oop.

Incremental load can be performed b using "oop import command or b loading the data into

hive without overwriting it. The different attributes that need to be specified during incremental

load in "oop are-

2+#ode >incremental+ $The mode defines how "oop will determine what the new rows are.

The mode can have value as Append or 9ast #odified.

0+last-value+ $This denotes the maximum value of the chec7 column from the previous

import operation.

'. Is it possi!e to do an in#re&enta! i&port using +Foop?

=es, "oop supports two tpes of incremental imports-

2+Append

0+9ast #odified

To insert onl rows Append should be used in import command and for inserting the rows and

also updating 9ast-#odified should be used in the import command.

). What is the standard !o#ation or path for Hadoop +Foop s#ripts?

LusrLbinLHadoop "oop

-. How #an "ou #he#/ a!! the ta!es present in a sing!e dataase using

+Foop?


16/26

The command to chec7 the list of all tables present in a single database using "oop is as

follows-

+Foop !ist:ta!es #onne#t d#; &"sF!; !o#a!hostuserJ

0. How are !arge oe#ts hand!ed in +Foop?

"oop provides the capabilit to store large si!ed data into a single field based on the tpe of

data. "oop supports the abilit to store-

2+.

%ist


17/26

• Hadoop Inter'ie (uestions and )nsers for Freshers - (.*os- ,/,0,3

• Hadoop Inter'ie (uestions and )nsers for Experien%ed - (.*os-

+,,4,0,1,2,+5

Hadoop 6!u&e Interview Questions and Answers

1 7%p!ain aout the #ore #o&ponents of 6!u&e.

The core components of 8lume are $

vent- The single log entr or unit of data that is transported.

ource- This is the component through which data enters 8lume wor7flows.

in7-It is responsible for transporting data to the desired destination.


18/26

• )syn%H9ase#in" has !etter

performan%e than H9ase sin" as it %an easily ma"e non-!lo%"in& %alls to H9ase.

Wor/ing of the HBase+in/

In HBasein7, a 8lume vent is converted into HBase Increments or :uts. eriali!er

implements the HBaseventeriali!er which is then instantiated when the sin7 starts. 8or ever

event, sin7 calls the initiali!e method in the seriali!er which then translates the 8lume vent into

HBase increments and puts to be sent to HBase cluster.

Wor/ing of the As"n#HBase+in/:

AsncHBasein7 implements the AsncHBaseventeriali!er. The initiali!e method is called

onl once b the sin7 when it starts. in7 invo7es the setvent method and then ma7es calls to

the getIncrements and getActions methods Eust similar to HBase sin7. 'hen the sin7 stops, the

cleanp method is called b the seriali!er.

' 7%p!ain aout the di*erent #hanne! t"pes in 6!u&e. Whi#h #hanne! t"pe

is faster?

The F different built in channel tpes available in 8lume are-

##?C=


19/26

- 7%p!ain aout the rep!i#ation and &u!tip!e%ing se!e#tors in 6!u&e.


20/26

3 What is the ro!e of Moo/eeper in HBase ar#hite#ture?

In HBase architecture, GooNeeper is the monitoring server that provides different services li7e $

trac7ing server failure and networ7 partitions, maintaining the configuration information,

establishing communication between the clients and region servers, usabilit of ephemeral

nodes to identif the available servers in the cluster.

' 7%p!ain aout Mooeeper in af/a

Apache Naf7a uses GooNeeper to be a highl distributed and scalable sstem. Goo7eeper is

used b Naf7a to store various configurations and use them across the hadoop cluster in a

distributed manner. To achieve distributed-ness, configurations are distributed and replicated

throughout the leader and follower nodes in the GooNeeper ensemble. 'e cannot directl

connect to Naf7a b be-passing GooNeeper because if the GooNeeper is down it will not be

able to serve the client re"uest.

) 7%p!ain how Moo/eeper wor/s

GooNeeper is referred to as the Ning of


21/26

• #torm that relies on ?ooKeeper is used !y popular %ompanies li"e @roupon

and Titter.

0 How to use Apa#he Moo/eeper #o&&and !ine interfa#e?

GooNeeper has a command line client support for interactive use. The command line interface

of GooNeeper is similar to the file and shell sstem of ID. %ata in GooNeeper is stored in a

hierarch of Gnodes where each !node can contain data Eust similar to a file. ach !node can

also have children Eust li7e directories in the ID file sstem.

Goo7eeper-client command is used to launch the command line client. If the initial prompt is

hidden b the log messages after entering the command, users can Eust hit TC to view the

prompt.

What are the di*erent t"pes of Mnodes?

There are 0 tpes of Gnodes namel- phemeral and e"uential Gnodes.

• The ?nodes that &et destroyed as soon as the %lient that %reated it

dis%onne%ts are referred to as Ephemeral ?nodes.

• #e$uential ?node is the one in hi%h se$uential num!er is %hosen !y the

?ooKeeper ensem!le and is pre-xed hen the %lient assi&ns name to the Anode.

What are wat#hes?


22/26


23/26

0 Di*erentiate etween Hadoop ap@edu#e and >ig

• :i& pro'ides hi&her le'el of a!stra%tion hereas 8apRedu%e pro'ides lo

le'el of a!stra%tion.

• 8apRedu%e re$uires the de'elopers to rite more lines of %ode hen

%ompared to )pa%he :i&.

• :i& %odin& approa%h is %omparati'ely sloer than the fully tuned 8apRedu%e

%odin& approa%h.

Cead #ore in %etail- httpB.deAyre.%omarti%le-mapredu%e-'s-pi&-'s-hi'e+04

What is the usage of forea#h operation in >ig s#ripts?

8?CAig?

ometimes there is data in a tuple or bag and if we want to remove the level of nesting from that

data then 8latten modifier in :ig can be used. 8latten un-nests bags and tuples. 8or tuples, the

8latten operator will substitute the fields of a tuple in place of a tuple whereas un-nesting bags is

a little complex because it re"uires creating new tuples.

'e have further categori!ed Hadoop :ig Interview Questions for 8reshers and xperienced-

• Hadoop Inter'ie (uestions and )nsers for Freshers - (.*os-+,,,1,3

• Hadoop Inter'ie (uestions and )nsers for Experien%ed - (.*os- 4,/,0,2,+5

Hadoop Hive Interview Questions and Answers

1 What is a Hive etastore?

Hive #etastore is a central repositor that stores metadata in external database.

http://www.dezyre.com/article/-mapreduce-vs-pig-vs-hive/163http://www.dezyre.com/article/-mapreduce-vs-pig-vs-hive/163


24/26

2 Are &u!ti!ine #o&&ents supported in Hive?

o

3 What is e#tInspe#tor fun#tiona!it"?

?bEectInspector is used to anal!e the structure of individual columns and the internal structure

of the row obEects. ?bEectInspector in Hive provides access to complex obEects which can be

stored in multiple formats.

Hadoop Hive Interview Questions and Answers for 6reshers: Q.(os:18283

Hadoop NA@( Interview Questions and Answers1What are the sta!e versions of Hadoop?

Celease 0.K.2 >stable+

Celease 0.3.2

Celease 2.0.2 >stable+

2 What is Apa#he Hadoop NA@(?

=AC is a powerful and efficient feature rolled out as a part of Hadoop 0.1.=AC is a large

scale distributed sstem for running big data applications.

3 Is NA@( a rep!a#e&ent of Hadoop ap@edu#e?

=AC is not a replacement of Hadoop but it is a more powerful and efficient technolog that

supports #apCeduce and is also referred to as Hadoop 0.1 or #apCeduce 0.

'e have further categori!ed Hadoop =AC Interview Questions for 8reshers and xperienced-

• Hadoop Inter'ie (uestions and )nsers for Freshers - (.*os- ,4

• Hadoop Inter'ie (uestions and )nsers for Experien%ed - (.*os- +


25/26

Hadoop Interview Questions Answers(eededInterview Questions on Hadoop Hive

2+xplain about the different tpes of Eoin in Hive.

0+How can ou configure remote metastore mode in Hive(

F+xplain about the #B ;oin in Hive.

3+Is it possible to change the default location of #anaged Tables in Hive, if so how(

5+How data transfer happens from Hive to H%8(

J+How can ou connect an application, if ou run Hive as a server(

K+'hat does the overwrite 7eword denote in Hive load statement(

+'hat is er%e in Hive( How can ou write ourown customer er%e(

+In case of embedded Hive, can the same metastore be used b multiple users(

Hadoop NA@( Interview Questions

2+'hat are the additional benefits =AC brings in to Hadoop(

0+How can native libraries be included in =AC Eobs(

F+xplain the differences between Hadoop 2.x and Hadoop 0.x

?r


26/26

3+xplain the difference between #apCeduce2 and #apCeduce 0L=AC

5+'hat are the modules that constitute the Apache Hadoop 0.1 framewor7(

J+'hat are the core changes in Hadoop 0.1(

K+How is the distance between two nodes defined in Hadoop(

+%ifferentiate between 8, Hadoop ameode and ;ournalode.

'e hope that these Hadoop Interview Questions and Answers have pre-charged ou for our

next Hadoop Interview.et the Ball Colling and answer the unanswered "uestions in the

comments below.:lease doR ItSs all part of our shared mission to ease Hadoop Interviews for all

prospective Hadoopers.'e invite ou to get involved.

big data hadoop interview questions and answers

Documents