big data hadoop interview questions and answers

Upload: cosmicblue

Post on 07-Jul-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    1/26

    Big Data Hadoop Interview Questionsand AnswersThese are Hadoop Basic Interview Questions and Answers for freshers and

    experienced.

    1. What is Big Data?

    Big data is defined as the voluminous amount of structured, unstructured or semi-structured

    data that has huge potential for mining but is so large that it cannot be processed using

    traditional database sstems. Big data is characteri!ed b its high velocit, volume and variet

    that re"uires cost effective and innovative methods for information processing to draw

    meaningful business insights. #ore than the volume of the data $ it is the nature of the data that

    defines whether it is considered as Big %ata or not.

    Here is an interesting and explanator visual on &'hat is Big %ata()

     

    2. What do the four V’s of Big Data denote?

    IB# has a nice, simple explanation for the four critical features of big data*

    a+ olume $cale of datab+ elocit $Analsis of streaming data

    c+ ariet $ %ifferent forms of data

    d+ eracit $ncertaint of data

    Here is an explanator video on the four /s of Big %ata

    3. How ig data ana!"sis he!ps usinesses in#rease their revenue? $ive

    e%a&p!e.Big data analsis is helping businesses differentiate themselves $ for example 'almart the

    world/s largest retailer in 0123 in terms of revenue - is using big data analtics to increase its

    sales through better predictive analtics, providing customi!ed recommendations and launching

    new products based on customer preferences and needs. 'almart observed a significant 214

    to 254 increase in online sales for 62 billion in incremental revenue. There are man more

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    2/26

    companies li7e 8aceboo7, Twitter, 9in7edIn, :andora, ;:#organ

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    3/26

    0+Hadoop #apCeduce-This is a Eava based programming paradigm of Hadoop framewor7 that

    provides scalabilit across various Hadoop clusters. #apCeduce distributes the wor7load into

    various tas7s that can run in parallel. Hadoop Eobs perform 0 separate tas7s- Eob. The map Eob

    brea7s down the data sets into 7e-value pairs or tuples. The reduce Eob then ta7es the output

    of the map Eob and combines the data tuples to into smaller set of tuples. The reduce Eob is

    alwas performed after the map Eob is executed.

    Here is a visual that clearl explain the H%8 and Hadoop #apCeduce

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    4/26

    %ata Integration

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    5/26

    • Hadoop Inter'ie (uestions and )nsers for Experien%ed - (.*os-4,2,3,+5

    8or a detailed :%8 report on Hadoop alaries - CLICK HERE

    Hadoop HD6+ Interview Questions and Answers

    1. What is a !o#/ and !o#/ s#anner in HD6+?

    Bloc7 - The minimum amount of data that can be read or written is generall referred to as a

    &bloc7) in H%8. The default si!e of a bloc7 in H%8 is J3#B.

    Bloc7 canner - Bloc7 canner trac7s the list of bloc7s present on a %ataode and verifies

    them to find an 7ind of chec7sum errors. Bloc7 canners use a throttling mechanism to reserve

    dis7 bandwidth on the datanode.

    2. 7%p!ain the di*eren#e etween (a&e(ode8 Ba#/up (ode and

    9he#/point (a&e(ode.

    (a&e(ode* ameode is at the heart of the H%8 file sstem which manages the metadata

    i.e. the data of the files is not stored on the ameode but rather it has the director tree of all

    the files present in the H%8 file sstem on a hadoop cluster. ameode uses two files for the

    namespace-

    fsimage file- It 7eeps trac7 of the latest chec7point of the namespace.

    edits file-It is a log of changes that have been made to the namespace since chec7point.

    9he#/point (ode:

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    6/26

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    7/26

    0. 7%p!ain the di*eren#e etween (A+ and HD6+.

    • *)# runs on a sin&le ma%hine and thus there is no pro!a!ility of data

    redundan%y hereas H6F# runs on a %luster of di7erent ma%hines thus there is data

    redundan%y !e%ause of the repli%ation proto%ol.

    • *)# stores data on a dedi%ated hardare hereas in H6F# all the data !lo%"s

    are distri!uted a%ross lo%al dri'es of the ma%hines.

    • In *)# data is stored independent of the %omputation and hen%e Hadoop

    8apRedu%e %annot !e used for pro%essin& hereas H6F# or"s ith Hadoop

    8apRedu%e as the %omputations in H6F# are mo'ed to data.

    . 7%p!ain what happens if during the >,< operation8 HD6+ !o#/ is

    assigned a rep!i#ation fa#tor 1 instead of the defau!t va!ue 3.

    Ceplication factor is a propert of H%8 that can be set accordingl for the entire cluster to

    adEust the number of times the bloc7s are to be replicated to ensure high data availabilit. 8orever bloc7 that is stored in H%8, the cluster will have n-2 duplicated bloc7s. o, if the

    replication factor during the :T operation is set to 2 instead of the default value F, then it will

    have a single cop of data. nder these circumstances when the replication factor is set to 2 ,if

    the %ataode crashes under an circumstances, then onl single cop of the data would be

    lost.

    . What is the pro#ess to #hange the 4!es at aritrar" !o#ations in HD6+?

    H%8 does not support modifications at arbitrar offsets in the file or multiple writers but filesare written b a single writer in append onl format i.e. writes to a file in H%8 are alwas made

    at the end of the file.

    15. 7%p!ain aout the inde%ing pro#ess in HD6+.

    Indexing process in H%8 depends on the bloc7 si!e. H%8 stores the last part of the data that

    further points to the address where the next part of data chun7 is stored.

    11. What is a ra#/ awareness and on what asis is data stored in a ra#/? All the data nodes put together form a storage area i.e. the phsical location of the data nodes is

    referred to as Cac7 in H%8. The rac7 information i.e. the rac7 id of each data node is ac"uired

    b the ameode. The process of selecting closer data nodes depending on the rac7

    information is 7nown as Cac7 Awareness.

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    8/26

    The contents present in the file are divided into data bloc7 as soon as the client is read to load

    the file into the hadoop cluster. After consulting with the ameode, client allocates F data

    nodes for each data bloc7. 8or each data bloc7, there exists 0 copies in one rac7 and the third

    cop is present in another rac7. This is generall referred to as the Ceplica :lacement :olic.

    'e have further categori!ed Hadoop H%8 Interview Questions for 8reshers and xperienced-

    • Hadoop Inter'ie (uestions and )nsers for Freshers - (.*os- ,4,1,3,+5,++

    • Hadoop Inter'ie (uestions and )nsers for Experien%ed - (.*os- +,,

    ,/,0,1,2

    Cli%" here to "no more a!out our I98 Certied Hadoop 6e'eloper %ourse

    Hadoop ap@edu#e Interview Questions and Answers

    1. 7%p!ain the usage of 9onte%t e#t.

    Ne,alue,context+

    F+cleanup >+ - This method is called onl once at the end of reduce tas7 for clearing all the

    temporar files.

    https://www.dezyre.com/Hadoop-Training-online/19https://www.dezyre.com/Hadoop-Training-online/19

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    9/26

    8unction %efinition -public void cleanup >context+

    3. 7%p!ain aout the partitioning8 shue and sort phase

    +hue >hase:?nce the first map tas7s are completed, the nodes continue to perform

    several other map tas7s and also exchange the intermediate outputs with the reducers as

    re"uired. This process of moving the intermediate outputs of map tas7s to the reducer is

    referred to as huffling.

    +ort >hase- Hadoop #apCeduce automaticall sorts the set of intermediate 7es on a single

    node before the are given as input to the reducer.

    >artitioning >hase:The process that determines which intermediate 7es and value will be

    received b each reducer instance is referred to as partitioning. The destination partition is

    same for an 7e irrespective of the mapper instance that generated it.

    '. How to write a #usto& partitioner for a Hadoop ap@edu#e o?

    teps to write a

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    10/26

    . What are the pri&ar" phases of a @edu#er?

    The F primar phases of a reducer are $

    2+huffle

    0+ort

    F+Ceduce

    . What is a

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    11/26

    0+'hen data is stored in the form of collections

    F+If the application demands 7e based access to data while retrieving.

    Ne components of HBase are $

    Cegion- This component contains memor data store and Hfile.

    Cegion erver-This monitors the Cegion.

    HBase #aster-It is responsible for monitoring the region server.

    Goo7eeper- It ta7es care of the coordination between the HBase #aster component and the

    client.

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    12/26

    C%B# does not have support for in-built partitioning whereas in HBase there is automated

    partitioning.

    C%B# stores normali!ed data whereas HBase stores de-normali!ed data.

    ). 7%p!ain aout the di*erent #ata!og ta!es in HBase?

    The two important catalog tables in HBase, are C??T and #TA. C??T table trac7s where the

    #TA table is and #TA table stores all the regions in the sstem.

    -. What is #o!u&n fa&i!ies? What happens if "ou a!ter the !o#/ siCe of

    9o!u&n6a&i!" on an a!read" popu!ated dataase?

    The logical deviation of data is represented through a 7e 7nown as column 8amil.

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    13/26

    There are F different tpes of tombstone mar7ers in HBase for deletion-

    2+8amil %elete #ar7er- This mar7ers mar7s all columns for a column famil.

    0+ersion %elete #ar7er-This mar7er mar7s a single version of a column.

    F+'A9+ in which all the H9og edits are written immediatel.'A9 edits remain in the

    memor till the flush period in case of deferred log flush.

    'e have further categori!ed Hadoop HBase Interview Questions for 8reshers and xperienced-

     

    • Hadoop Inter'ie (uestions and )nsers for Freshers - (.*os-+,,,/,1

    • Hadoop Inter'ie (uestions and )nsers for Experien%ed - (.*os-

    ,4,0,2,3,+5

    Hadoop +Foop Interview Questions and Answers

    1. 7%p!ain aout so&e i&portant +Foop #o&&ands other than i&port and

    e%port.

    9reate =o G::#reate

    Here we are creating a Eob with the name m Eob, which can import the table data from C%B#

    table to H%8. The following command is used to create a Eob that is importing data from the

    emploee table in the db database to the H%8 file.

    6 "oop Eob --create mEob O

    --import O

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    14/26

    --connect Edbc*ms"l*LLlocalhostLdb O

    --username root O

    --table emploee --m 2

    Verif" =o G::!ist

    P--list/ argument is used to verif the saved Eobs. The following command is used to verif the list

    of saved "oop Eobs.

    6 "oop Eob --list

    Inspe#t =o G::show

    P--show/ argument is used to inspect or verif particular Eobs and their details. The following

    command and sample output is used to verif a Eob called mEob.

    6 "oop Eob --show mEob

    7%e#ute =o G::e%e#

    P--exec/ option is used to execute a saved Eob. The following command is used to execute a

    saved Eob called mEob.

    6 "oop Eob --exec mEob

    2. How +Foop #an e used in a =ava progra&?

    The "oop Ear in classpath should be included in the Eava code. After this the method

    "oop.runTool >+ method must be invo7ed. The necessar parameters should be created to"oop programmaticall Eust li7e for command line.

    3. What is the pro#ess to perfor& an in#re&enta! data !oad in +Foop?

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    15/26

    The process to perform incremental data load in "oop is to snchroni!e the modified or

    updated data >often referred as delta data+ from C%B# to Hadoop. The delta data can be

    facilitated through the incremental load command in "oop.

    Incremental load can be performed b using "oop import command or b loading the data into

    hive without overwriting it. The different attributes that need to be specified during incremental

    load in "oop are-

    2+#ode >incremental+ $The mode defines how "oop will determine what the new rows are.

    The mode can have value as Append or 9ast #odified.

    0+last-value+ $This denotes the maximum value of the chec7 column from the previous

    import operation.

    '. Is it possi!e to do an in#re&enta! i&port using +Foop?

    =es, "oop supports two tpes of incremental imports-

    2+Append

    0+9ast #odified

    To insert onl rows Append should be used in import command and for inserting the rows and

    also updating 9ast-#odified should be used in the import command.

    ). What is the standard !o#ation or path for Hadoop +Foop s#ripts?

    LusrLbinLHadoop "oop

    -. How #an "ou #he#/ a!! the ta!es present in a sing!e dataase using

    +Foop?

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    16/26

    The command to chec7 the list of all tables present in a single database using "oop is as

    follows-

    +Foop !ist:ta!es #onne#t d#; &"sF!; !o#a!hostuserJ

    0. How are !arge oe#ts hand!ed in +Foop?

    "oop provides the capabilit to store large si!ed data into a single field based on the tpe of

    data. "oop supports the abilit to store-

    2+.

    %ist

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    17/26

     

    • Hadoop Inter'ie (uestions and )nsers for Freshers - (.*os- ,/,0,3

    • Hadoop Inter'ie (uestions and )nsers for Experien%ed - (.*os-

    +,,4,0,1,2,+5

    Hadoop 6!u&e Interview Questions and Answers

    1 7%p!ain aout the #ore #o&ponents of 6!u&e.

    The core components of 8lume are $

    vent- The single log entr or unit of data that is transported.

    ource- This is the component through which data enters 8lume wor7flows.

    in7-It is responsible for transporting data to the desired destination.

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    18/26

    • )syn%H9ase#in" has !etter

    performan%e than H9ase sin" as it %an easily ma"e non-!lo%"in& %alls to H9ase.

    Wor/ing of the HBase+in/

    In HBasein7, a 8lume vent is converted into HBase Increments or :uts. eriali!er

    implements the HBaseventeriali!er which is then instantiated when the sin7 starts. 8or ever

    event, sin7 calls the initiali!e method in the seriali!er which then translates the 8lume vent into

    HBase increments and puts to be sent to HBase cluster.

    Wor/ing of the As"n#HBase+in/:

     AsncHBasein7 implements the AsncHBaseventeriali!er. The initiali!e method is called

    onl once b the sin7 when it starts. in7 invo7es the setvent method and then ma7es calls to

    the getIncrements and getActions methods Eust similar to HBase sin7. 'hen the sin7 stops, the

    cleanp method is called b the seriali!er.

    ' 7%p!ain aout the di*erent #hanne! t"pes in 6!u&e. Whi#h #hanne! t"pe

    is faster?

    The F different built in channel tpes available in 8lume are-

    ##?C=

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    19/26

    - 7%p!ain aout the rep!i#ation and &u!tip!e%ing se!e#tors in 6!u&e.

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    20/26

    3 What is the ro!e of Moo/eeper in HBase ar#hite#ture?

    In HBase architecture, GooNeeper is the monitoring server that provides different services li7e $

    trac7ing server failure and networ7 partitions, maintaining the configuration information,

    establishing communication between the clients and region servers, usabilit of ephemeral

    nodes to identif the available servers in the cluster.

    ' 7%p!ain aout Mooeeper in af/a

     Apache Naf7a uses GooNeeper to be a highl distributed and scalable sstem. Goo7eeper is

    used b Naf7a to store various configurations and use them across the hadoop cluster in a

    distributed manner. To achieve distributed-ness, configurations are distributed and replicated

    throughout the leader and follower nodes in the GooNeeper ensemble. 'e cannot directl

    connect to Naf7a b be-passing GooNeeper because if the GooNeeper is down it will not be

    able to serve the client re"uest.

    ) 7%p!ain how Moo/eeper wor/s

    GooNeeper is referred to as the Ning of

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    21/26

    • #torm that relies on ?ooKeeper is used !y popular %ompanies li"e @roupon

    and Titter.

    0 How to use Apa#he Moo/eeper #o&&and !ine interfa#e?

    GooNeeper has a command line client support for interactive use. The command line interface

    of GooNeeper is similar to the file and shell sstem of ID. %ata in GooNeeper is stored in a

    hierarch of Gnodes where each !node can contain data Eust similar to a file. ach !node can

    also have children Eust li7e directories in the ID file sstem.

    Goo7eeper-client command is used to launch the command line client. If the initial prompt is

    hidden b the log messages after entering the command, users can Eust hit TC to view the

    prompt.

    What are the di*erent t"pes of Mnodes?

    There are 0 tpes of Gnodes namel- phemeral and e"uential Gnodes.

    •  The ?nodes that &et destroyed as soon as the %lient that %reated it

    dis%onne%ts are referred to as Ephemeral ?nodes.

    • #e$uential ?node is the one in hi%h se$uential num!er is %hosen !y the

    ?ooKeeper ensem!le and is pre-xed hen the %lient assi&ns name to the Anode.

    What are wat#hes?

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    22/26

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    23/26

    0 Di*erentiate etween Hadoop ap@edu#e and >ig

    • :i& pro'ides hi&her le'el of a!stra%tion hereas 8apRedu%e pro'ides lo

    le'el of a!stra%tion.

    • 8apRedu%e re$uires the de'elopers to rite more lines of %ode hen

    %ompared to )pa%he :i&.

    • :i& %odin& approa%h is %omparati'ely sloer than the fully tuned 8apRedu%e

    %odin& approa%h.

    Cead #ore in %etail- httpB.deAyre.%omarti%le-mapredu%e-'s-pi&-'s-hi'e+04

    What is the usage of forea#h operation in >ig s#ripts?

    8?CAig?

    ometimes there is data in a tuple or bag and if we want to remove the level of nesting from that

    data then 8latten modifier in :ig can be used. 8latten un-nests bags and tuples. 8or tuples, the

    8latten operator will substitute the fields of a tuple in place of a tuple whereas un-nesting bags is

    a little complex because it re"uires creating new tuples.

    'e have further categori!ed Hadoop :ig Interview Questions for 8reshers and xperienced-

    • Hadoop Inter'ie (uestions and )nsers for Freshers - (.*os-+,,,1,3

    • Hadoop Inter'ie (uestions and )nsers for Experien%ed - (.*os- 4,/,0,2,+5

    Hadoop Hive Interview Questions and Answers

    1 What is a Hive etastore?

    Hive #etastore is a central repositor that stores metadata in external database.

    http://www.dezyre.com/article/-mapreduce-vs-pig-vs-hive/163http://www.dezyre.com/article/-mapreduce-vs-pig-vs-hive/163

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    24/26

    2 Are &u!ti!ine #o&&ents supported in Hive?

    o

    3 What is e#tInspe#tor fun#tiona!it"?

    ?bEectInspector is used to anal!e the structure of individual columns and the internal structure

    of the row obEects. ?bEectInspector in Hive provides access to complex obEects which can be

    stored in multiple formats.

     

    Hadoop Hive Interview Questions and Answers for 6reshers: Q.(os:18283

    Hadoop NA@( Interview Questions and Answers1What are the sta!e versions of Hadoop?

    Celease 0.K.2 >stable+

    Celease 0.3.2

    Celease 2.0.2 >stable+

    2 What is Apa#he Hadoop NA@(?

    =AC is a powerful and efficient feature rolled out as a part of Hadoop 0.1.=AC is a large

    scale distributed sstem for running big data applications.

    3 Is NA@( a rep!a#e&ent of Hadoop ap@edu#e?

    =AC is not a replacement of Hadoop but it is a more powerful and efficient technolog that

    supports #apCeduce and is also referred to as Hadoop 0.1 or #apCeduce 0.

    'e have further categori!ed Hadoop =AC Interview Questions for 8reshers and xperienced-

    • Hadoop Inter'ie (uestions and )nsers for Freshers - (.*os- ,4

    • Hadoop Inter'ie (uestions and )nsers for Experien%ed - (.*os- +

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    25/26

    Hadoop Interview Questions Answers(eededInterview Questions on Hadoop Hive

    2+xplain about the different tpes of Eoin in Hive.

    0+How can ou configure remote metastore mode in Hive(

    F+xplain about the #B ;oin in Hive.

    3+Is it possible to change the default location of #anaged Tables in Hive, if so how(

    5+How data transfer happens from Hive to H%8(

    J+How can ou connect an application, if ou run Hive as a server(

    K+'hat does the overwrite 7eword denote in Hive load statement(

    +'hat is er%e in Hive( How can ou write ourown customer er%e(

    +In case of embedded Hive, can the same metastore be used b multiple users(

    Hadoop NA@( Interview Questions

    2+'hat are the additional benefits =AC brings in to Hadoop(

    0+How can native libraries be included in =AC Eobs(

    F+xplain the differences between Hadoop 2.x and Hadoop 0.x

    ?r 

  • 8/19/2019 Big Data Hadoop Interview Questions and Answers

    26/26

    3+xplain the difference between #apCeduce2 and #apCeduce 0L=AC

    5+'hat are the modules that constitute the Apache Hadoop 0.1 framewor7(

    J+'hat are the core changes in Hadoop 0.1(

    K+How is the distance between two nodes defined in Hadoop(

    +%ifferentiate between 8, Hadoop ameode and ;ournalode.

    'e hope that these Hadoop Interview Questions and Answers have pre-charged ou for our

    next Hadoop Interview.et the Ball Colling and answer the unanswered "uestions in the

    comments below.:lease doR ItSs all part of our shared mission to ease Hadoop Interviews for all

    prospective Hadoopers.'e invite ou to get involved.