hadoop pretest

Hadoop Pre-Test

Q1. In a traditional way of programing where your application is running on one machine and your

database is running on some other machine at different location. Let’s assume the data base size is 1TB

and the network bandwidth is 10MBPS. According to you what could be the performance bottle neck?

Choose the best option (

Network band width is the performance bottleneck. If we can increase the network bandwidth

then we can increase the performance since it will increase the data transfer rate

Huge amount of data present on Database is the bottleneck. Since transferring the data over the

low latency network takes lot of time to transfer the data to the application server. This is the

main reason for the performance bottleneck

Performance will only depend on the latency involved in reading the data by the application

and processing. There is as such no performance degradation with respect to the network or

with the data base size.

All of the above

There is no performance bottleneck

Q2. You are performing matrix multiplication on a Quad core machine (4 cores/CPU). The dimensions of

the matrix are 1024 * 1024 and each element in the matrix is a random decimal number. Currently the

matrix multiplication is implemented as single threaded application and you have observed it is taking

too much of time to perform the multiplication. How can you increase the performance?

Since I have 4 cores so I will run 4 threads to perform the multiplication. Each thread will be

working on 256 rows

I will transpose one of the matrix to avoid cache misprediction. And I will also employ 4 threads

to utilize all the cores and each thread will be working on 256 rows

I will transpose both of the matrix and will employ 4 threads to utilize all the cores and each

thread will be working on 256 rows

Since I can run as many threads I want, so I will run 256 threads and each thread takes only 4

rows.

I will run 1024 threads and each thread will take only 1 row and hence it will be very fast.

Q3. You want to store 10TB of data on 100GB hard disk capacity machine. What would be your

approach? Choose the best feasible and realistic option

I will upgrade my hard disk capacity of the machine to 100TB

I will buy high end machine and will store the data

I will partition the data into 100GB chunks and distributes the data across several machines

All the above

Q4. If 1TB of data is stored in a machine of 64GB RAM, how will you perform sort operation on 1TB of

data? Choose the best option

A. I will employ quick sort algorithm to perform sorting.

B. I will first do the sorting on first 100GB of data, then on second 100GB of data and so on. Finally

I will do the final sorting on all the intermediate output

C. I will do a combination of the A and B

D. I will first do the sorting on first 64GB of data, then on second 64GB of data and so on. Finally I

will do the final sorting on all the intermediate output

E. I will do the combination of A and D

Q5. If you have 10million records in the traditional database and you are firing a query which includes

“where” clause like “SELECT * FROM employeeTable WHERE emp_salary > 100000”. Choose the best

option which describes this operation. ( Choose 2 options)

Since traditional databases are row oriented in nature where rows and columns are tightly

coupled together. So it takes the entire 10million records into the memory and does the table

scanning and gives back the result. Since this operation is done in memory so it is very fast

Since traditional databases are row oriented in nature where rows and columns are tightly

coupled together. So it takes the entire 10million records into the memory and does the table

scanning and gives back the result. And it is very slows since it performs the comparison

operation with every record

The internal data structures of the databases are B+ trees, so when the data bases size above

few 100GB, retrieving the data becomes very slow.

Traditional Databases are column oriented in nature where importance is given to the columns

instead of rows and hence it knows which column to select depending on the condition and it is

very fast.

Q6. Choose the best option which describes the traditional data bases ( choose the best options)

o Rigid Schema

o Dynamic column creation

o No Dynamic column creation

o Table scanning

o No Table scanning

Q7. If you are storing unstructured data in the traditional RDBMS, which option best describes this

scenario?

A. I can store unstructured data in the traditional RDBMS, but I will be storing lot of nulls along

with the data

B. I cannot store unstructured data

C. I can store unstructured data, but only problem would be I will be continuously changing

schema depending on the columns

D. Both A and C

Q8. What is efficient way of serializing the data over the network? Let’s assume that you are passing

some integer values from machine1 to machine 2

I will transfer the data as it is

I will convert the data into binary and then will transfer

I will compress the integer values using some compression algorithm and then transfer

I will convert the integer values into binary and employ compression algorithm and then I will

transfer the data.

Q9. Choose the best option that why you want to go for distributed computing?

A. Because I am facing storage problem

B. Because I am facing performance problem

C. A and B both

D. None of the above

Q10. What is data localization?

Executing application on the machine where data is residing,

Moving data to the machine where application is running

Distributing the data across several machine

Localizing the entire data on single machine

None of these

Q11. What is the disadvantage of traditional application/systems?

Data is local to the application

Data is moved to the application over low latency network

Finite network bandwidth is used

Reading is slow

None of these

hadoop pretest

Documents