a data locality optimization algorithm for large … · a data locality optimization algorithm for...

7
A Data Locality Optimization Algorithm for Large-scale Data Processing in Hadoop Yanrong Zhao, Weiping Wang, Dan Meng, Xiufeng Yang Institute of Computing Technology Chinese Academy of Sciences Graduate University, Chinese Academy of Sciences BeiJing, China {zhaoyanrong, wpwang, md, yangxiufeng}@ncic.ac.cn Shubin Zhang, Jun Li, Gang Guan Data Platform Department Tencent ShenZhen, China {brantzhang, joeyli, gangguan}@tencent.com Abstract—Data-intensive applications are increasingly designed to execute on large computing clusters. Our previous observation on Tencent production systems has indicated that join query is one of the most important queries in large-scale data processing. When running a join query on Hive system, the job of the join query is divided into map phase and reduce phase, and requires transferring large amounts of intermediate results over the network, which is inefficient. In this paper, we proposed an algorithm called CHMJ, the general idea of the algorithm is to take advantage of data locality to accelerate calculation. It includes four parts, Data distribution strategy, Parallel HashMapJoin Algorithm, CoLocation Scheduling and Delay scheduling strategy. CHMJ has been adopted in Tencent data warehouse, and plays an important role in Tencent’s daily operations. Our relevant experiments demonstrate the feasibility and efficiency of our solution. Keywords-Hadoop; MapReduce; join query I. INTRODUCTION With the rapid development of Internet, we have entered an era of data explosion. The MapReduce framework is increasingly being used to analyse large-scale data. Hadoop MapReduce is one of its open-source implementation. Our previous observation on Tencent production systems has indicated that, in OLAP applications, Join query is one of the most frequently used queries, which takes up more than 70% of time in data analysis and processing. Traditional join query based on Hadoop MapReduce[1][2] uses SortMergeReduceJoin (Reduce Join). Job of join query is divided into two phases, map phase and reduce phase. Map- side may produce a large number of intermediate data, which requires to be transferred to Reduce-side. Transferring large amounts of data will inevitably cause high usage of network bandwidth. As the data copies accumulate on disk of Reduce-side, Reduce-side also needs a background thread merging copies which come from many map sides several times to make them into larger sorted files. Therefore, reduce often takes a long time for execution. However, on production systems, many queries are response-time critical in order to satisfy the requirements of both real-time requests and heavy workloads of decision supporting queries submitted by highly-concurrent users. Our work is originally motivated by the frequency use of join query. In this paper, we propose a novel join query processing algorithm called CHMJ(CoLocation Hash Map Join). Our contributions include: Firstly, we designed a new data distribution algorithm. The algorithm distributes the data of tables over the cluster according to the hash values of the join property, which can improve the data locality and ensure data availability. Secondly, on the base on the data distribution, we proposed a parallel join query processing algorithm to improve the efficiency of join query. Thirdly, we proposed CoLocation scheduling for fault-tolerant. Finally we also introduce delay scheduling strategy to job scheduler to raise data locality. We compare the performance of our CHMJ with other join algorithms, and conclude that, data distribution is very important for join queries, while, modern parallel join algorithms based on Hadoop ignore it. Our experimental results show that, CHMJ significantly improves the efficiency of join query, and execution time is only about 20% of traditional SortMergeReduceJoin. It also reduces network I/O significantly. CHMJ has been adopted in Tencent Distributed Data Warehouse (TDW) by Tencent, China's largest and most used Internet service portal. Relevant principle of this method is also instructive for Groupby query optimization. The remaining of the paper is organized as following: Section II describes the related work and the traditional query processing algorithm. Section III provides CHMJ, a new data locality join algorithm. In Section IV, we present the experiment results of CHMJ. Finally, related work and conclusion are discussed in Section V. II. RELATED WORK Current large-scale data processing systems based on cloud computing technology have been developed rapidly. Distributed processing frameworks, such as Yahoo's Hadoop and Google's MapReduce, have been successful at harnessing expansive data center resources for large-scale data analysis. Figure 1 shows the current large-scale data processing software stack. 978-1-4673-2713-8/12/$31.00 ©2012 IEEE 000655

Upload: buitram

Post on 26-Aug-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

A Data Locality Optimization Algorithm for Large-scale Data Processing in Hadoop

Yanrong Zhao, Weiping Wang, Dan Meng, Xiufeng Yang

Institute of Computing Technology Chinese Academy of Sciences

Graduate University, Chinese Academy of Sciences BeiJing, China

{zhaoyanrong, wpwang, md, yangxiufeng}@ncic.ac.cn

Shubin Zhang, Jun Li, Gang Guan Data Platform Department

Tencent ShenZhen, China

{brantzhang, joeyli, gangguan}@tencent.com

Abstract—Data-intensive applications are increasingly designed to execute on large computing clusters. Our previous observation on Tencent production systems has indicated that join query is one of the most important queries in large-scale data processing. When running a join query on Hive system, the job of the join query is divided into map phase and reduce phase, and requires transferring large amounts of intermediate results over the network, which is inefficient. In this paper, we proposed an algorithm called CHMJ, the general idea of the algorithm is to take advantage of data locality to accelerate calculation. It includes four parts, Data distribution strategy, Parallel HashMapJoin Algorithm, CoLocation Scheduling and Delay scheduling strategy. CHMJ has been adopted in Tencent data warehouse, and plays an important role in Tencent’s daily operations. Our relevant experiments demonstrate the feasibility and efficiency of our solution.

Keywords-Hadoop; MapReduce; join query

I. INTRODUCTION With the rapid development of Internet, we have entered

an era of data explosion. The MapReduce framework is increasingly being used to analyse large-scale data. Hadoop MapReduce is one of its open-source implementation.

Our previous observation on Tencent production systems has indicated that, in OLAP applications, Join query is one of the most frequently used queries, which takes up more than 70% of time in data analysis and processing. Traditional join query based on Hadoop MapReduce[1][2] uses SortMergeReduceJoin (Reduce Join). Job of join query is divided into two phases, map phase and reduce phase. Map-side may produce a large number of intermediate data, which requires to be transferred to Reduce-side. Transferring large amounts of data will inevitably cause high usage of network bandwidth. As the data copies accumulate on disk of Reduce-side, Reduce-side also needs a background thread merging copies which come from many map sides several times to make them into larger sorted files. Therefore, reduce often takes a long time for execution. However, on production systems, many queries are response-time critical in order to satisfy the requirements of both real-time requests and heavy workloads of decision supporting queries submitted by highly-concurrent users.

Our work is originally motivated by the frequency use of join query. In this paper, we propose a novel join query processing algorithm called CHMJ(CoLocation Hash Map Join). Our contributions include: Firstly, we designed a new data distribution algorithm. The algorithm distributes the data of tables over the cluster according to the hash values of the join property, which can improve the data locality and ensure data availability. Secondly, on the base on the data distribution, we proposed a parallel join query processing algorithm to improve the efficiency of join query. Thirdly, we proposed CoLocation scheduling for fault-tolerant. Finally we also introduce delay scheduling strategy to job scheduler to raise data locality. We compare the performance of our CHMJ with other join algorithms, and conclude that, data distribution is very important for join queries, while, modern parallel join algorithms based on Hadoop ignore it.

Our experimental results show that, CHMJ significantly improves the efficiency of join query, and execution time is only about 20% of traditional SortMergeReduceJoin. It also reduces network I/O significantly. CHMJ has been adopted in Tencent Distributed Data Warehouse (TDW) by Tencent, China's largest and most used Internet service portal. Relevant principle of this method is also instructive for Groupby query optimization.

The remaining of the paper is organized as following: Section II describes the related work and the traditional query processing algorithm. Section III provides CHMJ, a new data locality join algorithm. In Section IV, we present the experiment results of CHMJ. Finally, related work and conclusion are discussed in Section V.

II. RELATED WORK Current large-scale data processing systems based on

cloud computing technology have been developed rapidly. Distributed processing frameworks, such as Yahoo's Hadoop and Google's MapReduce, have been successful at harnessing expansive data center resources for large-scale data analysis. Figure 1 shows the current large-scale data processing software stack.

978-1-4673-2713-8/12/$31.00 ©2012 IEEE 000655

Figure 1. Current large-scale data processing software stack

Interface Layer provides programming APIs, shell or web GUI interface. When users submit SQL or other statements to it, interface layer will translate the statements into jobs of execution layer. At present, the interface layer software widely used include Hive[3][4][5], Pig[6], Scope[7], DryadLINQ[8] and so on.

Execution layer is responsible for job execution. MapReduce, proposed by Google in 2004, has become one of the most popular programming models of execution layer. It could be applied in large cluster for large-scale data analysis and processing. Dryad[9] of Microsoft is another popular programming model for large-scale data analysis and processing.

Storage layer provides data storage and access services. HDFS is one of the most commonly used software of storage layer. It is a distributed, scalable, and portable filesystem designed for large-scale distributed data processing under frame-works such as MapReduce.

A. Traditional join query processing algoritms Join is among the most frequently executed operations in

OLAP processing. Over the past few decades, significant efforts have been made to develop efficient join algorithms. Traditional join algorithm includes the nested-loop join(NLP), the sort-merge join(SMJ) and the hash join(HJ). Among these, the sort-merge join algorithm is dominantly used in early relational database systems. Later, the hash join algorithm becomes popular. This algorithm usually is faster than the sort-merge join, but puts a considerable load on memory for storing the full hash-table. Consider two datasets P and Q, The algorithm for a hash join will look like this:

for all p P∈ do Load p into in memory hash table H end for for all q Q∈ do if H contains p matching with q then add<p,q> to the result end if end for

There are also many variants have been designed for in-

memory databases[10][11][12] and parallel databases[13][14][15]. Parallel algorithms greatly improve the performance of the relational join in shared-nothing systems or shared-memory systems.

B. Parallel join query processing algorithms on Hadoop MapReduce Hadoop MapReduce is a programming model that

enables easy development of scalable parallel applications to process vast amount of data on large clusters of commodity machines. Modern join query processing based on Hadoop MapReduce, such as Hive uses SortMergeReduceJoin(a parallel variant of sort-merge join). It is straightforward, consisting of two phases: map phase and reduce phase in sequence, while reduce-side produces the final output.

Under the Hadoop MapReduce framework, a map is started for a DFS block. The map will sequentially process each row group in the DFS block. The output of the map task is written to the local disk, and the map task partitions its output, creating one partition for each reduce task. Reduce task does not have the advantage of data locality. A reduce task is fed by many map tasks. The reduce task has a small number of copier threads, so that it can fetch map output in parallel. As the map output copies accumulate on disk, a background thread merges them into larger, sorted files in several rounds. During the reduce phase, the reduce-side outputs the final results. SortMergeReduceJoin algorithm is illustrated in Figure 2

pageid userid time

1 111 2010.3

2 111 2011.4

3 222 2011.5

pageid age gender

111 25 female

222 32 male

key value

111 <1,1>

111 <1,2>

222 <1,1>

key value

111 <2,25>

222 <2,32>

key value

111 <1,1>

111 <1,2>

key value

222 <1,1>

key value

222 <2,32>

key value

111 <2,25>

key value

111 <1,1>

111 <1,2>

111 <2,25>

key value

222 <1,1>

222 <2,32>

pageid age

1 25

2 25

Pageid age

1 32

Page_view

user

Initial Data Map Phase Shuffle, Sort, Partition & Combine Reduce Phase

Figure 2. Process of SortMergeReduceJoin

SortMergeReduceJoin is inefficient as shown in Figure 2. Firstly, the reduce phase cannot start until the map phase is completed. Secondly, reduce does not have data locality guarantee, and requires getting data from other nodes. Thirdly, reduce also needs Sort/Merge operation to make data copies into larger, sorted files in several rounds. Therefore, the sort-merge algorithm is not always the most efficient algorithm for performing join query.

Hive also implements a parallel version of the hash join(HashJoin). However, the hash join can only be used when there are small files, which can be replicated to each map task.

III. CHMJ JOIN QUERY OPTIMIZATION ALGORITHM Our previous work has indicated that, in distributed

systems, the data distribution has a significant impact on join performances. Although there have been many parallel join algorithms studied extensively in Hadoop systems, such as SortMergeReducJoin and HashJoin of Hive, but all these algorithms ignore the importance of data distribution on performance of join queries.

978-1-4673-2713-8/12/$31.00 ©2012 IEEE 000656

CHMJ is a map-side output join query optimization algorithm, In comparison with previous parallel join algorithms, our CHMJ parallel join algorithm takes the data distribution into account. CHMJ includes four parts, Data distribution strategy, Parallel HashMapJoin Algorithm, CoLocation Scheduling and Delay scheduling strategy, in order to raise data locality to accelerate calculation and reduce the network I/O.

A. CoLocation Data Distribution Strategy The default data placement policy of Hadoop tries to

balance load by placing blocks randomly, but it does not take any data characteristics into account. Therefore, relevant data could be randomly spread over the cluster.

We have designed and implemented a new data distribution strategy called CoLocation for Hadoop. It extends Hadoop to enable collocating relevant data at the file system level, in order to take advantage of data locality to accelerate calculation and reduce the network I/O. CoLocation can be used to improve the efficiency of many operations, such as Join, Groupby and column-store operations.

We change the underlying data storage strategies of Hadoop DFS. The storage layer of TDW automatically identifies the type of data through the path in the storage layer. If data belongs to Hash partition or column-store, storage layer will use CoLocation strategy as principle to store the data. Otherwise, the data will be stored in accordance with the default rack-aware strategy [16] of Hadoop DFS.

The CoLocation uses a consistent hashing-like algorithm called Cyclic Consistent Hashing algorithm(CCH), which is a data distribution algorithm based on Consistent Hash algorithm[17]. Comparing with Consistent Hash algorithm which does not take multiple copies problem into account, CCH supports storing multiple copies of same data. It also supports the Dynamo[18] concept of virtual node, multiple virtual nodes can be mapped to an actual node.

Calculate the hash number of data

Configure the hash number of node

NodeA

NodeG

NodeC

NodeD

NodeB

NodeF

NodeE

IP = 192.168.100.120

IP = 192.168.100.8

three replicas

first replica

second replica

third replica

one replica

the only replica

IP = 192.168.100.88

Figure 3. Cyclic Consistent Hash algorithm

Cyclic Consistent Hashing algorithm is divided into the following steps:

• Configure a hash ring via a configure file. • Each node in the system is assigned a value within

this space which represents its position on the hash

ring via a configure file. Thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring.

• If the block belongs to partitioned table, the block yields its position on the ring by getting the hash partition serial number through the path in distributed filesystem. If the block belongs to column-store, the block yields its position on the ring by getting the hash value of the path in the distributed filesystem.

• After yielding the position of block, the algorithm walks the ring clockwise, to find the first node with a position larger than the block’s position. The node is deemed the coordinator for the position on the hash ring, and the node is chosen to store the first replica of the block. If the number of replicas is N, the other nodes are chosen by picking N-1 successors of the coordinator on the ring.

• If the node is in an abnormal condition, for example, its disk is full, the node will be skipped. Cyclic Consistent Hash algorithm will try the following node on the ring clockwise.

In the algorithm, blocks are placed on the hash ring according to their values. A block is mapped to N following nodes on the ring clockwise.

NodeA

NodeC

NodeD

NodeB

NodeF

NodeE NodeA

NodeC

NodeD

NodeB

NodeF

NodeE

Figure 4. Node failures in Cyclic Consistent Hash algorithm

In distributed systems, node failures are the norm rather than the exception. For example, as shown in Figure 4, a block with N(N=3) replicas is stored in the following N nodes, nodeE, nodeF and nodeA. When nodeE goes down, the block misses a replica. Therefore, failure detection system will start to work. The master will randomly choose a node from NodeF and NodeA and replicate the block to a new node named NodeB.

Cyclic Consistent Hashing algorithm ensures that when a new node is added in or removed from the cluster, it does not significantly change the mapping of data to nodes. It also avoids the condition that relevant data are distributed over many different nodes. For example, No matter how many files in a hash partition and how many blocks in a file, all these relevant blocks will be stored in the same set of nodes.

B. Parallel HashMapJoin Algorithm The SortMergeReduceJoin seems like the natural way to

join datasets using MapReduce. It uses the framework’s built-in capability to sort the intermediate key-value pairs before they reach the Reducer, and Reducer also needs

978-1-4673-2713-8/12/$31.00 ©2012 IEEE 000657

merge operation to make data copies into larger, sorted files in several rounds. However, this is usually a very time-consuming step, and it is inefficient. Therefore, we offer HashMapJoin.

HashMapJoin is a new parallel version of hash join algorithm for Hadoop MapReduce, which allows Hadoop to quickly fuse together two data sets based on the pre-computed hash value. It is not only can be easy to parallelize, but also able to handle large tables efficiently. HashMapJoin and Hash partitioning are implemented at the Interface Layer of our TDW using 20000 lines of Java codes. Users could submit HashMapJoin queries via Tencent-SQL statements. These SQL statements will be ultimately translated into the corresponding MapReduce jobs through syntax analysis, semantic analysis, query optimization and other steps. It is triggered by a hint in the SQL query of the following form:

SELECT /*+ hashmapjoin(a)*/ a.key, a.val, b.val, FROM table1 a JOIN table2 b ON a.key=b.key

A traditional map task deals with a block, while a map

task of HashMapJoin deals with Hash partitions. Figure 5 illustrates a map task of HashMapJoin.

Figure 5. a map task of HashMapJoin

Hash partitioning is independent of the join algorithm itself, but it is the necessary condition to achieve HashMapJoin. The partitions are formed by using a hash function on the join key. Using the hash function on the hash key guarantees that, any two joining tuples must be in the same pair of partitions. Tuples in each table will only join if they fall in the same partition after hashing. Therefore, the task of joining two large inputs has been reduced to multiple, but smaller, instances of the same tasks. The algorithm for performing is as follows:

//create partitioned table for all do

hash on join attributes p(a); put p into the appropriate hash partition P[i];

end for for all do

hash on join attributes q(b); put q into the appropriate hash partition Q[i];

end for //map tasks are executed in parallel for each map task[i] in MapReduce tasks do

map task[i] load the hash partition Q[i] into the memory; for each tuple in P[i] do

map task[i] uses s to probe the hash partitionQ[i]; output any matches to the result relation;

end for end for

Based on Hash partition, HashMapJoin could complete its work at map-side, no need of reduce-side. The number of Hash partitions will be ultimately converted into the corresponding number of map tasks, each task does its own work, no need of interaction with other tasks, all of these hash partitions output results summarized will be the final result.

C. CoLocation Scheduling By CoLocation, data of Hash partitions and column-

stores could be stored in the corresponding node according to “Cyclic Consistent Hash algorithm”, However, in a long period of time, for instance, if new nodes join the cluster or reset the Hash ring configure file, the distribution of data might be destroyed.

In order to handle this problem, we recently implemented CoLocation scheduler in 4000 lines of Java. CoLocation scheduler is responsible for replacing the scattered data. It can be run by the cluster administrator on a live Hadoop DFS cluster node.

CoLocation master asks namenode for blocks information, and those blocks need transferring to the corresponding nodes in accordance with “Cyclic Consistent Hash algorithm”. Each node has its own Colocation scheduling queue in the memory of Colocation master. Because of the memory limitation, each queue stores only a part of blocks information. CoLocation scheduler moves blocks from the source datanodes to the destination datanodes iteratively. At the end of each iteration, CoLocation scheduler obtains updated blocks information from the namenode, and tries the next iteration until all blocks achieve CoLocation.

Figure 6. CoLocation Scheduling

Figure 6 illustrates a CoLocation scheduling module, in which a block needs transferring from datanode4 to datanode2, in Figure 6, datanode2 and datanode4 are not in the same rack, datanode2 and datanode3 are in the same rack, and datanode3 has a copy of the block of datanode4.

CoLocation scheduling algorithm takes advantage of multiple copies mechanism to cut down network traffic and

978-1-4673-2713-8/12/$31.00 ©2012 IEEE 000658

accelerate the execution of CoLocation scheduling. CoLocation scheduler uses slots to control the degree of parallelism. There is also a system property that limits the CoLocation scheduler’s use of network bandwidth, it determines the maximum speed at which a block will be moved from one datanode to another.

CoLocation master and datanode only have control-flow, and there is no data-flow between them, Colocation master uses instructions to fetch blocks information from namenode, and asks datanodes to copy data. In Figure 6, datanode2 does not copy the block from datanode4, but from an agent (datanode3) which has a copy of the block. CoLoation master also packages a delete instruction to datanode2. Datanode2 sends the delete instruction to namenode, namenode finally uses the instruction to delete the block of datanode4 and updates its metadata. The block is finally transferred from datanode4 to datanode2.

D. Delay scheduling We should try our best to assign task to the nodes where

the map task acquires input data to reduce the burden of network I/O. We introduce delay scheduling strategy[19] to our job scheduler to raise data locality.

When a node requests a task, if the head-of-line job cannot launch a local task, we skip the job and look at the subsequent jobs, and if a job has been waiting long enough, delay scheduling allows the job to launch a non-local task to avoid starvation. We also consider the factors of memory to prevent node which lacks memory from accepting new tasks. The algorithm is illustrated as follows: Process of Delay scheduling strategy When a heartbeat is received from node n:

If n has a free slot then Sort Jobs in TDW Scheduler Style For j in jobs do

If n has enough memory resource for job j If j has unlaunched task t with data then

Launch t on n Set j.skipCount = 0 Break

Else if j has unlaunched task t then If j.skipCount >= D then

Launch t on n Break

Else Set j.skipCount = j.skipCount + 1

End if End if

End for End if When a node requests a task, if the candidate job cannot

launch a local task, delay strategy skips it and looks at subsequent jobs. However, if a job has been skipped long enough, we start allowing it to launch non-local tasks to avoid starvation. Assume that we have a M-node cluster, each node has L slots, Therefore total slots S is M*L, let J

denote the set of nodes on which the job has data left to process, Let D denote the skip count, λ denotes the probability of data locality we want to get.

The probability that it does not find a local task is

(1 )DJ

M− , Therefore, 1 (1 )DJ

Mλ = − − , we get

(1 )log (1 )J

M

D λ−

= −

Let us assume replication factor is R, when the job has K tasks left to launch

1 (1 )KJ R

M M= − −

Therefore, the probability that the job could launch a local task at this point is

/

1 (1 ) 1 (1 1 (1 ) )

1 (1 ) 1

D K D

KD RDK M

J R

M MR

eM

λ

= − − = − − − −

= − − ≥ −

Averaging this quantity over K=1 to N, the expected locality for a job, given a skip count D, is at least:

1 1

1

1 1( ) 1 1

11 1

(1 )

RDK RDKN NM M

K K

RDRDK M

MRD

K M

l D e eN N

ee

NN e

− −

= =

−∞ −

−=

= − = −

≥ − ≥ −

∑ ∑

Solving for ( )l D λ≥ , we find that we need to set (1 )

ln( )1 (1 )

M ND

R N

λ

λ

−≥ −

+ −

IV. EVALUATION In this section, we present an evaluation of the CHMJ

optimization algorithm performance. We run our experiments and compare the results with Hive’s SortMergeReduceJoin (Reduce Join), to demonstrate the effectiveness of CHMJ. Hive is a data warehouse infrastructure which is also based on Hadoop MapReduce model.

A. Experimental Setup We run experiments in a tencent experiment cluster with

30 nodes, all nodes are equipped with eight processing cores, 16GB RAM, designating one node as NameNode and JobTracker, one as Client, the other 28 nodes as DataNode and TaskTracker. The operating system is Linux (kernel version 2.6).

In our experiment, we use 500 Hash partitions, data block size is set to 128MB, network bandwidth for each node is 100M, and number of replicas is 3. Table I details the dataset used in the experiments.

978-1-4673-2713-8/12/$31.00 ©2012 IEEE 000659

TABLE I. DETAILS OF THE DATASET USED IN THE EXPERIMENTS

Data table Replicas Data information Record number Data size Data occupied size userprofile 3 15.4GB 46.2GB 0.6 Billion qqfriend 3 520GB 1560GB 12.7 Billion

B. Result and analysis 1. Full table join query performance experiment

Sql statements are shown in table II. Reduce Join uses H-SQL. HashMapJoin and CHMJ use Tencent-SQL.

TABLE II. SQL STATEMENTS USED IN EXPERIMENT

Types of join SQL statements

Reduce Join insert overwrite table tmp select a.STATIS_MONTH, b.qq1, a.AGE, a.GENDER, b.qq2 from userprofile a join qqfriend b on a.QQ_NUM=b.qq1

HashMapJoin insert overwrite table tmp select /*+ hashmapjoin(a)*/ a.STATIS_MONTH, b.qq1, a.AGE, a.GENDER, b.qq2 from userprofile a join qqfriend b on a.QQ_NUM=b.qq1

CHMJ

The experiment result is presented in Figure 7.

Figure 7. Execution time of Reduce Join, HashMapJoin, CHMJ

As shown above, the result of HashMapJoin is better than Hive’s SortMergeReduceJoin(Reduce Join). However, it does not consider data locality, and maps need drawing a lot of data from other nodes. By comparison, CHMJ is HashMapJoin with CoLocation data distribution strategy, and relevant data is stored in the same set of nodes. Therefore, the performance of CHMJ is much better. In our experiment, the execution time of CHMJ is only 26.5% of Reduce Join.

2. Range join query performance experiment

We limit query range userprofile.QQ_NUM<200000000 in the experiment. Table III shows the Sql statements.

TABLE III. SQL STATEMENTS USED IN EXPERIMENT

Types of join SQL statements

Reduce Join insert overwrite table tmp select a.STATIS_MONTH, b.qq1, a.AGE, a.GENDER, b.qq2 from userprofile a join qqfriend b on a.QQ_NUM=b.qq1 where a.QQ_Num < 200000000

HashMapJoin insert overwrite table tmp select /*+ hashmapjoin(a)*/ a.STATIS_MONTH, b.qq1, a.AGE, a.GENDER, b.qq2 from userprofile a join qqfriend b on a.QQ_NUM=b.qq1 where a.QQ_Num < 200000000

CHMJ

Figure 8. Execution time of Reduce Join, HashMapJoin, CHMJ

As shown in Figure 8, the execution time of HashMapJoin is about only 48.6% of Reduce Join and speedup is 2.06, the execution time of CHMJ is about only 21.6% of Reduce Join, that is, the speedup is 4.63. This experiment is characterized by a large amount of data input but a smaller amount of results output than full table join. Network condition has a much smaller effect on the experiment. Therefore, query performance is more apparent. 3. Low selectivity join query performance experiment

We also limit userprofile.QQ_NUM to “126357” in the experiment. Table IV shows the Sql statements.

TABLE IV. SQL STATEMENTS USED IN EXPERIMENT

Types of join SQL statements

Reduce Join

insert overwrite table tmp select a.STATIS_MONTH, b.qq1, a.AGE, a.GENDER, b.qq2 from userprofile a join qqfriend b on a.QQ_NUM=b.qq1 where a.QQ_Num = 126357

HashMapJoin insert overwrite table tmp select /*+ hashmapjoin(a)*/ a.STATIS_MONTH, b.qq1, a.AGE, a.GENDER, b.qq2 from userprofile a join qqfriend b on a.QQ_NUM=b.qq1 where a.QQ_Num = 126357

CHMJ

The experiment results are shown in Figure 9.

Figure 9. Execution time of Reduce Join, HashMapJoin, CHMJ

As shown in Figure 9, the execution time of CHMJ is about only 19% of Reduce Join, that is, the speedup is 5.2. This experiment is characterized by a large amount of data input but only a very small amount of results output, nearly all input data is from local and only a very few results output. In such circumstances, the network condition has only a little effect on the experiment. Compare to full-table

978-1-4673-2713-8/12/$31.00 ©2012 IEEE 000660

join and range join queries, the performance of CHMJ in low selectivity join query is more apparent.

Figure 10. Average Data locality of HashMapJoin and CHMJ

Another feature of the CHMJ is delay Scheduling, Figure 10 shows the results. We see that with the help of delay scheduling strategy, MapReduce jobs achieve nearly optimal data locality easily, which is helpful to improve the performance of CHMJ.

In summary, CHMJ takes advantage of data locality to accelerate calculation. It significantly improves the performance of join query. The results show that, CHMJ can improve the efficiency of join query by five times comparing to Hive.

V. CONCLUSION Large-scale data storage and analysis is a current

research focus, Hadoop and Hive have become the mainstream technologies. Join query is frequently used in the data warehouse system. When running a join query on Hive system, the job of the join query is divided into map phase and reduce phase, and requires transferring large amounts of intermediate results over the network, which is inefficient.

This paper proposed a new join algorithm CHMJ. Firstly, we designed a new data distribution algorithm called CoLocation. The algorithm distributes the data of tables over the cluster according to the hash values of the join property, which can improve the data locality and ensure data availability. Secondly, on the base on the data distribution, we proposed a parallel join query processing algorithm called HashMapJoin to improve the efficiency of join query. Thirdly, we proposed CoLocation scheduling for fault-tolerant. Finally delay scheduling strategy in introduced to raise data locality. CHMJ has been adopted in Tencent data warehouse, and plays an important role in Tencent's daily operations. The results show that, CHMJ can improve the efficiency of join query by five times comparing to Hive.

VI. ACKNOWLEDGMENT We would like to thank the reviewers of this paper for

their constructive comments. This research is supported by

Tencent Inc. as a part work of Tencent Distributed Data Warehouse (TDW) which intends to replace the current Oracle RAC System in Tencent. The research is also supported by the National Science Foundation of China under Grant No.60903047 and Chinese 863 Program under Grant No. 2011AA01A203.

REFERENCES [1] Jeffrey Dean, Sanjay Ghemawat. MapReduce: simplified data

processing on large clusters. OSDI'04, 2004. [2] Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, D.Stott Parker.

Map-reduce-merge: simplified relational data processing on large cluster. SIGMOD'07, 2007.

[3] Apache Hive. http://hadoop.apache.org/hive/. [4] Ashish Thusoo Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad

Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, Raghotham Murthy. Hive: a warehousing solution over a map-reduce framework. VLDB, 2009.

[5] Thusoo, A, Sarma, J.S, Jain, N, Zheng Shao, Chakka, P, Ning Zhang, Antony, S, Hao Liu, Murthy, R. Hive - a petabyte scale data warehouse using Hadoop Data Engineering. ICDE, 2010.

[6] Christopher Olston, Benjamin Reed, Utkarsh Sirvastava, Ravi Kumar, Andrew Tomkins. Pig Latin: A not-so-foreign language for data processing. SIGMOD, 2008.

[7] Ronnie Chaiken, Bob Jenkins, Per-Ake Larson, Bill Ramsey, Darren Shakib, Simon Weaver, Jingren Zhou. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. Proceedings of the VLDB Endowment, 2008.1(2):1265-1276.

[8] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ulfar Erlingsson, Pradeep Kumar Gunda, Jon Currey. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. OSDI, 2008.

[9] Michael Isard, Mihai Budiu. Dryad: distributed data-parallel programs from sequential building blocks. EuroSys '07, 2007.

[10] P. Boncz, S. Manegold, M. Kersten. Database architecture optimized for the new bottleneck: memory access. VLDB, 1999.

[11] J. Rao, K. A. Ross. Cache conscious indexing for decision-support in main memory. VLDB, 1999.

[12] A.Shatdal, C.Kant, J.F.Naughton. Cache conscious algorithms for relational query processing. VLDB,1994.

[13] D.DeWitt, J.Gray. Parallel database systems:The future of high performance database systems. In Communications of the ACM,Vol.35,No.6,June 1992.

[14] D.A.Schneider, D.DeWitt. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. SIGMOD,1989.

[15] B. Liu, E. Rundensteiner. Revisiting pipelined parallelism in multi-join query processing. VLDB, 2005.

[16] Tom White. Hadoop: The Definitive Guide .O'Reilly.June 2009. [17] David Karger, Eric Lehman, Tom Leighton, Matthew Levine, Daniel

Lewin, Rina Panigrahy. Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. STOC’97, 1997.

[18] Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, Werner Vogels. Dynamo: amazon’s highly available key-value store. SOSP 07, 2007.

[19] M.Zaharia, D.Borthakur, J. Sen Sarma.Delay Scheduling: A Simaple Technique for Achieving Locality and Fairness in Cluster Scheduling”, EuroSys, 2010.

978-1-4673-2713-8/12/$31.00 ©2012 IEEE 000661