mapreduce as a general framework to support research...

26
MapReduce as a General Framework to Support Research in Mining Software Repositories (MSR) published in Mining Software Repositories 2009 Weiyi Shang, Zhen Ming Jiang, Bram Adams, Ahmed E. Hassan Presenter: Jihun Park SELab 2013.06.07 LAB Seminar

Upload: hoangdang

Post on 10-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

MapReduce as a General Framework to Support Research in Mining Software Repositories (MSR)

published in Mining Software Repositories 2009

Weiyi Shang, Zhen Ming Jiang, Bram Adams, Ahmed E. Hassan

Presenter: Jihun Park

SELab

2013.06.07

LAB Seminar

Page 2: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

Mining Software Repositories

2

Version control system Bug tracking system

Source code Patch (for a rev.)

Author info.

The cause of bugs Severity of bugs Status of bugs

Research Questions • What kind of patches are likely to

be a bug? (LOC, # of method..)

• Can we use XX information of code

entities to predict additional

change locations? (co-change,

structure..)

• How do the software evolve?

(when refactoring occur?..)

• How can we predict buggy files /

patches? (code change complexity,

machine learning algorithm)

Page 3: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

Mining Software Repositories

3

A revision with a patch for a file A revision with patches for many files

Revision 324 Revision 332 Revision 352 Revision 370

Suggestion

Page 4: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

Mining Software Repositories

4

• Figure out modified methods • Identify the modified lines of codes • Identify the relationship between changed methods and others • Connect fix revision to bug report • …

The size of repository is getting bigger !!

Commit log: Fix bug 12345 ========================= void foo(){ +++ int a = 0; ----- int a = 1; } void bar(){ } …

Page 5: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

Motivation and Findings

• Motivation

– Mining Software Repository is one of the main research area in software engineering.

– Software repositories are getting bigger.

– Big data analysis technique (e.g., MapReduce) can facilitate analyzing large repositories.

• Findings

– It is easy to migrate existing algorithm to distributed system.

– MapReduce can improve the analysis speed.

5

Page 6: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

Outline

• Mining Software Repositories

• Motivation

• Big Data Analysis and MapReduce

• Approach

• Experimental Setup

• Evaluation

• My Research Area

6

Page 7: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

Big Data Analysis

• Big data requires exceptional technologies to efficiently process large quantities of data.

• Big data techniques include

– Association rule mining

– Machine learning

– Genetic algorithm

– Pattern recognition

– …

• Existing methodologies, but the problem is scalability

7

Scaling out (distributed systems) is always better than scaling up (bigger and more powerful machines)

Page 8: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

MapReduce

• One of the famous programming model for a parallel, distributed algorithm on a cluster.

8

Page 9: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

Approach Overview

• Can map-reduce support MSR research?

1. Adaptability - is it easy to migrate to map reduce approach?

2. Efficiency - is it faster than non-distributed approach?

3. Scalability - is it scalable with the size of input data?

4. Flexibility - is it run on different types of machines? 9

J-REX

Extraction

Parsing

Analysis

DJ-REX1

Extraction

Parsing

Analysis

DJ-REX2

Extraction

Parsing

Analysis

DJ-REX3

Extraction

Parsing

Analysis

Page 10: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

J-Rex

10

Extract series of snapshots for each file

Parse each snapshot to XML format using JDT AST parser

Analyze evolutionary change data to get evolutionary change data such as change type, message, time, etc.

Page 11: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

MapReduce Strategy for DJ-REX

• DJ-REX1: Extraction and parsing is done by one machine, then distributed machines analyze.

• DJ-REX2: Extraction is done by one machine, then remaining phases is done by distributed machines.

• DJ-REX3: Every phase is done by distributed machines. 11

Page 12: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

Experimental setup

• Two server machine and two desktop machine.

• Server machine have SSD which is much faster than normal hard disk.

• Experiments is done with three open source projects.

12

Repository Size

# Source Code files

Length of History

# Revisions

Datatools 394MB 10,552 2 years 2,398

BIRT 810MB 13,002 4 years 19,583

Eclipse 4.2GB 56,851 8 years 82,682

Characteristics of Eclipse, BIRT, and Datatools

Page 13: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

1. Adaptability

• Does the Hadoop migration take long time?

• Migration is very easy.

– Hadoop provides mapping algorithm such as “MultiFileSplit” and “DBInputSplit”.

– Hadoop has well-defined and simple APIs.

– There are available several code examples.

13

J-REX Logic No Change

MapReduce strategy for DJ-REX1 400 LOC, 2 hours

MapReduce strategy for DJ-REX2 400 LOC, 2 hours

MapReduce strategy for DJ-REX3 300 LOC, 1 hours

Deployment Configuration 1 hour

Reconfiguration 1 minute

Page 14: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

2. Efficiency

Repository Desktop Server Strategy 2 nodes 3 nodes 4 nodes

Datatools 0:35:50 0:34:14 DJ-REX3 0:19:52 0:14:32 0:16:40

BIRT 2:44:09 2:05:55 DJ-REX1 DJ-REX2 DJ-REX3

2:03:51 1:40:22 1:08:36

2:05:02 1:40:32 0:50:33

2:16:03 1:47:26 0:45:16

Eclipse - 12:35:34 DJ-REX3 - - 3:49:05

14

Experimental results for DJ-REX in Hadoop

• Experiment shows two main sub-conclusion for efficiency.

– Faster machine can speed up the mining process.

– All DJ-REX solutions outperforms non-distributed approach.

Page 15: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

2. Efficiency (cont’d)

• Preprocess time is time needed for non-distributed phase.

• The copy data time increase when adding nodes.

• The fully distributed DJ-REX3 is the most efficient.

15 Comparison of the running time of the 3 flavors of DJ-REX for BIRT

Page 16: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

3. Scalability

• The bigger the repository is, the more time can be saved by Hadoop.

• Haddop scales well for different number of nodes.

• Overhead of copying input data to another node can out-weigh parallelizing tasks to another node.

16 Running time comparison for BIRT and Datatools with DJ-REX3

Page 17: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

4. Flexibility

• Hadoop runs on many different platforms (Window, Mac, Unix, etc.)

• In this experiment, several different machine (two desktops and two servers) used.

• Load balance control in Hadoop assure a fair distribution of work.

17

Page 18: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

Conclusion

• It is easy to migrate existing mining algorithm to distributed system.

• Big data analysis technique, MapReduce, can improve the analysis speed.

• There is data distribution overhead, which determine the optimum number of nodes.

• Adding a machine is very easy, which means the approach is scalable.

18

Page 19: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

Discussion

• MapReduce approach can be used splitting mailing list, mapping bug reports, etc.

• Copying into HDFS can be overhead – finding out the optimal Hadoop configuration is future work.

19

Page 20: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

My Research Area

20

Bug 22

Fix #22

Bug 31 … …

Development history

Type 1 bug Type 2 bug

Bug reports

Fix commits

An initial patch

Supplementary patches

Fix #31

Fix #31

Fix #31

The bug IDs that were mentioned only one commit.

The bug IDs that were mentioned in multiple fix revisions.

Page 21: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

My Research Area

• Empirical results

– A considerable portion of bugs require supplementary patch.

– Type II bugs are more severe.

– Type II bugs take long time to be fixed.

– Incomplete patches are larger in size and more scattered.

• Can previous prediction approaches be used to predict supplementary change locations?

– Code Clone

– Co-Change

– Structural relationship (Inheritance, method calls, etc.)

21

Page 22: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

My Research Area

• Existing approach is not enough to predict supplementary change locations!

– Only small portion of supplementary patch is code clone of initial patch

– A considerable portion of supplementary patch cannot predict using structural dependency (e.g., method call, inheritance)

– Historical co-change has low precision on prediction.

• How can we predict additional change locations?

• Anyone who interested in MSR area, feel free to contact me.

22

Page 23: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

23

Thank you for listening

Page 24: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

C-REX Change schema

24

Page 25: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

Three DJ-REX approach

25

J-REX

Extraction

Parsing

Analysis

DJ-REX1

Extraction

Parsing

Analysis

DJ-REX2

Extraction

Parsing

Analysis

DJ-REX3

Extraction

Parsing

Analysis

• DJ-REX1: Use Map-reduce for analysis phase.

• DJ-REX2: Use Map-reduce for parsing and analysis phase.

• DJ-REX3: Use Map-reduce for all phase.

Page 26: MapReduce as a General Framework to Support Research …se.kaist.ac.kr/wp-content/uploads/2013/06/MapReduce-as-a-General... · MapReduce as a General Framework to Support Research

26

Running time of the basic J-REX on a desktop and server machine, and of DJ-REX3 on 3 virtual machines on the same server machine