revisiting the applicability of the pareto principle to core development teams in open source...

Post on 12-Apr-2017

103 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Is the Pareto Principle Applicable to the Core Teams

of GitHub Projects?

KazuhiroYamashita

YasutakaKamei

ShaneMcIntosh

NaoyasuUbayashi

Ahmed E. Hassan

Core developers play a critical role

in software development

2

Core developers are responsible for guiding and coordinating the development of an OSS project.

The most productive developers who have made roughly 80% of the total contributions.

Nakakoji

Mockus

In fact, some argue that core developers in OSS projects follow the Pareto Principle

5Effort Result

80% 80%

20%20%

Pareto Principle in Software Development

6

20%

80% 20%

80%

ProjectDevelopers Artifacts

Prior studies have arrived at mixed conclusions about core teams and the Pareto Principle

7

Pareto Non-Pareto

Goeminne IWSQM

RoblesRAMSS

MockusTOSEM

GeldenhuysECSEAA

KochISJ Dinh-Trong

TSE

The results depend on small number of case study systems

Other

Prior studies have arrived at mixed conclusions about core teams and the Pareto Principle

8

< 10 or 15 Other

Goeminne IWSQM

RoblesRAMSS

MockusTOSEM

GeldenhuysECSEAA

KochISJ

Dinh-TrongTSE

Overview of our study of core teams on GitHub

19

Applicability of the Pareto PrincipleNumber of Core Developers

Overview of our study of core teams on GitHub

20

Core and Non-Core Developers Activities

Applicability of the Pareto PrincipleNumber of Core Developers

Collecting and analyzing GitHub data to study core team activity

21

Filter Heuristics

Core

Non-Core

Core

Non-Core

Calc Prop

Projects

Core

Non-CoreClassifyCommits

Core Team Size Activity

Collecting and analyzing GitHub data to study core team activity

22

Filter Heuristics

Core

Non-Core

Projects

22

Core

Non-Core

Calc Prop

Core

Non-CoreClassifyCommits

Core Team Size Activity

Preprocessing GitHub data to handle forks, duplicates, and to remove immature projects

23

8,510,504 repositories -> 2,496 repositories

Collecting and analyzing GitHub data to study core team activity

24

Filter Heuristics

Core

Non-Core

Projects

24

Core

Non-Core

Calc Prop

Core

Non-CoreClassifyCommits

Core Team Size Activity

Using heuristics to identify core team members

26Commit-based LOC-based Access-based

Core Core Core

29A B C D

Our commit-based core contributor heuristic

Number of Commits

= Commit

Step1: Sort contributors by their number of commits

30A BC D

Number of Commits

Step2: Compute the proportion of commits that each contributor

32A BC D

60% 20% 10% 10%Commits ratio

Step3: Core contributors are those developers below the 0.8 cumulative contribution cutoff

33A BC D

0.8

1.0

0.6

Cumulativeratio

Pct. CoreDev2/4*100 = 50%

Num CoreDev2

Collecting and analyzing GitHub data to study core team activity

35

Filter Heuristics

Core

Non-Core

Projects

35

Core

Non-Core

Calc Prop

Core

Non-CoreClassifyCommits

Core Team Size Activity

Overview of our study of core teams on GitHub

36

Core and Non-Core Developers Activities

Applicability of the Pareto PrincipleNumber of Core Developers

Overview of our study of core teams on GitHub

37

Core and Non-Core Developers Activities

Applicability of the Pareto PrincipleNumber of Core Developers

Collecting and analyzing GitHub data to study core team activity

38

Filter Heuristics

Core

Non-Core

Projects

38

Core

Non-Core

Calc Prop

Core

Non-CoreClassifyCommits

Core Team Size Activity

Our approach to study Core Team Size

40

30%20%10%Percentage of Core Devs

Compliance with the Pareto Principle

Stratify projects along the confounding factors

Small Medium Large Small Medium Large Small Medium LargeLOC Total Author Age

The example project does not follow the Pareto Principle

Core team proportions are widespread

43

Commit-based Divide by LOC

Often, there are fewer than 15 core developers in a projects

44

Number of core developers in projects

88% 98% 96%Commit-Based LOC-Based Access-Based

Overview of our study of core teams on GitHub

45

Core and Non-Core Developers Activities

Applicability of the Pareto PrincipleNumber of Core Developers

More than half projects do not follow the Pareto principle

Most of projects have 15 or less core developers

Overview of our study of core teams on GitHub

48

Core and Non-Core Developers Activities

Applicability of the Pareto PrincipleNumber of Core Developers

More than half projects do not follow the Pareto principle

Most of projects have 15 or less core developers

Collecting and analyzing GitHub data to study core team activity

49

Filter Heuristics

Core

Non-Core

Projects

49

Core

Non-Core

Calc Prop

Core

Non-CoreClassifyCommits

Core Team Size Activity

Our approach to study activity

50

By using the keywords, we classify the commits.

DevelopmentActivity Type KeywordsForward Engineering implement, add, requestMaintenanceReengineering optimiz, adjust

Corrective Engineering bug, fix, issue, error

Management license, formatting, TODO

No big differences in proportions of development activities

54

Commit-Based LOC-Based Access-Based

Overview of our study of core teams on GitHub

55

Core and Non-Core Developers Activities

Applicability of the Pareto PrincipleNumber of Core Developers

More than half projects do not follow the Pareto principle

Most of projects have 15 or less core developers

There are no big differences between

core and non-core activities

Overview of our study of core teams on GitHub

56

Core and Non-Core Developers Activities

Applicability of the Pareto PrincipleNumber of Core Developers

More than half projects do not follow the Pareto principle

Most of projects have 15 or less core developers

There are no big differences between

core and non-core activities

Extremely large core team may be interesting

58

Heuristic -15 16-20 21-50 51-100 101-

Commit-Based

2,197 98 137 17 47

LOC-Based

2,454 15 13 4 10

Access-Based

1,164 24 24 0 0

Many projects face a risk of bus factor

59

Commit-Based LOC-Based Access-Based43% (Core=1: 8%) 81% (Core=1: 24%) 54% (Core=1: 21%)

In fact, most of projects have less than 5 core developers

Conclusion

63

64

Core Developer• additional slides

65

Additional description of our definition

66

0.8

1.0

A B C D E Depend on Name

Commit-based

67

Age Total Author

LOC-based

68

Age Total Author

LOC

Access-based

69

Age Total Author

LOC

Data Extraction

70

8,510,504 repositories -> 4,618 repositories

Data Extraction

71

Data Extraction

72

(1) Filter projects by GHTorrent

Filter forked repositories.

Fork

73

One of the features of GitHub

Fork (clone)

Original Repository

Fork Repository

Pull Request

Data Extraction

74

(1) Filter projects by GHTorrent

Filter forked repositories.

Filter less than 10 devs repositories.

Data Extraction

75

(1) Filter projects by GHTorrent

Filter forked repositories.

Filter less than 10 devs repositories.

Filter repositories which is developed outside of GitHub.

Data Extraction

76

(1) Filter projects by GHTorrent

Filter forked repositories.

Filter less than 10 devs repositories.

Filter repositories which is developed outside of GitHub.

8,510,504 repositories -> 4,618 repositories

Data Extraction

77

Data Extraction

78

(2) Clone repositories

4,618 repositories -> 4,154 repositories

local server

clone

Data Extraction

79

Data Extraction

80

(3) Filter duplicate projects

Project A Fork of Afork

clone

Project Bregister

Clone of A

Data Extraction

81

(3) Filter duplicate projects

4,618 repositories -> 3,533 repositories

Project A Project B

Compare SHAs

c87cce1e1a7260f40ccb5455e44c8b67f28651fa5e

655b8be757dd93a4cf3718145880cf484e34e63bde

Data Extraction

82

Data Extraction

83

(4) Calculate metrics

LOCTotal CommitsTotal Authors

AgeRepository

Data Extraction

84

Data Extraction

85

(5) Filter projects by metrics

4,618 repositories -> 2,496 repositories

Filter less than 10 devs repositories.

Filter less than 1,000 LOC repositories.

top related