a data set for social diversity studies of github teams@b_vasilescu @aserebrenik @vlfilkov...

5
MSR’15, Florence, Italy May 16, 2015 @b_vasilescu @aserebrenik @ vlfilkov A Data Set for Social Diversity Studies of GitHub Teams Bogdan Vasilescu Alexander Serebrenik Vladimir Filkov

Upload: others

Post on 09-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Data Set for Social Diversity Studies of GitHub Teams@b_vasilescu @aserebrenik @vlfilkov MSR’15, Florence, Italy May 16, 2015 Gender and Tenure Diversity in GitHub Teams Bogdan

MSR’15, Florence, Italy May 16, 2015

@b_vasilescu @aserebrenik @vlfilkov

A Data Set for Social Diversity Studies of GitHub Teams

Bogdan Vasilescu Alexander Serebrenik Vladimir Filkov

Page 2: A Data Set for Social Diversity Studies of GitHub Teams@b_vasilescu @aserebrenik @vlfilkov MSR’15, Florence, Italy May 16, 2015 Gender and Tenure Diversity in GitHub Teams Bogdan

@b_vasilescu @aserebrenik @vlfilkov

MSR’15, Florence, Italy May 16, 2015

Gender and Tenure Diversity in GitHub Teams

Bogdan Vasilescu†§⇤, Daryl Posnett†, Baishakhi Ray†, Mark G.J. van den Brand§,Alexander Serebrenik§, Premkumar Devanbu†, Vladimir Filkov†⇤

†University of California, Davis and §Eindhoven University of Technology⇤[email protected], [email protected]

ABSTRACTSoftware development is usually a collaborative venture.Open Source Software (OSS) projects are no exception; in-deed, by design, the OSS approach can accommodate teamsthat are more open, geographically distributed, and dynamicthan commercial teams. This, we find, leads to OSS teamsthat are quite diverse. Team diversity, predominantly in of-fline groups, is known to correlate with team output, mostlywith positive effects. How about in OSS?

Using GITHUB, the largest publicly available collection ofOSS projects, we studied how gender and tenure diversityrelate to team productivity and turnover. Using regressionmodeling of GITHUB data and the results of a survey, weshow that both gender and tenure diversity are positive andsignificant predictors of productivity, together explaining asizable fraction of the data variability. These results caninform decision making on all levels, leading to better out-comes in recruiting and performance.

Author KeywordsOpen source; gender; diversity; productivity; GitHub.

ACM Classification KeywordsH.5.3. [Information Interfaces and Presentation (e.g. HCI)]:Computer-supported cooperative work

INTRODUCTIONBecause of the world-wide demand for talented and skilledlabor, hiring in STEM (Science, Technology, Engineering,and Math) fields has become increasingly almost entirelymeritocratic, and largely blind to demographic factors. Thisis certainly true for software engineering; as a result, bothcommercial and open source software teams can be verydiverse. What are the effects of this on the project as awhole? Indeed, demographic similarity enhances mutualtrust (and thus, arguably, team effectiveness), while demo-graphic diversity may lead to stereotyping, cliquishness, andconflict [20,43]. However, a team’s social diversity seems toimprove its technical performance [24].

Permission to make digital or hard copies of all or part of this work for per-sonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] 2015, April 18 - 23, 2015, Seoul, Republic of Korea.Copyright held by authors. Publication rights licensed to ACM. 978-1-4503-3145-6/15/04$15.00. http://dx.doi.org/10.1145/2702123.2702549

Software development teams can be diverse in various ways,e.g., w.r.t. gender, experience, nationality, and coding lan-guage preference; some teams can be more diverse in oneattribute and less so in others. Diversity attributes may alsointeract (e.g., in some nations, female professionals mayface more obstacles), which complicates analysis and study.Team diversity has been studied in physical (“meat-space”)settings; however, data is hard-won in such settings. Smallersample sizes make it difficult to effectively control for con-founds. Data requirements for such effective controls, how-ever, increase exponentially with the number of dimensionsstudied (one aspect of the “curse of dimensionality” [22]).Thus, studies of effects of diversity in teams (given the in-eluctable confounds) require data on a great many teams,with sufficient variance along all co-variates of concern.

GITHUB, a social coding platform, has attracted millions ofdevelopers and thousands of Open Source Software projects.1All commits, issues, code changes, pull-requests etc. arearchived and publicly available. GITHUB has become thenew standard for comprehensive studies of social and tech-nical organization and achievement [16, 37, 39, 41, 60]. Evi-dently, this is an attractive setting in which to study the rela-tionship of diversity to performance. The scale of GITHUBis especially relevant when considering the role of women,who are very underrepresented in programming.2 With alarge enough dataset, however, the effect of increased gen-der diversity becomes noticeable. Additionally, since alldata in GITHUB is historical (i.e., archived), it is possi-ble to study the effects of tenure, or one’s length of timewith a project and with GITHUB. However, the relianceon volunteers in OSS projects complicates matters; volun-teers come and go, leading to team turn-over. Team turn-over can certainly influence performance, and will confoundthe effects of diversity. The constructs of “team” and “teamturnover” clearly also depend on the observation time-scale.In a healthy project, some rate of turnover is in fact desir-able, as “new blood” brings in new abilities and ideas [21].Arguably, turnover will affect observed diversity in GITHUBOSS teams, and must be considered carefully.

In this paper, using GITHUB data, we explore several ques-tions: How diverse are online teams with respect to genderand tenure? Does gender diversity depend on tenure? On1OSS depend on distributed volunteers’ efforts whereas commer-cial software is much more centralized, and depends more on paidgroups of programmers [23]; in both, the quality can be high [8].2Especially so, it seems, in OSS projects: A 2013 FLOSS Sur-vey [49] indicates 10% females; all earlier surveys [19] agreeon merely 1–5%. Industry reports slightly higher numbers, e.g.,Google with 17% female technology employees.

CHI’15Which is more effective?

Page 3: A Data Set for Social Diversity Studies of GitHub Teams@b_vasilescu @aserebrenik @vlfilkov MSR’15, Florence, Italy May 16, 2015 Gender and Tenure Diversity in GitHub Teams Bogdan

@b_vasilescu @aserebrenik @vlfilkov

MSR’15, Florence, Italy May 16, 2015

Gender and Tenure Diversity in GitHub Teams

Bogdan Vasilescu†§⇤, Daryl Posnett†, Baishakhi Ray†, Mark G.J. van den Brand§,Alexander Serebrenik§, Premkumar Devanbu†, Vladimir Filkov†⇤

†University of California, Davis and §Eindhoven University of Technology⇤[email protected], [email protected]

ABSTRACTSoftware development is usually a collaborative venture.Open Source Software (OSS) projects are no exception; in-deed, by design, the OSS approach can accommodate teamsthat are more open, geographically distributed, and dynamicthan commercial teams. This, we find, leads to OSS teamsthat are quite diverse. Team diversity, predominantly in of-fline groups, is known to correlate with team output, mostlywith positive effects. How about in OSS?

Using GITHUB, the largest publicly available collection ofOSS projects, we studied how gender and tenure diversityrelate to team productivity and turnover. Using regressionmodeling of GITHUB data and the results of a survey, weshow that both gender and tenure diversity are positive andsignificant predictors of productivity, together explaining asizable fraction of the data variability. These results caninform decision making on all levels, leading to better out-comes in recruiting and performance.

Author KeywordsOpen source; gender; diversity; productivity; GitHub.

ACM Classification KeywordsH.5.3. [Information Interfaces and Presentation (e.g. HCI)]:Computer-supported cooperative work

INTRODUCTIONBecause of the world-wide demand for talented and skilledlabor, hiring in STEM (Science, Technology, Engineering,and Math) fields has become increasingly almost entirelymeritocratic, and largely blind to demographic factors. Thisis certainly true for software engineering; as a result, bothcommercial and open source software teams can be verydiverse. What are the effects of this on the project as awhole? Indeed, demographic similarity enhances mutualtrust (and thus, arguably, team effectiveness), while demo-graphic diversity may lead to stereotyping, cliquishness, andconflict [20,43]. However, a team’s social diversity seems toimprove its technical performance [24].

Permission to make digital or hard copies of all or part of this work for per-sonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] 2015, April 18 - 23, 2015, Seoul, Republic of Korea.Copyright held by authors. Publication rights licensed to ACM. 978-1-4503-3145-6/15/04$15.00. http://dx.doi.org/10.1145/2702123.2702549

Software development teams can be diverse in various ways,e.g., w.r.t. gender, experience, nationality, and coding lan-guage preference; some teams can be more diverse in oneattribute and less so in others. Diversity attributes may alsointeract (e.g., in some nations, female professionals mayface more obstacles), which complicates analysis and study.Team diversity has been studied in physical (“meat-space”)settings; however, data is hard-won in such settings. Smallersample sizes make it difficult to effectively control for con-founds. Data requirements for such effective controls, how-ever, increase exponentially with the number of dimensionsstudied (one aspect of the “curse of dimensionality” [22]).Thus, studies of effects of diversity in teams (given the in-eluctable confounds) require data on a great many teams,with sufficient variance along all co-variates of concern.

GITHUB, a social coding platform, has attracted millions ofdevelopers and thousands of Open Source Software projects.1All commits, issues, code changes, pull-requests etc. arearchived and publicly available. GITHUB has become thenew standard for comprehensive studies of social and tech-nical organization and achievement [16, 37, 39, 41, 60]. Evi-dently, this is an attractive setting in which to study the rela-tionship of diversity to performance. The scale of GITHUBis especially relevant when considering the role of women,who are very underrepresented in programming.2 With alarge enough dataset, however, the effect of increased gen-der diversity becomes noticeable. Additionally, since alldata in GITHUB is historical (i.e., archived), it is possi-ble to study the effects of tenure, or one’s length of timewith a project and with GITHUB. However, the relianceon volunteers in OSS projects complicates matters; volun-teers come and go, leading to team turn-over. Team turn-over can certainly influence performance, and will confoundthe effects of diversity. The constructs of “team” and “teamturnover” clearly also depend on the observation time-scale.In a healthy project, some rate of turnover is in fact desir-able, as “new blood” brings in new abilities and ideas [21].Arguably, turnover will affect observed diversity in GITHUBOSS teams, and must be considered carefully.

In this paper, using GITHUB data, we explore several ques-tions: How diverse are online teams with respect to genderand tenure? Does gender diversity depend on tenure? On1OSS depend on distributed volunteers’ efforts whereas commer-cial software is much more centralized, and depends more on paidgroups of programmers [23]; in both, the quality can be high [8].2Especially so, it seems, in OSS projects: A 2013 FLOSS Sur-vey [49] indicates 10% females; all earlier surveys [19] agreeon merely 1–5%. Industry reports slightly higher numbers, e.g.,Google with 17% female technology employees.

Which is more effective?CHI’15

Page 4: A Data Set for Social Diversity Studies of GitHub Teams@b_vasilescu @aserebrenik @vlfilkov MSR’15, Florence, Italy May 16, 2015 Gender and Tenure Diversity in GitHub Teams Bogdan

@b_vasilescu @aserebrenik @vlfilkov

MSR’15, Florence, Italy May 16, 2015

Social Diversity in GitHub Teams

Gender= mix women/men

Tenure (experience) = mix junior/senior

Culture= mix countries

Page 5: A Data Set for Social Diversity Studies of GitHub Teams@b_vasilescu @aserebrenik @vlfilkov MSR’15, Florence, Italy May 16, 2015 Gender and Tenure Diversity in GitHub Teams Bogdan

@b_vasilescu @aserebrenik @vlfilkov

MSR’15, Florence, Italy May 16, 2015

https://github.com/bvasiles/diversity

! Explore Gist Blog Help

HTTPS clone URL

You can clone with HTTPS, SSH,or Subversion.

diversity /

latest&commit&a1d6263472

A data set for social diversity studies of GitHub teams — Edit

Updated to match camera-ready

bvasiles authored 21 days ago

" LICENSE Initial commit 2 months ago

" README.md Updated readme 2 months ago

" diversity_data.csv Updated to match camera-ready 21 days ago

Search bvasiles + # $ %

0 01& Unwatch ⋆ Star ( Forkbvasiles / diversity)

* Code

+ Issues 0

, Pull requests 0

- Wiki

. Pulse

/ Graphs

0 Settings

https://github.com/bvasiles/diversity.git1

?

2 Clone in Desktop

3 Download ZIP

4 commits 1 branch 0 releases 1 contributor7 6 8 9

455 6 master branch: +

1

- README.md

A data set for social diversity studies of GitHub teams

The data is presented in CSV format and can be directly imported in R. It contains a number ofstandard measures of (GitHub) activity, including number of committers, team size (committers, pullrequest submitters, commenters, etc.), number of commits (the most encompassing form of codingcontribution to a GitHub project and a representative facet of developer productivity in open source),number of comments (on commits, pull requests, and issues; a measure of the project’s socialactivity), number of issues opened, number of forks, and number of watchers.

Then, for each quarter (at least 4 quarters of data per project, by construction), we compute theproject age (in quarters), the number of female and male contributors, the genders and countriesof team members (at least 75% resolved, by construction), their GitHub tenures (in days; capturingglobal GitHub presence, based on account creation date), commit tenures (in days; capturing globalcoding experience, based on participation in any GitHub repository), and project tenures (inquarters; local project experience, not restricted to coding), the numbers of contributors leaving(i.e., active in the previous quarter but inactive now), joining (defined analogously), and staying in theteam (i.e., in common between w.r.t. previous quarter), as well as the turnover ratio (i.e., the fractionof the team in a given quarter that is different with respect to previous quarter).

Finally, we compute Blau indices of team gender and country diversity, a well-established diversitymeasure for categorical variables, and coefficients of variation for GitHub, commit, and projecttenure, as measures of team tenure diversity.

diversity

Status API Training Shop Blog About© 2015 GitHub, Inc. Terms Privacy Security Contact !

This repository

http://ghtorrent.org

+ =

The Data Set