demonstrating programming language feature mining using boa robert dyer these research activities...

63
Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation (NSF) grants CNS-15-13263, CNS-15-12947, CCF-15-18897, CCF-15-18776, CCF-14-23370, CCF-13-49153, CCF-13-20578, TWC-12-23828, CCF-11-17937, CCF-10-17334, and CCF-10-18600. Tien N. Nguyen Hridesh Rajan Hoan Anh Nguyen

Upload: stewart-perkins

Post on 18-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

Demonstrating Programming Language Feature Mining

using Boa

Robert Dyer

These research activities supported in part by the US National Science Foundation (NSF) grantsCNS-15-13263, CNS-15-12947, CCF-15-18897, CCF-15-18776, CCF-14-23370, CCF-13-49153,CCF-13-20578, TWC-12-23828, CCF-11-17937, CCF-10-17334, and CCF-10-18600.

Tien N. NguyenHridesh Rajan Hoan Anh Nguyen

Page 2: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

2

Today’s talk is aboutMining Software Repositories

at an Ultra-large-scale

Page 3: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

3

What do I mean bysoftware repository?

Page 4: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

4

Page 5: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

5

What features do they have?

Page 6: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

6

What do I mean bymining software repositories (MSR)?

Page 7: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

7

Page 8: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

8

What are some examples ofsoftware repository mining?

Page 9: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

9

What is the most usedprogramming language?

Page 10: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

10

How many wordsare in commit messages?

Words[] = update, 30715Words[] = cleanup, 19073Words[] = updated, 18737Words[] = refactoring, 11981Words[] = fix, 11705Words[] = test, 9428Words[] = typo, 9288Words[] = updates, 7746Words[] = javadoc, 6893Words[] = bugfix, 6295

Page 11: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

11

How has unit testingbeen adopted over time?

JUnit 4 release

Page 12: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

12

What makes thisultra-large-scale mining?

Page 13: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

13

Previous examples queried...

Projects 699,331

Code Repositories 494,158

Revisions 15,063,073

Unique Files 69,863,970

File Snapshots 147,074,540

AST Nodes 18,651,043,23

Over 250GB of pre-processed datafrom SourceForge

Page 14: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

14

Most recent dataset (Sep 2015)

Projects 7,830,023

Code Repositories 380,125

Revisions 23,229,406

Unique Files 146,398,339

File Snapshots 484,947,086

AST Nodes 71,810,106,868

Over 270GB of pre-processed datafrom GitHub (focusing on Java projects)

Page 15: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

15

What am I interested in?

Page 16: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

16

Language Studies

What languages doprogrammers choose?

[Meyerovich&Rabkin SPLASH'13]

Reflection

[Livshits et al. APLAS'05][Callaú et al. MSR'11]

JavaScript / eval

[Yue&Wang WWW'09][Richards et al. PLDI'10]

[Ratanaworabhan et al. WEBAPPS'10][Richards et al. ECOOP'11]

Generics

[Basit et al. SEKE'05][Parnin et al. MSR'11]

[Hoppe&Hanenberg SPLASH'13]

Object-oriented Features

[Tempero et al. ECOOP'08][Muschevici et al. OOPSLA'08]

[Tempero ASWEC'09][Grechanik et al. ESEM'10][Gorschek et al. ICSE'10]

Page 17: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

17

Finding use of assert

• Requires use of a parser (e.g. JDT)

• Requires knowledge of several APIs– SF.net / GitHub API– SVNkit/JGit/etc

• Must be manually parallelized

Page 18: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

18

ASSERTS: output sum of int;

visit(input, visitor {before node: CodeRepository -> {

snapshot := getsnapshot(node, "SOURCE_JAVA_JLS");

foreach (i: int; def(snapshot[i]))visit(snapshot[i]);

stop;}before node: Statement ->

if (node.kind == StatementKind.ASSERT)ASSERTS << 1;

});

Automatically parallelized

Analyzes 18 billion AST nodes in minutes

Only 12 lines of code

No external libraries

Finding use of assert

Page 19: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

19

Boa

http://boa.cs.iastate.edu/

[TOSEM] (to appear)[ICSE'14][GPCE'13][ICSE'13]

Page 20: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

20

Boa's Architecture

Replicate

Stored oncluster

User submitsquery

Deployed andexecuted on cluster

Query resultreturnedvia web

cache

Boa's Data Infrastructure

and Transform

Compiled intoHadoop program

Boa's Computing Infrastructure

Page 21: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

21

Automatic Parallelization

ASSERTS: output sum of int;

visit(input, visitor {before node: CodeRepository -> {

snapshot := getsnapshot(node, "SOURCE_JAVA_JLS");foreach (i: int; def(snapshot[i]))

visit(snapshot[i]);stop;

}before node: Statement ->

if (node.kind == StatementKind.ASSERT)ASSERTS << 1;

});

Output variables with built in aggregator functions:sum, mean, top(k), bottom(k), set, collection, etc

Compiler generates Hadoop MapReduce code

Page 22: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

22

Abstracting MSR with Types

ASSERTS: output sum of int;

visit(input, visitor {before node: CodeRepository -> {

snapshot := getsnapshot(node, "SOURCE_JAVA_JLS");

foreach (i: int; def(snapshot[i]))visit(snapshot[i]);

stop;}before node: Statement ->

if (node.kind == StatementKind.ASSERT)ASSERTS << 1;

});

Custom domain-specific types for mining software repositories5 base types and 9 types for source code

No need to understand multiple data formats or APIs

Page 23: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

23

Abstracting MSR with Types

Project

CodeRepository

Revision

ChangedFile

ASTRoot

1

1..*

1

*

1

*

1

0..1

Page 24: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

24

Abstracting MSR with Types

ASTRoot

Namespace

Declaration

1

*

1

1..*

Method Variable Type

1

*

1

*

1

*

Statement Expression

**1

1

Page 25: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

25

Challenge: How can we make mining source code easier?

Answer: Declarative Visitors

Page 26: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

26

Easing Source Code Mining with Visitors

id := visitor {before T -> statement;after T -> statement;

};

visit(node, id);

Page 27: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

27

Easing Source Code Mining with Visitors

id := visitor {before id : T1 -> statement;

before T2, T3 -> statement;

before _ -> statement;};

Page 28: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

28

Easing Source Code Mining with Visitors

ASTRoot

Namespace

Declaration

Method Variable Type

Statement Expression

ASTRoot

Namespace

Declaration

Method Variable Type

Statement Expression

Page 29: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

29

before n: Declaration -> {

}

Easing Source Code Mining with Visitors

Method Type

Statement Expression

ASTRoot

Namespace

Declaration

Variable

before n: Declaration -> {foreach (i: int; n.fields[i])

visit(n.fields[i]);

}

before n: Declaration -> {foreach (i: int; n.fields[i])

visit(n.fields[i]);stop;

}

Page 30: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

30

Let’s revisit the assert use example.

Page 31: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

31

Finding use of assert

ASSERTS: output sum of int;

Page 32: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

32

Finding use of assert

ASSERTS: output sum of int;

visit(input, visitor {

});

Page 33: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

33

Finding use of assert

ASSERTS: output sum of int;

visit(input, visitor {

before node: Statement ->

});

Page 34: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

34

Finding use of assert

ASSERTS: output sum of int;

visit(input, visitor {

before node: Statement ->if (node.kind == StatementKind.ASSERT)

});

Page 35: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

35

Finding use of assert

ASSERTS: output sum of int;

visit(input, visitor {

before node: Statement ->if (node.kind == StatementKind.ASSERT)

ASSERTS << 1;});

Page 36: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

36

Finding use of assert

ASSERTS: output sum of int;

visit(input, visitor {before node: CodeRepository -> {

snapshot := getsnapshot(node, "SOURCE_JAVA_JLS");

foreach (i: int; def(snapshot[i]))visit(snapshot[i]);

stop;}before node: Statement ->

if (node.kind == StatementKind.ASSERT)ASSERTS << 1;

});

Page 37: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

37

Let’s see that query in action!

Page 38: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

38

input = project1

input = project2

input = project3

input = projectn

.

.

.

Dataset

Boa Program

Boa Program

Boa Program

Boa Program

.

.

.

Assert Assert = 538372

OutputAssert << 1;

1

Assert << 1;

111111

Processes

ASSERTS: output sum of int;

visit(input, visitor {before node: CodeRepository -> {

snapshot := getsnapshot(node, "SOURCE_JAVA_JLS");

foreach (i: int; def(snapshot[i]))visit(snapshot[i]);

stop;}before node: Statement ->

if (node.kind == StatementKind.ASSERT)ASSERTS << 1;

});

Page 39: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

39

Back to our feature study…

Page 40: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

What is our study about?

How have new Java language featuresbeen adopted over time?

Assume Java

Corpus of 30k+ projects

Study 18 new features from 3 language editions

Over 10 years of history

Page 41: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

41

Research Questions

RQ2: How frequently is each feature used?

RQ4: Could features have been used more?

RQ5: Was old code converted to use new features?

Page 42: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

Research Question 2

How frequently was each

language feature used?

Page 43: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

43

Project Histogram: Annotation Use

Page 44: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

44

Project Density: Annotation Use

Page 45: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

45

Some features popular

Page 46: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

46

Some features popular. Why?

Page 47: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

47

Some features popular. Why?

ListArrayList

MapHashMap

SetCollection

VectorClass

IteratorHashSet

(confirms [Parnin et al. MSR'11])

Page 48: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

Research Question 4

Could features have been used more?

Page 49: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

49

Opportunity: Assert

void m(..) {if (cond) throw new IllegalArgumentException();...

}

void m(..) {assert cond;...

}

Find methods that throw IllegalArgumentException.

Simpler

Machine-checkable

Easily disabled for production

Page 50: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

50

Opportunity: Binary Literals

int x = 1 << 5;

Find where literal 1 is shifted left.

short[] phases = {0x7,0xE,0xD,0xB

};

short[] phases = {0b0111,0b1110,0b1101,0b1011

};

Page 51: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

51

Opportunity: Underscore Literals

int x = 1000000;

int x = 1_000_000;

Find integers with 7 or more digits and no underscores.

Page 52: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

52

Opportunity: Diamond

List<String> l = new ArrayList<String>();

List<String> l = new ArrayList<>();

Instantiation of generics not using diamond.

Page 53: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

53

Opportunity: MultiCatch

try { .. }catch (T1 e) { b1 }catch (T2 e) { b1 }

try { .. }catch (T1 | T2 e) { b1 }

A try with multiple, identical catch blocks.

Page 54: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

54

Opportunity: Try w/ Resources

try {..

} finally {var.close();

}

try (var = ..) {..

}

Try statements calling close() in the finally block.

Page 55: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

55

Assert Varargs Binary Literals Diamond MultiCatch Try w/

ResourcesUnderscore

Literals

Old 89K 612K 56K 3.3M 341K 489K 5.3M

New 291K 1.6M 5K 414K 24K 33K 507K

Millions of opportunities!

Page 56: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

Potential Uses

Projects 18.18% 88.78% 5.9% 59.08% 49.75% 37.27% 51.15%

56

Actual Uses

Assert Varargs Binary Literals Diamond MultiCatch Try w/

ResourcesUnderscore

Literals

Projects 12.72% 15.43% 0.02% 0.4% 0.27% 0.21% 0.02%

Millions of opportunities!

Page 57: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

Research Question 5

Was old code converted to use new features?

Page 58: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

58

Detecting Conversions

potentialNusesN potentialN+1usesN+1

usesN < usesN+1

potentialN > potentialN+1

File.java(Revision N)

File.java(Revision N+1)

Page 59: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

59

Detected lots of conversions!

manual, systematic sampling confirms2602 conversions13 not conversions

Assert Varargs Diamond MultiCatch Try w/ Resources

Underscore Literals

Count 180 2.1K 8.5K 162 154 2Files 105 1.6K 3.8K 125 99 1

Projects 37 488 72 23 17 1

Page 60: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

60

Similar usage patterns Assert Varargs Diamond MultiCatch Try w/ Resources

Underscore Literals

Count 180 2.1K 8.5K 162 154 2

Files 105 1.6K 3.8K 125 99 1

Projects 37 488 72 23 17 1

Old code converted to use new features

Only few featuressee high use

Assert Varargs Binary Literals Diamond MultiCatch Try w/

ResourcesUnderscore

Literals

Old 89K 612K 56K 3.3M 341K 489K 5.3M

New 291K 1.6M 5K 414K 24K 33K 507K

All 380K 2.2M 61K 3.7M 365K 522K 5.8M

Files 1.39% 12.74% 0.11% 12.25% 2.28% 1.85% 5.86%

Projects 18.18% 88.78% 5.9% 59.08% 49.75% 37.27% 51.15%

Despite (missed) potential for use

Feature adoption by individuals

To summarize...

Page 61: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

61

Summary

Ultra-large-scale language feature studiespose several challenges

Automatically parallelizes queries

Domain-specific language, types, and functionsto make mining software repositories easier

Boa provides abstractions to addressthese challenges

Ultra-large-scale dataset with millions of projects

Page 62: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

62

Boa's Global Impact

370+ users from over 20 countries!

http://boa.cs.iastate.edu/

Page 63: Demonstrating Programming Language Feature Mining using Boa Robert Dyer These research activities supported in part by the US National Science Foundation

63

Participate in theMSR 2016

Mining Challenge

http://2016.msrconf.org/#/challenge

deadline: Feb 19