linsen an efficient approach to split identifiers and expand abbreviations
DESCRIPTION
"Linsen an efficient approach to split identifiers and expand abbreviations" Slides presented at the International Conference of Software Maintenance (ICSM) 2012, Riva del Garda (TN), ItalyTRANSCRIPT
![Page 1: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/1.jpg)
LINSEN AN EFFICIENT APPROACH TO SPLIT IDENTIFIERS AND EXPAND ABBREVIATIONS
Anna Corazza, Sergio Di Martino, Valerio MaggioUniversità di Napoli “Federico II”
26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy
![Page 2: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/2.jpg)
MOTIVATIONS IR FOR SE
![Page 3: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/3.jpg)
MOTIVATIONS IR FOR SE
![Page 4: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/4.jpg)
IR F
OR
NATU
RAL
LAN
GU
AG
E
![Page 5: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/5.jpg)
1. Tokenization
IR F
OR
NATU
RAL
LAN
GU
AG
E
![Page 6: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/6.jpg)
1. Tokenization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
![Page 7: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/7.jpg)
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
![Page 8: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/8.jpg)
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Change to Lower case
draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ...
![Page 9: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/9.jpg)
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Change to Lower case
draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ...2.Remove StopWords
![Page 10: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/10.jpg)
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Change to Lower case
draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords
3.Apply Stemming4. ...
![Page 11: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/11.jpg)
Implicit assumption: The “same” words are used whenever a particular concept is described
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Change to Lower case
draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords
3.Apply Stemming4. ...
![Page 12: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/12.jpg)
Implicit assumption: The “same” words are used whenever a particular concept is described
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Change to Lower case
draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords
3.Apply Stemming4. ...
![Page 13: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/13.jpg)
1. Tokenization
IR F
OR
SOUR
CE
CO
DE
2. Normalization1.5 Identifier Splitting
![Page 14: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/14.jpg)
1. Tokenization
IR F
OR
SOUR
CE
CO
DE
2. Normalization1.5 Identifier Splitting
• snake_case Splitter: r’(?<=\w)_’
• display_box ==> display | box
• camelCase/PascalCase Splitter: r’(?<!^)([A-Z][a-z]+)’
• displayBox ==> display | Box
draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...
![Page 15: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/15.jpg)
IR F
OR
SOUR
CE
CO
DE
![Page 16: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/16.jpg)
IR F
OR
SOUR
CE
CO
DE
![Page 17: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/17.jpg)
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT
IR F
OR
SOUR
CE
CO
DE
![Page 18: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/18.jpg)
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT
IR F
OR
SOUR
CE
CO
DE
Splitting algorithms based on naming conventions are not robust enough
![Page 19: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/19.jpg)
Splitting algorithms based on naming conventions are not robust enough
• Heavy use of Abbreviations in the source code
IR F
OR
SOUR
CE
CO
DE
![Page 20: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/20.jpg)
Splitting algorithms based on naming conventions are not robust enough
• Heavy use of Abbreviations in the source code
• rect as for Rectangle
• r as for Rectangle
IR F
OR
SOUR
CE
CO
DE
![Page 21: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/21.jpg)
Splitting algorithms based on naming conventions are not robust enough
• Heavy use of Abbreviations in the source code
• rect as for Rectangle
• r as for Rectangle
IR F
OR
SOUR
CE
CO
DE
![Page 22: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/22.jpg)
1. Tokenization
IDEN
TIFIE
R M
AP
PIN
G
2. Normalization1.5 Identifier Mapping
• SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011)
• GenTest+Normalize (Lawrie and Binkley, 2011)
• AMAP (Hill and Pollock, 2008)
• ...• LINSEN
draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...
![Page 23: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/23.jpg)
LINSENA L G O
R I T H M CONTRIBUTION• Novel technique for the Identifier Mapping
![Page 24: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/24.jpg)
LINSENA L G O
R I T H M CONTRIBUTION• Novel technique for the Identifier Mapping
• Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)
![Page 25: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/25.jpg)
LINSENA L G O
R I T H M CONTRIBUTION• Novel technique for the Identifier Mapping
• Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)
• Applied on a Graph-based model
![Page 26: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/26.jpg)
LINSENA L G O
R I T H M CONTRIBUTION• Novel technique for the Identifier Mapping
• Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)
• Applied on a Graph-based model
• Able to both Split Identifiers and Expand possible occurring abbreviations
![Page 27: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/27.jpg)
There could be multiple and equally correct splitting or expansion solutions
THE
AM
BIG
UIT
Y P
ROBL
EM
![Page 28: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/28.jpg)
There could be multiple and equally correct splitting or expansion solutions
• r as for Rectangle OR red
THE
AM
BIG
UIT
Y P
ROBL
EM
![Page 29: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/29.jpg)
There could be multiple and equally correct splitting or expansion solutions
• r as for Rectangle OR red
THE
AM
BIG
UIT
Y P
ROBL
EM
•nsISupport ==> ns|IS|up|ports OR
==> ns|I|Supports
![Page 30: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/30.jpg)
DICTIONARIES
![Page 31: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/31.jpg)
DICTIONARIES
![Page 32: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/32.jpg)
DICTIONARIES
Application-aware Dictionaries
![Page 33: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/33.jpg)
DICTIONARIES
Application-aware Dictionaries(108,315 Entries)
![Page 34: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/34.jpg)
DICTIONARIES
Application-aware Dictionaries
(22,940 Entries)
(108,315 Entries)
![Page 35: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/35.jpg)
DICTIONARIES
Application-aware Dictionaries
(22,940 Entries)
(108,315 Entries)
(588 Entries)
![Page 36: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/36.jpg)
GR
AP
H M
ODE
L Model: Weighted Directed Graph Example: drawXORRect identifier
![Page 37: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/37.jpg)
GR
AP
H M
ODE
L
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10
![Page 38: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/38.jpg)
GR
AP
H M
ODE
L
• ARCS corresponds to matchings between identifier substrings and dictionary words
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
![Page 39: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/39.jpg)
GR
AP
H M
ODE
L
• ARCS corresponds to matchings between identifier substrings and dictionary words
• Application of the String Matching Algorithm (BYP)
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
![Page 40: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/40.jpg)
GR
AP
H M
ODE
L
• ARCS corresponds to matchings between identifier substrings and dictionary words
• Application of the String Matching Algorithm (BYP)
• Padding Arcs to ensure the Graph always connected
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3
“w”,C-MAX “X”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
![Page 41: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/41.jpg)
GR
AP
H M
ODE
L
• Every Arc is Labelled with the corresponding dictionary word
• Weights represent the “cost” of each matching
• Cost function [c(“word”)] favors longest words and words coming from the application-aware dictionaries
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3
“w”,C-MAX “X”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
![Page 42: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/42.jpg)
GR
AP
H M
ODE
L
• The final Mapping Solution corresponds to the sequence of labels in the
path with the minimum cost (Djikstra Algorithm)
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3
“w”,C-MAX “X”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
![Page 43: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/43.jpg)
STRING MATCHING• Application of the Baeza-Yates and Perlberg (BYP) Algorithm
• Signature: BYP(identifier, word, φ(word))
• identifier: target string
• word: string to match
• φ(·): Tolerance (Error) function
• Bounds the length of acceptable matchings
Advantage: Use the same algorithm for both the splitting and the expansion step with different input Tolerance function
![Page 44: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/44.jpg)
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
.. draw, the, are, null, handle, box, red, rectangle, ...
... echo, testing, threading, xpm, xor, ....
... abort absolute abstract ... or ... raw ...
DFile DCOMPUTER-SCIENCE DEnglish
Identifier:drawXORRect
![Page 45: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/45.jpg)
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
.. draw, the, are, null, handle, box, red, rectangle, ...
... echo, testing, threading, xpm, xor, ....
... abort absolute abstract ... or ... raw ...
0
11
1
9
4
5
7
8
2 63
10
DFile DCOMPUTER-SCIENCE DEnglish
Identifier:drawXORRect
![Page 46: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/46.jpg)
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
.. draw, the, are, null, handle, box, red, rectangle, ...
... echo, testing, threading, xpm, xor, ....
... abort absolute abstract ... or ... raw ...
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX “R”,C-MAX
DFile DCOMPUTER-SCIENCE DEnglish
Identifier:drawXORRect
![Page 47: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/47.jpg)
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
... echo, testing, threading, xpm, xor, ....
... abort absolute abstract ... or ... raw ...
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
DFile DCOMPUTER-SCIENCE DEnglish
Identifier:drawXORRect
![Page 48: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/48.jpg)
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
... abort absolute abstract ... or ... raw ...
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
DFile ... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE DEnglish
Identifier:drawXORRect
![Page 49: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/49.jpg)
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
DFile ... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ... or ...
raw ...
DEnglish
Identifier:drawXORRect
![Page 50: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/50.jpg)
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
DFile ... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglish
Identifier:drawXORRect
![Page 51: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/51.jpg)
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
DFile ... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglish
Identifier:drawXORRect
![Page 52: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/52.jpg)
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BY
P F
OR
EXPA
NSI
ON
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglishDFile
Identifier:drawXORRect
![Page 53: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/53.jpg)
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BY
P F
OR
EXPA
NSI
ON
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglishDFile
Identifier:drawXORRect
![Page 54: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/54.jpg)
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BY
P F
OR
EXPA
NSI
ON
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglish
“red”,c(“red”)
..
draw, the, are, null, handle, box,
red, rectangle, ...
DFile
Identifier:drawXORRect
![Page 55: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/55.jpg)
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BY
P F
OR
EXPA
NSI
ON
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglish
“red”,c(“red”)“rectangle”,c(“
rectangle”)
..
draw, the, are, null, handle, box,
red,
rectangle, ...
DFile
Identifier:drawXORRect
![Page 56: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/56.jpg)
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BY
P F
OR
EXPA
NSI
ON
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglish
“red”,c(“red”)“rectangle”,c(“
rectangle”)
..
draw, the, are, null, handle, box,
red,
rectangle, ...
DFile
Identifier:drawXORRect
![Page 57: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/57.jpg)
EMPIRICAL
E V A L UA T I O N
RESEARCH QUESTIONS
• RQ1:How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers?
![Page 58: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/58.jpg)
EMPIRICAL
E V A L UA T I O N
RESEARCH QUESTIONS
• RQ1:How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers?
• RQ2:How does LINSEN compare with state-of-the-art approaches as for the mapping of identifiers to dictionary words?
![Page 59: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/59.jpg)
EMPIRICAL
E V A L UA T I O N
RESEARCH QUESTIONS
• RQ1:How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers?
• RQ2:How does LINSEN compare with state-of-the-art approaches as for the mapping of identifiers to dictionary words?
• RQ3:What is the ability of the LINSEN approach in dealing with different types of abbreviations?
![Page 60: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/60.jpg)
CASE STUDIESDTW (Madani et. al 2010)
GenTest+Normalize (Lawrie and Binkley, 2011)
RQ1 andRQ2}
RQ1 only
LUDISO Dataset (2012)
AMAP (Hill and Pollock, 2008)RQ3 only
15 out of 750 software systemsCovering the 58% of total identifiers
EMPIRICAL
E V A L UA T I O N
![Page 61: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/61.jpg)
EVALUATION METRICS
Comparability of Results: Accuracy rate
Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011]
EMPIRICAL
E V A L UA T I O N
![Page 62: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/62.jpg)
EVALUATION METRICS
Comparability of Results: Accuracy rate
Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011]
EMPIRICAL
E V A L UA T I O N
![Page 63: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/63.jpg)
EVALUATION METRICS
Comparability of Results: Accuracy rate
• Identifier Level evaluation: Each mapping result must be completely correct
• Soft-word Level evaluation: “Partial credit” given to each word correctly mapped
Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011]
As for the comparison with GenTest+Normalize (Lawrie and Binkley, 2011)
EMPIRICAL
E V A L UA T I O N
![Page 64: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/64.jpg)
RQ1: SPLITTING
Accuracy Rates for the comparison withDTW (Madani et. al 2010)
0
0.25
0.5
0.75
1
JhotDraw 5.1 Lynx 2.8.5DTW LINSEN DTW LINSEN
DTWLINSEN
RESULTS
R Q 1
![Page 65: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/65.jpg)
Accuracy Rates for the comparison with GenTest (Lawrie and Binkley, 2011)
0
0.175
0.35
0.525
0.7
which 2.20 a2ps 4.14
Identifier Level
0
0.2
0.4
0.6
0.8
which 2.20 a2ps 4.14
Soft-word Level
GenTestLINSEN
GenTestLINSEN
RESULTS
R Q 1 RQ1: SPLITTING
![Page 66: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/66.jpg)
Accuracy Rates for the comparison withDTW (Madani et. al 2010)
0
0.25
0.5
0.75
1
JhotDraw 5.1 Lynx 2.8.5
DTWLINSEN
RQ2: MAPPINGRE
SULTSR Q 2
![Page 67: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/67.jpg)
Accuracy Rates for the comparison with Normalize (Lawrie and Binkley, 2011)
0
0.15
0.3
0.45
0.6
which 2.20 a2ps 4.14
Identifier Level
0
0.225
0.45
0.675
0.9
which 2.20 a2ps 4.14
Soft-word Level
NormalizeLINSENNormalize
LINSEN
RESULTS
R Q 2 RQ2: MAPPING
![Page 68: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/68.jpg)
Accuracy Rates for the comparison withAMAP (Hill and Pollock, 2008)
0
0.225
0.45
0.675
0.9
CW DL OO AC PR SL
AMAPLINSEN
RQ3: EXPANSIONRE
SULTSR Q 3
CW: Combination WordsDL: Dropped LettersOO: Others
AC: AcronymsPR: PrefixSL: Single Letters
![Page 69: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/69.jpg)
CONCLUSIONS
![Page 70: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/70.jpg)
CONCLUSIONS
![Page 71: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/71.jpg)
CONCLUSIONS
![Page 72: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/72.jpg)
CONCLUSIONS
![Page 73: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/73.jpg)
CONCLUSIONS
![Page 74: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/74.jpg)
FUTURE WORKS• Evaluation of the impact of each adopted dictionary on
the performance
• Improve or change or add dictionaries
• Improve the implementation of the prototype to speed up the computation
• Make use of parallel computation to process each identifier in isolation
![Page 75: LINSEN an efficient approach to split identifiers and expand abbreviations](https://reader033.vdocuments.site/reader033/viewer/2022060119/5590afa71a28ab536a8b4625/html5/thumbnails/75.jpg)
THANK YOU
26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy
Valerio MaggioPh.D. Student, University of Naples “Federico II”[email protected]