transactions on software engineering 1 supporting …sarec.nd.edu/preprints/tsenikki.pdf ·...

19
TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending Features from Online Product Listings Negar Hariri, Carlos Castro-Herrera, Mehdi Mirakhorli, Student Member, IEEE , Jane Cleland-Huang, Member, IEEE , Bamshad Mobasher, Member, IEEE Abstract—Domain analysis is a labor intensive task in which related software systems are analyzed to discover their common and variable parts. Many software projects include extensive domain analysis activities, intended to jumpstart the requirements process through identifying potential features. In this paper, we present a recommender system that is designed to reduce the human effort of performing domain analysis. Our approach relies on data mining techniques to discover common features across products as well as relationships among those features. We use a novel incremental diffusive algorithm to extract features from online product descriptions, and then employ association rule mining and the k-Nearest-Neighbor machine learning method to make feature recommendations during the domain analysis process. Our feature mining and feature recommendation algorithms are quantitatively evaluated and the results are presented. Also, the performance of the recommender system is illustrated and evaluated within the context of a case study for an enterprise level collaborative software suite. The results clearly highlight the benefits and limitations of our approach, as well as the necessary preconditions for its success. Index Terms—domain analysis, recommender systems, clustering, association rule mining, k-nearest neighbor. 1 I NTRODUCTION Domain analysis is the process of identifying, orga- nizing, analyzing, and modeling features common to a particular domain [1], [2]. It is conducted in early phases of the software development life-cycle in order to generate ideas for a product, discover commonali- ties and variants within a domain, help project stake- holders configure product lines, and to identify op- portunities for reuse. As such, it is an important com- ponent of the upfront software engineering process. In current practice, most domain analysis techniques, such as Feature Oriented Domain Analysis (FODA) [2] or Feature-Oriented Reuse Method (FORM) [3] rely upon analysts manually reviewing existing require- ment specifications or competitors’ product brochures and Web sites, and are therefore quite labor inten- sive. The success of these approaches is dependent upon the availability of relevant documents and/or access to existing project repositories, as well as the knowledge and expertise of the domain analyst. Other approaches such as the Domain Analysis and Reuse Environment (DARE) [4] utilize data mining and information retrieval methods to provide automated support for feature identification and extraction, but tend to focus their efforts on only a small handful N. Hariri, M.Mirakhorli, J. Cleland-Huang and B.Mobasher are with the School of Computing, DePaul University, IL, 60604. C. Castro-Herrera is with Google Inc. E-mail: [email protected], [email protected] of requirements specifications. The extracted features are therefore limited by the scope of the available specifications. In this paper, we address these limitations through presenting a novel approach for identifying a broader set of candidate features. In contrast to previous methods that extract features from proprietary project repositories, our approach mines raw feature descrip- tions, referred to from now on as descriptors, from thousands of partial product specifications found on Web sites which host or market software packages. It then analyzes the relationships between features and products and utilizes this information to recom- mend features for a specific project. Our approach takes as input an initial product description, analyzes this description, and then generates related feature recommendations utilizing two different algorithms. The first algorithm uses the unsupervised associa- tion rule mining [5], [6] technique to discover affini- ties among features across products and to augment and enrich the initial product profile. The second algorithm acts upon the augmented product profile and uses a k-nearest neighbor approach, common in collaborative filtering recommender systems [7], to analyze neighborhoods of similar products in order to identify new features. The end result is a set of recommended features which can be fed into the requirements engineering process in order to help project stakeholders define the features for a specific software product or product line. Figure 1 illustrates a feature recommendation scenario for an anti-virus

Upload: others

Post on 23-Mar-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 1

Supporting Domain Analysis throughMining and Recommending Features from

Online Product ListingsNegar Hariri, Carlos Castro-Herrera, Mehdi Mirakhorli, Student Member, IEEE ,

Jane Cleland-Huang, Member, IEEE , Bamshad Mobasher, Member, IEEE

Abstract—Domain analysis is a labor intensive task in which related software systems are analyzed to discover their commonand variable parts. Many software projects include extensive domain analysis activities, intended to jumpstart the requirementsprocess through identifying potential features. In this paper, we present a recommender system that is designed to reduce thehuman effort of performing domain analysis. Our approach relies on data mining techniques to discover common features acrossproducts as well as relationships among those features. We use a novel incremental diffusive algorithm to extract features fromonline product descriptions, and then employ association rule mining and the k-Nearest-Neighbor machine learning method tomake feature recommendations during the domain analysis process. Our feature mining and feature recommendation algorithmsare quantitatively evaluated and the results are presented. Also, the performance of the recommender system is illustrated andevaluated within the context of a case study for an enterprise level collaborative software suite. The results clearly highlight thebenefits and limitations of our approach, as well as the necessary preconditions for its success.

Index Terms—domain analysis, recommender systems, clustering, association rule mining, k-nearest neighbor.

F

1 INTRODUCTION

Domain analysis is the process of identifying, orga-nizing, analyzing, and modeling features common toa particular domain [1], [2]. It is conducted in earlyphases of the software development life-cycle in orderto generate ideas for a product, discover commonali-ties and variants within a domain, help project stake-holders configure product lines, and to identify op-portunities for reuse. As such, it is an important com-ponent of the upfront software engineering process.In current practice, most domain analysis techniques,such as Feature Oriented Domain Analysis (FODA)[2] or Feature-Oriented Reuse Method (FORM) [3] relyupon analysts manually reviewing existing require-ment specifications or competitors’ product brochuresand Web sites, and are therefore quite labor inten-sive. The success of these approaches is dependentupon the availability of relevant documents and/oraccess to existing project repositories, as well as theknowledge and expertise of the domain analyst. Otherapproaches such as the Domain Analysis and ReuseEnvironment (DARE) [4] utilize data mining andinformation retrieval methods to provide automatedsupport for feature identification and extraction, buttend to focus their efforts on only a small handful

• N. Hariri, M.Mirakhorli, J. Cleland-Huang and B.Mobasher are withthe School of Computing, DePaul University, IL, 60604.C. Castro-Herrera is with Google Inc.E-mail: [email protected], [email protected]

of requirements specifications. The extracted featuresare therefore limited by the scope of the availablespecifications.

In this paper, we address these limitations throughpresenting a novel approach for identifying a broaderset of candidate features. In contrast to previousmethods that extract features from proprietary projectrepositories, our approach mines raw feature descrip-tions, referred to from now on as descriptors, fromthousands of partial product specifications found onWeb sites which host or market software packages.It then analyzes the relationships between featuresand products and utilizes this information to recom-mend features for a specific project. Our approachtakes as input an initial product description, analyzesthis description, and then generates related featurerecommendations utilizing two different algorithms.The first algorithm uses the unsupervised associa-tion rule mining [5], [6] technique to discover affini-ties among features across products and to augmentand enrich the initial product profile. The secondalgorithm acts upon the augmented product profileand uses a k-nearest neighbor approach, common incollaborative filtering recommender systems [7], toanalyze neighborhoods of similar products in orderto identify new features. The end result is a set ofrecommended features which can be fed into therequirements engineering process in order to helpproject stakeholders define the features for a specificsoftware product or product line. Figure 1 illustratesa feature recommendation scenario for an anti-virus

Page 2: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 2

Step # 1: Enter initial product descriptionThe product will protect the computer from viruses and email spam. It will

maintain a database of known viruses and will retrieve updated descriptions

from a server. It will scan the computer on demand for viruses.

Step # 2: Confirm featuresWe have identified the following features from your initial product

description. Please confirm :

Email spam detectionVirus definition update and automatic update supported

Disk scan for finding malware

Internal database to detect known viruses

Step # 3: Recommend featuresBased on the features you have already selected we recommend the

following three features. Please confirm:

Network intrusion detection Why?Real time file monitoring Why?Web history and cookies management Why?

Click here for more recommendations

We notice that you appear to be developing an Anti-virus software system.

Would you like to browse the feature model?

View Feature Model

Fig. 1. Example of Feature Recommendations

product. An initial product description is mappedto four features in the recommender’s knowledgebase related to spam detection, disk scans, virus def-initions, and virus databases. This initial profile isthen augmented to include additional features of net-work intrusion, file monitoring, and web history andcookies management. In our approach, the user caninteract with the system in order to provide feedbackon candidate features. In this example scenario, theuser rejects network intrusion; however file monitoringand web history are added to the initial product profile.Additional recommendations (not shown) are thengenerated based on this profile.

In our previous work [8] we introduced our featurerecommender system and evaluated it in a restricteddomain with relatively small sized software productssuch as anti-virus software, multi-media iPod applica-tions, and photography applications. In this paper, weextend this prior work by making several additionalcontributions. First, we provide a more detailed analy-sis of our technique for extracting features from onlinelistings utilizing our Incremental Diffusive Clustering(IDC) algorithm. Previously, we had shown anecdo-tally that IDC performed well for clustering require-ments and features. In this paper we present a quanti-tative evaluation of IDC by comparing it against wellestablished algorithms including K-Means [9], FuzzyK-Means [10], Spherical K-Means [11], and LatentDirilecht Allocation (LDA) [12]; and we demonstratethat it outperforms them for purposes of clusteringfeature descriptors. Secondly, we demonstrate andevaluate our approach across a broader set of domainsincluding applications for Internet, security, systemsolutions, office-tools, and programming domains.Finally, as our recommender system is intended tosupport domain analysis in large projects, we evaluateit within the context of a project designed to deliver

Features

Product descriptions

Raw feature descriptions

Incremental Diffusive Clustering

Product × Feature Matrix

Feature naming

Features Feature Names

Recommender system based on association rules,

and content based (kNN) recommender.

Association rules

Frequent Item Set

Graph (FIG)

Accepted recommendations

Screen scraper

(a) Constructing the recommender system

Product description

Recommendations Association Rule recommender used to augment initial

profile

Natural Language Processor and feature matcher used to build

initial profile.

Recommendations

Product idea

Content based (kNN) recommender.

Domain Analyst Feedback

(b) Using the recommender system to recommend features

Fig. 2. Feature Recommendations

a software suite for supporting remote collaboration.This project incorporates features from a broad setof domains and demonstrates the usefulness of ourrecommender system in a more realistic developmentscenario.

Our Feature Recommender System includes severaldifferent components as illustrated in Figure 2(a). Thefirst component is the Screen scraper ¬ , responsible formining descriptors from online product specifications.The second component is the incremental diffusive clus-tering algorithm ­, which clusters descriptors into fea-tures and extracts meaningful names for each cluster.The third component is a product-by-feature matrix®, which is generated as a by-product of the clusteringprocess and is then used to construct a frequent itemsetgraph (FIG) ¯ [13], [14] from which feature association

Page 3: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 3

rules can be generated. The final component is therecommender system itself °. This system utilizesboth the product-by-feature matrix and the FIG togenerate on-demand feature recommendations. Fig-ure 2(b) depicts how feature recommendations aremade through a series of interactions with the user(steps ± to ´).

The remainder of this paper is laid out as follows.Section 2 describes our approach for mining featuresfrom online product descriptions, provides a quali-tative evaluation of the mined features, and reportsthe results of a comparative evaluation of IDC versusother well-established clustering techniques. Section 3describes our feature recommender system, which isthen evaluated in Section 4. Section 5 describes thecase study we conducted in which the recommendersystem was used to support domain analysis activitiesfor a Collaborative Online Software Suite. Section 6discusses threats to validity for our work, section 7describes related work, and finally section 8 summa-rizes our main findings and suggests areas of futurework.

2 FEATURE EXTRACTION

Many different websites such as SoftPedia, GoogleApps MarketPlace, and Softonic, provide descriptionsof software products. For purposes of our study, weutilized Softpedia.com, which indexes hundreds ofthousands of software products. Most products inSoftpedia include a product summary and a bulletedlist of features. The Screen-Scraper utility was used toextract both the product summary and also the list offeatures. The summary data was split into sentenceswhich were added to the features extracted from thebulleted list, creating a combined list which we referto as feature descriptors. In this study, 493,347 featuredescriptors were retrieved from 117,265 products, cat-egorized under 21 categories and 159 sub categories.Table 1 depicts a sample of the feature descriptorsmined from two typical SoftPedia products.

Table 2 illustrates a small subset of the featuresmanually identified for the Softpedia Anti-virus soft-ware products. It shows that most of the analyzedAnti-virus systems include core features such as Au-tomatic Updates or Malware Scan, as well as variantssuch as File Backup and Data Encryption which arefound in only a small subset of products. Furthermore,certain features such as Anti- root kit scan are foundprimarily in anti-virus software applications, whileothers, such as Password Manager are common acrossmany different product categories. Various associa-tions can be observed among features. For example,80% of products containing parental controls alsotrack web history, while 77% of Anti-Virus productsthat provide integration with 3rd party software doNOT support protected files.

TABLE 1Sample Descriptors from Two SoftPedia Products

FAR 3.0Drag and drop facility for copy and move operations;Easy configurable options: internal/external file viewer andtext editor, file operation associations for certain file types,panel view and file sorting modes;Long file name support;NTFS ”compressed” and ”encrypted” (Win2K) attribute andHard/symbolic links support;Plugin modules and commands: default plugins set includesArchive management plugin, FTP client, network browser,print manager and temporary panel, but you may write yourown plugins;Tunable configuration, color scheme customization;Clipboard functions;

Windows Phone Device Manager

Applications management: view, install/uninstall homebrewapplications;File management: explore device, exchange files with yourphone;Send to Windows Phone (to send files, apps, ringtones, weblinks in one click);Add and manage custom ringtones;Send SMS, E-mail, notes directly from PC (without needingcloud services);Take screenshots of the phone;Programming sends notifications to phone;Shared clipboard with PC and phone;Integration with Windows explorer (drag & drop,copy/paste, view apps icons and details);

2.1 Incremental Diffusive Clustering

The center piece of the feature discovery componentof the system is our Incremental Diffusive Clusteringalgorithm [8], [15], [16]. The goal of this component isto group together extracted product feature descrip-tors into an aggregate representation of a candidatefeature. Thus, each discovered cluster will representa feature that captures common characteristics acrossproducts. We originally developed IDC to addressperceived cohesiveness problems observed when clus-tering short documents such as requirements, fea-tures, and fault descriptions, using existing clusteringalgorithms [17].

Although anecdotally IDC outperformed othertechniques for clustering features, we did not previ-ously evaluate it in a rigorous manner. In this section,we first describe the IDC algorithm, and then reporton a new quantitative and qualitative analysis that weconducted.

IDC is an incremental clustering approach, in whichclustering is achieved through a series of iterations.The best cluster is retained from each iteration, and itsdominant terms are identified and then removed fromall feature descriptors. This allows cross-cutting, yetnon-dominant, topics to emerge in subsequent cluster-ing iterations, resulting in fuzzy clusters in which eachfeature descriptor can be assigned to more than onecluster [8], [18]. The iterative diffusion process is im-portant in obtaining clusters that both identify unique

Page 4: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 4

TABLE 2A Small Sample of Features for a Selection of Softpedia AntiVirus Products

a-sq

uare

dFr

eeA

cron

isA

VA

hnLa

bPl

atin

umA

hnla

bA

nti-

Troj

anEl

ite

App

Ran

ger

Ash

ampo

oa.

m.

Aus

logi

csA

VA

vast

!Pr

oA

VA

vast

!In

t.Se

cA

VG

AV

Pro.

AV

GA

V+F

irew

all

Avi

raA

ntiV

irPr

emiu

mA

vira

Smal

lBus

ines

sSu

ite

Bkav

2008

BitD

efen

der

Tota

lSe

c.’1

0Bi

tDef

ende

rIn

t.Se

c.’1

0C

orbi

tek

Ant

iMW

Cyb

erD

efen

der

Int.

Sec.

CO

MO

DO

Clo

udSc

anne

rD

r.Web

GD

ATA

AV

’10

Fix-

itU

tilit

ies

Pro.

Guc

upA

VG

SAD

elph

iIn

duc

Cle

aner

Haz

ard

Shie

ldG

Gre

atO

wl

USB

AV

Jian

gmin

AV

KV

’10

K7

AV

Imm

unet

Prot

ect

MW

Des

troy

erK

aspe

rsky

Ult

ra-P

orta

bles

Mx

One

AV

Mul

tiC

ore

AV

Ant

iSpy

war

eM

cAfe

eV

irus

Scan

Mic

row

orld

AV

Tool

kit

Uti

l.N

orto

nIn

t.Se

c.N

ovas

hiel

d-

Ant

iM

WN

oVir

usT

hank

sM

WR

em.

Nor

man

Vir

usC

ontr

olN

orto

nA

V20

09G

amin

gEd.

Nor

ton

AV

Nor

man

Sec.

Suit

eN

etw

ork

MW

Cle

aner

PCTo

ols

Int.

Sec.

’10

Out

post

Sec.

Suit

ePr

onP

rote

ctG

ameG

uard

Pers

.Q

uick

Hea

lIn

t.Se

c.’1

0Q

uick

Hea

lTo

tal

Sec.

’10

Steg

anos

Int.

Sec.

2009

Steg

anos

AV

2009

SpyD

LLR

em.

Tize

rR

ootk

itR

azor

The

Shie

ldD

elux

e’1

0Th

eC

lean

er20

11Sy

stem

Suit

ePr

o.Sy

sInt

egri

tyA

MV

IPR

EA

VPr

emiu

mV

irIT

eXpl

orer

Lite

Trus

tPor

tPC

Sec.

’10

Twis

ter

Ant

i-Tr

ojan

Vir

usZ

oneA

larm

Sec.

Suit

eW

lord

ing

Ant

iSpy

war

eYo

urFr

eeA

ntiS

pyw

are

Web

root

Int.

Sec.

Win

dow

sA

VM

ate

Prof

.W

imp

VIR

USfi

ghte

rV

irus

Bust

erPr

o.V

irus

Bust

erPe

rson

al

Active detection of downloaded files • • • • • • • • • • • • • • • • • • • • • • • • • • •Active detection of instant messenging • • • • • • • • • • • • • • • • • • •Active detection of removable media • • • • • • • • • • • • • • •Active detection of search results • • • • • • • • • • • • • • • • • • • •Anti-Root kit scan • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •Automatic scan • • • • • • • • • • • • • • • • • • • •Automatic scan of all files on startup • • • • • • • • • • • • • • • • • •Automatic updates • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •Behavioral Detection • • • • • • • • • •Command line scan • • •Contain viruses in specific quarantine • • • • • • • • • • • • • • • • • •Customized firewall settings • • • •Data encryption • • • • • • • • • • • • • • •Detection of suspcious/injected DLLs •Disc scan for finding malware • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •Email detection • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •File backup • • • • • • •......

Note: This table was manually compiled by researchers at DePaul university through inspecting Anti-Virus product listings at http://www.SoftPedia.com.It therefore only includes features listed on SoftPedia.

characteristics that contribute to a given feature, aswell as retain those characteristics that are commonacross distinct features. Our goal in designing the IDCalgorithms was to be able to generate overlappingclusters that correspond to product features, yet at thesame time maintain the clustering cohesiveness thatis typically achieved by crisp clustering techniques,such as standard variations of k-means algorithm.Our experimental evaluation, presented later in thissection, shows that the IDC algorithm does indeedachieve this goal.

2.1.1 PreprocessingInitial feature descriptors are first preprocessed bytokenizing them into a set of keywords, removingcommonly occurring words (stop words), and stem-ming the remaining keywords to their root form(i.e., terms). Given the dictionary of all such termsT = {t1, t2, . . . , tW }, each feature descriptor, fi, is rep-resented as a vector of terms, vi = (fi,1, fi,2, . . . , fi,W )where fi,j is a term weight representing the numberof occurrences of term tj in the feature descriptorfi. These term weights are then transformed using astandard term-frequency, inverse document frequency(tf-idf) approach [9] such that, tf -idf(fi,j) = fi,j ·log2(D/dfj) where D represents the total number offeature descriptors, and dfj represents the numberof feature descriptors containing term tj . Finally, thetransformed vector (with tf-idf weights) is normal-ized to a unit vector resulting in the vector Vi =(Fi,1, Fi,2, . . . , Fi,W ). The following steps are then ex-ecuted to identify features.

2.1.2 GranularityTo determine how many features (clusters) to generatefor a given product category, IDC uses a modified ver-sion of Can’s metric [19] which considers the degree

to which each feature descriptor differentiates itselffrom other feature descriptors. The ideal number ofclusters K is computed as follows:

K =∑vi∈F

W∑j=1

fi,j|vi|· fi,jNj

=∑vi∈F

1

|vi|

W∑j=1

f2i,jNj

(1)

where Nj represents the total number of occurrencesof term tj . Through observing the results of prelim-inary clustering experiments, we modified the algo-rithm to exclude terms for which Nj fell below athreshold determined empirically to be 0.0075D. In-troducing this threshold improved cluster quality byremoving low frequency terms that were not relevantor useful for identifying potential features.

2.1.3 Clustering

The feature descriptor vectors extracted in the prepro-cessing stage are automatically clustered using the in-cremental diffusive clustering algorithm (IDC) [8], [15]in order to extract a meaningful and cohesive set offinal features. In each iteration, IDC first automaticallyclusters the feature descriptors, and then, it identifiesand retains the highest quality cluster based on thecohesiveness and size of the cluster. Once this clusterhas been selected, its dominant terms are identifiedand removed from all clusters. This allows othercross-cutting, but non-dominant, topics to emerge inthe subsequent clustering iterations. In our initialrendition of IDC, a consensus-based Spherical K-Means clustering approach was used in each iterationin order to increase the cohesiveness of intermediateclusters. However, to address scalability issues as-sociated with clustering 493,347 feature descriptors,we replaced consensus clustering with the use ofstandard Spherical K-Means clustering. Furthermore,to reduce runtime, we clustered feature descriptors in

Page 5: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 5

each of the Softpedia categories separately, and thenmerged similar clusters across the categories.

2.1.4 Selecting the best clusterIn each iteration IDC promotes the “best” cluster tothe status of a feature. Based on initial observations,the best cluster from a set L = {C1, C2, . . . , CK},is one that has high levels of cohesion with broadcoverage of the topics. To measure cohesion for acluster Ci = {Vi,1, Vi,2, . . . , Vi,r} with centroid Ci, thesimilarity between the feature descriptors and theirassociated centroids is computed and averaged as(∑r

j=1 sim(Vi,j , Ci))/r. while topic coverage is com-puted as

∑rj=1 sim(Vi,j , Ci). Cohesion and topic cov-

erage scores are added together, and the cluster withthe highest combined score is selected as the “best”cluster and promoted to the status of a feature. Notethat if cohesion alone had been used to select the“best” cluster, the algorithm would have been biasedtowards small clusters. In the extreme case, a clusterwith only one feature descriptor would always be se-lected. Topic coverage was therefore used as a secondcriterion to balance cohesion and cluster size.

2.1.5 Removing dominant termsTo remove dominant terms, terms exhibiting weightsabove a predefined threshold (0.15) in the centroidvector of the “best” cluster are selected. Since thecentroids are also normalized, this threshold is heldconstant. These terms are then removed from alldescriptors in the dataset. For example, if dominantterms are identified as instant, email and encrypt, thena descriptor originally specified as Encrypt email mes-sages before transmission is reduced to messages beforetransmission (with the word before also removed asa stop word). Future rounds of clustering are thenperformed on the reduced version of descriptors.

This process is repeated until the targeted numberof features, determined previously in Step 1, havebeen identified.

2.1.6 Post processingA post-processing step is executed to optimize theidentified features. This step involves the four tasksof (i) removing misfits, (ii) recomputing centroids,(iii) checking for missing members, and finally (iv)merging similar features. Misfits are removed bycomputing the similarity score between each centroidand feature descriptor in the corresponding cluster.Descriptors whose similarity to the centroid fall belowa given threshold value (set to 0.35 based on empiricalobservations) are removed from the cluster. Centroidsare then repositioned according to the remainingfeature descriptors using the same SPK algorithmadopted in the initial clustering step. Missing featuredescriptors are identified by recomputing the cosinesimilarity between every feature descriptor not cur-rently assigned to a cluster, and the centroid of that

cluster. Any feature descriptor exhibiting a similarityscore higher than a given threshold (set to 0.35) isadded to that cluster. Finally, similar clusters aremerged by computing the cosine similarity betweeneach pair of centroids. Any pair of clusters exhibitinga score above a given threshold (set to 0.55) is finallymerged into a single feature.

2.1.7 Feature namingEach feature is named by identifying the medoid,defined as the descriptor that is most representative ofthe feature’s theme. The medoid is identified by firstcomputing the cosine similarity between each featuredescriptor and the centroid of the cluster, and thenaveraging all term weights in the feature descriptorvector above a certain threshold (0.1). The cosinesimilarity and the average of dominant term weightsare then added together to produce the score of eachfeature descriptor. The feature descriptor scoring thehighest value is selected as the medoid and the corre-sponding original feature descriptor (from Softpedia)is selected as the name for the feature. This approachproduces quite meaningful names. As an example, afeature based on the theme of updat, databas, automat,viru was subsequently named Virus definition updateand automatic update supported.

2.1.8 Merging Category ClustersOur approach involves processing each product cat-egory separately and then merging them into a sin-gle product-by-feature matrix. This directly mitigatesscalability issues of dealing with large numbers offeatures, and has the additional extensibility benefitof allowing new product categories to be added incre-mentally. Merging is accomplished through comput-ing the cosine similarity between each pair of features,and then merging features exhibiting scores of 0.6 orhigher.

2.2 Quantitative Cluster EvaluationBecause of the importance of the feature miningprocess in our system, we performed a quantitativeanalysis of the generated features. Quantitative clus-ter evaluation measures are generally classified intotwo types: unsupervised and supervised. Unsuper-vised measures such as cluster cohesion and clusterisolation, evaluate the clustering algorithm withoutreference to external information, and therefore failto evaluate how well the generated clusters supportspecific tasks [9]. They also tend to measure localizedcluster quality rather than optimizing for a global util-ity. In contrast, supervised measures, often referredto as external measures, provide a more general as-sessment of quality by evaluating the extent to whichgenerated clusters match an external ideal structure orground truth (usually provided in an answer set). Forour evaluations, the ideal clusters, also called classes,

Page 6: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 6

were produced by human experts and with manuallyclustering feature descriptors.

IDC was evaluated in the context of three dif-ferent datasets (i) antivirus product descriptions ex-tracted from SoftPedia, (ii) multimedia video productdescriptions extracted from SoftPedia, and (iii) a setof 809 feature requests which were collected fromapproximately 30 people for an Airport Kiosk in arequirements management forum. For the first twodatasets, clusters were formed by members of ourresearch team. To accomplish this, feature descrip-tors in each category were first pruned to removeredundancies. After pruning the antivirus dataset, 830of 1075 features remained which were then groupedinto 27 clusters. Similarly, the set of 3710 features inthe multimedia dataset were reduced to 3176 featuresand were clustered into 25 groups by experts. For theairport dataset, an ideal clustering was produced bya team of 4 college students who conducted a cardsorting exercise [20] which produced 40 groups.

For evaluation purposes four different compet-ing approaches were selected. These included LatentDirichlet allocation (LDA) [12], Spherical K-Meansclustering [11], K-Means, and Fuzzy K-Means [10]clustering algorithms. K-Means algorithm is a com-mon baseline to compare and evaluate clustering al-gorithms and is also used in our experiments as oneof the competing methods. To determine how muchthe variation of the Spherical K-Means clusteringused in the IDC approach can improve the standardalgorithm, we compared the results of our algorithmwith the Spherical K-Means method.

The clusters produced by both K-means and Spher-ical K-Means algorithms are crisp clusters that do notcontain overlaps. In our experiments, Fuzzy K-meanswas selected as a competing clustering techniquewhich produces soft clusters where each item can havea degree of membership in each cluster.

Instead of using vector space models to representfeature descriptors as in K-Means and IDC, there areother techniques that find the latent topics in eachdescriptor. Some well-known techniques in this cat-egory are the Probabilistic Latent Semantic Analysis(pLSA), the Latent Dirichlet Allocation (LDA), and theMixture of Dirichlet Compound Multinomial (DCM).Among these models, we selected LDA to be used inour comparisons as it has been shown to perform wellfor semi-structured and text-oriented data.

2.2.1 PurityPurity is measured by comparing generated clustersto the answer set clusters (also called classes). Eachgenerated cluster is matched with the answer setcluster with which it shares the most descriptors.Purity is then computed by counting the number ofcorrectly assigned descriptors and dividing by thetotal number of descriptors shown as N . In otherwords, let W = {w1, w2, .., wn} be the set of clusters

found by the clustering algorithm and C = {c1, .., cl}be the set of classes. Purity can be computed asfollows [9]:

purity(W,C) =1

N

∑k

maxj |wk ∩ cj | (2)

Purity can take values in the range of 0 to 1. Aperfect clustering algorithm has a purity of close to1 while a poorly performing clustering method couldhave purity closer to 0.

Fuzzy K-Means and LDA are soft clustering meth-ods and produce membership probabilities for eachdescriptor. There are two approaches for comparingthese methods against crisp clustering algorithms likeK-Means and Spherical K-Means; one way is toplace a threshold on the membership probabilitiesand then for each descriptor to choose all clustersfor which it has probabilities above the threshold.The second approach is to only select the highestprobability cluster for each descriptor. The quality ofthe clustering method also depends on the parametersand threshold values used in each of the methods.For experimental purposes, we ran each techniquewith different parameters and then used the bestresult for comparison against other techniques. Allfour algorithms were configured to produce the samenumber of clusters as the answer set.

As noted earlier, Purity is a localized metric anddoes not measure the global quality of the clusteringas a whole. Generally, for larger numbers of clusters,each cluster will exhibit higher degrees of purity. Inthe extreme case where the number of clusters is equalto the number of items, purity is maximized at 1.0. Tomitigate this problem, we also computed normalizedmutual information (NMI).

2.2.2 Normalized Mutual Information (NMI)NMI is calculated as follows:

NMI(W,C) =I(W ;C)

[H(W ) +H(C)]/2(3)

In the above formula, I(W ;C) represents mutualinformation and is computed as in equation 4. Thisvalue is a measure of the amount of information thatclusters contain about classes. Mutual information iszero if the clusters are totally unrelated to actualfeatures.

From the NMI formula, H represents entropy whichmeasures the degree to which each cluster consists ofobjects of a single category.

I(W ;C) =∑k

∑j

|wk ∩ cj |N

logN |wk ∩ cj ||wk||cj |

(4)

H(W ) = −∑k

|wk|N

log|wk|N

(5)

Page 7: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 7

2.2.3 Analysis of the results

As discussed in section 2, feature descriptors aregenerated by splitting the product summary datainto sentences and taking each sentence as a de-scriptor. There can be situations where a sentencedescribes more than one feature. The problem withusing crisp clustering techniques, such as Spherical K-Means and K-Means, is that some descriptors, com-mon across multiple features, may be assigned to onlyone feature. The IDC clustering algorithm, however,produces overlapping clusters in which each featuredescriptor can be assigned to more than one cluster.Although this aspect of the clustering algorithm is notcaptured by the common evaluation metrics includ-ing purity and NMI, it is particularly important forour feature recommendation system. As an example,consider Table 3 that shows two overlapping clustersgenerated by the IDC algorithm. In this example,one of the clusters relates to the Joining multiple videofiles feature while the other cluster represents Split-ting a large video file feature. A representative set ofdescriptors for each of the two clusters, as well as asample of common feature descriptors, are presented.To illustrate the problem addressed here consider thefeature descriptor, Split and join screen recordings, thatdescribes two separate features (split, and join). Byapplying crisp clustering methods, this descriptor isassigned to one, and only one, of the two relatedfeatures; while the IDC method assigns it to bothclusters.

As noted earlier, our goal in designing the IDCalgorithms was to be able to generate the type ofoverlapping clusters described above, and yet main-tain the clustering effectiveness (as measured by theNMI and purity metrics) that is typically achieved bycrisp clustering techniques. Although LDA and FuzzyK-Means methods can also produce soft clusters,IDC generally showed better performance in terms ofpurity and NMI in comparison to these two methodsand therefore was used in our approach.

Figure 3 compares the purity of IDC with otherclustering methods. For the antivirus dataset, Spheri-cal K-Means achieved slightly better purity than IDC(3% improvement) while other competing methodsreturned lower purity than IDC. For the multimediavideo dataset, IDC had the best performance andimproved Spherical K-Means and standard K-Meansalgorithms by 11% and 16% respectively. For theAirport dataset Spherical K-Means, K-Means, andIDC all had very similar purity.

Figure 4 depicts the NMI for different clusteringmethods for each of the three datasets. IDC and Spher-ical K-Means algorithms both returned higher NMIthan the other competing methods over all the threedatasets. Similar to the results for purity, IDC andSpherical K-Means had very similar performances interms of NMI. More specifically, Spherical K-Means

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Antivirus Multimedia Video Airport

Pu

rity

IDC

Kmeans

LDA

Fuzzy kmeans

SPK-means

Fig. 3. Comparison of Purity for Different ClusteringAlgorithms

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Antivirus Multimedia Video Airport

No

rmal

ize

d M

utu

al In

form

atio

n IDC

Kmeans

LDA

Fuzzy kmeans

SPK-means

Fig. 4. Comparison of Normalized Mutual Informationfor Different Clustering Algorithms

returned about 1.5% higher NMI for the antivirusdataset while IDC returned 1% higher NMI for mul-timedia video dataset. These two methods had thesame NMI for the airport dataset.

2.2.4 Recall and PrecisionAs explained in section 2.2.3, in some cases a fea-ture descriptor describes more than one feature. Ifcrisp clustering methods are used, valid features canbe missed for a descriptor, and as a result, thegenerated product-by-feature matrix is unnecessarilysparse. This is undesirable, as sparsity is a majorissue known to limit the quality of recommendations[21][22].

Based on our evaluations in the previous sections,IDC and SPK-Means had very similar performancesin terms of purity and NMI (while better performancein comparison to the other baselines). In this section,we compare the precision and recall of assigning de-scriptors to features for these two clustering methods.This evaluation was performed by three students whowere familiar with concepts of software engineer-

Page 8: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 8

TABLE 3An Example of Two Overlapping Clusters

Cluster A - Join Multiple Files into One Large File Cluster B - Split a Large File into Smaller Files·Join multiple video files into one large AVI, MPEG, WMV file ·Split, cut a large video file into smaller video,audio clips·Agile MPEG Video Joiner is a powerful easy-to-use tool to joinmultiple video files

·VidSplitter is a complete Video splitting tool that lets you to splitlarge MPEG, ASF files into smaller video clips in various formats

·It supports joining multiple MPEG, AVI-DivX, WMV-ASF andother video file formats

· Split video clips out from MPEG4, MP4, MPEG easily with only3 steps

·Join/merge MPEG-1 and MPEG-2 video ·Split file between all supported formatsSample of Common Feature Descriptors

· MP3 MPEG Joiner/Splitter is a software that joins/splits files of the type .mp3 or .mpg· Split Join Convert MOV is an efficient program designed to split, join or convert Quicktime MOV or QT files to AVI,MPEG or WMV formats· Split Join Convert Video is a versatile tool to split, join or convert video files· Split and join screen recordings· Video Splitter Joiner and Converter is an all-in-one tool to split, join or convert video files· A handy video file splitting and joining discount pack that includes Ultra Video Splitter and Ultra Video Joiner

ing and domain analysis. However, the evaluatorsdid not have any prior knowledge of our data andthe specifics of our project. For this evaluation, asample of 32 feature descriptors were selected fromthe antivirus and multimedia software categories ofSoftpedia. The sample was selected such that halfof the descriptors were assigned to more than onecluster based on IDC while the remaining descriptorsbelonged to only one cluster. The reason for thisselection was to evaluate the false positive mappings(a descriptor is assigned to a feature while it shouldnot be) and true negative mappings (a descriptorshould be assigned to a feature but it is not).

The users were presented with the union of thefeatures discovered by IDC and SPK-Means. For eachdescriptor, the participants were asked two questions:(1) Name the set of features that should be assigned to thisdescriptor. (2) If any, name the set of features that should beassigned to this descriptor but are not in the set of presentedfeatures.

For each descriptor, recall and precision were com-puted based on the answers from each participant andwere then averaged to compute the final recall andprecision for that descriptor. As the number of testitems is more than 30, we can assume that the sampleis normal. We used the Student’s t-test to evaluate thefollowing two null hypotheses:

Hp: The mean precision achieved using IDC isequal to the mean precision achieved using SPK-Means.

Hr: The mean recall achieved using SPK-Meansis greater than or equal to the mean recall achievedusing IDC.

Table 4 presents the average precision and recall forIDC and SPK-Means as well as the p-value for the twotests. The results show that while the mean precisionfor IDC and SPK-Means are not significantly different,the mean recall for IDC is significantly higher thanSPK-Means.

TABLE 4Student’s t-test results for comparison of mean

precision and mean recall for IDC and SPK-Means

Average for SPK-Means Average for IDC p-valueHp 0.85 0.82525 0.632484247Hr 0.532142857 0.680059524 0.000906678

2.3 Qualitative Feature Evaluation

We also performed a qualitative evaluation of theclusters generated by each technique. This was par-ticularly important because our recommender sys-tem will be used to support domain analysis tasksand so must generate clusters which are meaningfuland understandable to human users. To assess thequality of the generated features, the three productcategories of antivirus software, multimedia iPod appli-cations, and Photography applications, were evaluated.All descriptors from each product in these categorieswere retrieved from the SoftPedia Web site. A card-sorting activity was then conducted to group featuredescriptors into features. This task was performed andvalidated by three students who were familiar withthese domains, but otherwise previously unrelatedto this project. These manually identified featuresserved as the answer set for the remainder of theevaluation. Each answer set took between 18-25 hoursto construct, and resulted in the identification of 27distinct features for antivirus, 30 for iPod, and 38 formultimedia photography.

In the next step of the evaluation, the IDC algorithmwas used to automatically identify features. Thesefeatures were then evaluated against the answer setsby three additional students at DePaul. Each assessorjudged whether each automatically generated featurerepresented (i) a clearly defined feature that matcheda feature in the answer set, (ii) A meaningful featurethat was not included in the answer set, (iii) a non-unique feature that overlapped with a previouslygenerated feature, or (iv) an ill-formed feature thatwas non-cohesive or otherwise did not make sense.

Page 9: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 9

They were also asked to rate the feature name as(i) Excellent, (ii) Good, (iii) Poor, or (iv) Completelyinadequate; where an excellent feature was describedas having a realistic and professional sounding fea-ture name, and an inadequate feature name exhibitedsignificant problems such as being too lengthy orgrammatically significantly incomplete. Finally, eachevaluator was asked to either match each feature inthe manually created answer set with a similar featuregenerated by IDC, or to mark it as missing.

Results from the evaluation are shown in Table 5and show that while our approach discovered onlyabout 56% of the features in the answer set, it alsodiscovered about 30% new unique features that weremissed in the manually constructed answer set. Thisoccurred despite the fact that analysts spent from 18-25 hours constructing each answer set. Our initialobservations suggest that additional features couldhave been detected by expanding the scope of thescreen scraper to mine broader product descriptorsbeyond items listed as features. It is also worth notingthat our naming algorithm performed very well, de-livering almost 85% of features there were categorizedas Good or Excellent. Only 2% of names were deemedinadequate.

TABLE 5User Evaluation of Feature Quality and Completeness

Virus iPod Photo OverallTotal features in answer set 27 30 38 95In both auto gen. and answer set 18 15 20 53Answer set only 9 15 18 42Auto generated only 9 15 19 43Redundant features 3 3 1 7Ill-formed features 4 2 2 8Excellent name 0.45 0.50 0.69 0.57Good name 0.20 0.36 0.27 0.28Poor name 0.25 0.11 0.04 0.13Inadequate name 0.05 0.03 0.00 0.02

3 FEATURE RECOMMENDATION

Based upon the clusters generated using IDC, wecreate a binary product-by-feature matrix, M :=(mi,j)P×F , where P represents the number of prod-ucts (117,265), F is the number of identified features(1,135), and mi,j is 1 if and only if the feature j in-cludes a descriptor originally mined from the producti. This matrix, which contains the complete set ofrecommendable features referred to as the feature poolfrom now on, is used to generate feature recommen-dations in the following steps.

3.1 Creating initial product profileFirst an initial product profile is constructed in aformat compatible with the feature model. To accom-plish this, the domain analyst creates a short textualdescription of the product, which is then processed in

order to match elements of the description to featuresin the feature pool. This matching is accomplishedby first performing basic preprocessing, such as tok-enization, stemming, and removal of stop words, andthen converting the product description p to a termvector p = (w1,p, w2,p, ..., wn,p) where each dimensioncorresponds to a separate term. We used the standardterm frequency-inverse document frequency approach [9],also known as tf − idf , which computes the weightfor each term based on the normalized term-frequencyand the inverse document frequency.

Once the product description is converted to termvector form, it is compared to the term vector rep-resentation of each feature in the feature pool usinga standard information retrieval metric such as thecosine similarity. Features are then ranked accordingto their similarity to the product description, andpresented to the analyst for confirmation. In certaincases, the product description might describe a fea-ture which is not matched to the feature pool. Thismay occur because the feature pool does not providecoverage of the intended functionality. In this case,the recommender system will be unable to incorporatethis information into the recommendation process. Inother cases, a match may not be found because theanalyst did not describe the product in sufficient detailor used wording not represented in the model. Inthis case, the analyst can expand the description inan effort to make a match. To implement this phaseof the process we utilized Poirot [23], which is a tooloriginally developed by our research group to supportautomated trace retrieval. Its search capabilities wereideally suited to the feature matching task.

As a result of this step of the process, an initialproduct profile is appended to the product-by-featurematrix.

3.2 Feature Recommendations using AssociationRule MiningAs previously explained, our approach utilizes twodifferent recommendation algorithms. The first algo-rithm addresses the potential problem that the initialprofile is relatively sparse and contains only a few fea-tures. To rectify this problem, we utilize AssociationRule Mining [5] which is known to return relativelyaccurate, albeit incomplete, recommendations. It istherefore ideal for the first step of augmenting theproduct profile.

Association Rule Mining was originally developedto support “market basket analysis” in order toanalyze products that buyers tend to purchase atthe same time. For example, an association rulemilk, butter =⇒ bread means that customers whopurchase milk and butter are also likely to purchasebread. Association rule mining has been used to buildrecommender systems in several different domainsincluding e-commerce and intelligent Web applica-tions [24], [13], [25]. Association rules can be learned

Page 10: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 10

from the product-by-feature matrix, and then used tomake recommendations. For example, consider a rulestating that email spam detector, virus definition update=⇒ web history and cookies management, then a productprofile containing an email spam detector and a virusdefinition update, could result in a recommendationto add a feature to support web history and cookiesmanagement. In general, when a partial profile ismatched against the antecedent of a discovered rule,the items on the right hand side of the matching rulesare sorted according to the confidence values for therule, and the top ranked items from this list form therecommendation set.

More formally, given a set of product profiles P anda set of features F = {F1, F2, · · · , Fk} and a featureset f ⊆ F , let Pf ⊆ P be the set of products thathave all the features in f . The support of the featureset f is defined as σ(f) := |Pf | / |P |. Feature sets thatsatisfy a predefined support threshold are referred toas frequent item sets (in the following we will refer tothese as frequent feature sets).

An association rule r is expressed in the formX =⇒ Y (σr, αr), where X ⊆ F and Y ⊆ F arefeature sets, σr is the support of the rule r definedas σr := σ(X ∪ Y ), and αr is the confidence for therule r given by αr := σ(X ∪ Y )/σ(X)). The discoveryof association rules involves two main parts: thediscovery of frequent feature sets (i.e., itemsets whichsatisfy a minimum support threshold) and the discov-ery of association rules from these frequent featuresets which satisfy a minimum confidence threshold.Various algorithms exist for discovering the frequentitemsets and association rules among which we choseto use FP-Growth algorithm [26] as it is shown to bequite memory-efficient and hence suitable for the sizeof our data set.

After creating the association rules, new featurescan be recommended to an initial product profile byfinding all the matching rules. In order to reduce thesearch time, the frequent feature sets are stored ina directed acyclic graph, called a Frequent ItemsetGraph (FIG)[13], [14]. The graph is organized intolevels from 0 to k, where k is the maximum sizeamong all discovered frequent feature sets. Each nodeat depth d in the graph corresponds to a feature set Iof size d and is linked to feature sets of size d + 1 thatcontain I at the next level. The root node at level 0corresponds to the empty feature set. Each node alsostores the support value of the corresponding frequentfeature set.

Given a product profile with feature set f , a depth-first search is performed on the graph to level |f |.If a match is found, then the children of the match-ing node are used to generate candidate recommen-dations; and each child of this node correspondsto a frequent itemset f ∪ {r}. A feature r will beadded to the recommendation list if the support ratioσ(f ∪ {r})/σ{f} exceeds a pre-specified minimum

threshold. The same procedure will be repeated forall the subsets of the feature set f .

Figure 5 depicts a section of the graph for featuresrelated to audio/video format conversion. Each noderepresents a frequent feature set and the numberassociated with it shows the support value of thefeature set. For example, the three features (i) Supportsconversion between popular audio files, (ii) Fast conversionspeed and excellent output quality and (iii) Supports batchconversion, occur together in 0.005 of total productsand are recognized as a frequent feature set.

Features recommended by the association rule min-ing approach are then presented to the user. Theuser selects the correct recommendations and thesefeatures are used to augment the initial product pro-file. The augmented profile is then given as input tothe kNN recommendation module to produce morerecommendations.

Fig. 5. Part of a Frequent Itemset Graph for theaudio/video format conversion

3.3 Feature Recommendations using Binary kNN

In our prior work, the product-based k-NearestNeighbor (kNN) algorithm has been shown to bean efficient method for recommending features andrequirements [27], [28]. For purposes of feature rec-ommendations, the similarity of the new product andeach of the existing products in the product-by-featurematrix is computed and the top k (25) most similarproducts are selected as neighbors of the new product(note that products with less than 6 features wereignored, as their profiles are too sparse to creategood neighborhoods). Then, the binary equivalent ofcosine similarity is used to compute the similarityof the new product p and the existing product n,productSim(p, n), as follows:

Page 11: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 11

productSim(p, n) =|Fp ∩ Fn|√|Fp| · |Fn|

(6)

where Fp denotes the set of features of product p[29]. After forming the neighborhoods, features canbe recommended to the new product using a similarapproach as in [7]:

pred(p, f) =

∑n∈nbr(p) productSim(p, n) ·mn,f∑

n∈nbr(p) productSim(p, n)(7)

where n ∈ nbr(p) depicts that n is a neighbor of p, andmn,f is an entry in the matrix M indicating whetherproduct n contains feature f . In general, predictionscores will be computed for each candidate feature,and the top R features with highest predictions willbe selected.

To further reduce the noise in the recommendations,a post-filtering step is added to the recommendationprocess to re-rank the recommendations in favor ofthe domain-specific features. In order to accomplishthis, a cross-cutting factor is computed for every featureby counting the number of product types that are cov-ered in the corresponding cluster. The prediction scoregiven by the kNN method for each feature is thendivided by its cross-cutting factor to produce newscores. The features are descendant ordered based onthese new scores and the top N recommendations arepresented to the users.

4 FEATURE RECOMMENDER EVALUATION

Evaluating a recommender system is a non-trivialtask. The ideal approach would involve conducting astudy of several software projects to observe whetherany of the recommended features were ultimatelyincluded in the final product. However this type ofevaluation would need to be conducted over a longperiod of time and would require multiple softwareprojects. Fortunately there are several commonly usedstatistical methods for evaluating recommender sys-tems [30].

We adopted a standard experimental design basedon a m-fold cross validation approach in which thedata was divided into m sets. In each of the m runs,one of the sets is used as the test set and the union ofthe remaining (m − 1) sets are used for training therecommender system. After m runs of the experiment,each of the m sets was used once as a testing set.The performance of the system was determined byaveraging its performance over the m runs.

The goal of these experiments is to determine theperformance of the system in making correct fea-ture recommendations for a product with a smallset of known features. Therefore for each product inthe test set, L features were randomly selected andused to represent the initial product profile. One of

the remaining features was then randomly selectedas a target item, and the recommender system wasevaluated with respect to whether it was able torecommend back this targeted item. If the feature wasrecommended, its position in the overall recommen-dation list for the run was recorded. The rationalebehind recording the rank of the recommendation isbecause top recommendations are more valuable tothe users, who are less likely to scroll to the end of along list of recommendations.

Results for leave-one-out cross validation recom-mendation experiments can be evaluated by comput-ing the Hit Ratio, which computes the probability thata given feature is recommended as part of the topN recommendations produced by the system. Specif-ically, for each product p in the test set, a feature fp israndomly removed from the product profile and a rec-ommendation set RN (p), comprised of the top rankedN recommended features, is produced using the re-maining features of p. If fp ∈ RN (p), a hit is recordedfor that product. Suppose P is the set of products usedfor evaluation. The hit-ratio of the recommendationalgorithm relative to a recommendation set size of N ,is computed as: hr(N) = |p ∈ P : fp ∈ RN (p)|/|P |.Typically, hit ratio values are plotted against differentvalues of N . A hit ratio value of 1.0 indicates that thealgorithm was able to always recommend the hiddenfeature, whereas a hit ratio of 0.0 indicates that thealgorithm was not able to recommend any of thehidden features. Note that hit ratio increases as therecommendation set size (N ) increases. In particular, ifN is the total number of possible features, the hit ratiowill be 1. Therefore, ideally a recommender systemshould be able to achieve relatively high hit ratioseven when the recommendation set is small.

In practice, the usefulness of the recommendationsdegrades for lower ranks as the users do not usuallyscroll down to see all the recommendations. There-fore, to compare the performance of two competingalgorithms, differences in higher ranks should begiven more weight. Although hit-ratio ratio graphsprovide a good illustration of the performance of thealgorithms at different ranks, comparing the hit-levelmean values will not capture this point.

In case of our leave-one-out cross validation ex-periment, for the test product p, if the target featurefp is recommended at level X , then the precisionof recommendations for p is 1

X . we evaluate thesignificance of the differences between mean precisionin our comparisons (instead of mean hit-level). This ismainly because average precision is less sensitive todifferences in lower ranks than differences in higherranks.

4.0.1 Experiment 1: Comparison of different recom-mendation algorithmsThe first experiment was designed to evaluate the per-formance of the product-based kNN used in our rec-

Page 12: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 12

ommendation module, against two well-known rec-ommendation methods, namely, feature-based kNN[31] and matrix factorization using Bayesian personal-ized ranking (BPRMF) [32]. Similar to product-basedkNN, feature-based kNN method is also a neigh-borhood model, but one which makes predictionsbased on feature neighborhoods. Matrix Factorizationmodels are an alternative approach which map bothproducts and features to a joint latent factor spaceof dimensionality q, such that product-feature inter-actions are modeled as inner products in that space.BPRMF, which was used in our experiments, is a formof matrix factorization method designed to be mostuseful for binary data.

A five-fold cross validation approach was followedto evaluate the performance of the recommendationmethods. The experiment was performed with aninitial profile size of L = 3. The number of neighborsfor product-based kNN and feature-based kNN wasset to K = 20. Also, the BPRMF method was run for80 factors and in 3000 iterations with a learning rateof α = 0.05.The hit-ratio results of this experiment for a sample ofSoftpedia software categories are shown in Figure 6.For each category, the hit ratio is shown for differentsizes of recommendation.

In all categories, the product-based kNN algorithmperformed better than the feature-based kNN methodfor all confidence levels. Also, the graphs show thatthe product-based kNN outperformed BPRMF for thetop 25 recommendations while the BPRMF has muchbetter performance for larger sizes of recommenda-tions. Assuming that it is not practical to show morethan 20 results to the user at one time, the product-based kNN algorithm has the best performance.

We also computed the average precision of thecompeting algorithms and tested the significance ofthe differences between product-based KNN and theother two methods using the student’s t-test. For eachof the categories we tested the following two nullhypotheses:

H1: Mean precision for product-based kNN isless than or equal to the mean precision for feature-based kNN.

H2: Mean precision for product-based kNN isless than or equal to the mean precision for BPRMF.

The t-test requires a normally-distributed sample.By the central limit theorem, sample means of mod-erately large samples (more than 30) are often well-approximated by a normal distribution even if thedata are not normally distributed. As the number oftest samples is larger than 30, we can assume that thesample is normal. The average precision(AP) for therecommendation algorithms as well as the p-valuesfor the t-tests are included in table 6. For all the cate-gories except multimedia the two null hypothesis H1,and H2 can be rejected with high confidence showingthat product-based kNN can achieve a higher mean

precision than the other two methods. For the multi-media category, although H2 can be rejected, there isinsufficient evidence to reject H1.

4.0.2 Experiment 2: Association Rules

To determine the effect of association rule mining onthe performance of the system, the previously de-scribed five-fold approach was adopted. In each of thefive runs of the experiment, a frequent item set graphwas constructed from the products in the training set.For each product in the test set, L = 3 features wereselected and the remaining features were removedfrom the profile. The frequent item set graph wasthen used to generate recommendations, and thoserecommendations with confidence scores of 0.2 orhigher, were presented to the user. Figure 7 shows theprecision and recall for different levels of confidence.For example, at a confidence level of 0.2, the precisionof the generated recommendations is 25% and recallis 37%. The association rule approach can thereforeachieve high precision in recommendation, but at thecost of lower recall. These observations support ourearlier claims that Association Rule Mining can beuseful for identifying a small set of previously unusedfeatures with a high degree of precision.

To simulate the step in which the user evaluates theinitial recommendations, we automatically accept thecorrect recommendations and reject the incorrect onesbased on the known data stored in the product-by-features matrix. These accepted recommendations arethen used to augment the initial product profile of sizeL = 3. The augmented profile is then given as inputto the kNN recommender to generate more recom-mendations. In our evaluations, this hybrid methodis shown as kNN+.

The leave-one-out cross validation experiment de-scribed in Experiment 1 was then conducted on thewhole set of products, covering all of the categories,with the small modification that the left-out item wasselected from the set of features that were not partof the augmented profile. Figure 8 compares the hitratio results of the product-based kNN approach withthe kNN+ method. As can be seen, the quality ofrecommendations can be improved when associationrules are used to augment the product profile beforerunning the kNN algorithm. This difference repre-sents a 0.1 improvement in hit ratio at a rank of 20.

We also computed the average precision for thesetwo methods. The average precision of kNN was 0.051while the average precision of kNN+ was 0.060 .Similar to our evaluations in section 4.0.1, we usedthe Student’s t-test statistic to test the following nullhypothesis:

H0: Mean precision for kNN+ is less than orequal to the mean precision for kNN.

The computed p-value for this test was 4.951E-49 which is much smaller than the commonly used

Page 13: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 13

TABLE 6Student’s t-test results comparing the mean precision for product-based kNN and feature-based kNN and

BPRMF for various categories

AP for Product-based kNN AP for Feature-based kNN AP for BPRMF p-value for H1 p-value for H2

Internet 0.044385257 0.029598626 0.038987167 1.22251E-16 * 0.003991202 *Multimedia 0.042550062 0.040130303 0.037031697 0.287419169 0.026067424 *

Security 0.044682807 0.034942399 0.041098306 4.51717E-05 * 0.064780269 *System 0.042300756 0.031335661 0.03758579 0.000200299 * 0.026554663 *Office 0.049005664 0.031160901 0.035160479 3.7573E-17 * 8.83999E-06 *

Programming 0.049108903 0.036690594 0.04054354 8.07944E-06 * 0.002525262 *

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 11 21 31 41

Hit

Rat

io

Number of Recommedations

Product-based kNN

BPRMF

Feature-based knn

(a) Internet

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 11 21 31 41

Hit

Rat

io

Number of Recommendations

Product-based kNN

BPRMF

Feature-based KNN

(b) Multimedia

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 11 21 31 41

Hit

Rat

io

Number of Recommendations

Product-based kNN

BPRMF

Feature-based KNN

(c) Security

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 11 21 31 41

Hit

Rai

o

Number of Recommendations

Product-based kNN

BPRMF

Feature-based KNN

(d) System

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 11 21 31 41

Hit

Rat

io

Number of Recommendations

Product-based kNN

BPRMF

Feature-based KNN

(e) Office-tools

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 11 21 31 41

Hit

Rat

io

Number of Recommendations

Product-based kNN

BPRMF

Feature-based KNN

(f) Programming

Fig. 6. Comparison of Recommendation Methods for Different Software Categories

threshold of 0.05. Therefore, we reject the null hypoth-esis.

Fig. 7. Recall and Precision at Different Levels ofConfidence

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 6 11 16 21 26 31 36 41 46

Hit

Rat

io

Number of Recommendations

Product-based kNN

kNN+

Fig. 8. Hit Ratio Comparison of kNN and kNN+

5 CASE STUDY: COLLABORATIVE SOFT-WARE SUITE

The previous experiments demonstrate the viabilityof utilizing our recommender system to recommend

Page 14: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 14

useful features to a domain analyst. However, tofurther evaluate its usefulness we applied it withinthe context of a larger and more realistic project. Wetherefore present a case study, based around the Col-laborative Software Suite (CoSS), which is intendedto support project members collaborating on a dis-tributed project. CoSS, will be developed as a web-based solution for managing projects and supportingcollaboration across an enterprise-level organization.It will provide support for various activities includ-ing project management, virtual team management,advanced reporting, dashboards, online calendars,workflow creation, document management, commu-nication and conferencing, etc.

5.1 Case Study Creation MethodologyAlthough different domains were considered, CoSSwas selected because several researchers on our teamwere familiar with using enterprise level collaborativetools, and because the domain contained a relativelylarge number of interrelated features that were wellrepresented in the SoftPedia data. The case studywas developed by a member of our research team,from now on referred to as the Domain Analyst,who had extensive experience working on softwaredevelopment projects, and whose sole responsibilityin this project was to build the case study and helpin the qualitative evaluation of recommendations. Thedomain analyst followed the FODA method [2] anddeveloped the domain model based on a combinationof expert knowledge and publicly available productdescriptions for over 100 related commercial prod-ucts. To avoid introducing bias to the experiment,the analyst did not use Softpedia.com, which was theWebsite from which we mined features to train therecommender system. The domain analyst investedover 30 hours to identify approximately 120 coarse-grained features. The domain analysis did not includeidentifying composition rules such as optionality, vari-ability, or mutual exclusion between features.

5.2 Feature Model DescriptionThe CoSS feature model, depicted in Figure 9, showsprimary features such as project management and com-munication and a selection of their associated sub-features. For example, project management is decom-posed into features such as project setup, task setup,and project notification. Each feature is further definedthrough a textual description, as illustrated for TaskSetup and its sub-features “Create new tasks, and to-do lists for each team member”, “Create scheduledand recurring tasks” and “Create project/task work-flow rules”.

5.3 Qualitative AnalysisIn addition to the quantitative study described inSection 4, we also conducted a qualitative evaluation

which relied on expert human judgment. For thisexperiment, 5 tasks were defined where each taskconsists of a subset of 5 randomly selected featuresfrom the CoSS feature model and simulates a situationwhere an expert has selected 5 features for a productand is using the recommender system to discovermore features. Table 7 shows one of the tasks usedin our experiment.

As described in section 3.1, each of the featuredescriptors are automatically mapped to the set of1,135 features in our system to create an initial productprofile for each task. The highest ranked matches arepresented to the user for feedback. If no good matchesare found, the user can modify the initial productdescription and reissue the recommendation request.Assuming that the feature that the user is lookingfor exists in our system, this process can continueuntil the user is satisfied that their initial productdescription is successfully matched to features. In ourevaluations, it took less than 5 minutes in average tomap each task to the set of corresponding features

In the next step, association rule mining was per-formed on the set of selected features and the resultswere shown to the user for feedback. Those con-firmed recommendations were added to the profile.In the last step, the kNN algorithm was run and thetop R = 50 recommendations were selected. Theserecommendations were then re-ranked based on thecalculated cross-cutting factors and the top N = 10features were shown to the users. Table 8 showsthe set of recommendations produced by associationrule mining and kNN algorithms for the sample taskshown in Table 7. For each recommendation, thefeature name and a sample of feature descriptors areprovided.

An analysis of the recommended items was per-formed independently by three users who had notpreviously been exposed to the datasets and the COSSrecommendations made by our approach. In orderto remove bias in the experiment, for each of thetasks, the ranked list of N recommended featureswere mixed with N features that were randomlytaken from the set of all features. A total of 2Nrecommendations were presented for each task. Theusers were not informed of the number of randomfeatures inserted in the recommendation list, and nodifferentiation was made between features generatedby our method and the additional randomly selectedones. Study participants were asked to analyze eachpresented features in terms of the following criteria:(Q1) Is the recommended feature cohesive? (Q2) Was thisfeature mentioned in the original profile? (Q3) Is thisfeature related to the overall goals of the CollaborativeSoftware Suite? (Q4) Is this recommendation a possiblefeature for the Collaborative Software Suite? and (Q5) Wasthis recommendation useful?. It took less than 2 minutesin average for each user to answer this questions foreach of the recommended features

Page 15: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 15

CSS

Project Management

Communication Project Set up

Team Management

Project & Task Tracking Project

Notification

Team Set Up

Group MGM Role Setting Access control

Meeting Management

Online Calendar Group MGM

Meeting Reminder

Access control

… Video

Conferencing

Web Based Conferencing

Audience Monitoring

Application Sharing

Recorder & Play Back

Document Management

Version Control

Publishing

Sharing

Back-Up & Restore

Document Browser

Archive Distribution

Change Notification

… Enterprise Messaging

Instance Messaging

Email

Mobile Text

Voice Call

Evaluation

Voting Survey Feedback

Address Book

Task Set up

Fig. 9. Collaborative Software Suite: Domain Model.

TABLE 7An Example of an Initial Product Description. For

Experimental Purposes the Description WasComposed of a Set of Features Randomly Selected

from the Manually Created Domain Model.Provides different permission levels for adding and sharingcalendars, events and so on.Assign tasks to one or more of the teamSupports results reporting and chartingProvides data export import to excel, csv, spss fileManage projects and teams and improve inter and intra-group communications

For each task and for every user, the precisionfor each question from Q1 to Q5 was computed forthe feature recommender and the random baseline.Given that we had 5 tasks and 3 evaluators, 15 pairedsamples were collected for each of the questions.Based on normality tests including Shapiro-Wilk testand also frequency histograms, we can assume thatthe samples were taken from a normal population.

For questions 2 to 5, the paired one-way Student’s t-test statistic was used to evaluate the statistical signif-icance of improvement of precision using our method.For question 1, we used paired two-way Student’st-test statistic to evaluate whether there was anydifference in the cohesion of the features randomlygenerated and those produced by our recommender.Table 9 presents the null hypothesis for each question.In this table, µrec represents the mean precision forour recommendation approach and µrand indicatesthe mean precision for the random baseline. Also,the average precision over all the tasks and users areshown for our recommender and the random baseline.

According to Table 9, the p-values of the tests forquestions Q2 to Q5 are all much smaller than 0.05(which is commonly used as the threshold) showingthat the null hypothesis for all these questions canbe rejected and the mean precision of our method issignificantly higher than the random baseline. The p-

TABLE 8Top Recommendations Showing Feature Titles and a

Subset of Related Feature Descriptors# Algorithm Feature Title Sample Descriptors in the Feature

1 ARMExtensive reporting,Customized reports

· Users can share customized reports easily.· Reporting: Save reports in Word Adobe HTML for-mat.· Export reports in Excel files.· Custom reports generator.

2 ARM Task manager

· Event and Task List· plan manage and schedule your important dates andtimes· Date Reminder for nonrecurring and recurring events

3 kNN Features related to‘reminders’

· Flexible Reminder Frequency·Built-in day planner to remind you of scheduledevents·Set Tasks and reminders with 1 click.

4 kNN User Defined Filters· Ability to filter report content by WorkItem type.· Filters for sorting items according to many factors.· Information is readily accessible in a variety of ways.

5 kNN Templates· Ready-to-use templates· Quickly enter your records using record templates· Quickly enter movie records using record templates

6 kNN Features related to‘MS Outlook’

· Data can be synchronized with Microsoft Outlook· Duplicate Remover for Outlook Calendar· Outlook Express: Mail folders.· Outlook Express: Message rules.

7 kNN Activity Tree PaneFeatures

·Activity Tree Pane Features: Provides a hierarchicalarrangement of WorkItems· Activity Tree Pane Features: Ability to create yourown WorkItem types.· Activity Tree Pane Features: Ability to track timespent on WorkItems.

8 kNNCustomizablecategories for events,documents, tasks

· Filter Items by Category and Type.· Unlimited User defined categories.· Customizable Event Categories.

9 kNN Features related tonotes

· Export Notes·Add a note to the event· It keeps track of your notes and tasks

10 kNNDaily/weekly/monthlyreporting or viewingactivities

· The print options provide daily weekly and monthlysummaries.· Weekly calendar view of your activities· Weekly View: Display events in a week.

11 kNN Exporting to the Web

· Display record (text and graphic) in the form of theWEB page.· Publish your database to the Web· Web calendars.

12 kNN Recording· Screen Shot Capture is a handy and easy to usegraphic capture tool· Screen Snapshot Recording· Video Recorder

value for the first question shows that cohesion of therandom features are not significantly different fromthose recommended by our method.

Results showed that in average, 91% of the rec-ommended features were assessed as related to theoverall goals of CoSS. While only 73% of the featureswere deemed to be cohesive, the reviewers classified86% of them as useful, suggesting that some lackof cohesion did not undermine the utility of the

Page 16: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 16

TABLE 9Qualitative Analysis Results

Question Null hypothesis Average precision of the recommender Average precision of the random baseline p-valueQ1 µrand = µrec 0.73030303 0.807575758 0.145440189Q2 µrand ≥ µrec 0.674116162 0.17739899 5.43955E-06Q3 µrand ≥ µrec 0.909343434 0.327020202 4.9452E-08Q4 µrand ≥ µrec 0.768813131 0.327651515 3.90332E-07Q5 µrand ≥ µrec 0.865782828 0.350505051 1.34156E-06

recommendations.

6 THREATS TO VALIDITY

Threats to validity can be classified as construct, inter-nal, external validity, and reliability. External validityevaluates the generalizability of the approach. Thetraining data consists of product feature descriptorsextracted from Softpedia. Our feature model thereforeonly includes features which are included in SoftPedialistings. However, as SoftPedia covers a wide rangeof product types, we assume the technique will gen-eralize to other as yet unevaluated domains. Due tothis limitation, the COSS case study was deliberatelychosen to be compatible with domains covered bySoftPedia. If a non-compatible case study had beenchosen it would have been difficult to generate use-ful recommendations. These limitations were clearlydescribed in the text. In future work we intend to ex-tend the training data to include product descriptionsmined from additional sites such as GOOGLE AppsMarketPlace.

Construct validity evaluates the extent to whichthe construct that was intended to be measured wasactually measured. In the evaluation of the IDC al-gorithm, we used four different answer sets. Each ofthese answer sets was created manually using card-sorting techniques, and represents one viable clus-tering of feature descriptors. It is unclear whetherthe various clustering algorithms would have per-formed equally well against all valid clusterings. Un-fortunately the time and effort needed to manuallycreate such clusterings made it infeasible to repeatthe exercise multiple times. On the other hand, theclusterings were created by people otherwise entirelyunrelated to our study, and their only instructionswere to cluster feature descriptors into meaningfuland distinct features. We did not tell them how manyfeatures to identify. Furthermore, the results from theIDC study were fairly consistent across all three of thestudied datasets.

Another construct threat to validity is in our choiceof leave-one-out evaluation for quantitatively evaluat-ing the recommender system. While this is a standardtechnique for evaluating recommender systems, it suf-fers from the problem that it is only able to evaluateknown information. For example, if a known featureis removed from a product, and the recommendersystem is able to successfully recommend it back in

a certain ranking of the recommender features list, itcounts as a successful recommendation at that rank-ing. On the other hand, our recommender system mayrecommend a feature which does not appear in thefeature list of that product. This result will be recordedas an incorrect recommendation even though the rec-ommended feature might be a viable recommendationfor that product. This problem is particularly evidentin our dataset, as the lists of features associated witheach product are certainly incomplete. On the otherhand, this problem has a negative impact on our re-sults and so does not bias the validity of our approach.Despite this problem, our recommender system wasable to recommend back a significant number ofknown features at high positions in the ranked list.Furthermore, we augmented our findings from thisquantitative study with a qualitative case study inwhich generated recommendations were evaluated byhuman analysts.

Internal validity reflects the extent to which a studyminimizes systematic error or bias, so that a causalconclusion can be drawn. We designed the COSSevaluation to reduce this bias. The evaluation wasperformed by members of our research team whohad not otherwise been exposed to the datasets andthe results of the recommendation system prior tothis study. We performed a blind study in whichevaluators (1) had not previously been exposed to theCOSS recommendations made by our approach, and(2) which now included 50% of random recommen-dations. (3) The evaluators were not informed whichfeatures were made by our recommender system andwhich were random. (4) The evaluators were notinformed of the number of random features in therecommendation set.

7 RELATED WORK

This work bridges the gap between automated featuredetection and recommender systems. We thereforeprovide a brief background survey on each of theseareas.

Several domain analysis methods such as FODA(Feature Oriented Domain Analysis) [2], FROM(Feature-Oriented Reuse Method) [3] , FAST (Fam-ily Oriented Abstraction, Specification, And Trans-lation)[5], FeatureRSEB [33], and ODM (Organiza-tion Domain Modeling) [34] have been proposed tosupport the discovery, analysis, and documentation

Page 17: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 17

of commonalities and variabilities within a domain.These methods rely on human intensive activitieswhich are more appropriate for a small domain. Toolsthat have been constructed to support these meth-ods [35], [36] [37][4] primarily focus upon modelingvariability and managing it across different products.Therefore significant upfront manual effort is still re-quired in order to perform a domain analysis, identifyfeatures, and create their composition rules. Therehas recently been considerable research on using datamining and machine learning techniques in orderto construct domain models through organizing anddiscovering relationships between software require-ments. For example, the Domain Analysis and ReuseEnvironment (DARE) [4] uses text processing toolsto extract domain vocabulary from text sources, andthen identifies common domain entities, functions,and objects, by clustering related words and phrases.These clusters are then used by the analyst to createa feature table. Chen et al. [38] manually constructedrequirements relationship graphs (RRG) from severaldifferent requirements specifications and then usedhierarchical clustering techniques to merge them intoa single domain tree. However, this approach is noteasily scalable beyond trivially sized domains be-cause the domain user is required to manually createeach requirements relationship graph. Although thesegraphs are constructed from one application and canbe reused in other applications in the same domain,considerable user intervention is required to modifyand adopt them. Vander Alves et.al. [39] utilizedthe Vector Space Model (VSM) and Latent SemanticAnalysis (LSA) to determine the similarity between re-quirements and then generated an association matrixwhich is clustered using the Agglomerative Hierarchi-cal Clustering (AHC) algorithm. Identified clusters arethen merged together using a depth-first algorithmto compare nodes and create a feature model for thedomain. Noppen et al. [40] extended this work byincluding fuzzy sets in the framework to allow in-dividual requirements to be associated with multiplefeatures. Niu and Easterbrook [41], [42] developed anon-demand clustering framework that provided semi-automatic support for analyzing functional require-ments in a product line. They proposed an informa-tion retrieval (IR) and natural language processing(NLP) approach to identify important entities andfunctions in the requirements as a functional require-ments profile (FRP). FRPs are identified accordingto the linguistic characterization of a domain andtend to capture the domain’s action themes from therequirements documents. FRPs are used as clusteringobjects and information retrieval techniques are usedto identify (model) the variability among FRPs. In thisframework, human intervention is needed to validateFRPs and their attributes.

Several other researchers have used association rulemining to automate the construction of domain fea-

ture models from a set of already existing applicationfeatures. Lora-Michiels [43] applied association rulesanalysis to identify mandatory and optional rela-tionships; then performed a chi-square statistical testin conjunction with the association rules to identifyrequire relationships. Similarly, She [44] used a setof individual product configurations consisting of alist of features, then applied conjunctive and disjunc-tive association rule mining to construct a proba-bilistic feature model. In other work, Acher et. Al[45] extracted feature models of several applicationsfrom tabular documentation of product features. Theyused a name based merging technique to merge theoverlapping parts of each feature model, synthesizethe variability information and build a single featuremodel representing the product family.

However, all of these approaches on feature con-figuration, assume that the list of features for eachindividual application is available for synthesis andconstructing the final domain model.

The field of recommender systems has also beenstudied extensively, but mostly within the contextof e-commerce systems, where numerous algorithmshave been developed to model user preferences andcreate predictions. These algorithms vary greatly, de-pending on the type of data they use as an input tocreate the recommendations. For example, some usecontent information about the items [46], or collabo-rative data of other users’ ratings [7], or knowledgerules of the domain [47], or hybrid approaches [48].Substantial amounts of work have been done in thearea of evaluating recommender systems [30] and onnewer highly efficient algorithms such as those basedon matrix factorization [49]. Despite the proliferationof both of these fields, there has been very little workcombining recommender systems and requirementsengineering. Work in this area has focused on rec-ommending topics of interest in large scale onlinerequirements forums [28], and a high-level overviewof possible usages and applications of recommendersystems in this domain [50]. Nevertheless, our useof recommender systems in this paper builds on thesubstantial background of research in this area.

8 CONCLUSION

This paper has presented a novel feature recom-mender system to support the domain analysis pro-cess. This is a critical early-phase part of the softwaredevelopment life-cycle and is essential in both ap-plication development and product line development[51], [52]. Our system mines feature descriptors forhundreds of products from publicly available soft-ware repositories of product descriptions and usesthis data to discover features and their associations.For feature discovery, we proposed a novel Incremen-tal Diffusive Clustering algorithm. The evaluation re-sults presented in this paper, reveal that our clustering

Page 18: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 18

method is more suitable for this task than some otherwell-known clustering methods. We expect that theresults from our clustering analysis can be applied toother clustering problems in the requirements engi-neering domain. Another novel contribution of thiswork is the hybrid approach which uses associationrule mining to augment an initial profile, and thenuses the standard kNN approach to make additionalrecommendations. This has the advantage of aug-menting an initially sparse product description beforemaking a more extensive set of recommendations.

The performance of the recommender system wasevaluated both through quantitative experiments andqualitative analysis. The results showed that our rec-ommender system is capable of making viable featurerecommendations to help domain analysts build do-main models for a variety of software applications. Inaddition, the examples shown throughout the paperhave illustrated the process, its benefits and its limi-tations.

Future work will focus on mining additional soft-ware repositories and directories to broaden theknowledge of the system, improve the interactionbetween the user and the system, explore techniquesto enhance the navigation and visualization of therecommended features, and conduct more extensivequalitative evaluations with a wider group of projectstakeholders.

ACKNOWLEDGMENTS

The work described in this paper was partially fundedby grant III:0916852 from the National Science Foun-dation.

REFERENCES

[1] G. Arango and R. Prieto-Diaz, Domain Analysis: Acquisition ofReusable Information for Software Construction. IEEE Comp.Society Press, May 1989.

[2] K. Kang, S. Cohen, J. Hess, W. Nowak, and S. Peterson,“Feature-oriented domain analysis (foda) feasibility study,”1990.

[3] K. C. Kang, S. Kim, J. Lee, K. Kim, G. J. Kim, and E. Shin,“FORM: A feature-oriented reuse method with domain-specific reference architectures,” Annals of Software Engineering,vol. 5, pp. 143–168, 1998.

[4] W. Frakes, R. Prieto-Diaz, and C. Fox, “Dare: Domain analysisand reuse environment,” Annals of Software Eng., vol. 5, 1998.

[5] R. Agrawal and R. Srikant, “Fast algorithms for mining asso-ciation rules,” Proc. of the 20th Intn’l Conf. on Very Large DataBases, 1994.

[6] R. Agrwal and R. Srikant, “Fast algorithms for mining associ-ation rules,” in Proc. of the 20th Intn’l Conf. on Very Large DataBases (VLDB’94), Santiago, Chile, September 1994, pp. 487–499.

[7] J. Schafer, D. Frankowski, J. Herlocker, and S. Sen, “Collab-orative filtering recommender systems,” The Adaptive Web, p.291, 2007.

[8] H. Dumitru, M. Gibiec, N. Hariri, J. Cleland-Huang,B. Mobasher, C. Castro-Herrera, and M. Mirakhorli, “On-demand feature recommendations derived from mining publicsoftware repositories,” in 33rd International Conference on Soft-ware Engineering. Honolulu, Hawaii, USA: ACM ComputerSociety, May 2011, p. 10.

[9] C. D. Manning, P. Raghavan, and H. Schutze, Introductionto Information Retrieval. Cambridge: Cambridge UniversityPress, 2008.

[10] J. C. Bezdek, Pattern Recognition with Fuzzy Objective FunctionAlgorithms. MA, USA: Kluwer Academic Publishers, 1981.

[11] I. S. Dhillon and D. S. Modha, “Concept decompositions forlarge sparse text data using clustering.” Machine Learning,vol. 42, no. 1/2, pp. 143–175, 2001.

[12] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichletallocation.” Journal of Machine Learning Research, vol. 3, pp.993–1022, 2003.

[13] B. Mobasher, H. Dai, T. Luo, and M. Nakagawa, “Effectivepersonalization based on association rule discovery fromweb usage data,” in 3rd ACM Workshop on Web Informationand Data Management (WIDM01), Atlanta, November 2001.[Online]. Available: bib-papers/MDLN01a.pdf

[14] J. Sandvig, B. Mobasher, and R. Burke, “Robustness of collab-orative recommendation based on association rule mining,”Proc. of Recommender Systems, 2007.

[15] C. Duan, J. Cleland-Huang, and B. Mobasher, “A consensusbased approach to constrained clustering of software require-ments,” ACM Conf. on Information and Knowledge Management,pp. 1073–1082, 2008.

[16] J. Cleland-Huang and B. Mobasher, “Automated detection ofrecurring faults in problem anomoly reports,” SERC Report toLockheed Martin, 2009.

[17] C. Duan, “Improving requirements clustering in an interactiveand dynamic environment,” DePaul University, Tech. Rep.

[18] H.Dumitru, A.Czauderna, and J.Cleland-Huang, “Automatedmining of cross-cutting concerns from problem reports and re-quirements specifications,” Argonne Undergraduate Symp., 2009.

[19] F. Can and E. Ozkarahan, “Concepts and effectiveness ofthe cover-coefficient-based clustering methodology for textdatabases,” ACM Trans. Database Syst., pp. 483–517, 1990.

[20] K. M. Lewis and P. Hepburn, “Open card sorting and factoranalysis: a usability case study,” The Electronic Library, vol. 28,no. 3, pp. 401–416, 2010.

[21] M. Papagelis, D. Plexousakis, and T. Kutsuras, “Alleviatingthe sparsity problem of collaborative filtering using trustinferences.” in iTrust, ser. Lecture Notes in Computer Science,P. Herrmann, V. Issarny, and S. Shiu, Eds., vol. 3477. Springer,2005, pp. 224–239.

[22] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Analysis ofrecommendation algorithms for e-commerce,” 2000, pp. 158–167, proc. ACM Conference on Electronic Commerce.

[23] J. Cleland-Huang, B. Berenbach, S. Clark, R. Settimi, andE. Romanova, “Best practices for automated traceability,” IEEEComputer, vol. 40, no. 6, pp. 27–35, 2007.

[24] W. Lin, S. A. Alvarez, and C. Ruiz, “Efficient adaptive-support association rule mining for recommender systems,”Data Mining and Knowledge Discovery, vol. 6, pp. 83–105, 2002.[Online]. Available: bib-papers/LAR02.pdf

[25] B. M. Sarwar, G. Karypis, J. Konstan, and J. Riedl,“Analysis of recommender algorithms for e-commerce,” inProc. of the 2nd ACM E-Commerce Conf. (EC’00), Minneapolis,MN, October 2000, pp. 158–167. [Online]. Available: bib-papers/SKKR00a.pdf

[26] J. Han, J. Pei, and Y. Yin, “Mining frequent patternswithout candidate generation,” in Proceedings of the 2000ACM SIGMOD international conference on Management ofdata(SIGMOD’00), New York, NY, June 2000.

[27] C. Castro-Herrera, J. Cleland-Huang, and B. Mobasher, “En-hancing stakeholder profiles to improve recommendations inonline requirements elicitation,” in Requirements Eng., IEEEIntn’l Conf. on. Los Alamitos, CA, USA: IEEE ComputerSociety, 2009, pp. 37–46.

[28] C. Castro-Herrera, C. Duan, J. Cleland-Huang, andB. Mobasher, “A recommender system for requirementselicitation in large-scale software projects,” Proc. of the 2009ACM Symp. on Applied Computing, pp. 1419–1426, 2009.

[29] E. Spertus, M. Sahami, and O. Buyukkokten,“Evaluating similarity measures: a large-scale studyin the orkut social network.” Chicago, Illinois,USA: ACM, 2005, pp. 678–684. [Online]. Available:http://portal.acm.org/citation.cfm?id=1081956

Page 19: TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting …sarec.nd.edu/Preprints/TSENikki.pdf · TRANSACTIONS ON SOFTWARE ENGINEERING 1 Supporting Domain Analysis through Mining and Recommending

TRANSACTIONS ON SOFTWARE ENGINEERING 19

[30] J. Herlocker, J. Konstan, L. Terveen, and J. Riedl, “Evaluatingcollaborative filtering recommender systems,” ACM Trans. onInfo. Sys., vol. 22, pp. 5–23, 2004.

[31] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-basedcollaborative filtering recommendation algorithms,” 2001, pp.285–295.

[32] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-thieme, “L.s.: Bpr: Bayesian personalized ranking from implicitfeedback,” in In: Proceedings of the 25th Conference on Uncer-tainty in Artificial Intelligence (UAI, 2009.

[33] M. Simos, D. Creps, C. Klingler, L. Levine, and D. Allemang,“Software technology for adaptable reliable systems (stars)organization domain modeling (odm) guidebook version 2.0,”Manassas, VA: Lockheed Martin Tactical Defense Systems,Tech. Rep. STARS-VC-A025/001/00, 1996. [Online]. Available:http://citeseer.ist.psu.edu/700189.html

[34] J. Coplien, D. Hoffman, and D. Weiss, “Commonalityand variability in software engineering,” IEEE Softw.,vol. 15, no. 6, pp. 37–45, Nov. 1998. [Online]. Available:http://dx.doi.org/10.1109/52.730836

[35] R. Krut, “Integrating 001 Tool Support into the Feature-Oriented Domain Analysis Methodology,” Software EngineeringInstitute, Pittsburgh, Technical Report CMU/SEI-93-TR-11, 1993.

[36] K. Czarnecki, M. Antkiewicz, Chang, S. Lau, and K. Pietroszek,“fmp and fmp2rsm: eclipse plug-ins for modeling features us-ing model templates,” in Object-oriented programming, systems,languages, and applications (OOPSLA). San Diego, CA, USA:ACM, 2005, pp. 200–201.

[37] T. von der Massen and H. Lichter, “RequiLine: A Require-ments Engineering Tool for Software Product Lines,” in Inter-national Workshop on Product Family Engineering (PFE). Siena,Italy: Springer Verlag, 2003.

[38] K. Chen, W. Zhang, H. Zhao, and H. Mei, “An approach toconstructing feature models based on requirements cluster-ing,” Requirements Engineering, IEEE International Conference on,vol. 0, pp. 31–40, 2005.

[39] V. Alves, C. Schwanninger, L. Barbosa, A. Rashid, P. Sawyer,P. Rayson, K. Pohl, and A. Rummler, “An exploratory studyof information retrieval techniques in domain analysis,” Proc.of the 12th Intn’l Software Product Line Conf., pp. 67–76, 2008.

[40] J.Noppen, P. van den Broek, N. Weston, and A. Rashid,“Modelling imperfect product line requirements with fuzzyfeature diagrams,” VaMoS, pp. 93–102, 2009.

[41] N.Niu and S.Easterbrook, “Extracting and modeling productline functional requirements,” Proc. of the 16th Intn’l Require-ments Eng. Conf., 2008.

[42] N.Niue and S.Easterbrook, “On-demand cluster analysis forproduct line functional requirements,” Proc. of the 12th Intn’lSoftware Product Line Conf., pp. 87–96, 2008.

[43] A. Lora-Michiels, C. Salinesi, and R. Mazo, “A method basedon association rules to construct product line models.” inVaMoS, ser. ICB-Research Report, D. Benavides, D. S. Batory,and P. Grnbacher, Eds., vol. 37. Universitt Duisburg-Essen,2010, pp. 147–150. [Online]. Available: http://dblp.uni-trier.de/db/conf/vamos/vamos2010.html

[44] S. She, “Feature model mining,” Master’s thesis, Universityof Waterloo, Waterloo, 08/2008 2008. [Online]. Available:http://hdl.handle.net/10012/3915

[45] M. Acher, A. Cleve, G. Perrouin, P. Heymans, C. Vanbeneden,P. Collet, and P. Lahire, “On extracting featuremodels from product descriptions,” in Proceedings of theSixth International Workshop on Variability Modeling ofSoftware-Intensive Systems, ser. VaMoS ’12. New York,NY, USA: ACM, 2012, pp. 45–54. [Online]. Available:http://doi.acm.org/10.1145/2110147.2110153

[46] M. Pazzani and D. Billsus, “Content-based recommendationsystems,” The Adaptive Web, pp. 325–341, 2007.

[47] R.Burke, “Knowledge-based recommender systems,” Encyclo-pedia of Library and Info. Sys., vol. 69, 2000.

[48] R.Burk, “Hybrid rec. systems: Survey and experiments,” UserModeling and User-Adapted Interaction, vol. 12, pp. 331–370,2002.

[49] J. Koren, R. Bell, and C. Volinsky, “Matrix factorization tech-niques for recommender systems,” Computer, vol. 42, pp. 30–37, 2009.

[50] W. Maalej and A. Thurimella, “Towards a research agendafor recommendation systems in requirements eng.,,” Proc. of

the 2nd Intn’l Workshop on Managing Requirements Knowledge(MARK), pp. 32–39, 2009.

[51] S. Apel and C. Kastner, “An overview of feature-orientedsoftware development,” Journal of Object Technology (JOT),vol. 8, no. 5, pp. 49–84, July/August 2009.

[52] A. Classen, P. Heymans, and P.-Y. Schobbens, “What’s in afeature: A requirements engineering perspective,” in FASE,2008, pp. 16–30.