resilient identity crime detection

Resilient Identity Crime DetectionClifton Phua, Member, IEEE, Kate Smith-Miles, Senior Member, IEEE, Vincent Lee, and Ross Gayler

Abstract—Identity crime is well known, prevalent, and costly; and credit application fraud is a specific case of identity crime. Theexisting non-data mining detection systems of business rules and scorecards, and known fraud matching have limitations. To addressthese limitations and combat identity crime in real-time, this paper proposes a new multi-layered detection system complemented withtwo additional layers: Communal Detection (CD) and Spike Detection (SD). CD finds real social relationships to reduce the suspicionscore, and is tamper-resistant to synthetic social relationships. It is the whitelist-oriented approach on a fixed set of attributes. SDfinds spikes in duplicates to increase the suspicion score, and is probe-resistant for attributes. It is the attribute-oriented approach ona variable-size set of attributes. Together, CD and SD can detect more types of attacks, better account for changing legal behaviour,and remove the redundant attributes. Experiments were carried out on CD and SD with several million real credit applications. Resultson the data support the hypothesis that successful credit application fraud patterns are sudden and exhibit sharp spikes in duplicates.Although this research is specific to credit application fraud detection, the concept of resilience, together with adaptivity and qualitydata discussed in the paper, are general to the design, implementation, and evaluation of all detection systems.

Index Terms—data mining-based fraud detection, security, data stream mining, anomaly detection.

�

1 INTRODUCTION

IDENTITY CRIME is defined as broadly as possible inthis paper. At one extreme, synthetic identity fraud

refers to the use of plausible but fictitious identities.These are effortless to create but more difficult to applysuccessfully. At the other extreme, real identity theftrefers to illegal use of innocent people’s complete iden-tity details. These can be harder to obtain (although largevolumes of some identity data are widely available) buteasier to successfully apply. In reality, identity crime canbe committed with a mix of both synthetic and realidentity details.

Identity crime has become prominent because there isso much real identity data available on the Web, and con-fidential data accessible through unsecured mailboxes. Ithas also become easy for perpetrators to hide their trueidentities. This can happen in a myriad of insurance,credit, and telecommunications fraud, as well as othermore serious crimes. In addition to this, identity crimeis prevalent and costly in developed countries that donot have nationally registered identity numbers.

Data breaches which involve lost or stolen consumers’identity information can lead to other frauds such astax returns, home equity, and payment card fraud. Con-sumers can incur thousands of dollars in out-of-pocketexpenses. The US law requires offending organisationsto notify consumers, so that consumers can mitigate theharm. As a result, these organisations incur economicdamage, such as notification costs, fines, and lost busi-ness [24].

Credit applications are Internet or paper-based forms

• C. Phua is with the Data Mining Department, Institute for InfocommResearch (I2R), Singapore.E-mail: see https://sites.google.com/site/cliftonphua/

• K. Smith-Miles and V. Lee are with Monash University, and R. Gayler iswith Veda Advantage.

with written requests by potential customers for creditcards, mortgage loans, and personal loans. Credit appli-cation fraud is a specific case of identity crime, involvingsynthetic identity fraud and real identity theft.

As in identity crime, credit application fraud hasreached a critical mass of fraudsters who are highlyexperienced, organised, and sophisticated [10]. Their vis-ible patterns can be different to each other and constantlychange. They are persistent, due to the high financialrewards, and the risk and effort involved are minimal.Based on anecdotal observations of experienced creditapplication investigators, fraudsters can use softwareautomation to manipulate particular values within anapplication and increase frequency of successful values.

Duplicates (or matches) refer to applications whichshare common values. There are two types of duplicates:exact (or identical) duplicates have the all same values;near (or approximate) duplicates have some same values(or characters), some similar values with slightly alteredspellings, or both. This paper argues that each successfulcredit application fraud pattern is represented by a sud-den and sharp spike in duplicates within a short time,relative to the established baseline level.

Duplicates are hard to avoid from fraudsters’ point-of-view because duplicates increase their’ success rate.The synthetic identity fraudster has low success rate, andis likely to reuse fictitious identities which have beensuccessful before. The identity thief has limited timebecause innocent people can discover the fraud early andtake action, and will quickly use the same real identitiesat different places.

It will be shown later in this paper that many fraud-sters operate this way with these applications and thattheir characteristic pattern of behaviour can be detectedby the methods reported. In short, the new methods arebased on white-listing and detecting spikes of similarapplications. White-listing uses real social relationshipson a fixed set of attributes. This reduces false positives

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING VOL.24 NO.3 YEAR 2012

by lowering some suspicion scores. Detecting spikes induplicates, on a variable set of attributes, increases truepositives by adjusting suspicion scores appropriately.

Throughout this paper, data mining is defined as thereal-time search for patterns in a principled (or system-atic) fashion. These patterns can be highly indicative ofearly symptoms in identity crime, especially syntheticidentity fraud [22].

1.1 Main challenges for detection systemsResilience is the ability to degrade gracefully whenunder most real attacks. The basic question asked by alldetection systems is whether they can achieve resilience.To do so, the detection system trades off a small degreeof efficiency (degrades processing speed) for a muchlarger degree of effectiveness (improves security by de-tecting most real attacks). In fact, any form of securityinvolves trade-offs [26].

The detection system needs “defence-in-depth” withmultiple, sequential, and independent layers of defence[25] to cover different types of attacks. These layersare needed to reduce false negatives. In other words,any successful attack has to pass every layer of defencewithout being detected.

The two greatest challenges for the data mining-basedlayers of defence are adaptivity and use of quality data.These challenges need to be addressed in order to reducefalse positives.

Adaptivity accounts for morphing fraud behaviour, asthe attempt to observe fraud changes its behaviour. Butwhat is not obvious, yet equally important, is the need toalso account for changing legal (or legitimate) behaviourwithin a changing environment. In the credit applica-tion domain, changing legal behaviour is exhibited bycommunal relationships (such as rising/falling numbersof siblings) and can be caused by external events (suchas introduction of organisational marketing campaigns).This means legal behaviour can be hard to distinguishfrom fraud behaviour, but it will be shown later in thispaper that they are indeed distinguishable from eachother.

The detection system needs to exercise caution withapplications which reflect communal relationships. Italso needs to make allowance for certain external events.

Quality Data is highly desirable for data mining anddata quality can be improved through the real-timeremoval of data errors (or noise). The detection systemhas to filter duplicates which have been re-entered due tohuman error or for other reasons. It also needs to ignoreredundant attributes which have many missing values,and other issues.

1.2 Existing identity crime detection systemsThere are non-data mining layers of defence to protectagainst credit application fraud, each with its uniquestrengths and weaknesses.

The first existing defence is made up of business rulesand scorecards. In Australia, one business rule is the

hundred-point physical identity check test which re-quires the applicant to provide sufficient point-weightedidentity documents face-to-face. They must add up toat least one hundred points, where a passport is worthseventy points. Another business rule is to contact (orinvestigate) the applicant over the telephone or Inter-net. The above two business rules are highly effective,but human resource intensive. To rely less on humanresources, a common business rule is to match an ap-plication’s identity number, address, or phone numberagainst external databases. This is convenient, but thepublic telephone and address directories, semi-publicvoters’ register, and credit history data can have dataquality issues of accuracy, completeness, and timeliness.In addition, scorecards for credit scoring can catch asmall percentage of fraud which does not look credit-worthy; but it also removes outlier applications whichhave a higher probability of being fraudulent.

The second existing defence is known fraud matching.Here, known frauds are complete applications whichwere confirmed to have the intent to defraud and usuallyperiodically recorded into a blacklist. Subsequently, thecurrent applications are matched against the blacklist.This has the benefit and clarity of hindsight becausepatterns often repeat themselves. However, there are twomain problems in using known frauds. First, they areuntimely due to long time delays, in days or months, forfraud to reveal itself, and be reported and recorded. Thisprovides a window of opportunity for fraudsters. Sec-ond, recording of frauds is highly manual. This meansknown frauds can be incorrect [11], expensive, difficultto obtain [21], [3], and have the potential of breachingprivacy.

In the real-time credit application fraud detection do-main, this paper argues against the use of classification(or supervised) algorithms which use class labels. In ad-dition to the problems of using known frauds, these al-gorithms, such as logistic regression, neural networks, orSupport Vector Machines (SVM), cannot achieve scalabil-ity or handle the extreme imbalanced class [11] in creditapplication data streams. As fraud and legal behaviourchanges frequently, the classifiers will deteriorate rapidlyand the supervised classification algorithms will needto be trained on the new data. But the training time istoo long for real-time credit application fraud detectionbecause the new training data has too many derivednumerical attributes (converted from the original, sparsestring attributes) and too few known frauds.

This paper acknowledges that in another domain,real-time credit card transactional fraud detection, thereare the same issues of scalability, extremely imbalancedclasses, and changing behaviour. For example, FairIsaac- a company renown for their predictive fraud analytics -has been successfully applying supervised classificationalgorithms, including neural networks and SVM.

1.3 New data mining-based layers of defenceThe main objective of this research is to achieve resilienceby adding two new, real-time, data mining-based layers

to complement the two existing non-data mining layersdiscussed in the subsection. These new layers will im-prove detection of fraudulent applications because thedetection system can detect more types of attacks, betteraccount for changing legal behaviour, and remove theredundant attributes.

These new layers are not human resource intensive.They represent patterns in a score where the higher thescore for an application, the higher the suspicion of fraud(or anomaly). In this way, only the highest scores requirehuman intervention. These two new layers, communaland spike detection, do not use external databases, butonly the credit application database per se. And crucially,these two layers are unsupervised algorithms which arenot completely dependent on known frauds but usethem only for evaluation.

The main contribution of this paper is the demon-stration of resilience, with adaptivity and quality datain real-time data mining-based detection algorithms.The first new layer is Communal Detection (CD): thewhitelist-oriented approach on a fixed set of attributes.To complement and strengthen CD, the second new layeris Spike Detection (SD): the attribute-oriented approachon a variable-size set of attributes.

The second contribution is the significant extension ofknowledge in credit application fraud detection becausepublications in this area are rare. In addition, this re-search uses the key ideas from other related domains todesign the credit application fraud detection algorithms.

Finally, the last contribution is the recommendation ofcredit application fraud detection as one of the manysolutions to identity crime. Being at the first stage of thecredit life cycle, credit application fraud detection alsoprevents some credit transactional fraud.

Section 2 gives an overview of related work in creditapplication fraud detection and other domains. Section3 presents the justifications and anatomy of the CDalgorithm, followed by the SD algorithm. Before theanalysis and interpretation of CD and SD results, Section4 considers the legal and ethical responsibility of han-dling application data, and describes the data, evaluationmeasures, and experimental design. Section 5 concludesthe paper.

2 BACKGROUND

Many individual data mining algorithms have been de-signed, implemented, and evaluated in fraud detection.Yet until now, to the best of the researchers’ knowledge,resilience of data mining algorithms in a complete de-tection system has not been explicitly addressed.

Much work in credit application fraud detection re-mains proprietary and exact performance figures unpub-lished, therefore there is no way to compare the CD andSD algorithms against their leading industry methodsand techniques. For example, [14] has ID Score-Riskwhich gives a combined view of each credit application’scharacteristics and their similarity to other industry-provided or Web identity’s characteristics. In another

example, [7] has Detect which provides four categories ofpolicy rules to signal fraud, one of which is checking anew credit application against historical application datafor consistency.

Case-Based Reasoning (CBR) is the only known priorpublication in the screening of credit applications [29].CBR analyses the hardest cases which have been mis-classified by existing methods and techniques. Retrievaluses thresholded nearest neighbour matching. Diagnosisutilises multiple selection criteria (probabilistic curve,best match, negative selection, density selection, anddefault) and resolution strategies (sequential resolution-default, best guess, and combined confidence) to analysethe retrieved cases. CBR has twenty percent higher truepositive and true negative rates than common algorithmson credit applications.

The CD and SD algorithms, which monitor the sig-nificant increase or decrease in amount of somethingimportant (Section 3), are similar in concept to credittransactional fraud detection and bio-terrorism detec-tion. Peer Group Analysis [2] monitors inter-accountbehaviour over time. It compares the cumulative meanweekly amount between a target account and othersimilar accounts (peer group) at subsequent time points.The suspicion score is a t-statistic which determinesthe standardised distance from the centroid of the peergroup. On credit card accounts, the time window tocalculate a peer group is thirteen weeks, and the futuretime window is four weeks. Break Point Analysis [2]monitors intra-account behaviour over time. It detectsrapid spending or sharp increases in weekly spend-ing within a single account. Accounts are ranked bythe t-test. The fixed-length moving transaction windowcontains twenty-four transactions: the first twenty fortraining and the next four for evaluation on credit cardaccounts. Bayesian networks [33] uncovers simulatedanthrax attacks from real emergency department data.[32] surveys algorithms for finding suspicious activity intime for disease outbreaks. [9] uses time series analysisto track early symptoms of synthetic anthrax outbreaksfrom daily sales of retail medication (throat, cough, andnasal) and some grocery items (facial tissues, orangejuice, and soup). Control-chart-based statistics, exponen-tial weighted moving averages, and generalised linearmodels were tested on the same bio-terrorism detectiondata and alert rate [15].

The SD algorithm, which specifies how much thecurrent prediction is influenced by past observations(subsection 3.3), is related to Exponentially WeightedMoving Average (EWMA) in statistical process controlresearch [23]. In particular, like EWMA, the SD algorithmperforms linear forecasting on the smoothed time series,and their advantages include low implementation andcomputational complexity. In addition, the SD algorithmis similar to change point detection in bio-surveillanceresearch, which maintains a cumulative sum (CUSUM)of positive deviations from the mean [13]. Like CUSUM,the SD algorithm raises an alert when the score/CUSUMexceeds a threshold, and both detects change points

faster as they are sensitive to small shifts from the mean.Unlike CUSUM, the SD algorithm weighs and choosesstring attributes, not numerical ones.

3 THE METHODSThis section is divided into four subsections to system-atically explain the CD algorithm (first two subsections)and the SD algorithm (last two subsections). Each sub-section commences with a clearer discussion about itspurposes.

3.1 Communal Detection (CD)This subsection motivates the need for CD and its adap-tive approach.

Suppose there were two credit card applications thatprovided the same postal address, home phone number,and date of birth, but one stated the applicant’s name tobe John Smith, and the other stated the applicant’s nameto be Joan Smith. These applications could be interpretedin three ways:

1) Either it is a fraudster attempting to obtain multiplecredit cards using near duplicated data

2) possibly there are twins living in the same housewho both are applying for a credit card;

3) or it can be the same person applying twice, andthere is a typographical error of one character inthe first name.

With the CD layer, any two similar applications couldbe easily interpreted as (1) because this paper’s detectionmethods use the similarity of the current applicationto all prior applications (not just known frauds) as thesuspicion score. However, for this particular scenario,CD would also recognize these two applications as either(2) or (3) by lowering the suspicion score due to thehigher possibility that they are legitimate.

To account for legal behaviour and data errors, Com-munal Detection (CD) is the whitelist-oriented approachon a fixed set of attributes. The whitelist, a list ofcommunal and self relationships between applications,is crucial because it reduces the scores of these legalbehaviours and false positives. Communal relationshipsare near duplicates which reflect the social relationshipsfrom tight familial bonds to casual acquaintances: familymembers, housemates, colleagues, neighbours, or friends[17]. The family member relationship can be furtherbroken down into more detailed relationships such ashusband-wife, parent-child, brother-sister, male-femalecousin (or both male, or both female), as well as uncle-niece (or uncle-nephew, auntie-niece, auntie-nephew).Self-relationships highlight the same applicant as a resultof legitimate behaviour (for simplicity, self-relationshipsare regarded as communal relationships).

Broadly speaking, the whitelist is constructed by rank-ing link-types between applicants by volume. The largerthe volume for a link-type, the higher the probability of acommunal relationship. On when and how the whitelistis constructed, please refer to Section 3.2, Step 6 of theCD algorithm.

However, there are two problems with the whitelist.First, there can be focused attacks on the whitelist byfraudsters when they submit applications with syntheticcommunal relationships. Although it is difficult to makedefinitive statements that fraudsters will attempt this,it is also wrong to assume that this will not happen.The solution proposed in this paper is to make thecontents of the whitelist become less predictable. Thevalues of some parameters (different from an applica-tion’s identity value) are automatically changed suchthat it also changes the whitelist’s link-types. In general,tampering is not limited to hardware, but can also referto manipulating software such as code. For our domain,tamper-resistance refers to making it more difficult forfraudsters to manipulate or circumvent data mining byproviding false data.

Second, the volume and ranks of the whitelist’s realcommunal relationships change over time. To make thewhitelist exercise caution with (or more adaptive to)changing legal behaviour, the whitelist is continuallybeing reconstructed.

3.2 CD algorithm designThis subsection explains how the CD algorithm works inreal-time by giving scores when they are exact or similarmatches between categorical data; and in terms of itsnine inputs, three outputs, and six steps.

This research focuses on one rapid and continous datastream [19] of applications. For clarity, let G represent theoverall stream which contains multiple and consecutive{. . . , gx−2, gx−1, gx, gx+1, gx+2, . . .} Mini-discrete streams.• gx: current Mini-discrete stream which contains

multiple and consecutive {ux,1, ux,2, . . . , ux,p}micro-discrete streams.

• x: fixed interval of the current month, fortnight, orweek in the year.

• p: variable number of micro-discrete streams in aMini-discrete stream.

Also, let ux,y represent the current micro-discretestream which contains multiple and consecutive{vx,y,1, vx,y,2, . . . , vx,y,q} applications. The currentapplication’s links are restricted to previous applicationswithin a moving window, and this window can belarger than the number of applications within thecurrent micro-discrete stream.• y: fixed interval of the current day, hour, minute, or

second.• q: variable number of applications in a micro-

discrete stream.Here, it is necessary to describe a single and continu-

ous stream of applications as being made up of separatechunks: a Mini-discrete stream is long-term (for example,a month of applications); while a micro-discrete streamis short-term (for example, a day of applications). Theyhelp to specify precisely when and how the detectionsystem will automatically change its configurations. Forexample, the CD algorithm reconstructs its whitelist atthe end of the month and resets its parameter values

at the end of the day; the SD algorithm does attributeselection and updates CD attribute weights at the endof the month. Also, for example, long-term previousaverage score, long-term previous average links, andaverage density of each attribute are calculated from datain a Mini-discrete stream; short-term current averagescore and short-term current average links are calculatedfrom data in a micro-discrete stream.

With this data stream perspective in mind, the CDalgorithm matches the current application against a mov-ing window of previous applications. It accounts forattribute weights which reflect the degree of importancein attributes. The CD algorithm matches all links againstthe whitelist to find communal relationships and reducetheir link score. It then calculates the current applica-tion’s score using every link score and previous appli-cation score. At the end of the current micro-discretedata stream, the CD algorithm determines the State-of-Alert (SoA) and updates one random parameter’s valuesuch that it trades off effectiveness with efficiency, orvice versa. At the end of the current Mini-discrete datastream, it constructs the new whitelist.

Inputsvi (current application)W number of vj (moving window)�x,link−type (link-types in current whitelist)Tsimilarity (string similarity threshold)Tattribute (attribute threshold)η (exact duplicate filter)α (exponential smoothing factor)Tinput (input size threshold)SoA (State-of-Alert)

OutputsS(vi) (suspicion score)Same or new parameter valueNew whitelist

CD algorithmStep 1: Multi-attribute link [match vi against W number ofvj to determine if a single attribute exceeds Tsimilarity ; andcreate multi-attribute links if near duplicates’ similarity exceedsTattribute or an exact duplicates’ time difference exceeds η]Step 2: Single-link score [calculate single-link score by match-ing Step 1’s multi-attribute links against �x,link−type]Step 3: Single-link average previous score [calculate averageprevious scores from Step 1’s linked previous applications]Step 4: Multiple-links score [calculate S(vi) based on weightedaverage (using α) of Step 2’s link scores and Step 3’s averageprevious scores]Step 5: Parameter’s value change [determine same or newparameter value through SoA (for example, by comparing inputsize against Tinput) at end of ux,y]Step 6: Whitelist change [determine new whitelist at end of gx]

TABLE 1Overview of Communal Detection (CD) algorithm

Table 1 shows the data input, six most influentialparameters, and two adaptive parameters.• vi: unscored current application. N is its number of

attributes. ai,k is the value of the kth attribute in

application vi.• W : moving (or sliding) window of previous appli-

cations. It determines the short time search space forthe current application. CD utilises an application-based window (such as the previous ten thousandapplications). vj is the scored previous application.aj,k is the value of the kth attribute in applicationvj .

• �x,link−type is a set of unique and sorted link-types(in descending order by number of links), in thelink-type attribute of the current whitelist. M is thenumber of link-types.

• Tsimilarity : string similarity threshold between twovalues.

• Tattribute: attribute threshold which requires a min-imum number of matched attributes to link twoapplications.

• η: exact duplicate filter at the link level. It removeslinks of exact duplicates from the same organisationwithin minutes, likely to be data errors by customersor employees

• α: exponential smoothing factor. In CD, α graduallydiscounts the effect of average previous scores asthe older scores become less relevant.

• Tinput: input size threshold. When the environ-ment evolves significantly over time, the input sizethreshold Tinput may have to be manually adjusted.

• SoA (State-of-Alert): condition of reduced, same, orheightened watchfulness for each parameter.

Table 1 also shows the three outputs.• S(vi): CD suspicion score of current application.• Same or new parameter values for each parameter.• New whitelist.While Table 1 gives an overview of the CD algorithm’s

six steps, the details in each step are presented below.

Step 1: Multi-attribute linkThe first step of the CD algorithm matches every

current application’s value against a moving window ofprevious applications’ values to find links.

ek =

{1 if Jaro−Winkler(ai,k, aj,k) ≥ Tsimilarity

0 otherwise(1)

where ek is the single-attribute match between thecurrent value and a previous value. The first case usesJaro-Winkler(.) [30], is case sensitive, and can also becross-matched between current value and previous val-ues from another similar attribute. The second case is anon-match because values are not similar.

ei,j =

⎧⎪⎪⎪⎨⎪⎪⎪⎩e1e2 . . . eN if Tattribute ≤

∑Nk=1 ek ≤ N − 1

or [∑N

k=1 ek = N

and T ime(ai,k, aj,k) ≥ η]

ε otherwise(2)

where ei,j is the multi-attribute link (or binary string)between the current application and a previous applica-tion. ε is the empty string. The first case uses Time(.)which is the time difference in minutes. The secondcase has no link (empty string) because it is not a nearduplicate, or it is an exact duplicate within the time filter.

Step 2: Single-link communal detectionThe second step of the CD algorithm accounts for at-

tribute weights, and matches every current application’slink against the whitelist to find communal relationshipsand reduce their link score.

S(ei,j) =

⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩

∑Nk=1(ek × wk)× wz if ei,j ∈ �x,link−type

and ei,j �= ε∑Nk=1(ek × wk) if ei,j /∈ �x,link−type

and ei,j �= ε

0 otherwise(3)

where S(ei,j) is the single-link score. This terminology“single-link score” is adopted over “multi-attribute linkscore” to focus on a single link between two applications,not on the matching of attributes between them. The firstcase uses wk which is the attribute weight with defaultvalues of 1

N , and wz which is the weight of the zth

link-type in the whitelist. The second case is the greylist(neither in the blacklist nor whitelist) link score. The lastcase is when there is no multi-attribute link.

Step 3: Single-link average previous scoreThe third step of the CD algorithm is the calculation

of every linked previous applications’ score for inclusioninto the current application’s score. The previous scoresact as the established baseline level.

βj =

⎧⎪⎨⎪⎩

S(vj)EO(vj)

if ei,j �= ε

and EO(vj) > 0

0 otherwise(4)

where βj is the single-link average previous score. Asthere will be no linked applications, the initial valuesof βj = 0 since S(vj) = 0 and Eo(vj) = 0. S(vj) isthe suspicion score of a previous application to whichthe current application links. S(vj) was computed thesame way as S(vi) - a previous application was oncea current application. EO(vj) is the number of outlinksfrom the previous application. The first case gives theaverage score of each previous application. The last caseis when there is no multi-attribute link.

Step 4: Multiple-links scoreThe fourth step of the CD algorithm is the calculation

of every current application’s score using every link andprevious application score.

S(vi) =∑

vj∈K(vi)

[S(ei,j) + βj ] (5)

where S(vi) is the CD suspicion score of the currentapplication. K(vi) is the set of previous applicationswithin the moving window to which the current ap-plication links. Therefore, a high score is the resultof strong links between current application and theprevious applications (represented by S(ei,j)), the highscores from linked previous applications (represented byβj), and a large number of linked previous applications(represented by

∑vj∈K(vi)

[.]).

S(vi) =∑

vj∈K(vi)

[(1− α)× S(ei,j) + α× βj ] (6)

where Equation (6) incorporates α [6] into Equation(5).

Step 5: Parameter’s value changeAt the end of the current micro-discrete data stream,

the adaptive CD algorithm determines the State-of-Alert(SoA) and updates one random parameter’s value suchthat there is a trade-off between effectiveness with effi-ciency, or vice versa. This increases the tamper-resistancein parameters.

SoA =

⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩

low if q ≥ Tinput and Ωx−1 ≥ Ωx,y

and δx−1 ≥ δx,y

high if q < Tinput and Ωx−1 < Ωx,y

and δx−1 < δx,y

medium otherwise

(7)

where SoA is the state-of-alert at the end of everymicro-discrete data stream. Ωx−1 is the long-term pre-vious average score and Ωx,y is the short-term currentaverage score. δx−1 is the long-term previous averagelinks and δx,y is the short-term current average links.Collectively, these are termed output suspiciousness.

The first case sets SoA to low when input size is highand output suspiciousness is low. The adaptive CD al-gorithm trades off one random parameter’s effectiveness(degrades communal relationship security) for efficiency(improves computation speed). For example, a smallermoving window, fewer link-types in the whitelist, ora larger attribute threshold decreases the algorithm’seffectiveness but increases its efficiency.

Conversely, the second case sets SoA to high when itsconditions are the opposite of the first case. The adaptiveCD algorithm will trade off one random parameter’sefficiency (degrades speed) for effectiveness (improvessecurity).

The last case sets SoA to medium. The adaptive CDalgorithm will not change any parameter’s value.

Step 6: Whitelist changeAt the end of the current Mini-discrete data stream, the

adaptive CD algorithm constructs the new whitelist onthe current Mini-discrete stream’s links. This increasesthe tamper-resistance in the whitelist.

i Given Family Unit Street Home Dateor name name no. name phone ofj no. birth1 John Smith 1 Circular 91234567 1/1/1982

road2 Joan Smith 1 Circular 91234567 1/1/1982

road3 Jack Jones 3 Square 93535353 3/2/1955

drive4 Ella Jones 3 Square 93535353 6/8/1957

drive5 Riley Lee 2 Circular 91235678 5/3/1983

road6 Liam Smyth 2 Circular 91235678 1/1/1982

road

TABLE 2Sample of 6 credit applications with 6 attributes

Table 2 provides a sample of 6 credit applications with6 attributes, to show how communal relationships areextracted from credit applications.

The whitelist is constructed from multi-attribute linksgenerated from Step 1 of the CD algorithm on the train-ing data. In our simple illustration, the CD algorithmis assumed to have the following parameter settings:Tsimilarity = 0.8, Tattribute = 3, and M = 4. If Table 2is used as training data, five multi-attribute links will begenerated: e1,2 = 011111, e1,6 = 010101, e2,6 = 010101,e3,4 = 011110, and e5,6 = 001110. These multi-attributelinks capture communal relationships: John and Joan aretwins, Jack and Ella are married, Riley and Liam arehousemates, John and Joan are neighbours with Rileyand Liam; and John, Joan, and Liam share the samebirthday.

z Link-type No. Weight1 010101 2 0.252 011111 1 0.53 011110 1 0.754 001110 1 1

TABLE 3Sample whitelist

Table 3 shows the sample whitelist constructed fromcredit applications in Table 2. A whitelist contains threeattributes. They include the link-type, which is a uniquelink determined from aggregated links from trainingdata, and its corresponding number of this type oflink and its link-type weight. There will be many link-types, so the quantity of link-types are pre-determinedby selecting the most frequent ones to be in the whitelist.Specifically, the link-types in the whitelist are processedin the following manner. The link-types are first sortedin descending order by number of links. For the high-est ranked link-type, the link-type weight starts at 1

M .Each subsequent link-type weight is then incrementallyincreased by 1

M , until the lowest ranked link-type weightis one. In other words, a higher ranked link-type is givena smaller link-type weight and is most likely a communal

relationship.

3.3 Spike Detection (SD)This subsection contrasts SD with CD; and presentsthe need for SD, in order to improve resilience andadaptivity.

Before proceeding with a description of Spike De-tection (SD), it is necessary to reinforce that CD findsreal social relationships to reduce the suspicion score,and is tamper-resistant to synthetic social relationships.It is the whitelist-oriented approach on a fixed set ofattributes. In contrast, SD finds spikes to increase thesuspicion score, and is probe-resistant for attributes.Probe-resistance reduces the chances a fraudster willdiscover attributes used in the SD score calculation. Itis the attribute-oriented approach on a variable-size setof attributes. A side note: SD cannot use a whitelist-oriented approach because it was not designed to createmulti-attribute links on a fixed-size set of attributes.

CD has a fundamental weakness in its attribute thresh-old. Specifically, CD must match at least three valuesfor our dataset. With less than three matched values,our whitelist does not contain real social relationshipsbecause some values, such as given name and unitnumber, are not unique identifiers. The fraudster canduplicate one or two important values which CD cannotdetect.

SD complements CD. The redundant attributes areeither too sparse where no patterns can be detected,or too dense where no denser values can be found.The redundant attributes are continually filtered, onlyselected attributes in the form of not-too-sparse and not-too-dense attributes are used for the SD suspicion score.In this way, the exposure of the detection system toprobing of attributes is reduced because only one or twoattributes are adaptively selected.

Suppose there was a bank’s marketing campaign togive attractive benefits for it’s new ladies’ platinumcredit card. This will cause a spike in the number oflegitimate credit card applications by women, which canbe erroneously interpreted by the system as a fraudsterattack.

To account for the changing legal behaviour causedby external events, SD strengthens CD by providing at-tribute weights which reflect the degree of importance inattributes. The attributes are adaptive for CD in the sensethat its attribute weights are continually determined.This addresses external events such as the entry of neworganisations and exit of existing ones, and marketingcampaigns of organisations which do not contain anypatterns and are likely to cause three natural changes inattribute weights. These changes are volume drift wherethe overall volume fluctuates, population drift where thevolume of both fraud and legal classes fluctuates inde-pendent of each other, and concept drift which involveschanging legal characteristics that can become similarto fraud characteristics. By tuning attribute weights, thedetection system makes allowance for these externalevents.

In general, SD trades off effectiveness (degrades secu-rity because it has more false positives without filteringout communal relationships and some data errors) forefficiency (improves computation speed because it doesnot match against the whitelist, and can compute each at-tribute in parallel on multiple workstations). In contrast,CD trades off efficiency (degrades computation speed)for effectiveness (improves security by accounting forcommunal relationships and more data errors).

3.4 SD algorithm design

This subsection explains how the SD algorithm works inreal-time with the CD algorithm, and in terms of its sixinputs, two outputs, and five steps.

From the data stream point-of-view, using a seriesof window steps, the SD algorithm matches the cur-rent application’s value against a moving window ofprevious applications’ values. It calculates the currentvalue’s score by integrating all steps to find spikes. Then,it calculates the current application’s score using allvalues’ scores and attribute weights. Also, at the end ofthe current Mini-discrete data stream, the SD algorithmselects the attributes for the SD suspicion score, andupdates the attribute weights for CD.

Inputsvi (current application)W number of vj (moving window)t (current step)Tsimilarity (string similarity threshold)θ (time difference filter)α (exponential smoothing factor)

OutputsS(vi) (suspicion score)wk (attribute weight)

SD algorithmStep 1: Single-step scaled counts [match vi against W numberof vj to determine if a single value exceeds Tsimilarity and itstime difference exceeds θ]Step 2: Single-value spike detection [calculate current value’sscore based on weighted average (using α) of t Step 1’s scaledmatches]Step 3: Multiple-values score [calculate S(vi) from Step 2’svalue scores and Step 4’s wk]Step 4: SD attributes selection [determine wk for SD at end ofgx]Step 5: CD attribute weights change [determine wk for CD atend of gx]

TABLE 4Overview of Spike Detection (SD) algorithm

Table 4 shows the data input and five parameters.• vi: unscored current application (previously intro-

duced in subsection 3.1).• W : In SD, it is a time-based window (such as

previous ten days).• t: current step, also the number of steps in W .

• Tsimilarity : string similarity threshold between twovalues (previously described in subsection 3.1).

• θ: time difference filter at the link level. It is asimplified version of the exact duplicate filter.

• α: In SD, it gradually discounts the effect of previoussteps of each value as the older steps become lessrelevant.

Table 4 also shows the two outputs.• S(vi): SD suspicion score of current application.• wk: In SD, each attribute weight is automatically

updated at the end of the current Mini-discrete datastream.

While Table 4 gives an overview of the SD algorithm’sfive steps, the details in each step are presented below.

Step 1: Single-step scaled countThe first step of the SD algorithm matches every

current value against a moving window of previousvalues in steps.

ai,j =

⎧⎪⎨⎪⎩1 if Jaro−Winkler(ai,k, aj,k) ≥ Tsimilarity

and Time(ai,k, aj,k) ≥ θ

0 otherwise(8)

where ai,j is the single-attribute match between thecurrent value and a previous value. The first case usesJaro-Winkler(.) [30], which is case sensitive, and can alsobe cross-matched between current value and previousvalues from another similar attribute, and Time(.) whichis the time difference in minutes. The second case is anon-match because the values are not similar, or recurtoo quickly.

sτ (ai,k) =∑

aj,k∈L(ai,k)

ai,jκ

(9)

where sτ (ai,k) represents the scaled matches in eachstep (the moving window is made up of many steps)to remove volume effects. L(ai,k) is the set of previousvalues within each step which the current value matches,and κ is the number of values in each step.

Step 2: Single-value spike detectionThe second step of the SD algorithm is the calculation

of every current value’s score by integrating all stepsto find spikes. The previous steps act as the establishedbaseline level.

S(ai,k) = (1− α)× st(ai,k) + α×∑t−1

τ=1 sτ (ai,k)

t− 1(10)

where S(ai,k) is the current value score.

Step 3: Multiple-values scoreThe third step of the SD algorithm is the calculation of

every current application’s score using all values’ scoresand attribute weights.

S(vi) =

N∑k=1

S(ai,k)× wk (11)

where S(vi) is the SD suspicion score of the currentapplication.

Step 4: SD attributes selectionAt the end of every current Mini-discrete data stream,

the fourth step of the SD algorithm selects the attributesfor the SD suspicion score. This also highlights the probe-reduction of selected attributes.

wk =

⎧⎪⎪⎨⎪⎪⎩1 if 1

2×N ≤∑p×q

i=1 S(ai,k)

i×∑Nk=1 wk

≤ 1N +

√1N ×∑N

k=1(wk − 1N )2

0 otherwise

(12)

where wk is the SD attribute weight applied to theSD attributes in Equation (11). The first case is theaverage density of each attribute, or the sum of all valuescores within a Mini-discrete stream for one attribute,relative to all other applications and attribute weights.In addition, the first case retains only the best attributes’weights within the lowerbound (half of default weight)and upperbound (default weight plus one standard de-viation), by setting redundant attributes’ weights to zero.

Step 5: CD attribute weights changeAt the end of every current Mini-discrete data stream,

the fifth step of the SD algorithm updates the attributeweights for CD.

wk =

∑p×qi=1 S(ai,k)

i×∑Nk=1 wk

(13)

where wk is the SD attribute weight applied to the CDattributes in Equation (3).

Standalone CD assumes all attributes are of equalimportance. The resilient combination of CD-SD meansthat CD is provided attribute weights by SD, and theseattribute weights reflect degree of importance in at-tributes. This is how CD and SD scores are combinedto give a single score.

4 EXPERIMENTAL RESULTS

4.1 Identity data - Real Application DataSet (RADS)

Substantial identity crime can be found in privateand commercial databases containing information col-lected about customers, employees, suppliers, and ruleviolators. The same situation occurs in public andgovernment-regulated databases such as birth, death,patient and disease registries; taxpayers, residents’ ad-dress, bankruptcy, and criminals lists.

To reduce identity crime, the most important textualidentity attributes such as personal name, Social SecurityNumber (SSN), Date-of-Birth (DoB), and address must

be used. The following publications support this argu-ment: [16] ranks SSN as most important, followed bypersonal name, DoB and address. [17] assigns highestweights to permanent attributes (such as SSN and DoB),followed by stable attributes (such as last name andstate), and transient (or ever changing) attributes (suchas mobile phone number and email address). [27] statesthat DoB, gender, and postcode can uniquely identifymore than eighty percent of the United States (US) popu-lation. [12], [20] regards name, gender, DoB, and addressas the most important attributes. The most importantidentity attributes differ from database to database. Theyare least likely to be manipulated, and are easiest tocollect and investigate. They also have the least missingvalues, least spelling and transcription errors, and haveno encrypted values.

Extra precaution had to be taken in this project sincethis is the first time, to the best of the researchers’knowledge, that so much real identity data has beenreleased for original credit application fraud detectionresearch. Issues of privacy, confidentiality, and ethicswere of prime concern.

This real dataset was chosen because, at experimen-tation time, it had the most recent fraud behaviour.Although this real dataset cannot be made available,there is a synthetic dataset of fifty thousand credit appli-cations which is available at https://sites.google.com/site/cliftonphua/communal-fraud-scoring-data.zip.

The specific summaries and basic statistics of the realcredit application data are discussed below. For purposesof confidentiality, the application volume and fraudpercentage in Figure 1 have been deliberately removed.Also, the average fraud percentage (known fraud per-centage in all applications) and specific attributes forapplication fraud detection cannot be revealed.

There are thirteen months (m1 to m13) with severalmillion applications (VedaAdvantage, 2006). Each day(d1 to d31) has more than ten thousand applications.This historical data is unsampled, time-stamped to themilliseconds, and modelled as data streams. Figure 1(a)illustrates that the detection system has to handle a morerapid and continuous data stream on weekdays thanweekends.

(a) Daily application volume (b) Fraud percentagefor two months across months

Fig. 1. Real Application DataSet (RADS)

There are about thirty raw attributes such as personalnames, addresses, telephone numbers, driver licence

numbers (or SSN), DoB, and other identity attributes(but no link attribute). Only nineteen of the most im-portant identity attributes (I to XIX) are selected. Allnumerical attributes are treated as string attributes. Someof these identifying attributes, including names, wereencrypted to preserve privacy. For our identity crimedetection data, its encrypted attributes are limited to ex-act matching because the particular encryption methodwas not made known to us. But in a real application,homomorphic encryption [18] or unencrypted attributeswould be used to allow string similarity matching. An-other two problems are many missing values in someattributes, and hash collisions in encrypted attributes(different original values encrypted into the same en-crypted value), but it is beyond the scope of this paperto present any solution.

The imbalanced class is extreme, with less than onepercent of known frauds in all binary class-labeled (as“fraud” or “legal”) applications. Figure 1(b) depicts thatknown frauds are significantly understated in the pro-vided applications. The main reason for fewer knownfrauds is having only eight months (m7 to m14) ofknown frauds linked to thirteen months of applications.Six months (m1 to m6) of known frauds were not pro-vided. This results in m6 to m10 having the highest fraudpercentage, but this is not true. Other reasons includesome frauds which were unlabeled, having been inad-vertently overlooked. Some known frauds are labeledonce but not their duplicates, while some organisationsdo not contribute known frauds.

The impact of fewer known frauds means algorithmswill produce poorer results and lead to incorrect eval-uation. To reduce this negative impact and improvescalability, the data has been rebalanced by retaining allknown frauds but randomly undersampling unknownapplications by ninety percent.

There are multiple sources, consisting of thirty-oneorganisations (s1 to s31) that provided the applications.Top-5 of these organisations (s1 to s5) can be consideredlarge (with at least ten thousand applications per month),and more important than others, because they contributemore income to the credit bureau. Each organisationcontributes their own number and type of attributes.

The data quality was enhanced through the cleaningof two obvious data errors. First, a few organisations’applications, with slightly more than ten percent ofall applications, were filtered. This was because someimportant unstructured attributes were encrypted intojust one value. Also, several “dummy” organisations’applications, comprising less than two percent of allapplications, were filtered. They were actually test valuesparticularly common in some months.

After the above data pre-processing activities, the ac-tual experimental data provided significantly improvedresults. This was observed using the parameter settingsin CD and SD (subsection 4.3). These results have beenomitted to focus on the results from CD and SD param-eter settings and attributes.

In addition, which are the training and test datasets?

The CD, SD, and classification algorithms use eightconsecutive months (m6 to m13) out of thirteen monthsdata (each month is also known as a Mini-discrete streamin this paper) where known frauds are not significantlyunderstated. For creating whitelist, selecting attributes,or setting attribute weights in the next month, the train-ing set is the previous month’s data. For evaluation, thetest set is the current month’s data. Both training and testdatasets are separate from each other. For example, inCD, the initial whitelist is constructed from m5 trainingdata, applied to m6 test data; and so on, until the finalwhitelist is constructed from m12 training data, andapplied to m13 test data.

4.2 Evaluation measure

Known frauds UnknownsAlerts tp fp

Non-alerts fn tn

TABLE 5Confusion matrix

Table 5 shows four main result categories for binary-class data with a given decision threshold. Alerts (oralarms) refer to applications with scores which exceedthe decision threshold, and subjected to responses suchas human investigation or outright rejection. Non-alertsare applications with scores lower than the decisionthreshold. tp, fp, fn, and tn are the number of truepositives (or hits), false positives (or false alarms), falsenegatives (or misses), and true negatives (or normals),respectively.

Measure Description

precision tptp+fp

recall, sensitivity tptp+fn

(1 - specificity) fpfp+tn

F -measure curve 2×precision×recallprecision+recall

plotted against X

thresholdsReceiver OperatingCharacteristic (ROC)curve

All scores are ranked in descending or-der; and sensitivity versus (1 - speci-ficity) plotted against X thresholds

TABLE 6Evaluation measures

Table 6 briefly summarises useful evaluation mea-sures for scores. This paper uses the F -measure curve[31] and Receiver Operating Characteristic (ROC) curve[8] with eleven threshold values from zero to one tocompare all experiments. The additional thresholds areneeded because F -measure curves seldom dominate oneanother over all thresholds. The F -measure curve isrecommended over other useful measures for the follow-ing reasons. First, for confidentiality reasons, precision-recall curves are not used as they will reveal true pos-itives, false positives, and false negatives. Second, inimbalanced class data, ROC curves, AUC, and accuracy

understates false positive percentage because they usetrue negatives [5]. Third, being a single-value measure,NTOP-k [4] does not evaluate results for more than onethreshold. The ROC curve can be used to compliment F -measure curve because the former allows the reader todirectly interpret if CD and SD are really reducing falsepositives. Reducing false positives is very important be-cause staff costs for manual investigation, and valuablecustomers lost due to credit application rejection, arevery expensive.

In this paper, the scores have two unique characteris-tics. First, the CD score distribution is heavily skewedto the left, while SD score distribution is more skewedto the right. Most scores are zero as values are usuallysparse. All the zero scores have been removed sincethey are not relevant to decision making. This willresult in more realistic F -measures, although the numberof applications in each F -measure will most likely bedifferent. Second, some scores can exceed one since eachapplication can be similar to many others. In contrast,classifier scores from naive Bayes, decision trees, logisticregression, or SVM exhibit a normal distribution andeach score is between zero and one.

4.3 CD and SD’s experimental designAll experiments were performed on a dedicated 2 XeonQuad Core (8 2.0GHz CPUs) and 12 Gb RAM server,running on Windows Server 2008 platform. Communaland spike detection algorithms, as well as evaluationmeasures, were coded in Java. The application data wasstored in a MySQL database. The plan here is to processall real applications from RADS with the most influentialparameters and their values. These influential param-eters are known to provide the best results based onthe experience from hundreds of previous experiments.However, the best results are also dependent on settingthe right value for each influential parameter in practice,as some parameters are sensitive to a change in theirvalue.

There are seven experiments which focus on specificclaims in this paper: (1) No-whitelist, (2) CD-baseline, (3)CD-adaptive, (4) SD-baseline, (5) SD-adaptive, (6) CD-SD-resilient, and (7) CD-SD-resilient-best.

The first three experiments address how much the CDalgorithm reduces false positives. The no-whitelist exper-iment uses zero link-types (M = 0) to avoid using thewhitelist. The CD-baseline experiment has the followingparameter values (based on hundreds of previous CDexperiments):• W = set to what is convenient for experimentation

(for reasons of confidentiality, the actual W cannotbe given)

• M = 100• Tsimilarity = 0.8• Tattribute = 3• η = 120• α = 0.8

In other words, the CD-baseline uses a whitelist with onehundred most frequent link-types, and sets the string

similarity threshold, attribute threshold, exact duplicatefilter, and the exponential smoothing factor for scores. Tovalidate the usefulness of the adaptive CD algorithm’schanging parameter values, CD-adaptive experiment hasthree parameters (W , M , Tsimilarity) where their valuescan be changed according to the State-of-Alert (SoA).

The fourth and fifth experiments show if the SDalgorithm increases power. The next experiment, SD-baseline, has the following parameter values (based onhundreds of previous SD experiments):• N = 19• t = 10• Tsimilarity = 0.8• θ = 60• α = 0.8

In other words, the SD-baseline uses all nineteen at-tributes, a moving window made up of ten windowsteps, and sets string similarity threshold, time differencefilter, and the exponential smoothing factor for steps. TheSD-adaptive experiment selects two best attributes for itssuspicion score.

The last two experiments highlight how well the CD-SD combination works. The CD-SD-resilient experimentis actually CD-baseline which uses attribute weightsprovided by SD-baseline. To empirically evaluate the de-tection system, the final experiment is CD-SD-resilient-best experiment with the best parameter setting below(without adaptive CD algorithm’s changing parametervalues):• W = set to what is expected to be used in practice• Tsimilarity = 1• Tattribute = 4• SD attribute weights

4.4 CD and SD’s results and discussion

Fig. 2. F -measure curves of CD and SD experiments

The CD F -measure curves skew to the left. The CD-related F -measure curves start from 0.04 to 0.06 atthreshold 0, and peak from 0.08 to 0.25 at thresholds0.2 or 0.3. On the other hand, the SD F -measure curvesskew to the right.

Fig. 3. ROC curves of CD and SD experiments

Without the whitelist, the results are inferior. FromFigure 2 at threshold 0.2, the no-whitelist experiment(F -measure below 0.09) performs poorer than the CD-baseline experiment (F -measure above 0.1). From Figure3, the no-whitelist experiment has about 10% more falsepositives than the CD-baseline experiment. This verifiesthe hypothesis that the whitelist is crucial because itreduces the scores of these legal behaviour and falsepositives; also, the larger the volume for a link-type, thehigher the probability of a communal relationship.

From Figure 2 at threshold 0.2, the CD-adaptive exper-iment (F -measure above 0.16) has a significantly higherF -measure than the CD-baseline experiment. From Fig-ure 3, the CD-adaptive experiment has about 5% lessfalse positives in the early part of the ROC curve thanthe CD-baseline experiment. The interpretation is thatthe most useful parameters are moving window andnumber of link-types. More importantly, the adaptive CDalgorithm finds the balance between effectiveness andefficiency to produce significantly better results than theCD-baseline experiment. This empirical evidence sug-gests that there is tamper-resistance in parameters andthe whitelist as some parameters’ values and whitelist’slink-types are changed in a principled fashion.

From Figure 2 at threshold 0.7, the SD-adaptive exper-iment (F -measure around 0.1) has a significantly higherF -measure than the SD-baseline experiment. Also, SD-adaptive experiment has almost the same F -measure asthe CD-baseline experiment but at different thresholds.Since most attributes are redundant, the adaptive SDalgorithm only needs to select the two best attributes forcalculation of the suspicion score. This means that theadaptive SD algorithm on two best attributes producesbetter results than the SD algorithm on all attributes, aswell as similar results to the basic CD algorithm on allattributes.

Across thresholds 0.2 to 0.5, the CD-SD-resilient-bestexperiment (F -measure above 0.23) has a F -measurewhich is more than twice the CD-baseline experiment’s.This is the most apparent outcome of all experiments:The CD algorithm, strengthened by the SD algorithm’s

attribute weights, and with the right parameter setting,delivers superior results, despite an extremely imbal-anced class (at least for the given dataset). In addition,results from the CD-SD-resilient experiment supports theview that SD attribute weights strengthen the CD algo-rithm; and resilience (CD-SD-resilience) is shown to bebetter than adaptivity (CD-adaptive and SD-adaptive).

Fig. 4. F -measure curves of CD-SD-resilient-best param-eters

Extending CD-SD-resilient-best experiment, Figure 4shows the results of doubling the most influential pa-rameters’ values. W and Tattribute have significant in-creases in F -measure over most thresholds, and M hasa slight increase at thresholds 0.2 to 0.4.

Results on the data support the argument that success-ful credit application fraud patterns are characterised bysudden and sharp spikes in duplicates. However, this re-sult is based on some assumptions and conditions shownby the research to be critical for effective detection. Alarger moving window and attribute threshold, as wellas exact matching and the whitelist must be used. Theremust also be tamper-resistance in parameters and thewhitelist. It is also assumed that SD attribute weightsare used for SD attributes selection (probe-reduction ofattributes), and SD attribute weights are used for CDattribute weights change. However, the results can beslightly incorrect because of the encryption of someattributes and the significantly understated number ofknown frauds. Also, the solutions could not account forthe effects of the existing defences - business rules andscorecards, and known fraud matching - on the results.

4.5 Drilled-down results and discussion

The CD-SD-resilient-best experiment shows that the CD-SD combination method works best for all thirty-one or-ganisations as a whole. The same method may not workwell for every organisation. Figure 5 shows the detailedbreakdown of top-5 organisations’ (s1 to s5) results fromthe CD-SD-resilient-best experiment. Similar to the CD-SD-resilient-best experiment, the top-5 organisations’ F -measure curves are skewed to the left.

Fig. 5. F -measure curves of top-5 organisations

Across thresholds 0.2 to 0.5, two organisations, s1(F -measure above 0.22) and s3 (F -measure above 0.19)have comparable F -measures than the CD-SD-resilient-best experiment (F -measure above 0.23). In contrast,for the same thresholds, three organisations, s4 (F -measure above 0.16), s2 (F -measure above 0.08), ands5 (F -measure above 0.05) have significantly lowerF -measures. However, in CD-baseline experiment, forthresholds 0.2 to 0.5, s5 performs better than s4. Thisimplies that most methods or parameter settings canwork well for only some organisations.

4.6 Classifier-comparison experimental design, re-sults, and discussion

Are classification algorithms suitable for the real-timecredit application fraud detection domain? To answer theabove question, four popular classification algorithmswith default parameters in WEKA [31] were chosen forclassifier experiments. The algorithms were Naive Bayes(NB), C4.5 Decision Tree (DT), Logistic Regression (LoR),and Support Vector Machines (SVM) - current state-of-the-art libSVM. A well-known data stream classificationalgorithm, Very Fast Machine Learner (VFML) whichis a Hoeffding decision tree, is also used with defaultparameters in MOA [1]. They were applied to the sametraining and test data used by CD and SD algorithms,and there was an extra step to convert the string at-tributes to word vector ones. The following experimentsassume that ground truth is available at training time(see Section 1.2 for a description of the problems in usingknown frauds).

Classification algorithms are not the most accurate andscalable for this domain. Figure 6 compares the fiveclassifiers against CD-SD-resilient-best experiment withF -measure across eleven thresholds. Across thresholds0.2 to 0.5, CD-SD-resilient-best experiment’s F -measurecan be several times higher than the five classifiers: NB(F -measure above 0.08), LoR (F -measure above 0.05),VFML (F -measure above 0.04), SVM and DT (F -measureabove 0.03). Also, results did not improve from training

Fig. 6. F -measure curves of five classification algorithms

Experiment(s) Relative timeCD-SD-resilient-best 1

NB 1.25DT 5

VFML 18LoR 60SVM 156

TABLE 7Relative time of five classification algorithms

the five classifiers on labeled multi-attribute links, andapplying the classifiers to multi-attribute links in the testdata. Table 7 measures relative time of five classifiersusing CD-SD-resilient-best experiment as baseline. Timerefers to total system time for the algorithm to complete.CD-SD-resilient-best experiment is orders of magnitudefaster than the classifier experiments because it does notneed to train on many word vector attributes and withfew known frauds.

5 CONCLUSION

The main focus of this paper is Resilient IdentityCrime Detection; in other words, the real-time searchfor patterns in a multi-layered and principled fashion,to safeguard credit applications at the first stage of thecredit life cycle.

This paper describes an important domain that hasmany problems relevant to other data mining research.It has documented the development and evaluation inthe data mining layers of defence for a real-time creditapplication fraud detection system. In doing so, thisresearch produced three concepts (or “force multipliers”)which dramatically increase the detection system’s ef-fectiveness (at the expense of some efficiency). Theseconcepts are resilience (multi-layer defence), adaptivity(accounts for changing fraud and legal behaviour), andquality data (real-time removal of data errors). Theseconcepts are fundamental to the design, implementation,and evaluation of all fraud detection, adversarial-relateddetection, and identity crime-related detection systems.

The implementation of CD and SD algorithms is prac-tical because these algorithms are designed for actualuse to complement the existing detection system. Nev-ertheless, there are limitations. The first limitation iseffectiveness, as scalability issues, extreme imbalancedclass, and time constraints dictated the use of rebalanceddata in this paper. The counter-argument is that, in prac-tice, the algorithms can search with a significantly largermoving window, number of link-types in the whitelist,and number of attributes. The second limitation is indemonstrating the notion of adaptivity. While in theexperiments, CD and SD are updated after every period,it is not a true evaluation as the fraudsters do not geta chance to react and change their strategy in responseto CD and SD as would occur if they were deployed inreal-life (experiments were performed on historical data).

ACKNOWLEDGMENTSThe authors are grateful to Dr. Warwick Graco andMr. Kelvin Sim for their insightful comments. This re-search was supported by the Australian Research Coun-cil (ARC) under Linkage Grant Number LP0454077.

REFERENCES[1] Bifet, A. and Kirkby, R. 2009. Massive Online Analysis, Technical

Manual, University of Waikato.[2] Bolton, R. and Hand, D. 2001. Unsupervised Profiling Methods for

Fraud Detection, Proc. of CSCC01.[3] Brockett, P., Derrig, R., Golden, L., Levine, A. and Alpert, M.

2002. Fraud Classification using Principal Component Analysis ofRIDITs, The Journal of Risk and Insurance 69(3): pp. 341-371. DOI:10.1111/1539-6975.00027.

[4] Caruana, R. and Niculescu-Mizil, A. 2004. Data Mining in MetricSpace: An Empirical Analysis of Supervised Learning PerformanceCriteria, Proc. of SIGKDD04. DOI: 10.1145/1014052.1014063.

[5] Christen, P. and Goiser, K. 2007. Quality and Complexity Measuresfor Data Linkage and Deduplication, in F. Guillet and H. Hamilton(eds), Quality Measures in Data Mining, Vol. 43, Springer, UnitedStates. DOI: 10.1007/978-3-540-44918-8.

[6] Cortes, C., Pregibon, D. and Volinsky, C. 2003. Compu-tational methods for dynamic graphs, Journal of Compu-tational and Graphical Statistics 12(4): pp. 950-970. DOI:10.1198/1061860032742.

[7] Experian. 2008. Experian Detect: Applica-tion Fraud Prevention System. Whitepaper,http://www.experian.com/products/pdf/experian detect.pdf.

[8] Fawcett, T. 2006. An Introduction to ROC Analysis, Pattern Recog-nition Letters 27: pp. 861-874. DOI: 10.1016/j.patrec.2005.10.010.

[9] Goldenberg, A., Shmueli, G. and Caruana, R. 2002. Using GrocerySales Data for the Detection of Bio-Terrorist Attacks, StatisticalMedicine.

[10] Gordon, G., Rebovich, D., Choo, K. and Gordon, J. 2007. IdentityFraud Trends and Patterns: Building a Data-Based Foundationfor Proactive Enforcement, Center for Identity Management andInformation Protection, Utica College.

[11] Hand, D. 2006. Classifier Technology and the Illusionof Progress, Statistical Science 21(1): pp. 1-15. DOI:10.1214/088342306000000060.

[12] Head, B. 2006. Biometrics Gets in the Picture, Information AgeAugust-September: pp. 10-11.

[13] Hutwagner, L., Thompson, W., Seeman, G., Treadwell, T. 2006.The Bioterrorism Preparedness and Response Early AberrationReporting System (EARS), Journal of Urban Health 80: pp. 89-96.PMID: 12791783.

[14] IDAnalytics. 2008. ID Score-Risk: Gain Greater Visibility intoIndividual Identity Risk. Unpublished.

[15] Jackson, M., Baer, A., Painter, I. and Duchin, J. 2007. A SimulationStudy Comparing Aberration Detection Algorithms for SyndromicSurveillance, BMC Medical Informatics and Decision Making 7(6).DOI: 10.1186/1472-6947-7-6.

[16] Jonas, J. 2006. Non-Obvious Relationship Awareness (NORA),Proc. of Identity Mashup.

[17] Jost, A. 2004. Identity Fraud Detection and Prevention. Unpub-lished.

[18] Kantarcioglu, M., Jiang, W. and Malin, B. 2008. A Privacy-Preserving Framework for Integrating Person-Specific Databases,Privacy in Statistical Databases, Lecture Notes in Computer Sci-ence, 5262/2008: pp. 298-314. DOI: 10.1007/978-3-540-87471-325.

[19] Kleinberg, J. 2005. Temporal Dynamics of On-Line Informa-tion Streams, in M. Garofalakis, J. Gehrke and R. Rastogi (eds),Data Stream Management: Processing High-Speed Data Streams,Springer, United States. ISBN: 978-3-540-28607-3.

[20] Kursun, O., Koufakou, A., Chen, B., Georgiopoulos, M., Reynolds,K. and Eaglin, R. 2006. A Dictionary-Based Approach to Fast andAccurate Name Matching in Large Law Enforcement Databases,Proc. of ISI06. DOI: 10.1007/11760146.

[21] Neville, J., Simsek, O., Jensen, D., Komoroske, J., Palmer,K. and Goldberg, H. 2005. Using Relational Knowledge Dis-covery to Prevent Securities Fraud, Proc. of SIGKDD05. DOI:10.1145/1081870.1081922.

[22] Oscherwitz, T. 2005. Synthetic Identity Fraud: Unseen IdentityChallenge, Bank Security News 3: p. 7.

[23] Roberts, S. 1959. Control-Charts-Tests based on Geometric Mov-ing Averages, Technometrics 1: pp. 239-250.

[24] Romanosky, S., Sharp, R. and Acquisti, A. 2010. Data Breachesand Identity Theft: When is Mandatory Disclosure Optimal?, Proc.of WEIS10 Workshop, Harvard University.

[25] Schneier, B. 2003. Beyond Fear: Thinking Sensibly about Se-curity in an Uncertain World, Copernicus, New York. ISBN-10:0387026207.

[26] Schneier, B. 2008. Schneier on Security, Wiley, Indiana. ISBN-10:0470395354.

[27] Sweeney, L. 2002. k-Anonymity: A Model for Protecting Privacy,International Journal of Uncertainty, Fuzziness Knowledge-BasedSystems: 10(5): pp. 557-570.

[28] VedaAdvantage. 2006. Zero-Interest Credit Cards Cause RecordGrowth In Card Applications. Unpublished.

[29] Wheeler, R. and Aitken, S. 2000. Multiple Algorithms forFraud Detection, Knowledge-Based Systems 13(3): pp. 93-99. DOI:10.1016/S0950-7051(00)00050-2.

[30] Winkler, W. 2006. Overview of Record Linkage and Current Re-search Directions, Technical Report RR 2006-2, U.S. Census Bureau.

[31] Witten, I. and Frank, E. 2000. Data Mining: Practical MachineLearning Tools and Techniques with Java, Morgan Kauffman Pub-lishers, San Francisco. ISBN-10: 1558605525.

[32] Wong, W. 2004. Data Mining for Early Disease Outbreak Detec-tion, PhD thesis, Carnegie Mellon University.

[33] Wong, W., Moore, A., Cooper, G. and Wagner, M. 2003. BayesianNetwork Anomaly Pattern Detection for Detecting Disease Out-breaks, Proc. of ICML03. ISBN: 1-57735-189-4.

Clifton Phua is a Research Fellow at the Data Mining Department ofInstitute of Infocomm Research (I2R), Singapore. His current researchinterests are in security and healthcare-related data mining.

Kate Smith-Miles is a Professor and Head of the School of Mathe-matical Sciences at Monash University, Australia. Her current researchinterests are in neural networks, intelligent systems, and data mining.

Vincent Lee is an Associate Professor in the Clayton School of Infor-mation Technology, Monash University, Australia. His current researchinterests are in data and text mining for business intelligence.

Ross Gayler is a Senior Research and Development Consultant in VedaAdvantage, Australia. His current research interests are in credit scoring.

resilient identity crime detection

Documents