credit cards data mining for fraud detection

D A T A M I N I N G

Distributed Data Mining inCredit Card Fraud DetectionPhilip K. Chan, Florida Institute of TechnologyWei Fan, Andreas L. Prodromidis, and Salvatore J. Stolfo, Columbia University

CREDIT CARD TRANSACTIONS CON-tinue to grow in number, taking an ever-largershare of the US payment system and leadingto a higher rate of stolen account numbersand subsequent losses by banks. Improvedfraud detection thus has become essential tomaintain the viability of the US payment sys-tem. Banks have used early fraud warningsystems for some years.

Large-scale data-mining techniques canimprove on the state of the art in commercialpractice. Scalable techniques to analyze mas-sive amounts of transaction data that effi-ciently compute fraud detectors in a timelymanner is an important problem, especiallyfor e-commerce. Besides scalability and effi-ciency, the fraud-detection task exhibits tech-nical problems that include skewed distribu-tions of training data and nonuniform costper error, both of which have not been widelystudied in the knowledge-discovery and data-mining community.

In this article, we survey and evaluate anumber of techniques that address these threemain issues concurrently. Our proposedmethods of combining multiple learned frauddetectors under a “cost model” are generaland demonstrably useful; our empiricalresults demonstrate that we can significantlyreduce loss due to fraud through distributeddata mining of fraud models.

Our approach

In today’s increasingly electronic societyand with the rapid advances of electronic com-merce on the Internet, the use of credit cardsfor purchases has become convenient and nec-essary. Credit card transactions have becomethe de facto standard for Internet and Web-based e-commerce. The US government esti-mates that credit cards accounted for approx-imately US $13 billion in Internet sales during1998. This figure is expected to grow rapidlyeach year. However, the growing number ofcredit card transactions provides more oppor-tunity for thieves to steal credit card numbersand subsequently commit fraud. When bankslose money because of credit card fraud, card-holders pay for all of that loss through higherinterest rates, higher fees, and reduced bene-fits. Hence, it is in both the banks’ and the

cardholders’ interest to reduce illegitimate useof credit cards by early fraud detection. Formany years, the credit card industry has stud-ied computing models for automated detec-tion systems; recently, these models have beenthe subject of academic research, especiallywith respect to e-commerce.

The credit card fraud-detection domainpresents a number of challenging issues fordata mining:

• There are millions of credit card transac-tions processed each day. Mining suchmassive amounts of data requires highlyefficient techniques that scale.

• The data are highly skewed—many moretransactions are legitimate than fraudu-lent. Typical accuracy-based mining tech-niques can generate highly accurate frauddetectors by simply predicting that all

THIS SCALABLE BLACK-BOX APPROACH FOR BUILDING

EFFICIENT FRAUD DETECTORS CAN SIGNIFICANTLY REDUCE

LOSS DUE TO ILLEGITIMATE BEHAVIOR. IN MANY CASES, THE

AUTHORS’ METHODS OUTPERFORM A WELL-KNOWN, STATE-OF-THE-ART COMMERCIAL FRAUD-DETECTION SYSTEM.

NOVEMBER/DECEMBER 1999 1094-7167/99/$10.00 © 1999 IEEE 67

transactions are legitimate, although thisis equivalent to not detecting fraud at all.

• Each transaction record has a differentdollar amount and thus has a variablepotential loss, rather than a fixed misclas-sification cost per error type, as is com-monly assumed in cost-based miningtechniques.

Our approach addresses the efficiency andscalability issues in several ways. We divide

a large data set of labeled transactions (eitherfraudulent or legitimate) into smaller subsets,apply mining techniques to generate classi-fiers in parallel, and combine the resultant basemodels by metalearning from the classifiers’behavior to generate a metaclassifier.1 Ourapproach treats the classifiers as black boxesso that we can employ a variety of learningalgorithms. Besides extensibility, combiningmultiple models computed over all availabledata produces metaclassifiers that can offset

the loss of predictive performance that usu-ally occurs when mining from data subsets orsampling. Furthermore, when we use thelearned classifiers (for example, during trans-action authorization), the base classifiers canexecute in parallel, with the metaclassifier thencombining their results. So, our approach ishighly efficient in generating these models andalso relatively efficient in applying them.

Another parallel approach focuses on par-allelizing a particular algorithm on a particu-

68 IEEE INTELLIGENT SYSTEMS

The AdaCost algorithm One of the most important results of our experimental work on this

domain was the realization that each of the base-learning algorithmsemployed in our experiments utilized internal heuristics based on train-ing accuracy, not cost. This leads us to investigate new algorithms thatemploy internal metrics of misclassification costwhen computing hy-potheses to predict fraud.

Here, we describe AdaCost1 (a variant of AdaBoost2,3), which reducesboth fixed and variable misclassification costs more significantly thanAdaBoost. We follow the generalized analysis of AdaBoost by RobertSchapire and Yoram Singer.3 Figure A shows the algorithm. Let S=((x1,c1,y1), …, (xm,cm,ym)) be a sequence of training examples where eachinstance xi belongs to a domain X, each cost factor ci belongs to thenon-negative real domainR+, and each label yi belongs to a finite label spaceY. We only focus on binary classification problems in which Y = {–1,+1}. h is a weak hypothesis—it has the form h:X – R. The sign of h(x) isinterpreted as the predicted label, and the magnitude |h(x)| is the “confi-dence” in this prediction. Let t be an index to show the round of boostingand Dt(i) be the weight given to (xi,ci,yi) at the tth round. 0 ≤ Dt(i) ≤1, andΣDt(i) = 1 is the chosen parameter as a weight for weak hypothesis ht atthe tth round. We assume αt > 0. β(sign(yiht(xi)),ci) is a cost-adjustmentfunction with two arguments: sign(yi ht(xi)) to show if ht(xi) is correct, andthe cost factor ci.

The difference between AdaCost and AdaBoost is the additional cost-adjustment function β(sign(yiht(xi)),ci) in the weight-updating rule.

Where it is clear in context, we use either β(i) or β(ci) as a shorthand forβ(sign(yiht(xi)),ci). Furthermore, we use β+ when sign (yi ht(xi)) = +1 andβ– when sign(yi ht(xi)) = –1. For an instance with a higher cost factor,β(i)increases its weights “more” if the instance is misclassified, butdecreases its weight “less” otherwise. Therefore, we require β–(ci) to benondecreasing with respect to ci,β+ (ci) to be nonincreasing, and both arenonnegative. We proved that AdaCost reduces cost on the training data.1

Logically, we can assign a cost factor c of tranamt – overheadtofrauds and a factor c of overhead to nonfrauds. This reflects how the pre-diction errors will add to the total cost of a hypothesis. Because theactual overheadis a closely guarded trade secret and is unknown to us,we chose to set overhead∈ {60, 70, 80, 90} to run four sets ofexperiments. We normalized each ci to [0,1]. The cost adjustment func-tion β is chosen as:β– (c) = 0.5⋅c + 0.5 and β+(c) = –0.5⋅c + 0.5.

As in previous experiments, we use training data from one month anddata from two months later for testing. Our data set let us form 10 suchpairs of training and test sets. We ran both AdaBoost and AdaCost to the50th round. We used Ripper as the “weak” learner because it provides aneasy way to change the training set’s distribution. Because using thetraining set alone usually overestimates a rule set’s accuracy, we used theLaplace estimate to generate the confidence for each rule.

We are interested in comparing Ripper (as a baseline), AdaCost, andAdaBoost in several dimensions. First, for each data set and cost model,we determine which algorithm has achieved the lowest cumulative mis-classification cost. We wish to know in how many cases AdaCost is theclear winner. Second, we also seek to know, quantitatively, the differencein cumulative misclassification cost of AdaCost from AdaBoost and thebaseline Ripper. It is interesting to measure the significance of these dif-ferences in terms of both reduction in misclassification loss and percent-age of reduction. Finally, we are interested to know if AdaCost requiresmore computing power.

Figure B plots the results from the Chase Bank’s credit card data. Fig-ure B1 shows the average reduction of 10 months in percentage cumula-tive loss (defined as cumulativeloss/ maximalloss – leastloss* 100%) forAdaBoost and AdaCost for all 50 rounds with an overhead of $60 (resultsfor other overheads are in other work1). We can clearly see that there is aconsistent reduction. The absolute amount of reduction is around 3%

We also observe that the speed of reduction by AdaCost is quickerthan that of AdaBoost. The speed is the highest in the first few rounds.This means that in practice, we might not need to run AdaCost for manyrounds. Figure B2 plots the ratio of cumulative cost by AdaCost andAdaBoost. We have plotted the results of all 10 pairs of training and testmonths over all rounds and overheads. Most of the points are above the y= x line in Figure B2, implying that AdaCost has lower cumulative loss inan overwhelming number of cases.

References1. W. Fan et al., “Adacost: Misclassification Cost-Sensitive

Boosting,” Proc. 16th Int’l Conf. Machine Learning, Morgan Kauf-mann, San Francisco, 1999, pp. 97–105. Figure A. AdaCost.

Given: (x1, c1, y1), …,(xm, cm, ym):xi ∈ X,ci ∈ R+,yi ∈ {–1,+1}Initialize D1(i) (such as D1(i) = (such as D1(i) = ci/∑m

j cj)For t = 1, …,T:

1. Train weak learner using distribution Dt.2. Compute weak hypothesis ht: X→R.3. Choose αt ∈ R and β(i) ∈ R+

4. Update

where β(sign(yiht(xi)),ci) is a cost-adjustment func-tion. Zt is a normalization factor chosen so that Dt+1will be a distribution.

Output the final hypothesis:

H(x) = sign(f(x)) where f(x) = αt tt

Th x( )

=∑

1

D iD i y h x y h x c

Zt

t t i t i

t

i t i i+ =

−( )1( )

( ) exp ( ) ( ( )), )α β sign(

NOVEMBER/DECEMBER 1999 69

lar parallel architecture. However, a new algo-rithm or architecture requires a substantialamount of parallel-programming work. Al-though our architecture- and algorithm-inde-pendent approach is not as efficient as somefine-grained parallelization approaches, it letsusers plug different off-the-shelf learning pro-grams into a parallel and distributed environ-ment with relative ease and eliminates theneed for expensive parallel hardware.

Furthermore, because our approach could

generate a potentially large number of classi-fiers from the concurrently processed datasubsets, and therefore potentially require morecomputational resources during detection, weinvestigate pruning methods that identifyredundant classifiers and remove them fromthe ensemble without significantly degradingthe predictive performance. This pruning tech-nique increases the learned detectors’ com-putational performance and throughput.

The issue of skewed distributions has not

been studied widely because many of the datasets used in research do not exhibit this char-acteristic. We address skewness by partition-ing the data set into subsets with a desired dis-tribution, applying mining techniques to thesubsets, and combining the mined classifiersby metalearning (as we have already dis-cussed). Other researchers attempt to removeunnecessary instances from the majorityclass—instances that are in the borderlineregion (noise or redundant exemplars) are can-didates for removal. In contrast, our approachkeeps all the data for mining and does notchange the underlying mining algorithms.

We address the issue of nonuniform costby developing the appropriate cost model forthe credit card fraud domain and biasing ourmethods toward reducing cost. This costmodel determines the desired distributionjust mentioned. AdaCost (a cost-sensitiveversion of AdaBoost) relies on the cost modelfor updating weights in the training distrib-ution. (For more on AdaCost, see the “Ada-Cost algorithm” sidebar.) Naturally, this costmodel also defines the primary evaluationcriterion for our techniques. Furthermore, weinvestigate techniques to improve the costperformance of a bank’s fraud detector byimporting remote classifiers from otherbanks and combining this remotely learnedknowledge with locally stored classifiers.The law and competitive concerns restrictbanks from sharing information about theircustomers with other banks. However, theymay share black-box fraud-detection mod-els. Our distributed data-mining approachprovides a direct and efficient solution tosharing knowledge without sharing data. Wealso address possible incompatibility of dataschemata among different banks.

We designed and developed an agent-based distributed environment to demon-strate our distributed and parallel data-min-ing techniques. The JAM (Java Agents forMetalearning) system not only provides dis-tributed data-mining capabilities, it also letsusers monitor and visualize the various learn-ing agents and derived models in real time.Researchers have studied a variety of algo-rithms and techniques for combining multi-ple computed models. The JAM system pro-vides generic features to easily implementany of these combining techniques (as wellas a large collection of base-learning algo-rithms), and it has been broadly available foruse. The JAM system is available for down-load at http://www.cs.columbia.edu/~sal/JAM/PROJECT.2

2. Y. Freund and R. Schapire, “Experiments with a New Boosting Algorithm,”Proc. 13th Conf.Machine Learning, Morgan Kaufmann, San Francisco, 1996, pp. 148–156.

3. R. Schapire and Y. Singer, “Improved Boosting Algorithms Using Confidence-Rated Predic-tions,” Proc. 11th Conf. Computational Learning Theory, ACM Press, New York, 1998.

0.36

0.365

0.37

0.375

0.38

0.385

0.39

0.395

0 5 10 15 20 25 30 35 40 45 50

Perc

enta

ge c

umul

ativ

e m

iscl

assi

ficat

ion

cost

Boosting Rounds

Overhead = 60

AdaBoostAdaCost

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

0.5

0.52

0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5

AdaB

oost

cum

ulat

ive

cost

AdaCost cumulative cost

(1)

(2)

y=xPercentagel of AdaCost and AdaBoost

Figure B. Cumulative cost versus rounds (1); AdaBoost versus AdaCost in cumulative cost (2).

Credit card data and costmodels

Chase Bank and First Union Bank, mem-bers of the Financial Services TechnologyConsortium (FSTC), provided us with realcredit card data for this study. The two datasets contain credit card transactions labeledas fraudulent or legitimate. Each bank sup-plied 500,000 records spanning one year with20% fraud and 80% nonfraud distribution forChase Bank and 15% versus 85% for FirstUnion Bank. In practice, fraudulent transac-tions are much less frequent than the 15% to20% observed in the data given to us. Thesedata might have been cases where the bankshave difficulty in determining legitimacy cor-rectly. In some of our experiments, we delib-erately create more skewed distributions toevaluate the effectiveness of our techniquesunder more extreme conditions.

Bank personnel developed the schemataof the databases over years of experience andcontinuous analysis to capture importantinformation for fraud detection. We cannotreveal the details of the schema beyond whatwe have described elsewhere.2 The recordsof one schema have a fixed length of 137bytes each and about 30 attributes, includingthe binary class label (fraudulent/legitimatetransaction). Some fields are numeric and therest categorical. Because account identifica-tion is not present in the data, we cannotgroup transactions into accounts. Therefore,instead of learning behavior models of indi-vidual customer accounts, we build overallmodels that try to differentiate legitimatetransactions from fraudulent ones. Our mod-els are customer-independent and can serveas a second line of defense, the first beingcustomer-dependent models.

Most machine-learning literature concen-trates on model accuracy (either training erroror generalization error on hold-out test datacomputed as overall accuracy, true-positive orfalse-positive rates, or return-on-cost analysis).This domain provides a considerably differentmetric to evaluate the learned models’perfor-mance—models are evaluated and rated by acost model. Due to the different dollar amountof each credit card transaction and other fac-tors, the cost of failing to detect a fraud varieswith each transaction. Hence, the cost modelfor this domain relies on the sum and averageof loss caused by fraud. We define

and

where Cost(i) is the cost associated withtransactions i, and n is the total number oftransactions.

After consulting with a bank representa-tive, we jointly settled on a simplified costmodel that closely reflects reality. Because ittakes time and personnel to investigate apotentially fraudulent transaction, each inves-tigation incurs an overhead. Other relatedcosts—for example, the operational resourcesneeded for the fraud-detection system—areconsolidated into overhead.So, if the amountof a transaction is smaller than the overhead,investigating the transaction is not worthwhileeven if it is suspicious. For example, if it takes$10 to investigate a potential loss of $1, it ismore economical not to investigate it. There-fore, assuming a fixed overhead,we devisedthe cost model shown in Table 1 for eachtransaction, where tranamtis the amount of acredit card transaction. The overhead thresh-old, for obvious reasons, is a closely guardedsecret and varies over time. The range of val-ues used here are probably reasonable levelsas bounds for this data set, but are probablysignificantly lower. We evaluated all ourempirical studies using this cost model.

Skewed distributions

Given a skewed distribution, we would liketo generate a training set of labeled transactionswith a desired distribution without removingany data, which maximizes classifier perfor-mance. In this domain, we found that deter-mining the desired distribution is an experi-mental art and requires extensive empirical teststo find the most effective training distribution.

In our approach, we first create data sub-sets with the desired distribution (determinedby extensive sampling experiments). Thenwe generate classifiers from these subsetsand combine them by metalearning fromtheir classification behavior. For example, ifthe given skewed distribution is 20:80 andthe desired distribution for generating thebest models is 50:50, we randomly divide the

majority instances into four partitions andform four data subsets by merging the minor-ity instances with each of the four partitionscontaining majority instances. That is, theminority instances replicate across four datasubsets to generate the desired 50:50 distri-bution for each distributed training set.

For concreteness, let Nbe the size of the dataset with a distribution of x:y (x is the percent-age of the minority class) and u:vbe the desireddistribution. The number of minority instancesis N × x, and the desired number of majorityinstances in a subset is Nx× v/u. The numberof subsets is the number of majority instances(N × y) divided by the number of desiredmajority instances in each subset, which is Nydivided by Nxv/u or y/x × u/v. So, we have y/x×u/vsubsets, each of which has Nxminor-ity instances and Nxv/umajority instances.

The next step is to apply a learning algo-rithm or algorithms to each subset. Becausethe learning processes on the subsets are inde-pendent, the subsets can be distributed to dif-ferent processors and each learning processcan run in parallel. For massive amounts ofdata, our approach can substantially improvespeed for superlinear time-learning algo-rithms. The generated classifiers are com-bined by metalearning from their classifica-tion behavior. We have described severalmetalearning strategies elsewhere.1

To simplify our discussion, we onlydescribe the class-combiner(or stacking)strategy.3 This strategy composes a metaleveltraining set by using the base classifiers’pre-dictions on a validation set as attribute val-ues and the actual classification as the classlabel. This training set then serves for train-ing a metaclassifier. For integrating subsets,the class-combiner strategy is more effectivethan the voting-based techniques. When thelearned models are used during online frauddetection, transactions feed into the learnedbase classifiers and the metaclassifier thencombines their predictions. Again, the baseclassifiers are independent and can executein parallel on different processors. In addi-tion, our approach can prune redundant baseclassifiers without affecting the cost perfor-mance, making it relatively efficient in thecredit card authorization process.

AverageCostCumulativeCost

n=

CumulativeCost Cost ii

n= ∑ ( )


Table 1. Cost model assuming a fixed overhead.

OUTCOME COST

Miss (false negative—FN) TranamtFalse alarm (false positive—FP) Overhead if tranamt > overhead or 0 if tranamt ≤ overheadHit (true positive—TP) Overhead if tranamt > overhead or tranamt if tranamt ≤ overheadNormal (true negative—TN) 0

Experiments and results. To evaluate ourmulticlassifier class-combiner approach toskewed class distributions, we performed aset of experiments using the credit card frauddata from Chase.4We used transactions fromthe first eight months (10/95–5/96) for train-ing, the ninth month (6/96) for validating,and the twelfth month (9/96) for testing.(Because credit card transactions have a nat-ural two-month business cycle—the time tobill a customer is one month, followed by aone-month payment period—the true labelof a transaction cannot be determined in lessthan two months’ time. Hence, building mod-els from data in one month cannot be ratio-nally applied for fraud detection in the nextmonth. We therefore test our models on datathat is at least two months newer.)

Based on the empirical results from theeffects of class distributions, the desired distri-bution is 50:50. Because the given distributionis 20:80, four subsets are generated from eachmonth for a total of 32 subsets. We applied fourlearning algorithms (C4.5, CART, Ripper, andBayes) to each subset and generated 128 baseclassifiers. Based on our experience with train-ing metaclassifiers, Bayes is generally moreeffective and efficient, so it is the metalearnerfor all the experiments reported here.

Furthermore, to investigate if our approachis indeed fruitful, we ran experiments on theclass-combiner strategy directly applied tothe original data sets from the first eightmonths (that is, they have the given 20:80distribution). We also evaluated how indi-vidual classifiers generated from each monthperform without class-combining.

Table 2 shows the cost and savings from theclass-combiner strategy using the 50:50 dis-tribution (128 base classifiers), the average ofindividual CART classifiers generated usingthe desired distribution (10 classifiers), class-combiner using the given distribution (32 baseclassifiers—8 months × 4 learning algo-rithms), and the average of individual classi-fiers using the given distribution (the averageof 32 classifiers). (We did not perform exper-iments on simply replicating the minorityinstances to achieve 50:50 in one single data

set because this approach increases the train-ing-set size and is not appropriate in domainswith large amounts of data—one of the threeprimary issues we address here.) Comparedto the other three methods, class-combiningon subsets with a 50:50 fraud distributionclearly achieves a significant increase in sav-ings—at least $110,000 for the month (6/96).When the overhead is $50, more than half ofthe losses were prevented.

Surprisingly, we also observe that when theoverhead is $50, a classifier (Single CART)trained from one month’s data with the desired50:50 distribution (generated by ignoringsome data) achieved significantly more sav-ings than combining classifiers trained fromall eight months’data with the given distribu-tion. This reaffirms the importance of employ-ing the appropriate training class distributionin this domain. Class-combiner also con-tributed to the performance improvement.Consequently, utilizing the desired trainingdistribution and class-combiner provides asynergistic approach to data mining withnonuniform class and cost distributions. Per-haps more importantly, how do our techniquesperform compared to the bank’s existingfraud-detection system? We label the currentsystem “COTS” (commercial off-the-shelfsystem) in Table 2. COTS achieved signifi-cantly less savings than our techniques in thethree overhead amounts we report in this table.

This comparison might not be entirelyaccurate because COTS has much moretraining data than we have and it might beoptimized to a different cost model (whichmight even be the simple error rate). Fur-thermore, unlike COTS, our techniques aregeneral for problems with skewed distribu-tions and do not utilize any domain knowl-edge in detecting credit card fraud—the onlyexception is the cost model used for evalua-tion and search guidance. Nevertheless,COTS’performance on the test data providessome indication of how the existing fraud-detection system behaves in the real world.

We also evaluated our method with moreskewed distributions (by downsamplingminority instances): 10:90, 1:99, and 1:999.

As we discussed earlier, the desired distri-bution is not necessarily 50:50—for instance,the desired distribution is 30:70 when thegiven distribution is 10:90. With 10:90 dis-tributions, our method reduced the cost sig-nificantly more than COTS. With 1:99 dis-tributions, our method did not outperformCOTS. Both methods did not achieve anysavings with 1:999 distributions.

To characterize the condition when ourtechniques are effective, we calculate R, theratio of the overhead amount to the averagecost:R = Overhead/Average cost. Our ap-proach is significantly more effective thanthe deployed COTS when R< 6. Both meth-ods are not effective when the R > 24. So,under a reasonable cost model with a fixedoverhead cost in challenging transactions aspotentially fraudulent, when the number offraudulent transactions is a very small per-centage of the total, it is financially undesir-able to detect fraud. The loss due to this fraudis yet another cost of conducting business.

However, filtering out “easy” (or low-risk)transactions (the data we received were pos-sibly filtered by a similar process) can reducea high overhead-to-loss ratio. The filteringprocess can use fraud detectors that are builtbased on individual customer profiles, whichare now in use by many credit card compa-nies. These individual profiles characterizethe customers’ purchasing behavior. Forexample, if a customer regularly buys gro-ceries at a particular supermarket or has setup a monthly payment for phone bills, thesetransactions are close to no risk; hence. pur-chases of similar characteristics can be safelyauthorized without further checking. Reduc-ing the overhead through streamlining busi-ness operations and increased automationwill also lower the ratio.

Knowledge sharing throughbridging

Much of the prior work on combiningmultiple models assumes that all modelsoriginate from different (not necessarily dis-


Table 2. Cost and savings in the credit card fraud domain using class-combiner (cost ± 95% confidence interval).

OVERHEAD = $50 OVERHEAD = $75 OVERHEAD = $100FRAUD SAVED SAVED SAVED SAVED SAVED SAVED

METHOD (%) COST (%) ($K) COST (%) ($K) COST (%) ($K)

Class combiner 50 17.96 ±.14 51 761 20.07 ±.13 46 676 21.87 ±.12 41 604Single CART 50 20.81 ±.75 44 647 23.64 ±.96 36 534 26.05 ± 1.25 30 437Class combiner Given 22.61 39 575 23.99 35 519 25.20 32 471Average single

classifier Given 27.97 ± 1.64 24 360 29.08 ± 1.60 21 315 30.02 ± 1.56 19 278COTS N/A 25.44 31 461 26.40 29 423 27.24 26 389

tinct) subsets of a single data set as a meansto increase accuracy (for example, by impos-ing probability distributions over the in-stances of the training set, or by stratifiedsampling, subsampling, and so forth) and notas a means to integrate distributed informa-tion. Although the JAM system addresses thelatter problem by employing metalearningtechniques, integrating classification modelsderived from distinct and distributed data-bases might not always be feasible.

In all cases considered so far, all classifi-cation models are assumed to originate fromdatabases of identical schemata. Becauseclassifiers depend directly on the underlyingdata’s format, minor differences in the sche-mata between databases derive incompatibleclassifiers—that is, a classifier cannot beapplied on data of different formats. Yet theseclassifiers may target the same concept. Weseek to bridge these disparate classifiers insome principled fashion.

The banks seek to be able to exchangetheir classifiers and thus incorporate usefulinformation in their system that would oth-erwise be inaccessible to both. Indeed, foreach credit card transaction, both institutionsrecord similar information, however, theyalso include specific fields containing impor-tant information that each has acquired sep-arately and that provides predictive value indetermining fraudulent transaction patterns.To facilitate the exchange of knowledge (rep-resented as remotely learned models) andtake advantage of incompatible and other-wise useless classifiers, we need to devisemethods that bridge the differences imposedby the different schemata.

Database compatibility. The incompatibleschema problem impedes JAM from takingadvantage of all available databases. Let’sconsider two data sites A and B with data-bases DBA and DBB, having similar but notidentical schemata. Without loss of general-ity, we assume that

Schema(DBA) = {A1, A2, ...,An, An+1, C}Schema(DBB) = {B1, B2, ...,Bn, Bn+1, C}

where,Ai and Bi denote the ith attribute ofDBA and DBB, respectively, and C the classlabel (for example, the fraud/legitimate labelin the credit card fraud illustration) of eachinstance. Without loss of generality, we fur-ther assume that Ai = Bi, 1 ≤ i ≤ n. As for theAn+1 and Bn+1 attributes, there are twopossibilities:

1. An+1 ≠ Bn+1: The two attributes are ofentirely different types drawn from dis-tinct domains. The problem can then bereduced to two dual problems where onedatabase has one more attribute than theother; that is,

Schema(DBA) = {A1, A2 ,..., An, An+1, C}Schema(DBB) = {B1, B2 ,..., Bn, C}

where we assume that attribute Bn+1 isnot present in DBB. (The dual problemhas DBB composed with Bn+1,but An+1 isnot available to A.)

2. An+1 ≈ Bn+1: The two attributes are ofsimilar type but slightly differentsemantics—that is, there might be amap from the domain of one type to thedomain of the other. For example,An+1

and Bn+1 are fields with time-dependentinformation but of different duration(that is,An+1 might denote the numberof times an event occurred within a win-dow of half an hour and Bn+1 mightdenote the number of times the sameevent occurred but within 10 minutes).

In both cases (attribute An+1 is either notpresent in DBB or semantically different fromthe correspondingBn+1), the classifiers CAj

derived from DBA are not compatible withDBB’s data and therefore cannot be directlyused in DBB’s site and vice versa. But the pur-pose of using a distributed data-mining systemand deploying learning agents and metalearn-ing their classifier agents is to be able to com-bine information from different sources.

Bridging methods. There are two methodsfor handling the missing attributes.5

Method I: Learn a local model for the miss-ing attribute and exchange. Database DBB

imports, along with the remote classifieragent, a bridging agent from database DBA

that is trained to compute values for the miss-ing attribute An+1 in DBB’s data. Next,DBB

applies the bridging agent on its own data toestimate the missing values. In many cases,however, this might not be possible or desir-able by DBA (for example, in case theattribute is proprietary). The alternative fordatabase DBB is to simply add the missingattribute to its data set and fill it with null val-ues. Even though the missing attribute mighthave high predictive value for DBA, it is ofno value to DBB. After all, DBB did not

include it in its schema, and presumablyother attributes (including the common ones)have predictive value.

Method II: Learn a local model without themissing attribute and exchange. In thisapproach, database DBA can learn two localmodels: one with the attribute An+1 that can beused locally by the metalearning agents andone without it that can be subsequentlyexchanged. Learning a second classifier agentwithout the missing attribute, or with theattributes that belong to the intersection of theattributes of the two databases’ data sets,implies that the second classifier uses only theattributes that are common among the partic-ipating sites and no issue exists for its inte-gration at other databases. But, remote classi-fiers imported by database DBA (and assurednot to involve predictions over the missingattributes) can still be locally integrated withthe original model that employs An+1. In thiscase, the remote classifiers simply ignore thelocal data set’s missing attributes.

Both approaches address the incompati-ble schema problem,and metalearning overthese models should proceed in a straight-forward manner. Compared to ZbigniewRas’s related work,6 our approach is moregeneral because our techniques support bothcategorical and continuous attributes and arenot limited to a specific syntactic case orpurely logical consistency of generated rulemodels. Instead, these bridging techniquescan employ machine- or statistical-learningalgorithms to compute arbitrary models forthe missing values.

Pruning

An ensemble of classifiers can be unneces-sarily complex, meaning that many classifiersmight be redundant, wasting resources andreducing system throughput. (Throughputhere denotes the rate at which a metaclassifiercan pipe through and label a stream of dataitems.) We study the efficiency of metaclassi-fiers by investigating the effects of pruning(discarding certain base classifiers) on theirperformance. Determining the optimal set ofclassifiers for metalearning is a combinator-ial problem. Hence, the objective of pruning isto utilize heuristic methods to search for par-tially grown metaclassifiers (metaclassifierswith pruned subtrees) that are more efficientand scalable and at the same time achievecomparable or better predictive performance


results than fully grown (unpruned) meta-classifiers. To this end, we introduced twostages for pruning metaclassifiers; the pre-training and posttraining pruning stages. Bothlevels are essential and complementary to eachother with respect to the improvement of thesystem’s accuracy and efficiency.

Pretraining pruningrefers to the filteringof the classifiers before they are combined.Instead of combining classifiers in a brute-force manner, we introduce a preliminarystage for analyzing the available classifiersand qualifying them for inclusion in a com-bined metaclassifier. Only those classifiersthat appear (according to one or more pre-defined metrics) to be most promising par-ticipate in the final metaclassifier. Here, weadopt a black-box approach that evaluates theset of classifiers based only on their input andoutput behavior, not their internal structure.Conversely,posttraining pruningdenotes theevaluation and pruning of constituent baseclassifiers after a complete metaclassifier hasbeen constructed. We have implemented andexperimented with three pretraining and twoposttraining pruning algorithms, each withdifferent search heuristics.

The first pretraining pruning algorithmranks and selects its classifiers by evaluatingeach candidate classifier independently (met-ric-based), and the second algorithm decidesby examining the classifiers in correlation witheach other (diversity-based). The third relieson the independent performance of the clas-sifiers and the manner in which they predictwith respect to each other and with respect tothe underlying data set (coverage and spe-cialty-based). The first posttraining pruningalgorithms are based on a cost-complexitypruning technique (a technique the CARTdecision-tree learning algorithm uses thatseeks to minimize the cost and size of its treewhile reducing the misclassification rate). Thesecond is based on the correlation between theclassifiers and the metaclassifier.7

Compared to Dragos Margineantu andThomas Dietterich’s approach,8 ours con-siders the more general setting where ensem-bles of classifiers can be obtained by apply-ing possibly different learning algorithmsover (possibly) distinct databases. Further-more, instead of voting (such as AdaBoost)over the predictions of classifiers for the finalclassification, we use metalearning to com-bine the individual classifiers’ predictions.

Evaluation of knowledge sharing andpruning. First, we distributed the data sets

across six different data sites (each site storestwo months of data) and prepared the set ofcandidate base classifiers—that is, the orig-inal set of base classifiers the pruning algo-rithm is called to evaluate. We computedthese classifiers by applying five learningalgorithms (Bayes, C4.5, ID3, CART, andRipper) to each month of data, creating 60base classifiers (10 classifiers per data site).Next, we had each data site import the remotebase classifiers (50 in total) that we subse-quently used in the pruning and metalearn-ing phases, thus ensuring that each classifierwould not be tested unfairly on known data.Specifically, we had each site use half of itslocal data (one month) to test, prune, andmetalearn the base classifiers and the otherhalf to evaluate the pruned and unprunedmetaclassifier’s overall performance.7 Inessence, this experiment’s setting corre-sponds to a parallel sixfold cross-validation.

Finally, we had the two simulated banksexchange their classifier agents. In additionto its 10 local and 50 internal classifiers(those imported from their peer data sites),each site also imported 60 external classifiers(from the other bank). Thus, we populatedeach simulated Chase data site with 60 (10 +50) Chase classifiers and 60 First Union clas-sifiers, and we populated each First Unionsite with 60 (10 + 50) First Union classifiersand 60 Chase classifiers. Again, the sites usedhalf of their local data (one month) to test,prune, and metalearn the base classifiers andthe other half to evaluate the pruned orunpruned metaclassifier’s overall perfor-mance. To ensure fairness, we did not use the10 local classifiers in metalearning.

The two databases, however, had the fol-lowing schema differences:

• Chase and First Union defined a (nearlyidentical) feature with different seman-tics.

• Chase includes two (continuous) featuresnot present in the First Union data.

For the first incompatibility, we mappedthe First Union data values to the Chasedata’s semantics. For the second incompati-bility, we deployed bridging agents to com-pute the missing values.5 When predicting,the First Union classifiers simply disregardedthe real values provided at the Chase datasites, while the Chase classifiers relied onboth the common attributes and the predic-tions of the bridging agents to deliver a pre-diction at the First Union data sites.

Table 3 summarizes our results for theChase and First Union banks, displaying thesavings for each fraud predictor examined.The column denoted as size indicates thenumber of base classifiers used in the ensem-ble classification system. The first row ofTable 3 shows the best possible performanceof Chase’s own COTS authorization anddetection system on this data set. The nexttwo rows present the performance of the bestbase classifiers over the entire set and over asingle month’s data, while the last four rowsdetail the performance of the unpruned (sizeof 50 and 110) and pruned metaclassifiers(size of 32 and 63). The first two of thesemetaclassifiers combine only internal (fromChase) base classifiers, while the last twocombine both internal and external (fromChase and First Union) base classifiers. Wedid not use bridging agents in these experi-ments, because Chase defined all attributesused by the First Union classifier agents.

Table 3 records similar data for the FirstUnion data set, with the exception of FirstUnion’s COTS authorization and detectionperformance (it was not made available to us),and the additional results obtained whenemploying special bridging agents from Chaseto compute the values of First Union’s miss-ing attributes. Most obviously, these experi-ments show the superior performance of meta-learning over the single-model approaches andover the traditional authorization and detec-tion systems (at least for the given data sets).The metaclassifiers outperformed the single


Table 3. Results on knowledge sharing and pruning.

CHASE FIRST UNION

SAVINGS SAVINGS

CLASSIFICATION METHOD SIZE ($K) SIZE ($K)

COTS scoring system from Chase N/A 682 N/A N/ABest base classifier over entire set 1 762 1 803Best base classifier over one subset 1 736 1 770Metaclassifier 50 818 50 944Metaclassifier 32 870 21 943Metaclassifier (+ First Union) 110 800 N/A N/AMetaclassifier (+ First Union) 63 877 N/A N/AMetaclassifier (+ Chase – bridging) N/A N/A 110 942Metaclassifier (+ Chase + bridging) N/A N/A 110 963Metaclassifier (+ Chase + bridging) N/A N/A 56 962

base classifiers in every category. Moreover,by bridging the two databases, we managedto further improve the metalearning system’sperformance.

However, combining classifiers’ agentsfrom the two banks directly (without bridg-ing) is not very effective, no doubt becausethe attribute missing from the First Uniondata set is significant in modeling the Chasedata set. Hence, the First Union classifiersare not as effective as the Chase classifierson the Chase data, and the Chase classifierscannot perform at full strength at the FirstUnion sites without the bridging agents.

This table also shows the invaluable contri-bution of pruning. In all cases, pruning suc-ceeded in computing metaclassifiers with sim-ilar or better fraud-detection capabilities, whilereducing their size and thus improving theirefficiency. We have provided detailed descrip-tion on the pruning methods and a compara-tive study between predictive performance andmetaclassifier throughput elsewhere.7

ONE LIMITATION OF OUR AP-proach to skewed distributions is the need torun preliminary experiments to determine thedesired training distribution based on a de-fined cost model. This process can be auto-mated, but it is unavoidable because thedesired distribution highly depends on the costmodel and the learning algorithm. Currently,for simplicity reasons, all the base learners usethe same desired distribution; using an indi-vidualized training distribution for each baselearner could improve the performance. Fur-thermore, because thieves also learn and fraudpatterns evolve over time, some classifiers aremore relevant than others at a particular time.Therefore, an adaptive classifier-selectionmethod is essential. Unlike a monolithicapproach of learning one classifier usingincremental learning, our modular multiclas-sifier approach facilitates adaptation over timeand removes out-of-date knowledge.

Our experience in credit card fraud detec-tion has also affected other important appli-cations. Encouraged by our results, we shiftedour attention to the growing problem of intru-sion detection in network- and host-based

computer systems. Here we seek to performthe same sort of task as in the credit card frauddomain. We seek to build models to distin-guish between bad(intrusions or attacks) andgood(normal) connections or processes. Byfirst applying feature-extraction algorithms,followed by the application of machine-learn-ing algorithms (such as Ripper) to learn andcombine multiple models for different typesof intrusions, we have achieved remarkablygood success in detecting intrusions.9 Thiswork, as well as the results reported in thisarticle, demonstrates convincingly that dis-tributed data-mining techniques that combinemultiple models produce effective fraud andintrusion detectors.

AcknowledgmentsWe thank the past and current participants of

this project: Charles Elkan, Wenke Lee, ShelleyTselepis, Alex Tuzhilin, and Junxin Zhang. Wealso appreciate the data and expertise provided byChase and First Union for conducting this study.This research was partially supported by grantsfrom DARPA (F30602-96-1-0311) and the NSF(IRI-96-32225 and CDA-96-25374).

References1. P. Chan and S. Stolfo, “Metalearning for Mul-

tistrategy and Parallel Learning,”Proc. Sec-ond Int’l Workshop Multistrategy Learning,Center for Artificial Intelligence, GeorgeMason Univ., Fairfax,Va., 1993, pp. 150–165.

2. S. Stolfo et al., “JAM: Java Agents for Meta-learning over Distributed Databases,”Proc.Third Int’l Conf. Knowledge Discovery andData Mining, AAAI Press, Menlo Park,Calif., 1997, pp. 74–81.

3. D. Wolpert, “Stacked Generalization,”NeuralNetworks, Vol. 5, 1992, pp. 241–259.

4. P. Chan and S. Stolfo, “Toward ScalableLearning with Nonuniform Class and CostDistributions: A Case Study in Credit CardFraud Detection,”Proc. Fourth Int’l Conf.Knowledge Discovery and Data Mining,AAAI Press, Menlo Park, Calif., 1998, pp.164–168.

5. A. Prodromidis and S. Stolfo, “Mining Data-bases with Different Schemas: IntegratingIncompatible Classifiers,”Proc. Fourth IntlConf. Knowledge Discovery and Data Min-ing, AAAI Press, Menlo Park, Calif., 1998,pp. 314–318.

6. Z. Ras, “Answering Nonstandard Queries inDistributed Knowledge-Based Systems,”Rough Sets in Knowledge Discovery, Studies

in Fuzziness and Soft Computing, L. Pol-kowski and A. Skowron, eds., Physica Verlag,New York, 1998, pp. 98–108.

7. A. Prodromidis and S. Stolfo, “Pruning Meta-classifiers in a Distributed Data Mining Sys-tem,” Proc. First Nat’l Conf. New Informa-tion Technologies, Editions of New Tech.Athens, 1998, pp. 151–160.

8. D. Margineantu and T. Dietterich, “PruningAdaptive Boosting,”Proc. 14th Int’l Conf.Machine Learning, Morgan Kaufmann, SanFrancisco, 1997, pp. 211–218.

9. W. Lee, S. Stolfo, and K. Mok, “Mining in aData-Flow Environment: Experience in Net-work Intrusion Detection,”Proc. Fifth Int’lConf. Knowledge Discovery and Data Min-ing, AAAI Press, Menlo Park, Calif., 1999,pp. 114–124.

Philip K. Chan is an assistant professor of com-puter science at the Florida Institute of Technol-ogy. His research interests include scalable adap-tive methods, machine learning, data mining,distributed and parallel computing, and intelligentsystems. He received his PhD, MS, and BS in com-puter science from Columbia University, Vander-bilt University, and Southwest Texas State Uni-versity, respectively. Contact him at ComputerScience, Florida Tech, Melbourne, FL 32901;[email protected]. edu; www.cs.fit.edu/~pkc.

Wei Fan is a PhD candidate in computer science atColumbia University. His research interests are inmachine learning, data mining, distributed systems,information retrieval, and collaborative filtering. Hereceived his MSc and B.Eng from Tsinghua Uni-versity and MPhil from Columbia University. Con-tact him at the Computer Science Dept., ColumbiaUniv., New York, NY 10027; [email protected];www.cs.columbia.edu/~wfan.

Andreas Prodromidis is a director in research anddevelopment at iPrivacy. His research interestsinclude data mining, machine learning, electroniccommerce, and distributed systems. He received aPhD in computer science from Columbia Univer-sity and a Diploma in electrical engineering fromthe National Technical University of Athens. He isa member of the IEEE and AAAI. Contact him atiPrivacy, 599 Lexington Ave., #2300, New York,NY 10022; [email protected]; www.cs.columbia. edu/~andreas.

Salvatore J. Stolfois a professor of computer sci-ence at Columbia University and codirector of theUSC/ISI and Columbia Center for AppliedResearch in Digital Government Information Sys-tems. His most recent research has focused on dis-tributed data-mining systems with applications tofraud and intrusion detection in network informa-tion systems. He received his PhD from NYU’sCourant Institute. Contact him at the ComputerScience Dept., Columbia Univ., New York, NY10027; [email protected]; www.cs.columbia.edu/sal; www.cs.columbia.edu/~sal.


credit cards data mining for fraud detection

Documents