a overview breif of pridit.ppt

Review ofFraud Classification Using Principal Components Analysis of RIDITSBy Louise A. FrancisFrancis Analytics and Actuarial Data Mining, Inc.

ObjectivesAddress question: Why use new method, PRIDIT?Introduce other methods used in similar circumstancesExplain how PRIDIT adds to methods availableExplain limitations of PRIDIT/RIDIT

A Key Problem in Fraud ModelingMost data mining methods need a target (dependent) variableY = a + b1x1 + b2x2 + bnxn

Fraud (Yes/No or Fraud Score) = f(predictor variables) Need sample of data where claims have been determined to be fraudulent or legitimate

Dependent variable hard to getIn a large sample of automobile insurance claims perhaps 1/3 may have an element of abuse or fraudScarce resources are not expensed on such large volumes of claims to determine their legitimacyOnly a small percentage referred to SIU investigators or other investigationsThere are time lags in determining the outcome of investigations

Unsupervised learningAnother approach that does not require a dependent variableTwo Key KindsCluster AnalysisPrincipal Components/Factor AnalysisPridit uses this approachIt is applied to ordered categorical variables

Cluster AnalysisRecords are grouped in categories that have similar values on the variablesExamplesMarketing: People with similar values on demographic variables (i.e., age, gender, income) may be grouped together for marketing Text analysis: Use words that tend to occur together to classify documentsNote: no dependent variable used in analysis

ClusteringCommon Method: k-means, hierarchicalNo dependent variable records are grouped into classes with similar values on the variableStart with a measure of similarity or dissimilarityMaximize dissimilarity between members of different clusters

Dissimilarity (Distance) Measure Continuous VariablesEuclidian Distance

Manhattan Distance

Binary Variables

Sheet1

Row Variable

Column Variable10

1aba+b

0cdc+d

a+cb+d

Sheet2

Sheet3

Binary VariablesSample Matching

Rogers and Tanimoto

Example: Fraud DataData from 1993 closed claim study conducted by Automobile Insurers Bureau of MassachusettsClaim files often have variables which may be useful in assessing suspicion of fraud, but a dependent variable is often not available Variables used for clustering:Legal representationPrior ClaimSIU InvestigationAt faultPolice reportNumber of providers

Statistics for ClustersBased on descriptive statistics, Cluster 2 appears to have higher likelihood of fraudulent claims more about this later

Sheet1

medauditPoliceMedicalAtLegalSIUNumber

NYClusterReportAuditFaultRepInvestigationProviders

FrequencyPercentFrequencyPercentTotalPct YesPercentage Yes

Cluster172253.127299484912.43902439027230.0013831259146.7%0.1%42.2%6.1%0.0%2

263746.87270051514097.56097560986770.059084195249.8%5.9%2.4%96.0%6.5%4

Combined135910041100

polcerpt

12

FrequencyPercentFrequencyPercent

Cluster138553.103448275933850.07407407417230.4674965422

234046.896551724133749.92592592596770.4977843427


atfault

12


Cluster141838.739573679330595.0155763247230.4218533887

266161.2604263207164.9844236766770.023633678

Combined1079100321100

legalcode

12


Cluster167996.1756373938446.34005763697230.060857538

2273.824362606265093.65994236316770.9601181684


si_done

NY


Cluster172353.3185840708007230

263346.6814159292441006770.0649926145


Sheet2

Sheet3

Principal Components AnalysisA form of dimension (variable) reductionSuppose we want to combine all the information related to the financial dimension of fraudMedical provider bill (indicative of padding claim)Hospital billNumber of providersEconomic LossesClaimed wagesIncurred Losses

Principal ComponentsThese variables are correlated but not perfectly correlatedWe replace many variables with a weighted sum of the variables

Correlation Matrix for Variables

Sheet1




Cluster172253.127299484912.43902439027230.0013831259146.7%0.1%42.2%6.1%0.0%2

263746.87270051514097.56097560986770.059084195249.8%5.9%2.4%96.0%6.5%4


polcerpt

12


Cluster138553.103448275933850.07407407417230.4674965422

234046.896551724133749.92592592596770.4977843427


atfault

12


Cluster141838.739573679330595.0155763247230.4218533887

266161.2604263207164.9844236766770.023633678

Combined1079100321100

legalcode

12


Cluster167996.1756373938446.34005763697230.060857538

2273.824362606265093.65994236316770.9601181684


si_done

NY


Cluster172353.3185840708007230

263346.6814159292441006770.0649926145


Sheet2

Correlations

Number ProvidersMedical BillProvider PaidEconomic LossesIncurredHospital Pymt

Number Providers1.0000.3870.5710.3820.3820.168

Medical Bill0.3871.0000.5390.9520.9520.922

Provider Paid0.5710.5391.0000.5310.5310.327

Economic Losses0.3820.9520.5311.0001.0000.888

Inourred0.3820.9520.5311.0001.0000.888

Hospital Pymt0.1680.9220.3270.8880.8881.000

Sheet3

Finding Factor or ComponentThe correlation matrix is used to find the factor that explains the most variance (captures most of the correlation) for the set of variablesThat component or factor extracted will be a weighted average of the variablesMore than one Component or Factor may result from applying the method

Evaluating Importance of VariablesUse factor loadings

Sheet1




Cluster172253.127299484912.43902439027230.0013831259146.7%0.1%42.2%6.1%0.0%2

263746.87270051514097.56097560986770.059084195249.8%5.9%2.4%96.0%6.5%4


polcerpt

12


Cluster138553.103448275933850.07407407417230.4674965422

234046.896551724133749.92592592596770.4977843427

Combined7251006751001400

atfault

12


Cluster141838.739573679330595.0155763247230.4218533887

266161.2604263207164.9844236766770.023633678

Combined10791003211001400

legalcode

12


Cluster167996.24467230.060857538

2273.8650946770.9601181684


si_done

NY


Cluster172353.3185840708007230

263346.6814159292441006770.0649926145


Sheet2

Correlations

Number ProvidersMedical BillProvider PaidEconomic LossesInourredHospital PymtClaimed Wages

Number Providers1.0000.3870.5710.3820.3820.168-0.161

Medical Bill0.3871.0000.5390.9520.9520.922-0.020

Provider Paid0.5710.5391.0000.5310.5310.327-0.077

Economic Losses0.3820.9520.5311.0001.0000.888-0.095

Inourred0.3820.9520.5311.0001.0000.888-0.095

Hospital Pymt0.1680.9220.3270.8880.8881.0000.024

Claimed Wages-0.161-0.020-0.077-0.095-0.0950.0241.000

Component Matrix(a)

Component

12

Number Providers0.49656281140.7617640494

Medical Bill0.973599283-0.1369925871

mp_paid0.64641355110.582379983

Economic Losses0.9759676899-0.1388981879

Inourred0.9759676899-0.1388981879

Report Lag0.0182089405-0.1071551963

Hospital Pymt0.886202804-0.3929965634

Extraction Method: Principal Component Analysis.

a2 components extracted.

Component Matrix

VariableLoading

Number Providers0.497

Medical Bill0.974

Provider Paid0.646

Economic Losses0.976

Incurred0.976

Hospital Pymt0.886

Report Lag0.018

Sheet3

Problem: Categorical VariablesIt is not clear how to best perform Principal Components/Factor Analysis on categorical variablesThe categories may be coded as a series of binary dummy variablesIf the categories are ordered categories, you may loose important informationThis is the problem that PRIDIT addresses

RIDITVariables are ordered so that lowest value is associated with highest probability of fraudUse Cumulative distribution of claims at each value, i, to create RIDIT statistic for claim t, value i

Example: RIDIT for Legal Representation

Sheet1

Legal Representation

ProportionProportion

ValueCodeNumberProportionBelowAboveRIDIT

Yes17060.5040.0000.496-0.496

No26940.4960.5040.0000.504

1400

Sheet2

Sheet3

PRIDITUse RIDIT statistics in Principal Components Analysis

ScoringAssign a score to each claimThe score can be used to sort claimsMore effort expended on claims more likely to be fraudulent or abusiveIn the case of AIB data, we can use additional information to test how well PRIDIT did, using the PRIDIT scoreA suspicion score was assigned to each claim by an expert

PRIDIT vs. Suspicion Score

Chart1

0.4378592313

0.2528500621

-0.4183959163

-0.4248695316

-0.6804811165

-0.616280535

-0.7170292658

-0.9590567571

-0.8462498046

-1.0009264309

-1.2381645646

Suspicion Score

PRIDIT Score

Suspicion Score vs PRIDIT Score

Sheet1

Legal Representation

ProportionProportion

ValueCodeNumberProportionBelowAboveRIDIT

Yes17060.5040.0000.496-0.496

No26940.4960.5040.0000.504

1400

Sheet2

Component Matrix(a)

Component

1

Rsiu-0.2475210115

Rpolce0.2201004541

Rfault-0.7087566059

rlegal0.7520387699

rmed0.3405600453

rprior0.4057082417

Extraction Method: Principal Component Analysis.

a1 components extracted.

Report

PRIDIT Score

ScoreMeanNStd. Deviation

Suspicion Level0.000.44798.000.88

1.000.2548.000.86

2.00(0.42)60.000.78

3.00(0.42)97.000.79

4.00(0.68)89.000.82

5.00(0.62)131.000.81

6.00(0.72)53.000.67

7.00(0.96)56.000.65

8.00(0.85)50.000.75

9.00(1.00)10.000.70

10.00(1.24)8.001.20

Total0.001,400.001.00

Sheet2

Suspicion Score

PRIDIT Score

Suspicion Score vs PRIDIT Score

Sheet3

Clustering and Suspicion Score

ResultThere appears to be a strong relationship between PRIDIT score and suspicion that claim is fraudulent or abusiveThe clusters resulting from the cluster procedure also appeared to be effective in separating legitimate from fraudulent or abusive claims

Comparison: PRIDIT and ClusteringPRIDIT gives a score, which may be very useful for claims sorting. Clustering assigns claims to classes. They are either in or out of the assigned class.Clustering ignores information about the order of values for categorical variablesClustering can accommodate both categorical and continuous variables

ComparisonUnordered categorical variables with many values (i.e., injury type):Clustering has a procedure for measuring dissimilarity for these variables and can use them in clusteringIf the values for the variables contain no meaningful order, PRIDIT will not help in creating variables to use in Principal Components Analysis.

Review ofFraud Classification Using Principal Components Analysis of RIDITSBy Louise A. FrancisFrancis Analytics and Actuarial Data Mining, Inc.

a overview breif of pridit.ppt

Documents

similar values

fraud score

fraud datadata

sample of data

suspicion of fraud

fpredictor variables

demographic variables

available variables