a overview breif of pridit.ppt

28
Review of Fraud Classification Using Principal Components Analysis of RIDITS By Louise A. Francis Francis Analytics and Actuarial Data Mining, Inc.

Upload: commando719

Post on 25-Nov-2015

35 views

Category:

Documents


1 download

DESCRIPTION

PRIDIT is a technique for fraud detection.

TRANSCRIPT

  • Review ofFraud Classification Using Principal Components Analysis of RIDITSBy Louise A. FrancisFrancis Analytics and Actuarial Data Mining, Inc.

  • ObjectivesAddress question: Why use new method, PRIDIT?Introduce other methods used in similar circumstancesExplain how PRIDIT adds to methods availableExplain limitations of PRIDIT/RIDIT

  • A Key Problem in Fraud ModelingMost data mining methods need a target (dependent) variableY = a + b1x1 + b2x2 + bnxn

    Fraud (Yes/No or Fraud Score) = f(predictor variables) Need sample of data where claims have been determined to be fraudulent or legitimate

  • Dependent variable hard to getIn a large sample of automobile insurance claims perhaps 1/3 may have an element of abuse or fraudScarce resources are not expensed on such large volumes of claims to determine their legitimacyOnly a small percentage referred to SIU investigators or other investigationsThere are time lags in determining the outcome of investigations

  • Unsupervised learningAnother approach that does not require a dependent variableTwo Key KindsCluster AnalysisPrincipal Components/Factor AnalysisPridit uses this approachIt is applied to ordered categorical variables

  • Cluster AnalysisRecords are grouped in categories that have similar values on the variablesExamplesMarketing: People with similar values on demographic variables (i.e., age, gender, income) may be grouped together for marketing Text analysis: Use words that tend to occur together to classify documentsNote: no dependent variable used in analysis

  • ClusteringCommon Method: k-means, hierarchicalNo dependent variable records are grouped into classes with similar values on the variableStart with a measure of similarity or dissimilarityMaximize dissimilarity between members of different clusters

  • Dissimilarity (Distance) Measure Continuous VariablesEuclidian Distance

    Manhattan Distance

  • Binary Variables

    Sheet1

    Row Variable

    Column Variable10

    1aba+b

    0cdc+d

    a+cb+d

    Sheet2

    Sheet3

  • Binary VariablesSample Matching

    Rogers and Tanimoto

  • Example: Fraud DataData from 1993 closed claim study conducted by Automobile Insurers Bureau of MassachusettsClaim files often have variables which may be useful in assessing suspicion of fraud, but a dependent variable is often not available Variables used for clustering:Legal representationPrior ClaimSIU InvestigationAt faultPolice reportNumber of providers

  • Statistics for ClustersBased on descriptive statistics, Cluster 2 appears to have higher likelihood of fraudulent claims more about this later

    Sheet1

    medauditPoliceMedicalAtLegalSIUNumber

    NYClusterReportAuditFaultRepInvestigationProviders

    FrequencyPercentFrequencyPercentTotalPct YesPercentage Yes

    Cluster172253.127299484912.43902439027230.0013831259146.7%0.1%42.2%6.1%0.0%2

    263746.87270051514097.56097560986770.059084195249.8%5.9%2.4%96.0%6.5%4

    Combined135910041100

    polcerpt

    12

    FrequencyPercentFrequencyPercent

    Cluster138553.103448275933850.07407407417230.4674965422

    234046.896551724133749.92592592596770.4977843427

    Combined725100675100

    atfault

    12

    FrequencyPercentFrequencyPercent

    Cluster141838.739573679330595.0155763247230.4218533887

    266161.2604263207164.9844236766770.023633678

    Combined1079100321100

    legalcode

    12

    FrequencyPercentFrequencyPercent

    Cluster167996.1756373938446.34005763697230.060857538

    2273.824362606265093.65994236316770.9601181684

    Combined706100694100

    si_done

    NY

    FrequencyPercentFrequencyPercent

    Cluster172353.3185840708007230

    263346.6814159292441006770.0649926145

    Combined135610044100

    Sheet2

    Sheet3

  • Principal Components AnalysisA form of dimension (variable) reductionSuppose we want to combine all the information related to the financial dimension of fraudMedical provider bill (indicative of padding claim)Hospital billNumber of providersEconomic LossesClaimed wagesIncurred Losses

  • Principal ComponentsThese variables are correlated but not perfectly correlatedWe replace many variables with a weighted sum of the variables

  • Correlation Matrix for Variables

    Sheet1

    medauditPoliceMedicalAtLegalSIUNumber

    NYClusterReportAuditFaultRepInvestigationProviders

    FrequencyPercentFrequencyPercentTotalPct YesPercentage Yes

    Cluster172253.127299484912.43902439027230.0013831259146.7%0.1%42.2%6.1%0.0%2

    263746.87270051514097.56097560986770.059084195249.8%5.9%2.4%96.0%6.5%4

    Combined135910041100

    polcerpt

    12

    FrequencyPercentFrequencyPercent

    Cluster138553.103448275933850.07407407417230.4674965422

    234046.896551724133749.92592592596770.4977843427

    Combined725100675100

    atfault

    12

    FrequencyPercentFrequencyPercent

    Cluster141838.739573679330595.0155763247230.4218533887

    266161.2604263207164.9844236766770.023633678

    Combined1079100321100

    legalcode

    12

    FrequencyPercentFrequencyPercent

    Cluster167996.1756373938446.34005763697230.060857538

    2273.824362606265093.65994236316770.9601181684

    Combined706100694100

    si_done

    NY

    FrequencyPercentFrequencyPercent

    Cluster172353.3185840708007230

    263346.6814159292441006770.0649926145

    Combined135610044100

    Sheet2

    Correlations

    Number ProvidersMedical BillProvider PaidEconomic LossesIncurredHospital Pymt

    Number Providers1.0000.3870.5710.3820.3820.168

    Medical Bill0.3871.0000.5390.9520.9520.922

    Provider Paid0.5710.5391.0000.5310.5310.327

    Economic Losses0.3820.9520.5311.0001.0000.888

    Inourred0.3820.9520.5311.0001.0000.888

    Hospital Pymt0.1680.9220.3270.8880.8881.000

    Sheet3

  • Finding Factor or ComponentThe correlation matrix is used to find the factor that explains the most variance (captures most of the correlation) for the set of variablesThat component or factor extracted will be a weighted average of the variablesMore than one Component or Factor may result from applying the method

  • Evaluating Importance of VariablesUse factor loadings

    Sheet1

    medauditPoliceMedicalAtLegalSIUNumber

    NYClusterReportAuditFaultRepInvestigationProviders

    FrequencyPercentFrequencyPercentTotalPct YesPercentage Yes

    Cluster172253.127299484912.43902439027230.0013831259146.7%0.1%42.2%6.1%0.0%2

    263746.87270051514097.56097560986770.059084195249.8%5.9%2.4%96.0%6.5%4

    Combined135910041100

    polcerpt

    12

    FrequencyPercentFrequencyPercent

    Cluster138553.103448275933850.07407407417230.4674965422

    234046.896551724133749.92592592596770.4977843427

    Combined7251006751001400

    atfault

    12

    FrequencyPercentFrequencyPercent

    Cluster141838.739573679330595.0155763247230.4218533887

    266161.2604263207164.9844236766770.023633678

    Combined10791003211001400

    legalcode

    12

    FrequencyPercentFrequencyPercent

    Cluster167996.24467230.060857538

    2273.8650946770.9601181684

    Combined706100694100

    si_done

    NY

    FrequencyPercentFrequencyPercent

    Cluster172353.3185840708007230

    263346.6814159292441006770.0649926145

    Combined135610044100

    Sheet2

    Correlations

    Number ProvidersMedical BillProvider PaidEconomic LossesInourredHospital PymtClaimed Wages

    Number Providers1.0000.3870.5710.3820.3820.168-0.161

    Medical Bill0.3871.0000.5390.9520.9520.922-0.020

    Provider Paid0.5710.5391.0000.5310.5310.327-0.077

    Economic Losses0.3820.9520.5311.0001.0000.888-0.095

    Inourred0.3820.9520.5311.0001.0000.888-0.095

    Hospital Pymt0.1680.9220.3270.8880.8881.0000.024

    Claimed Wages-0.161-0.020-0.077-0.095-0.0950.0241.000

    Component Matrix(a)

    Component

    12

    Number Providers0.49656281140.7617640494

    Medical Bill0.973599283-0.1369925871

    mp_paid0.64641355110.582379983

    Economic Losses0.9759676899-0.1388981879

    Inourred0.9759676899-0.1388981879

    Report Lag0.0182089405-0.1071551963

    Hospital Pymt0.886202804-0.3929965634

    Extraction Method: Principal Component Analysis.

    a2 components extracted.

    Component Matrix

    VariableLoading

    Number Providers0.497

    Medical Bill0.974

    Provider Paid0.646

    Economic Losses0.976

    Incurred0.976

    Hospital Pymt0.886

    Report Lag0.018

    Sheet3

  • Problem: Categorical VariablesIt is not clear how to best perform Principal Components/Factor Analysis on categorical variablesThe categories may be coded as a series of binary dummy variablesIf the categories are ordered categories, you may loose important informationThis is the problem that PRIDIT addresses

  • RIDITVariables are ordered so that lowest value is associated with highest probability of fraudUse Cumulative distribution of claims at each value, i, to create RIDIT statistic for claim t, value i

  • Example: RIDIT for Legal Representation

    Sheet1

    Legal Representation

    ProportionProportion

    ValueCodeNumberProportionBelowAboveRIDIT

    Yes17060.5040.0000.496-0.496

    No26940.4960.5040.0000.504

    1400

    Sheet2

    Sheet3

  • PRIDITUse RIDIT statistics in Principal Components Analysis

  • ScoringAssign a score to each claimThe score can be used to sort claimsMore effort expended on claims more likely to be fraudulent or abusiveIn the case of AIB data, we can use additional information to test how well PRIDIT did, using the PRIDIT scoreA suspicion score was assigned to each claim by an expert

  • PRIDIT vs. Suspicion Score

    Chart1

    0.4378592313

    0.2528500621

    -0.4183959163

    -0.4248695316

    -0.6804811165

    -0.616280535

    -0.7170292658

    -0.9590567571

    -0.8462498046

    -1.0009264309

    -1.2381645646

    Suspicion Score

    PRIDIT Score

    Suspicion Score vs PRIDIT Score

    Sheet1

    Legal Representation

    ProportionProportion

    ValueCodeNumberProportionBelowAboveRIDIT

    Yes17060.5040.0000.496-0.496

    No26940.4960.5040.0000.504

    1400

    Sheet2

    Component Matrix(a)

    Component

    1

    Rsiu-0.2475210115

    Rpolce0.2201004541

    Rfault-0.7087566059

    rlegal0.7520387699

    rmed0.3405600453

    rprior0.4057082417

    Extraction Method: Principal Component Analysis.

    a1 components extracted.

    Report

    PRIDIT Score

    ScoreMeanNStd. Deviation

    Suspicion Level0.000.44798.000.88

    1.000.2548.000.86

    2.00(0.42)60.000.78

    3.00(0.42)97.000.79

    4.00(0.68)89.000.82

    5.00(0.62)131.000.81

    6.00(0.72)53.000.67

    7.00(0.96)56.000.65

    8.00(0.85)50.000.75

    9.00(1.00)10.000.70

    10.00(1.24)8.001.20

    Total0.001,400.001.00

    Sheet2

    Suspicion Score

    PRIDIT Score

    Suspicion Score vs PRIDIT Score

    Sheet3

  • Clustering and Suspicion Score

  • ResultThere appears to be a strong relationship between PRIDIT score and suspicion that claim is fraudulent or abusiveThe clusters resulting from the cluster procedure also appeared to be effective in separating legitimate from fraudulent or abusive claims

  • Comparison: PRIDIT and ClusteringPRIDIT gives a score, which may be very useful for claims sorting. Clustering assigns claims to classes. They are either in or out of the assigned class.Clustering ignores information about the order of values for categorical variablesClustering can accommodate both categorical and continuous variables

  • ComparisonUnordered categorical variables with many values (i.e., injury type):Clustering has a procedure for measuring dissimilarity for these variables and can use them in clusteringIf the values for the variables contain no meaningful order, PRIDIT will not help in creating variables to use in Principal Components Analysis.

  • Review ofFraud Classification Using Principal Components Analysis of RIDITSBy Louise A. FrancisFrancis Analytics and Actuarial Data Mining, Inc.