data quality: opportunities, data, and examples
DESCRIPTION
Data Quality: Opportunities, Data, and Examples. Better and More Data. Level of analysis Take a quick look at what/why use data Linking data from disparate and third party sources Explore data types Typical issues & Tricks Cross validation and sourcing Reverse Look-up GIS layering - PowerPoint PPT PresentationTRANSCRIPT
3
– Level of analysis• Take a quick look at what/why use data
• Linking data from disparate and third party sources
– Explore data types– Typical issues & Tricks
• Cross validation and sourcing
• Reverse Look-up
• GIS layering
• Backfill from text correlated to codes
– Information from operations• Text analytics
– Level of analysis• Take a quick look at what/why use data
• Linking data from disparate and third party sources
– Explore data types– Typical issues & Tricks
• Cross validation and sourcing
• Reverse Look-up
• GIS layering
• Backfill from text correlated to codes
– Information from operations• Text analytics
Better and More DataBetter and More Data
4
Sales and Distribution
Producer SegmentationMarket PlanningRevenue ForecastingCross sell and Up sellRetention and Profitability
Underwriting
Risk Selection and PricingPortfolio ManagementPremium AdequacyBilling and Collections Management
Claims
Payment AccuracyClaim Collaboration > Fraud Detection > Subrogation > Risk Transfer > 3rd Party Deductible > Reinsurance Recoverable
General Organizational OverviewAn information business focused on risk taking.Make. Sell. Serve.
5
Same Problems – Different Lines of BusinessSame Problems – Different Lines of Business
• Personal – Auto, HO, Umbrella
• Small Commercial – BOP, CPP
• Middle Market Commercial – CPP w/GL, CP, Crime, CIM, B&M, WC, Auto
• Large Commercial Accounts
• Commercial Auto
• Workers Comp
• Umbrella/Excess
• Specialty Lines – D&O, EPL, E&O, Farm, FI
• Personal – Auto, HO, Umbrella
• Small Commercial – BOP, CPP
• Middle Market Commercial – CPP w/GL, CP, Crime, CIM, B&M, WC, Auto
• Large Commercial Accounts
• Commercial Auto
• Workers Comp
• Umbrella/Excess
• Specialty Lines – D&O, EPL, E&O, Farm, FI
6
Structured dataSemi-structured dataUnstructured dataTextSpatialPictographicGraphicVoiceVideo
Data Types and FormsData Types and Forms
7
Data
Archive,Legacy Systems
Current System Claim
Multiple StatesBilling SystemsFinance SystemsCRM Systems, other data
PolicyMultiple Underwriting Systems
Medical Data - Bill Review - PPO - Case Management - Paradigm
Multiple Data Systems which must be pulled together for analysis. Great opportunity for cross-validation and sourcing
• Identify Data Systems• Get right data from right systems• Overcome internal Organizational Barriers• Bridge to legacy systems and archived data• Augment to create rich data mining environment• Expect the need to negotiate for resources
ACTIONS
Vendors/Partners
External Data
8
Dun & BradstreetExperianBureau of Labor and StatisticsMarket StanceAM BestEquifaxUS CensusClaritasMelissa DataISOGIS vendorsU&C Data setsCode Sets for ICD-s and CPT’s…
Some typical external data sources and vendors
9
Data Glitches – historical and on-goingData Glitches – historical and on-going
Systemic changes to data not process related– Changes in data layout / data types– Changes in scale / format– Temporary reversion to defaults– Missing and default values– Gaps in time series
Systemic changes to data not process related– Changes in data layout / data types– Changes in scale / format– Temporary reversion to defaults– Missing and default values– Gaps in time series
11
Defining Issues-sampleDefining Issues-sampleConstantsDefinition MismatchesFiller Containing DataInconsistent CasesInconsistent Data TypesInconsistent Null RulesInvalid KeysInvalid ValuesMiscellaneousMissing ValuesOrphansOut of RangePattern ExceptionsPotential ConstantsPotential DefaultsPotential DuplicatesPotential InvalidsPotential RedundantValuesPotential Unused FieldsRule ExceptionsUnused Fields
Source Data
1-DefineIssues
12
Data Elements
DZ
BE
CN
DK
EG
FR
. . .
ZW
ISO 3166English Name
ISO 31663-Numeric Code
012
056
156
208
818
250
. . .
716
ISO 31662-Alpha Code
Algeria
Belgium
China
Denmark
Egypt
France
. . .
Zimbabwe
Name:Context:Definition:Unique ID: 4572Value Domain:Maintenance Org.Steward:Classification:Registration Authority:Others
ISO 3166French Name
L`Algérie
Belgique
Chine
Danemark
Egypte
La France
. . .
Zimbabwe
DZA
BEL
CHN
DNK
EGY
FRA
. . .
ZWE
ISO 31663-Alpha Code
MORE ISSUES…Mapping across sources: Same Fact, Different TermsMORE ISSUES…Mapping across sources: Same Fact, Different Terms
Algeria
Belgium
China
Denmark
Egypt
France
. . .
Zimbabwe
Name: Country IdentifiersContext:Definition:Unique ID: 5769Conceptual Domain:Maintenance Org.:Steward:Classification:Registration Authority:Others
DataElementConcept
13
Data FillingData Filling
• Manual
• Statistical Imputation
• Temporal
• Spatial
• Spatial-temporal
• Manual
• Statistical Imputation
• Temporal
• Spatial
• Spatial-temporal
15
Deriving Data = Power
Totals: Household Income Trends: Rate of Medical Bill Increases Ratios: Claims/Premium, Target/Median Friction: Level of inconvenience, ratio of rental to damage Sequences: Lawyer-Doctor, Auto-Life Policy Circumstances: Minimal Impact Severe Trauma Temporal: Loss shortly after adding collision Spatial: Distance to Service, proximity of stakeholders Logged: Progress Notes, Diaries,
Who did it, When, “Why”
16
Deriving Data = Power (Cont’d)
Behavioral: Deviation from past usage, spike buying Experience Profiles: Vendor, Doctor, Premium Audit Channel: How applied, How reported, Service Chain Legal Jurisdiction: Venue Disposition, Rules Demographics: Working, Weekly wage, lost income Firmographics: Industry Class Code Vs Injuries Claimed Inflation: Wage, Medical, Goods, Auto, COLA Gov’t Statistics: Crime Rate, Employment, Traffic Other Stats: Rents, Occupancy, Zoning, Mgd Care
17
“Search” versus “Discover”“Search” versus “Discover”
Data Mining
Text Mining
DataRetrieval
InformationRetrieval
Search(goal-oriented)
Discover(opportunistic)
StructuredData
UnstructuredData (Text)
18
Word Replacement
Lists
Input Value
[Jim]
SearchingSearchingSearchingSearching
Returns “Similar Matches”
All Records Found:
Jimmy
Jim
James
JimmyJimmy
JimJim
JamesJames
JAMESJAMES
JAMESJAMES
JAMESJAMES
TransformedInput Value
[JAMES]
19
Motivation for Text MiningMotivation for Text Mining
• Approximately 90% of the world’s data is held in unstructured formats (source: Oracle Corporation)
• Information intensive business processes demand that we transcend from simple document retrieval to “knowledge” discovery.
90%
Structured Numerical or CodedInformation
10%
Unstructured or Semi-structuredInformation
21
Techniques for attacking text data:Techniques for attacking text data:
Rules-basedStatistical Text Analysis and ClusteringLinguistic and Semantic ClusteringSupport Vector MachinesPattern Matching or other statistical algorithmsNeural Networks
Combination of methods from above
Text is like a data iceberg
22
Claims processing – Progress notes and DiariesClaims processing – Progress notes and Diaries
CLAIMSADJUSTER
•Medical Management Staff•Special Investigation Unit•NICB•Vendor Management•Consulting Engineers•Hearing Representative •Structured Settlement Unit•Recovery Staff•Legal Staff
•Home Office Staff•Field Office Claim Staff•Insured Risk Manager•Agent or Broker
•Diary forward – “call Dr Jones next week”•Business Rule – large loss review•System Reminder – update case reserves•Correspondence Tracking – legal letter sent
Service
23
Semantic processing: Named Entity ExtractionSemantic processing: Named Entity Extraction
• Identify and type language features• Examples:
• People names• Company names• Geographic location names• Dates• Monetary amount• Phone #, zipcodes, SSN, FEIN• Others… (domain specific)
• Identify and type language features• Examples:
• People names• Company names• Geographic location names• Dates• Monetary amount• Phone #, zipcodes, SSN, FEIN• Others… (domain specific)
24
ForkliftHits Ladder
Ladder inDoorway
ForkliftCouldn’t Stop
No BarrierSigns
ForkliftBrakes
Defective
Cooking Oil on Floor
ForkliftGoing Too
Fast
BrakeMaintenance
Delayed
HousekeepingInadequate
Speed LimitsNot Enforced
Or
Lack ofPersonnel
NoPolicy
NoEnforcement
NoEnforcement
Feedback to UW