presenter: jo prichard · – flipping – short-sale flipping (flopping) •seller and buyer risk...
TRANSCRIPT
RED/082311
Presenter: Jo Prichard Computerworld, Phoenix AZ, 9/20/11
LexisNexis HPCC Systems Mortgage Fraud Case Study
1
LexisNexis HPCC Systems Mortgage Fraud Case Study
Presenter: Jo Prichard
Computerworld Phoenix, AZ 9/20/11
RED/082311
• Massively Parallel Extract, Transform and Load (ETL) engine
– Built from the ground up as a parallel data environment. Leverages inexpensive locally attached storage. Doesn’t require a SAN infrastructure.
• Enables data integration on a scale not previously available:
– Current LexisNexis person data build process generates 350 Billion intermediate results at peak
• Suitable for:
– Massive joins/merges
– Massive sorts & transformations
– Any N2 problem
– “identify and catalog all the DNA in the oceans”
HPCC Data Refinery (Thor)
HPCC Data Delivery Engine (Roxie)
• A massively parallel, high throughput, structured query response engine
• Ultra fast due to its read-only nature.
• Allows indices to be built onto data for efficient multi-user retrieval of data
• Suitable for
– Volumes of structured queries
– Full text ranked Boolean search
– “I want that fish there”
Enterprise Control Language (ECL)
• An easy to use , data-centric programming language optimized for large-scale data management and query processing
• Highly efficient; Automatically distributes workload across all nodes.
– Industry analysts estimate 80% more efficient than C++, Java and SQL and 1/3 reduction in programmer time to maintain/enhance existing applications
– Benchmark against SQL (5 times more efficient) for code generation
• Automatic parallelization and synchronization of sequential algorithms for parallel and distributed processing
• Large library of efficient modules to handle common data manipulation tasks
Three Main Components
2
RED/082311
Mortgage Fraud Continues to Impact the Economy
• Per FBI, pending investigations increased 12% in the fiscal year ended September 30, 2010, to 3,129 cases – this represents a 90% jump from the previous fiscal year.
• The collapse of the housing boom and financial crisis has increased foreclosures – in 2010, 2.5 million foreclosures were initiated. 2011 should see the same number.
• Per FBI, mortgage origination schemes have decreased because of depressed housing market.
• But Fraud targeting troubled borrowers increased and includes loan modification scams and foreclosure rescue schemes in which perpetrators convince borrowers they can save their homes through deed transfers and upfront fees
• Mortgage fraud hotspots include California, New York and Florida
• Source: http://www.reuters.com/article/2011/08/15/us-usa-mortgages-fraud-idUSTRE77E3UP20110815
3
RED/082311
Why is Mortgage Fraud so Difficult to Detect?
• Systems built to manage loan portfolios are not well suited to fraud detection in scale.
• Mortgage Fraud is prolific and can be hard to detect since Mortgage data is not consolidated into one database.
• Data is spread through various places: financial services organizations, FinCen SARS, government agencies, and public records such as property and assessment deeds.
• Government Agencies have limited resources to detect and investigate the bigger mortgage fraud schemes.
• The challenge is to quickly leverage readily available data to help organizations detect, prioritize and investigate large mortgage fraud schemes.
4
RED/082311
How Can HPCC Systems Detect Mortgage Fraud ?
• Leverage publicly available data such as Property Deed and Assessments, which is almost 700 million records
• Public Records Data has data hygiene challenges:
• Limited information on mortgage participants
• No information on the appraiser or realtor
• Names are misspelled, no SSN, no DOB
• How can HPCC Systems detect mortgage fraud leveraging the big data of public records?
5
RED/082311
Rules Based Fraud Detection Falls Short
6
Fraudsters know all the thresholds and game the system. • Advanced Persistent Threat (APT) is not just Cyber.
• Rules based detection plays a key role in the “Giant
Mortgage Fraud Magic Act”.
• Key Differentiator is in how to leverage BIG DATA to measure proximity of seemingly low risk events commonly associated with high risk activities to detect organized fraud syndicates.
RED/082311
Isolated risk? Lone Individuals vs. Organized Group
7
Variables that describe the proximity and connectedness of risk through relationships. • Non-visual rank ordering, prioritizing for investigation and mitigating of risk.
– Suspicious insurance claims by proximity to other suspicious insurance claims, providers and body
shop contacts.
– New unsecured accounts by proximity to secured accounts and other newly unsecured accounts.
– Suspicious property transactions proximity to associated suspicious property transactions.
• Predictive analytics based on variables that contain awareness of proximity through relationships
– Predict risk through associations to keep step with emerging fraud schemes.
– Measure the predictive nature within networks of personal injury claims, suspicious mortgage transactions and potential bust out activities.
RED/082311
Property Transaction Risk
8
Three core transaction variables measured • Velocity
• Profit (or not)
• Buyer to Seller Relationship Distance
(Potential of Collusion)
Flipping
Profit
Collusion
RED/082311
Large Scale Suspicious Cluster Ranking
±700 mill Deeds
Derived Public Data Relationships from +/- 50 terabyte database
Collusion Graph Analytics
Chronological Analysis of all property Sales
Historical Property Sales Indicators and Counts
Person / Network Level Indicators and Counts
Data Factory Clean
Overview
RED/082311
Isolated risk of Mortgage Fraud? Lone Individuals vs. Organized Group
10
Rank the nature, connectedness and proximity of suspicious property transactions for every identity in the U.S. • Property History Risk
– Chronological flow of transactions for a property – Collusion to artificially inflate and strip equity – Collusion to artificially deflate – Flipping – Short-sale flipping (Flopping)
• Seller and Buyer Risk
– Seller and Buyer property transaction history – Seller and Buyer cluster variables representing a variety of variables identifying
– Equity Stripping schemes – Foreclosure Generating Clusters – Flipping schemes (short-sale flipping) – Straw buyer recruiting
RED/082311
Example: Suspicious Equity Stripping Cluster
11
RED/082311
Results
12
Large scale measurement of influencers strategically placed to potentially direct suspicious transactions. • All BIG DATA on one supercomputer measuring over a decade of property
transfers nationwide.
• BIG DATA Products to turn other BIG DATA into compelling intelligence.
• Large Scale Graph Analytics allow for identifying known unknowns.
• Florida Proof of Concept – Highest ranked influencers
Identified known ringleaders in flipping and equity stripping schemes.
Typically not connected directly to suspicious transactions. – Known ringleaders not the Highest Ranking.
• Clusters with high levels of potential collusion. • Clusters offloading property, generating defaults. • Agile Framework able to keep step with emerging schemes in real estate.
RED/082311
BIG DATA Insights on Complex Real Estate Behavior. Deeds and Flipping
13
RED/082311
Total Sales vs. Flipping
14
Contrast Sales with Flipping and Potential Collusion • Sales decline post 2003.
• Percentage of Flipping and Potential
Collusion increasing in spite of declining sales.
RED/082311
Emerging Fraud Trends – Short-Sale Flipping
15
RED/082311
16
Appendix
RED/082311
• Started in 1973 as Mead Data Central and launched the Lexis® service, which pioneers online legal research by allowing attorneys to search case law database in firm via private telecommunications network. Added more data sources and services for the next thirty years, included public records.
• Acquired by Reed Elsevier in 1994. Reed Elsevier is a world leading provider of professional information and workflow solutions in the science, medical, legal, risk management and business sectors. Stock Symbols: valued at $12 billion in 2010; [NYSE: ENL; NYSE: RUK]; 34,000 employees.
• Today, LexisNexis has billions of searchable documents and records from more than 45,000 information sources. Headquarters: New York, NY (Legal and Professional) and Alpharetta, GA (Risk Solutions). Global reach: customers in more than 100 countries with about 15,000 global employees. Revenue was $6 billion in 2010.
• LexisNexis® Risk Solutions is a leader in providing essential information that helps companies across all industries and government predict, assess and manage risk. Formed in 2000, the business unit grew via organic growth and four acquisitions (RiskWise, Dolan, Seisint, ChoicePoint). The core capabilities of the business unit are data, linking and data analytics for the customers in enterprise organizations (financial services, insurance carriers, government, law enforcement).
LexisNexis® Over 35 Years of Big Data Experience
17
RED/082311
Processing power to allow for complex matching, scoring and processing in real time, at point of need for Big Data
Data Analytics Unique and proven analytic tools
based on all data sources Our advanced technology which
matches and links files across disparate data sources
Linking
Advanced Technology
Public record and proprietary data on consumers and businesses
ƒx(x,y) = (x-y)2
Big Data is in the DNA of LexisNexis Risk Solutions
Big Data Examples
Over 4 petabytes of content (4 thousand terabytes)
• 34 billion records
• 45,000 sources
• 800,000 records added daily
• 4.2 billion names and addresses
• 585 million identities
• 739 million business contacts
• 3.5 billion documents
• Adding 2 million documents per day
• Processing over 100 million documents per day
18
RED/082311
• High Performance Computing Cluster Platform (HPCC) enables data integration on a scale not previously available and real-time answers to millions of users. Built for big data and proven for 10 years with enterprise customers.
• Offers a single architecture, two data platforms (query and refinery) and a consistent data-intensive programming language (ECL)
• ECL Parallel Programming Language optimized for business differentiating data intensive applications
HPCC Systems Built for Big Data, Proven for 10 Years with Enterprise Customers
Big Data
Open Source Components
Insurance
Financial Services
Cyber Security
Government
Health Care
Retail
Telecommunications
Transportation & Logistics
Weblog Analysis
INDUSTRY SOLUTIONS
Customer Data Integration Data Fusion Fraud Detection and Prevention Know Your Customer Master Data Management Weblog Analysis
Online Reservations
19
RED/082311
Qu
ery
Tim
e
TB of Storage
50 100 150 200 250 300
10 sec 13 sec
Scalability: Little Degradation in Performance
Scalability
• Scales to support 1000+TB (up to petabytes) of data
• Purposely built system to do massive I/O
• Rapidly performs complex queries on structured and unstructured data to link to a variety of data sources
• Suitable for massive joins/mergers, beyond limits of relational DBs
• Scale increases with the addition of low-cost, commodity servers
HPCC – in production
• Current production systems range from 20 to 2000 nodes
• Currently supports over 150,000 customers, millions of end users
• Currently handling over 20 million transactions per day for our online and batch products, and innovation leading to better results
Complex query example demonstrates below
• Transaction latencies increase logarithmically while data sizes grow linearly
20