assurance scoring pydata london 2016

26
Assurance Scoring: Using Machine Learning and Analytics to Reduce Risk in the Public Sector Matt Thomson Natalia Angarita-Jaimes 8/5/2016

Upload: matthew-thomson

Post on 15-Apr-2017

120 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Assurance Scoring Pydata London 2016

Assurance Scoring: Using Machine Learning and Analytics to Reduce Risk in the Public Sector

Matt ThomsonNatalia Angarita-Jaimes8/5/2016

Page 2: Assurance Scoring Pydata London 2016

2Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Outline

IntroductionTraditional Fraud DetectionAssurance ScoringMachine LearningBusiness RulesAnomaly DetectionGraph Links

Page 3: Assurance Scoring Pydata London 2016

3Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Who are we?

Matt Thomson Senior Data Scientist at Capgemini PhD in Astrophysics (http://arxiv.org/abs/1010.3315) Several years experience in fraud detection

Natalia Angarita-Jaimes Data Scientist at Capgemini PhD in Optical Engineering Several years experience signal and image processing.

Capgemini Big Data Analytics team 30 Data Scientists, 40 Big Data Engineers Focus on Open Source and Big Data technologies to solve client problems Sponsor the conference!

Page 4: Assurance Scoring Pydata London 2016

4Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Introduction to the Problem

Public sector constantly working in an environment of reduced resources

Want to provide a better service but with greater efficiency

Therefore very important that limited resources are focussed correctly

Assurance Scoring Use ML and other analytical methods to identify the least risky people or applications so

that investigators resources can be targeted on the most risky

Page 5: Assurance Scoring Pydata London 2016

5Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Hypothetical Example – 2016 Olympics tickets

Running the application process for selling tickets to the 2016 Olympics

Avoid selling tickets to touts/resellers Vast majority of people applying for tickets are genuine Fraud detection with big class imbalance problem (<0.1%) Avoid approach of investigating each person applying

Lets say we know from 2012 Olympics which people ended up reselling their tickets – training data

Use ML to identify the least risky 30% (say) of people wanting tickets

Investigators focus on the high risk

Page 6: Assurance Scoring Pydata London 2016

6Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Traditional Fraud Detection

Identify Historical

Training Data

Feature Engineering

Model Training and Evaluation

Model Execution

Feedback

Page 7: Assurance Scoring Pydata London 2016

7Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Assurance Scoring

Focus on low-risk

Allows resources to be better focussed

Not limited to Machine Learning

Built using Python! Pandas, Scikit-learn etc

Page 8: Assurance Scoring Pydata London 2016

8Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Assurance Scoring

Page 9: Assurance Scoring Pydata London 2016

9Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

POLE ‘Analytical’ Data Layer

Disparate data sources - Atomic Layer

Atomic data is Transformed and Loaded into POLE

POLE Layer

EventLocationObjectPerson

Page 10: Assurance Scoring Pydata London 2016

10Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

POLE ‘Analytical’ Data Layer

POLE contains ALL entities from the Atomic Layer, plus their inter-linkages

Page 11: Assurance Scoring Pydata London 2016

11Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Assurance Scoring

Page 12: Assurance Scoring Pydata London 2016

12Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Machine learning

Transform Selection Model

Training

Validation

Test

Feature extraction and selection Model Building

Variety of output files: logs, graphics, pickle models, etcTesting: Unit tests, monitoring tests and integration tests

Vector BuildInput Data

Manipulate, ExploreData

Framework: Structure, flexibility, consistency

Page 13: Assurance Scoring Pydata London 2016

13Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Machine learning : Feature Engineering

SQL, Python

Transform

Explore

Select

Ask questions, validate

Refine features

• Feature Extraction

• Data exploration

• Feature selection

Historical Data

Page 14: Assurance Scoring Pydata London 2016

14Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Machine Learning: Model Building

Training

Validation

Test

Split Datasets

Build Models

Hyper-parameter tuning

Selectedfeatures Models

Training results

Validation results

Testsresults

Compare Models

Page 15: Assurance Scoring Pydata London 2016

15Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Low risk? High risk? Depends on classifier’s threshold

• True-positives : applications the model correctly classifies as high risk

• True negatives: applications model correctly classifies as low risk

• False-positives: applications the model scores as high risk but are not

• False-negatives: applications the model scores as low risk but were in fact high risk

Page 16: Assurance Scoring Pydata London 2016

16Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Assurance Scoring

Page 17: Assurance Scoring Pydata London 2016

17Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Business Rules

Identifying Fraud often been done using deterministic rules

Look for transactions near a threshold or at the end of the day

Primarily data queries on your feature vector

Olympics example – Anyone applying for more than £10,000 tickets

Page 18: Assurance Scoring Pydata London 2016

18Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Assurance Scoring

Page 19: Assurance Scoring Pydata London 2016

19Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Anomaly Detection

Use the training data to create a baseline of applications by postcode (say)

If a particular postcode has a larger than expected number of applications then those cases pushed into high-risk bucket

Page 20: Assurance Scoring Pydata London 2016

20Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Assurance Scoring

Page 21: Assurance Scoring Pydata London 2016

21Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Graph Links - Matching

Key part of assurance scoring – bringing data together from disparate sources

Probability of Match: 80%

Attribute Data Source 1 Data Source 2

Name Matt Thomson Matthew Thosmon

Phone Number 07123 456 789 07123 456 798

Favourite Sport Football Cricket

Page 22: Assurance Scoring Pydata London 2016

22Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Assurance Scoring

Page 23: Assurance Scoring Pydata London 2016

23Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Jupyter Notebook

Page 24: Assurance Scoring Pydata London 2016

24Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

Further Details

[email protected] / @MattGThomson

Assurance Scoring brochure: http://ow.ly/4nbEUI

Blogs:

Introduction: https://www.capgemini.com/node/1380596

Integrating multiple techniques: http://bit.ly/24BmszV

Machine Learning: http://bit.ly/1QTMGnq

More coming soon!

Page 25: Assurance Scoring Pydata London 2016

25Copyright © Capgemini 2012. All Rights Reserved

Presentation Title | Date

We’re Hiring!

Data Sciencehttps://www.uk.capgemini.com/careers/jobs/data-scientist-0

Big Data Engineerhttps://www.uk.capgemini.com/careers/jobs/big-data-engineer

Data Visualisation Analysthttps://www.uk.capgemini.com/careers/jobs/data-visualisation-analyst

[email protected]

Page 26: Assurance Scoring Pydata London 2016

The information contained in this presentation is proprietary.© 2012 Capgemini. All rights reserved.

www.capgemini.com

About CapgeminiWith more than 120,000 people in 40 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2011 global revenues of EUR 9.7 billion.Together with its clients, Capgemini creates and delivers business and technology solutions that fit their needs and drive the results they want. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business ExperienceTM, and draws on Rightshore ®, its worldwide delivery model.

Rightshore® is a trademark belonging to Capgemini