exploratory data analysis

21
1 Exploratory Data Analysis Kathirmani Sukumar Data Scientist @ Gramener

Upload: gramener

Post on 16-Apr-2017

475 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Exploratory data analysis

1

Exploratory Data AnalysisKathirmani SukumarData Scientist @ Gramener

Page 2: Exploratory data analysis

2

How do I start doing analysis?

Page 3: Exploratory data analysis

3

Exploratory Data Analysis might help you…!!!

Page 4: Exploratory data analysis

4

CASE STUDIES

Page 5: Exploratory data analysis

5

DETECTING FRAUD

“ We know meter readings are incorrect, for various reasons.

We don’t, however, have the concrete proof we need to start the process of meter reading automation.

Part of our problem is the volume of data that needs to be analysed. The other is the inexperience in tools or analyses to identify such patterns.

ENERGY UTILITY

Page 6: Exploratory data analysis

6

AN ENERGY UTILITY DETECTED BILLING FRAUD

This plot shows the frequency of all meter readings from Apr-2010 to Mar-2011. An unusually large

number of readings are aligned with the slab boundaries.

Below is a simple histogram (or frequency distribution) of usage levels. Each bar represents the number of customers with a customers with a specific bill amount (in units, or KWh).

Tariffs are based on the usage slab. Someone with 101 units is billed in full at a higher tariff than someone with 100 units. So people have a strong incentive to stay at or within a slab boundary.

An energy utility (with over 50 million subscribers) had 10 years worth of customer billing data available.

Most fraud detection software failed to load the data, and sampled data revealed little or no insight.

This can happen in one of two ways.

First, people may be monitoring their usage very carefully, and turn of their lights and fans the instant their usage hits the slab boundary.

Or, more realistically, there’s probably some level of corruption involved, where customers pay a small sum to the meter reading staff to ensure that it stays exactly at the slab boundary, giving them the advantage of a lower price.

Page 7: Exploratory data analysis

7

PREDICTING MARKS

“ What determines a child’s marks?

Do girls score better than boys?

Does the choice of subject matter?

Does the medium of instruction matter?

Does community or religion matter?

Does their birthday matter?

Does the first letter of their name matter?

EDUCATION

Page 8: Exploratory data analysis

8

TN CLASS X: ENGLISH

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

Page 9: Exploratory data analysis

9

TN CLASS X: SOCIAL SCIENCE

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

Page 10: Exploratory data analysis

10

TN CLASS X: LANGUAGE

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

Page 11: Exploratory data analysis

11

TN CLASS X: SCIENCE

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

Page 12: Exploratory data analysis

12

TN CLASS X: MATHEMATICS

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

Page 13: Exploratory data analysis

13

ICSE 2013 CLASS XII: TOTAL MARKS

Page 14: Exploratory data analysis

14

CBSE 2013 CLASS XII: ENGLISH MARKS

Page 15: Exploratory data analysis

15

Based on the results of the 20 lakh students taking the Class XII exams at Tamil Nadu over the last 3 years, it appears that the month you were born in can make a difference of as much as 120 marks out of 1,200.

June borns score the

lowest

The marks shoot up for Aug borns

… and peaks for Sep-borns

120 marks out of 1200

explainable by month of birth

An identical pattern was observed in 2009 and 2010…

… and across districts, gender, subjects, and class X & XII.

“It’s simply that in Canada the eligibility cut-off for age-class hockey is January 1. A boy who turns ten on January 2, then, could be playing alongside someone who doesn’t turn ten until the end of the year—and at that age, in preadolescence, a twelve-month gap in age represents an enormous difference in physical maturity.”

-- Malcolm Gladwell, Outliers

Page 16: Exploratory data analysis

16

This is a dataset (1975 – 1990) that has been around for several years, and has been studied extensively. Yet, a visualization can reveal patterns that are neither obvious nor well known.

For example,• Are birthdays uniformly distributed?• Do doctors or parents exercise the C-section option to

move dates?• Is there any day of the month that has unusually high or

low births?• Are there any months with relatively high or low births?

Very high births in September. But this is fairly

well known. Most conceptions happen during

the winter holiday season

Relatively few births during the Christmas and

Thanksgiving holidays, as well as New Year and

Independence Day.

Most people prefer not to have children

on the 13th of any month, given that it’s

an unlucky day

Some special days like April Fool’s day are avoided, but Valentine’s Day is quite popular

More births Fewer births … on average, for each day of the year (from 1975 to 1990)

LET’S LOOK AT 15 YEARS OF US BIRTH DATA

Page 17: Exploratory data analysis

17

THE PATTERN IN INDIA IS QUITE DIFFERENTThis is a birth date dataset that’s obtained from school admission data for over 10 million children. When we compare this with births in the US, we see none of the same patterns.

For example,• Is there an aversion to the 13th or is there a local cultural

nuance?• Are holidays avoided for births?• Which months have a higher propensity for births, and

why?• Are there any patterns not found in the US data?

Very few children are born in the month of August, and

thereafter. Most births are concentrated in the first half

of the year

We see a large number of children born on the 5th, 10th,

15th, 20th and 25th of each month – that is, round

numbered dates

Such round numbered patterns a typical indication

of fraud. Here, birthdates are brought forward to aid

early school admission

More births Fewer births … on average, for each day of the year (from 2007 to 2013)

Page 18: Exploratory data analysis

EDA PROCESS

UNDERSTAND DERIVE QUESTION INTERACT

Identify Relevant data & sources

Map Context Prepare

Metadata Label & Clean

data

New Metrics from business

Metrics from Patterns (Binning, comparison, Ratios, Attributes, Transformation)

Stakeholder inputs who would benefit from the analysis

Based on patterns(top groups by a metric, maximise a metric, bivariate relationships)

Filter by a group value

Compare against a value or a derived metric

Sort by a dimension

Page 19: Exploratory data analysis

19

LIVE DEMO

Page 20: Exploratory data analysis

20

THANK YOU

Page 21: Exploratory data analysis

21

Reaching out…

Kathirmani SukumarEmail: [email protected]

Twitter: @skathirmani

LinkedIn: https://in.linkedin.com/in/skathirmani