data visualization for social problems
TRANSCRIPT
DATA VISUALIZATION
FOR SOCIAL PROBLEMS
S Anand, Chief Data Scientist, Gramener
Most discussions of decision-making assume that only senior executives make decisions or that only senior executives’ decisions matter. This is a dangerous mistake…
Peter F Drucker
Data generation and analysis are not sufficient.
Consuming it as a team and acting in cohesion is.
SHOWme what is happening
with the data
EXPLAINto me why it’s
happening
Allow me to
EXPLOREand figure it out
Just
EXPOSEthe data to me
Low effort High effort
High effort
Low effort
Creator
Consumer
THERE ARE MANY WAYS TO AID DATA CONSUMPTION
SHOWme what is happening
with the data
EXPLAINto me why it’s
happening
Allow me to
EXPLOREand figure it out
Just
EXPOSEthe data to me
SHOWme what is happening
with the data
EXPLAINto me why it’s
happening
Allow me to
EXPLOREand figure it out
Just
EXPOSEthe data to me
EDUCATION
PREDICTING MARKS
What determines a child’s marks?
Do girls score better than boys?
Does the choice of subject matter?
Does the medium of instruction matter?
Does community or religion matter?
Does their birthday matter?
Does the first letter of their name matter?
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
TN CLASS X: ENGLISH
TN CLASS X: SOCIAL SCIENCE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
TN CLASS X: MATHEMATICS
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
DETECTING FRAUD
“We know meter readings are incorrect, for various reasons.
We don’t, however, have the concrete proof we need to start the process of meter reading automation.
Part of our problem is the volume of data that needs to be analysed. The other is the inexperience in tools or analyses to identify such patterns.
ENERGY UTILITY
This plot shows the frequency of all meter readings from Apr-2010 to Mar-2011. An unusually large number of
readings are aligned with the tariff slab boundaries.
This clearly shows collusion of some form with the customers.
Apr-10 May-10Jun-10Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
217 219 200 200 200 200 200 200 200 350 200 200
250 200 200 200 201 200 200 200 250 200 200 150
250 150 150 200 200 200 200 200 200 200 200 150
150 200 200 200 200 200 200 200 200 200 200 50
200 200 200 150 180 150 50 100 50 70 100 100
100 100 100 100 100 100 100 100 100 100 110 100
100 150 123 123 50 100 50 100 100 100 100 100
0 111 100 100 100 100 100 100 100 100 50 50
0 100 27 100 50 100 100 100 100 100 70 100
1 1 1 100 99 50 100 100 100 100 100 100
This happens with specific customers, not randomly. Here are such customers’ meter readings.
Section Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109%
Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54%
Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34%
Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14%
Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15%
Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33%
Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14%
Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17%
Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11%
If we define the “extent of fraud” as the percentage excess of the 100 unitmeter reading, the value varies considerably across sections, and time
New section manager arrives
… and is transferred out
… with some explainable anomalies.
Why would these happen?
SHOWme what is happening
with the data
EXPLAINto me why it’s
happening
Allow me to
EXPLOREand figure it out
Just
EXPOSEthe data to me
… to inform and to entertain
SHOWme what is happening
with the data
EXPLAINto me why it’s
happening
Allow me to
EXPLOREand figure it out
Just
EXPOSEthe data to me
Jain
Harini
Shweta
Sneha Pooja
Ashwin
Shah
Deepti
Sanjana
Varshini
Ezhumalai
Venkatesan
Silambarasan
Pandiyan
Kumaresan
Manikandan
Thirupathi
Agarwal
Kumar
Priya
Based on the results of the 20 lakh students taking the Class XII exams at Tamil Nadu over the last 3 years, it appears that the month you were born in can make a difference of as much as 120 marks out of 1,200.
June bornsscore the lowest
The marks shoot up for Aug borns
… and peaks for Sep-borns
120 marks out of 1200 explainable by month of birth
An identical pattern was observed in 2009 and 2010…
… and across districts, gender, subjects, and class X & XII.
“It’s simply that in Canada the eligibility cutoff for age-class hockey is January 1. A boy who turns ten on January 2, then, could be playing alongside someone who doesn’t turn ten until the end of the year—and at that age, in preadolescence, a twelve-month gap in age represents an enormous difference in physical maturity.”
-- Malcolm Gladwell, Outliers
LET’S LOOK AT 15 YEARS OF US BIRTH DATA
This is a dataset (1975 – 1990) that has
been around for several years, and has
been studied extensively. Yet, a
visualization can reveal patterns that
are neither obvious nor well known.
For example,
• Are birthdays uniformly distributed?
• Do doctors or parents exercise the C-section option to move dates?
• Is there any day of the month that has unusually high or low births?
• Are there any months with relatively high or low births?
Very high births in September.
But this is fairly well known.
Most conceptions happen during
the winter holiday season
Relatively few births during the
Christmas and Thanksgiving
holidays, as well as New Year and
Independence Day.
Most people prefer not
to have children on the
13th of any month, given
that it’s an unlucky day
Some special days like April
Fool’s day are avoided, but
Valentine’s Day is quite
popular
More births Fewer births … on average, for each day of the year (from 1975 to 1990)
THE PATTERN IN INDIA IS QUITE DIFFERENTThis is a birth date dataset that’s
obtained from school admission data
for over 10 million children. When we
compare this with births in the US, we
see none of the same patterns.
For example,
• Is there an aversion to the 13th or is there a local cultural nuance?
• Are holidays avoided for births?
• Which months have a higher propensity for births, and why?
• Are there any patterns not found in the US data?
Very few children are born in the
month of August, and thereafter.
Most births are concentrated in
the first half of the year
We see a large number of
children born on the 5th, 10th,
15th, 20th and 25th of each month
– that is, round numbered dates
Such round numbered patterns a
typical indication of fraud. Here,
birthdates are brought forward
to aid early school admission
More births Fewer births … on average, for each day of the year (from 2007 to 2013)
THIS ADVERSELY IMPACTS CHILDREN’S MARKS
It’s a well established fact that older
children tend to do better at school in
most activities. Since many children
have had their birth dates brought
forward, these younger children suffer.
The average marks of children “born” on the 1st, 5th, 10th, 15th etc. of the
month tend to score lower marks.
• Are holidays avoided for births?
• Which months have a higher propensity for births, and why?
• Are there any patterns not found in the US data?
Higher marks Lower marks … on average, for children born on a given day of the year (from 2007 to 2013)
Children “born” on round numbered days score lower marks on average,due to a higher proportion of younger children
0%
10%
20%
30%
40%
50%
60%
0 2 4 6 8 10 12 14 16 18
# contestants
Win
ner
mar
gin
More contestants did not reduce the winner marginKarnataka, Assembly Elections 2008
0%
10%
20%
30%
40%
50%
60%
0 2 4 6 8 10 12 14 16 18
# contestants
Ru
nn
er-u
p m
argi
n
More contestants did reduce the runner-up marginKarnataka, Assembly Elections 2004
Adult Educat
ion
Adminisrative
Reforms
Agricultura
l Marketing
AgricultureAnimal
Husbandry
Cooperative
Excise
Finance
Fisheries
Fisheries &
Inland
water
transport
Food & Civil
Supplies
Forest
Fuel
Haz & Wakf
Health and
family welfare
Higher Educati
on
Home Horticu
lture
Housing
Information
& Technology
Kannada &
Culture
Labour
Law &
Human Righ
ts
Major & Medium Industri
es
Medical Educatio
n
Medium and
Large Industrie
sMines
& Geolo
gy
Minor Irrigati
on
Muzrai
P.W.D.
Parliamentar
y Affairs
and Human Rights
Planning
Planning
and Statist
ics
Primary and
Secondary Education
Primary Educati
on
Prison
Public
Library
Revenue
Rural Developme
nt and Panchayat
Raj
Rural Wate
r Suppl
y
Rural Water Supply
and Sanitat
ion
Sericulture
Small
Scale Industrie
s
Small Indust
riesSocial Welfar
e
Sugar
Textile
Tourism
Transport
Transportatio
n
Urban Development
Water Resourc
es
Woman & Child
Development
Youth and
Sports
Youth
Service & Spor
ts
BJP focus
JD(S)focus
INC focus
What topics did parties focus on during questions?Karnataka, 2008-2012
P.W.D.
Health and family
welfare
Revenue
Rural Developme
nt and Panchayat
Raj
Social Welfar
e
Urban Development
Water Resour
ces
Minor Irrigati
on
Fuel
Housing
Agriculture
Primary Educati
on
Primary and Secondary Education
Woman & Child
Development
Higher Educati
on
HomeCoope
rative
Forest
Adminisrative
Reforms
Labour
Food & Civil
Supplies
Tourism
Finance
Animal Husbandry
Transportation
Horticulture
Muzrai
Haz & Wakf
TransportMedical
Education
Medium and Large Industries
Excise
Major & Medium Industrie
s
Kannada &
Culture
Textile
Fisheries
Parliamentary Affairs
and Human Rights
Adult Educati
on
Rural Water Supply
and Sanitati
on
Mines &
Geology
Small Industr
ies
Youth and
Sports
Sugar
Planning and Statisti
cs
Agricultural
Marketing
Rural Water Supply
Fisheries &
Inland water transport
Small Scale Industries
Youth
Service & Sport
s
Sericultur
e
Law &
Human
Rights
Prison
Planning
Information
& Technology
Public
Library
What topics did the young & old focus on during questions?Karnataka, 2008-2012
Young Old
SHOWme what is happening
with the data
EXPLAINto me why it’s
happening
Allow me to
EXPLOREand figure it out
Just
EXPOSEthe data to me
… to connect the dots for your readers
SHOWme what is happening
with the data
EXPLAINto me why it’s
happening
Allow me to
EXPLOREand figure it out
Just
EXPOSEthe data to me
https://gramener.com/aapdonations
EXPLORING THE MAHABHARATA
How does Mahabharata, one of the largest epics with 1.8 million words lend itself to text analytics?
Can this ‘unstructured data’ be processed to extract analytical insights?
What does sentiment analysis of this tome convey?
Is there a better way to explore relations between characters?
How can closeness of characters be analysed & visualized?
SHOWme what is happening
with the data
EXPLAINto me why it’s
happening
Allow me to
EXPLOREand figure it out
Just
EXPOSEthe data to me
… to allow your users to tell stories
VISUALISATION IS IMPERATIVE FOR
DATA → INSIGHTS → ACTIONSpot the unusual Communicate patterns Simplify decisions
We handle terabyte-size data via non-traditional analytics and visualise it in real-time.
A data analytics and visualisation company
gramener.com
for more examples