data science: best practice & governance in analytics sayara beg 30 th april 2013

17
Data Science: Best Practice & Governance in Analytics Sayara Beg 30 th April 2013

Upload: elfrieda-wilcox

Post on 17-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Data Science: Best Practice & Governance in Analytics

Sayara Beg30th April 2013

Operational Research Consultancy 2

Agenda• Data Science– Role of the Data Scientist– What does a data scientist do?

• Best Practice & Governance in Data Science– Unintentional mistakes; or Fraud?– Issue 1: Question of Reproducibility– Issue 2: Applying the Scientific Method

30/04/2014

Operational Research Consultancy 3

What came first?

30/04/2014

The Data The Question

or

Operational Research Consultancy 4

The Role of the Data Scientist

• Sexiest Job of the Century – HBR Oct 2010

• A superman or woman??

• A Mathematician?• A Computer Programmer?• A Graphic Designer?• All rolled into one?30/04/2014

Operational Research Consultancy 5

A tedious job?

• HBR 2014 – “A Data Scientist’s job is tedious..”• Majority of the ‘scientific analysis’ time spent:– Data Discovery (extraction)– Data Wrangling (interpretation)– Data Munging (transformation)– Data Cleansing– Data Profiling

• Less time spent modelling & visualising

30/04/2014

Operational Research Consultancy 6

What does a Data Scientist do?

– Scientific Experts: Statistics, Mathematic, O.R. Modelling, Physics• Identify algorithms, gather insights, discover patterns,

clusters

– Tools of the Trade: Data, Hardware, Software, Programming• Access, capture, prepare, cleanse large data sets

– Interpersonal Skills: Communication, Presentation• Visual communication using colour, shape, size,

quantity

30/04/2014

Operational Research Consultancy 7

Analysis – Structured Coding in SQL

• Sql code“Select id, name, reg_date, reg_addressFrom employeeGroup by name;”

30/04/2014

ID Name Reg_date Reg_address

32487493 Bloggs 24-Sep-1992 London

98349435 Doe 07-Aug-1983 Munich

Operational Research Consultancy 8

Exploratory Data Analysis (EDA) in ‘R’• # Goal: Toss a coin N times and compute the running proportion of heads.• N = 500 # Specify the total number of flips, denoted N

• # Generate a random sample of N flips for a fair coin (head=1, tails=0):• set.seed(47405)• flipsequence = sample( x=c(0,1) , prob=c(0.2,0.8) , size=N , replace=TRUE)

• # Compute the running proportion of heads:• r=cumsum( flipsequence )• n=1:N # N is a vector• runprop = r/n

• # Graph the running proportion:• plot (n, runprop, type="o", log="x",• xlim=c(1,N) , ylim=c(0.0,1.0) , cex.axis=1.5 ,• xlab="FlipNumber" , ylab="Proportion Heads" , cex.lab=1.5 ,• main="Running Proportion of Heads" , cex.main=1.5 )

• # Plot a dotted horizontal line at y=0.8, just as a reference line:• lines( c(1,N) , c(0.2, 0.8) , lty=3 )

• # Display the beginning of the flip sequence. These string and character• flipletters = paste( c("T", "H")[flipsequence[1:10]+1], collapse="")• displaystring = paste( "Flip Sequence = ", flipletters, "..." , sep="")• text(5, 0.9 , displaystring , adj=c(0,1) , cex=1.3)

• #display the relative frequency at the end of the sequence.• text( N, 0.3 , paste("End Proportion = ", runprop[N]), adj=c(1,0) , cex=1.3)

30/04/2014

Operational Research Consultancy 9

Technologies-Structured, Modelled

30/04/2014

Dimension Table

Dimension Table

Dimension Table

Dimension Table

Fact Table

Dimension Table

Dimension Table

Dimension Table

Dimension Table

Fact Table

Dimension Table

Dimension Table

Dimension Table

Dimension Table

Fact Table

Dimension Table

Dimension Table

Dimension Table

Dimension Table

Fact Table

Star

Snowflake

Operational Research Consultancy 10

Now - Big Data, Unstructured

30/04/2014

Operational Research Consultancy 11

Visualisation - before

30/04/2014

Operational Research Consultancy 12

Now – ‘tell a story’ in 20 secs

30/04/2014

Operational Research Consultancy 13

Best Practice & GovernanceIs it a science?• The Science Council's definition of science

Science is the pursuit and application of knowledge and understanding of the natural

and social world following a systematic methodology based on evidence....

30/04/2014

Operational Research Consultancy 14

Worst Practice & Bad Governance• Rienhart & Rogoff Scandal 2013

– NBER 2010 Paper ’90% Debt to GDP threshold excesses slows economic growth’• Criticised by Henden, Ash & Pollen; discovered fundamental coding errors &

missing data– significant change in real average

• Potti Scandal 2010 (Duke University)– “Abundantly clear” that there was “manipulated data” behind the

published research • Investment Ponzi Scandal 2009 (Madoff Collapse), Subprime

Mortgage Scandal 2008 (Lehman Bros Collapse), Accounting Frauds 2001 (Enron & WorldCom Collapse) etc, etc

• Daily Telegraph – Climategate Blog – Climate Change worst ever scientific scandal

There are lies, damned lies and then, there are statistics !30/04/2014

Operational Research Consultancy 15

Reproducibility

• Can the results be reproduced?• What are the challenges?– Data is not static, its meaning and value is fluid– Analytics is often based on a ‘moment-in-time’– Can that ‘moment’ ever truly be reproduced?– Based on assumptions considered valid at the

moment-in-time– Assumption Validations Must Be Robust!

30/04/2014

Operational Research Consultancy 16

The Scientific Method• State your assumptions, articulate your question, establish your

hypothesis?• Document the steps you will take to validate, analyse and test

your assumptions, questions, hypothesis?• Extrapolate your conclusions; what results are you expecting to

discover? What might or might not happen? Why?• Document the actual results and your observations? Did they

differ from your expectations? Why?• Were your results peer reviewed? Was it reproducible?

Auditable?• Where is the actual data you used?

You may be a Data Analyst, but are you a Data Scientist?30/04/2014

Operational Research Consultancy 17

The O.R. Society - CSci

Recently granted a license, by the Science Council, to offer Chartered Scientist

registration.

You would be peer assessed, re-register annually after completing CPDs

Would you be interested?

30/04/2014