module 4 data analysis

29
Data Analysis Module ILRI Graduate Fellows skills training Nairobi 4 th December 2013

Upload: ilri-jmaru

Post on 08-Jul-2015

141 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Module 4 data analysis

Data Analysis Module

ILRI Graduate Fellows skills training

Nairobi 4th December 2013

Page 2: Module 4 data analysis

Session Objectives

To be able to;

• Answer the research questions:

What results do you need to show and in what format (tables, graphs, charts etc.)

Selection of data analysis methods (principles and concepts)

• Identify, evaluate and apply data analysis packages – software available, open source

• Plan analysis of own data and carry out exploratory analysis

• Carry out formal data analysis using different tools and methods e.g. R

Page 3: Module 4 data analysis

Research Process

• Problem definition• Literature review• Objective & hypothesis • Study design

• Sampling• Data collection • Data management• Formal analysis

• Reporting• Publication• Data archiving or

publication

Project development implementation Communicating findings

Definition of problem domain & how the specific problem fits inIdentification of gaps,

appropriate methods & theoryResearch will approve or

disapprove the hypothesisResearch strategy to be

used, sample size, sampling frame

Sample selection Data collection tools Database development and

data cleaning Exploration, description,

modelling & interpretation of statistical outputs

Choice of reporting media & format

Advise on presentation of results

Data sharing media

Page 4: Module 4 data analysis

Data Analysis – Guiding Principles

Translating Research Questions / Objectives / Hypotheses into an ‘analysis plan’

• Used our research questions / objectives / hypotheses to design the study – experiment or survey, questions to ask & data to collect

• We use them again to plan the analysis – what differences do I need to show, what are my response variables, what types of model may I need to use etc.

• This is often a good time to draft tables and graphs which you think will help answer the questions

Page 5: Module 4 data analysis

Data Analysis – Multi-level Data Structures

• Many studies are designed at different levels and collect data at thesemultiple levels

• E.g. in experiments this could be animals or plots → blocks and for survey thismay be animal → household → village → district.

• Another aspect of ‘levels’ is repeated measurements over time.

• These levels must not be forgotten when we reach analysis stage –which level do we summarise each variable at? What is our ‘unit ofanalysis’?

• For formal analysis then there are advanced methods that allow thedata to be analysed in a way that allows for the multi-level datastructure (including variation).

• The analysis can often be simplified into fewer dimensions bysummarising particular aspects

• e.g. means between two points in time, the slope of the trend between twopoints, the value reached at the end.

Page 6: Module 4 data analysis

Data Analysis – Response & Explanatory variablesTerms are discipline specific:

Response ≡ Dependent variable ≡ y

Explanatory ≡ Independent variable ≡ x’s

• Explanatory variables can be continuous or discrete

• In epidemiology they can be ‘confounding’ or biological ‘interacting’*

• In economics they can be exogenous or endogenous

*Note that statistical confounding and interactions have different interpretations

Page 7: Module 4 data analysis

Data Analysis – Variable Types

• Variables can be:

CONTINUOUS or DISCRETE

• In analysis we may sometimes convert continuous data to discrete

• Both Response and Explanatory variables can be continuous ordiscrete.

Page 8: Module 4 data analysis

Data Analysis – Aim of Exploratory Analysis

• Data exploration is the first stage to any analysis of the data –people often jump straight to formal analysis and models but itis at this stage where you will identify patterns and ‘odd’ data

• In some cases this may be 90% of the analysis you do on thedata (e.g. Case Study 11)

• The more complicated the data set the more interesting andnecessary the exploratory phase becomes.

• With some expertise in data management we can highlight theimportant patterns within the data and list the types ofstatistical models and their forms that need to be fitted at thenext stage.

Page 9: Module 4 data analysis

Data Analysis – Activities of Exploratory Analysis

Page 10: Module 4 data analysis

Data Analysis – Exploratory Analysis Methods (Continuous & Discrete Variables)

Tools for exploratory analysis:

• Means & ranges Case Study 2

• 1 & 2-way tables of means Case Study 3

• Frequency tables Case Study 3 & Case Study 8 – use of Excel pivot tables!

• Histograms Case Study 11

• Scatterplots Case Study 1 & Case Study 3

• Boxplots Case Study 3

• Bar charts & pie charts Case Study 11

• Trend graphs, survival curves Case Study 3

Identify patterns and unusual variables – e.g. outliers, zeros

Measures of variation – variance, standard deviation, confidence interval, standard error, CV

Transformation of variables?

Page 11: Module 4 data analysis

Data Analysis – Confirmatory / Formal Analysis

Exploratory analysis was the first part of the analysis of ourresearch data – to gain initial understanding of patterns thatexist in the data and suggest further analysis needs

By the time we reach Confirmatory / Formal analysis wehave refined our objectives and these should clearly defineexactly what type of further statistical analysis we need (andwhich models to use)

In exploratory analysis it is difficult to look at many variablesat the same time – formal analysis allows us to do this and beable to see which variables are more important and others.

Page 12: Module 4 data analysis

Data Analysis – Confirmatory / Formal Analysis (Some options…)

(Non-parametric equivalents

for small samples)

Response / Dependent Variable(s)

Discrete Continuous

Explanatory /

Independent

Variable(s)

Discrete

Chi-square/ Regression

(Logistic – binary;

Poisson - count)

T-Test / Analysis of

Variance (/ Regression)

ContinuousRegression (Logistic –

binary; Poisson - count)

Correlation / Linear

Regression

BothRegression (Logistic –

binary; Poisson - count)

ANOVA / Linear

Regression

Page 13: Module 4 data analysis

Data Analysis – Confirmatory / Formal Analysis (Some advanced options…)

• Mixed-effects models (REML) – incorporates random effectsincluding spatial & temporal repeated measurements; better atmanaging data with many levels of hierarchy; used a lot inepidemiology, animal/plant genetics.

• Survival models.

• Multivariate (> 1 y) methods– can be both parametric & non-parametric).

• Proportional-odds models when categorical response with boththan 2 categories.

Page 14: Module 4 data analysis

Data Analysis – Confirmatory / Formal Analysis Concepts

The underlying concept in formal statistical modeling is:

Data = Pattern + Residual

‘Data’ are the raw data (responses) that you collected, or sometimes may besummaries derived from the raw data*. They could also be transformedvalues*

‘Pattern’ is all the variables (continuous or discrete) that are in the design ofthe study (e.g. treatment) or have been selected in your exploratory analysisas explaining some of the differences in the ‘Data’

‘Residual’ is the variation we can’t explain by the ‘Pattern’.

The aim of our formal analysis is to put as much of thevariation as possible into the ‘Pattern’ while keeping themodel as simple as possible…easy

*See earlier slides

Page 15: Module 4 data analysis

Data Analysis – Confirmatory / Formal Analysis: Correlation & simple Linear Regression 1

We will use correlation & fitting astraight line to data to explain theconcepts of statistical modeling:

• The simplest way of looking at therelationship between an x-variableand a y-variable is with theCORRELATION.

• An extension of this is to use aLINEAR REGRESSION model to fit astraight line through the points (Casestudy 3)

• To look at how well this model isfitting we use an ‘analysis ofvariance’ – the amount of variation inthe Pattern vs. in the Residual

Page 16: Module 4 data analysis

Data Analysis – Confirmatory / Formal Analysis: Correlation & simple Linear Regression 2

Linear regression and similar models present the analysis in an‘analysis of variance’ table that looks like (Section 3.2):

In this example the p-value will tell you if the slope of the line issignificantly different from 0 (i.e. a flat line) (Section 5)

Models such as Logistic for binary data and proportions orPoisson for counts give a similar table but it is now the ‘analysisof deviance’ with similar interpretations (Section 10.2

Source of variation

d.f. s.s. m.s. v.r (F-value) p-value

Slope of line 1

Residual (error) N-2

Total N-1

Page 17: Module 4 data analysis

Data Analysis – Confirmatory / Formal Analysis: Correlation & simple Linear Regression 3

A key aspect of anymodel is the ‘modelchecking’ – part of this isdone throughexamination of theresiduals.

For all regression modelswhich assume thateither the data or theresiduals are ‘normal’then we use the sameassumptions ofindependence,randomness and normaldistribution

Page 18: Module 4 data analysis

Data Analysis – Confirmatory / Formal Analysis: Parameter Estimates & Least Square Means

We also look at the parameter estimates and their standarderrors – for the linear regression example the parameter estimateis the slope (and intercept). For more complex models and thosewith discrete explanatory variable we will use the parameterestimates to compare levels of the discrete variables (Case Study 3

for examples & discussion).

For models containing discrete variables as explanatory /independent variables we will often want to present Means andStandard Errors and compare these (with t-test if comparing 2 ormultiple comparison tests) (Section 6).

Page 19: Module 4 data analysis

Data Analysis – Confirmatory / Formal Analysis: Exercise

Identify what sort of model you may use in your research (check the Statistical Modeling Teaching Guide) – e.g. linear regression (section 3), designed experiment (section 4), response data which are proportions or binary (section 10), count response data (section 11), survival data (section 12).

Which parameters may be included in your model as the Pattern

Draw a pretend analysis of variance / deviance or parameter estimates table of what you may expect to see in the analysis

Page 20: Module 4 data analysis

Features Excel Stata SPSS SAS R

Learning curve Gradual/flat Steep/gradual Gradual/flat Pretty steep Pretty steep

User interface Point-and-click Programming/point-

and-click

Mostly point-and-

click

Programming Programming

Data

manipulation

Weak/moderate Very strong Moderate Very strong Very strong

Data analysis Modest Powerful Powerful Powerful/versatile Powerful/versatile

Graphics Very good Very good Very good Good Excellent

Cost

Part of MS office Affordable(perpetual licenses, renewonly when upgrade)

Expensive (with annual license renewal )

Expensive

Annual Renewal

Open source

(free)

Data Analysis – Application (Statistical Software Packages)

Page 21: Module 4 data analysis

Data Analysis – Application (Using R)

Outline• Installing R• R Environment Command prompt RStudio Setting your workspace

• Loading and installing R packages• Importing data into R - *.csv/*.xls• Saving R data• Data exploration – summary statistics• Graphing in R - boxplot• Data analysis – T-test/linear & logistic regression

Page 22: Module 4 data analysis

Introduction to R

Installing R

- Download R from http://cran.r-project.org

CRAN – Comprehensive R Archive Network

R version changes over time, the current one is E-2.15.0

- Installing RStudio

- Setting up the work environment

Page 23: Module 4 data analysis

Introduction to R

R EnvironmentCommand prompt

R is primarily a command driven software where instructions are typed at the command prompt (> )R is case sensitive

RstudioRstudio has limited set of commands that can be selected and executed from the menu

Setting your workspaceIt is important to set R preferences to suit your work environment, one such setting is the working directory.Working directory is set using the command setwd

setwd("D:/My Documents/R course") or from File->change dir on the menu

Take note of / R will not recognize \ when specifying subdirectories

Page 24: Module 4 data analysis

Introduction to RLoading and installing R packagesModules or sets of functions are referred to as PACKAGES in R. Some packages are part of the base installation while others have to be installed separately.

There are several user-contributed packages.

Type library() to view installed packages

To view functions within a package type

library(help=“packagename”) e.g. library(help=stats) – no quotes

Install packages using the menu Packages->Install package(s) ….

Use find(“item”) command to identify the package containing an item of interest

e.g. find(“plot”), if you are sure of the exact name otherwise use

apropos(“item”)

Page 25: Module 4 data analysis

Introduction to R

Importing data into R - *.csv/*.xlsAlthough it is possible to enter data directly into R, importing data in a spreadsheet format is more efficient.

Use:

i. Read.table – to import space separated data with column headings (*.txt)

prod1 <- read.table("D://My Documents/R course/PROD2B.txt", header=T, sep=",")

ii. Read.csv – to import comma separated data with column headings (*.csv)

prod2 <- read.csv("D://My Documents/R course/PROD2B.csv", header=T)

To save the file in R

write.table(prod2, file="D://My Documents/R course/proddata2", quote=FALSE)

i. odbcConnectExcel() - to import excel worksheet

prod3<-"D://My Documents/R course/PROD2B.xls“

datachannel<-odbcConnectExcel(prod3)

outprod3 <-sqlFetch(channel= datachannel, sqtable="prod3")

write.table(outprod3, file="D://My Documents/R course/proddata3", quote=FALSE)

Page 26: Module 4 data analysis

Introduction to R

Data exploration – summary statisticsOne can get summary statistics on all numeric variables in the dataset using summary(datasetname)

eg. summary(outprod3)

It is also possible to get summary statistics on a particular variable, use $ to attach variable to the data table

e.g. summary(outprod3$WEIGHT)

Use aggregate to get summary statistics by group/category

e.g. aggregate(data.frame(WEIGHT), by=list(herd=HERD,sex=SEX), mean)

It is advisable to attach a data file to avoid having to specify the data file all the time particularly for long summaries such as aggregateattach(outprod3)

Page 27: Module 4 data analysis

Introduction to R

Graphing in R – boxplot- R has powerful graphing features that can be used in data

exploration, such as histograms, boxplot, scatterplot, etc.

histogram(~PCV, n=30, xlab="Packed Cell Volume")

boxplot(PCV, ylab="Packed Cell Volume")

boxplot(PCV~HERD, color="orange", ylab="Packed Cell Volume", xlab="Herd")

xyplot(PCV~WEIGHT, color="orange", ylab="Packed Cell Volume", xlab="Weight")

Page 28: Module 4 data analysis

Introduction to R

Data analysis – T-test

- t.test(prod2$WEIGHT~prod2$SEX)

Data analysis – Linear Regression

- output1<-lm(PCV~WEIGHT)

- Remember to attach the dataset to make it active

- attach(prod2)

Data analysis – Logistic Regression

-

Page 29: Module 4 data analysis

References

Research Methods & Biometrics Teaching Resource – Case Study 1 & 4 have R,

Case Study 2 has ANOVA: many others used in this session as examples. The study

guides are useful for reference material on Explanatory and Formal Analysis.

Take home:

Analysis the Data & Models Chapters in Green Book

Good Statistical Practice for Natural Resources Research – Part IV

R Intro Course Notes – Nicholas Ndiwa

Reading University SSC – Approaches to Analysis of Survey Data; Confidence &

Significance: Key Concepts of Inferential Statistics; Modern methods of analysis;

Analysis of Experimental Data