data visualization and graphic design introducing r for data visualization

57
Data visualization and graphic design Introducing R for data visualization Allan Just and Andrew Rundle EPIC Short Course June 21, 2011 Wickham 2008

Upload: laird

Post on 25-Feb-2016

89 views

Category:

Documents


8 download

DESCRIPTION

Data visualization and graphic design Introducing R for data visualization. Allan Just and Andrew Rundle EPIC Short Course June 21, 2011. Wickham 2008. Intro to R. Objectives After this class, participants will be able to: Describe some capabilities and uses of R - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data visualization and graphic design Introducing R for data visualization

Data visualization and graphic designIntroducing R for data visualization

Allan Just and Andrew RundleEPIC Short CourseJune 21, 2011

Wickham 2008

Page 2: Data visualization and graphic design Introducing R for data visualization

Intro to RObjectivesAfter this class, participants will be able to:

1. Describe some capabilities and uses of R

2. Search for help within R and use good coding practices for reproducible research in R

3. Read in and summarize a simple dataset with R/JGR/Deducer

4. Make some standard plots with Deducer templates

Page 3: Data visualization and graphic design Introducing R for data visualization

What is R?

nytimes.com

Page 4: Data visualization and graphic design Introducing R for data visualization

R has many uses• Work with data: subset, merge, and transform

datasets with a powerful syntax• Analysis: use existing statistical functions like

regression or write your own• Graphics: graphs can be made quickly during

analysis and polished for publication quality displays

Page 5: Data visualization and graphic design Introducing R for data visualization

Why learn a whole language to look at data versus Excel?

1. Recreate/redo your exact analysis

2. Automate repetitive tasks

3. Access to statistical methods not available in Excel

4. Graphs are more elegant

Page 6: Data visualization and graphic design Introducing R for data visualization

1. It's free!

2. It runs on Mac, Windows, and Linux

3. It has state-of-the-art graphics capabilities

4. It contains advanced statistical routines not yet available in other packages – a de facto standard in statistics

5. Can program new statistical methods or automate data manipulation/analysis

adapted from statmethods.net

Why R versusSAS, SPSS, or Stata?

Page 7: Data visualization and graphic design Introducing R for data visualization

Made in SAS Redone in R

learnr.wordpress.com

Page 8: Data visualization and graphic design Introducing R for data visualization

R plots from my own research

Page 9: Data visualization and graphic design Introducing R for data visualization

Scatterplot matrixbivariate densities and correlations

Page 10: Data visualization and graphic design Introducing R for data visualization

Forest plot to compare parameter estimates from many models

Page 11: Data visualization and graphic design Introducing R for data visualization

Displaying lots of data: facetted histograms

Page 12: Data visualization and graphic design Introducing R for data visualization

Plotting data with a model

Page 13: Data visualization and graphic design Introducing R for data visualization

Automated report generation

Page 14: Data visualization and graphic design Introducing R for data visualization

Shapefile: CIESIN, Columbia University Asthma data: http://nyc.gov/html/doh/downloads/pdf/asthma/asthma-hospital.pdf

Choropleth map

Page 15: Data visualization and graphic design Introducing R for data visualization

Intro to R: recapObjectivesAfter this class, participants will be able to:

1. Describe some capabilities and uses of R

Statistical data analysis

Automation (scripting) of functions to work with data

Elegant graphics to facilitate data visualization

2. Search for help within R and use good coding practices for reproducible research in R

3. Read in and summarize a simple dataset with R/JGR/Deducer

4. Make some standard plots with Deducer templates

Page 16: Data visualization and graphic design Introducing R for data visualization

Learning a new language is difficult

flickr.com/photos/dnorman/3732851541/

Page 17: Data visualization and graphic design Introducing R for data visualization

What makes R difficult to learn

R is designed to be flexible and powerful rather than simple but limited.

R is a fully featured language mainly used from the command line. Learning the commands and the structure of the code takes time and practice.

If I made a a typo you would know what I meant...

Page 18: Data visualization and graphic design Introducing R for data visualization

What makes R difficult to learn

R is designed to be flexible and powerful rather than simple but limited.

The solution: be carefulbuild code in simple pieces and test as you go (learn to debug). Reuse code that works. Use helpful resources. Consider an alternative GUI for R.

Page 19: Data visualization and graphic design Introducing R for data visualization

Getting help in RYou can call for help on a function with a leading question mark and leaving off the ()?functionname

Search online

statmethods.net

An Introduction to Rin Windows found under Help – Manuals (in PDF)

Page 20: Data visualization and graphic design Introducing R for data visualization

Suggestions for an R workflowSave the bits of your code that work in a text editor - building a script

of clean code that works from start-to-finish.

With clean code instead of transformed data files it is easier to redo analyses if your data are updated or you want to change an earlier step

Leave yourself informative comments# everything to the right of the pound sign# is unevaluated

Using spaces and indents can help readabilityUse meaningful names for objects

Reproducible research!

Page 21: Data visualization and graphic design Introducing R for data visualization

Intro to R: recapObjectivesAfter this class, participants will be able to:

1. Describe some capabilities and uses of R

2. Search for help within R and use good coding practices for reproducible research in R

?t.test will bring up R help

Free manuals online: Introduction to R Also: statmethods.net

#use comments; save the code that works to reproduce your results

3. Read in and summarize a simple dataset with R/JGR/Deducer

4. Make some standard plots with Deducer templates

Page 22: Data visualization and graphic design Introducing R for data visualization

Learning the languageMany important features

• Arithmetic and logical operators: +, <, …

• Data types: numeric, logical, …

• Data structures: vectors, matrices, …

• Functions – always end with (): median(x)

Page 23: Data visualization and graphic design Introducing R for data visualization

Using R as a calculator

Mathematical operators+ - / * ^

log()abs()

Page 24: Data visualization and graphic design Introducing R for data visualization

R can evaluate logical expressions== equal!= not equal& and| or (vertical pipe)

10 < 20[1] TRUEpi > 3 & 2^2 == 4[1] TRUE"This" != "That"[1] TRUE

Page 25: Data visualization and graphic design Introducing R for data visualization

Creating new objectsAssignment operator is <- (looks like an arrow)x <- 10“Set x to take the value 10”

The symbols in this operator must be adjacent. x < - 10 What does this do?

You can overwrite old valuesx <- x^2“Set x to take the value x2”

Page 26: Data visualization and graphic design Introducing R for data visualization

Indexing and subsettingConcatenate function is c() x <- c(10, 20, 30) x[1] 10 20 30

Refer to components of objects by a position index which goes between square braces

x[2] return the second position in x[1] 20 x[c(1, 2)] return the first and second position in x[1] 10 20 x[-3] return all except the third position in x[1] 10 20

What would x[c(3, 2)] return?

Page 27: Data visualization and graphic design Introducing R for data visualization

Data framesA data frame is a rectangular collection of data

Rows: observationsColumns: variables

diamonds <- data.frame(carat, cut, price) carat cut price1 0.23 Ideal 3262 0.21 Premium 3263 0.23 Good 3274 0.29 Premium 3345 0.31 Good 3356 0.24 Very Good 336

Page 28: Data visualization and graphic design Introducing R for data visualization

Data framesYou can extract the variables as vectors with a $ diamonds$cut You can also index by position (or name) with square bracesdiamonds[2, 3] returns the single value in row 2, column 3

An empty index is treated like a wildcard and corresponds to all rows or columns depending on position

diamonds[, "cut"] (same result as diamonds$cut)

How would you return the first three rows and all columns?

row, column

Page 29: Data visualization and graphic design Introducing R for data visualization

R functionsThousands of functions are built-in:

median() lm() linear model

t.test() chisq.test()

or make your own:

inch.to.cm <- function(x){x * 2.54}

inch.to.cm(74)

[1] 187.96

Page 30: Data visualization and graphic design Introducing R for data visualization

Missing valuesThese take a value of NA Can be in a data object of any type (logical, numeric, character)

By default operations on NA will return NANA == NA[1] NA

Can check for NA with is.na()y <- c(2, 10, NA, 12)is.na(y) [1] FALSE FALSE TRUE FALSE

Can often pass na.rm = T option to remove NA values in operationsmean(y)[1] NAmean(y, na.rm = T)[1] 8

Page 31: Data visualization and graphic design Introducing R for data visualization

R has several thousandadditional packages

time seriessurvivalspatialmachine learningbioinformatics

Interfaces to Excel, SQL databases, Twitter, google maps…

Page 32: Data visualization and graphic design Introducing R for data visualization

Installing a package

1. Open up R2. Click in to the console window and type:install.packages()3. Select a mirror (anywhere in the US)4. Find and select "Deducer" and choose OK.5. This will download Deducer and the other

packages which it requires, including ggplot2.

Page 33: Data visualization and graphic design Introducing R for data visualization

The default R graphical user interface (Windows)

Page 34: Data visualization and graphic design Introducing R for data visualization

JGR

Page 35: Data visualization and graphic design Introducing R for data visualization

Deducer

Page 36: Data visualization and graphic design Introducing R for data visualization

Recap on GUIs

R Default Windows GUI: lacks additional features to make learning or programming easier

JGR: Makes programming easier with syntax highlighting and command argument suggestions. No menus for stats. Looks the same across platforms (Java based)

Deducer: Adds menus for basic stats to JGR. Menu driven graphics options (building with ggplot2).

Page 37: Data visualization and graphic design Introducing R for data visualization

R graphics – 3 main "dialects"Base: with(airquality, plot(Temp, Ozone)) Lattice: xyplot(Ozone ~ Temp, airquality)

ggplot2: ggplot(airquality, aes(Temp, Ozone)) + geom_point( )

Page 38: Data visualization and graphic design Introducing R for data visualization

Google image search: ggplot2

Page 39: Data visualization and graphic design Introducing R for data visualization

ggplot2 philosophyWritten by Hadley Wickham (Rice Univ.)Extends The Grammar of Graphics (Wilkinson, 2005)

All graphs can be constructed by combining specifications with data (Wilkinson, 2005).

A specification is a structured way to describe how to build the graph from geometric objects (points, lines, etc.) projected on to scales (x, y, color, size, etc.)

Page 40: Data visualization and graphic design Introducing R for data visualization

ggplot2 philosophyWhen you can describe the content of the graph with the grammar, you don’t need to know the name of a particular type of plot…

Dot plot, forest plot, Manhattan plot are just special cases of this formal grammar.

…a plotting system with good defaults for a large set of components that can be combined in flexible and creative ways…

Page 41: Data visualization and graphic design Introducing R for data visualization

Building a plot in ggplot2

data to visualize (a data frame)map variables to aesthetic attributesgeometric objects – what you see (points, bars, etc)scales map values from data to aesthetic space

faceting subsets the data to show multiple plots statistical transformations – summarize datacoordinate systems put data on plane of graphic

Wickham 2009

Page 42: Data visualization and graphic design Introducing R for data visualization

A basic ggplot2 graphggplot(airquality) + geom_point(aes(x = Temp, y = Ozone))

DataAesthetics map variables to scales

Geometric objects to display

Page 43: Data visualization and graphic design Introducing R for data visualization

A ggplot2 graph is an R objectp <- ggplot(airquality) + geom_point(aes(x = Temp, y = Ozone))str(p) #structure of p

List of 8 $ data :'data.frame': 153 obs. of 6 variables: ..$ Ozone : int [1:153] 41 36 12 18 NA 28 23 19 8 NA ... ..$ Solar.R: int [1:153] 190 118 149 313 NA NA 299 99 19 194 ... ..$ Wind : num [1:153] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... ..$ Temp : int [1:153] 67 72 74 62 56 66 65 59 61 69 ... ..$ Month : int [1:153] 5 5 5 5 5 5 5 5 5 5 ... ..$ Day : int [1:153] 1 2 3 4 5 6 7 8 9 10 ... $ layers :List of 1 ..$ :proto object .. .. $ mapping :List of 2 .. .. ..$ x: symbol Temp .. .. ..$ y: symbol Ozone .. .. $ geom_params:List of 1 .. .. ..$ na.rm: logi FALSE ...$ plot_env :<environment: R_GlobalEnv> - attr(*, "class")= chr "ggplot"

shortened substantially

Note that the internal plot specification includes the data

So if you update the data, update the call to ggplot()

Page 44: Data visualization and graphic design Introducing R for data visualization

Help with learning ggplot2Website: had.co.nz/ggplot2/Thousands of examples!

Book:ggplot2: Elegant Graphics for Data

AnalysisHadley Wickham, 2009

Graphic User Interface:Deducer (R package)Ian Fellows

Page 45: Data visualization and graphic design Introducing R for data visualization

Intro to R: recapObjectivesAfter this workshop participants will be able to:

1. Describe some capabilities and uses of R

2. Search for help within R and use good coding practices for reproducible research in R

3. Read in and summarize a simple dataset with R/JGR/Deducer

Together, let’s explore some data from the WHO - Global School Health Survey.

I will also give you a script containing code which you can run, modify, and take home!

4. Make some standard plots with Deducer templates

Page 46: Data visualization and graphic design Introducing R for data visualization

Open JGR -

Page 47: Data visualization and graphic design Introducing R for data visualization

Load the Deducer package

Page 48: Data visualization and graphic design Introducing R for data visualization

Note additional menus

Page 49: Data visualization and graphic design Introducing R for data visualization
Page 50: Data visualization and graphic design Introducing R for data visualization
Page 51: Data visualization and graphic design Introducing R for data visualization

Intro to R: recapObjectivesAfter this workshop participants will be able to:

1. Describe some capabilities and uses of R

2. Search for help within R and use good coding practices for reproducible research in R

3. Read in and summarize a simple dataset with R/JGR/Deducer

4. Make some standard plots with Deducer templates

Using the gshs dataframe – let's make some plots together using templates in:

Deducer → Plots → Plot Builder

Page 52: Data visualization and graphic design Introducing R for data visualization

Since R, JGR, and Deducer are free, you should install them at home or

work and play with them!

Page 53: Data visualization and graphic design Introducing R for data visualization

Installing R, JGR, DeducerPart I: R on Windows (shown), or Mac, or Linux

R is available from a set of mirrors known as The Comprehensive R Archive Network (CRAN)http://cran.r-project.org/

Closest mirror and link for windows:http://software.rc.fas.harvard.edu/mirrors/R/bin/windows/base/

Uses a Windows installer – default options are fine

Page 54: Data visualization and graphic design Introducing R for data visualization

Installing R, JGR, DeducerPart II: JGR on Windows (shown), or Mac, or Linux

JGR requires a Java Development Kit (JDK)You probably don't have this* Available free at:http://www.oracle.com/technetwork/java/javase/downloads/index.html

*if you did have a JDK (and not just a JRE) you would have a folder named something like …C:\Program Files\Java\jdk1.6.0_20\

Page 55: Data visualization and graphic design Introducing R for data visualization

Installing R, JGR, DeducerPart II: JGR on Windows (shown), or Mac, or Linux

JGR requires a launcher file on Windows:http://www.rforge.net/JGR/web-files/jgr-1_62.exe

Leave this as your desktop shortcut

Page 56: Data visualization and graphic design Introducing R for data visualization

Installing R, JGR, DeducerPart III: Installing Deducer

Deducer is an R package

From within JGR To install packages: Packages & Data -> Package Installer To load packages: Packages & Data -> Package Manager

Page 57: Data visualization and graphic design Introducing R for data visualization

A few helpful R linksDownload R: http://cran.r-project.org/ available for Windows, Mac OS X, and Linux

Advice – A clearly stated question with a reproducible example is far more likely to get help. You will often find your own solution by restating where you are getting stuck in a clear and concise way.

Writing reproducible examples: https://gist.github.com/270442

General R linkshttp://statmethods.net/ Quick-R for SAS/SPSS/Stata Users - An all around excellent reference sitehttp://www.ats.ucla.edu/stat/R/ Resources for learning R from UCLA with lots of exampleshttp://www.r-bloggers.com/learning-r-for-researchers-in-psychology/ This is a nice listing of R resourceshttp://stackoverflow.com/questions/tagged/r Q&A forum for R programming questions - lots of good help!see also: http://crossvalidated.com for general stats & Rhttp://rstudio.org Integrated Development Environment for command line programming with R

ggplot2 linkshttp://had.co.nz/ggplot2/ ggplot2 help & reference – lots of exampleshttp://groups.google.com/group/ggplot2 ggplot2 user group – great for posting questionshttps://github.com/hadley/ggplot2/wiki ggplot2 wiki: answers many FAQs, tips & tricks

http://www.slideshare.net/hadley/presentations Over 100 presentations by Hadley Wickham, author of ggplot2. A four-part video of a ½ day workshop by him starts here: http://had.blip.tv/file/3362248/

Setting up JGR in WindowsJGR requires a JDK – speak to your IT person if this seems daunting (http://www.oracle.com/technetwork/java/javase/downloads/index.html )On Windows, JGR needs to be started from a launcher. For R version 2.13.0 on Windows with a 32bit R you will likely want to get the file jgr-1_62.exe as a

launcher from here: http://www.rforge.net/JGR/A discussion of the features of JGR can be found in this article (starting on page 9): http://stat-computing.org/newsletter/issues/scgn-16-2.pdf

Deducer - an R package which works best in a working instance of JGR – has drop-down menus for ggplot2 functionalityhttp://www.deducer.org/pmwiki/pmwiki.php?n=Main.DeducerManual

There are great videos linked here introducing the Deducer package (although the volume is quite low)

This slide last updated 06/19/2011