data science and machine learningfouskakis/programming_r/slides/2.pdfalbert & rizzo (2011). r by...

229
Introduction to Basic Principles of R Dimitris Fouskakis Associate Professor in Applied Statistics, Department of Mathematics, School of Applied Mathematical & Physical Sciences, National Technical University of Athens Email: [email protected] DATA SCIENCE AND MACHINE LEARNING

Upload: others

Post on 25-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • Introduction to Basic Principles of RDimitris FouskakisAssociate Professor in Applied Statistics,Department of Mathematics, School of Applied Mathematical & Physical Sciences,National Technical University of AthensEmail: [email protected]

    DATA SCIENCE ANDMACHINE LEARNING

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 2

    What is R? R is a dialect of the S language.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 3

    What is S? S is a language that was developed by John

    Chambers and others at Bell Labs. S was initiated in 1976 as an internal

    statistical analysis environment - originallyimplemented as Fortran libraries.

    Early versions of the language did notcontain functions for statistical modeling.

    In 1988 the system was rewritten in C andbegan to resemble the system that we havetoday (this was Version 3 of the language).

    Version 4 of the S language was released in1998 and is the version we use today.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 4

    Back to R 1991: Created in New Zealand by Ross Ihaka and

    Robert Gentleman. Their experience developing R isdocumented in a 1996 JCGS paper.

    1993: First announcement of R to the public. 1995: Martin Machler convinces Ross and Robert to

    use the GNU General Public License to make R freesoftware.

    1996: A public mailing list is created (R-help and R-devel).

    1997: The R Core Group is formed (containing somepeople associated with S-PLUS). The core groupcontrols the source code for R.

    2000: R version 1.0.0 is released.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 5

    Features of R Syntax is very similar to S, making it easy for S-PLUS users

    to switch over. Runs on almost any standard computing platform/OS (even

    on the PlayStation 3). Frequent releases (annual + bugx releases); active

    development. Quite lean, as far as software goes; functionality is divided

    into modular packages. Graphics capabilities very sophisticated and better than

    most stat packages. Useful for interactive work, but contains a powerful

    programming language for developing new tools (user programmer)

    Very active and vibrant user community; R-help and R-devel mailing lists.

    It’s free!!!!!

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 6

    Drawbacks of R Essentially based on 40 year old technology. Little built in support for dynamic or 3-D

    graphics (but things have improved greatlysince the “old days").

    Functionality is based on consumer demandand user contributions. If no one feels likeimplementing your favorite method, then it'syour job! (Or you need to pay someone to doit).

    Objects must generally be stored in physicalmemory; but there have been advancementsto deal with this too.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 7

    Design of the R System The R system is divided into 2 conceptual parts:

    The “base” R system that you download fromCRAN.

    Everything else. CRAN is the “Comprehensive R Archive Network”. It is a

    collection of sites which carry identical material,consisting of the R distribution(s), the contributedextensions, documentation for R, and binaries.

    R functionality is divided into a number of packages. The “base” R system contains, among other things, the

    base package which is required to run R and contains themost fundamental functions.

    The other packages contained in the “base" systeminclude utils, stats, datasets, graphics, grDevices, grid,methods, tools, parallel, compiler, splines, tcltk, stats4.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 8

    Design of the R System And there are many other packages

    available. There are about 4000 packages (!!!) on

    CRAN that have been developed by usersand programmers around the world.

    People often make packages available ontheir personal websites; there is no reliableway to keep track of how many packagesare available in this fashion.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 9

    Some R Resources Available from CRAN (http://cran.r-

    project.org). An Introduction to R Writing R Extensions R Data Import/Export R Installation and Administration (mostly

    for building R from sources)

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 10

    Books Standard Texts:

    Adler (2010). R in a Nutshell, O’Reilly. Albert (2007). Bayesian Computation with R, Springer. Albert & Rizzo (2011). R by Example, Springer. Chambers (2008). Software for Data Analysis: Programming with R, Springer. Crawley (2007). The R book, Wiley. Dalgaard (2002). Introductory Statistics with R, Springer – Verlag. Everitt & Hothorn (2006). A Handbook of Statistical Analyses using R,

    Chapman & Hall/CRC Press. Venables & Ripley (2002). Modern Applied Statistics with S, Springer. Murrell (2005). R Graphics, Chapman & Hall/CRC Press. Φουσκάκης (2013). Ανάλυση Δεδομένων με Χρήση της R. Εκδόσεις Τσότρας.

    Other resources: Springer has a series of books called Use R!. A longer list of books is at http://www.r-project.org/doc/bib/R-books.html.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 11

    Install R The home page for the R project, located at

    http://r-project.org, is the best startingplace for information about the software. Itincludes links to CRAN, which featuresprecompiled binaries as well as source codefor R, add-on packages, documentation(including manuals, frequently askedquestions, and the R newsletter) as well asgeneral background information. MirroredCRAN sites with identical copies of thesefiles exist all around the world. Newversions are regularly posted on CRAN,which must be downloaded and installed.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 12

    Install R Windows: More information on Windows-specific

    issues can be found in the CRAN R for Windows FAQ (http://cran.r-project.org/bin/windows/base/rw-FAQ.html).

    Mac OS: More information on Macintosh-specific issues can be found in the CRAN R for Mac OS X FAQ (http://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html).

    Linux: Precompiled distributions of R binaries are available for the Debian, Redhat (Fedora), Suse and Ubuntu Linux, and detailed information on installation can be found at CRAN.

    In this course the windows version will be used.

    http://cran.r-project.org/bin/windows/base/rw-FAQ.html

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 13

    R Commander If you wish you can download a more user-

    friendly graphical interface that works inconcert with R called R Commander, createdby John Fox. Users familiar with SPSS andother drop-down menu type programs willinitially feel more comfortable using RCommander than R. The R environment is runwith scripts, and in the long run it can be muchmore advantageous to have scripts that youcan easily bring back up as you performmultiple calculations.

    Here we will use the default graphical interface.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 14

    Running R Once installation is complete, the recommended

    next step for a new user would be to start R bydouble click the R program.

    Then the R console shows up. The `>' character is the prompt, and commands

    are executed once the user presses the RETURNkey.

    To terminate the program either press q(), orclose the R main window (not the R console), orfrom the menu file choose exit. In all cases youwill be asked if you wish to save the workspaceimage and all the objects you have created inyour session.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 15

    R windows R – Console: You may write R commands

    there. This window also displays all thecommands R has run, the results, and errorreport. This window appears when youlaunch R. To bring this window to theforeground, click on the window or go tothe Windows pull-down menu and choosethe R Console. You can type and runcommands in this window line by line.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 16

    R windows R – Editor: Alternatively, you can write all

    your commands there and allow R to runpart or all of the commands at once. Youopen this window by clicking the New Scriptoption in the File menu. If you want to redoyour analysis at a later time, or send yourcode in a file to someone else, you cansave the scripts. You can also print thescripts by clicking the Print option from theFile menu.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 17

    R windows R – Graphics: This window opens

    automatically when you create agraph. To bring a graph to theforeground, click on the graphicswindow or go to the Windows pull-down menu and choose the RGraphics window. Graphs can besaved in various formats, such as jpg,png, bmp, ps, pdf, emf, pictex, x_gand so on.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 18

    R Console

    Pull down menus

    Toolbar

    Type Command Here

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 19

    Toolbar & Menu Bar Like most window-based programs, R has a

    toolbar and a menu bar with pull-downmenus that you can use to access many ofthe features of the program. The toolbarcontains buttons for more commonly usedprocedures. To see what each button does,hold the mouse over the button for amoment and a description of what thebutton does will appear. The following is asummary of the main pull-down menusand their functions:

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 20

    Toolbar & Menu Bar

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 21

    Getting Help R provides help files for all its

    functions. To access a help file, usethe Help pull-down menu, or type ?followed by a function name in the Rconsole. For example, to get help onthe scan function, type:

    > ?scan

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 22

    Notation & Common R Operators> Indicates the prompt at the start of each new line in the R console.

    # The comment operator. Often called the “hash sign” or “pound sign”. Any type followingthe # on a line indicates a comment. R will ignore (not execute or attempt tointerpret) anything written as a comment. Comments are a good way to indicate whatyou intend a line to do, or to describe a variable or command. They are useful to youwhen you read your code later or, and are useful to others with whom you share yourcode.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 23

    Arithmetic Operators in R

    Operator Description

    + Addition

    - Subtraction

    * Multiplication

    / Division

    ^ Exponentiation

    % / % Integer Division

    % % Modulus

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 24

    Example> 3+3 # this is my first command[1] 6> 9-2[1] 7> 17/3[1] 5.666667> 4*2[1] 8> 2^3[1] 8> 17%/%3[1] 5> 17%%3[1] 2> 7/0[1] Inf

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 25

    Special Numbers R supports 4 special numeric values: Inf, -Inf, NaN,

    NA and NULL. The first 2 are positive and negativeinfinity. NaN is short for “not-a-number” and meansthat our calculation either didn’t make mathematicalsense or could not be performed properly. NA is shortof “not available” and represents a missing value-aproblem all too common in data analysis. NULLrepresents a null object in R.

    There a functions available to check for these specialvalues (see examples in the next slides).

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 26

    Special Numbers Example> 7/0[1] Inf> -10^1000[1] –Inf> exp(-Inf)[1] 0> log(-2)[1] NaN> 0/0[1] NaN> x names(x)[1] NULL

    > is.null(x)[1] FALSE> is.null(names(x))[1] TRUE> x is.finite(x)[1] TRUE FALSE FALSE FALSE FALSE> is.infinite(x)[1] FALSE TRUE TRUE FALSE FALSE> is.nan(x)[1] FALSE FALSE FALSE TRUE FALSE> is.na(x)[1] FALSE FALSE FALSE TRUE TRUE

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 27

    Logical Operators

    Operator Description> Greater than< Less than

    >= Greater than or equal to

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 28

    Examples

    > x x[1] 56> 5*2->y> y[1] 10> x>y[1] TRUE> y x>=56[1] TRUE> y x==56[1] TRUE> y!=1[1] TRUE> !(5>3)[1] FALSE

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 29

    Examples

    # An example> x x[(x>8) | (x x x1 2 3 4 5 6 7 8 9 10> x > 8F F F F F F F F T T> x < 5T T T T F F F F F F> x > 8 | x < 5T T T T F F F F T T> x[c(T,T,T,T,F,F,F,F,T,T)]1 2 3 4 9 10

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 30

    More on the Assignment Operator R είναι is case sensitive, i.e. x and Χ are different

    objects. When you use the assignment operator the result is

    not visible on screen. To see the value that yourobject took just type its name.

    Objects could also be vectors, matrices, data frames,or lists (more later).

    An object could be present in both parts of theassignment operator, but before we should give it avalue.> x x x[1] 8

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 31

    More on Logical Operators

    We usually use the symbols & (AND), |(OR) to combine two or more logicaloperators.

    > (5>3) & (8>10)[1] FALSE> (5>3) | (8>10)[1] TRUE

    Logical operators are frequently usedin loops, e.g. if, for, etc. (more later).

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 32

    Basic Mathematical Functions in RFunction Description Function Description

    sqrt() Square root asin() Inverse sineabs() Absolute value atan() Inverse tangentlog() Natural logarithm (ln) gamma() Gamma functionlog2() Logarithm base 2 lgamma() Natural logarithm of gamma

    functionlog10() Logarithm base 10 beta() Beta functionexp() Exponential function floor() Previous integercos() Cosine ceiling() Next integersin() Sine factorial() Factorialtan() Tangent choose() Combinations

    acos() Inverse cosine lchoose() Natural logarithm of combinations

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 33

    Examples> lgamma(4)[1] 1.791759> floor(4.9)[1] 4> ceiling(4.1)[1] 5> factorial(5)[1] 120> choose(5,2)[1] 10> lchoose(5,2)[1] 2.302585

    > sqrt(16)[1] 4> abs(-2)[1] 2> log(10)[1] 2.302585> log2(10)[1] 3.321928> log10(10)[1] 1> exp(3)[1] 20.08554> cos(pi)[1] -1> sin(2*pi)[1] -2.449213e-16> tan(pi/2)[1] 1.633178e+16

    > sin(pi/2)[1] 1> tan(0)[1] 0> acos(0.2)[1] 1.369438> atan(2)[1] 1.107149> asin(0)[1] 0> gamma(2)[1] 1> beta(2,3)[1] 0.08333333

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 34

    General Functions in R builtins(): built-in functions in R. cat() or print(): Print on screen. ls(): list all objects rm(): remove objects. getwd(): returns working directory. setwd(): changes working directory. list.files(): returns a list with all files in

    working directory. date() or Sys.time(): Returns current day

    & time.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 35

    Classes of Objects in R R has five basic classes of objects:

    Numeric (real number)> x x x[1] 4 +3i

    Logical> x y 4> y[1] FALSE

    Character> x x

    [1] “DIMITRIS”

    Use the function mode() or class() to check the objects class.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 36

    Data Types in R Vectors. Matrices. Arrays. Data frames. Lists.

    In the 3 first cases objects should be of thesame type.

    A list allows you to gather a variety of(possibly unrelated) objects under onename.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 37

    Vectors Vectors in R are lists of objects of the

    same type. Numerical Vectors. Character Vectors. Logical Vectors. Factors.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 38

    Numerical Vectors To create a numerical vector use the function

    c().> x x[1] 1 2 3 4 5

    The same function can be used to concatenatealready defined vectors

    > x y z z[1] 10 1 2 3 4 5 6 7

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 39

    Numerical Vectors Functions for numerical vectors:

    Length: > x length(x)[1] 5

    Min & Max> min(x)[1] 1> max(x)[1] 5

    Sum & Product> sum(x)[1] 15> prod(x)[1] 120

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 40

    Numerical Vectors Sort

    > x sort(x)[1] 1 2 3 5 6> sort(x,decreasing=T)[1] 6 5 3 2 1

    Rank> x rank(x)[1] 3 4 2 1 5

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 41

    Numerical Vectors Rank & Ties

    Average: assigns each tied element the "average" rank> x2 rank(x2, ties.method="average")[1] 3.5 5.0 2.0 1.0 6.0 3.5

    First: lets the "earlier" entry "win", so the ranks are in numerical order> rank(x2, ties.method="first")[1] 3 5 2 1 6 4

    Random: breaks ties randomly> rank(x, ties.method="random")[1] 4 5 2 1 6 3> rank(x, ties.method="random")[1] 3 5 2 1 6 4

    Min/Max: assigns every tied element to the lowest/highest rank> rank(x, ties.method="min")[1] 3 5 2 1 6 3> rank(x, ties.method="max")[1] 4 5 2 1 6 4

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 42

    Numerical Vectors Order

    > x order(x)[1] 4 3 1 2 5

    (thus the smallest value is on the 4th position,the next one on the 3rd, etc).

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 43

    Numerical Vectors Assign names to vector members

    > weight names(weight)NULL> names(weight) names(weight)[1] "Mary" "Kelly" "Elena" "George"> weightMary Kelly Elena George

    70 57 68 82

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 44

    Numerical Vectors Missing Values

    > x3 x3[1] 1 2 3 NA 9> is.na(x3)[1] FALSE FALSE FALSE TRUE FALSE

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 45

    Numerical Vectors Colon Operator

    Using the syntax a:b you can generate a sequence of numbers from a to b with step 1 (or -1)> x x[1] 1 2 3 4 5 6 7 8 9 10> x x[1] -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10> x x[1] 6 5 4 3 2 1> x x[1] 3.3 4.3 5.3 6.3 7.3 8.3 9.3 10.3

    If (b-a) is not an integer, then R will stop before b> x x[1] 3.3 4.3 5.3 6.3

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 46

    Numerical Vectors If you want to use a different step then you should use the function seq(). You

    should define the first (from) & last (to) numbers of the sequence, the step,the length and the name of another vector (along) in order the generatedsequence to have the same length as the specified vector. From all the abovearguments you need to specify only 3; if the user specifies only 2 then bydefault R take the parameter by = 1.

    > seq(from=1,to=9, by=2)[1] 1 3 5 7 9> seq(from=1,to=9, length=3)[1] 1 5 9> seq(to=9, length=3)[1] 7 8 9> seq(from=1,by=2,length=10)[1] 1 3 5 7 9 11 13 15 17 19> y seq(from=1,by=2,along=y)[1] 1 3 5 7 9 11 13 15 17 19

    If the ratio of the difference between the last and first number over the step isnot an integer number then R will stop the sequence before the last term.

    > seq(from=1,to=10,by=2)[1] 1 3 5 7 9

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 47

    Numerical Vectors Replicate values or vectors. With the function rep() you can

    replicate a value or a vector as many times as you want. Themain arguments of the function are: (a) the value or the vectoryou wish to replicate and (b) the number of times to repeat thevalue or the whole vector (times) or the number of times torepeat each element of the vector (each).

    > rep(2,5)[1] 2 2 2 2 2> x rep(x,5)[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3> rep(x,each=5)[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 48

    Numerical Vectors Arithmetic Operators & Vectors. You can use the

    numeric operators between numerical vectors (theyshould be of the same length), or between scalars andvectors. Also the mathematical functions can be used innumerical vectors.

    > x x*3[1] 3 6 9> x^2[1] 1 4 9> y y/x[1] 4.0 2.5 2.0

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 49

    Numerical Vectors By using [] you can easily choose specific elements of a

    vector. For example if you wish to extract the firstelement of the vector x you use the syntax x[1].

    > x x[1] 1 3 5 7 9> x[2][1] 3> x[2:4][1] 3 5 7> x[c(1,3)][1] 1 5> x[-c(1,3)][1] 3 7 9

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 50

    Character Vectors> x x[1] “Statistics" “Mathematics"

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 51

    Functions for Character Vectors character()

    > character(length=2)[1] “” “”

    as.character()> x x[1] 1 2 3 4 5 6 7 8 9 10> as.character(x)[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

    as.numeric()> x as.numeric(x)[1] -0.1 2.7 NAWarning message:NAs introduced by coercion

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 52

    Functions for Character Vectors is.character()

    > y is.character(y)[1] TRUE> x x[1] "-0.1" "2.7" "B" > is.character(x)[1] TRUE> x x[1] -0.1 2.7 NA> is.character(x)[1] FALSE

    noquote()> y noquote(y)[1] 1 2 3 4 5 6 7 8 9

    10 nchar()

    > y nchar(y)[1] 1 1 1 1 1 1 1 1 1 2

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 53

    Functions for Character Vectors paste()

    > x[1] 1 2 3 4 5 6 7 8 9 10> paste(x)[1] "1" "2" "3" "4" "5" "6" "7" "8"

    "9" "10"> paste(`Mathematical',`Statistics')[1] "Mathematical Statistics"> paste(`3',`5',`8', sep="+")[1] "3+5+8"> paste(paste(3,5, sep=` + '), 8, sep=`

    = ')[1] "3 + 5 = 8"> paste(`Chapter',2, sep=" ")[1] "Chapter 2"> paste("Today is", date())[1] "Today is Mon Jun 03 20:42:41 2013“> a b paste(a,b)[1] "Kwstas Papadopoulos" "Maria Kyriakou" > paste(b,a, sep=`, ')[1] "Papadopoulos, Kwstas" "Kyriakou, Maria" > paste("Chapter", 1:2, sep=" ")[1] "Chapter 1" "Chapter 2"> a b b

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 54

    Functions for Character Vectors strsplit()

    > x strsplit(x,split="a")[[1]][1] "St" "tistics"[[2]][1] "M" "them" "tics"> strsplit(x, split='')[[1]][1] "S" "t" "a" "t" "i" "s" "t" "i" "c" "s"[[2]][1] "M" "a" "t" "h" "e" "m" "a" "t" "i" "c"

    "s"> strsplit(x, split="th")[[1]][1] "Statistics"[[2]][1] "Ma" "ematics"

    substr()> substr("abcdef",2,4)[1] "bcd"> x substr(x,2,4)[1] "tat" "ath"

    grep()> countries grep("United", countries)[1] 2 3 6> grep("United", countries, value=TRUE)[1] "United States" "United Kingdom" "United Arab Emirates“> data[grep("United", data$country), ]

    country gdp income continent23 United Arab Emirates 21000 24200 AS42 United Kingdom 28300 29400 EU 82 United States 35200 31200 NA

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 55

    Functions for Character Vectors sub() & gsub()

    > values as.numeric(values)[1] NA NAWarning message:NAs introduced by coercion > as.numeric(gsub(",","",values))[1] 1700 2300> as.numeric(sub(",","",values))[1] 1700 2300> sub(",","",values)[1] "1700" "2300"> gsub(",","",values)[1] "1700" "2300"> values sub(",","",values)[1] "1000,000" "2000,000"> gsub(",","",values)[1] "1000000" "2000000"

    toupper () & tolower()> x[1] "Statistics" "Mathematics"> tolower(x)[1] "statistics" "mathematics"> toupper(x)[1] "STATISTICS" "MATHEMATICS"

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 56

    Logical Vectors> logical(3)[1] FALSE FALSE FALSE

    > as.logical(c(0:10)) [1] FALSE TRUE TRUE TRUE TRUE TRUE

    TRUE TRUE TRUE TRUE TRUE

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 57

    Factors Categorical Variables:

    > gender gender[1] "Male" "Female" "Male" "Male" "Female"> factor(gender)[1] Male Female Male Male FemaleLevels: Female Male> levels(factor(gender))[1] "Female" "Male"

    Ordinal Variables:> opinion ordered(opinion, levels=c('Low', 'Medium', 'High'))[1] Low Low High High High MediumLevels: Low < Medium < High

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 58

    Matrices > x X X

    [,1] [,2][1,] 1 6[2,] 2 7[3,] 3 8[4,] 4 9[5,] 5 10> X X[,1] [,2]

    [1,] 1 6[2,] 2 7[3,] 3 8[4,] 4 9[5,] 5 10> X X

    [,1] [,2][1,] 1 2[2,] 3 4[3,] 5 6[4,] 7 8[5,] 9 10

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 59

    Matrices Dimension

    > dim(X)[1] 5 2

    Specific elements> X[3,2][1] 6

    Specific rows or columns> X[3,][1] 5 6> X[,2][1] 2 4 6 8 10

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 60

    Matrices Create Matrices by combining vectors (cbind & rbind).

    > x1 x2 cbind(x1,x2)

    x1 x2[1,] 1 6[2,] 2 7[3,] 3 8[4,] 4 9[5,] 5 10> rbind(x1,x2)

    [,1] [,2] [,3] [,4] [,5]x1 1 2 3 4 5x2 6 7 8 9 10

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 61

    Matrices Diagonal Matrices

    > diag(1:5)[,1] [,2] [,3] [,4] [,5]

    [1,] 1 0 0 0 0[2,] 0 2 0 0 0[3,] 0 0 3 0 0[4,] 0 0 0 4 0[5,] 0 0 0 0 5

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 62

    Matrices Identical Matrix

    > diag(5)[,1] [,2] [,3] [,4] [,5]

    [1,] 1 0 0 0 0[2,] 0 1 0 0 0[3,] 0 0 1 0 0[4,] 0 0 0 1 0[5,] 0 0 0 0 1

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 63

    Matrices Arithmetic Operators & Matrices.

    You can use the numeric operatorsbetween matrices (be careful with thedimensions) or between vectors &matrices (be careful with thedimensions) or between scalars &matrices. Also the mathematicalfunctions can be used in matrices.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 64

    Matrices

    Operator Description

    %*% Multiplication of Matrices

    t() Transpose of a Matrix

    solve() Inverse of a Matrix

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 65

    Matrices > x x

    [,1] [,2][1,] 1 4[2,] 2 5[3,] 3 6> dim(x)[1] 3 2> y y

    [,1] [,2][1,] 0 1[2,] 1 1

    > x%*%y[,1] [,2]

    [1,] 4 5[2,] 5 7[3,] 6 9> t(x)

    [,1] [,2] [,3][1,] 1 2 3[2,] 4 5 6> solve(y)

    [,1] [,2][1,] -1 1[2,] 1 0

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 66

    Matrices The function apply().

    > x x

    [,1] [,2][1,] 1 4[2,] 2 5[3,] 3 6> apply(x,1,sum)[1] 5 7 9> apply(x,2,sum)[1] 6 15

    Compute row sum.

    Compute column sum.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 67

    Function apply()> str(apply)function (X, MARGIN, FUN, ...)X is an arrayMARGIN is an integer vector indicating

    which margins should be “retained”.FUN is a function to be applied... is for other arguments to be passed

    to FUN

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 68

    Function apply()> x apply(x, 2, mean)[1] 0.04868268 0.35743615 -0.09104379[4] -0.05381370 -0.16552070 -0.18192493[7] 0.10285727 0.36519270 0.14898850[10] 0.26767260> apply(x, 1, sum)[1] -1.94843314 2.60601195 1.51772391[4] -2.80386816 3.73728682 -1.69371360[7] 0.02359932 3.91874808 -2.39902859[10] 0.48685925 -1.77576824 -3.34016277[13] 4.04101009 0.46515429 1.83687755[16] 4.36744690 2.21993789 2.60983764[19] -1.48607630 3.58709251

    simulates from standardnormal (more later)

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 69

    col/row sums & means For sums and means of matrix

    dimensions, we have some shortcuts. rowSums (x) = apply(x, 1, sum) rowMeans (x) = apply(x, 1, mean) colSums (x) = apply(x, 2, sum) colMeans (x) = apply(x, 2, mean)

    The shortcut functions are muchfaster, but you won’t notice unlessyou’re using a large matrix.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 70

    Function apply() Quantiles of the rows of a matrix.

    > x apply(x, 1, quantile, probs = c(0.25, 0.75))

    [,1] [,2] [,3] [,4]25% -0.3304284 -0.99812467 -0.9186279 -0.4971168675% 0.9258157 0.07065724 0.3050407 -0.06585436

    [,5] [,6] [,7] [,8]25% -0.05999553 -0.6588380 -0.653250 0.0174999775% 0.52928743 0.3727449 1.255089 0.72318419

    [,9] [,10] [,11] [,12]25% -1.2467955 -0.8378429 -1.0488430 -0.705490275% 0.3352377 0.7297176 0.3113434 0.4581150

    [,13] [,14] [,15] [,16]25% -0.1895108 -0.5729407 -0.5968578 -0.951706975% 0.5326299 0.5064267 0.4933852 0.8868922

    [,17] [,18] [,19] [,20]25% -0.2502935 -0.7488003 -0.7190923 -0.638243……………………………………………………………………………………..

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 71

    Arrays Arrays are matrices with 3 or more

    dimensions. To create them we usethe function array(). The dimension ofthe array is given by the parameterdim. For example if dim=c(2,3,4), wewill have a 3 dimensional array ofdimension 2×3×4.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 72

    Arrays > X X, , 1

    [,1] [,2] [,3][1,] 1 3 5[2,] 2 4 6

    , , 2

    [,1] [,2] [,3][1,] 7 9 11[2,] 8 10 12

    , , 3

    [,1] [,2] [,3][1,] 36 38 40[2,] 37 39 41

    , , 4

    [,1] [,2] [,3][1,] 42 44 46[2,] 43 45 47

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 73

    Function apply() Average matrix in an array

    > a apply(a, c(1, 2), mean)

    [,1] [,2][1,] -0.2353245 -0.03980211[2,] -0.3339748 0.04364908> rowMeans(a, dims = 2)

    [,1] [,2][1,] -0.2353245 -0.03980211[2,] -0.3339748 0.04364908

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 74

    Data Frames Data frames are used for storing data

    tables. It is a list of vectors (notnecessary of the same type) of equallength. In a data frame usually westore the observations we have froma sample.

    To create a data frame we use thefunction data.frame().

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 75

    Data Frames> Gender Gender Gender[1] Male Male Male FemaleLevels: Female Male> Smoking Smoking Choresterol Choresterol[1] 200 220 180 172> sample sampleGender Smoking Choresterol

    1 Male TRUE 2002 Male TRUE 2203 Male FALSE 1804 Female FALSE 172

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 76

    Data Frames With the function as.data.frame() you

    can convert an R object into a dataframe. With the parameterrow.names () you can define namesfor the rows (observations) of thedata frame. With the functionnames() you can give names into thecolumns (variables) of the dataframe.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 77

    Data Frames> x x

    [,1] [,2] [,3][1,] 1 1 200[2,] 1 1 220[3,] 1 0 180[4,] 0 0 172> x x

    V1 V2 V31 1 1 2002 1 1 2203 1 0 1804 0 0 172> names(x)[1] "V1" "V2" "V3"> names(x) x

    Gender Smoking Choresterol1 1 1 2002 1 1 2203 1 0 1804 0 0 172

    > x xGender Smoking Choresterol

    obs1 1 1 200obs2 1 1 220obs3 1 0 180obs4 0 0 172

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 78

    Data Frames Similar functions used for

    matrices can be also used here.

    > xGender Smoking Choresterol

    obs1 1 1 200obs2 1 1 220obs3 1 0 180obs4 0 0 172> dim(x)[1] 4 3> x[1,]

    Gender Smoking Choresterol

    obs1 1 1 200

    > x[1,2][1] 1> x$Gender[1] 1 1 1 0> rbind(1,x)

    Gender Smoking Choresterol1 1 1 1obs1 1 1 200obs2 1 1 220obs3 1 0 180obs4 0 0 172> cbind(1,x)

    1 Gender Smoking Choresterolobs1 1 1 1 200obs2 1 1 1 220obs3 1 1 0 180obs4 1 0 0 172

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 79

    Lists Lists are vectors with elements not

    necessarily of the same type orlength. To create a list you can usethe function list() giving as a mainparameter the objects (members)you wish to be contained in your list,together with their names.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 80

    Lists> Gender Gender Gender[1] Male Male Male FemaleLevels: Female Male> x x[1] 1 2 3 4 5 6 7 8 9 10> sampleGender Smoking Choresterol

    1 Male TRUE 2002 Male TRUE 2203 Male FALSE 1804 Female FALSE 172> y y$my_sampleGender Smoking Choresterol

    1 Male TRUE 2002 Male TRUE 2203 Male FALSE 1804 Female FALSE 172

    $x[1] 1 2 3 4 5 6 7 8 9 10

    $the_gender[1] Male Male Male FemaleLevels: Female Male

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 81

    Lists With the symbol $ or [[ ]], we can

    see a member of the list.> y$x[1] 1 2 3 4 5 6 7 8 9 10> y[[3]][1] Male Male Male FemaleLevels: Female Male> y$x[1:3][1] 1 2 3

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 82

    Reading Data There are a few principal functions

    reading data into R. read.table, read.csv, for reading

    tabular data readLines, for reading lines of a text file source, for reading in R code files

    (inverse of dump) dget, for reading in R code files (inverse

    of dput) load, for reading in saved workspaces

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 83

    Reading & Writing Data There are analogous functions for

    writing data to files write.table writeLines dump dput save

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 84

    The function read.table() The read.table function is one of the most commonly

    used functions for reading data. It has a few important arguments: file: the name of a file, or a connection header: logical indicating if the file has a header line sep: a string indicating how the columns are separated colClasses: a character vector indicating the class of

    each column in the dataset nrows: the number of rows in the dataset comment.char: a character string indicating the

    comment character skip: the number of lines to skip from the beginning stringsAsFactors: should character variables be coded as

    factors?

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 85

    Reading & Writing Data> Gender write(Gender,file="g.txt", ncol=4)> x[1] 1 2 3 4 5 6 7 8 9 10> write(x,file="x.txt", ncol=length(x))> Smoking

    [1] TRUE TRUE FALSE FALSE> write(Smoking, file="smoking.txt",

    ncol=4)> X

    [,1] [,2][1,] 1 7[2,] 2 8[3,] 3 9[4,] 4 10[5,] 5 11[6,] 6 12

    > write(t(X), "X.txt", ncol=2)> sample

    Gender Smoking Cholesterol1 Male TRUE 2002 Male TRUE 2203 Male FALSE 1804 Female FALSE 172> write.table(sample, file="sample.txt")> x x[1] 1 2 3 4 5 6 7 8 9 10> X X

    [,1] [,2][1,] 1 2[2,] 3 4[3,] 5 6[4,] 7 8[5,] 9 10[6,] 11 12

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 86

    Reading & Writing Data> zz zz

    Gender Smoking Cholesterol1 Male TRUE 2002 Male TRUE 2203 Male FALSE 1804 Female FALSE 172> zz zz

    Gender Smoking Cholesterol1 Male TRUE 2002 Male TRUE 2203 Male FALSE 1804 Female FALSE 172

    > sapply(zz,mode)Gender Smoking Cholesterol

    "numeric" "logical" "numeric" > zz zzGender Smoking Cholesterol

    1 Male TRUE 2002 Male TRUE 2203 Male FALSE 1804 Female FALSE 172> sapply(zz,mode)

    Gender Smoking Cholesterol "character" "logical" "numeric"

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 87

    The function read.table() For small to moderately sized datasets, you can

    usually call read.table without specifying any other arguments

    data

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 88

    The function read.table() With much larger datasets, doing the

    following things will make your life easierand will prevent R from choking. Read the help page for read.table, which

    contains many hints. Make a rough calculation of the memory

    required to store your dataset. If the dataset islarger than the amount of RAM on yourcomputer, you can probably stop right here.

    Set comment.char = "" if there are nocommented lines in your file.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 89

    The function read.table() Use the colClasses argument. Specifying this option

    instead of using the default can make read.table runMUCH faster, often twice as fast. In order to use thisoption, you have to know the class of each column inyour data frame. If all of the columns are “numeric”,for example, then you can just set colClasses ="numeric". A quick an dirty way to figure out theclasses of each column is the following:

    initial

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 90

    Textual Format dumping and dputing are useful because the

    resulting textual format is edit-able, and in the case ofcorruption, potentially recoverable.

    Unlike writing out a table or csv file, dump and dputpreserve the metadata (sacrificing some readability),so that another user doesn’t have to specify it all overagain.

    Textual formats can work much better with versioncontrol programs like subversion or git which can onlytrack changes meaningfully in text files.

    Textual formats can be longer-lived; if there iscorruption somewhere in the file, it can be easier tofix the problem.

    Textual formats adhere to the “Unix philosophy”. Downside: The format is not very space-efficient.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 91

    dput() & dget() Another way to pass data around is by deparsing the

    R object with dput and reading it back in using dget.> y dput(y)structure(list(a = 1,b = structure(1L, .Label = "a",class = "factor")),.Names = c("a", "b"), row.names = c(NA, -1L),class = "data.frame")> dput(y, file = "y.R")> new.y new.y

    a b1 1 a

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 92

    dump() Multiple objects can be deparsed using the

    dump function and read back in usingsource.

    > x y dump(c("x", "y"), file = "data.R")> rm(x, y)> source("data.R")> ya b1 1 a> x[1] "foo"

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 93

    Package “foreign”

    SPSS read.spss()

    S read.S()

    STATA read.dta()

    SAS read.xport()

    Epi Info read.epiinfo()

    Minitab read.mtb()

    Octave read.octave()

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 94

    Example: From Excel to R Suppose the first row contains variable names and we have data

    for 6 columns (variables). One of the variables is “value” where the comma is used as a thousand separator. Missing values are denoted by “:”.

    Convert data.xls to data.csv Menu File, Save As… and choose csv as type. Open data.csv with WordPad and replace all “;” with “,”.

    Run R from the same directory where data.csv is. Type in R

    > data data$value class(area$value)

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 95

    Example. From R to Excel> write.csv(data, file="data.csv") Convert data.csv to data.xls Open file and with the mouse choose first

    column From menu Data choose Text to Columns….. Click on “Delimited”, and in the next window

    choose “Comma” as a delimiter. In the next window click on Finish and from the

    menu File save the file (Save As…) as .xls.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 96

    Functions in R

    > x[1] 46 104 94 114 35 70 120 29 19 135

    200 222 89 100 55 214 15 81 118 193> range

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 97

    Functions in R

    > calc calc(4,3)[1] 64

    b=2 default value.

    α=4, b=3

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 98

    Functions in RSyntax Descriptionif (A) B Check if A is satisfied. If yes perform Β

    if (A) B1 else B2 Check if A is satisfied. If yes perform Β1 else B2

    ifelse(A, B1, B2) Same as previous

    break Stops current loop

    next Stops current loop and starts next iteration

    return(A) Terminates a function and return A

    while(A) B repetitive execution of B is to be carried out till it meets the constraint condition (not A)

    repeat A Same as while

    for(index in A) B Performs B as long as index belongs to A

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 99

    if else statement in R if(A) {A1} else {A2}

    > x if(x

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 100

    if else statement in R if(A) {B1

    } else if(C){B2

    } else{B3

    }

    > x if(x

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 101

    for loop

    for(index in A) B

    > x n proda summ for(i in 1:n)

    {summ

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 102

    for loop > A summ for(i in 1:1000)

    {for(j in 1:1000){summ system.time({summ

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 103

    while & repeat

    while(A) B

    Suppose we wish to apply Newton-Raphsonmethod in order to find the solution of theequation 3 2f (x) x 2x 7 0= + − =

    nn 1 n

    n

    f (x )x xf '(x )+

    = −2f '(x) 3x 4x= +

    repeat(B; if(A) break)

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 104

    while & repeat > x tolerance f f.prime while(abs(f)>tolerance){x

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 105

    Functions in R Example: Suppose we wish to create our own function to calculate

    x! where x is a natural numberfact1

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 106

    Functions in Rfact2

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 107

    Functions in Rfact3

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 108

    Functions in R Loops in R could be time consuming.

    Therefore the following loopfor(i in 1:length(y)) {if(y[i]

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 109

    Functions in R Additionally we could use the cumprod() function of R, that

    calculates the cumulative product, i.e.> cumprod(c(1,2,4))[1] 1 2 8

    Thus we could use the following function instead

    fact4 fact4(0)[1] 1> fact4(4)[1] 24> fact4(2.3)[1] "Your number is not natural"

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 110

    Functions in R Finally we could use the gamma() function, since for natural

    numbers we have x!=Γ(x+1) or of course the ready functionfactorial().

    > gamma(4)[1] 6> gamma(2)[1] 1> gamma(1)[1] 1> factorial(3)[1] 6> factorial(1)[1] 1> factorial(0)[1] 1

    Please note that the above functions give answers not necessarilyfor natural numbers.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 111

    Functions in R Overflow Problems:

    > factorial(200)/(factorial(100)*factorial(100))[1] NaNWarning message:In factorial(200) : value out of range in 'gammafn'

    100200

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 112

    Functions in RWe have

    Thus in R

    or even better

    200 101 ... 200100 1 ... 100

    ⋅ ⋅ = ⋅ ⋅

    > prod(101:200)/prod(1:100)[1] 9.054851e+58

    > x y z prod(z)[1] 9.054851e+58

    581005.9 ⋅

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 113

    Functions in R Another way to avoid overflow problems is by

    using logs.

    [ ]

    n

    i 1

    200 200exp log exp log(200!) 2 log(100!)

    100 100

    But log(n!) log(i)

    =

    = = −

    =∑

    > exp(sum(log(1:200))-2*sum(log(1:100)))[1] 9.054851e+58

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 114

    Loop Functions Writing for & while loops is useful when

    programming but not particularly easywhen working interactively on thecommand line. There are some functionswhich implement looping to make lifeeasier.

    lapply: Loop over a list and evaluate a function oneach element.

    sapply: Same as lapply but try to simplify theresult.

    apply: Apply a function over the margins of anarray.

    tapply: Apply a function over subsets of a vector. mapply: Multivariate version of lapply.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 115

    lapply() lapply takes three arguments: a list X, a function (or

    the name of a function) FUN, and other arguments via its ... argument. If X is not a list, it will be coerced to a list using as.list.

    > lapplyfunction (X, FUN, ...){FUN

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 116

    lapply() lapply always returns a list,

    regardless of the class of the input.> x lapply(x, mean)$a[1] 3$b[1] 0.0296824

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 117

    lapply()> x lapply(x, mean)$a[1] 2.5$b[1] 0.06082667$c[1] 1.467083$d[1] 5.074749

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 118

    sapply() sapply will try to simplify the result of

    lapply if possible. If the result is a list where every element is

    length 1, then a vector is returned. If the result is a list where every element is

    a vector of the same length (> 1), a matrixis returned.

    If it can’t figure things out, a list isreturned.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 119

    sapply()> x lapply(x, mean)$a[1] 2.5$b[1] 0.06082667$c[1] 1.467083$d[1] 5.074749

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 120

    sapply()> sapply(x, mean)a b c d2.50000000 0.06082667 1.46708277

    5.07474950> mean(x)[1] NAWarning message:In mean.default(x) : argument is not numeric

    or logical: returning NA

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 121

    tapply() tapply is used to apply a function over

    subsets of a vector.> str(tapply)function (X, INDEX, FUN = NULL, ..., simplify =

    TRUE) X is a vector. INDEX is a factor or a list of factors (or else

    they are coerced to factors). FUN is a function to be applied. ... contains other arguments to be passed FUN. simplify, should we simplify the result?

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 122

    tapply() Take group means.

    > x f f[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3[24] 3 3 3 3 3 3 3Levels: 1 2 3> tapply(x, f, mean)1 2 30.1144464 0.5163468 1.2463678

    Generate factors by specifying the pattern of their levels.

    simulates from uniform in (0,1) (more later)

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 123

    tapply() Take group means without

    simplification.> tapply(x, f, mean, simplify = FALSE)$‘1‘[1] 0.1144464$‘2‘[1] 0.5163468$‘3‘[1] 1.246368

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 124

    tapply() Find group ranges.

    > tapply(x, f, range)$‘1‘[1] -1.097309 2.694970$‘2‘[1] 0.09479023 0.79107293$‘3‘[1] 0.4717443 2.5887025

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 125

    split() split takes a vector or other objects

    and splits it into groups determined by a factor or list of factors.

    > str(split)function (x, f, drop = FALSE, ...) x is a vector (or list) or data frame. f is a factor (or coerced to one) or a list of

    factors. drop indicates whether empty factors levels

    should be dropped.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 126

    split()> x f split(x, f)$‘1‘[1] -0.8493038 -0.5699717 -0.8385255 -0.8842019[5] 0.2849881 0.9383361 -1.0973089 2.6949703[9] 1.5976789 -0.1321970$‘2‘[1] 0.09479023 0.79107293 0.45857419 0.74849293[5] 0.34936491 0.35842084 0.78541705 0.57732081[9] 0.46817559 0.53183823$‘3‘[1] 0.6795651 0.9293171 1.0318103 0.4717443…………………………………………………………………………………

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 127

    split() A common idiom is split followed by

    an lapply.> lapply(split(x, f), mean)$‘1‘[1] 0.1144464$‘2‘[1] 0.5163468$‘3‘[1] 1.246368

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 128

    Splitting a Data Frame> library(datasets)> head(airquality)Ozone Solar.R Wind Temp Month Day1 41 190 7.4 67 5 12 36 118 8.0 72 5 23 12 149 12.6 74 5 34 18 313 11.5 62 5 45 NA NA 14.3 56 5 56 28 NA 14.9 66 5 6

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 129

    Splitting a Data Frame> s lapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R",

    "Wind")]))$‘5‘Ozone Solar.R WindNA NA 11.62258$‘6‘Ozone Solar.R WindNA 190.16667 10.26667$‘7‘Ozone Solar.R WindNA 216.483871 8.941935…………………………………………………….

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 130

    Splitting a Data Frame> sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R",

    "Wind")]))5 6 7 8 9Ozone NA NA NA NA NASolar.R NA 190.16667 216.483871 NA 167.4333Wind 11.62258 10.26667 8.941935 8.793548 10.1800

    > sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")],

    na.rm = TRUE))5 6 7 8 9Ozone 23.61538 29.44444 59.115385 59.961538 31.44828Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333Wind 11.62258 10.26667 8.941935 8.793548 10.18000

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 131

    Splitting on more that one level> x f1 f2 f1[1] 1 1 1 1 1 2 2 2 2 2Levels: 1 2> f2[1] 1 1 2 2 3 3 4 4 5 5Levels: 1 2 3 4 5> interaction(f1, f2)[1] 1.1 1.1 1.2 1.2 1.3 2.3 2.4 2.4 2.5 2.510 Levels: 1.1 2.1 1.2 2.2 1.3 2.3 1.4 2.4 1.5 2.5

    1 1 1 1 1 2 2 2 2 2

    1 1 2 2 3 3 4 4 5 5

    Empty levels

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 132

    Splitting on more that one level Interactions can create empty levels.

    > str(split(x, list(f1, f2)))List of 10$ 1.1: num [1:2] -0.378 0.445$ 2.1: num(0)$ 1.2: num [1:2] 1.4066 0.0166$ 2.2: num(0)$ 1.3: num -0.355$ 2.3: num 0.315$ 1.4: num(0)$ 2.4: num [1:2] -0.907 0.723$ 1.5: num(0)$ 2.5: num [1:2] 0.732 0.360

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 133

    Splitting on more that one level Empty levels can be dropped.

    > str(split(x, list(f1, f2), drop = TRUE))List of 6$ 1.1: num [1:2] -0.378 0.445$ 1.2: num [1:2] 1.4066 0.0166$ 1.3: num -0.355$ 2.3: num 0.315$ 2.4: num [1:2] -0.907 0.723$ 2.5: num [1:2] 0.732 0.360

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 134

    mapply() mapply is a multivariate apply of sorts

    which applies a function in parallel over aset of arguments.> str(mapply)function (FUN, ..., MoreArgs = NULL, SIMPLIFY =

    TRUE, USE.NAMES = TRUE) FUN is a function to apply ... contains arguments to apply over MoreArgs is a list of other arguments to FUN. SIMPLIFY indicates whether the result should be

    simplified

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 135

    mapply()The following is tedious to type

    > list(rep(1, 4), rep(2, 3), rep(3, 2), rep(4, 1))Instead we can do

    > mapply(rep, 1:4, 4:1)[[1]][1] 1 1 1 1[[2]][1] 2 2 2[[3]][1] 3 3[[4]][1] 4

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 136

    Vectorizing a Function> noise noise(5, 1, 2)[1] 2.4831198 2.4790100 0.4855190 -

    1.2117759[5] -0.2743532> noise(1:5, 1:5, 2)[1] -4.2128648 -0.3989266 4.2507057

    1.1572738[5] 3.7413584

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 137

    Vectorizing a Function> mapply(noise, 1:5, 1:5, 2)[[1]][1] 1.037658[[2]][1] 0.7113482 2.7555797[[3]][1] 2.769527 1.643568 4.597882[[4]][1] 4.476741 5.658653 3.962813 1.204284[[5]][1] 4.797123 6.314616 4.969892 6.530432 6.723254The above is the same as:> list(noise(1, 1, 2), noise(2, 2, 2), noise(3, 3, 2), noise(4, 4,

    2), noise(5, 5, 2))

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 138

    Numerical Measures – Quantitative Variables

    Function Descriptionmean(x) Sample Mean

    min(x) Minimum

    max(x) Maximum

    median(x) Median

    var(x) Variance

    sd(x) Standard Deviation

    quantile(x,p) Returns the p percentile. Για p=0.25 και p=0.75 we get the 1οand 3ο quartile. Read the help in R for the different quantile

    algorithms ( argument “types”)

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 139

    Graphical Methods – Quantitative Variables

    Histogram: hist(x) hist(x, nclass=10) hist(x, breaks=seq(from=0,to=240,by=30)) hist(x, probability=T)

    Boxplot: boxplot(x) boxplot(x,y, names=c(“X”, “Y”))

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 140

    Graphical Methods – Quantitative Variables

    X Y

    50

    10

    01

    50

    20

    0

    Histogram of x

    x

    De

    nsi

    ty

    0 50 100 150 200 250

    0.0

    00

    0.0

    02

    0.0

    04

    0.0

    06

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 141

    Descriptive Measures - Categorical Variables

    C C B M M

    C M M F C

    F B B M M

    C C C M C

    car=C, metro=Μ, bus=Β walk=F

    males

    females

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 142

    Descriptive Measures - Categorical Variables> Transportation Transportation Gender Gender[1] "M" "M" "M" "M" "M" "M" "M" "M" "M" "M" "F" "F" "F" "F" "F" "F"

    "F" "F" "F" "F"> Gender table(Transportation)AB C F M 3 8 2 7 > prop.table(table(Transportation))A

    B C F M 0.15 0.40 0.10 0.35

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 143

    Descriptive Measures - Categorical Variables> mytable mytable

    GenderTransportation F M

    B 2 1C 4 4F 1 1M 3 4

    > margin.table(mytable, 1)TrasportationB C F M3 8 2 7 > margin.table(mytable, 2)GenderF M 10 10

    > prop.table(mytable)Gender

    Transportation F MB 0.10 0.05C 0.20 0.20F 0.05 0.05M 0.15 0.20

    > prop.table(mytable, 1)Gender

    Transportation F MB 0.6666667 0.3333333C 0.5000000 0.5000000F 0.5000000 0.5000000M 0.4285714 0.5714286

    > prop.table(mytable, 2)Gender

    Transportation F MB 0.2 0.1C 0.4 0.4F 0.1 0.1M 0.3 0.4

    frequencies for transportation

    frequencies for gender

    relative frequencies for each cell

    relative frequencies rows

    relative frequency columns

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 144

    Descriptive Measures - Categorical Variables

    Barplot> AA AAAB C F M 3 8 2 7 > barplot(AA)

    Piechart

    > pie(AA)

    B C F M

    02

    46

    8B

    C

    F

    M

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 145

    Stacked Barplots> freq_table barplot(freq_table, xlim=c(0,3),

    xlab="Gender", legend=levels(Transportation), col=1:4)

    > freq_table barplot(freq_table, width=0.85, xlim=c(0,5),

    xlab="Transportation", legend=levels(Gender), col=1:2)

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 146

    Stacked Barplots

    F M

    MFCB

    Gender

    02

    46

    810

    B C F M

    MF

    Transportation

    02

    46

    8

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 147

    Grouped Barplot> freq_table barplot(prop.table(freq_table,1), width=0.25,

    xlim=c(0,3), ylim=c(0,0.7), xlab="Gender", legend=levels(Transportation), beside=T, col=1:4)

    > freq_table barplot(prop.table(freq_table,1), width=0.25,

    xlim=c(0,3.6), xlab="Transportation", legend=levels(Gender), beside=T, col=1:2)

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 148

    Grouped Barplot

    F M

    BCFM

    Gender

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    B C F M

    FM

    Transportation

    0.0

    0.1

    0.2

    0.3

    0.4

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 149

    Distributions in R Probability distribution functions

    usually have four functions associatedwith them.

    The functions are prefixed with a: d for density. r for random number generation. p for cumulative distribution. q for quantile function.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 150

    Distributions in Rcommand distribution command distribution

    beta Beta hyper Hypergeometricnorm Normal unif Uniformpois Poisson cauchy Cauchy

    nbinom Neg. Binomial weibull Weibullgamma Gamma chisq X2

    t Student exp Exponentialbinom Binomial geom Geometric

    f Snedecor mvnorm Multivariate Normal

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 151

    Distributions in R Examples:

    > pnorm(3,2,2)[1] 0.6914625

    > qgamma(0.3,1,1)[1] 0.3566749

    > dt(2,3)[1] 0.06750966

    > runif(5,-2,2)[1] 1.3448055 -0.4691324 1.2517269 1.5576504 0.9563447

    Generates 5 uniform in (-2,2) variates.

    Calculates the value of the pdf of the Student distribution with 3 df at x = 2.

    Calculates the value of the cumulative distributionfunction of the Normal distribution with mean 2 and SD (NOT VARIANCE) 2 at x = 3.

    Finds the 0.3 percentile of the Gammadistribution with parameters 1 and 1.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 152

    Distributions in R Arguments of the previous functions could be

    vectors.

    Default Values in the Parameters: Thecommand rnorm(50) generates 50 normalvariates with μ = 0 and σ = 1.

    > dexp(1:5,2)[1] 2.706706e-01 3.663128e-02 4.957504e-03 6.709253e-04

    9.079986e-05Calculates the value of the pdf of the exponential distribution with parameter 2 for x = 1, 2, 3, 4, 5.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 153

    Distributions in R If Χ~Student(10) then the P(X≤2) is

    equal to

    while the P(X>2) is equal to

    > pt(2,10)[1] 0.963306

    > pt(2,10, lower.tail=FALSE)[1] 0.03669402

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 154

    Distributions in R All functions (except the ones that

    are prefixed with r) could be in log(natural logarithm) scale

    > pnorm(3,2,2)[1] 0.6914625> pnorm(3,2,2, log=T)[1] -0.3689464> dt(2,3)[1] 0.06750966> dt(2,3, log=T)[1] -2.695485

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 155

    Distributions in R Working with the Normal distributions

    requires using these four functions: dnorm(x, mean = 0, sd = 1, log = FALSE). pnorm(q, mean = 0, sd = 1, lower.tail = TRUE,

    log.p = FALSE). qnorm(p, mean = 0, sd = 1, lower.tail = TRUE,

    log.p = FALSE). rnorm(n, mean = 0, sd = 1).

    If Φ is the cumulative distribution function for a standard Normal distribution, thenpnorm(q) = Φ(q) and qnorm(p) = Φ-1(p).

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 156

    Distributions in R Plot a pdf or pmf:

    > x plot(x, dgamma(x,1,1), type='l')

    0 2 4 6 8 10

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    x

    dgam

    ma(

    x, 1

    , 1)

    Gamma distribution withparameters (1,1)

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 157

    Distributions in R> n p x prplot(x,pr,type="h",xlim=c(0,6),

    ylim=c(0,1), col="blue",ylab="p")

    >points(x,pr,pch=20,col="dark red")

    Binomial distribution with n=6 και p=0.1 0 1 2 3 4 5 6

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    x

    p

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 158

    Simulations Generating random Normal variates

    > x x[1] 1.38380206 0.48772671 0.53403109 0.66721944[5] 0.01585029 0.37945986 1.31096736 0.55330472[9] 1.22090852 0.45236742> x x[1] 23.38812 20.16846 21.87999 20.73813 19.59020[6] 18.73439 18.31721 22.51748 20.36966 21.04371> summary(x)Min. 1st Qu. Median Mean 3rd Qu. Max.18.32 19.73 20.55 20.67 21.67 23.39

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 159

    set.seed Setting the random number seed with set.seed

    ensures reproducibility> set.seed(1)> rnorm(5)[1] -0.6264538 0.1836433 -0.8356286 1.5952808[5] 0.3295078> rnorm(5)[1] -0.8204684 0.4874291 0.7383247 0.5757814[5] -0.3053884> set.seed(1)> rnorm(5)[1] -0.6264538 0.1836433 -0.8356286 1.5952808[5] 0.3295078

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 160

    Simulations Generating Poisson data

    > rpois(10, 1)[1] 3 1 0 1 0 0 1 0 1 1> rpois(10, 2)[1] 6 2 2 1 3 2 2 1 1 2> rpois(10, 20)[1] 20 11 21 20 20 21 17 15 24 20> ppois(2, 2) ## Cumulative distribution[1] 0.6766764 ## Pr(x ppois(4, 2)[1] 0.947347 ## Pr(x ppois(6, 2)[1] 0.9954662 ## Pr(x

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 161

    Simulate from a Linear Model Suppose we want to simulate 100 values

    from the following linear model y = β0 +β1x + ε, where ε ~ N(0; 22). Assume X ~N(0; 12), β0 = 0.5 and β1 = 2.

    > x e y summary(y)Min. 1st Qu. Median Mean 3rd Qu. Max.-6.4080 -1.5400 0.6789 0.6893 2.9300 6.5050> plot(x, y)

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 162

    Simulate from a Linear Model

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 163

    Simulate from a Linear Model What if X is binary?

    > x e y summary(y)Min. 1st Qu. Median Mean 3rd Qu. Max.-3.4940 -0.1409 1.5770 1.4320 2.8400

    6.9410> plot(x, y)

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 164

    Simulate from a Linear Model

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 165

    Simulate from a Poisson Model Suppose we want to simulate 100 values

    from a Poisson model where Y ~Poisson(μ), log(μ) = β0 + β1x. Assume X ~N(0; 12) and β0 = 0.5 and β1 = 0:3.

    > set.seed(1)> x log.mu y summary(y)Min. 1st Qu. Median Mean 3rd Qu. Max.0.00 1.00 1.00 1.55 2.00 6.00> plot(x, y)

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 166

    Simulate from a Poisson Model

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 167

    Sampling The sample function draws randomly from a specified

    set of (scalar) objects allowing you to sample fromarbitrary distributions.

    > sample(1:10, 4)[1] 3 4 5 7> sample(1:10, 4)[1] 3 9 8 5> sample(letters, 5)[1] "q" "b" "e" "x" "p"> sample(1:10) ## permutation[1] 4 7 10 6 9 2 8 3 1 5> sample(1:10)[1] 2 3 4 1 9 5 10 8 6 7> sample(1:10, replace = TRUE) ## Sample w/replacement[1] 2 9 7 8 2 8 5 9 7 8> sample(1:5, replace = TRUE, prob=c(0.3, 0.3, 0.4, 0,0)) ## Sample

    w/replacement and pre-specified probabilities[1] 3 1 2 2 1

  • Dimitris FouskakisΠροσομοίωση 168

    Weak Law of Large Numbers

    If Χ1,...,Χn are idependent andidentical distributed RandomVariables with finite mean μ, then

    Application: Let Χi~Bernoulli(p). Thenμ=P(Xi=1)=p, and therefore

    X p by probability as n .→ →∞

    n1i1

    X n X by probability as n .−= →µ →∞∑

  • Dimitris FouskakisΠροσομοίωση 169

    Weak Law of Large Numbers

    p=0.2, n=500:

    > x xbar plot(xbar)> abline(h=0.2)

  • Dimitris FouskakisΠροσομοίωση 170

    Weak Law of Large Numbers

    0 100 200 300 400 500

    0.2

    0.4

    0.6

    0.8

    1.0

    Index

    xbar

  • Dimitris FouskakisΠροσομοίωση 171

    Weak Law of Large Numbers

    Repeat 4 times:> par(mfrow=c(2,2))> for(i in 1:4){x

  • Dimitris FouskakisΠροσομοίωση 172

    Weak Law of Large Numbers

    0 100 200 300 400 500

    0.20.4

    0.60.8

    1.0

    Index

    xbar

    0 100 200 300 400 500

    0.00.1

    0.20.3

    0.40.5

    Index

    xbar

    0 100 200 300 400 500

    0.00.2

    0.40.6

    Index

    xbar

    0 100 200 300 400 500

    0.20.4

    0.60.8

    1.0

    Index

    xbar

  • Dimitris FouskakisΠροσομοίωση 173

    Weak Law of Large Numbers Example of the Cauchy distribution (the mean is

    not finite).> par(mfrow=c(2,2))> for(i in 1:4){x

  • Dimitris FouskakisΠροσομοίωση 174

    Weak Law of Large Numbers

    0 100 200 300 400 500

    01

    23

    Index

    xbar

    0 100 200 300 400 500

    -6-4

    -20

    2

    Index

    xbar

    0 100 200 300 400 500

    -50

    510

    15

    Index

    xbar

    0 100 200 300 400 500

    0200

    0400

    0600

    0

    Index

    xbar

  • Dimitris FouskakisΠροσομοίωση 175

    Central Limit Theorem Let Χ1,...,Χn be independent and

    identical distributed Random Variableswith finite mean μ and variance σ2. Then

    Equivalently

    ni1

    n

    X nS Z N(0,1) by law as n .

    n

    − µ= → →∞

    σ∑

    2Y N( , / n) by law as n .Χ→ µ σ →∞

  • Dimitris FouskakisΠροσομοίωση 176

    Central Limit Theorem Example: Let n = 150 from Poisson with λ=2 (then

    λ=μ=σ2=2).> poisson.clt

  • Dimitris FouskakisΠροσομοίωση 177

    Central Limit Theorem We use k=300 replications.

    > run par(mfrow=c(1,2))> hist(run)> qqnorm(run)> qqline(run)

  • Dimitris FouskakisΠροσομοίωση 178

    Central Limit TheoremHistogram of ru

    run

    Freque

    ncy

    -4 -2 0 2 4

    020

    4060

    80100

    -3 -2 -1 0 1 2 3

    -3-2

    -10

    12

    3

    Normal Q-Q Plo

    Theoretical Quantile

    Samp

    le Quan

    tiles

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 179

    Plotting The plotting and graphics engine in R is encapsulated in a

    few base and recommend packages: graphics: contains plotting functions for the “base”

    graphing systems, including plot, hist, boxplot andmany others.

    lattice: contains code for producing Trellis graphics,which are independent of the “base” graphicssystem; includes functions like xyplot, bwplot,levelplot.

    grid: implements a different graphing systemindependent of the “base” system; the latticepackage builds on top of grid; we seldom callfunctions from the grid package directly.

    grDevices: contains all the code implementing thevarious graphics devices, including X11, PDF,PostScript, PNG, etc.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 180

    Base Graphics Base graphics are used most commonly and

    are a very powerful system for creating 2-Dgraphics. Calling plot(x, y) or hist(x) will launch a graphics

    device (if one is not already open) and draw theplot on the device.

    If the arguments to plot are not of some specialclass, then the default method for plot is called;this function has many arguments, letting youset the title, x axis lable, y axis label, etc.

    The base graphics system has many parametersthat can set and tweaked; these parameters aredocumented in ?par; it wouldn’t hurt tomemorize this help page!

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 181

    Main Parameters of a Graph main: Title of the graph. sub: Subtitle of the graph. xlab & ylab: Titles of the axes. xlim & ylim: Range of the axes.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 182

    Base Graphics Parameters The par function is used to specify global graphics parameters

    that affect all plots in an R session. These parameters canoften be overridden as arguments to specific plotting functions. pch: the plotting symbol (default is open circle). lty: the line type (default is solid line), can be dashed, dotted,

    etc. lwd: the line width, specified as an integer multiple. col: the plotting color, specified as a number, string, or hex

    code; the colors function gives you a vector of colors by name. las: the orientation of the axis labels on the plot. bg: the background color. mar: the margin size. oma: the outer margin size (default is 0 for all sides). mfrow: number of plots per row, column (plots are filled row-

    wise). mfcol: number of plots per row, column (plots are filled

    column-wise).

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 183

    Base Graphics Parameters Some default

    values:> par("lty")[1] "solid"> par("lwd")[1] 1> par("col")[1] "black"> par("pch")[1] 1

    > par("bg")[1] "transparent"> par("mar")[1] 5.1 4.1 4.1 2.1> par("oma")[1] 0 0 0 0> par("mfrow")[1] 1 1> par("mfcol")[1] 1 1

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 184

    Some Important Base Plotting Functions plot: make a scatterplot, or other type of plot

    depending on the class of the object being plotted. lines: add lines to a plot, given a vector x values and

    a corresponding vector of y values (or a 2-columnmatrix); this function just connects the dots.

    points: add points to a plot. text: add text labels to a plot using specified x, y

    coordinates. title: add annotations to x, y axis labels, title,

    subtitle, outer margin. mtext: add arbitrary text to the margins (inner or

    outer) of the plot. axis: adding axis ticks/labels.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 185

    Text & Symbol Size cex: number indicating the amount by which

    plotting text and symbols should be scaledrelative to the default. 1=default, 1.5 is 50%larger, 0.5 is 50% smaller, etc.

    cex.axis: magnification of axis annotationrelative to cex.

    cex.lab: magnification of x and y labelsrelative to cex.

    cex.main: magnification of titles relative tocex.

    cex.sub: magnification of subtitles relative tocex.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 186

    Axes You can create custom axes using the axis( ) function.

    > axis(side, at=, labels=, pos=, lty=, col=, las=, tck=, ...)

    side: an integer indicating the side of the graph to draw theaxis (1=bottom, 2=left, 3=top, 4=right).

    at: a numeric vector indicating where tic marks should bedrawn.

    labels: a character vector of labels to be placed at thetickmarks (if NULL, the at values will be used).

    pos: the coordinate at which the axis line is to be drawn.(i.e., the value on the other axis where it crosses).

    lty: line type. col: the line and tick mark color. las: labels are parallel (=0) or perpendicular(=2) to axis. tck: length of tick mark as fraction of plotting region (negative

    number is outside graph, positive number is inside, 0suppresses ticks, 1 creates gridlines) default is -0.01.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 187

    Axes The option axes=FALSE suppresses

    both x and y axes. xaxt="n" and yaxt="n" suppress

    the x and y axis respectively. xlab=“” and ylab=“” suppress the

    titles of the two axes.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 188

    Legend Add a legend with the legend() function.

    > legend(location, title, legend, ...) location: There are several ways to indicate the location of the

    legend. You can give a x,y coordinate for the upper left handcorner of the legend. You can use locator(1), in which case youuse the mouse to indicate the location of the legend. You can alsouse the keywords "bottom", "bottomleft", "left", "topleft", "top","topright", "right", "bottomright", or "center". If you use akeyword, you may want to use inset= to specify an amount tomove the legend into the graph (as fraction of plot region).

    title: A character string for the legend title (optional). legend: A character vector with the labels. ...Other options. If the legend labels colored lines,

    specify col= and a vector of colors. If the legend labels pointsymbols, specify pch= and a vector of point symbols. If thelegend labels line width or line style, use lwd= or lty= and avector of widths or styles. To create colored boxes for the legend(common in bar, box, or pie charts), use fill= and a vector ofcolors.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 189

    Fonts font: Integer specifying font to use for text. 1=plain,

    2=bold, 3=italic, 4=bold italic, 5=symbol. font.axis: font for axis annotation. font.lab: font for x and y labels. font.main: font for titles. font.sub: font for subtitles. ps: font point size (roughly 1/72 inch)

    text size=ps*cex. family: font family for drawing text. Standard values are

    "serif", "sans", "mono", "symbol". Mapping is devicedependent. In windows, mono is mapped to "TT CourierNew", serif is mapped to"TT Times New Roman", sans ismapped to "TT Arial", mono is mapped to "TT Courier New",and symbol is mapped to "TT Symbol" (TT=True Type). Youcan add your own mappings.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 190

    Margins & Graph Size mar: numerical vector indicating

    margin size c(bottom, left, top, right)in lines. default = c(5, 4, 4, 2) + 0.1.

    mai: numerical vector indicatingmargin size c(bottom, left, top, right)in inches.

    pin: plot dimensions (width, height)in inches.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 191

    Saving Graphs You can save the graph in a variety of formats from

    the menuFile -> Save As.

    You can also save the graph via code using one of the following functions. pdf("mygraph.pdf"): pdf file. win.metafile("mygraph.wmf"): windows metafile. png("mygraph.png"): pngfile. jpeg("mygraph.jpg"): jpeg file. bmp("mygraph.bmp"): bmp file. postscript("mygraph.ps"): postscript file.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 192

    Useful Graphics Devices The list of devices is found in ?Devices;

    there are also devices created by users onCRAN pdf: useful for line-type graphics, vector format,

    resizes well, usually portable postscript: older format, also vector format

    and resizes well, usually portable,can be used tocreate encapsulated postscript files, Windowssystems often don’t have a postscript viewer.

    xfig: good of you use Unix and want to edit aplot by hand.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 193

    Plot

    0 20 40 60 80 100

    -3-2

    -10

    12

    Index

    x> plot(x)

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 194

    Different plotting symbols

    0 20 40 60 80 100

    -3-2

    -10

    12

    Index

    x

    > plot(x, pch=11)

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 195

    Different plotting symbols

    > plot(x, pch=‘A’)A

    A

    A

    AA

    A

    A

    A

    A

    A

    A

    A

    A

    A

    A

    A

    A

    AA

    A

    A

    AA

    A

    A

    A

    A

    A

    A

    AA

    A

    A

    A

    A

    A

    AA

    A

    A

    A

    AAA

    A

    A

    A

    A

    A

    AAAA

    A

    A

    A

    A

    AAAA

    A

    A

    AA

    A

    A

    A

    A

    AA

    AA

    A

    A

    AA

    A

    A

    AA

    AA

    A

    A

    A

    A

    A

    A

    A

    AAA

    A

    A

    A

    A

    A

    AA

    0 20 40 60 80 100

    -3-2

    -10

    12

    Index

    x

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 196

    Different plotting symbols

    > plot(x, pch=c('A', 'B'))A

    B

    A

    BA

    B

    A

    B

    A

    B

    A

    B

    A

    B

    A

    B

    A

    BA

    B

    A

    BA

    B

    A

    B

    A

    B

    A

    BA

    B

    A

    B

    A

    B

    AB

    A

    B

    A

    BAB

    A

    B

    A

    B

    A

    BABA

    B

    A

    B

    A

    BABA

    B

    A

    BA

    B

    A

    B

    A

    BA

    BA

    B

    A

    BA

    B

    A

    BA

    BA

    B

    A

    B

    A

    B

    A

    B

    ABA

    B

    A

    B

    A

    B

    AB

    0 20 40 60 80 100

    -3-2

    -10

    12

    Index

    x

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 197

    Colors> colors()

    [1] "white" "aliceblue" "antiquewhite" [4] "antiquewhite1" "antiquewhite2" "antiquewhite3" [7] "antiquewhite4" "aquamarine" "aquamarine1"

    [10] "aquamarine2" "aquamarine3" "aquamarine4" [13] "azure" "azure1" "azure2" [16] "azure3" "azure4" "beige" [19] "bisque" "bisque1" "bisque2" [22] "bisque3" "bisque4" "black"

    ..............................................................

    ..............................................................

    ..............................................................

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 198

    Colors> plot(1:10, c(5, 4, 3, 2, 1, 2, 3, 4, 3, 2), col=c("red",

    "blue", "green", "beige", "goldenrod", "turquoise", "salmon", "purple", "pink", "seashell"))

    2 4 6 8 10

    12

    34

    5

    1:10

    c(5,

    4, 3

    , 2, 1

    , 2, 3

    , 4, 3

    , 2)

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 199

    Colors> palette()[1] "black" "red" "green3" "blue" "cyan" "magenta" "yellow"[8] "gray"

    col = 1 col = "black” col = 2 col = "red“, etc

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 200

    Type Arguments "p": Points "l": Lines "b": Both "c": The lines part alone of "b" "o": Both “overplotted” "h": Histogram like (or high-density)

    vertical lines "n": No plotting

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 201

    Type Argument> x

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 202

    Type Arguments

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 203

    Add to a graph points(x,y): Add points to an existing graph. lines(x,y): Add a line to an existing graph. abline(a=2, b=4): Add to the existing graph the line y = 2+4x. abline(v=4): Add to the existing graph the line x = 4. abline(h=2): Add to the existing graph the line y = 2. abline(v=1:4): Add to the existing graph the lines x = 1, x = 2, x

    = 3 & x = 4. segments(x0, y0, x1,y1): Draw line segments between pairs of

    points. (x0, y0): coordinates of points from which to draw & (x1, y1): coordinates of points to which to draw.

    arrows(x0, y0, x1,y1): Draw arrows between pairs of points. (x0, y0): coordinates of points from which to draw & (x1, y1): coordinates of points to which to draw. Additional arguments here are: (a) length: length of the edges of the arrow head (in inches), (b) angle: angle from the shaft of the arrow to the edge of the arrow head, (c) code: integer code, determining kind of arrows to be drawn.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 204

    Line Types

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 205

    Rectangulars> plot(c(100, 200), c(300, 450),

    type= "n", xlab="", ylab="")> rect(100, 300, 125, 350)

    100 120 140 160 180 200

    300

    350

    400

    450

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 206

    Polygons

    > plot(c(100, 300), c(300, 450), type= "n", xlab="", ylab="")

    > polygon(c(140, 120, 120, 160, 160), c(440, 400, 320, 320, 400))

    100 150 200 250 300

    300

    350

    400

    450

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 207

    Polygons> plot(c(100, 300), c(300, 450),

    type= "n", xlab="", ylab="")> polygon(c(140, 120, 120, 160,

    160, NA, 240, 220, 220, 260, 260),

    c(440, 400, 320, 320, 400, NA, 440, 400, 320, 320, 400))

    100 150 200 250 300

    300

    350

    400

    450

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 208

    Curves> curve(x^3 - 3*x, -2, 2,

    ylab="f")> curve(x^2 - 2, add = TRUE, col

    = "blue")

    -2 -1 0 1 2

    -2-1

    01

    2

    x

    f

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 209

    Text> plot(c(100, 300), c(300, 450),

    type= "n", xlab="", ylab="")> text(150, 380, "STATISTICS")

    100 150 200 250 300

    300

    350

    400

    450

    STATISTICS

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 210

    Text> plot(c(100, 300), c(300, 450),

    type= "n", xlab="", ylab="")> text(150, 380, "STATISTICS",

    srt=30, cex=2, font=4)

    rotation in degrees

    100 150 200 250 300

    300

    350

    400

    450

    STATIS

    TICS

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 211

    Text> weight gender plot(weight, type="n")> text(weight, label=gender)

    2 4 6 8 10

    5560

    6570

    7580

    8590

    Index

    wei

    ght

    M

    M

    M

    MM

    F

    F

    FF

    F

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 212

    Identify Points in a Scatter plot> plot(x)> identify(x, n=1)[1] 53

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 213

    Identify Points in a Scatter plot

    0 20 40 60 80 100

    -2-1

    01

    2

    Index

    x

    +

    0 20 40 60 80 100

    -2-1

    01

    2

    Index

    x

    53

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 214

    Multiple Graphs R makes it easy to combine multiple plots

    into one overall graph, using thepar( ) function.

    With the par( ) function, you can includethe option mfrow=c(nrows, ncols) tocreate a matrix of nrows x ncols plots thatare filled in by row.

    mfcol=c(nrows, ncols) fills in the matrixby columns.

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 215

    Multiple Graphs

    # 4 figures arranged in 2 rows and 2 columns> attach(mtcars) # attaches dataset mtcars> par(mfrow=c(2,2))> plot(wt,mpg, main="Scatterplot of wt vs. mpg")> plot(wt,disp, main="Scatterplot of wt vs disp")> hist(wt, main="Histogram of wt")> boxplot(wt, main="Boxplot of wt")

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 216

    Multiple Graphs

    2 3 4 5

    1015

    2025

    30

    Scatterplot of wt

    wt

    mpg

    2 3 4 5

    100

    300

    Scatterplot of wt

    wt

    disp

    Histogram of wt

    wt

    Freq

    uenc

    y

    2 3 4 5

    02

    46

    8

    23

    45

    Boxplot of wt

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 217

    Graphs in Higher Dimensions Install and Download library

    ElemStatLearn From the Menu or Type:

    > install.packages("ElemStatLearn”)> library (ElemStatLearn)

    Data-set ozone:> data(ozone)> pairs(ozone)

  • Dimitris FouskakisIntroduction to asic rinciples of RB P 218

    Scatterplot Matrices

    ozone