r - get started i - sanaitics

66
GET STARTED IN R Module I 1

Upload: vijith-nair

Post on 15-Feb-2017

334 views

Category:

Education


0 download

TRANSCRIPT

Page 1: R - Get Started I - Sanaitics

1

GET STARTED IN RModule I

Page 2: R - Get Started I - Sanaitics

2

What is R?• A language and environment for statistical

computing and graphics

• Wide variety of statistical & graphical techniques built in

• Free and Open Source software

• Compiles and runs on a wide variety of UNIX platforms, Windows and MacOS

Page 3: R - Get Started I - Sanaitics

3

R Environment• Most functionality through built in functions

• Basic functions available by default

• Other functions contained in packages that can be attached

• All datasets created in the session are in Memory

• Output can be used as input to other functions

Page 4: R - Get Started I - Sanaitics

4

R-CRAN• The Comprehensive R Archive Network

• A network of global web servers storing identical, up-to-date, versions of code and documentation for R

• Use the CRAN mirror nearest to you to download R setup at a faster speed

• http://cran.r-project.org/

Page 5: R - Get Started I - Sanaitics

5

SECTION 1Introduction to R and User Interface (UI)

Page 6: R - Get Started I - Sanaitics

6

The R-UI

The R console-- Type in the command -- Press enter-- See the output

Page 7: R - Get Started I - Sanaitics

7

Create Your Data Setx<-c(12,23,45)y<-c(13,21,6) Create vectors x, y, z on R

Consolez<-c("a","b","c")data1<-data.frame(x,y,z) Combine them in a

tabledata1 Type data name for output

Page 8: R - Get Started I - Sanaitics

8

Your data in Table mode

edit(data1)

Edit your table, add new values & close the window to save your updates

Page 9: R - Get Started I - Sanaitics

9

Export your updated table write.csv (data1,file.choose())

Select the path, name your file and click save

Page 10: R - Get Started I - Sanaitics

10

Data Set typesVector 1 dimension

All elements have the same data types

NumericCharacterLogicFactor

Matrix 2 dimensions

Array 2 or moredimensions

Data frame 2 dimensionsTable-like data object allowing different data types for different columns

ListCollection of data objects, each element of a list is a data object

Page 11: R - Get Started I - Sanaitics

11

Useful In-Built Functionsmin(data1$x)Use $ sign after data set name to use specific columns

max(data1$y) length(data1$x)

mean(data1$y)

levels(data1$z) Gives a list of unique categories in the data

Page 12: R - Get Started I - Sanaitics

12

Save Your ScriptOpen a new script, write commands, execute using F5 key. Save the file at a desired folder. Helps to save the script for a later use.

Page 13: R - Get Started I - Sanaitics

13

R Workspace• Objects that you create during an R session are

hold in memory, the collection of objects that you currently have is called the workspace

• This workspace is not saved on disk unless you tell R to do so

• This means that your objects are lost when you close R and not save the objects, or worse when R or your system crashes on you during a session

Page 14: R - Get Started I - Sanaitics

14

R Workspace• If you have saved a workspace image and

you start R the next time, it will restore the workspace

• So all your previously saved objects are available again

• You can also explicitly load a saved workspace i.e., that could be the workspace image of someone else

• Go the `File' menu and select `Load workspace'

Page 15: R - Get Started I - Sanaitics

15

R WorkspaceDisplay last 25 commandshistory()

Display all previous commandshistory(max.show=Inf)

Save your command history to a file. Default is ".Rhistory"savehistory(file="myfile")

Recall your command history. Default is ".Rhistory"loadhistory(file="myfile")

Page 16: R - Get Started I - Sanaitics

16

Types of Variables• Type: Logical, integer, double, complex,

character, factor or raw

• Type conversion functions have the naming convention

as.xxx

• For example, as.integer(3.2) returns the integer 3, and as.character(3.2) returns the string "3.2".

Page 17: R - Get Started I - Sanaitics

17

Factors• Tell R that a variable is nominal by making it

a factor

• The factor stores the nominal values as a vector of integers in the range [ 1... k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers

• An ordered factor is used to represent an ordinal variable

Page 18: R - Get Started I - Sanaitics

18

Factors• R will treat factors as nominal variables and

ordered factors as ordinal variables in statistical procedures and graphical analyses

• You can use options in the factor( ) and ordered( ) functions to control the mapping of integers to strings (overriding the alphabetical ordering)

Page 19: R - Get Started I - Sanaitics

19

Help in R•If you know the topic but not the exact

function

Help.search(“topic”)

•If you know the exact function

Help(functionname)

Page 20: R - Get Started I - Sanaitics

20

In-Memory Computing•Storage of information in the main RAM

of dedicated servers rather than in complicated relational databases operating on comparatively slow disk drives

•Helps business customers, including retailers, banks and utilities, to quickly detect patterns, analyze massive data volumes on the fly, and perform their operations quickly 

Page 21: R - Get Started I - Sanaitics

21

Import .csv Data Filebasic_salary<-read.table(“C:/Users/Documents/R/BASIC SALARY DATA.csv", header=TRUE, sep=",")

• Command requires the file path separated by a /

• header=TRUE if there are row labels

Page 22: R - Get Started I - Sanaitics

22

Data ImportedUse Head function to get an idea about how your data looks

like

head(basic_salary)

Page 23: R - Get Started I - Sanaitics

23

Manual Importbasic_salary<-read.table(file.choose(), header=TRUE , sep=",")

Select file of interest and click open to import

Page 24: R - Get Started I - Sanaitics

24

Checks after Importdim(basic_salary)

str(basic_salary)

names(basic_salary)

Page 25: R - Get Started I - Sanaitics

25

Importing .txt File#importing a text file having dlm = “blank/space”sample1<-read.table(file.choose(),header=T)sample1

#importing a comma separated text file new<-read.table(file.choose(), header=T, sep=",")new

#another mehod for importing a comma separated text file

new1<-read.csv(file.choose(),header=T)new1

Page 26: R - Get Started I - Sanaitics

26

SECTION 2Data Management: Creating Subsets

Page 27: R - Get Started I - Sanaitics

27

Using Indices•Row level sub setting: display some

selected rows •Sub setting by putting condition on

observations•Column level sub setting: display some

selected columns •Sub setting by putting condition on

variable names•Display selected rows for selected

columns•Sub setting by putting conditions on

variables and observations

Page 28: R - Get Started I - Sanaitics

28

Row Sub-Settingbasic_salary[c(5:10), ] Display rows from 5th to 10th

Page 29: R - Get Started I - Sanaitics

29

Row Sub-Settingbasic_salary[c(1,3,5,8), ] Display only selected rows

Page 30: R - Get Started I - Sanaitics

30

Condition on Observationsbasic_salary1 <- basic_salary[basic_salary$Location ==“DELHI” & basic_salary$ba > 20000, ]

head(basic_salary1)

Page 31: R - Get Started I - Sanaitics

31

More than one value of Variablebasic_salary2 <-basic_salary[basic_salary$Location %in% c("MUMBAI"),]

head(basic_salary2)

Page 32: R - Get Started I - Sanaitics

32

Column Sub-Settingbasic_salary3<-basic_salary[ , 1:2]

head(basic_salary3)

Page 33: R - Get Started I - Sanaitics

33

Condition on Variable Namesbasic_salary4 <- basic_salary[ ,c("First_Name","Location", "ba")]

head(basic_salary4)

Page 34: R - Get Started I - Sanaitics

34

Selected Rows for Selected Columns

basic_salary5<-basic_salary[c(1,5,8,4), c(1,2)]

basic_salary5

Page 35: R - Get Started I - Sanaitics

35

Condition on Variables & Observations

basic_salary6<-basic_salary[basic_salary$Grade=="GR1" & basic_salary$ba >15000 , c("First_Name","Location", "ba") ]

head(basic_salary6)

Page 36: R - Get Started I - Sanaitics

36

Subset Function•Condition on observations

•Condition on variable names

•Condition on observations and variable names

Page 37: R - Get Started I - Sanaitics

37

Condition on Observationsbasic_salary7<-subset(basic_salary, Location=="DELHI" & ba > 20000)

head(basic_salary7)

Page 38: R - Get Started I - Sanaitics

38

Condition on Variable Namesbasic_salary8<-subset(basic_salary,select=c(Location,First_Name,ba))

head(basic_salary8)

Page 39: R - Get Started I - Sanaitics

39

Condition on Observations & Variablebasic_salary9<-subset(basic_salary,Grade=="GR1" & ba>15000,select= c(First_Name,Grade, Location))

head(basic_salary9)

Page 40: R - Get Started I - Sanaitics

40

Using Head Functionhead(basic_salary, n=5)

Page 41: R - Get Started I - Sanaitics

41

Using Tail Functiontail(basic_salary, n=5)

Page 42: R - Get Started I - Sanaitics

42

Using Split Functionbasic_salary10<-split(basic_salary,basic_salary$Location)

delhi_salary<-data.frame(basic_salary10[1])

head(delhi_salary)

Page 43: R - Get Started I - Sanaitics

43

Using Split Functionmumbai_salary<-data.frame(basic_salary10[2])

head(mumbai_salary,5)

Page 44: R - Get Started I - Sanaitics

44

Delete Casesbasic_salary11<-basic_salary[!(basic_salary$Grade=="GR1") & !(basic_salary$Location=="MUMBAI"), ]

basic_salary11

Page 45: R - Get Started I - Sanaitics

45

Delete Casesbasic_salary12<-basic_salary[ !(basic_salary$ba>14500), ]

basic_salary12

Page 46: R - Get Started I - Sanaitics

46

Recoding Based on Quartilesbrk1<-with(basic_salary,quantile(basic_salary$ba,probs<-seq(0,1,0.25)))brk1

basic_salary13<-within(basic_salary,salary_quartile<-cut(basic_salary$ba,breaks=brk1,labels=1:4,include.lowest=TRUE))

Page 47: R - Get Started I - Sanaitics

47

Recoding Based on Quartileshead(basic_salary13)

Page 48: R - Get Started I - Sanaitics

48

Recoding Based on Decilesbrk2<-with(basic_salary,quantile(basic_salary$ba, probes<-seq(0,1,0.10)))brk2

basic_salary14<-within(basic_salary,salary_deciles<- cut(basic_salary$ba,breaks=brk2,labels=1:10,include.lowest=TRUE))

Page 49: R - Get Started I - Sanaitics

49

Recoding Based on Decilestail(basic_salary14)

Page 50: R - Get Started I - Sanaitics

50

Recoding Based on User Cut Pointsbrk3<-with(basic_salary,c(min(basic_salary$ba), 15000,18000,max(basic_salary$ba)))

brk3

basic_salary15<-within(basic_salary,salary_partition<-cut(basic_salary$ba,breaks=brk3,labels=1:3,include.lowest=TRUE))

Page 51: R - Get Started I - Sanaitics

51

Recoding Based on User Cut Points basic_salary15[c(20,30),]

Page 52: R - Get Started I - Sanaitics

52

Remove Variables#using remove.vars function (need to load “gdata”

package)

Removing variable 'Name'Removing variable 'Place‘

basic_salary<-remove.vars(basic_salary,c("Last_Name", "Location"))

names(basic_salary)[1] "X" "First_Name" "Grade" "ba"

Page 53: R - Get Started I - Sanaitics

53

SECTION 3Sorting & Merging

Page 54: R - Get Started I - Sanaitics

54

Sorting Dataattach(basic_salary)bs_sorted1<-basic_salary[order(-ba), ]head(bs_sorted1)

Page 55: R - Get Started I - Sanaitics

55

Sorting by Multiple Variablesbs_sorted2<-basic_salary[order( First_Name, ba) , ] head(bs_sorted2)

Page 56: R - Get Started I - Sanaitics

56

Merging by VariablesCreate the below two datasetssal_data

bonus_data

Page 57: R - Get Started I - Sanaitics

57

Outer Joinouterjoin<-

merge(sal_data,bonus_data,by=c("Employee_ID"),all=TRUE)

outerjoin

Page 58: R - Get Started I - Sanaitics

58

Inner Joininnerjoin<-merge(sal_data,bonus_data,by=("Employee_ID"))

innerjoin

Page 59: R - Get Started I - Sanaitics

59

Left Joinleftjoin<-merge(sal_data,bonus_data, by=c("Employee_ID"),all.x=TRUE)

leftjoin

Page 60: R - Get Started I - Sanaitics

60

Right Joinright_join<-merge(sal_data,bonus_data, by="Employee_ID" , all.y=TRUE)

right_join

Page 61: R - Get Started I - Sanaitics

61

Merging Cases (append)# rbind function

Appending two datasets using rbind function requires both the datasets with exactly the same number of variables with exactly the same names

If datasets do not have the same number of variables , variables can be either dropped or created so both match.

or else

# smartbind function can be used #need to donwload gtools package to use this function

Page 62: R - Get Started I - Sanaitics

62

R-String Functionsbasic_salary$Location<-substr(basic_salary$Location,1,1)head(basic_salary)

Page 63: R - Get Started I - Sanaitics

63

Replace Functionbasic_salary$Location<-substr(basic_salary$Location,1,1)head(basic_salary)table(basic_salary$Location)

Page 64: R - Get Started I - Sanaitics

64

Contd…..

replace Functionreplace() function replaces the values in x with indices given in list by thosegiven in values. If necessary, the values in values are recycled.

Page 65: R - Get Started I - Sanaitics

65

Deleting missing cases# complete.cases, na.exclude(), na.omit()are for dealing manually with NAs in a dataset

x<-c(12,NA,13,12)y<-c(25,48,NA,NA) data<-data.frame(x,y) data

data1<-na.omit(data) data1

x y1 12 25

data[complete.cases(data),]

x y1 12 25

na.exclude(data) x y1 12 25

Page 66: R - Get Started I - Sanaitics

66

Thank you !!!!