r - get started i - sanaitics
TRANSCRIPT
1
GET STARTED IN RModule I
2
What is R?• A language and environment for statistical
computing and graphics
• Wide variety of statistical & graphical techniques built in
• Free and Open Source software
• Compiles and runs on a wide variety of UNIX platforms, Windows and MacOS
3
R Environment• Most functionality through built in functions
• Basic functions available by default
• Other functions contained in packages that can be attached
• All datasets created in the session are in Memory
• Output can be used as input to other functions
4
R-CRAN• The Comprehensive R Archive Network
• A network of global web servers storing identical, up-to-date, versions of code and documentation for R
• Use the CRAN mirror nearest to you to download R setup at a faster speed
• http://cran.r-project.org/
5
SECTION 1Introduction to R and User Interface (UI)
6
The R-UI
The R console-- Type in the command -- Press enter-- See the output
7
Create Your Data Setx<-c(12,23,45)y<-c(13,21,6) Create vectors x, y, z on R
Consolez<-c("a","b","c")data1<-data.frame(x,y,z) Combine them in a
tabledata1 Type data name for output
8
Your data in Table mode
edit(data1)
Edit your table, add new values & close the window to save your updates
9
Export your updated table write.csv (data1,file.choose())
Select the path, name your file and click save
10
Data Set typesVector 1 dimension
All elements have the same data types
NumericCharacterLogicFactor
Matrix 2 dimensions
Array 2 or moredimensions
Data frame 2 dimensionsTable-like data object allowing different data types for different columns
ListCollection of data objects, each element of a list is a data object
11
Useful In-Built Functionsmin(data1$x)Use $ sign after data set name to use specific columns
max(data1$y) length(data1$x)
mean(data1$y)
levels(data1$z) Gives a list of unique categories in the data
12
Save Your ScriptOpen a new script, write commands, execute using F5 key. Save the file at a desired folder. Helps to save the script for a later use.
13
R Workspace• Objects that you create during an R session are
hold in memory, the collection of objects that you currently have is called the workspace
• This workspace is not saved on disk unless you tell R to do so
• This means that your objects are lost when you close R and not save the objects, or worse when R or your system crashes on you during a session
14
R Workspace• If you have saved a workspace image and
you start R the next time, it will restore the workspace
• So all your previously saved objects are available again
• You can also explicitly load a saved workspace i.e., that could be the workspace image of someone else
• Go the `File' menu and select `Load workspace'
15
R WorkspaceDisplay last 25 commandshistory()
Display all previous commandshistory(max.show=Inf)
Save your command history to a file. Default is ".Rhistory"savehistory(file="myfile")
Recall your command history. Default is ".Rhistory"loadhistory(file="myfile")
16
Types of Variables• Type: Logical, integer, double, complex,
character, factor or raw
• Type conversion functions have the naming convention
as.xxx
• For example, as.integer(3.2) returns the integer 3, and as.character(3.2) returns the string "3.2".
17
Factors• Tell R that a variable is nominal by making it
a factor
• The factor stores the nominal values as a vector of integers in the range [ 1... k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers
• An ordered factor is used to represent an ordinal variable
18
Factors• R will treat factors as nominal variables and
ordered factors as ordinal variables in statistical procedures and graphical analyses
• You can use options in the factor( ) and ordered( ) functions to control the mapping of integers to strings (overriding the alphabetical ordering)
19
Help in R•If you know the topic but not the exact
function
Help.search(“topic”)
•If you know the exact function
Help(functionname)
20
In-Memory Computing•Storage of information in the main RAM
of dedicated servers rather than in complicated relational databases operating on comparatively slow disk drives
•Helps business customers, including retailers, banks and utilities, to quickly detect patterns, analyze massive data volumes on the fly, and perform their operations quickly
21
Import .csv Data Filebasic_salary<-read.table(“C:/Users/Documents/R/BASIC SALARY DATA.csv", header=TRUE, sep=",")
• Command requires the file path separated by a /
• header=TRUE if there are row labels
22
Data ImportedUse Head function to get an idea about how your data looks
like
head(basic_salary)
23
Manual Importbasic_salary<-read.table(file.choose(), header=TRUE , sep=",")
Select file of interest and click open to import
24
Checks after Importdim(basic_salary)
str(basic_salary)
names(basic_salary)
25
Importing .txt File#importing a text file having dlm = “blank/space”sample1<-read.table(file.choose(),header=T)sample1
#importing a comma separated text file new<-read.table(file.choose(), header=T, sep=",")new
#another mehod for importing a comma separated text file
new1<-read.csv(file.choose(),header=T)new1
26
SECTION 2Data Management: Creating Subsets
27
Using Indices•Row level sub setting: display some
selected rows •Sub setting by putting condition on
observations•Column level sub setting: display some
selected columns •Sub setting by putting condition on
variable names•Display selected rows for selected
columns•Sub setting by putting conditions on
variables and observations
28
Row Sub-Settingbasic_salary[c(5:10), ] Display rows from 5th to 10th
29
Row Sub-Settingbasic_salary[c(1,3,5,8), ] Display only selected rows
30
Condition on Observationsbasic_salary1 <- basic_salary[basic_salary$Location ==“DELHI” & basic_salary$ba > 20000, ]
head(basic_salary1)
31
More than one value of Variablebasic_salary2 <-basic_salary[basic_salary$Location %in% c("MUMBAI"),]
head(basic_salary2)
32
Column Sub-Settingbasic_salary3<-basic_salary[ , 1:2]
head(basic_salary3)
33
Condition on Variable Namesbasic_salary4 <- basic_salary[ ,c("First_Name","Location", "ba")]
head(basic_salary4)
34
Selected Rows for Selected Columns
basic_salary5<-basic_salary[c(1,5,8,4), c(1,2)]
basic_salary5
35
Condition on Variables & Observations
basic_salary6<-basic_salary[basic_salary$Grade=="GR1" & basic_salary$ba >15000 , c("First_Name","Location", "ba") ]
head(basic_salary6)
36
Subset Function•Condition on observations
•Condition on variable names
•Condition on observations and variable names
37
Condition on Observationsbasic_salary7<-subset(basic_salary, Location=="DELHI" & ba > 20000)
head(basic_salary7)
38
Condition on Variable Namesbasic_salary8<-subset(basic_salary,select=c(Location,First_Name,ba))
head(basic_salary8)
39
Condition on Observations & Variablebasic_salary9<-subset(basic_salary,Grade=="GR1" & ba>15000,select= c(First_Name,Grade, Location))
head(basic_salary9)
40
Using Head Functionhead(basic_salary, n=5)
41
Using Tail Functiontail(basic_salary, n=5)
42
Using Split Functionbasic_salary10<-split(basic_salary,basic_salary$Location)
delhi_salary<-data.frame(basic_salary10[1])
head(delhi_salary)
43
Using Split Functionmumbai_salary<-data.frame(basic_salary10[2])
head(mumbai_salary,5)
44
Delete Casesbasic_salary11<-basic_salary[!(basic_salary$Grade=="GR1") & !(basic_salary$Location=="MUMBAI"), ]
basic_salary11
45
Delete Casesbasic_salary12<-basic_salary[ !(basic_salary$ba>14500), ]
basic_salary12
46
Recoding Based on Quartilesbrk1<-with(basic_salary,quantile(basic_salary$ba,probs<-seq(0,1,0.25)))brk1
basic_salary13<-within(basic_salary,salary_quartile<-cut(basic_salary$ba,breaks=brk1,labels=1:4,include.lowest=TRUE))
47
Recoding Based on Quartileshead(basic_salary13)
48
Recoding Based on Decilesbrk2<-with(basic_salary,quantile(basic_salary$ba, probes<-seq(0,1,0.10)))brk2
basic_salary14<-within(basic_salary,salary_deciles<- cut(basic_salary$ba,breaks=brk2,labels=1:10,include.lowest=TRUE))
49
Recoding Based on Decilestail(basic_salary14)
50
Recoding Based on User Cut Pointsbrk3<-with(basic_salary,c(min(basic_salary$ba), 15000,18000,max(basic_salary$ba)))
brk3
basic_salary15<-within(basic_salary,salary_partition<-cut(basic_salary$ba,breaks=brk3,labels=1:3,include.lowest=TRUE))
51
Recoding Based on User Cut Points basic_salary15[c(20,30),]
52
Remove Variables#using remove.vars function (need to load “gdata”
package)
Removing variable 'Name'Removing variable 'Place‘
basic_salary<-remove.vars(basic_salary,c("Last_Name", "Location"))
names(basic_salary)[1] "X" "First_Name" "Grade" "ba"
53
SECTION 3Sorting & Merging
54
Sorting Dataattach(basic_salary)bs_sorted1<-basic_salary[order(-ba), ]head(bs_sorted1)
55
Sorting by Multiple Variablesbs_sorted2<-basic_salary[order( First_Name, ba) , ] head(bs_sorted2)
56
Merging by VariablesCreate the below two datasetssal_data
bonus_data
57
Outer Joinouterjoin<-
merge(sal_data,bonus_data,by=c("Employee_ID"),all=TRUE)
outerjoin
58
Inner Joininnerjoin<-merge(sal_data,bonus_data,by=("Employee_ID"))
innerjoin
59
Left Joinleftjoin<-merge(sal_data,bonus_data, by=c("Employee_ID"),all.x=TRUE)
leftjoin
60
Right Joinright_join<-merge(sal_data,bonus_data, by="Employee_ID" , all.y=TRUE)
right_join
61
Merging Cases (append)# rbind function
Appending two datasets using rbind function requires both the datasets with exactly the same number of variables with exactly the same names
If datasets do not have the same number of variables , variables can be either dropped or created so both match.
or else
# smartbind function can be used #need to donwload gtools package to use this function
62
R-String Functionsbasic_salary$Location<-substr(basic_salary$Location,1,1)head(basic_salary)
63
Replace Functionbasic_salary$Location<-substr(basic_salary$Location,1,1)head(basic_salary)table(basic_salary$Location)
64
Contd…..
replace Functionreplace() function replaces the values in x with indices given in list by thosegiven in values. If necessary, the values in values are recycled.
65
Deleting missing cases# complete.cases, na.exclude(), na.omit()are for dealing manually with NAs in a dataset
x<-c(12,NA,13,12)y<-c(25,48,NA,NA) data<-data.frame(x,y) data
data1<-na.omit(data) data1
x y1 12 25
data[complete.cases(data),]
x y1 12 25
na.exclude(data) x y1 12 25
66
Thank you !!!!