using the “r” actor in kepler for quality control
DESCRIPTION
John Porter, University of Virginia, [email protected]. Using the “R” Actor in Kepler for quality control. R Basics. R is an open source statistical language “Atomic” types: logical, integer, real, complex, string (or character) and raw - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/1.jpg)
USING THE “R” ACTOR IN KEPLER FOR QUALITY CONTROL
John Porter, University of Virginia, [email protected]
![Page 2: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/2.jpg)
R Basics R is an open source statistical language “Atomic” types: logical, integer, real,
complex, string (or character) and raw Data in R is stored in one of several types of
objects Scalar : myVar <- 10 Vectors: myVec <- c(10,20,30) Lists: myList <- c(10,”E”,12.3) Matrix: myMat <- cbind(myVec1,myVec2) Data Frames: myDf<-data.frame(myVec,MyList) Factors: myFac <- as.factor(myList)
![Page 3: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/3.jpg)
R Workspaces
All the variables and functions defined during a session are part of the “Workspace”
R Workspaces can be saved for later use When you come back, everything is the
same as when the workspace was saved
![Page 4: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/4.jpg)
Most Commonly Used Object Types Vectors – contain a single column
of one of the “atomic” types Often created using the concatenate
function myVec <- c(10,20,30)Individual elements can be accessed using indexesmyVec[2] is 20
![Page 5: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/5.jpg)
Data Frames
Data Frames – table-style objects that contain named vectors inside them
myDF$RAIN refers to the “RAIN” vector, as does
myDF[ ,2]myDF[135,3] is 121.8
![Page 6: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/6.jpg)
Reading Data into Data Frames A common way of creating data
frames is to read in a comma-separated-value (csv) file
myDf <- read.csv(“C:/ft_monro.csv”,header=TRUE)
read.csv
Note, regardless of operating system, R wants “/” – not “\”
![Page 7: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/7.jpg)
Sample R Program for QA/QC# Select the Data Fileinfile1 <- file(“C:/downloads/ft_monroe.csv", open="r") # Read the datadataTable1 <-read.csv(infile1, ,skip=1 ,sep="," ,quot='"' , col.names=c( "YEAR", "RAIN", "RAIN_CM", "NOTES" ), check.names=TRUE)
attach(dataTable1)
# Run basic summary statisticssummary(as.factor(NOTES)) summary(as.numeric(YEAR)) summary(as.numeric(RAIN)) summary(as.numeric(RAIN_CM))
![Page 8: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/8.jpg)
![Page 9: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/9.jpg)
Quick Exercise – Run these in R# anything after a # sign on a line is just a COMMENT - it won't do anythingvarA <- 10 # sets up a vector with one element containing a 10varA # listing an object's name prints out the values varB <- c(10,20,30) # sets up a vector with 3 elements. c() is the concatenation functionvarBvarB[2] # now let's display ONLY the second element
# now let's do some math!mySumAB <- varA + varB # adding them together. # Note there is only 1 value in varAmySumAB # note the single value in varA repeated in the addition
![Page 10: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/10.jpg)
R Data Structures
A lot of the “magic” in R is because of the object-oriented approach used
R objects contain a lot more than just the data values
A command that does one thing to a scalar (single value) does something else with a vector (a list of values) – all because R functions “understand” the difference!
![Page 11: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/11.jpg)
Conversions
Conversions are possible between different modes or types of objects using conversion functions as.numeric(varA)
makes varA a number – if it can! as.integer( ) as.character( ) as.factor() as.matrix() as.data.frame()
![Page 12: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/12.jpg)
Using Data FramesA <- c(10,20,30)B <- c(4,6,3)C <- c(‘A’,’B’,’C’) # put letters in quotesDf <-data.frame(C,A,B)Df # list whole data frameDf$A # list the A vectorDf[,3] # list the 3rd vector (B)Df[1,] # list all columns for row 1Df[Df$A > 10,] # list rows where A>10
![Page 13: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/13.jpg)
Data Frames
Results of Data Frame manipulations
![Page 14: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/14.jpg)
R Help
R has a number of ways of calling up help ??sqrt - does a “fuzzy” search for
functions like “sqrt” ?sqrt – does an exact search for the
function sqrt() and displays documentation
There are also manuals and extensive on-line tutorials (but Google is frequently the best way to find help)
![Page 15: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/15.jpg)
R & Kepler
Kepler uses the “RExpression Actor” to run R code from inside Kepler
Typically run with an SDF Director with a single iteration for most analyses You only need them done once! Don’t forget to set the iteration count –
the default is to loop forever!
![Page 16: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/16.jpg)
The default RExpression has no inputs and two outputs
graphicsFileName & output
![Page 17: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/17.jpg)
Typical connections for basic RExpression Actor
![Page 18: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/18.jpg)
Adding Ports
To make Rexpression actors really useful, it is helpful to be able to have them intercommunicate with other Kepler actors beyond simply listing output or showing graphs
To allow this intercommunication we need to add additional Input and Output ports The names of the ports will automatically
be connected to objects with the same name in the R program
![Page 19: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/19.jpg)
![Page 20: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/20.jpg)
![Page 21: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/21.jpg)
Hook up some input and output actors
![Page 22: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/22.jpg)
R Program to Test
Remember – names of ports translate into names of objects in R
![Page 23: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/23.jpg)
Results of Running Workflow
R Listing Output
“myOutValue
”displayed
![Page 24: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/24.jpg)
![Page 25: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/25.jpg)
R for Checking EML Data
But there are some TRICKS you should know!
![Page 26: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/26.jpg)
Trick 1 – select the right object type for the EMLactor By Default the EML Actor only
connects to the output ports the FIRST LINE OF DATA “as field”.
If you want to have an output port represent the data as a VECTOR you need to select “As Column Vector”
If you want to get a Data Frame instead of individual columns, you need to select “As ColumnBased Record”
![Page 27: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/27.jpg)
Setting Data Output Format in EML actor
![Page 28: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/28.jpg)
Trick 2 – Trap R errors
Normally if there is a problem with your R program you get a cryptic message from Kepler
![Page 29: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/29.jpg)
try() and geterrmessage() in R
Runs the “errorplot()”* function and reports any error messages
that occur when you run it* There is no “errorplot()” function in R
![Page 30: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/30.jpg)
Now we get an informative message
![Page 31: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/31.jpg)
Correct the command and see the output
![Page 32: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/32.jpg)
QA/QC – Quality Assurance and Quality Control Error types
Errors of Commission – data contains wrong values
Errors of Omission – data that should be there is missing
We will mostly be talking today about errors of commission
![Page 33: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/33.jpg)
Porter’s Rule of Data Quality There is no non-trivial dataset
that does not contain some errors
Goal of QA/QC: reduce errors to the maximum possible extent, or at least to the level that they don’t adversely effect the conclusions reached through analysis of the data
![Page 34: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/34.jpg)
QA/QC – Possible Tests Identification and removal of duplicates Correct Domain
Numerical Range (e.g., -20 < Temperature < 50)
Correct Codes (e.g., HOGI, not HOG1) Graphs
Time-series plots Plots between variables
Detections of “spikes” in time series Customized criteria (e.g., month specific
range checks)
![Page 35: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/35.jpg)
Exercise – A succession of workflows for QA Open your Virtual Machine Open a Web Browser and go to: http://tinyurl.com/7po5ffb Open the LocalData.zip file Extract All Files to directory C:\ You should then have a C:\localData
directory containing the files for this exercise
![Page 36: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/36.jpg)
1_Ft_Monroe_simple_summary.kar
A dead-simple workflow
![Page 37: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/37.jpg)
![Page 38: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/38.jpg)
Kepler Stuff to Note
Annotations allow you to add titles and other useful instructions to your workflow display
![Page 39: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/39.jpg)
Kepler Stuff to Note
Parameters let you easily show and change values that will be used elsewhere in the workflow
![Page 40: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/40.jpg)
Kepler Parameters
Customize Name lets you set the NAME of the parameter and what should display on the screen
Remember thename – that is how you will refer to the parameter later.
![Page 41: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/41.jpg)
Using a Parameter Value Add a $ to the front of a parameter in
a Kepler settings box to insert the value of the parameter – so the Data File: is c:/localData/ft_monro.csv
![Page 42: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/42.jpg)
Brief Exercise
Experiment with editing connections in this workflow to display different graphs
Then open the 3_ft_monro_badData.kar workflow – it has a corrupted version of this data
![Page 43: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/43.jpg)
R stuff to Note
This workflow uses both a Data Frame (table) and vectors (single columns) In the dataFrame you can subset
lines using: dataFrame[(dataFrame$RAIN < 0), ] Be sure to put the trailing comma! dataFrame$RAIN < 0 generates a logical
vector of TRUE and FALSE values – one for each line
![Page 44: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/44.jpg)
QA/QC in R
summary(dataFrame)
print("Here are Duplicated Data Lines")dataFrame[duplicated(dataFrame),]
print("now list out of range checks")dataFrame[(dataFrame$RAIN < 0 | dataFrame$RAIN_CM < 0),]dataFrame[(dataFrame$RAIN > 150 | dataFrame$RAIN_CM > 300),]print("now list unit conversion errors")dataFrame[(abs((dataFrame$RAIN*2.54)- dataFrame$RAIN_CM)>0.1),]
![Page 45: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/45.jpg)
Examine the workflow on the bad data and change it! Try setting different values for the
range checks Try different graphs (as you did for
the good data) Try listing all the data that was NOT
duplicated (note in R the “not “ operator is “!“)
use R help and Google as needed
![Page 46: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/46.jpg)
R+Kepler vs. R Alone Given that “R” runs just fine alone, why
use Kepler? Allows use of OTHER Kepler actors, Data
Turbine E.g., EMLData, editors, graphical tools
Allows code to be segmented for easier editing in the future
Reusability – ability to copy and paste parts of Kepler workflows
Use spatial arrangement to help guide the user
Downsides Complicates debugging
![Page 47: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/47.jpg)
A more complex and general workflow
4_BasicEMLQA
![Page 48: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/48.jpg)
Workflow Steps Read an EML metadata file Convert it using a XSLT stylesheet into
an R program Edit the R program to point to the data Ingest the data into a data frame Summarize the data “Tweak “ the data to add a date-time
vector for time plots and fix some conversion problems and re-summarize the data
Run some plots
![Page 49: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/49.jpg)
Passing R Workspaces
This workflow, instead of passing data from actor-to-actor, passes the name of the R Workspace
Subsequent actors re-open the R Workspace without needing to ingest the data again
This is very efficient, but this method only works for connecting R actors
![Page 50: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/50.jpg)
R code for passing on R workspaces
Set Port Variable to the name of the workflow
Remember to save the
workspace!
Saving workspace for
later use
Loading the Saved
WorkspaceName of Port connected to
WorkingDir port (above)
![Page 51: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/51.jpg)
A conversion problem
Temperature and Humidity values have
some severe problems reading in!
What happened?
![Page 52: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/52.jpg)
R Factors Factors are the way R deals with categorical or
nominal data (e.g., typically, non-numeric data) Internally Factors are made up of two vectors:
Values – the actual values stored in the factor – often referred to as “levels”
Indexes – an integer vector containing numbers that are used to specify the ORDERing of the values
DANGER – sometimes when you read in data from a file, errors or odd characteristics of the data will cause R to read a column of (mostly) numbers as a Factor instead of as a numeric vector!
![Page 53: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/53.jpg)
Factors
This is the mean of the INDEXES
not the VALUES/Levels
![Page 54: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/54.jpg)
After conversion data ranges are much better!
But Max_T is still suspicious!
![Page 55: Using the “R” Actor in Kepler for quality control](https://reader035.vdocuments.site/reader035/viewer/2022062410/56815e78550346895dccfacd/html5/thumbnails/55.jpg)
Your Final Challenge
As it’s name suggests this data file has some corrupted data (plus the normal errors)
Edit the “Tweaks” actor to add additional checks or add additional plots to identify the problems with the data
If you don’t cause Kepler to abort the workflow due to errors at least once, you aren’t trying hard enough! So make additions in a change-test-repeat cycle