stat115 stat225 bist512 bio298 - intro to computational biology yang li lin liu jan 29, 2014 1
TRANSCRIPT
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Yang LiLin Liu
Jan 29, 2014
1
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
• Unix part slides courtesy: John Brunelle
• You can check out more details in:– https://software.rc.fas.harvard.edu/
training/intro_unix/latest/#(1)
2
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Sign up on Odyssey
• Very simple, just go to http://rc.fas.harvard.edu/, then click on Account and Access Request Forms (right top of the website on Quick Links section), then click on RC Account form, and then fill it in as below – we will take care of the rest!
3
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 4
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Basic Unix Command
• Log in:• ssh
5
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Basic Unix Command
• Upload or download files:Upload:
scp username@host dir/targetfilenameDownload:
scp dir/yourfilename username@host
6
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
CaSe SeNsItIvE
• In shell commands, abc will be different from ABC
7
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Terminology and notation
• Folders are usually referred to as directories
• Locations in the filesystem, like /n/home00/cfest350, are called paths
• The directory and file names that make up a path are always separated by a forward-slashes
• The top of the hierarchy is /, ie the root directory
8
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Terminology and notation
9
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Navigating the system: ls
10
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Download and unzip files
• wget https://software.rc.fas.harvard.edu/training/examples.tar.gz
• tar xvf examples.tar.gz
11
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
What if you get confused
• man ls– Use the arrow keys, page up/down keys,
or the SPACE to navigate– To search for a phrase of text, for
example the word time, type /time and hit ENTER• Hit n to go to the next occurrence• Hit N to go to the previous occurrence• Hit q to quit
12
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Kill process
• top• kill• killall• Ctrl-c• Exercise: Run the command
~/examples/bin/ticktock, and kill it once you've had enough
13
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Copy files• mkdir workshop• cd workshop• cp ~/examples/aaa .• cp ~/examples/bbb ~/examples/ccc .• cp aaa zzz• rsync: replacement for cp, but can be used
to copy files to/from remote computers– e.g. rsync -avz --progress mywork
username@hostname:~/mywork
14
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Moving and removing
15
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
File permissions
• The -rw-r--r-- displays the file mode bits– The first character is the type (- for files,
d for directories, and other letters (b, c, l, s etc.) for special files
– Following that are three groups of three characters, for read, write, and execute permissions for user, group, and others
16
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 17
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
• r = 4, w = 2, x = 1, rwx = 7• chmod 755
18
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Hidden files
19
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
File manipulation
• cat ~/examples/gpl-3.0.txt• less ~/examples/gpl-3.0.txt• File editors: vim/emacs/nano
20
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
More shell commands
21
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Piping your commands
• cat ~/examples/answers.out | awk '{print $3}'
22
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Exercises
• List the last 5 files in /bin by combining the ls and tail commands with a pipe
• Count the number of lines that contain the word free in ~/examples/gpl-3.0.txt
23
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
The shell environment
• echo $PATH• Change $PATH:• PATH=$PATH\:/dir/path ; export PATH
24
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Submit a job
• bsub < yourscript.bsub• yourscript.bsub:
#!/bin/sh#BSUB -u linliu@harvard#BSUB -J hellwo_world#BSUB -o hellow_world.out#BSUB -e hellow_world.err#BSUB -q short_serialpython hellow_world.py > hellow_world.out
25
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Load modules
• module load dir/software– http://oldrcwebsite.rc.fas.harvard.edu/
faq/modulelist
26
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Final tips
• Google is extremely helpful if you want to write some shell scripts
27
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Getting started
Where the scripts/commands are executed
Where plots/help displayed, and packages installed.
Where the CODE is scripted
Show the variables/functions in memory
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Workspace Management
• Before jumping into R, it is important to ask ourselvesWhere am I?
>getwd()
–I want to be there…• setwd(“C://”)
–With who am I?• dir() # lists all the files in the working directory
–With who I can count on?• ls() #lists all the variables on the current session
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Workplace Management (2)
Saving>save(x,file=“name.RData”) #Saves specific
objects>save.image(“name.Rdata”) #Saves the whole
workspace
Loading>load(“name.Rdata”)
‘?function’ and ‘??function’>? To get the documentation of the function>?? Find related functions to the query
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
R Objects• Almost all things in R are OBJECTS!
– Functions, datasets, results, etc… (graphs NO)
• OBJECTS are classified by two criteria– MODE: How objects are stored in R
• Character, numeric, logical, factor, list, function…• To obtain the mode of an object
> mode(object)
– CLASS: How objects are treated by functions• Vector, matrix, array, data.frame,…• To obtain the class of an object
> class(object)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
R classescharacter
> assembly = “hg19”> assembly> class(assembly)
numeric> expression = 3.456> expression> class(expression)
integer> nbases = “3000000000L”> nbases> class(nbases)
logical> completed = FALSE> completed> class(completed)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
R classes - Vectorvector
>x=c(10,5,3,6); x[3:4]; x[1]
Computations on vector are performed on each entry of the vector
>y=c(log(x),x,x^2)
Not necessarily to have vectors of the same length in operations!
>w=sqrt(x)+2>z=c(pi,exp(1),sqrt(2))>x+z
–Logical vectors>aux=x<7
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
R Classes - Listlist
A vector of values of possibly different classes and different length.
Creating it.>x1 = 1:5>x2 = c(T,T,F,T,F)>y=list(question.number = x1, question.answer = x2)
Accesing it.>y;class(y)>y$question.answer[3]; y[[2]][3];
y[[“question.answer”]][3]>y$question.number[which(question.answer == T)]
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
R classes - Matrixmatrix
>x=1:8>dim(x)=c(2,4)>y=matrix(1:8,2,4,byrow=F)
Operations are applied on each element
>x*x; max(x)>x=matrix(1:28,ncol=4);
y=7:10 so then x*y is…?>y=matrix(1:8,ncol=2)>y%*%t(y)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
R classes - Matrix
matrixExtracting info
>y[1,] or y[,1]Extending matrices
>cbind(y,seq(101,104))>rbind(y,c(102,109))
Apply is a useful function!>apply(y,2,mean)>apply(y,1,log)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
R classes – Data Frame
data.frameCreating it.
> policy.number = c(“A00187”, “A00300”,”A00467”,”A01226”)> issue.age = c(74,30,68,74)> sex=c(“F”, “M”, “M”, “F”)> smoke=c(“S”,”N”,”N”,”N”)> face.amount = c(420, 1560, 960, 1190)> ins.df = data.frame(policy.number, issue.age, sex, smoke,
face.amount)
Accesing it.> ins.df[1,]; ins.df[,1] # access first row, access first colum> ins.df$policy.number # access policy number column> rownames(ins.df); colnames(ins.df);> index.smokers = which(ins.df$smoke == “S”) # row index of
smokers> ins.df[index.smokers] # access all smokers in the df> ins.df$policy.number[index.smokers] # policy number for
smokers
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
R classes – Data Frame
data.frameManipulating it.
> ins.df = rbind(ins.df, c(“A01495”, 62, “M”, “N”, 1330))> sort.age = sort(ins.df$issue.age, index=T)> ins.df = ins.df[sort.age$ix,]> ins.df$visits = c(0,4,2,1,1)> drops = c(“sex”,”visits”)> ins.df[,!(names(ins.df) %in% drops]> ins.df[,”visits”] = c(0,4,2,1,1)> carins.df = data.frame(policy.number =
c("A01495","A00232","A00187"), car.accident = c("Y","N","N"))> ins.merged.df = merge(ins.df, carins.df, by = "policy.number")> Etc…
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
R Classes - Factorfactor
Qualitative variables that can be included in models.
>smoke = c(“yes”,”no”,”yes”,”no”)>smoke.factor = as.factor(smoke)>smoke.factor>class(smoke)>class(smoke.factor)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Loops and Conditional Statements
ifExample
>a=9>if(a<0){ print (“Negative number”) } else{ print (“Non-negative number”) }
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
• for>z=rep(1,10)>for (i in 2:10)
{ z[i]=z[i]+exp(1)*z[i-1] }
• while>n=0>tmp=0>while(tmp<100)
{ tmp=tmp+rbinom(1,10,0.5) n=n+1 }
Loops and Conditional Statements
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Functions!• My own functions
> function.name=function(arg1,arg2,…,argN) { Body of the function }
> fun.plot=function(y,z){y=log(y)*z-z^3+z^2plot(z,y)}
> z=seq(-11,10)> y=seq(11,32)> fun.plot(y,z)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Functions! (2)• The ‘…’ argument
– Can be used to pass arguments from one function to another• Without the need to specify arguments in
the header
fun.plot=function(y,z,...) { y=log(y)*z-z^3+z^2 plot(z,y,...) }fun.plot(y,z,type="l",col="red")fun.plot(y,z,type="l”,col=“red”,lwd=4)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Handling data I/O
Reading from files to a data frame>read.csv(“filename.csv“) # reads csv files into
a data.frame>read.table(“filename.txt“) # reads txt files in a
table format to a data.frame
Writing from a data frame to a file>write(x,filename) # writes the object x to
filename>write.table(x,filename) # writes the object x to
filename in a table format
Note: have in mind additional options such as, header = TRUE, row.names = TRUE, col.names = TRU, quotes = TRUE, etc.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Plotting!
>x.data=rnorm(1000)>y.data=x.data^3-10*x.data^2>z.data=-0.5*y.data-90
>plot(x.data,y.data,main="Title of the graph",xlab="x label",ylab="y label")
>points(x.data,z.data,col="red")>legend(-2,2,legend=c("Black points","Red
points"),col=c("black","red"),pch=1,text.col=c("black","red"))
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Plotting! (2)
You can export graphs in many formats– To check the formats that are available in your
R installation>capabilities()
png>png("Lab2_plot.png",width=520,height=440)>plot(x.data,y.data,main="Title of the graph",xlab="x
label",ylab="y label")>points(x.data,z.data,col="red")>legend(-2,2,legend=c("Black points","Red
points"),col=c("black","red"),pch=1,text.col=c("black","red"))
>dev.off()eps
> postscript("Lab2_plot.eps",width=500,height=440)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Simulation
Sampling>sample(x,repla
ce=TRUE) – put it back into the bag!
Distributions
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Libraries!!
Collection of R functionsthat together perform a specialized analysis.
Install packages from CRAN> install.packages(“PackageName”)
Loading libraries> library(LibraryName)
Getting the documentation of a library> library(help=LibraryName)
Listing all the available packages> library()
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
• www.bioconductor.org
– A suite of R packages for Bioinformatics.
– To use only Core packages• >source(“http://bioconductor.org/biocLite.R”)• >biocLite()
– To use Core and Other packages• >source(“http://bioconductor.org/biocLite.R”)• >biocLite(c(“pkg1”, “pkg2”,…,“pkgN”))
Libraries!!
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Exercise 1 – The empire strikes back: GOOG versus BAIDU
Plot historical Stock Prices times series using prices from yahoo finance.
(a) Download and install tseries package.
(b) Include tseries package as a library in your code.
(c) Use get.hist.quote to download GOOG and BAIDU historical data.
(d) Plot both time series in the same panel and add a legend to the plot.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Exercise 2 – Challenging Challenger
On January 28, 1986, the space Shuttle Challenger exploded in the early stages of its flight. Feynman, along a committee determined that the explosion was due to low temperatures and the failure of O-rings sealed on the booster rockets. The ambient temperature was 36 degrees on the morning of the launch.The scientists had data (temperature, number of failures) from previous flights.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Exercise 2 – Challenging Challenger
(a)Plot the number of failures versus the temperature for flights with one or more O-ring failures. Is there any evidence that temperature affects O-ring performance?
(b)Plot the number of failures versus temperature for all the flights. Is there any evidence that temperature affects O-ring performance?
(c) What’s your conclusion? What do you think the scientists plot before taking the decision to fly that day? Just historical curiosity, Whom played a central role in discovering the causes of the failure and how he announced it?