2015 11-17-programming inr.key

Post on 25-Jan-2017

661 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Programming in R(and some other stuff)

y.wurm@qmul.ac.ukhttps://wurmlab.github.io

© Alex Wild & others

© National Geographic

Atta leaf-cutter ants

© National Geographic

Atta leaf-cutter ants

© National Geographic

Atta leaf-cutter ants

Oecophylla Weaver ants

© ameisenforum.de

© ameisenforum.de

Fourmis tisserandes

© ameisenforum.de

Oecophylla Weaver ants

© forestryimages.org© wynnie@flickr

Tofilski et al 2008

Forelius pusillus

Tofilski et al 2008

Forelius pusillus hides the nest entrance at night

Tofilski et al 2008

Forelius pusillus hides the nest entrance at night

Tofilski et al 2008

Forelius pusillus hides the nest entrance at night

Tofilski et al 2008

Forelius pusillus hides the nest entrance at night

Avant

Workers staying outside die« preventive self-sacrifice »

Tofilski et al 2008

Forelius pusillus hides the nest entrance at night

Dorylus driver ants: ants with no home

© BBC

Animal biomass (Brazilian rainforest)

from Fittkau & Klinge 1973

Other insects AmphibiansReptiles

Birds

Mammals

Earthworms

Spiders

Soil fauna excluding earthworms,

ants & termites

Ants & termites

We use modern technologies to understand insect societies.• evolution of social behaviour• molecules involved in social behaviour• consequences of environmental change

Big data is invading biology

This changes everything.

Any lab can sequence anything!

http://gregoryzynda.com/ncbi/genome/python/2014/03/31/ncbi-genome.html

DATABIG

Big data is invading biology• Genomics

• Cancer genomics

• Biodiversity assessments

• Stool microbiome sequencing

• Personalized medicine

• Sensor networks - e.g tracking microclimates, recording sounds

• Huge medical studies

• Aerial surveys (Drones) - e.g. crop productivity; rainforest cover

• Camera traps

Learning to deal with big data takes time

Practicals• Aim: get relevant data handling skills

• Doing things by hand: • impossible? • slow, • error-prone,

• Automate!

• Basic programming• in R• no stats!

Why R?😳😟

😴😡😖😥

Practicals: contents• Done:

• data accessing/subsetting• New:

• search/replace• regular expressions

• New:• functions • loops

• Friday: (Introduction to Unix & High performance computing)

Text search on steroids

Reusable pieces of workRepeating the same thing many times

• create a variable that contains the number 35

• create a variable that contains the string “I love tofu”

• give me a vector containing the sequence of numbers from 5 to 11

• access the second number

• replace the second number with 42

• add 5 to the second number

• now add 5 to all numbers

• now add an extra number: 1999

• can you sum all the numbers?

• creating a vector

> my_vector <- c(5, 6, 7, 8, 9, 10, 11)> my_vector <- 5:11> my_vector <- seq(from=5, to=11, by=1)> my_vector[1] 5 6 7 8 9 10 11> length(my_vector)[1] 7> (10 > 30) [1] FALSE> my_vector > 8 [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE> my_vector[my_vector > 8] 9 10 11> other_vector <- my_vector[my_vector > 8]> other_vector9 10 11> other_vector + 3

• give me a vector containing numbers from 5 to 11 (3 variants)

• accessing a subset• of a vector

> big_vector <- 150:100> big_vector [1] 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 135 134 133 132[20] 131 130 129 128 127 126 125 124 123 122 121 120 119 118 117 116 115 114 113[39] 112 111 110 109 108 107 106 105 104 103 102 101 100> big_vector[5]146> mysubset <- big_vector[my_vector]> mysubset[1] 146 145 144 143 142 141 140> big_vector > 130 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE[13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE[49] FALSE FALSE FALSE> subset(x = big_vector, subset = big_vector > 140) [1] 150 149 148 147 146 145 144 143 142 141> big_vector[big_vector >= 140][1] 150 149 148 147 146 145 144 143 142 141 140

> my_vector[1] 5 6 7 8 9 10 11

Regular expressions (regex): Text search on steroids.

who dat?

Regular expressions (regex): Text search on steroids.

Regular expression FindsDavid David

Dav(e|(id)) David, DaveDav(e|(id)|(ide)|o) David, Dave, Davide, Davo

At{1,2}enborough Attenborough, Atenborough

Atte[nm]borough Attenborough, Attemborough

At{1,2}[ei][nm]bo{0,1}ro((ugh)|w){0,1}Atimbro,

attenbrough,ateinborow

Easy counting, replacing all with “Sir David Attenborough”

Yes: ”HATSOMIKTIP"yes: ”HAVSONYYIKTIP"not: ”HAVSQMIKTIP"

Regex special symbolsRegular expression Finds Example

[aeiou] any single vowel “e”

[aeiou]* between 0 and infinity vowels vowels, e.g.’ “eeooouuu"

[aeoiu]{1,3} between 1 and 3 vowels “oui”

a|i one of the 2 characters “"

((win)|(fail)) one of the two words in () fail

Yes: ”HATSOMIKTIP"yes: ”HAVSONYYIKTIP"not: ”HAVSQMIKTIP"

More Regex Special symbols

• Google “Regular expression cheat sheet”• ?regexp

Synonymous with[:digit:] [0-9]

[A-z] [A-z], ie [A-Za-z]

\s whitespace

. any single character

.+ one to many of anything

b* between 0 and infinity letter ‘b’

[^abc] any character other than a, b or c.

\( (

[:punct:] any of these: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { |

You want to scan a protein sequence database for a particular binding site. Type a single regular expression that will match the first two of the following peptide sequences,

but NOT the last one:

"HATSOMIKTIP""HAVSONYYIKTIP""HAVSQMIKTIP"

(rubular)

Variants of a microsatellite sequence are responsible for differential expression of vasopressin receptor, and in turn for

differences in social behaviour in voles & others. Create a regular expression that finds AGAGAGAGAGAGAGAG dinucleotide

microsatellite repeats with lengths of 5 to 500

Again

Make a regular expression

• matching “LMTSOMIKTIP” and “LMVSONYYIKTIP” but not “LMVSQMIKTIP”

• matching all variants of “ok” (e.g., “O.K.”, “Okay”…)

Ok… so how do we use this?

• ?grep

• ?gsub

Which species names include ‘y’?Create a vector with only species names, but replace all ‘y’ with ‘Y!

ants <- read.table("https://goo.gl/3Ek1dL") colnames(ants) <- c("genus", "species")

Remove all vowels

Replace all vowels with ‘o’

Functions

Functions• R has many. e.g.: plot(), t.test()

• Making your own:

tree_age_estimate <- function(diameter, species) { growth_rate <- growth_rates[ species ] age_estimate <- diameter / growth_rate return(age_estimate)}

> tree_age_estimate(25, “White Oak”)+ 66> tree_age_estimate(60, “Carya ovata”)+ 190

Make a function• That converts fahrenheit to celsius

(subtract 32 then divide the result by 1.8)

Loops

“for” Loop

> possible_colours <- c('blue', 'cyan', 'sky-blue', 'navy blue', 'steel blue', 'royal blue', 'slate blue', 'light blue', 'dark blue', 'prussian blue', 'indigo', 'baby blue', 'electric blue')

> possible_colours [1] "blue" "cyan" "sky-blue" "navy blue" [5] "steel blue" "royal blue" "slate blue" "light blue" [9] "dark blue" "prussian blue" "indigo" "baby blue" [13] "electric blue"

> for (colour in possible_colours) {+ print(paste("The sky is oh so, so", colour))+ }

[1] "The sky is so, oh so blue"[1] "The sky is so, oh so cyan"[1] "The sky is so, oh so sky-blue"[1] "The sky is so, oh so navy blue"[1] "The sky is so, oh so steel blue"[1] "The sky is so, oh so royal blue"[1] "The sky is so, oh so slate blue"[1] "The sky is so, oh so light blue"[1] "The sky is so, oh so dark blue"[1] "The sky is so, oh so prussian blue"[1] "The sky is so, oh so indigo"[1] "The sky is so, oh so baby blue"[1] "The sky is so, oh so electric blue"

What does this loop do?for (index in 10:1) { print(paste(index, "mins befo lunch"))}

Again

• What does the following code do (decompose on pen and paper)

for (letter in LETTERS) { begins_with <- paste("^", letter, sep="") matches <- grep(pattern = begins_with, x = ants$genus) print(paste(length(matches), "begin with", letter))}

> LETTERS [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"[20] "T" "U" "V" "W" "X" "Y" "Z"> ants <- read.table("https://goo.gl/3Ek1dL")> colnames(ants) <- c("genus", “species")> head(ants) genus species1 Anergates atratulus2 Camponotus sp.3 Crematogaster scutellaris4 Formica aquilonia5 Formica cunicularia6 Formica exsecta

What does this loop do?

Jasmin Zohren Bruno

VieiraRodrigo Pracana

JamesWright

Programming in R?

If/else

Logical Operators

going further

top related