lecture-2 - pennsylvania state university · 2012-08-30 ·...

24
2012 % BMMB 597D: Analyzing Next Genera;on Sequencing Data Week 1, Lecture 2 István Albert Biochemistry and Molecular Biology and Bioinforma;cs Consul;ng Center Penn State

Upload: others

Post on 12-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$

$$Week$1,$Lecture$2$

István'Albert''

Biochemistry$and$Molecular$Biology$$and$Bioinforma;cs$Consul;ng$Center$

$Penn$State$

Page 2: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Get$a$good$text$editor$

Desired$features:$syntax'highligh3ng,$line$numbering,$ability$to$view$white$space$$•  Komodo$Edit$•  Sublime$Text$•  TextMate$$

There$are$many$other$op;ons.$$

Page 3: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Download$the$data$for$the$lecture$

The$url$sent$out$via$email$(also$on$the$course$webpage)$$

hVp://downloads.yeastgenome.org/cura;on/chromosomal_feature/saccharomyces_cerevisiae.gff$$$

Page 4: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Biological$file$formats$

Each$file$format$represents$ $

1.   Informa3on$–$types$of$knowledge$that$are$ stored$in$the$file $$

2.   Op3miza3on$–$$types$of$opera;ons$that$are$easy/efficient$to$perform$

The$above$implies$that$some$informa;on$may$not$be$present$or$cannot$be$easily$extracted$from$a$certain$file$format. $

Page 5: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Tabular$formats$

•  Many$common$bioinforma;cs$data$formats$are$column$based$and$tab%separated$$

•  First$format$we$deal$with$will$be$the$$

GFF3 '–'Generic''Feature''Format'

(search$for$GFF3$to$see$the$specifica;on$for$version$3 )$$

hVp://www.sequenceontology.org/gff3.shtml$ $

Page 6: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

The$GFF3$specifica;on$

Page 7: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

GFF$format$Search$for$GFF3$!$hVp://www.sequenceontology.org/gff3.shtml$

Tab$separated$with$9$columns.$Missing$aVributes$may$be$replaced$with$a$$dot$!$.$

1.   Seqid'$$$$$$$$$$(usually$chromosome)$2.   Source$$$$$$$$$(where$is$the$data$coming$from)$3.   Type$$$$$$$$$$$$$(usually$a$term$from$the$sequence$ontology)$4.   Start''$$$$$$$$$$$(interval$start$rela;ve$to$the$seqid)$5.   End''''$$$$$$$$$$$(interval$end$rela;ve$to$the$seqid)$6.   Score'''$$$$$$$$$(the$score$of$the$feature,$a$floa;ng$point$number)$7.   Strand''$$$$$$$$(+/%/.)$8.   Phase'''''''$$$$(used$to$indicate$reading$frame$for$coding$sequences)$9.   APributes$$$$(semicolon$separated$aVributes$!$Name=ABC;ID=1)$

Example$aVribute$specifica;on:$name=REB1;id=YP33546

Page 8: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Variants$of$GFF$–$GTF$2 $$

GTF$2$–$Gene'Transfer'Format' same$9$columns$as$the$GFF$$

hPp://mblab.wustl.edu/GTF2 .html'

Differences$$1.  Only$a$subset$of$types$are$allowed$in$column$3:$CDS, start_codon, stop_codon a nd$a$

few$more$$

2.  AVribute$column$format$change,$key$values$are$separated$by$space$and$not$semicolon$=$$3.  Two$mandatory$aVributes$at$the$end$of$the$record:$

$•  gene_id'value;$$$$$A$globally$unique$iden;fier$for$the$genomic$source$of$the$transcript$

$•  transcript_id'value;$$$$$A$globally$unique$iden;fier$for$the$predicted$transcript.$

$Example$aVribute$specifica;on:$name “REB1”; id “YP33546”$

Page 9: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

What$do$the$terms$mean?$

Page 10: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Sequence$ontology$browser$

Page 11: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Searching$for$$

X_element_combinatorial_repeat$$

Page 12: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Unix$commands$in$this$lecture$

$•  wc, cat, head, tail, sort, cut, grep, more, clear

Handy'Tips'$

CTRL%C$!$interrupts$any$process$that$may$be$running$$

clear$!$clears$the$screen$$

$cursor$keys$allow$you$to$recall$past$commands$$$

$auto%complete$!$write$part$of$the$filename$then$press$TAB $

Page 13: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Inves;gate$your$data$

Page 14: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Check$head/tail$of$the$file$

Page 15: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Paging$data$with:$less$(more)$

•  q$or$ESC$!$quits$the$pager$

•  SPACE$or$f$!$go$forward,$next$page$

•  b$!$go$backward$

•  /$word$!$search$for$a$word$$$

•  /$!$repeats$the$search$for$the$last$word$

Page 16: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Find$paVerns$in$the$file$

Page 17: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Connec;ng$streams$

•  Input$streams:$entry$from$the$keyboard$or$$files$

•  Output$streams:$print$on$screen,$into$files$

Stream$redirec;on$the$symbols$of$“arrows”$<,$>$$

Input$stream$redirec;on$from$file:$$<'filename'Output$stream$redirec;on$to$a$file:$>'filename''

Page 18: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Redirec;ng$to$a$file$$creates/overwrites$that$file$

Page 19: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Piping$streams$

•  The$pipe$character$$|'channels$the$output$of$one$command$into$the$other$

$(located$above$the$ENTER$key)$

$

You$can$pipe$mul;ple$commands$together$

Page 20: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Piping$commands$

Page 21: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Isola;ng$relevant$parts$of$our$file$

Page 22: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

How$many$of$each$elements$

Page 23: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Find$out$how$many$of$each$features$

Page 24: lecture-2 - Pennsylvania State University · 2012-08-30 · 2012$%$BMMB$597D:$Analyzing$Next$Genera;on$Sequencing$Data$ $ $Week$1,$Lecture$2$ István'Albert' ' Biochemistry$and$Molecular$Biology$$

Homework$2$

•  Create$a$file$that$lists$all$possible$ontology$terms$that$are$present$in$the$provided$GFF$file$with$a$count$of$how$many$;mes$this$element$occurs$in$the$yeast$genome.$Sort$this$file$by$this$count$in$reverse$order$(hint:$man$sort)$

•  Pick$an$ontology$term$that$is$unfamiliar$to$you$and$look$it$up$in$the$Sequence$Ontology,$paste$the$explana;on$into$the$homework$