important details compu:ng interval overlaps · compu:ng interval overlaps • unexpectedly complex...

6
11/19/15 1 2015 - BMMB 852D: Applied Bioinforma:cs Week 13, Lecture 26 István Albert Biochemistry and Molecular Biology and Bioinforma:cs Consul:ng Center Penn State Interval related tasks An intervals are not one-dimensional points! – make sure to specify more precisely For each feature find the intervals from another dataset that are close/overlapping with it For each interval on one strand find the closest on the other strand This is may not be sufficiently well defined. Important details What are the anchor points (the loca:ons that represent the intervals) Which direc:on does the comparison proceed – upstream, downstream? What gets reported? OZen we need to create another transformed interval data that conforms to what we actually need midpoint starts upstream Compu:ng Interval Overlaps Unexpectedly complex task as it needs to account for various types of posi:oning: full containment of either interval par:al overlaps X Y Neat and useful formulas (X ,Y is the target interval, start, end refer to the query ): midpoint = (start + end) // 2 (with integer division) overlap condi:on: (start < Y ) and (end > X)

Upload: others

Post on 25-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Important details Compu:ng Interval Overlaps · Compu:ng Interval Overlaps • Unexpectedly complex task as it needs to account for various types of posioning: – full containment

11/19/15

1

2015-BMMB852D:AppliedBioinforma:cs

Week13,Lecture26

IstvánAlbert

BiochemistryandMolecularBiologyandBioinforma:csConsul:ngCenter

PennState

Intervalrelatedtasks

Anintervalsarenotone-dimensionalpoints!–makesuretospecifymoreprecisely

•  Foreachfeaturefindtheintervalsfromanotherdatasetthatareclose/overlappingwithit

•  Foreachintervalononestrandfindtheclosestontheotherstrand

Thisismaynotbesufficientlywelldefined.

Importantdetails

•  Whataretheanchorpoints(theloca:onsthatrepresenttheintervals)

•  Whichdirec:ondoesthecomparisonproceed–upstream,downstream?

•  Whatgetsreported?OZenweneedtocreateanothertransformedintervaldatathatconformstowhatweactuallyneed

midpoint startsupstream Compu:ngIntervalOverlaps•  Unexpectedlycomplextaskasitneedstoaccountforvarioustypesof

posi:oning:–  fullcontainmentofeitherinterval–  par:aloverlaps

X Y

Neatandusefulformulas(X,Yisthetargetinterval,start,endrefertothequery):•  midpoint=(start+end)//2(withintegerdivision)•  overlapcondi:on:(start<Y)and(end>X)

Page 2: Important details Compu:ng Interval Overlaps · Compu:ng Interval Overlaps • Unexpectedly complex task as it needs to account for various types of posioning: – full containment

11/19/15

2

Overlap/intersect

•  Twofeaturesaresaidtooverlaporintersectiftheyshareatleastonebaseincommon.

FeatureA

FeatureB

FeatureC

genome

Compu:ngIntervalOverlaps•  Unexpectedlycomplextaskasitneedstoaccountforvarioustypesof

posi:oning:–  fullcontainmentofeitherinterval–  par:aloverlaps

X Y

Neatandusefulformulas(X,Yisthetargetinterval,start,endrefertothequery):•  midpoint=(start+end)//2(withintegerdivision)•  overlapcondi:on:(start<Y)and(end>X)

Intervalrepresenta:on

•  binningàredundantlystoringdataatdifferentzoomlevels-originallyimplementedinUCSCgenomebrowser(alsousedinBAMandBedTools)

•  Adifferentop:onàintervaltree,usuallysupportedbyprogramminglanguages

•  Programming:p:forintervalsthatarenotradicallydifferentinsizeasortbystartcoordinatefollowedbyabinarysearchwillbeefficient

BedTools

•  HighperformancesoZwarepackagethatoperatesonmul:pleintervalorienteddataformats:BED,GFF,SAM,BAMandVCF

•  DownloadandinstallbedtoolshCp://bedtools.readthedocs.org/en/latest/

QuinlanARandHallIM,BEDTools:aflexiblesuiteofu3li3esforcomparinggenomicfeatures.Bioinforma:cs.26,6,(2010)

Page 3: Important details Compu:ng Interval Overlaps · Compu:ng Interval Overlaps • Unexpectedly complex task as it needs to account for various types of posioning: – full containment

11/19/15

3

BedToolsconcepts

•  Therearemany(25andgrowing)tools/ac:onswithdifferentnames

•  Mosttoolswritetothestandardoutput

•  The–(minus)characterspecifiesthestandardinput

•  CanbechainedwithpipeslikeallUNIXcommands

•  Mosttoolswritetheirhelpwheninvoked,othersneed–hflag

•  Flagop:onscansubstan:allychangetheoutputformat

Excellentdocumenta:on

Basicconcepts

•  Foranyopera:onthatrequirestwofilesthetoolswillrequireafileAandfileB

•  EachelementinfileAismatchedagainsteachelementinfileB

•  FileBisloadedintomemory–trytomakethatthesmallerfile

(forexampletheAfilecontainsthethereads–Bfilecontainsthefeatures)

Bedtoolsconcepts

•  Theoldstylemodecontainsadifferenttoolforeachtask(themanualcoversthesetools):–  intersectBed–  windowBed–  closestBed

•  Anewstylemodethatcontainsonlyonetoolthattakescommandslikesamtools:–  bedtoolsintersect–  bedtoolswindow–  bedtoolsclosest

Page 4: Important details Compu:ng Interval Overlaps · Compu:ng Interval Overlaps • Unexpectedly complex task as it needs to account for various types of posioning: – full containment

11/19/15

4

AfewBedToolsoperators

– slop(extend)

– flank

– merge

– subtract

– complement

BlueàbeforeRedàaZer

Essen:alfeature:StrandAwareness

•  Sometoolstakea–l(leM),-r(right)parameterthatwillhaveadifferenteffectifthe“stranded”modeisturnedon

1.   defaultmode:leZ,rightareinterpretedontheforwardstrand’scoordinatesystem

2.   strandedmode:leZ,rightareinterpretedinthetranscrip:onaldirec:on5’to3’

Importantdetails

•  Whataretheanchorpoints(theloca:onsthatrepresenttheintervals)

•  Whichdirec:ondoesthecomparisonproceed–upstream,downstream?

•  Whatgetsreported?OZenweneedtocreateanothertransformedintervaldatathatconformstowhatweactuallyneed

midpoint startsupstream Intervalintersec:on(findoverlaps)

•  Themostimportantfunc:onalityofthetoolset

•  Otherfunc:onalityofbedtoolscouldprobablybeimplementedbyyourprograms

•  Efficientlyintersec:ngintervalsisanalgorithmicallymorecomplexproblem

Page 5: Important details Compu:ng Interval Overlaps · Compu:ng Interval Overlaps • Unexpectedly complex task as it needs to account for various types of posioning: – full containment

11/19/15

5

Basicconcepts

•  Foranyopera:onthatrequirestwofilesthetoolswillrequireafileAandfileB

•  EachelementinfileAismatchedagainsteachelementinfileB

•  FileBisloadedintomemory–trytomakethatthesmallerfile

(forexampletheAfilecontainsthethereads–Bfilecontainsthefeatures)

Bedtoolsconcepts

•  Theoldstylemodecontainsadifferenttoolforeachtask(themanualcoversthesetools):–  intersectBed–  windowBed–  closestBed

•  Anewstylemodethatcontainsonlyonetoolthattakescommandslikesamtools:–  bedtoolsintersect–  bedtoolswindow–  bedtoolsclosest

bedtoolsintersect

•  Differentflagscanproducericheroutputs

•  Therearevariantssuchasclosest/windowthataresimilarinfunc:onalitytointersect

•  Some:methesolu:ontogenngwhatyouwantistocreateintervalsoflength1aroundthefeatureofinterest

Next:BedtoolsTutorialbyAaronQuinlan

MaterialtaughtatColdSpringHarborsummerworkshopshop://quinlanlab.org/tutorials/cshl2014/bedtools.html

Page 6: Important details Compu:ng Interval Overlaps · Compu:ng Interval Overlaps • Unexpectedly complex task as it needs to account for various types of posioning: – full containment

11/19/15

6

Regionsnotcoveredbyintervals Mergingoverlappingintervals

Genomewidecoverage Homework26CreateanebolafeaturefilethathasonlythefeaturesannotatedasgenesThenusingthisfile:1.  Createanewintervalfilethatcontainsonlythegenomicregionsthat

areNOTcoveredbygenes(complement)

2.  Createanintervalfilethatcontainsonlythe250bplongregionsthatareupstreamofeachgene(flank).Callthesepromoterregions.

3.  Createafastafilethatcontainsthesequencesforthepromoterregionsthatyouextractedinstep2(geSasta).

InyourhomeworkshowthecommandsandascreenshotofIGVthatshowstheintervals