important details compu:ng interval overlaps · compu:ng interval overlaps • unexpectedly complex...

Post on 25-Sep-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

11/19/15

1

2015-BMMB852D:AppliedBioinforma:cs

Week13,Lecture26

IstvánAlbert

BiochemistryandMolecularBiologyandBioinforma:csConsul:ngCenter

PennState

Intervalrelatedtasks

Anintervalsarenotone-dimensionalpoints!–makesuretospecifymoreprecisely

•  Foreachfeaturefindtheintervalsfromanotherdatasetthatareclose/overlappingwithit

•  Foreachintervalononestrandfindtheclosestontheotherstrand

Thisismaynotbesufficientlywelldefined.

Importantdetails

•  Whataretheanchorpoints(theloca:onsthatrepresenttheintervals)

•  Whichdirec:ondoesthecomparisonproceed–upstream,downstream?

•  Whatgetsreported?OZenweneedtocreateanothertransformedintervaldatathatconformstowhatweactuallyneed

midpoint startsupstream Compu:ngIntervalOverlaps•  Unexpectedlycomplextaskasitneedstoaccountforvarioustypesof

posi:oning:–  fullcontainmentofeitherinterval–  par:aloverlaps

X Y

Neatandusefulformulas(X,Yisthetargetinterval,start,endrefertothequery):•  midpoint=(start+end)//2(withintegerdivision)•  overlapcondi:on:(start<Y)and(end>X)

11/19/15

2

Overlap/intersect

•  Twofeaturesaresaidtooverlaporintersectiftheyshareatleastonebaseincommon.

FeatureA

FeatureB

FeatureC

genome

Compu:ngIntervalOverlaps•  Unexpectedlycomplextaskasitneedstoaccountforvarioustypesof

posi:oning:–  fullcontainmentofeitherinterval–  par:aloverlaps

X Y

Neatandusefulformulas(X,Yisthetargetinterval,start,endrefertothequery):•  midpoint=(start+end)//2(withintegerdivision)•  overlapcondi:on:(start<Y)and(end>X)

Intervalrepresenta:on

•  binningàredundantlystoringdataatdifferentzoomlevels-originallyimplementedinUCSCgenomebrowser(alsousedinBAMandBedTools)

•  Adifferentop:onàintervaltree,usuallysupportedbyprogramminglanguages

•  Programming:p:forintervalsthatarenotradicallydifferentinsizeasortbystartcoordinatefollowedbyabinarysearchwillbeefficient

BedTools

•  HighperformancesoZwarepackagethatoperatesonmul:pleintervalorienteddataformats:BED,GFF,SAM,BAMandVCF

•  DownloadandinstallbedtoolshCp://bedtools.readthedocs.org/en/latest/

QuinlanARandHallIM,BEDTools:aflexiblesuiteofu3li3esforcomparinggenomicfeatures.Bioinforma:cs.26,6,(2010)

11/19/15

3

BedToolsconcepts

•  Therearemany(25andgrowing)tools/ac:onswithdifferentnames

•  Mosttoolswritetothestandardoutput

•  The–(minus)characterspecifiesthestandardinput

•  CanbechainedwithpipeslikeallUNIXcommands

•  Mosttoolswritetheirhelpwheninvoked,othersneed–hflag

•  Flagop:onscansubstan:allychangetheoutputformat

Excellentdocumenta:on

Basicconcepts

•  Foranyopera:onthatrequirestwofilesthetoolswillrequireafileAandfileB

•  EachelementinfileAismatchedagainsteachelementinfileB

•  FileBisloadedintomemory–trytomakethatthesmallerfile

(forexampletheAfilecontainsthethereads–Bfilecontainsthefeatures)

Bedtoolsconcepts

•  Theoldstylemodecontainsadifferenttoolforeachtask(themanualcoversthesetools):–  intersectBed–  windowBed–  closestBed

•  Anewstylemodethatcontainsonlyonetoolthattakescommandslikesamtools:–  bedtoolsintersect–  bedtoolswindow–  bedtoolsclosest

11/19/15

4

AfewBedToolsoperators

– slop(extend)

– flank

– merge

– subtract

– complement

BlueàbeforeRedàaZer

Essen:alfeature:StrandAwareness

•  Sometoolstakea–l(leM),-r(right)parameterthatwillhaveadifferenteffectifthe“stranded”modeisturnedon

1.   defaultmode:leZ,rightareinterpretedontheforwardstrand’scoordinatesystem

2.   strandedmode:leZ,rightareinterpretedinthetranscrip:onaldirec:on5’to3’

Importantdetails

•  Whataretheanchorpoints(theloca:onsthatrepresenttheintervals)

•  Whichdirec:ondoesthecomparisonproceed–upstream,downstream?

•  Whatgetsreported?OZenweneedtocreateanothertransformedintervaldatathatconformstowhatweactuallyneed

midpoint startsupstream Intervalintersec:on(findoverlaps)

•  Themostimportantfunc:onalityofthetoolset

•  Otherfunc:onalityofbedtoolscouldprobablybeimplementedbyyourprograms

•  Efficientlyintersec:ngintervalsisanalgorithmicallymorecomplexproblem

11/19/15

5

Basicconcepts

•  Foranyopera:onthatrequirestwofilesthetoolswillrequireafileAandfileB

•  EachelementinfileAismatchedagainsteachelementinfileB

•  FileBisloadedintomemory–trytomakethatthesmallerfile

(forexampletheAfilecontainsthethereads–Bfilecontainsthefeatures)

Bedtoolsconcepts

•  Theoldstylemodecontainsadifferenttoolforeachtask(themanualcoversthesetools):–  intersectBed–  windowBed–  closestBed

•  Anewstylemodethatcontainsonlyonetoolthattakescommandslikesamtools:–  bedtoolsintersect–  bedtoolswindow–  bedtoolsclosest

bedtoolsintersect

•  Differentflagscanproducericheroutputs

•  Therearevariantssuchasclosest/windowthataresimilarinfunc:onalitytointersect

•  Some:methesolu:ontogenngwhatyouwantistocreateintervalsoflength1aroundthefeatureofinterest

Next:BedtoolsTutorialbyAaronQuinlan

MaterialtaughtatColdSpringHarborsummerworkshopshop://quinlanlab.org/tutorials/cshl2014/bedtools.html

11/19/15

6

Regionsnotcoveredbyintervals Mergingoverlappingintervals

Genomewidecoverage Homework26CreateanebolafeaturefilethathasonlythefeaturesannotatedasgenesThenusingthisfile:1.  Createanewintervalfilethatcontainsonlythegenomicregionsthat

areNOTcoveredbygenes(complement)

2.  Createanintervalfilethatcontainsonlythe250bplongregionsthatareupstreamofeachgene(flank).Callthesepromoterregions.

3.  Createafastafilethatcontainsthesequencesforthepromoterregionsthatyouextractedinstep2(geSasta).

InyourhomeworkshowthecommandsandascreenshotofIGVthatshowstheintervals

top related