cs 581 / bioe 540: algorithmic computaonal genomicstandy.cs.illinois.edu/581-intro-to-course.pdf ·...

Post on 09-Sep-2020

15 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS581/BIOE540:AlgorithmicComputa<onalGenomics

TandyWarnowDepartmentsofBioengineeringandComputerSciencehHp://tandy.cs.illinois.edu

CourseDetails

Officehours:Tuesdays12:30-1:30(Siebel3235)Coursewebpage: hHp://tandy.cs.illinois.edu/581-2017.htmlTextbook:Computa<onalPhylogene<cs,availablefordownloadathHp://tandy.cs.illinois.edu/textbook.pdfTA:PranjalVachaspa<(tobeconfirmed)

Today

•  Describesomeimportantproblemsincomputa<onalbiology,forwhichstudentsinthiscoursecoulddevelopimprovedmethods

•  Explainhowthecoursewillberun•  Answerques<ons

ThisCourseTopics:computa<onalandsta<s<calproblemsinsequenceanalysis(e.g.,mul<plesequencealignment,phylogenyes<ma<on,metagenomics,etc.).Focus:understandingthemathema<calfounda<ons,anddesigningalgorithmswithoutstandingaccuracyandspeedonlarge,complexdatasets.Thisisnotacourseabouthowtousethetools.

Prerequisites

Nobackgroundinbiologyisneeded.However,thecoursehasthefollowingprerequisites:

•  CS374:computa<onalcomplexity,algorithmdesigntechniques,andprovingtheoremsaboutalgorithms

•  CS361:probabilityandsta<s<cs

•  Byrecursion,CS225:programming

Ifyouhaven’tsa<sfiedthepre-reqs:

Youneedpermissiontostayinthecourse.

•  Thefirsthomeworkisdue(byemail)onSaturdayat1PM.SeethehomeworkwebpagehHp://tandy.cs.illinois.edu/cs581-2017-hw.html

•  Thenmakeanappointmenttoseemetoreviewthehomework.

Thiscourse

•  Phylogenyes<ma<onbasedonstochas<cmodelsofsequenceevolu<onandgenomeevolu<on

•  Mul<plesequencealignment•  Applica<onstometagenomics,protein

structurepredic<on,andotherbiologicalproblems

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website, University of Arizona

Species Tree

Evolu<oninformsabouteverythinginbiology

•  Biggenomesequencingprojectsjustproducedata------sowhat?

•  Evolu<onaryhistoryrelatesallorganismsandgenes,andhelpsusunderstandandpredict–  interac<onsbetweengenes(gene<cnetworks)–  drugdesign–  predic<ngfunc<onsofgenes–  influenzavaccinedevelopment–  originsandspreadofdisease–  originsandmigra<onsofhumans

Constructing the Tree of Life: Hard Computational Problems

NP-hardproblemsLargedatasets

100,000+sequencesthousandsofgenes

“Bigdata”complexity:

modelmisspecifica<onfragmentarysequenceserrorsininputdatastreamingdata

Phylogenomicpipeline

SelecttaxonsetandmarkersGatherandscreensequencedata,possiblyiden<fyorthologsComputemul<plesequencealignmentsforeachlocusComputespeciestreeornetwork:

Computegenetreesonthealignmentsandcombinethees<matedgenetrees,OREs<mateatreefromaconcatena<onofthemul<plesequencealignments

Getsta<s<calsupportoneachbranch(e.g.,bootstrapping)Es<matedatesonthenodesofthephylogenyUsespeciestreewithbranchsupportanddatestounderstandbiology

Phylogenomicpipeline

SelecttaxonsetandmarkersGatherandscreensequencedata,possiblyiden<fyorthologsComputemul<plesequencealignmentsforeachlocusComputespeciestreeornetwork:

Computegenetreesonthealignmentsandcombinethees<matedgenetrees,OREs<mateatreefromaconcatena<onofthemul<plesequencealignments

Getsta<s<calsupportoneachbranch(e.g.,bootstrapping)Es<matedatesonthenodesofthephylogenyUsespeciestreewithbranchsupportanddatestounderstandbiology

Research Strategies

Improvedalgorithmsthrough:•  Divide-and-conquer•  “Bin-and-conquer”•  Iteration•  Bayesiansta<s<cs• HiddenMarkovModels• Graphtheory•  Combinatorialop<miza<on

Sta<s<calmodellingMassiveSimula<onsHighPerformanceCompu<ng

AvianPhylogenomicsProject

GZhang,BGI

• Approx.50species,wholegenomes• 14,000loci

MTPGilbert,Copenhagen

S.MirarabMd.S.Bayzid,UT-Aus<nUT-Aus<n

T.WarnowUT-Aus<n

Plusmanymanyotherpeople…

ErichJarvis,HHMI

Science,December2014(Jarvis,Mirarab,etal.,andMirarabetal.)

1kp:ThousandTranscriptomeProject

l  PlantTreeofLifebasedontranscriptomesof~1200speciesl  Morethan13,000genefamilies(mostnotsinglecopy)l  Firstpaper:PNAS2014(~100speciesand~800loci)GeneTreeIncongruence

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin

Plus many many other people…

UpcomingChallenges(~1200species,~400loci)

DNA Sequence Evolution

AAGACTT

TGGACTT AAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT

AGGGCAT TAGCCCT AGCACTT

AAGACTT

TGGACTT AAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

AGCGCTT AGCACAA TAGACTT TAGCCCA AGGGCAT

Phylogeny Problem

TAGCCCA TAGACTT TGCACAA TGCGCTT AGGGCAT

U V W X Y

U

V W

X

Y

Performance criteria •  Running time •  Space •  Statistical performance issues (e.g., statistical

consistency) with respect to a Markov model of evolution

•  “Topological accuracy” with respect to the underlying true tree or true alignment, typically studied in simulation

•  Accuracy with respect to a particular criterion (e.g. maximum likelihood score), on real data

Quantifying Error

FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate

FN

FP

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

1  Hill-climbing heuristics for hard optimization criteria (Maximum Parsimony and Maximum Likelihood)

Local optimum

Cost

Global optimum

Phylogenetic trees 2  Polynomial time distance-based methods: Neighbor

Joining, FastME, etc. 3. Bayesian methods

Phylogenetic reconstruction methods

Solving maximum likelihood (and other hard optimization problems) is… unlikely

# of Taxa

# of Unrooted Trees

4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 1020

100 4.5 x 10190

1000 2.7 x 102900

Quantifying Error

FN: false negative (missing edge)

FP: false positive (incorrect edge)

FP

50% error rate

FN

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001]

Theorem (Atteson): Exponential sequence length requirement for Neighbor Joining!

NJ

0 400 800 No. Taxa

1600 1200

0.4 0.2

0

0.6

0.8

Erro

r Rat

e

Major Challenges

•  Phylogenetic analyses: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements)

Phylogeny Problem

TAGCCCA TAGACTT TGCACAA TGCGCTT AGGGCAT

U V W X Y

U

V W

X

Y

AGAT TAGACTT TGCACAA TGCGCTT AGGGCATGA

U V W X Y

X U

Y

V W

TheRealProblem!

…ACGGTGCAGTTACC-A…

…AC----CAGTCACCTA…

The true multiple alignment –  Reflects historical substitution, insertion, and deletion events –  Defined using transitive closure of pairwise alignments computed on

edges of the true tree

…ACGGTGCAGTTACCA… Insertion

Substitution Deletion

…ACCAGTCACCTA…

Input: unaligned sequences

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Alignment

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 2: Construct tree

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

S1

S4

S2

S3

Two-phase estimation Alignment methods •  Clustal •  POY (and POY*) •  Probcons (and Probtree) •  Probalign •  MAFFT •  Muscle •  Di-align •  T-Coffee •  Prank (PNAS 2005, Science 2008) •  Opal (ISMB and Bioinf. 2007) •  FSA (PLoS Comp. Bio. 2009) •  Infernal (Bioinf. 2009) •  Etc.

Phylogeny methods •  Bayesian MCMC•  Maximum

parsimony•  Maximum likelihood•  Neighbor joining•  FastME•  UPGMA•  Quartet puzzling•  Etc.

Simulation Studies

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

S1 S2

S4 S3

True tree and alignment

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Compare

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA

S1 S4

S2 S3

Estimated tree and alignment

Unaligned Sequences

1000taxonmodels,orderedbydifficulty(Liuetal.,2009)

Multiple Sequence Alignment (MSA): another grand challenge1

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- … Sn = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA

Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation

1 Frontiers in Massive Data Analysis, National Academies Press, 2013

Major Challenges

•  Phylogenetic analyses: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements)

•  Multiple sequence alignment: key step for many biological questions (protein structure and function, phylogenetic estimation), but few methods can run on large datasets. Alignment accuracy is generally poor for large datasets with high rates of evolution.

(Phylogenetic estimation from whole genomes)

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website, University of Arizona

Species Tree Estimation requires multiple genes!

Two basic approaches for species tree estimation

•  Concatenate (“combine”) sequence alignments for different genes, and run phylogeny estimation methods

•  Compute trees on individual genes and

combine gene trees

Usingmul<plegenes

gene 1S1

S2

S3

S4

S7

S8

TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA TCTAATGGAC

TATAACGGAA

gene 3TATTGATACA

TCTTGATACC

TAGTGATGCA

CATTCATACC

TAGTGATGCA

S1

S3

S4

S7

S8

gene 2GGTAACCCTCGCTAAACCTC

GGTGACCATC

GCTAAACCTC

S4

S5

S6

S7

Concatena<on gene 1

S1

S2

S3

S4

S5

S6

S7

S8

gene 2 gene 3 TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA

TCTAATGGAC

TATAACGGAA

GGTAACCCTCGCTAAACCTC

GGTGACCATC

GCTAAACCTC

TATTGATACA

TCTTGATACC

TAGTGATGCA

CATTCATACC

TAGTGATGCA ? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

Redgenetree≠speciestree(greengenetreeokay)

GeneTreeIncongruence

Genetreescandifferfromthespeciestreedueto:

Duplica<onandlossHorizontalgenetransferIncompletelineagesor<ng(ILS)

IncompleteLineageSor<ng(ILS)

Confoundsphylogene<canalysisformanygroups:

HominidsBirdsYeastAnimalsToadsFishFungi

Thereissubstan<aldebateabouthowtoanalyzephylogenomicdatasetsinthepresenceofILS.

LineageSor<ng

Popula<on-levelprocess,alsocalledthe “Mul<-speciescoalescent”(Kingman,1982)Genetreescandifferfromspeciestreesduetoshort<mesbetweenspecia<oneventsorlargepopula<onsize;thisiscalled“IncompleteLineageSor<ng”or“DeepCoalescence”.

TheCoalescent

Present

Past

CourtesyJamesDegnan

GenetreeinaspeciestreeCourtesyJamesDegnan

Keyobserva<on:Underthemul<-speciescoalescentmodel,thespeciestreedefinesaprobabilitydistribu4ononthegenetrees,andisiden4fiablefromthedistribu4onongenetrees

CourtesyJamesDegnan

. . .

Analyzeseparately

Summary Method

Twocompe<ngapproaches

gene 1 gene 2 . . . gene k

. . . Concatenation

Species

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website, University of Arizona

Speciestreees<ma<on:difficult,evenforsmalldatasets!

Major Challenges: large datasets, fragmentary sequences

•  Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution.

•  Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements).

•  Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging.

•  Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution

Both phylogenetic estimation and multiple sequence alignment are also

impacted by fragmentary data.

Major Challenges: large datasets, fragmentary sequences

•  Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution.

•  Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements).

•  Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging.

•  Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution

Both phylogenetic estimation and multiple sequence alignment are also

impacted by fragmentary data.

AvianPhylogenomicsProject

GZhang,BGI

• Approx.50species,wholegenomes• 14,000loci

MTPGilbert,Copenhagen

S.MirarabMd.S.Bayzid,UT-Aus<nUT-Aus<n

T.WarnowUT-Aus<n

Plusmanymanyotherpeople…

ErichJarvis,HHMI

Challenges:•  Speciestreees<ma<onunderthemul<-speciescoalescentmodel,from14,000poores<matedgenetrees,allwithdifferenttopologies(weused“sta<s<calbinning”)•  Maximumlikelihoodes<ma<ononamillion-sitegenome-scalealignment–250CPUyears

Science,December2014(Jarvis,Mirarab,etal.,andMirarabetal.)

1kp:ThousandTranscriptomeProject

l  PlantTreeofLifebasedontranscriptomesof~1200speciesl  Morethan13,000genefamilies(mostnotsinglecopy)l  Firstpaper:PNAS2014(~100speciesand~800loci)GeneTreeIncongruence

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin

Plus many many other people…

UpcomingChallenges(~1200species,~400loci):•  Speciestreees<ma<onunderthemul<-speciescoalescentfromhundredsofconflic<nggenetreeson>1000species(wewilluseASTRAL–Mirarabetal.2014,Mirarab&Warnow

2015)•  Mul<plesequencealignmentof>100,000sequences(withlotsoffragments!)–wewilluseUPP(Nguyenetal.,2015)

Constructing the Tree of Life: Hard Computational Problems

NP-hardproblemsLargedatasets

100,000+sequencesthousandsofgenes

“Bigdata”complexity:

modelmisspecifica<onfragmentarysequenceserrorsininputdatastreamingdata

Research Strategies

Improvedalgorithmsthrough:•  Divide-and-conquer•  “Bin-and-conquer”•  Iteration•  Bayesiansta<s<cs• HiddenMarkovModels• Graphtheory•  Combinatorialop<miza<on

Sta<s<calmodellingMassiveSimula<onsHighPerformanceCompu<ng

Evolu<oninformsabouteverythinginbiology

•  Biggenomesequencingprojectsjustproducedata------sowhat?

•  Evolu<onaryhistoryrelatesallorganismsandgenes,andhelpsusunderstandandpredict–  interac<onsbetweengenes(gene<cnetworks)–  drugdesign–  predic<ngfunc<onsofgenes–  influenzavaccinedevelopment–  originsandspreadofdisease–  originsandmigra<onsofhumans

Metagenomics: Venter et al., Exploring the Sargasso Sea:

Scientists Discover One Million New Genes in Ocean Microbes

Metagenomic data analysis NGS data produce fragmentary sequence data Metagenomic analyses include unknown

species Taxon identification: given short sequences,

identify the species for each fragment Applications: Human Microbiome Issues: accuracy and speed

MihaiPop,UnivMaryland

Metagenomictaxoniden6fica6on

Objec<ve:classifyshortreadsinametagenomicsample

PossibleIndo-Europeantree(Ringe,WarnowandTaylor2000)

Anatolian

Albanian

Tocharian

Greek

Germanic

Armenian

Italic

Celtic

Baltic Slavic

Vedic Iranian

“PerfectPhylogene<cNetwork” forIENakhlehetal.,Language2005

Anatolian

Albanian

Tocharian

Greek

Germanic

Armenian

Baltic Slavic

Vedic Iranian

Italic

Celtic

Grading

•  Homework:25%(onehwdropped)•  Midterm:40%(March30)•  FinalProject:25%(dueMay6)•  CoursePar<cipa<on:10%

Nofinalexam.

HomeworkAssignments

•  HomeworkassignmentsarelistedathHp://tandy.cs.illinois.edu/cs581-2017-hw.htmlandaredueat1PM(inpersonorviaemail)–latehomeworkshavereducedcreditandwillnotbeacceptedayer48hourspastthedeadline.

•  Youareencouragedtoworkwithothersonyourhomework,butyoumustwriteupsolu<onsbyyourselfandindicatewhoyouworkedwithoneachhomework.

CourseSchedule

AdetailedcoursescheduleisathHp://tandy.cs.illinois.edu/cs581-2017-detailed-syllabus.htmlThisscheduleincludesmaterialyouareexpectedtohavelookedatbeforecomingtoclass:•  assignedreading(fromtextbookand/or

scien<ficliterature)•  PPTand/orPDFofmylecture

FinalProjectandClassPresenta<on

Eitherresearchproject(canbewithanotherstudent)orsurveypaper(donebyyourself).Manyinteres<ngandpublishableproblemstoaddress:seehHp://tandy.cs.illinois.edu/topics.htmlYourclasspresenta<onshouldberelatedtoyourfinalproject.

AcademicIntegrity

PleaseseecoursewebsiteathHp://tandy.cs.illinois.edu/581-2017.htmlandalsohHp://tandy.cs.illinois.edu/ethics.pdfForthiscourse:•  Examinethepolicyaboutcollabora<on•  Learnandunderstandwhatplagiarismis(and

thendon’tdoit).Thisappliestohomework,allwri<ngassignments,andthefinalproject.

CourseResearchProjects

•  Evalua<ngexis<ngmethodsonsimulatedandreal(biologicalorlinguis<c)datasets

•  Designinganewmethod,andestablishingitsperformance(usingtheoryanddata)

•  Analyzingabiologicaldatasetusingseveraldifferentmethods,toaddressbiology

ExamplesofpublishedcourseprojectsMdS.Bayzid,T.Hunt,andT.Warnow."DiskCoveringMethodsImprovePhylogenomicAnalyses".ProceedingsofRECOMB-CG(Compara<veGenomics),2014,andBMCGenomics2014,15(Suppl6):S7.T.Zimmermann,S.MirarabandT.Warnow."BBCA:Improvingthescalabilityof*BEASTusingrandombinning".ProceedingsofRECOMB-CG(Compara<veGenomics),2014,andBMCGenomics2014,15(Suppl6):S11.J.Chou,A.Gupta,S.Yaduvanshi,R.Davidson,M.Nute,S.MirarabandT.Warnow.“Acompara<vestudyofSVDquartetsandothercoalescent-basedspeciestreees<ma<onmethods”.RECOMB-Compara<veGenomicsandBMCGenomics,2015.,2015,16(Suppl10):S2.P.Vachaspa<andT.Warnow(2016).FastRFS:FastandAccurateRobinson-FouldsSupertreesusingConstrainedExactOp<miza<onBioinforma<cs2016;doi:10.1093/bioinforma<cs/btw600.(SpecialissueforpapersfromRECOMB-CG)

ResearchProjectsyoucouldjoin•  Phylogenomicsprojects(Avianandthe1KP)

•  Speciestreeandnetworkes<ma<onfromconflic<nggenes•  Large-scalemul<plesequencealignment•  Large-scalemaximumlikelihoodtreees<ma<on•  Improvinggenetreees<ma<onusingwholegenomes

•  Metagenomics(withMihaiPop,UniversityofMaryland,andBillGropp)•  Iden<fyinggenesandtaxafromshortsequences•  Metagenomicassembly•  Applica<onstoclinicaldiagnos<cs

•  ProteinSequenceAnalysis(withJianPeng)•  Whatfunc<onandstructuredoesthisproteinhave?•  Howdidstructureandfunc<onevolve?

•  HistoricalLinguis<cs(withDonaldRinge,UPenn)•  HowdidIndo-Europeanevolve?•  Designingandimplemen<ngsta<s<cales<ma<onmethodsfor

languagephylogenies

top related