cs 581 / bioe 540: algorithmic computaonal genomicstandy.cs.illinois.edu/581-intro-to-course.pdf ·...

CS581/BIOE540:AlgorithmicComputa<onalGenomics

TandyWarnowDepartmentsofBioengineeringandComputerSciencehHp://tandy.cs.illinois.edu

CourseDetails

Officehours:Tuesdays12:30-1:30(Siebel3235)Coursewebpage: hHp://tandy.cs.illinois.edu/581-2017.htmlTextbook:Computa<onalPhylogene<cs,availablefordownloadathHp://tandy.cs.illinois.edu/textbook.pdfTA:PranjalVachaspa<(tobeconfirmed)

•  Describesomeimportantproblemsincomputa<onalbiology,forwhichstudentsinthiscoursecoulddevelopimprovedmethods

•  Explainhowthecoursewillberun•  Answerques<ons

ThisCourseTopics:computa<onalandsta<s<calproblemsinsequenceanalysis(e.g.,mul<plesequencealignment,phylogenyes<ma<on,metagenomics,etc.).Focus:understandingthemathema<calfounda<ons,anddesigningalgorithmswithoutstandingaccuracyandspeedonlarge,complexdatasets.Thisisnotacourseabouthowtousethetools.

Prerequisites

Nobackgroundinbiologyisneeded.However,thecoursehasthefollowingprerequisites:

•  CS374:computa<onalcomplexity,algorithmdesigntechniques,andprovingtheoremsaboutalgorithms

•  CS361:probabilityandsta<s<cs

•  Byrecursion,CS225:programming

Ifyouhaven’tsa<sfiedthepre-reqs:

Youneedpermissiontostayinthecourse.

•  Thefirsthomeworkisdue(byemail)onSaturdayat1PM.SeethehomeworkwebpagehHp://tandy.cs.illinois.edu/cs581-2017-hw.html

•  Thenmakeanappointmenttoseemetoreviewthehomework.

Thiscourse

•  Phylogenyes<ma<onbasedonstochas<cmodelsofsequenceevolu<onandgenomeevolu<on

•  Mul<plesequencealignment•  Applica<onstometagenomics,protein

structurepredic<on,andotherbiologicalproblems

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website, University of Arizona

Species Tree

Evolu<oninformsabouteverythinginbiology

•  Biggenomesequencingprojectsjustproducedata------sowhat?

•  Evolu<onaryhistoryrelatesallorganismsandgenes,andhelpsusunderstandandpredict–  interac<onsbetweengenes(gene<cnetworks)–  drugdesign–  predic<ngfunc<onsofgenes–  influenzavaccinedevelopment–  originsandspreadofdisease–  originsandmigra<onsofhumans

Constructing the Tree of Life: Hard Computational Problems

NP-hardproblemsLargedatasets

100,000+sequencesthousandsofgenes

“Bigdata”complexity:

modelmisspecifica<onfragmentarysequenceserrorsininputdatastreamingdata

Phylogenomicpipeline

SelecttaxonsetandmarkersGatherandscreensequencedata,possiblyiden<fyorthologsComputemul<plesequencealignmentsforeachlocusComputespeciestreeornetwork:

Computegenetreesonthealignmentsandcombinethees<matedgenetrees,OREs<mateatreefromaconcatena<onofthemul<plesequencealignments

Getsta<s<calsupportoneachbranch(e.g.,bootstrapping)Es<matedatesonthenodesofthephylogenyUsespeciestreewithbranchsupportanddatestounderstandbiology

Phylogenomicpipeline

SelecttaxonsetandmarkersGatherandscreensequencedata,possiblyiden<fyorthologsComputemul<plesequencealignmentsforeachlocusComputespeciestreeornetwork:

Computegenetreesonthealignmentsandcombinethees<matedgenetrees,OREs<mateatreefromaconcatena<onofthemul<plesequencealignments

Getsta<s<calsupportoneachbranch(e.g.,bootstrapping)Es<matedatesonthenodesofthephylogenyUsespeciestreewithbranchsupportanddatestounderstandbiology

Research Strategies

Improvedalgorithmsthrough:•  Divide-and-conquer•  “Bin-and-conquer”•  Iteration•  Bayesiansta<s<cs• HiddenMarkovModels• Graphtheory•  Combinatorialop<miza<on

Sta<s<calmodellingMassiveSimula<onsHighPerformanceCompu<ng

AvianPhylogenomicsProject

GZhang,BGI

• Approx.50species,wholegenomes• 14,000loci

MTPGilbert,Copenhagen

S.MirarabMd.S.Bayzid,UT-Aus<nUT-Aus<n

T.WarnowUT-Aus<n

Plusmanymanyotherpeople…

ErichJarvis,HHMI

Science,December2014(Jarvis,Mirarab,etal.,andMirarabetal.)

1kp:ThousandTranscriptomeProject

l  PlantTreeofLifebasedontranscriptomesof~1200speciesl  Morethan13,000genefamilies(mostnotsinglecopy)l  Firstpaper:PNAS2014(~100speciesand~800loci)GeneTreeIncongruence

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin

Plus many many other people…

UpcomingChallenges(~1200species,~400loci)

DNA Sequence Evolution

AAGACTT

TGGACTT AAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT

AAGACTT

TGGACTT AAGGCCT

AAGGCCT TGGACTT

AGCGCTT AGCACAA TAGACTT TAGCCCA AGGGCAT

Phylogeny Problem

TAGCCCA TAGACTT TGCACAA TGCGCTT AGGGCAT

U V W X Y

Performance criteria •  Running time •  Space •  Statistical performance issues (e.g., statistical

consistency) with respect to a Markov model of evolution

•  “Topological accuracy” with respect to the underlying true tree or true alignment, typically studied in simulation

•  Accuracy with respect to a particular criterion (e.g. maximum likelihood score), on real data

Quantifying Error

FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

1  Hill-climbing heuristics for hard optimization criteria (Maximum Parsimony and Maximum Likelihood)

Local optimum

Global optimum

Phylogenetic trees 2  Polynomial time distance-based methods: Neighbor

Joining, FastME, etc. 3. Bayesian methods

Phylogenetic reconstruction methods

Solving maximum likelihood (and other hard optimization problems) is… unlikely

# of Taxa

# of Unrooted Trees

4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 1020

100 4.5 x 10190

1000 2.7 x 102900

Quantifying Error

FN: false negative (missing edge)

FP: false positive (incorrect edge)

50% error rate

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001]

Theorem (Atteson): Exponential sequence length requirement for Neighbor Joining!

0 400 800 No. Taxa

1600 1200

0.4 0.2

Major Challenges

•  Phylogenetic analyses: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements)

Phylogeny Problem

TAGCCCA TAGACTT TGCACAA TGCGCTT AGGGCAT

U V W X Y

AGAT TAGACTT TGCACAA TGCGCTT AGGGCATGA

U V W X Y

TheRealProblem!

…ACGGTGCAGTTACC-A…

…AC----CAGTCACCTA…

The true multiple alignment –  Reflects historical substitution, insertion, and deletion events –  Defined using transitive closure of pairwise alignments computed on

edges of the true tree

…ACGGTGCAGTTACCA… Insertion

Substitution Deletion

…ACCAGTCACCTA…

Input: unaligned sequences

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Alignment

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

Phase 2: Construct tree

Two-phase estimation Alignment methods •  Clustal •  POY (and POY*) •  Probcons (and Probtree) •  Probalign •  MAFFT •  Muscle •  Di-align •  T-Coffee •  Prank (PNAS 2005, Science 2008) •  Opal (ISMB and Bioinf. 2007) •  FSA (PLoS Comp. Bio. 2009) •  Infernal (Bioinf. 2009) •  Etc.

Phylogeny methods •  Bayesian MCMC•  Maximum

parsimony•  Maximum likelihood•  Neighbor joining•  FastME•  UPGMA•  Quartet puzzling•  Etc.

Simulation Studies

True tree and alignment

Compare

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA

Estimated tree and alignment

Unaligned Sequences

1000taxonmodels,orderedbydifficulty(Liuetal.,2009)

Multiple Sequence Alignment (MSA): another grand challenge1

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- … Sn = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA

Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation

1 Frontiers in Massive Data Analysis, National Academies Press, 2013

Major Challenges

•  Phylogenetic analyses: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements)

•  Multiple sequence alignment: key step for many biological questions (protein structure and function, phylogenetic estimation), but few methods can run on large datasets. Alignment accuracy is generally poor for large datasets with high rates of evolution.

(Phylogenetic estimation from whole genomes)

Species Tree Estimation requires multiple genes!

Two basic approaches for species tree estimation

•  Concatenate (“combine”) sequence alignments for different genes, and run phylogeny estimation methods

•  Compute trees on individual genes and

combine gene trees

Usingmul<plegenes

gene 1S1

TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA TCTAATGGAC

TATAACGGAA

gene 3TATTGATACA

TCTTGATACC

TAGTGATGCA

CATTCATACC

TAGTGATGCA

gene 2GGTAACCCTCGCTAAACCTC

GGTGACCATC

GCTAAACCTC

Concatena<on gene 1

gene 2 gene 3 TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA

TCTAATGGAC

TATAACGGAA

GGTAACCCTCGCTAAACCTC

GGTGACCATC

GCTAAACCTC

TATTGATACA

TCTTGATACC

TAGTGATGCA

CATTCATACC

TAGTGATGCA ? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

Redgenetree≠speciestree(greengenetreeokay)

GeneTreeIncongruence

Genetreescandifferfromthespeciestreedueto:

Duplica<onandlossHorizontalgenetransferIncompletelineagesor<ng(ILS)

IncompleteLineageSor<ng(ILS)

Confoundsphylogene<canalysisformanygroups:

HominidsBirdsYeastAnimalsToadsFishFungi

Thereissubstan<aldebateabouthowtoanalyzephylogenomicdatasetsinthepresenceofILS.

LineageSor<ng

Popula<on-levelprocess,alsocalledthe “Mul<-speciescoalescent”(Kingman,1982)Genetreescandifferfromspeciestreesduetoshort<mesbetweenspecia<oneventsorlargepopula<onsize;thisiscalled“IncompleteLineageSor<ng”or“DeepCoalescence”.

TheCoalescent

Present

CourtesyJamesDegnan

GenetreeinaspeciestreeCourtesyJamesDegnan

Keyobserva<on:Underthemul<-speciescoalescentmodel,thespeciestreedefinesaprobabilitydistribu4ononthegenetrees,andisiden4fiablefromthedistribu4onongenetrees

CourtesyJamesDegnan

Analyzeseparately

Summary Method

Twocompe<ngapproaches

gene 1 gene 2 . . . gene k

. . . Concatenation

Species

Speciestreees<ma<on:difficult,evenforsmalldatasets!

Major Challenges: large datasets, fragmentary sequences

•  Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution.

•  Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements).

•  Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging.

•  Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution

Both phylogenetic estimation and multiple sequence alignment are also

impacted by fragmentary data.

Major Challenges: large datasets, fragmentary sequences

•  Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution.

•  Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements).

•  Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging.

•  Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution

Both phylogenetic estimation and multiple sequence alignment are also

impacted by fragmentary data.

AvianPhylogenomicsProject

GZhang,BGI

• Approx.50species,wholegenomes• 14,000loci

MTPGilbert,Copenhagen

S.MirarabMd.S.Bayzid,UT-Aus<nUT-Aus<n

T.WarnowUT-Aus<n

Plusmanymanyotherpeople…

ErichJarvis,HHMI

Challenges:•  Speciestreees<ma<onunderthemul<-speciescoalescentmodel,from14,000poores<matedgenetrees,allwithdifferenttopologies(weused“sta<s<calbinning”)•  Maximumlikelihoodes<ma<ononamillion-sitegenome-scalealignment–250CPUyears

Science,December2014(Jarvis,Mirarab,etal.,andMirarabetal.)

1kp:ThousandTranscriptomeProject

l  PlantTreeofLifebasedontranscriptomesof~1200speciesl  Morethan13,000genefamilies(mostnotsinglecopy)l  Firstpaper:PNAS2014(~100speciesand~800loci)GeneTreeIncongruence

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin

Plus many many other people…

UpcomingChallenges(~1200species,~400loci):•  Speciestreees<ma<onunderthemul<-speciescoalescentfromhundredsofconflic<nggenetreeson>1000species(wewilluseASTRAL–Mirarabetal.2014,Mirarab&Warnow

2015)•  Mul<plesequencealignmentof>100,000sequences(withlotsoffragments!)–wewilluseUPP(Nguyenetal.,2015)

Constructing the Tree of Life: Hard Computational Problems

NP-hardproblemsLargedatasets

100,000+sequencesthousandsofgenes

“Bigdata”complexity:

modelmisspecifica<onfragmentarysequenceserrorsininputdatastreamingdata

Research Strategies

Improvedalgorithmsthrough:•  Divide-and-conquer•  “Bin-and-conquer”•  Iteration•  Bayesiansta<s<cs• HiddenMarkovModels• Graphtheory•  Combinatorialop<miza<on

Sta<s<calmodellingMassiveSimula<onsHighPerformanceCompu<ng

Evolu<oninformsabouteverythinginbiology

•  Biggenomesequencingprojectsjustproducedata------sowhat?

•  Evolu<onaryhistoryrelatesallorganismsandgenes,andhelpsusunderstandandpredict–  interac<onsbetweengenes(gene<cnetworks)–  drugdesign–  predic<ngfunc<onsofgenes–  influenzavaccinedevelopment–  originsandspreadofdisease–  originsandmigra<onsofhumans

Metagenomics: Venter et al., Exploring the Sargasso Sea:

Scientists Discover One Million New Genes in Ocean Microbes

Metagenomic data analysis NGS data produce fragmentary sequence data Metagenomic analyses include unknown

species Taxon identification: given short sequences,

identify the species for each fragment Applications: Human Microbiome Issues: accuracy and speed

MihaiPop,UnivMaryland

Metagenomictaxoniden6fica6on

Objec<ve:classifyshortreadsinametagenomicsample

PossibleIndo-Europeantree(Ringe,WarnowandTaylor2000)

Anatolian

Albanian

Tocharian

Germanic

Armenian

Italic

Celtic

Baltic Slavic

Vedic Iranian

“PerfectPhylogene<cNetwork” forIENakhlehetal.,Language2005

Anatolian

Albanian

Tocharian

Germanic

Armenian

Baltic Slavic

Vedic Iranian

Italic

Celtic

Grading

•  Homework:25%(onehwdropped)•  Midterm:40%(March30)•  FinalProject:25%(dueMay6)•  CoursePar<cipa<on:10%

Nofinalexam.

HomeworkAssignments

•  HomeworkassignmentsarelistedathHp://tandy.cs.illinois.edu/cs581-2017-hw.htmlandaredueat1PM(inpersonorviaemail)–latehomeworkshavereducedcreditandwillnotbeacceptedayer48hourspastthedeadline.

•  Youareencouragedtoworkwithothersonyourhomework,butyoumustwriteupsolu<onsbyyourselfandindicatewhoyouworkedwithoneachhomework.

CourseSchedule

AdetailedcoursescheduleisathHp://tandy.cs.illinois.edu/cs581-2017-detailed-syllabus.htmlThisscheduleincludesmaterialyouareexpectedtohavelookedatbeforecomingtoclass:•  assignedreading(fromtextbookand/or

scien<ficliterature)•  PPTand/orPDFofmylecture

FinalProjectandClassPresenta<on

Eitherresearchproject(canbewithanotherstudent)orsurveypaper(donebyyourself).Manyinteres<ngandpublishableproblemstoaddress:seehHp://tandy.cs.illinois.edu/topics.htmlYourclasspresenta<onshouldberelatedtoyourfinalproject.

AcademicIntegrity

PleaseseecoursewebsiteathHp://tandy.cs.illinois.edu/581-2017.htmlandalsohHp://tandy.cs.illinois.edu/ethics.pdfForthiscourse:•  Examinethepolicyaboutcollabora<on•  Learnandunderstandwhatplagiarismis(and

thendon’tdoit).Thisappliestohomework,allwri<ngassignments,andthefinalproject.

CourseResearchProjects

•  Evalua<ngexis<ngmethodsonsimulatedandreal(biologicalorlinguis<c)datasets

•  Designinganewmethod,andestablishingitsperformance(usingtheoryanddata)

•  Analyzingabiologicaldatasetusingseveraldifferentmethods,toaddressbiology

ExamplesofpublishedcourseprojectsMdS.Bayzid,T.Hunt,andT.Warnow."DiskCoveringMethodsImprovePhylogenomicAnalyses".ProceedingsofRECOMB-CG(Compara<veGenomics),2014,andBMCGenomics2014,15(Suppl6):S7.T.Zimmermann,S.MirarabandT.Warnow."BBCA:Improvingthescalabilityof*BEASTusingrandombinning".ProceedingsofRECOMB-CG(Compara<veGenomics),2014,andBMCGenomics2014,15(Suppl6):S11.J.Chou,A.Gupta,S.Yaduvanshi,R.Davidson,M.Nute,S.MirarabandT.Warnow.“Acompara<vestudyofSVDquartetsandothercoalescent-basedspeciestreees<ma<onmethods”.RECOMB-Compara<veGenomicsandBMCGenomics,2015.,2015,16(Suppl10):S2.P.Vachaspa<andT.Warnow(2016).FastRFS:FastandAccurateRobinson-FouldsSupertreesusingConstrainedExactOp<miza<onBioinforma<cs2016;doi:10.1093/bioinforma<cs/btw600.(SpecialissueforpapersfromRECOMB-CG)

ResearchProjectsyoucouldjoin•  Phylogenomicsprojects(Avianandthe1KP)

•  Speciestreeandnetworkes<ma<onfromconflic<nggenes•  Large-scalemul<plesequencealignment•  Large-scalemaximumlikelihoodtreees<ma<on•  Improvinggenetreees<ma<onusingwholegenomes

•  Metagenomics(withMihaiPop,UniversityofMaryland,andBillGropp)•  Iden<fyinggenesandtaxafromshortsequences•  Metagenomicassembly•  Applica<onstoclinicaldiagnos<cs

•  ProteinSequenceAnalysis(withJianPeng)•  Whatfunc<onandstructuredoesthisproteinhave?•  Howdidstructureandfunc<onevolve?

•  HistoricalLinguis<cs(withDonaldRinge,UPenn)•  HowdidIndo-Europeanevolve?•  Designingandimplemen<ngsta<s<cales<ma<onmethodsfor

languagephylogenies

cs 581 / bioe 540: algorithmic computaonal genomicstandy.cs.illinois.edu/581-intro-to-course.pdf ·...

Documents

basics&of&computaonal& chemistry&

computational sprinting june...

webmo: auserfriendly computaonal&plaorm&for& chemical

bioe 403 lab 5

postgraduate study bioe 1010 professional development

phys20762&...

computaonal+modelling+of+ verbs+in+dene+languages+ ·...

bioe 162 final review 2014 ho

bioe 110 - hgh (1)

download the bioe undergraduate handbook (pdf)

international seventh edition bioe errustrv

computaonal linguiscs - the stanford natural language

cs 598: computaonal topology spring 2013 - github...

proposal to the senate educational policy committee ·...

computaonal challenges and needs for academic and

bioe 4420 – senior design

bioe 505: computational bioengineering

computaonal methods for data integraon...computaonal methods...

bioe 301 lecture twelve

ku bioe program highlights › ~irsurvey › hlc2015 ›...