problem definition

1
DNA Barcoding: Preliminary Studies with Avian Influenza Virus J. Duitama 1,* , D.M. Kumar 2,* , E. Hemphill 3 , S. Babapoor 2 , I.I. Mandoiu 1 , C. Nelson 3 , and M. Khan 2 1 Computer Science & Engineering, 2 Pathobiology & Veterinary Science, 3 Molecular & Cell Biology, U. of Connecticut, *Contributed equally to this work •AvianinfluenzabelongstotheinfluenzatypeAgenusofthe Orthomyxoviridae familyofRNA viruses.Itisa highlymutable virus,with the Haemagglutinin (HA)and the Neuraminidase (NA)genesbeing the mostvariable.Todate,16 HA and 9 NA subtypeshavebeen identified. • There has been m uch recent work on developing rapid methods fordetection and subtype identification ofavian influenzainfections.Nucleicacidbasedanalysessuchasthe Polymerase Chain Reaction (PCR)are becoming the method of choice , largely replacing the much more laborand tim e consumingserotypingtechniques. • Common primerdesign packagessuch asPrimer3 [6]are not wellsuitedfordesigningPCRprimersforsubtypeidentification since theyseekto amplifyjustone known targetsequence,not an unknown target from thesetofhighlyvariablesequences thatcompriseasubtype. •Thehighsequencevariabilitywithinsubtypesalsorenders unfeasible “com m on substring”approachessuch as[2,5]. • Methodsfordegenerate primerselection such as[3,8]can be used to ensure amplification of a large fraction of known sequences of a given subtype, but they ignore primer specificity,i.e.,preventingamplificationofcloselyrelatedvirus subtypes. •Asin[2,5],ourtooltakesasinputsetsofboth target and non- target sequences.H owever,instead ofsearching forsubstrings shared by the targetsequences as in [2,5,8],orforhighly conserved regions in a m ultiple alignment of the target sequences as in [3], Primer Hunter ensures that selected primersamplifyalltargetsequencesand none ofthe non-target sequences by relying on accurate m elting tem perature computationsbasedonthenearest-neighbormodelof[7]and thefractionalprogrammingalgorithmof[4]. The open source C ++ code,released underthe G NU G eneral Public License,as w ell as a w eb serverforPrim erH unterare available athttp://dna.engr.uconn.edu/softw are/Prim erH unter/ Problem Definition N otations Fora sequence s,we denote by |s| itslength,andby s(l,i) thesubsequence oflength l endingatposition i (i.e., s(l,i) = s i-l+1 …s i-1 s i ) . W e denote by T(p,t,i) themeltingtemperatureoftheduplexformedbya primerp and the W atson-Crick complementof t(|p| ,i). Inordertoensure sensitiveamplificationoftargetsequences,werequireforeachselected primer p tohaveatleastoneposition i withineachtargett suchthatT(p,t,i) isgreaterthanorequaltoauserspecifiedthreshold T min_ target .To avoidnon- specificamplification,wefurtherrequirea meltingtemperature T(p,t,i) below auserspecifiedthreshold T max_ nontarget ateveryposition i ofeverynontarget sequence t . Since mismatches atthe 3’end ofthe primer can significantly reduce amplificationefficiencyweadditionallyrequirethatthe3’endof p match perfectly t(|p| ,i) ata setofbasesspecified using a 0-1 perfectmatch mask M.Forexam ple,a m ask M = 3’-1101-5’ specifiesthatthefirst,second,and fourth 3’-mostbasesofthe primermustbe matched exactly.Fora primer p and a targetsequence t ,we denote by I(p,t,M) thesetofpositions i of t at which the 3’end of p matches t(|p| ,i) accordingto M.Thus,in orderto ensuresensitivePCRamplificationoftargetsequences,werequirethata selectedprimerp have,foreverytargett ,atleastoneposition i I(p,t,M) for which T(p,t,i) T min_ target D esign forw ard prim ers M ake pairs filtering by productlength, cross dim erization and Tm Iterate overtargets to build a hash table ofoccurances ofseed patterns H according w ith m ask M Build candidates as suitable length substrings ofone or m ore targetsequences Testeach candidate p D esign reverse prim ers TestG C C ontent,G C C lam p,single base repeat and selfcom plem entarity Foreach target t use H to build I(p,t,M) and testif exists i:I |T(p,t,i)>T min_ target Foreach com plem entof a non target t’ test if T(p,t’) < T max_ nontarget Given • Sets TARGETS and NONTARGETS of 5’ – 3’ DNA sequences, perfect m atch m ask M, m elting temperaturethresholds T min_ target and T max_ nontarget and constraints on prim er length, GC content,self complementarity,etc . Find • Primers p satisfying given constraints on prim er length,G C content,self-com plementarity,etc.,such that: – Forevery t TARGETS ,exists i I(p,t,M) such thatT(p,t,i) T min_ target and – Forevery t NONTARGETS T(p,t,i)<T max_ nontarget forevery i {|p| ,… ,|t|} InputSequences •We used all complete Avian Influenza HA sequences from North Am erica available in N C BI flu database [1] as of M arch 2008 (a totalof574 sequences spanning the 14 subtypes show n in adjacentphylogenetic tree) For each subtype H i , we used all available H i sequences as targets,and all sequences with different subtype as non- targets H 12 H9 H8 H 10 H7 H4 H3 H5 H2 H1 H6 H 16 H 13 H 11 Parameters Primerlength 20 -25 Ampliconlength 75 -200 G C content 25% -75% M axim um m ononucleotide repeat 5 M ask M = 11 No required 3’ G C C lam p Primerconcentration 0.8μM Saltconcentration 50mM T min_ target =T max_ nontarget = 40 o C R esults SuitablepairsforsubtypespecificPCR amplificationfoundforeveryH i (Table 1) Number of primers and pairsfound decreases as the variability in the targetsincreases Primersdesign has been performed underthesameconditionsforthenine know n NA subtypes finding suitable pairsforeveryN i (Table2) Subtype H1 H2 H3 H4 H5 H6 H7 H8 H9 H 10 H 11 H 12 H 13 H 16 Targets 48 41 72 67 69 100 55 9 23 16 45 15 10 4 N on Targets 526 533 502 507 505 474 519 565 551 558 529 559 564 570 FP 51 42 41 265 68 36 77 489 140 243 267 472 41 367 RP 52 43 61 225 66 27 81 482 152 302 262 494 33 352 Pairs 70 187 135 3724 160 3 260 14415 1222 3712 4117 12895 98 7629 Table 1:N um berofprim erpairs forH A subtypes of Avian Influenza Subtype N1 N2 N3 N4 N5 N6 N7 N8 N9 Targets 110 241 65 15 32 77 22 84 42 N on Targets 578 447 623 673 656 611 666 604 646 FP 97 77 45 370 355 29 97 140 292 RP 71 44 61 360 353 43 103 211 305 Pairs 553 234 113 9665 8380 7 480 1785 6310 Table 2:N um berofprim erpairs forN A subtypes of Avian Influenza Therm odynamic Alignm entProblem : G iven a DNA sequence s 1 in 5’ – 3’ orientation and DNA sequence s 2 in 3’ – 5’ orientation, find the pairwise alignmentof s 1 and s 2 thatmaximizes the melting temperature according withSantalucia’sModel[7]. FractionalProgram m ing: Given a finite setS ,andtwofunctions f,g:S IR , if g ispositive, t=f(y)/g(y) for some y S, and max xS (f(x)-tg(x))=0 then t=max xS (f(x)/g(x)). Ifthe leftterm can be efficientlymaximized,an iterative processcan be applied to find argmax xS (f(x)/g(x)) [4] . Tm Calculation Using FractionalProgram ming:S isthesetofallpossible alignmentsbetween s 1 and s 2 .PrimerHunteruses the melting temperature formulaof[7],which includessaltconcentrationeffectsnotconsideredin[4]: ΔH (x) T m (x) = — ——————————————— ΔS (x) + 0.368*N/2*ln(Na + ) + R ln(C) where C is c 1 -c 2 /2 if c 1 =?c 2 and (c 1 +c 2 )/4 if c 1 =c 2 The figure below show s the distribution ofdifferences (in degrees celsius) betw een experim entalm elting tem peratures predictions obtained by fractional program m ing w ithoutsaltcorrection [4]and w ith saltcorrections perform ed using [7]for812 duplexes w ithoutm ismatches • Three primerpairswere selected forH3,H5 and H7 subtypes, and H A segmentsfrom isolates of these 3 subtypes were cloned intoplasmids • Realtime PCR wasperformed with eachcombination ofprimer- pairand plasmid type.Amplification curves foran H3 specific primeragainstH3,H5 and H7 templates and the minimum, maximum,and average difference in Ctvalue compared to a no tem plate control for H3, H5 and H7 specific primers are presented below. • PrimerHunter is a new toolto design primers for subtype identificationviaPCRexperiments. • Accurate m elting tem perature estimates allowing for mismatchesareobtainedusingthenearest-neighbormodelof [7]andthefractionalprogrammingapproachof[4];theseare criticalforachievinghighprimersensitivityandspecificityfor rapidlymutating viruses • Primers have been successfully designed for 14 Avian Influenzasubtypes,andhavebeenvalidatedusingreal-time PCR experimentswith virusisolatescloned into plasmids. In ongoing w ork w e seek to incorporate group testing techniques forreducing the num berofreactions needed for unam biguous subtype identification. The open source C ++ code,released underthe G NU G eneral Public License,as w ellas a w eb serverforPrim erH unterare available athttp://dna.engr.uconn.edu/softw are/Prim erH unter/ Introduction Algorithmic Details Triplicate real time PCR reactions were performed w ith ten different dilutions of each plasm id type. The plotbelow gives both on-targetand off-targetdelta C t values at different plasm id dilutions for H3, H5 and H 7-specific prim erpairs Experimental Validation Primer Design Conclusions

Upload: mckenzie-english

Post on 31-Dec-2015

18 views

Category:

Documents


1 download

DESCRIPTION

Problem Definition. Algorithmic Details. Introduction. Experimental Validation. Conclusions. Primer Design. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Problem Definition

DNA Barcoding: Preliminary Studies with Avian Influenza VirusJ. Duitama1,*, D.M. Kumar2,*, E. Hemphill3, S. Babapoor2, I.I. Mandoiu1 , C. Nelson3, and M.

Khan2

1Computer Science & Engineering, 2Pathobiology & Veterinary Science, 3Molecular & Cell Biology, U. of Connecticut, *Contributed equally to this work

• Avian influenza belongs to the influenza type A genus of theOrthomyxoviridae family of RNA viruses. It is a highly mutablevirus, with the Haemagglutinin (HA) and the Neuraminidase(NA) genes being the most variable. To date, 16 HA and 9 NAsubtypes have been identified.

• There has been much recent work on developing rapidmethods for detection and subtype identification of avianinfluenza infections. Nucleic acid based analyses such as thePolymerase Chain Reaction (PCR) are becoming the method ofchoice , largely replacing the much more labor and timeconsuming serotyping techniques.

• Common primer design packages such as Primer3 [6] are notwell suited for designing PCR primers for subtype identificationsince they seek to amplify just one known target sequence, notan unknown target from the set of highly variable sequencesthat comprise a subtype.

• The high sequence variability within subtypes also rendersunfeasible “common substring” approaches such as [2, 5].

• Methods for degenerate primer selection such as [3, 8] can beused to ensure amplification of a large fraction of knownsequences of a given subtype, but they ignore primerspecificity, i.e., preventing amplification of closely related virussubtypes.

• As in [2, 5], our tool takes as input sets of both target and non-target sequences. However, instead of searching for substringsshared by the target sequences as in [2, 5, 8], or for highlyconserved regions in a multiple alignment of the targetsequences as in [3], Primer Hunter ensures that selectedprimers amplify all target sequences and none of the non-targetsequences by relying on accurate melting temperaturecomputations based on the nearest-neighbor model of [7] andthe fractional programming algorithm of [4].

• The open source C++ code, released under the GNU General Public License, as well as a web server for PrimerHunter are available at http://dna.engr.uconn.edu/software/PrimerHunter/

Problem DefinitionNotations• For a sequence s, we denote by |s| its length, and by s(l,i) the subsequence

of length l ending at position i (i.e., s(l,i) = si-l+1 … si-1si).

• We denote by T(p,t,i) the melting temperature of the duplex formed by aprimer p and the Watson-Crick complement of t(|p|,i). In order to ensuresensitive amplification of target sequences, we require for each selectedprimer p to have at least one position i within each target t such that T(p,t,i)is greater than or equal to a user specified threshold Tmin_target. To avoid non-specific amplification, we further require a melting temperature T(p,t,i) belowa user specified threshold Tmax_nontarget at every position i of every non targetsequence t.

• Since mismatches at the 3’ end of the primer can significantly reduceamplification efficiency we additionally require that the 3’ end of p matchperfectly t(|p|,i) at a set of bases specified using a 0-1 perfect match maskM. For example, a mask M = 3’-1101-5’ specifies that the first, second, andfourth 3’-most bases of the primer must be matched exactly. For a primer pand a target sequence t, we denote by I(p,t,M) the set of positions i of t atwhich the 3’ end of p matches t(|p|,i) according to M. Thus, in order toensure sensitive PCR amplification of target sequences, we require that aselected primer p have, for every target t, at least one position i I(p,t,M) forwhich T(p,t,i) ≥ Tmin_target

Design forward primers

Make pairs filtering by product length,cross dimerization

and Tm

Iterate over targets to build a hash table of occurances

of seed patterns Haccording with mask M

Build candidates as suitablelength substrings of one or

more target sequences

Test each candidate p

Design reverseprimers

Test GC Content, GCClamp, single base repeatand self complementarity

For each target t use H tobuild I(p,t,M) and test if

exists i:I |T(p,t,i)>Tmin_target

For each complement of a non target t’ test ifT(p,t’) < Tmax_nontarget

Given• Sets TARGETS and NONTARGETS of 5’ – 3’ DNA

sequences, perfect match mask M, meltingtemperature thresholds Tmin_target and Tmax_nontarget andconstraints on primer length, GC content, selfcomplementarity, etc.

Find• Primers p satisfying given constraints on primer

length, GC content, self-complementarity, etc., such that:– For every t TARGETS, exists i I(p,t,M) such that T(p,t,i)

≥ Tmin_target and

– For every t NONTARGETS T(p,t,i) <Tmax_nontarget for everyi { |p|,…,|t|}

Input Sequences• We used all complete

Avian Influenza HA sequences from North America available in NCBI flu database [1] as of March 2008 (a total of 574 sequences spanning the 14 subtypes shown in adjacent phylogenetic tree)

• For each subtype Hi, we used all available Hi

sequences as targets, and all sequences with different subtype as non-targets

H12

H9

H8

H10

H7H4

H3

H5

H2

H1

H6H16H13

H11

Parameters• Primer length 20 - 25

• Amplicon length 75 - 200

• GC content 25% - 75%

• Maximum mononucleotide repeat 5

• Mask M = 11

• No required 3’ GC Clamp

• Primer concentration 0.8μM

• Salt concentration 50mM

• Tmin_target =Tmax_nontarget = 40o C

Results• Suitable pairs for subtype specific PCR

amplification found for every Hi (Table1)

• Number of primers and pairs founddecreases as the variability in thetargets increases

• Primers design has been performedunder the same conditions for the nineknown NA subtypes finding suitablepairs for every Ni (Table 2)

Subtype H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12 H13 H16Targets 48 41 72 67 69 100 55 9 23 16 45 15 10 4Non Targets 526 533 502 507 505 474 519 565 551 558 529 559 564 570FP 51 42 41 265 68 36 77 489 140 243 267 472 41 367RP 52 43 61 225 66 27 81 482 152 302 262 494 33 352Pairs 70 187 135 3724 160 3 260 14415 1222 3712 4117 12895 98 7629

Table 1: Number of primer pairs for HA subtypes of Avian Influenza

Subtype N1 N2 N3 N4 N5 N6 N7 N8 N9Targets 110 241 65 15 32 77 22 84 42Non Targets 578 447 623 673 656 611 666 604 646FP 97 77 45 370 355 29 97 140 292RP 71 44 61 360 353 43 103 211 305Pairs 553 234 113 9665 8380 7 480 1785 6310Table 2: Number of primer pairs for NA subtypes of Avian Influenza

• Thermodynamic Alignment Problem: Given a DNA sequence s1 in 5’ – 3’orientation and DNA sequence s2 in 3’ – 5’ orientation, find the pairwisealignment of s1 and s2 that maximizes the melting temperature accordingwith Santalucia’s Model [7].

• Fractional Programming: Given a finite set S, and two functions f,g:S→IR,if g is positive, t=f(y)/g(y) for some yS, and maxxS(f(x)-tg(x))=0 thent=maxxS(f(x)/g(x)). If the left term can be efficiently maximized, an iterativeprocess can be applied to find argmaxxS (f(x)/g(x)) [4].

• Tm Calculation Using Fractional Programming: S is the set of all possiblealignments between s1 and s2. PrimerHunter uses the melting temperatureformula of [7], which includes salt concentration effects not considered in [4]:

ΔH (x)

Tm (x) = ————————————————

ΔS (x) + 0.368*N/2*ln(Na+) + Rln(C)

where C is c1-c2/2 if c1≠c2 and (c1+c2)/4 if c1=c2

• The figure below shows the distribution of differences (in degrees celsius) between experimental melting temperatures predictions obtained by fractional programming without salt correction [4] and with salt corrections performed using [7] for 812 duplexes without mismatches

• Three primer pairs were selected for H3, H5 and H7 subtypes,and HA segments from isolates of these 3 subtypes werecloned into plasmids

• Real time PCR was performed with each combination of primer-pair and plasmid type. Amplification curves for an H3 specificprimer against H3, H5 and H7 templates and the minimum,maximum, and average difference in Ct value compared to a notemplate control for H3, H5 and H7 specific primers arepresented below.

• PrimerHunter is a new tool to design primers for subtypeidentification via PCR experiments.

• Accurate melting temperature estimates allowing formismatches are obtained using the nearest-neighbor model of[7] and the fractional programming approach of [4]; these arecritical for achieving high primer sensitivity and specificity forrapidly mutating viruses

• Primers have been successfully designed for 14 AvianInfluenza subtypes, and have been validated using real-timePCR experiments with virus isolates cloned into plasmids.

• In ongoing work we seek to incorporate group testing techniques for reducing the number of reactions needed for unambiguous subtype identification.

• The open source C++ code, released under the GNU General Public License, as well as a web server for PrimerHunter are available at http://dna.engr.uconn.edu/software/PrimerHunter/

Introduction Algorithmic Details

• Triplicate real time PCR reactions were performed with ten different dilutions of each plasmid type. The plot below gives both on-target and off-target delta Ct values at different plasmid dilutions for H3, H5 and H7-specific primer pairs

Experimental ValidationPrimer Design Conclusions