muscle (edgar 2004a,b) · muscle (edgar 2004a,b) [0] k–mer distance estimation for unaligned...

MUSCLE (Edgar 2004a,b)

[0] k–mer distance estimation for unaligned sequences

[1] distance (UPGMA) guide tree generated

[2] pairwise global alignment down tree

[a] consensus (profile) constructed

[b] insertions propagated up tree

[3] K2P distances calculated

[4] back to [1] (once)

[5] pairwise global alignment down tree (like [2])

=> sum of pairs used to accept/reject realignment

pairwise global alignment

BMC Bioinformatics 2004, 5:113 http://www.biomedcentral.com/1471-2105/5/113

Page 3 of 19

(page number not for citation purposes)

number of changed nodes has not decreased, the processof improving the tree is considered to have converged anditeration terminates.

Progressive alignmentA new progressive alignment is built. The existing align-ment is retained of each subtree for which the branchingorder is unchanged; new alignments are created for the(possibly empty) set of changed nodes. When the align-ment at the root is completed, the algorithm may termi-nate, return to step 2.1 or go to Stage 3.

Stage 3: refinementThe third stage performs iterative refinement using a vari-ant of tree-dependent restricted partitioning [12].

Choice of bipartitionAn edge is deleted from the tree, dividing the sequencesinto two disjoint subsets (a bipartition). Edges are visitingin order of decreasing distance from the root.

Profile extractionThe profile (multiple alignment) of each subset isextracted from the current multiple alignment. Columnscontaining no residues (i.e., indels only) are discarded.

Re-alignmentThe two profiles obtained in step 3.2 are re-aligned toeach other using profile-profile alignment.

Accept/rejectThe SP score of the multiple alignment implied by thenew profile-profile alignment is computed. If the scoreincreases, the new alignment is retained, otherwise it isdiscarded. If all edges have been visited without a changebeing retained, or if a user-defined maximum number of

iterations has been reached, the algorithm is terminated,otherwise it returns to step 3.1. Visiting edges in order ofdecreasing distance from the root has the effect of first re-aligning individual sequences, then closely related groups

Algorithm elementsIn the following, we describe the elements of the MUSCLEalgorithm. In several cases, alternative versions of theseelements were implemented in order to investigate theirrelative performance and to offer different trade-offsbetween accuracy, speed and memory use. Most of thesealternatives are made available to the user via command-line options. Four benchmark datasets have been used toevaluate options and parameters in MUSCLE: BAliBASE[10,11], SABmark [15], SMART [16-18] and our ownbenchmark, PREFAB [2].

Objective scoreIn its refinement stage, MUSCLE seeks to maximize anobjective score, i.e. a function that maps a multiplesequence alignment to a real number which is designed togive larger values to better alignments. MUSCLE uses thesum-of-pairs (SP) score, defined to be the sum over pairs ofsequences of their alignment scores. The alignment scoreof a pair of sequences is computed as the sum of substitu-tion matrix scores for each aligned pair of residues, plusgap penalties. Gaps require special consideration (Figure3). We use the term indel for the symbol that indicates agap in a column (typically a dash '-'), reserving the termgap for a maximal contiguous series of indels. The gappenalty contribution to SP for a pair of sequences is com-puted by discarding all columns in which both sequenceshave an indel, then applying an affine penalty g + λe foreach remaining gap where g is the per-gap penalty, λ is the

Progressive alignmentFigure 1Progressive alignment. Sequences are assigned to the leaves of a binary tree. At each internal (i.e., non-leaf) node, the two child profiles are aligned using profile-profile align-ment (see Figure 2). Indels introduced at each node are indi-cated by shaded background.

M Q T I FL H - I W

M Q T I F

L H I W

L Q S W

L S F

L Q S WL - S F

M Q T I FL H - I WL Q S - WL - S - F

Profile-profile alignmentFigure 2Profile-profile alignment. Two profiles (multiple sequence alignments) X and Y are aligned to each other such that columns from X and Y are preserved in the result. Col-umns of indels (gray background) are inserted as needed in order to align the columns to each other. The score for aligning a pair of columns is determined by the profile func-tion, which should assign a high score to pairs of columns containing similar amino acids.

M Q T FL H T WL Q S W

X

L T I FM T I WY

M Q T - FL H T - WL Q S - WL - T I FM - T I W

Text

Edgar (2004)

\

Its space complexity is basically O(N2)þO(L2)þO(NL). When the sequence length exceeds thethreshold (set as 10 000 residues at present), FFT-NS-2 automatically switches the DP algorithm to amemory saving one [54] and the space complexitybecomes O(N2)þO(NL). On a current desktopcomputer, this method can be applied to an MSAconsisting of up to "10 000 sequences. Themaximum length depends on the similarity level:"10 000 residues for distantly related sequences or"500 000 residues for closely related sequences withglobal homology.

The progressive method has a drawback in thatonce a gap is incorrectly introduced at a step, thegap is never removed in later steps. To overcomethis drawback, there are two types of solutions,the iterative refinement method [55–61] and theconsistency-based method [25, 62–64]. These twoprocedures are quite different: the former tries tocorrect mistakes in the initial alignment, whereas thelatter tries to avoid mistakes in advance, but bothwork well to improve the alignment accuracy.

Iterative refinement method withth eWSP scoreçFFT-NS-iIn the iterative refinement method, an objectivefunction that represents the ‘goodness’ of the MSA isexplicitly defined. An initial MSA, calculated by theprogressive or another method, is subjected to aniterative process and is gradually modified so thatthe objective function is maximized, as shown inFigure 1B. Various combinations of objectivefunctions and optimization strategies have beenproposed to date [55–61]. Among them, Gotoh’siterative refinement method, PRRN [16], is themost successful one, and it forms the basis of recentmethods, including MAFFT, MUSCLE [23, 65] andPRIME [66]. The iterative alignment option ofMAFFT, called FFT–NS–i, uses the weighted sum–of–pairs (WSP) objective function [24]. As shown inFigure 1B, an MSA is partitioned into two groups,which are then realigned using an approximategroup-to-group alignment algorithm [20]. The newMSA replaces the old one if it has a higher score.This process is repeated until no more improvements

Replace if better

Initial alignment

Tree-dependent partitioning

A

B

β γ δ εα

Group-to-groupalignment

αβγ

δε

αβγδε

αβγδε

Unaligned sequences Distance matrix 1

Distance matrix 2

Tree 1 Alignment 1

Tree 2 Alignment 2

All-to-allcomparisons



αβχδε

β γ δ εα

α β γ δ

εγδ

αβ γδε

αβγδε

αβγδε

α

β

χ

δ

β γ δ ε

α

β

γ

δ

β γ δ ε

α β

β γ δ εα

δ ε

γ δεαβ γδε

Figure 1: Calculation procedures ofthe progressivemethod (A) and the iterative refinementmethod (B).

page 4 of13 Katoh and Toh

Katoh and Toh (2008)

MAFFT (Katoh and Toh 2008)

(too) many different algorithms available

uses variants of sum of pairs or COFFEE scoring

can use local or global alignment

can use structural pairwise alignments

good for low similarity sequences

‘program’ is really a large shell script that dispatches to a variety of special purpose programs

restricts access to some algorithms by alignment size

can be overridden by modifying the script

Nearest Alignment Space Termination (NAST)

DeSantis et al. (2006), Caporaso et al. (2010)

builds a multiple sequence alignment from a template

for each new sequence:

BLAST (etc.) to find most similar template sequence

pairwise alignment of template and new sequence

insert into template without introducing insertions

can cause local mis–alignments (or worse)

primarily used for identification (DNA barcoding, etc.)

other better options (i.e. identification algorithms)

translatorX (Abascal et al. 2010)

[1] translates nucleotides to amino acids (standard tables)

[2] aligns amino acids using an external program

can be manually edited

can be aligned using an ‘unsupported’ program

[3] reverse translates back to the original nucleotides

removes incomplete codons from the ends

has difficulty with long strings of ambiguous nucleotides

useful for difficult to align coding regions

sequence qualitybase–by–base error probability for base–calling programs

reflects assay bias (e.g. detection chemistry, algorithms)

allows for more efficient sequence editing and assembly

allows for ‘poorly supervised’ automation

base calling: PHRED...

Ewing et al. (1998), Ewing and Green (1998)

the ‘standard’ open base–caller for ABI BigDye chemistry

works with other chemistries also

more ABI training data => best for ABI

ABI’s KB base–caller is good (better), but closed source

other base–callers for other chemistries (e.g. LifeTrace)

most algorithmic differences among programs are minor

differences are mostly a result of different training data

algorithms are empirically derived (i.e. kluges)

...base calling: PHRED...

[1] calculate ideal peak locations

assumes relatively even spacing

chromatograms converted from log to linear

[2] locate peaks in trace data

[3] compare ideal and actual peaks (align)

merge and split peaks based on ideal peaks

call bases using signal intensity

[4] call ambiguous bases

near equal signal for multiple bases

PHRED: ideal peak locations...

[1a] preliminary peaks for each dye colorthe maximum value between a pair of inflected points

midpoint is used if there is no maximummust be 10% above previous peak (background)

[1b] synthetic trace of preliminary peaks (all dye colors)height = 1, width = 1/4 local peak–to–peak distance

[1c] sliding window: each peak ± 200 scanscalculate mean scaled standard deviation of peak–to–peak distance

<0.45 == good spacing

...PHRED: ideal peak locations

[1d] select starting point

window of lowest mean scaled standard deviation

work right to end, then left to start

[i] construct a ‘damped’ synthetic trace at the current position

Fourier transform the synthetic trace

i.e. fit to a sin wave function

[ii] if mean scaled standard deviation >0.45 => force average spacing; else: modify fit based on direction (left or right) and other kluges

PHRED: locate peaks

[2a] for each dye color search original trace for ‘concave’ regions

sum florescence signal for each scan to estimate peak ‘area’ (area under the curve)

[2b] accept peaks that are at least 10% bigger than the average 10 previous peaks and 5% larger than the previous peak

peak location == geometric center

PHRED: ideal vs. actual peaks

[3a] align ideal and actual peaks

similar to a sequence alignment algorithm

[3b] call exact matches (highest intensity signal)

[3c] call large shifted peaks (>0.2 relative average area)

[3d] call small shifted peaks (>0.1 relative average area)

[3e] remaining uncalled peaks are either called, or saved as ‘best uncalled peak’

if no signal predominates, called as ‘N’

PHRED: call ambiguous bases

[4] any peaks not assignably to a predicted peak are called provided:

[4a] it is the strongest signal at a given scan

[4b] >10% above background

[4c] is unsplit (i.e. is just one peak)

[4d] is flanked by called peaks

[4e] adding the peak improves local peak spacing

error probabilities: PHRED

calculates probabilities using a local window

able to distinguish between ‘good’ and ‘bad’ regions

not able to distinguish overall ‘good’ from ‘bad’

outputs log probabilities

e.g. q = -10 • log10(p) [p = 0.001; q = 30]

predicts quality by measuring peak properties

similar to linear discriminate analysis

without assumption of normality (data are not normal)

sequence error probabilities

PHRED: signs of error

(a) peak spacing (7–peak window)

(b) height of largest uncalled peak relative to smallest called peak (7–peak window)

(c) height of largest uncalled peak relative to smallest called peak (3–peak window)

(d) distance from the nearest unresolved base (•-1)

PHRED: threshold values

need training set (i.e. resequence known regions)

usually calculated from plasmid sequences

not directly comparable to PCR products

produce a lookup table for q = 1–50

compute empirical error rate for each parameter

new sequence versus known sequence

can be generated for any sequencing technology

Illumina base calling

model–based:AYB (Massingham and Goldman 2012), Bustard (Illumina default), BayesCall (Kao and Song 2009), naiveBayescall (Kao and Song 2011), Onlinecall (Das and Vikalo 2012), Rolexa (Ledergerber and Dessimoz 2011), Softy (Das and Vikalo 2013), Swift (Whiteford et al. 2009), etc.

(supervised) machine learning:Altacyclic (Erlich et al. 2008), freeIbis (Renaud et al. 2013), Ibis (Kircher et al. 2009), etc.


important parameters:cross–talk among dyes phasing (i.e. secondary signals) as a function of cycle signal decay as a function of cycleintensity of the previous cycleintensity of the current cycleintensity of the next cycle


thymine retention (Kircher et al., 2009). The reads produced byboth versions were aligned back to the !X174 genome, and thenumber of sequences mapped and average edit distance wascomputed. We observed that LIBOCAS outperforms the previ-ous SVM library for both metrics.Because the introduction of incorrectly labelled training ex-

amples could influence the quality of the SVM model, wesought to evaluate whether our masking procedure would havean effect on the number of mapped reads. The mapping statisticsconfirmed that masking divergent bases on the !X genome im-proves the final sequence accuracy (170572 sequences mapped)compared with not masking any bases (170220) or maskingrandom bases (170225).We tested freeIbis on a recent paired-end GAIIx run from

mid-2011 from our own sequencing centre with 2! 126 cyclesand a single index of seven nucleotides. This multiplexed run hadboth human DNA as target, and !X174 as control and wasbasecalled using the previous version, Ibis, and the current one,freeIbis as well as naiveBayesCall (v. 0.3) and All your base(AYB, v2.08). We compared how each performed in terms ofsequence accuracy, the number of sequences mapped and edit

distance to the reference, as well as runtime (Table 1). Weshowed that freeIbis provides more high-quality base calls, lead-ing to an increased number of reads being mapped to the refer-ence with a lower edit distance than is the case for otherbasecallers. The predicted versus observed quality scores wereplotted for Bustard and for freeIbis (Fig. 1). The sequences forthe two GA runs used for comparison were produced usingBustard Off-Line Basecaller (OLB v.1.9.3). Our results showthat freeIbis offers an improved accuracy and calibrated qualityscores for these sequencing runs (including one on a HiSeq andanother on a MiSeq) and outperforms Bustard on runs withunusually high error rates (see Supplementary Data).Using the genotype calls from the same sequencing data but

using three different basecallers (Ibis, freeIbis and Bustard) tocompare with calls from Sanger sequences, we determined thatfreeIbis offers improved genotyping accuracy (seeSupplementary Data).

4 CONCLUSION

FreeIbis provides substantial improvements in sequence accur-acy, quality score calibration and genotyping accuracy overBustard, and is more computationally efficient than equally ac-curate model-based methods such as AYB.

ACKNOWLEDGEMENTS

We would like to thank the Bioinformatics Group, theSequencing and the Population Genetics Group at the MaxPlanck Institute for Evolutionary Anthropology for providingdata and feedback. We are also indebted to Vojtech Franc,Yun Song, Hazel Marsden and Tim Massingham who providedsupport for use of their software.

Funding: Work was funded by the Max Planck Society

Conflict of Interest: none declared.

REFERENCES

Das,S. and Vikalo,H. (2012) Onlinecall: fast online parameter estimation and basecalling for illumina’s next-generation sequencing. Bioinformatics, 28, 1677–1683.

Erlich,Y. et al. (2008) Alta-cyclic: a self-optimizing base caller for next-generationsequencing. Nat. Methods, 5, 679–682.

Franc,V. and Sonnenburg,S. (2009) Optimized cutting plane algorithm for large-scale risk minimization. J. Mach. Learn. Res., 10, 2157–2192.

Kao,W. et al. (2009) Bayescall: a model-based base-calling algorithm for high-throughput short-read sequencing. Genome Res., 19, 1884.

Kircher,M. et al. (2009) Improved base calling for the illumina genome analyzerusing machine learning strategies. Genome Biol., 10, R83.

Li,H. and Durbin,R. (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25, 1754–1760.

Massingham,T. and Goldman,N. (2012) All your base: a fast and accurate prob-abilistic approach to base calling. Genome Biol., 13, R13.

McKenna,A. et al. (2010) The genome analysis toolkit: a MapReduce frameworkfor analyzing next-generation DNA sequencing data. Genome Res., 20,1297–1303.

Whiteford,N. et al. (2009) Swift: primary data analysis for the Illumina Solexasequencing platform. Bioinformatics, 25, 2194–2199.

Table 1. Accuracy for each basecaller on a Illumina GAIIx dataset(2! 126 cycles with 366135 257 clusters)

Basecaller Trainingtime

Callingtime

Mapped (%)a Editdistance

Bustard 583348 201 (83.93%) 1.379naiveBayesCall 591h 658h 578957 145 (83.34%) 1.496AYB 394h 593183 967 (85.52%) 1.076Ibis 19.4 h 13.2h 592929 953 (85.31%) 1.167freeIbis 21.3 h 12.2h 594095 219 (85.48%) 1.145

The human sequences were mapped to the hg19 version of the human genome. Thenumber of mapped sequences and the average number of mismatches for those weretallied for each method. Time trials were conducted on a machine with 74 GB ofRAM and using 8 of the 12 Intel Xeon cores running at 2.27GHz. aPercentagerelative to sequences assigned to the read group of interest.

Fig. 1. Plot of the predicted versus the observed base quality score forcontrol reads. Ideally the base qualities should follow the diagonal line.The root mean square error (RMSE) shows that quality scores predictedusing freeIbis have a greater correlation to their observed error rates

1209

freeIbis

at The New

York B

otanical Garden on N

ovember 17, 2014

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

(Renaud et al. 2013)

Das and Vikalo BMC Bioinformatics 2013, 14:129 Page 7 of 10http://www.biomedcentral.com/1471-2105/14/129

where sj can take values of unit vectors comprising threezeros and one non-zero entry equal to 1, and 1 ≤ j ≤4. Base probabilities P(Si = sj|Y, λ, !) can be calculatedfrom the state probabilities of the trellis that we defined inthe parameter estimation section, e.g.,

P(STi =[ 1 0 0 0]) =

4!

j=1P(Ti = tj|Y, λ, !), (15)

and so on. Note that these probabilities are also the ‘qual-ity score’ assigned to the given basecall (more on qualityscores in the next section). Clearly, we need to find pos-teriori probabilities P(Ti = tj|Y, λ, !). For this, we againturn to the soft-output Viterbi and forward-backwardalgorithms that we described in the previous section.

Note that the value of λ used for base calling in windowl is approximated by the value of λ which maximizes thelog-likelihood function formed using ! and Si from theprevious window, l − 1 (except in window l = 1 wherewe use Si provided by Bustard). It is straightforward toshow that this maximization entails solving the quadraticequation in λ

lW!

i=(l−1)W+14λ2 + (KiXi)T"i−1(Yi)

(i"

j=2(1 − dj))∥Xi∥2

λ

− Yi"i−1Yi

(i"

j=2(1 − dj))2∥Xi∥2

= 0,

(16)

and choosing the positive solution as the value of λ.

Quality scoresPerformance of various base calling algorithms can becompared by evaluating error rates that they achievewhen applied to determining the order of nucleotidesin a known sequence. In practical applications, wherethe sequence being analyzed is not known, we need toassess the confidence of a base calling procedure. To thisend, quality scores provide information as to how reli-able the corresponding base calls are. The quality scoresthat we assign to base calls are the posterior probabilitiesof the bases computed by the forward-backward/SOVAschemes. In particular, we use the posteriori probabilitiesof the bases computed according to (15) as the qual-ity scores. In order to assess the ‘goodness’ of qualityscores, we consider their discrimination ability [10,11].The discrimination ability for a given error rate is obtainedby sorting all bases according to their quality scores indescending order and finding the number of bases calledbefore the error rate exceeds the predefined threshold.

ResultsGAIIPerformance of the forward-backward algorithm andSOVA is verified on a full lane data obtained by sequenc-ing phiX174 ((EMBL/NCBI accession number J02482)bacteriophage using Illumina’s Genome Analyzer II whichgenerates reads of length 76. After basecalling the lane byBustard, naiveBayesCall, Rolexa, Ibis, forward-backwardand SOVA, the calls were mapped onto the known refer-ence sequence comprising 5386 bases. The optimal align-ment is found using a Hamming distance metric. Readsthat map with less than 30% errors are retained whilereads having more errors are removed to ensure that thereis no ambiguity in the alignment. This results in approx-imately 7 million reads and 550 million bases which areused to compare the performance of the considered base-calling schemes. Average error rates computed over theentire lane are compared in Table 1. Figure 2 shows the bytile error rates, by cycle error rates and the discriminationabilities of the different basecallers. Forward-backwardalgorithm and SOVA outperform all other schemes interms of error rates and discrimination abilities.

HiSeqPerformance of the forward-backward algorithm andSOVA is verified on reads from E.coli (EMBL/NCBI acces-sion number NC007779) using Illumina’s HiSeq2000 com-prising of 100 cycle paired end data. The error ratesfor both pairs of reads are shown as a function of cyclenumber in Figure 3. Average error rates are compared inTable 2 for both SOVA and FB schemes. As can be seen,we improve on Bustard’s calls by 12.3 and 9.6% for the firstand second pair respectively.

DiscussionComputational complexityFor each read, the most computationally expensive Bus-tard’s step is its correction of phasing effects. For bothforward-backward algorithm and SOVA, we need toevaluate 16 objective functions for the states at each stage

Table 1 Comparison of error rates and speed for GAII

Decoding strategy Error rate Running times

FB 0.0128 400mins

SOVA 0.0129 300mins

OnlineCall 0.0137 30mins

naiveBayesCall 0.0139 1500mins

Ibis 0.0147 480mins

Bustard 0.0154 40mins

Rolexa 0.0171 720mins

A comparison of error rates and running times (per lane) for different basecallers (note that Bustard’s running time is underestimated since it does notaccount for the parameter estimation step).

(Das and Vikalo 2013)

Illumina base calling: Ibis

process image file to extract sample datacreate SVM models for each cycle

train/test data from known genome sequenceintensity values of current cycle + previous and next

model outputs base call (classification)PHRED (like) error probability (based on classification probability)

SVM

project data into a hyperplane that separates data

search for useful hyperplane(s)related to discriminate functions

https://commons.wikimedia.org/w/index.php?curid=73710028

Illumina error probabilities

model–based:PA = IA/IA+IC+IG+IT (Whiteford et al. 2009)likelihood of the base call (Das and Vikalo 2012)

(supervised) machine learning:SVM assignment scores converted to error probabilities using piecewise linear regression (Renaud et al. 2013)

sequence contigs•an assembly of two or more sequencing reads

•usually from different primers or library fragments

•[1] confirm sequence interpretation

•disagreement among reads must be resolved

•ambiguous bases, contradictory bases

•resolved (as best as possible) based on quality

•unresolvable coded as IUPAC polymorphism

•[2] make consensus (compromise)

•usually larger than individual reads

super contigs and scaffolds

•an assembly of two or more contigs

•often produced with a secondary assembler

•consensus

•a major source of error in ‘draft’ genomes

•scaffolds

•contain regions of (approximately) known size, but unknown sequence

•often represented as a uniform size (e.g. 100 Ns)

assembly quality

•N50: median assembly length

•longer is generally better

•gene content:

•count the number of reference genes found among the assemblies (BLAST, hmmer, etc.)

•Benchmarking set of Universal Single–Copy Othologs (BUSCO; Seppey et al. 2019)

•Core Eukaryotic Genes (CEG; Parra et al. 2007)

sequence trimming•window size

•usually 20 bp

•allowable ‘error’ threshold

•ambiguous bases

•e.g. no more than 2 bases

•confidence

•e.g. no more than 2 bases with QV < 20

•[1] read from end

•[2] trim at first window error below threshold

sequence ‘correction’…

•remove ‘systematic’ errors from sequencing reads

•rare bits of sequence (k–mers) that are similar to common bits of sequence (k–mers) are corrected to the more common variant

•often improves the sequence quality

•can improve the assembly size

•decreases the assembly size more often than not

•can introduce errors into the sequence

Heydari et al. BMC Bioinformatics (2017) 18:374 Page 7 of 13

Table 4 NGA50 of respectively contigs (top) and scaffolds (bottom) assembled by SPAdes before and after error correction

Tools D1 D2 D3 D4 D5 D6 D7 D8

Contig NGA50

Uncorrected 397 392 92 570 119 253 231 409 264 881 8 559 6 429 50 484

ACE 397 392 = 92 570 = 125 608 ↑ 231 409 = 264 881 = 8 771 ↑ 3 143! 28 679!BayesHammer 397 392 = 92 344 ↓ 132 564" 231 409 = 264 881 = 9 075 ↑ 6 540 ↑ 53 534 ↑BFC 397 392 = 92 570 = 132 876" 231 409 = 264 881 = 9 375 ↑ 6 389 ↓ 49 185 ↓BLESS 2 397 392 = 92 570 = 119 265 ↑ 231 409 = 264 881 = 7 975 ↓ 3 047! 23 814!Blue 397 392 = 92 708 ↑ 132 876" 231 409 = 289 353 ↑ 7 628! 6 191 ↓ 50 486 ↑Fiona 397 392 = 92 611 ↑ 119 253 = 231 409 = 264 881 = 9 224 ↑ 5 346! 45 472 ↓Karect 397 392 = 92 611 ↑ 132 876" 231 409 = 264 881 = 9 865" 6 392 ↓ 54 132 ↑Lighter 397 392 = 92 570 = 132 564" 231 409 = 289 353 ↑ 9 609" 6 423 ↓ 50 440 ↓Musket 397 392 = 92 566 ↓ 132 876" 231 409 = 264 881 = 9 293 ↑ 6 170 ↓ 46 377 ↓RACER 397 392 = 92 523 ↓ 112 393 ↓ 231 409 = 264 881 = 7 336! 3 244! 21 538!SGA-EC 397 392 = 92 344 ↓ 119 255 ↑ 231 409 = 264 881 = 9 296 ↑ 6 435 ↑ 52 105 ↑Trowel 397 392 = 92 344 ↓ 119 335 ↑ 231 409 = 264 881 = 7 808 ↓ 6 389 ↓ 48 357 ↓

Scaffold NGA50

Uncorrected 397 392 97 353 132 876 231 409 289 353 8 829 6 472 60 554

ACE 397 392 = 97 353 = 133 713 ↑ 231 409 = 264 881 ↓ 9 190 ↑ 3 158! 35 392!BayesHammer 397 392 = 97 353 = 133 309 ↑ 231 409 = 264 881 ↓ 9 443 ↑ 6 576 ↑ 58 570 ↓BFC 397 392 = 97 353 = 133 088 ↑ 231 409 = 264 881 ↓ 9 664 ↑ 6 419 ↓ 59 613 ↓BLESS 2 397 392 = 97 353 = 132 876 = 231 409 = 264 881 ↓ 8 441 ↓ 3 073! 35 638!Blue 397 392 = 97 288 ↓ 133 309 ↑ 231 409 = 289 353 = 7 841! 6 183 ↓ 61 289 ↑Fiona 397 392 = 97 353 = 132 876 = 231 409 = 264 881 ↓ 9 491 ↑ 5 385! 54 188!Karect 397 392 = 97 353 = 133 058 ↑ 231 409 = 264 881 ↓ 10 302" 6 446 ↓ 62 304 ↑Lighter 397 392 = 97 353 = 133 309 ↑ 231 409 = 289 353 = 9 955" 6 468 ↓ 59 697 ↓Musket 397 392 = 97 353 = 133 088 ↑ 231 409 = 264 881 ↓ 9 502 ↑ 6 219 ↓ 55 842 ↓RACER 397 392 = 97 353 = 132 876 = 231 409 = 264 881 ↓ 7 603! 3 266! 23 783!SGA-EC 397 392 = 97 353 = 132 876 = 231 409 = 264 881 ↓ 9 640 ↑ 6 483 ↑ 60 636 ↑Trowel 397 392 = 97 353 = 132 876 = 231 409 = 264 881 ↓ 8 107 ↓ 6 435 ↓ 57 078 ↓

Arrows in the table are based on their value relative to the NGA50 value obtained from uncorrected data as follows:! < -10% < ↓ < 0% < ↑ < +10% <"

Fig. 3, the breakpoints marked as ‘A’ and ‘B’ each occur infour cases.In order to identify the mechanisms that cause break-

points, the k-mer spectrum of both corrected and uncor-rected data along the two contigs was examined. In thissection, k = 21 is used throughout, as it corresponds tothe smallest k-mer size that is used to establish overlapbetween individual reads by the multi-k SPAdes assem-bler. In Fig. 3, black bars visualize the locations of ‘losttrue 21-mers’, i.e., 21-mers that do exist in the referencesequence (hence ‘true’) and also do exist in the uncor-rected data but that are no longer present in the correcteddata (hence ‘lost’). Lost true k-mers hence refer to thosek-mers that were systematically, but erroneously removedduring error correction. In many cases, lost true 21-mers

0

100

200

300

400

500

0 10 20 30 40 50 60 70 80 90 100

Sca

ffold

leng

th N

GA

x (K

bp)

x

UncorrectedACE

BayesHammerBFC

BLESS 2Blue

FionaKarectLighterMusketRACER

SGA-ECTrowel

20

40

60

45 50 55

Fig. 2 SPAdes assemblies. SPAdes assembly results forD. melanogaster for (un)corrected data. Scaffolds with length NGAx orlarger contain x% of the genome

(Heydari et al. 2016)

…sequence ‘correction’

•k–mer

•e.g. BLESS (Heo et al. 2014), Hammer (Medvedev et al. 2011), HiTEC (Ilie et al. 2011), Musket (Liu et al. 2013), Quake (Qu et al. 2009) RACER (Ilie and Molnar 2013)

•multiple sequence alignment (MSA)

•e.g. Coral (Salmela and Schröder 2011), ECHO (Kao et al. 2011), Karect (Allam et al. 2015)

Quake k–mer ‘correction’

•[1] count (or estimate) k–mer frequency

•[2] determine common/rare threshold from the data

•[3] model common k–mers with the Gaussian and/or zeta distribution

•[4] model rare k–mers with the gamma distribution

•[5] for each read remove rare k–mers:

•[a] by trimming from the 3′ end

•[b] change bases with low quality scores to more common bases

Karect MSA ‘correction’

•[1] for each read:

•[a] global pairwise align reads to references having at least x k–mers (indels are permitted) in common

•[b] store alignment if > y alignment overlap and edit distance < z

•[2] for each read:

•[a] extract the shortest stored alignment

•[b] change bases to the modal base for each alignment position

Cri to reads that share with r exactly k -mer ri (Fig. 1a). Type (b): If

less that m type (a) reads are found (m is a user-defined constraint

[default m ¼ maxð30;minð150; 0:6#estimated coverageÞÞ.]),Karect adds to Cri reads that may contain up to d mismatches/indels

in the l-prefix or l-suffix of k -mer ri (Fig. 1b), where d is a user-

defined parameter (default d ¼2). To count the mismatches or

indels, the Hamming or edit distance is used, respectively. Type (c):

If jCrij < m Karect generates two smaller k 0-mers of ri, where k 0 ¼ 2l

and searches for exact k 0-mer matches (Fig. 1c). Type (d): If

jCrij < m, Karect searches for reads that contain up to d mis-

matches/indels in the l-prefix or l-suffix of the k 0-mer (Fig. 1d).

To reduce the effect of bias towards specific k -mers, Cri

is allowed to include at most m reads sharing the same k-mer or

k 0-mer. Cri reads are added to Cr, and the process is repeated

for other k -mers of r. For more details refer to the Supplementary

Document.

2.2 Alignment and normalization of candidate readsOur goal is to correct reference read r. Karect aligns each read c in

the candidate set Cr, against r (line 5 in Algorithm CORRECTERRORS).

The result includes the start and end of c or r (semi-global align-

ment) to allow the alignment of overlaps. We use a variant of the

Needleman and Wunsch (1970) algorithm; refer to the

Supplementary Document for details.

To exclude candidate reads sequenced from different genome

regions, an alignment is considered valid only if the overlap exceeds

a threshold (Default s1 ¼ maxðminð0:7 # avgReadLen; 35Þ;0:2#refReadLenÞ:) s1 and the number of mismatches/indels within

the overlap does not exceed a threshold (Default s2 ¼ 25% of the

overlap.) s2. This rudimentary filter may still accept some reads

from irrelevant genome regions. To further minimize this problem,

Karect assigns a weight wc to each read (refer to Section 2.5).

Consider reference read r ¼ CAA and candidate read c1 ¼ GAAA.

r can be transformed to c1 by substituting C with G at position 1 and

inserting A at position 4. Substitutions are modeled as deletions fol-

lowed by insertions. Therefore, the alignment corresponds to

del(C,1); ins(G,1); ins(A,4). Now consider another candidate read

c2 ¼ AAA. r can be transformed to c2 by the following operations:

del(C,1); ins(A,1). Observe that, inserting an A at position 1 gener-

ates the same string as inserting A at position 4. Therefore, an

equivalent representation for the alignment is del(C,1); ins(A,4). We

call this the normalized form of the alignment (line 6 in Algorithm

CORRECTERRORS), where normalization means that operations are

shifted as far as possible to the right. Normalization allows better

grouping of operations of a set of candidate reads, which enables

Karect to correct reference reads with high accuracy. In the previous

example, after normalization it is revealed that, to correct r, we

must insert an A at position 4, with high probability. The concept of

normalization is also used in DAGCon (Chin et al., 2013), but the

resulting representation is suboptimal; the details are explained in

the Supplementary Document. Note that normalization is not

required if the sequencing technology generates only substitution

errors.

2.3 Storing alignments in the POGEach normalized alignment is stored in a POG Gr associated with

the reference read r (line 7 in Algorithm CORRECTERRORS). Initially,

Gr represents only r. The candidate read alignments are then added

incrementally in Gr in a manner similar to DAGCon, with the differ-

ence that similar out-nodes (i.e. nodes connected by edges coming

out from the same node) are merged instantly; this saves time and

space. Also, in contrast to DAGCon, similar in-nodes (i.e. nodes

connected by edges going to the same node) are not merged, since

this is not required by our extraction algorithm; this also saves com-

putational time. Figure 2 illustrates an example of aligning four can-

didate reads c1; . . . ; c4 to reference read r. The value on each edge

corresponds to the number of alignments passing through that edge.

We are going to modify these values in Section 2.5.

For sequencing technologies that generate only substitution

errors, instead of a POG we use an array of size jrj to accumulate

alignment weights.

2.4 Extracting corrected read from the POGGiven POG Gr for a reference read r, the corrected read r0 corres-

ponds to a path within Gr (line 8 in Algorithm CORRECTERRORS).

There are many ways to select such a path. For instance, it can be

the path that maximizes the sum of edge scores, but the quality of

error correction is expected to be low, because the heuristic favors

longer paths. As another example, DAGCon assigns each node a

score based on the weights of the out-edges and local coverage and

selects the path that maximizes the sum of node scores.

We propose a novel approach. First, we normalize all edge

weights such that the sum of the out-edge weights of any node is 1

(Fig. 3). The rationale is that, after normalization, edge weights will

reflect the transition probability between nodes. Then, the problem

is mapped to the classic problem of finding the most reliable path in

a network (Petrovic and Jovanovic, 1979), which is the path that

maximizes the product of edge weights. Since POGs are directed

Fig. 2. Example POG. The first row shows the initial POG for reference read r.

In the second row, c1 introduces an insertion and a substitution. Next, c2 in-

cludes a deletion, an insertion and a substitution and so on. At each row, the

newly introduced changes are shown in bold

Fig. 3. Normalized POG of Figure 2. The extracted path is shown in bold

Karect 3423

Downloaded from https://academic.oup.com/bioinformatics/article-abstract/31/21/3421/195621by The New York Botanical Garden useron 27 November 2017

(Allam et al. 2015)

muscle (edgar 2004a,b) · muscle (edgar 2004a,b) [0] k–mer distance estimation for unaligned...

Documents