supplementary note 1. mixcr analysis pipeline overview...supplementary note 1. mixcr analysis...

Supplementary Note 1. MiXCR analysis pipeline overview

MiXCR analysis flow

The pipeline consists of three main processing steps:

● building alignments of sequencing reads to reference V, D, J and C genes of T- or B- cell

receptors (MiXCR align command)

● assembling clonotypes (MiXCR assemble command) using alignments obtained on previous

step (in order to extract specific gene regions e.g. CDR3)

● exporting alignments (MiXCR exportAlignments command) or clones (MiXCR

exportClones command) to human-readable text file

MiXCR supports the following formats of sequencing data: fasta, fastq, fastq.gz, paired-end fastq

and fastq.gz. As an output of each processing stage, MiXCR produces binary compressed file with

comprehensive information about entries produced by each stage (“.vdjca” file with alignments and

“.clns” file with clones). Each binary file can be converted to a human-readable/parsable tab-

delimited text file using exportAlignments and exportClones commands.

Comprehensive documentation for all analysis steps can be found in MiXCR user manual. Below we

briefly describe a set of key algorithms employed in MiXCR.

Nature Methods: doi:10.1038/nmeth.3364

Supplementary Note 2. Alignment algorithm

To align V, J, and C genes to sequencing reads we developed and implemented a specialized

version of K-mer chaining algorithm proposed by Liao et al. (Ref. 1). The implementation is called

KAligner and can be found in our MiLib library hosted on GitHub

(https://github.com/milaboratory/milib/tree/develop/src/main/java/com/milaboratory/core/alignment ).

The algorithm was modified to handle short alignments and to determine boundaries of alignments

with high accuracy. We have modified the seed size and implemented more robust handling for

some special cases, such as homopolymer sequences (consisting of single and di-nucleotides) that

are important in case of short seeds. The scoring scheme for the seed-and-vote step was also

changed.

Briefly, the algorithm consists of two steps: (1) seed-and-vote and (2) alignment of regions between

seeds and extension of alignments outside of bounding seeds. The algorithm works with

precalculated index which stores positions of all seeds in reference sequences. The seeding step

consists of random picking of seeds from the target read (with distances between seed origins

uniformly distributed between two fixed numbers; in default parameters from 4 to 10 nucleotides).

The default size of the seed is 5 nucleotides, so in some cases seeds can overlap. The voting step,

on the optimal reference sequence and its offset, performs a heuristic search to maximize a scoring

function calculated based on the number of matching and absent seeds and offsets of seeds relative

to the optimal position (offsets are introduced in case of deletions and insertions and in case of

some rear exceptional situations), in a certain region of query sequence (sequencing read). After

selection of single or several candidates alignments are built for spaces between seeds and for

regions outside boundary seeds. Alignments are built using classical Needleman–Wunsch algorithm

with a banded matrix for sequence between seeds and using a modified Smith-Waterman algorithm

for boundary sequences. After alignments are built a classical alignment score (scoring matrix and

penalties for indels) is calculated for all candidates and additional filtering is performed based on this

values.


https://github.com/milaboratory/milib/tree/develop/src/main/java/com/milaboratory/core/alignment

Supplementary Note 3. Assembling algorithm

MiXCR assembler flow

The assembling algorithm consists of the following steps:

1. The assembler sequentially processes records (aligned reads) from input.. On the first step,

assembler tries to extract gene feature sequences from aligned reads (called clonal

sequence) specified by the user (by default it is CDR3); the clonotypes are assembled with

respect to the clonal sequence. If aligned read does not contain clonal sequence (e.g. CDR3

region), it will be dropped.

2. If clonal sequence contains at least one nucleotide with low quality (the threshold can be

changed by user; default value is 20), corresponding record will be deferred for further

processing by reads mapper. If percent of low quality nucleotides in deferred record is

greater than some threshold, this record will be completely dropped. Records with clonal

sequence containing only high quality nucleotides are used to build core clonotypes by

grouping records with equal clonal sequences (e.g. CDR3). Each clonotype has two key

properties: the clonal sequence and abundance --- a number of records aggregated by this

clonotype.

3. After the core clonotypes are built, MiXCR performs mapping which processes records

deferred on the previous step. Mapping is aimed on rescuing of quantitative information from

low quality reads. For this, each deferred record is mapped onto already assembled

clonotypes: if a fuzzy match with small number of mismatches (can be adjusted by user) in

low quality positions is found, then this record will be aggregated by the corresponding

clonotype; in case of several matching clonotypes, a single clonotype will be randomly


chosen with weights equal to clonotype abundances. If no matches found, the record will be

finally dropped.

4. After clonotypes are assembled by initial assembler and mapper, MiXCR proceeds to

clustering. The clustering algorithm tries to find fuzzy matches between clonotypes and

organize matched clonotypes in hierarchical tree (cluster), where each child layer is highly

similar to its parent but has significantly smaller abundance. Thus, clonotypes with small

abundances will be attached to highly similar "parent" clonotypes with significantly greater

abundance. A typical cluster is present on the following figure:

Typical structure of assembled cluster

5. After clusters are built, only their heads are considered as final clones. The maximal depths

of a cluster, a fuzzy matching criteria, relative abundances of parent/childs and other

parameters can be customized by passing additional parameters to MiXCR assemble

command (see user manual).

6. The final step is to align clonal sequences to reference V, D, J and C genes. Since gene

features used to build clones (e.g. CDR3) are different from those used in aligner (e.g.

VRegion, JRegion etc.), it is necessary to rebuild alignments for clonal sequences. Since all

hits are known in advance, these alignments are built using slower but more accurate Smith-

Waterman algorithm (with minor modifications).


Supplementary Note 4. Synthetic data generation

To verify and compare performance of MiXCR with existing software tools we developed a pipeline

to generate synthetic data representing real distributions of sequencing and PCR error rates to

compare MiXCR with other software packages and to evaluate its performance on datasets with

known clonotypes.

The pipeline consists of following steps:

1. Generation of clonotypes

A. Generation of clonal sequence. Each clonal sequence is generated in three steps:

a. Random generation of combinations of V, J, and D genes (for loci that have D

genes)

b. Random generation of combination of numbers of trimmed nucleotides from 3’

end of V gene, 3’ and 5’ end of D gene and 5’ of J gene. If sum of numbers of

trimmed nucleotides from both sides of D gene is less than zero, D gene is

considered as absent.

c. Generation of random inserts (N regions).

All probabilities for these steps were learned from several real datasets from our

collection.

B. Generation of fraction values (abundances) for each generated clonotype

2. Generation of reads

A. Random picking of clonotype with probabilities equal to their relative abundances

B. Incorporation of PCR / Sequencing-quality-independent errors, including indels, into

target sequence

C. Generation of quality values for each nucleotide in the sequence

D. Incorporation of additional sequencing mismatches with probabilities equal to 10 - q/10,

where q is quality value.

Error rates of PCR / Sequencing quality independent errors were learned from our data.

The MiXCR-Test suite with User Manual can be downloaded on the MiXCR website

(http://mixcr.milaboratory.com ).


http://mixcr.milaboratory.com/

Supplementary Note 5. MiXCR performance.

We run full MiXCR analysis pipeline on input file truncated to a different number of sequences and

measured time and memory consumption of each MiXCR routine (alignment, assembling). As it is

seen from the figure below, the overall execution time of MiXCR grows linearly with the number of

input sequences, thereby showing a linear scalability. Utilized amount of memory becomes nearly

constant for large input datasets, and grows rapidly in case of small ones. The main reason of such

behavior is Java garbage collection and Java Memory Model. The only MiXCR routine, which really

consumes a large amount of memory, is assembling of final clonotypes, since it stores all clonotypes

in memory. Thus, memory consumed by assemble linearly grows with the number of clonotypes,

while amount of memory needed for align is constant even for huge input fastq files. The input file

used can be downloaded from SRR1200517 from Ref. 2. In order to limit input to a different number

of sequences we used built-in `-n` option of MiXCR, which limits the number of analyzed sequences

(i.e. takes only n first sequences from input file), e.g. mixcr align input.fastq

input.vdjca -n 1000000.

Dependence of time and RAM used by MiXCR on the number of sequences in input file.

Further we run MiXCR align command specifying different number of threads (using -t option) on

the same input fastq file with a million of sequences. We used Intel(R) Xeon(R) CPU E5-1620 @

3.60GHz with 4 physical cores and 8 hardware threads. As one can see on the figure below, the

throughput grows linearly and reaches a maximum at 6 threads while further increasing the number

of threads does not improve performance. The throughput of assembling procedure is not shown,

since it mostly uses a single thread during assembling procedure. However, specifying more threads

can significantly increase performance of assembler in case of small number of clonotypes in an

input file.


MiXCR throughput (processed sequences per second) versus number of compute threads.


Supplementary Table 1. Comparison of existing software packages for immune repertoire

analysis using high-throughput sequencing.

MiXCR MiTCR

IMGT/High-V-Quest

Decombinator IgBlast

Analysis of T-cell receptor sequences ✓ ✓ ✓ ✓ ✓

Analysis of Immunoglobulin sequences ✓ X ✓ X ✓

Supported input formats FASTA, FASTQ, FASTQ[.gz] Paired-end FASTQ[.gz]

FASTQ, FASTQ.gz

FASTA FASTQ FASTA

Output alignments with germline sequences

✓ X ✓ <100,000 reads

X ✓

Builds clonotypes by CDR3 ✓ ✓ X X X

Builds clonotypes by full-length sequence ✓ X X X X

Builds clonotypes by user-defined regions ✓ X X X X

Error correction ✓ ✓ X X X

Accounts for sequence quality within clonal sequence

✓ ✓ X X X

Rescues low quality reads by mapping ✓ ✓ X X X

Analysis of large datasets (> 500,000 reads)

✓ ✓ X ✓ ✓

Adjustment of analysis parameters ✓ ✓ ✓ X ✓

Combines information from paired-end reads

✓ X X X X

CDR3 extraction ✓ ✓ ✓ X X

Extracts sequences for FRs, CDR1, CDR2. ✓ X ✓ X ✓

Extracts sequences for 5’UTR, leader sequence, V and J-C introns.

✓ X X X X

Provides quality of extracted sequences ✓ X X X X

Provides aggregated quality for built clonotypes

✓ ✓ X X X

Supports translocated genes out of the box (V and J from different loci)

✓ X ✓ X X

Provides candidates for germline segment in uncertain situations

✓ ✓

without score

✓

without score X ✓

Alignment of C gene (for immunoglobulin isotype identification)

✓ X X X X

Multithreaded processing ✓ ✓ X X ✓

Reference this paper 3 4 5 6

Distribution Standalone Standalone Online Standalone Standalone/

Online


Supplementary Table 2. Sample data alignment performance and efficiency. We used two datasets with Homo sapiens sequencing data for T-cell receptor beta chain*** and immunoglobulin heavy chain repertoires****. We also generated two sets of synthetic human IGH and TRB sequences for which CDR3 and V, D, J genes were known in advance, and high level of PCR-like and sequencing-like errors and indels was introduced in silico. Each data set contains 100,000 sequencing reads.

Dataset Software Timing in seconds (100,000 reads)

% of reads where V or J genes were not

detected

Real TCR beta data

MiXCR 7* 0.41

MiTCR 2 2.23

IMGT 350 000** 0.50

Decombinator 45 5.51

IgBlast 1 070 0.42

Real IGH data

MiXCR 33* 7.66

MiTCR NA NA

IMGT 350 000** 9.02

Decombinator NA NA

IgBlast 2 983 9.10

Synthetic TCR beta data

MiXCR 10* 1.36

MiTCR 5 14.92

IMGT 350 000** 3.54

Decombinator 47 23.96

IgBlast 1 413 0.00

Synthetic IGH data

MiXCR 21* 0.24

MiTCR NA NA

IMGT 350 000** 3.03

Decombinator NA NA

IgBlast 1 938 0.00

* (MiXCR was executed with only one processing thread in order to make fair comparison, but generally it can use all available processor threads significantly reducing execution time) ** (IMGT needs up to one week to process 100,000 reads, time may vary depending on internal IMGT analysis process; in our case it took four days). *** TCR beta dataset was taken from ref. 7, published in SRA under accession number SRP028752 **** IGH data that were used for these tests is not published yet. Exact data subset is deposited in SRA (SRR1842411). Briefly, RNA was extracted from healthy donor PBMC, cDNA synthesis with template switch was performed as described

8, and the library was analyzed using 300+300 nt MiSeq sequencing.


Supplementary Table 3. Accuracy of gene segments identification. Comparison of

accuracy of gene segments identification and alignment for selected platforms on synthetic

human IGH, TRB sequences with known V, D, J genes and CDR3. Each dataset contains

100,000 reads.

Data Platform % of wrong

V genes

% of wrong

D genes

% of wrong

J genes

% of wrong

CDR3

Synthetic

TRB

MiXCR 0.0 35.3 0.2 0.4

IMGT 0.6 21.6 9.3 19.0

Decombinator 3.8 N/A 2.3 N/A

IgBlast 0.0 28.5 0.0 N/A

Synthetic

IGH

MiXCR 0.0 27.8 0.2 1.6

IMGT 1.3 54.4 11.6 14.1

IgBlast 0.0 20.3 0.0 N/A


Supplementary Table 4. Accuracy of C-gene segments identification. Accuracy of C-

genes identification, evaluated on synthetic data with C-gene. Each dataset contains

100,000 reads.

Dataset % of wrong

C genes

Synthetic TRB with C gene in sequenced region 0.15

Synthetic IGH with C gene in sequenced region 0.36


Supplementary Table 5. Analysis of sequencing information losses. A more detailed

analysis of alignments produced by different platforms on datasets from Supplementary

Table 2 shows that most of not aligned reads are the same for all platforms. Nearly the

same situation is also found for other datasets.

Sequences fate Real TCR beta, % Real IGH, %

Aligned by MiXCR & IgBlast & IMGT 99.44 90.1

Not aligned by any platform 0.37 6.7

Aligned by MiXCR and IgBlast

(but not by IMGT) 0.12 0.2

By MiXCR and IMGT

(but not by IgBlast) 0.02 0.1

By IgBlast and IMGT

(but not by MiXCR) 0.02 0.5

Aligned only by MiXCR 0.01 2.0

Aligned only by IgBlast 0.00 0.1

Aligned only by IMGT 0.02 0.3


Supplementary Table 6. Comparison of V, J genes identified for real sequencing

reads. For each read we compared sets of V and J hits calculated by each platform: if both

V and J hit sets for platforms under consideration contain at least one same V gene and one

same J gene, this read was counted. It can be seen that in most cases all platforms give the

same V and J hits for real sequences. Some discrepancy in case of IGH data analysis is

probably caused by presence of hypermutations in germline sequences and by differences

in alignment algorithms of platforms under consideration. As one can see from

Supplementary Table 3 IMGT has relatively high rate of J gene misidentification, and

probably this results in greater difference between IMGT and other systems in this test.

Sequences fate Real TCR beta, % Real IGH, %

Intersecting sets of V, J hits for

MiXCR & IgBlast & IMGT 96.6 76.1


MiXCR & IgBlast 96.7 88.3


MiXCR & IMGT 99.4 77.5


IgBlast & IMGT 96.6 83.0

V, J hits extracted by MiXCR 99.6 92.3

V, J hits extracted by IgBlast 99.6 90.9

V, J hits extracted by IMGT 99.5 91.0


References:

1. Liao, Y., Smyth, G.K. & Shi, W. Nucleic Acids Res 41, e108 (2013).

2. Shugay, M. et al. Nat Methods 11, 653-655 (2014).

3. Bolotin, D.A. et al. Nat Meth 10, 813-814 (2013).

4. Alamyar, E., Giudicelli, V., Li, S., Duroux, P. & Lefranc, M.P. Immunome research 8, 26 (2012).

5. Thomas, N., Heather, J., Ndifon, W., Shawe-Taylor, J. & Chain, B. Bioinformatics (2013).

6. Ye, J., Ma, N., Madden, T.L. & Ostell, J.M. Nucleic Acids Res 41, W34-40 (2013).

7. Zvyagin, I.V. et al. Proc Natl Acad Sci U S A 111, 5980-5985 (2014).

8. Mamedov, I.Z. et al. Frontiers in immunology 4, 456 (2013).


supplementary note 1. mixcr analysis pipeline overview...supplementary note 1. mixcr analysis...

Documents