supplementary note 1. mixcr analysis pipeline overview...supplementary note 1. mixcr analysis...
TRANSCRIPT
Supplementary Note 1. MiXCR analysis pipeline overview
MiXCR analysis flow
The pipeline consists of three main processing steps:
● building alignments of sequencing reads to reference V, D, J and C genes of T- or B- cell
receptors (MiXCR align command)
● assembling clonotypes (MiXCR assemble command) using alignments obtained on previous
step (in order to extract specific gene regions e.g. CDR3)
● exporting alignments (MiXCR exportAlignments command) or clones (MiXCR
exportClones command) to human-readable text file
MiXCR supports the following formats of sequencing data: fasta, fastq, fastq.gz, paired-end fastq
and fastq.gz. As an output of each processing stage, MiXCR produces binary compressed file with
comprehensive information about entries produced by each stage (“.vdjca” file with alignments and
“.clns” file with clones). Each binary file can be converted to a human-readable/parsable tab-
delimited text file using exportAlignments and exportClones commands.
Comprehensive documentation for all analysis steps can be found in MiXCR user manual. Below we
briefly describe a set of key algorithms employed in MiXCR.
Nature Methods: doi:10.1038/nmeth.3364
Supplementary Note 2. Alignment algorithm
To align V, J, and C genes to sequencing reads we developed and implemented a specialized
version of K-mer chaining algorithm proposed by Liao et al. (Ref. 1). The implementation is called
KAligner and can be found in our MiLib library hosted on GitHub
(https://github.com/milaboratory/milib/tree/develop/src/main/java/com/milaboratory/core/alignment ).
The algorithm was modified to handle short alignments and to determine boundaries of alignments
with high accuracy. We have modified the seed size and implemented more robust handling for
some special cases, such as homopolymer sequences (consisting of single and di-nucleotides) that
are important in case of short seeds. The scoring scheme for the seed-and-vote step was also
changed.
Briefly, the algorithm consists of two steps: (1) seed-and-vote and (2) alignment of regions between
seeds and extension of alignments outside of bounding seeds. The algorithm works with
precalculated index which stores positions of all seeds in reference sequences. The seeding step
consists of random picking of seeds from the target read (with distances between seed origins
uniformly distributed between two fixed numbers; in default parameters from 4 to 10 nucleotides).
The default size of the seed is 5 nucleotides, so in some cases seeds can overlap. The voting step,
on the optimal reference sequence and its offset, performs a heuristic search to maximize a scoring
function calculated based on the number of matching and absent seeds and offsets of seeds relative
to the optimal position (offsets are introduced in case of deletions and insertions and in case of
some rear exceptional situations), in a certain region of query sequence (sequencing read). After
selection of single or several candidates alignments are built for spaces between seeds and for
regions outside boundary seeds. Alignments are built using classical Needleman–Wunsch algorithm
with a banded matrix for sequence between seeds and using a modified Smith-Waterman algorithm
for boundary sequences. After alignments are built a classical alignment score (scoring matrix and
penalties for indels) is calculated for all candidates and additional filtering is performed based on this
values.
Nature Methods: doi:10.1038/nmeth.3364
Supplementary Note 3. Assembling algorithm
MiXCR assembler flow
The assembling algorithm consists of the following steps:
1. The assembler sequentially processes records (aligned reads) from input.. On the first step,
assembler tries to extract gene feature sequences from aligned reads (called clonal
sequence) specified by the user (by default it is CDR3); the clonotypes are assembled with
respect to the clonal sequence. If aligned read does not contain clonal sequence (e.g. CDR3
region), it will be dropped.
2. If clonal sequence contains at least one nucleotide with low quality (the threshold can be
changed by user; default value is 20), corresponding record will be deferred for further
processing by reads mapper. If percent of low quality nucleotides in deferred record is
greater than some threshold, this record will be completely dropped. Records with clonal
sequence containing only high quality nucleotides are used to build core clonotypes by
grouping records with equal clonal sequences (e.g. CDR3). Each clonotype has two key
properties: the clonal sequence and abundance --- a number of records aggregated by this
clonotype.
3. After the core clonotypes are built, MiXCR performs mapping which processes records
deferred on the previous step. Mapping is aimed on rescuing of quantitative information from
low quality reads. For this, each deferred record is mapped onto already assembled
clonotypes: if a fuzzy match with small number of mismatches (can be adjusted by user) in
low quality positions is found, then this record will be aggregated by the corresponding
clonotype; in case of several matching clonotypes, a single clonotype will be randomly
Nature Methods: doi:10.1038/nmeth.3364
chosen with weights equal to clonotype abundances. If no matches found, the record will be
finally dropped.
4. After clonotypes are assembled by initial assembler and mapper, MiXCR proceeds to
clustering. The clustering algorithm tries to find fuzzy matches between clonotypes and
organize matched clonotypes in hierarchical tree (cluster), where each child layer is highly
similar to its parent but has significantly smaller abundance. Thus, clonotypes with small
abundances will be attached to highly similar "parent" clonotypes with significantly greater
abundance. A typical cluster is present on the following figure:
Typical structure of assembled cluster
5. After clusters are built, only their heads are considered as final clones. The maximal depths
of a cluster, a fuzzy matching criteria, relative abundances of parent/childs and other
parameters can be customized by passing additional parameters to MiXCR assemble
command (see user manual).
6. The final step is to align clonal sequences to reference V, D, J and C genes. Since gene
features used to build clones (e.g. CDR3) are different from those used in aligner (e.g.
VRegion, JRegion etc.), it is necessary to rebuild alignments for clonal sequences. Since all
hits are known in advance, these alignments are built using slower but more accurate Smith-
Waterman algorithm (with minor modifications).
Nature Methods: doi:10.1038/nmeth.3364
Supplementary Note 4. Synthetic data generation
To verify and compare performance of MiXCR with existing software tools we developed a pipeline
to generate synthetic data representing real distributions of sequencing and PCR error rates to
compare MiXCR with other software packages and to evaluate its performance on datasets with
known clonotypes.
The pipeline consists of following steps:
1. Generation of clonotypes
A. Generation of clonal sequence. Each clonal sequence is generated in three steps:
a. Random generation of combinations of V, J, and D genes (for loci that have D
genes)
b. Random generation of combination of numbers of trimmed nucleotides from 3’
end of V gene, 3’ and 5’ end of D gene and 5’ of J gene. If sum of numbers of
trimmed nucleotides from both sides of D gene is less than zero, D gene is
considered as absent.
c. Generation of random inserts (N regions).
All probabilities for these steps were learned from several real datasets from our
collection.
B. Generation of fraction values (abundances) for each generated clonotype
2. Generation of reads
A. Random picking of clonotype with probabilities equal to their relative abundances
B. Incorporation of PCR / Sequencing-quality-independent errors, including indels, into
target sequence
C. Generation of quality values for each nucleotide in the sequence
D. Incorporation of additional sequencing mismatches with probabilities equal to 10 - q/10,
where q is quality value.
Error rates of PCR / Sequencing quality independent errors were learned from our data.
The MiXCR-Test suite with User Manual can be downloaded on the MiXCR website
(http://mixcr.milaboratory.com ).
Nature Methods: doi:10.1038/nmeth.3364
Supplementary Note 5. MiXCR performance.
We run full MiXCR analysis pipeline on input file truncated to a different number of sequences and
measured time and memory consumption of each MiXCR routine (alignment, assembling). As it is
seen from the figure below, the overall execution time of MiXCR grows linearly with the number of
input sequences, thereby showing a linear scalability. Utilized amount of memory becomes nearly
constant for large input datasets, and grows rapidly in case of small ones. The main reason of such
behavior is Java garbage collection and Java Memory Model. The only MiXCR routine, which really
consumes a large amount of memory, is assembling of final clonotypes, since it stores all clonotypes
in memory. Thus, memory consumed by assemble linearly grows with the number of clonotypes,
while amount of memory needed for align is constant even for huge input fastq files. The input file
used can be downloaded from SRR1200517 from Ref. 2. In order to limit input to a different number
of sequences we used built-in `-n` option of MiXCR, which limits the number of analyzed sequences
(i.e. takes only n first sequences from input file), e.g. mixcr align input.fastq
input.vdjca -n 1000000.
Dependence of time and RAM used by MiXCR on the number of sequences in input file.
Further we run MiXCR align command specifying different number of threads (using -t option) on
the same input fastq file with a million of sequences. We used Intel(R) Xeon(R) CPU E5-1620 @
3.60GHz with 4 physical cores and 8 hardware threads. As one can see on the figure below, the
throughput grows linearly and reaches a maximum at 6 threads while further increasing the number
of threads does not improve performance. The throughput of assembling procedure is not shown,
since it mostly uses a single thread during assembling procedure. However, specifying more threads
can significantly increase performance of assembler in case of small number of clonotypes in an
input file.
Nature Methods: doi:10.1038/nmeth.3364
MiXCR throughput (processed sequences per second) versus number of compute threads.
Nature Methods: doi:10.1038/nmeth.3364
Supplementary Table 1. Comparison of existing software packages for immune repertoire
analysis using high-throughput sequencing.
MiXCR MiTCR
IMGT/High-V-Quest
Decombinator IgBlast
Analysis of T-cell receptor sequences ✓ ✓ ✓ ✓ ✓
Analysis of Immunoglobulin sequences ✓ X ✓ X ✓
Supported input formats FASTA, FASTQ, FASTQ[.gz] Paired-end FASTQ[.gz]
FASTQ, FASTQ.gz
FASTA FASTQ FASTA
Output alignments with germline sequences
✓ X ✓ <100,000 reads
X ✓
Builds clonotypes by CDR3 ✓ ✓ X X X
Builds clonotypes by full-length sequence ✓ X X X X
Builds clonotypes by user-defined regions ✓ X X X X
Error correction ✓ ✓ X X X
Accounts for sequence quality within clonal sequence
✓ ✓ X X X
Rescues low quality reads by mapping ✓ ✓ X X X
Analysis of large datasets (> 500,000 reads)
✓ ✓ X ✓ ✓
Adjustment of analysis parameters ✓ ✓ ✓ X ✓
Combines information from paired-end reads
✓ X X X X
CDR3 extraction ✓ ✓ ✓ X X
Extracts sequences for FRs, CDR1, CDR2. ✓ X ✓ X ✓
Extracts sequences for 5’UTR, leader sequence, V and J-C introns.
✓ X X X X
Provides quality of extracted sequences ✓ X X X X
Provides aggregated quality for built clonotypes
✓ ✓ X X X
Supports translocated genes out of the box (V and J from different loci)
✓ X ✓ X X
Provides candidates for germline segment in uncertain situations
✓ ✓
without score
✓
without score X ✓
Alignment of C gene (for immunoglobulin isotype identification)
✓ X X X X
Multithreaded processing ✓ ✓ X X ✓
Reference this paper 3 4 5 6
Distribution Standalone Standalone Online Standalone Standalone/
Online
Nature Methods: doi:10.1038/nmeth.3364
Supplementary Table 2. Sample data alignment performance and efficiency. We used two datasets with Homo sapiens sequencing data for T-cell receptor beta chain*** and immunoglobulin heavy chain repertoires****. We also generated two sets of synthetic human IGH and TRB sequences for which CDR3 and V, D, J genes were known in advance, and high level of PCR-like and sequencing-like errors and indels was introduced in silico. Each data set contains 100,000 sequencing reads.
Dataset Software Timing in seconds (100,000 reads)
% of reads where V or J genes were not
detected
Real TCR beta data
MiXCR 7* 0.41
MiTCR 2 2.23
IMGT 350 000** 0.50
Decombinator 45 5.51
IgBlast 1 070 0.42
Real IGH data
MiXCR 33* 7.66
MiTCR NA NA
IMGT 350 000** 9.02
Decombinator NA NA
IgBlast 2 983 9.10
Synthetic TCR beta data
MiXCR 10* 1.36
MiTCR 5 14.92
IMGT 350 000** 3.54
Decombinator 47 23.96
IgBlast 1 413 0.00
Synthetic IGH data
MiXCR 21* 0.24
MiTCR NA NA
IMGT 350 000** 3.03
Decombinator NA NA
IgBlast 1 938 0.00
* (MiXCR was executed with only one processing thread in order to make fair comparison, but generally it can use all available processor threads significantly reducing execution time) ** (IMGT needs up to one week to process 100,000 reads, time may vary depending on internal IMGT analysis process; in our case it took four days). *** TCR beta dataset was taken from ref. 7, published in SRA under accession number SRP028752 **** IGH data that were used for these tests is not published yet. Exact data subset is deposited in SRA (SRR1842411). Briefly, RNA was extracted from healthy donor PBMC, cDNA synthesis with template switch was performed as described
8, and the library was analyzed using 300+300 nt MiSeq sequencing.
Nature Methods: doi:10.1038/nmeth.3364
Supplementary Table 3. Accuracy of gene segments identification. Comparison of
accuracy of gene segments identification and alignment for selected platforms on synthetic
human IGH, TRB sequences with known V, D, J genes and CDR3. Each dataset contains
100,000 reads.
Data Platform % of wrong
V genes
% of wrong
D genes
% of wrong
J genes
% of wrong
CDR3
Synthetic
TRB
MiXCR 0.0 35.3 0.2 0.4
IMGT 0.6 21.6 9.3 19.0
Decombinator 3.8 N/A 2.3 N/A
IgBlast 0.0 28.5 0.0 N/A
Synthetic
IGH
MiXCR 0.0 27.8 0.2 1.6
IMGT 1.3 54.4 11.6 14.1
IgBlast 0.0 20.3 0.0 N/A
Nature Methods: doi:10.1038/nmeth.3364
Supplementary Table 4. Accuracy of C-gene segments identification. Accuracy of C-
genes identification, evaluated on synthetic data with C-gene. Each dataset contains
100,000 reads.
Dataset % of wrong
C genes
Synthetic TRB with C gene in sequenced region 0.15
Synthetic IGH with C gene in sequenced region 0.36
Nature Methods: doi:10.1038/nmeth.3364
Supplementary Table 5. Analysis of sequencing information losses. A more detailed
analysis of alignments produced by different platforms on datasets from Supplementary
Table 2 shows that most of not aligned reads are the same for all platforms. Nearly the
same situation is also found for other datasets.
Sequences fate Real TCR beta, % Real IGH, %
Aligned by MiXCR & IgBlast & IMGT 99.44 90.1
Not aligned by any platform 0.37 6.7
Aligned by MiXCR and IgBlast
(but not by IMGT) 0.12 0.2
By MiXCR and IMGT
(but not by IgBlast) 0.02 0.1
By IgBlast and IMGT
(but not by MiXCR) 0.02 0.5
Aligned only by MiXCR 0.01 2.0
Aligned only by IgBlast 0.00 0.1
Aligned only by IMGT 0.02 0.3
Nature Methods: doi:10.1038/nmeth.3364
Supplementary Table 6. Comparison of V, J genes identified for real sequencing
reads. For each read we compared sets of V and J hits calculated by each platform: if both
V and J hit sets for platforms under consideration contain at least one same V gene and one
same J gene, this read was counted. It can be seen that in most cases all platforms give the
same V and J hits for real sequences. Some discrepancy in case of IGH data analysis is
probably caused by presence of hypermutations in germline sequences and by differences
in alignment algorithms of platforms under consideration. As one can see from
Supplementary Table 3 IMGT has relatively high rate of J gene misidentification, and
probably this results in greater difference between IMGT and other systems in this test.
Sequences fate Real TCR beta, % Real IGH, %
Intersecting sets of V, J hits for
MiXCR & IgBlast & IMGT 96.6 76.1
Intersecting sets of V, J hits for
MiXCR & IgBlast 96.7 88.3
Intersecting sets of V, J hits for
MiXCR & IMGT 99.4 77.5
Intersecting sets of V, J hits for
IgBlast & IMGT 96.6 83.0
V, J hits extracted by MiXCR 99.6 92.3
V, J hits extracted by IgBlast 99.6 90.9
V, J hits extracted by IMGT 99.5 91.0
Nature Methods: doi:10.1038/nmeth.3364
References:
1. Liao, Y., Smyth, G.K. & Shi, W. Nucleic Acids Res 41, e108 (2013).
2. Shugay, M. et al. Nat Methods 11, 653-655 (2014).
3. Bolotin, D.A. et al. Nat Meth 10, 813-814 (2013).
4. Alamyar, E., Giudicelli, V., Li, S., Duroux, P. & Lefranc, M.P. Immunome research 8, 26 (2012).
5. Thomas, N., Heather, J., Ndifon, W., Shawe-Taylor, J. & Chain, B. Bioinformatics (2013).
6. Ye, J., Ma, N., Madden, T.L. & Ostell, J.M. Nucleic Acids Res 41, W34-40 (2013).
7. Zvyagin, I.V. et al. Proc Natl Acad Sci U S A 111, 5980-5985 (2014).
8. Mamedov, I.Z. et al. Frontiers in immunology 4, 456 (2013).
Nature Methods: doi:10.1038/nmeth.3364