nanotyper: an hla genotyping algorithm for nanopore sequencing … · 2019. 11. 19. · nanotyper:...

1
nanotyper: an HLA genotyping algorithm for nanopore sequencing data Steffen Klasberg¹, Kathrin Putke¹, Vineeth Surendranath¹, Alexander H Schmidt 1,2 , Vinzenz Lange¹, Gerhard Schöfl 1 1 DKMS Life Science Lab, St. Petersburger Str. 2, 01069 Dresden, Germany; 2 DKMS, Kressbach 1, 72072 Tübingen, Germany Introduction Nanopore sequencing may put sequence-based HLA genotyping in reach of even the smallest immunogenetic laboratories. Without the requirement for major capital investments, it delivers very long reads, easily covering both HLA class I and class II genes. In addition, results may be obtained very rapidly and at reasonable cost. However, these advantages come at the price of a lower per-read accuracy of only ~ 85 to 90 % compared to the high accuracy of NGS reads. Here, we present an algorithmic approach to obtain accurate genotyping results from noisy third-generation sequencing data. Nanotyper consists of a two step pipeline: First, the creation of allele-specific mappings using the longread path of DR2S, second, the hierarchical probabilistic typing using Typlomat. Typlomat DR2S-LR Results Our still maturing nanotyper pipeline, consisting of the in-house developed freely available open-source R packages DR2S and Typlomat, is able to infer the genotype of HLA class I and class II genes by only leveraging data from ONT´s nanopore sequencers. Nanotyper delivers ultra-high 4-field resolution while keeping ambiguities at a low. DR2S and Typlomat can be used in an interactive R session or via Linux command-line on a standard workstation. Both modules benefit from parallel computing frameworks. The runtime of DR2S for a single sample is about 10 min on 8 cores, depending on the target size and read coverage. Typlomat can be run on a batch of samples simultaneously and takes about 15 min + 1 min per sample per core. The accuracy and performance of resulting genotypes are discussed on poster P184. Figure 1: First step of the nanotyper pipeline. The nanopore reads of heterozygous samples are filtered and, after initial mapping, assigned to individual alleles based upon true polymorphic positions. The best possible results are obtained by iteratively mapping reads to consensus sequences from a previous mapping step. The resulting allele-specific mappings serve as the input for the next step of nanotyper. Mind the gap! Long stretches of homopolymers are a special challenge for third generation sequencing. Current platforms are not able to capture the correct number of nucleotides in homopolymers accurately. Therefore, the genotyping algorithm cannot distinguish alleles based on homopolymer length alone. Figure 2: The hierarchical mapping approach of Typlomat. (A) In a first round, only exons of the antigen-recognition domain are scored. The best scoring 10 % are propagated into the next round, to score (B) the remaining exons, followed by (C) non-coding features, to (D) finally arrive at the genotyping result. Figure 3: Probabilistic scoring of Typlomat. The allele-specific longread mapping is transformed into a position-specific weight matrix (PWM). The consensus sequence of the mapping is added to a feature-wise pre-computed multiple sequence alignment (MSA) of target alleles. Terminal borders in the MSA denote feature boundaries of the allele. Insertions and deletions of the reference to the target allele database are inferred from the MSA and injected into the PWM. The score for a feature is built up by summing the nucleotide weights. The score of all features per round is summed to rank the target alleles. Missing information, i.e. incomplete target alleles, is substituted by the best occurring score. DKMS Life Science Lab St. Petersburger Str. 2 • 01069 Dresden • www.dkms-lab.de Download from GitHub: https://github.com/DKMS-LSL

Upload: others

Post on 18-Aug-2020

39 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: nanotyper: an HLA genotyping algorithm for nanopore sequencing … · 2019. 11. 19. · nanotyper: an HLA genotyping algorithm for nanopore sequencing data Steffen Klasberg¹, Kathrin

nanotyper: an HLA genotyping algorithm for nanopore sequencing data

Steffen Klasberg¹, Kathrin Putke¹, Vineeth Surendranath¹, Alexander H Schmidt1,2, Vinzenz Lange¹, Gerhard Schöfl1 1DKMS Life Science Lab, St. Petersburger Str. 2, 01069 Dresden, Germany;

2DKMS, Kressbach 1, 72072 Tübingen, Germany

Introduction Nanopore sequencing may put sequence-based HLA genotyping in reach of even the smallest immunogenetic laboratories. Without the requirement for major

capital investments, it delivers very long reads, easily covering both HLA class I and class II genes. In addition, results may be obtained very rapidly and at

reasonable cost. However, these advantages come at the price of a lower per-read accuracy of only ~ 85 to 90 % compared to the high accuracy of NGS reads.

Here, we present an algorithmic approach to obtain accurate genotyping results from noisy third-generation sequencing data. Nanotyper consists of a two step

pipeline: First, the creation of allele-specific mappings using the longread path of DR2S, second, the hierarchical probabilistic typing using Typlomat.

Typlomat DR2S-LR

Results

Our still maturing nanotyper pipeline, consisting of the in-house developed freely available open-source R packages DR2S and Typlomat, is able to infer

the genotype of HLA class I and class II genes by only leveraging data from ONT´s nanopore sequencers. Nanotyper delivers ultra-high 4-field resolution

while keeping ambiguities at a low. DR2S and Typlomat can be used in an interactive R session or via Linux command-line on a standard workstation.

Both modules benefit from parallel computing frameworks. The runtime of DR2S for a single sample is about 10 min on 8 cores, depending on the target

size and read coverage. Typlomat can be run on a batch of samples simultaneously and takes about 15 min + 1 min per sample per core. The accuracy

and performance of resulting genotypes are discussed on poster P184.

Figure 1: First step of the nanotyper pipeline. The nanopore reads

of heterozygous samples are filtered and, after initial mapping,

assigned to individual alleles based upon true polymorphic

positions. The best possible results are obtained by iteratively

mapping reads to consensus sequences from a previous mapping

step. The resulting allele-specific mappings serve as the input for

the next step of nanotyper.

Mind the gap! Long stretches of homopolymers are a special challenge for third generation

sequencing. Current platforms are not able to capture the correct number of nucleotides in homopolymers accurately. Therefore, the genotyping algorithm cannot distinguish alleles based on homopolymer length alone.

Figure 2: The hierarchical mapping approach of Typlomat. (A) In a first round,

only exons of the antigen-recognition domain are scored. The best scoring 10 %

are propagated into the next round, to score (B) the remaining exons, followed by

(C) non-coding features, to (D) finally arrive at the genotyping result.

Figure 3: Probabilistic scoring of Typlomat. The

allele-specific longread mapping is transformed into a

position-specific weight matrix (PWM). The consensus

sequence of the mapping is added to a feature-wise

pre-computed multiple sequence alignment (MSA) of

target alleles. Terminal borders in the MSA denote

feature boundaries of the allele. Insertions and

deletions of the reference to the target allele

database are inferred from the MSA and injected into

the PWM. The score for a feature is built up by

summing the nucleotide weights. The score of all

features per round is summed to rank the target

alleles. Missing information, i.e. incomplete target

alleles, is substituted by the best occurring score.

DKMS Life Science Lab

St. Petersburger Str. 2 • 01069 Dresden • www.dkms-lab.de

Download from GitHub: https://github.com/DKMS-LSL