materials and methods · web viewests [11], using blastn with relaxed parameters (reward 5, penalty...
TRANSCRIPT
Supporting information:A. Materials and Methods.....................................................................................................2
1. Ethics statement...............................................................................................................22. Bronchoalveolar lavage fluid specimens.........................................................................23. Enrichment in P. jirovecii cells by immuno-precipitation...............................................24. DNA extraction and amplification...................................................................................35. BALF specimens screening for high proportion of P. jirovecii DNA.............................36. Investigation of the presence or absence of another Pneumocystis species....................47. Investigation of the presence or absence of another fungal species................................58. High throughput sequencing............................................................................................69. Filtering and P. jirovecii genome assembly.....................................................................610. Transcriptome assembly and annotation..........................................................................811. Gene predictions and functional annotations...................................................................812. Phylogeny.........................................................................................................................913. Search for missing genes................................................................................................1014. Recovery, assembly, and annotation of the ribosomal RNA unit..................................1015. Recovery, assembly, and annotation of mitochondrial genome....................................11
B. Data access.......................................................................................................................11
C. References for supplementary information..................................................................12
D. Supplementary figure.....................................................................................................15
E. Supplementary tables......................................................................................................16
1
A. Materials and Methods
1. Ethics statement
The study protocol was approved by the institutional review board (Commission cantonale
[VAUD] d'éthique de la recherche sur l'être humain). The project was considered part of
research for improving diagnosis. All patients provided an informed oral consent which was
part of procedure for the admittance in the hospital. This oral consent was documented by the
fact that their chart did not mention that they ask their samples not to be used for research.
The institutional review board approved the protocol for oral consent and documentation. The
samples were treated anonymously.
2. Bronchoalveolar lavage fluid specimens
Fresh bronchoalveolar lavage fluids (BALFs) positive for P. jirovecii using Grocott’s
Methenamine Silver staining were supplemented with 15% v/v glycerol, frozen in liquid
Nitrogen, and stored at -80°C. Only those with a sufficient volume (more than one ml) and
heavy load were stored. Four specimens were stored between 2005 and 2008, and were used
for the selection procedure described here below.
3. Enrichment in P. jirovecii cells by immuno-precipitation
The BALF was centrifuged at 10’000 rpm for 10 min. After removal of the supernatant, the
pellet was resuspended in 200 µl of 1 X PBS, mixed with 50 µl of antibodies solution
(Pneumo-Cel IF, Celllabs, Australia), and incubated overnight at 4°C without shaking. The
solution was then added to 50 ml of Dynabeads® Protein G pretreated with 0.1 M Na acetate
pH 5-0.01% tween-20 and PBS pH 7.4 according to the manufacturer’s instructions
(Invitrogen, Switzerland), and the whole mixture was incubated overnight at 4°C. Beads were
2
washed three times with 1X PBS by concentration for two minutes using an immunomagnetic
separation rack (Invitrogen, Switzerland). P. jirovecii cells were eluted with 200 µl of elution
buffer (0.1 M citrate buffer pH 2), 100 µl of 1 M Tris pH 7.5 was added, and the enriched
BALF was stored at -20°C.
4. DNA extraction and amplification
Genomic DNA was extracted from BALF specimens using QIAamp® DNA Mini kit
(QIAGEN, Germany), and resuspended in 50 l of elution buffer. For each BALF, one l of
DNA was randomly amplified in a 20 l reaction using the Illustra GenomiPhi V2 DNA
Amplification Kit (GE Healthcare, Switzerland) according to the manufacturer’s instructions.
To obtain a sufficient amount of DNA for high throughput sequencing, each DNA sample was
amplified in ten separate 20 l reactions, that were pooled and purified using QIAamp® DNA
blood mini kit (Qiagen, Germany). Amplified DNA fragments (size ≥ 10 kb) were visualized
by ethidium bromide agarose gel electrophoresis. The quantity and purity were estimated
using Quant-IT TM DNA Assays (Invitrogen, Switzerland).
5. BALF specimens screening for high proportion of P. jirovecii DNA
The proportion of P. jirovecii DNA in the four BALFs was estimated by Roche 454 low level
pyrosequencing of amplified DNA (1/8 plate). The resulting reads were assigned to various
organisms using the following simplified bioinformatics classification pipeline. Shotgun 454
reads were mapped onto human genome using Roche’s gsMapper (Newbler v.2.6). Human
genomic DNA was identified exclusively by this method and was not further investigated.
Unmapped reads were filtered to discriminate P. jirovecii reads from those of other organisms
using an all-against-all blast comparison. Because the aim was only to estimate the proportion
of P. jirovecii DNA within each BALF, we considered reads as being from P. jirovecii only if
3
they exhibited significant homology with P. carinii (best blast hit, Blastx and Blastn, e-value
≤ 10-5). We did not take into account the reads that had their best blast hit against other fungi
proteomes. To ensure the specificity and sensitivity of blast searches, randomly simulated
reads were generated from Mycobacterium leprae, S. pombe, Kluyeromyces lactis, and human
adenovirus genomes using custom Perl scripts. They were used were as controls. The CG
odds ratios were not computed because of the small number of reads. The results are shown in
Table S1.
6. Investigation of the presence or absence of another Pneumocystis species
The data obtained from the preliminary 454 sequencings of the four BALFs (section A5) were
used to determine if they contained single or multiple species of Pneumocystis. Seven genes,
which sequences are available for several Pneumocystis species, S. pombe, and S. cerevisiae,
were used as markers. These genes were: the heat shock protein 70, dihydrofolate reductase,
dihydropteroate synthase, beta-tubulin, superoxide dismutase, cyclin-dependent kinase, and
guanosine nucleotide binding (see Table S6 for accession numbers). Profiles and Hidden
Markov Models were build using MAFFT [1] and Pftools [2], and used to screen the raw 454
reads. Spurious alignments, paralogs, and non-discriminative sequences (e.g. conserved
regions between human and Pneumocystis spp) were removed by manual inspection using
Jalview [3]. Pairewise identities were computed using custom Perl scripts. The results were
verified using a PCR which amplifies the mitochondrial large subunit 26S rRNA gene of all
Pneumocystis species, followed by sequencing the PCR product, as well as using PCRs
specific for P. carinii or P. wakefieldiae, another Pneumocystis species infecting specifically
rats [4].
4
7. Investigation of the presence or absence of another fungal species
To investigate if another fungal species than Pneumocystis was present in the four BALFs, the
454 reads were taxonomically classified using Blastx (e-value ≤ 10-50 against UniProtKB
database with filter "m S"), and MEGAN (min score 150) [5]. Reads that could be assigned at
the species level were collected as well as the complete proteome of the species against which
they had their hits. As Pneumocystis predicted peptides are not included into UniProtKB, the
P. carinii partial genome was re-annotated (ca. 4’591 predicted peptides, section A11), and to
a custom databases containing the complete proteomes of other relevant fungal species. The
assigned reads were then compared to this database using Blastx (e-value ≤ 10-6), and were
classified into three groups: group 1 included those with their best blast hit with P. carinii,
group 2 included those with best blast hit against with other fungal species but without
homolog in P. carinii, and group 3 included those having their best blast hit with other fungal
species but having also significant hit with P. carinii. Group 1 was considered as P. jirovecii
sequences homolog to sequences of P. carinii. Group 2 was not investigated further because it
was not possible to determine if these sequences were missing in P. carinii assembly, specific
to P. jirovecii, or truly belong to another fungus. Each sequence of group 3 was translated into
the six open reading frames using 6ft program [2], and aligned with its homologs from other
fungi using MAFFT [1]. Alignment were visualized using Jalview [3]. Pairewise sequence
identities were computed using custom Perl scripts. In all cases examined, the reads were
more close to P. carinii than to any other fungi, but still divergent from P. carinii (data not
shown). Thus, no evidence of the presence of another fungal species than P. jirovecii was
detected.
5
8. High throughput sequencing
Ten micrograms of amplified DNA from BALF E8 were used to build three Roche 454 XL+
shotgun libraries and a single Illumina HiSeq 2000 paired end library (insert size 500 bp).
Sequencing produced 2’889’665 Roche 454 single end reads (1.3 Gb, average length 700 nt)
and 316’713’248 Illumina paired end reads (30 Gb, average length 100 nt). Low quality and
adapters were removed from Illumina paired end reads using fastqc (v.0.9) and cutadapt [6].
9. Filtering and P. jirovecii genome assembly
The flow chart of the filtering and assembly procedure is shown in Figure S1 and the details
for each step are described below:
Step 1: P. jirovecii highly repetitive telomeric sequences and mitochondrion sequences were
identified using mreps [7] and Blastx (10-5) against Pneumocystis published sequences
[8,9,10], removed from 454 reads, and kept apart.
Step 2: The 454 reads were mapped onto the human genome (GRCh37/hg19) and the
complete NCBI human genomic resources
(ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/human_genomic.gz), using Roche’s gsMapper
(Newbler v. 2.6; parameters: –mi 85% -ml 100). Fully mapped reads were removed from the
dataset. Partially mapped reads were kept at this step only if they had their best blast hit
against P. carinii or another fungal proteomes.
Step 3: The reads were assembled using gsAssembler software with stringent parameters -mi
99 –ml 100 –rip, which were deduced from in silico simulations to avoid the creation of
chimeric contigs. Reads were trimmed internally with Newbler using -vs and -vt options. The
trimming database included 510 complete phage genomes
(http://phage.sdsu.edu/~rob/phage/), 3’074 complete viral genomes
(http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html), UniVec database
6
(ftp://ftp.ncbi.nih.gov/pub/UniVec/), and the complete genomes of relevant bacteria. The
purpose of this assembly step was to facilitate the attribution of every contig to a taxonomic
group.
Step 4: Contigs and unassembled reads were simultaneously compared to the incomplete P.
carinii genome, including 4’278 contigs (http://pgp.cchmc.org/) and 1’042 ESTs [11], using
Blastn with relaxed parameters (reward 5, penalty -4, word size 7, cost to open a gap 10, cost
to extend a gap 6, e-value 10-6). They were also compared to custom protein databases
including P. carinii predicted peptides [12], fungi, eukaryotes, bacteria, archaea, and viral
proteomes using Blastx. Sequences having their best blast hit against P. carinii were
considered as P. jirovecii. Reads having a best blast hit against another eukaryotes except
Homo sapiens were considered as P. jirovecii. Most of them corresponded to conserved
sequences found in a broad spectrum of fungal species but missing or fragmented in the P.
carinii assembly.
Step 5: Residual contamination was assessed using RepeatMasker v. 3.2.8 with primspec and
rodspec options (http://repeatmasker.org/), and relative dinucleotide abundance analysis
[13,14]. To identify chimeric reads possibly introduced by the phi 29 DNA polymerase, pre-
filtered 454 reads were assembled using Roche’s gsAssembler (Newbler v. 2.6; parameters: -
mi 95 –ml 100 –rip). Single end 454 reads were then remapped to the contigs using Roche’s
gsMapper (Newbler, v. 2.6; -mi 98 -ml 100 –rip), and filtered using custom Perl scripts. The
whole process of stringent assembly (step 3) followed by filtration (steps 4 and 5) was
repeated several times
Step 6: A fragmented P. jirovecii genome assembly was obtained using Roche’s gs
Assembler (Newbler v. 2.6, -mi 98 –ml 100 –rip).
Step 7: The trimmed Illumina paired end reads that mapped onto human sequences using
Bowtie 2 [15] were removed.
7
Step 8: The remaining Illumina paired end reads that mapped onto a collection of bacterial
and viral genomes using Bowtie 2 [15] were removed.
Step 9: The Illumina reads were used to extend and scaffold the 454 contigs using SSAKE (v.
3-8, -p 1 -m 16 -o 2 -r 0.6 -p 0) [16]. The last remaining contaminants (bacteria essentially)
were filtered out as described in step 4. Extended 454 contigs, singlets, and unassembled 454
reads were assembled using Phrap to yield the final assembly [17,18].
10. Transcriptome assembly and annotation
Total RNA was isolated from a BALF of another patient with PCP than that used for genome
sequencing. Total RNA was extracted using the Ribopure Yeast Kit (Ambion). RNA was
randomly amplified using NuGEN Ovation® RNA-seq system and sequenced using Illumina
HiSeq 2000 technologies. A total of 246’196’388 non-strand-specific paired ended reads
(insert size 250 bp, average read length 100 nt) were obtained. Adapters sequences and low
quality sequences were removed using cutadapt [6]. Human reads (62.5%) were first removed
by mapping using TopHat [19]. Additional human contaminant reads were removed (31%) by
realignment using GSNAP [20]. Bacterial contaminants were removed by mapping to a
collection of bacterial species using Bowtie 2 (3.8%) [15]. After these filtering steps, we
obtained 6’708’547 paired end reads that were assembled using TopHat and cufflinks [19,21].
11. Gene predictions and functional annotations
The repeat-masked P. jirovecii genome was annotated using Maker (v. 2.10) [22], integrating
de novo, homology, and transcriptome evidences for genes. Augustus (v. 2.3.1) [23] and
SNAP (v. 2006-07-28) [24] were trained on 483 transcripts expected to be complete (section
A10). GeneMark-ES (v.2 .5) [25] was trained directly on the genome assembly. Homology
based evidence was obtained from the alignments of RNAseq transcripts, 43’310 proteins
8
corresponding to the complete proteome of Schizosaccharomyces pombe,
Schizosaccharomyces japonicus, Saccharomyces cerevisiae, Neosartorya fisherii, Neurospora
crassa, and 393 publicly available Pneumocystis proteins (downloaded from UniProtKB
release 2012_06). Protein coding genes were annotated by comparison to a collection of
fungal proteomes using Blastp with an e-value of 10-6 as cut off. Mapping to KEGG metabolic
pathways was performed using our previously mapping pipeline [12], and Priam [26].
Specific softwares were used to predict secreted proteins, GPI-anchor proteins, carbone
hydrate enzymes, peptidases, transmembrane proteins, kinases, G-proteins coupled receptors
(GPCRs), transporters GTPases, phosphatases, transfer RNAs, other non-coding RNAs, and
transposable elements. The ribosomal operon unit as well as the mitochondrial genome were
assembled and annotated separately (see sections 11 and 12.). For the sake of a fair
comparison with P. jirovecii, the P. carinii genome was re-annotated using the same methods
including specific P. carinii gene models as well as a collection of 1’042 [11] and 48’229
ESTs (C. Aliouat-Denis, unpublished data).
12. Phylogeny
The proteomes of Ashbya gossypii, Debaryomyces hansenii, N. crassa, Neosartorya fumigata,
S. japonicus, S. pombe, Ustilago maydis, S. cerevisiae, Rhizopus delemar, and Yarrowia
lipolytica were downloaded from UniprotKB website (release 2012_06;
http://www.uniprot.org/), those of Schizosaccharomyces cryophilus, Schizosaccharomyces
octosporus from the Broad Institute website (http://www.broadinstitute.org/). The proteomes
of P. jirovecii and P. carinii were from this study, whereas that of Taphrina deformans is
unpublished data (Cissé et al., Manuscript in preparation). Single copy orthologs were
identified using Orthologous MAtrix project (OMA.0.99) [27], concatenated, and aligned
using MAFFT [1] with the L-INS-i method. Misaligned regions were removed by GBLOCKS
9
[28]. The maximum likelihood and maximum parsimony phylogenies were inferred using
PhyML (v.3.0) [29] and RAxML (v.7.2.8) [30], respectively with 100 bootstrap replicates and
BLOSUM62 as model.
13. Search for missing genes
The glyoxylate cycle hallmark genes (i.e. isocitrate lyase and malate synthase), and enzymes
dedicated to the synthesis of amino acids were searched using hmmer3 (http://hmmer.org/)
with corresponding Pfam Hidden Markov models (http://pfam.sanger.ac.uk/), our previously
KEGG mapping pipeline [12], and Priam [26]. The secondary metabolites clusters were
searched using the Secondary Metabolite Unique Regions Finder (SMURF;
http://www.jcvi.org/smurf/index.php), and Blast searches against UniProtKB database
(http://www.uniprot.org/).
14. Recovery, assembly, and annotation of the ribosomal RNA unit
We searched for homologs to P. jirovecii ribosomal sequences in the raw Roche 454
sequences using Blastn (e-value ≤ of 10-5). The reference P. jirovecii sequences used for
screening were: 18S rRNA gene (NCBI accession number AB266392), ITS1-5.8S rRNA-
ITS2 (AF013954, AY330724, AY328067 - AY328078, AB469815, AB469816, EU709722 -
EU709727, AB469817, AB481404, AB481405 - AB481414, FJ164067, FJ164068), the full-
length P. carinii ribosomal operon (M86760), and the intron of P. jirovecii nuclear 26S rRNA
(L13615). The 242 reads recovered were purged for contaminants using Blast against NCBI
nr/nt (i.e. removal of human or bacterial ribosomal genes), assembled using Roche’s
gsAssembler (Newbler v. 2.6), and annotated using Artemis [31]. The contig 357 contained
the full-length 18S, ITS1, 5.8S, ITS2, and 26S rDNA sequences, whereas contig 358
contained the full-length IGS1, 5S, and IGS2 sequences.
10
15. Recovery, assembly, and annotation of mitochondrial genome
The Roche 454 reads were compared to the published P. carinii mitochondrial genome [10]
using Blastn with the parameters -r 5 -q -4 -W 7 -G 10 -E 6 -e 1e-05 -v 1 -b 1 -F "m D". We
retrieved 111’276 reads (3.8% of total) that were purged for contaminants using Blast against
NCBI nr/nt, and assembled using gsAssembler (Newbler v.2.6, parameters -mi 95 -ml 100 -
ace –rip). Illumina paired end reads were used to correct contigs by remapping using Bowtie 2
[15] with default parameters and manual inspection. Protein coding genes and ribosomal
genes were annotated by comparison to the available P. jirovecii and P. carinii mitochondrial
genes using Artemis [31] and NCBI orf finder (http://www.ncbi.nlm.nih.gov/projects/gorf/)
with translation table 4. Transfer RNAs were predicted de novo using tRNAscan with –c
option [32].
B. Data access
The Whole Genome Shotgun project has been registered at EMBL-Bank under the 68827
identification number (http://www.ncbi.nlm.nih.gov/bioproject/68827). The raw sequences
were deposited at the European Sequence Read Archive (SRA) under accession number of
ERP000939. The P. jirovecii transcriptome project has been registered at EMBL-Bank under
the PRJEB400. Raw RNAseq data were deposited at the European Sequence Read Archive
(SRA) under accession number of ERP001479. The P. jirovecii and P. carinii annotated
genomes are temporary available before public release at:
http://myhits.isb-sib.ch/wwwtmp/weekly/Pneumocystis_jirovecii_genome.tar.gz
http://myhits.isb-sib.ch/wwwtmp/weekly/Pneumocystis_carinii_genome.tar.gz
11
C. References for supplementary information
1. Katoh K, Asimenos G, Toh H. 2009. Multiple alignment of DNA sequences with
MAFFT. Methods Mol Biol 537: 39-64.
2. Bucher P, Karplus K, Moeri N, Hofmann K. 1996. A flexible motif search technique
based on generalized profiles. Comput Chem 20: 3-23.
3. Waterhouse AM, Procter JB, Martin DM, Clamp M, Barton GJ. 2009. Jalview Version
2--a multiple sequence alignment editor and analysis workbench. Bioinformatics 25:
1189-1191.
4. Palmer RJ, Cushion MT, Wakefield AE. 1999. Discrimination of rat-derived
Pneumocystis carinii f. sp. Carinii and Pneumocystis carinii f. sp. Ratti using the
polymerase chain reaction. Mol Cell Probes 13: 147-155.
5. Huson DH, Auch AF, Qi J, Schuster SC. 2007. MEGAN analysis of metagenomic data.
Genome Res 17: 377-386.
6. Martin M. 2011. Cutadapt removes adapter sequences from high-throughput sequencing
reads. EMBnetjournal EMBnet.journal, North America
7. Kolpakov R, Bana G, Kucherov G. 2003. mreps: Efficient and flexible detection of
tandem repeats in DNA. Nucleic Acids Res 31: 3672-3678.
8. Underwood AP, Louis EJ, Borts RH, Stringer JR, Wakefield AE. 1996 Pneumocystis
carinii telomere repeats are composed of TTAGGG and the subtelomeric sequence
contains a gene encoding the major surface glycoprotein. Mol Microbiol 19: 273-281.
9. Kutty G, Ma L, Kovacs JA. 2001 Characterization of the expression site of the major
surface glycoprotein of human-derived Pneumocystis carinii. Mol Microbiol 42: 183-
193.
12
10. Sesterhenn TM, Slaven BE, Keely SP, Smulian AG, Lang BF, et al. 2010 Sequence
and structure of the linear mitochondrial genome of Pneumocystis carinii. Mol Genet
Genomics 283: 63-72.
11. Cushion MT, Smulian AG, Slaven BE, Sesterhenn T, Arnold J, et al. 2007.
Transcriptome of Pneumocystis carinii during fulminate infection: carbohydrate
metabolism and the concept of a compatible parasite. PLoS One 2: e423.
12. Hauser PM, Burdet FX, Cisse OH, Keller L, Taffe P, et al. 2010. Comparative
genomics suggests that the fungal pathogen pneumocystis is an obligate parasite
scavenging amino acids from its host's lungs. PLoS One 5: e15152.
13. Gentles AJ, Karlin S. 2001. Genome-scale compositional comparisons in eukaryotes.
Genome Res 11: 540-546.
14. Willner D, Thurber RV, Rohwer F. 2009. Metagenomic signatures of 86 microbial and
viral metagenomes. Environ Microbiol 11: 1752-1766.
15. Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nat
Methods 9: 357-359.
16. Warren RL, Sutton GG, Jones SJ, Holt RA. 2007. Assembling millions of short DNA
sequences using SSAKE. Bioinformatics 23: 500-501.
17. Green P. 2009. Phrap, version 1.090518. http://phrap.org.
18. Green P, Ewing, B. 2002. Phred, version 0.020425c. http://phrap.org.
19. Trapnell C, Pachter L, Salzberg SL. 2009. TopHat: discovering splice junctions with
RNA-Seq. Bioinformatics 25: 1105-1111.
20. Wu TD, Nacu S. 2010. Fast and SNP-tolerant detection of complex variants and splicing
in short reads. Bioinformatics 26: 873-881.
21. Roberts A, Pimentel H, Trapnell C, Pachter L. 2011. Identification of novel transcripts
in annotated genomes using RNA-Seq. Bioinformatics 27: 2325-2329.
13
22. Cantarel BL, Korf I, Robb SM, Parra G, Ross E, et al. 2008. MAKER: an easy-to-use
annotation pipeline designed for emerging model organism genomes. Genome Res 18:
188-196.
23. Stanke M, Schoffmann O, Morgenstern B, Waack S. 2006. Gene prediction in
eukaryotes with a generalized hidden Markov model that uses hints from external
sources. BMC Bioinformatics 7: 62.
24. Korf I. 2004. Gene finding in novel genomes. BMC Bioinformatics 5: 59.
25. Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. 2008. Gene
prediction in novel fungal genomes using an ab initio algorithm with unsupervised
training. Genome Res 18: 1979-1990.
26. Claudel-Renard C, Chevalet C, Faraut T, Kahn D. 2003. Enzyme-specific profiles for
genome annotation: PRIAM. Nucleic Acids Res 31: 6633-6639.
27. Roth AC, Gonnet GH, Dessimoz C. 2008. Algorithm of OMA for large-scale orthology
inference. BMC Bioinformatics 9: 518.
28. Talavera G, Castresana J. 2007. Improvement of phylogenies after removing divergent
and ambiguously aligned blocks from protein sequence alignments. Syst Biol 56: 564-
577.
29. Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, et al. 2010. New
algorithms and methods to estimate maximum-likelihood phylogenies: assessing the
performance of PhyML 3.0. Syst Biol 59: 307-321.
30. Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic
analyses with thousands of taxa and mixed models. Bioinformatics 22: 2688-2690.
31. Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, et al. 2000. Artemis: sequence
visualization and annotation. Bioinformatics 16: 944-945.
14
32. Lowe TM, Eddy SR. 1997. tRNAscan-SE: a program for improved detection of transfer
RNA genes in genomic sequence. Nucleic Acids Res 25: 955-964.
D. Supplementary figure
Figure S1. Bioinformatics strategy used for the filtration, classification, and assembly of
the reads from P. jirovecii. The whole DNA extracted from the bronchoalveolar fluid and
randomly amplified was sequenced using Roche 454 shotgun single end (SE) and Illumina
paired end (PE) sequencing technologies. The 454 reads were filtered and assembled into a
partial assembly (steps 1 to 6). In parallel, low quality Illumina reads and those mapped to
454 contaminant reads were eliminated (steps 7 and 8). After removal of contaminant reads,
Illumina reads were used to complete the 454 assembly into one final genome (step 9).
15