human transposon insertion profiling: analysis, visualization … · human transposon insertion...

Human transposon insertion profiling: Analysis,visualization and identification of somaticLINE-1 insertions in ovarian cancerZuojian Tanga,b, Jared P. Sterankac,d, Sisi Maa,2, Mark Grivainisa,b, Nemanja Rodi�cc,3, Cheng Ran Lisa Huangd,4,Ie-Ming Shihc,e, Tian-Li Wangc, Jef D. Boekeb,1, David Fenyöa,b,1, and Kathleen H. Burnsc,d,1

aCenter for Health Informatics and Bioinformatics, NYU Langone Medical Center, New York, NY 10016; bInstitute for Systems Genetics, NYU LangoneMedical Center, New York, NY 10016; cDepartment of Pathology, Johns Hopkins University School of Medicine, Baltimore, MD 21205; dMcKusick–NathansInstitute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205; and eDepartment of Gynecology and Obstetrics, JohnsHopkins University School of Medicine, Baltimore, MD 21205

Contributed by Jef D. Boeke, December 20, 2016 (sent for review June 18, 2016; reviewed by Prescott Deininger and David J. Witherspoon)

Mammalian genomes are replete with interspersed repeats reflect-ing the activity of transposable elements. These mobile DNAs areself-propagating, and their continued transposition is a source ofboth heritable structural variation as well as somatic mutation inhuman genomes. Tailored approaches to map these sequences areuseful to identify insertion alleles. Here, we describe in detail astrategy to amplify and sequence long interspersed element-1 (LINE-1,L1) retrotransposon insertions selectively in the human genome,transposon insertion profiling by next-generation sequencing (TIP-seq). We also report the development of a machine-learning–basedcomputational pipeline, TIPseqHunter, to identify insertion siteswith high precision and reliability. We demonstrate the utility ofthis approach to detect somatic retrotransposition events in high-grade ovarian serous carcinoma.

retrotransposon | TIPseq | human | LINE-1 | ovarian cancer

Much of our genome consists of interspersed repeats, se-quences that evidence the long-standing activities of mo-

bile DNAs (1, 2). One of the most abundant and successfulmobile DNAs in human genomes is long interspersed element-1(LINE-1). LINE-1 sequences are common to many mammals,and sequence comparisons indicate that LINE-1 has accumu-lated throughout primate lineages as a singular succession ofsubfamilies (3). There are more than 1 million LINE-1 fragmentsin our genome today, accounting for nearly 540 million basepairs, or about 18% of our DNA (1). The oldest LINE-1 inser-tions have themselves been interrupted by younger transpos-able elements, and their fragments can be merged such that werecognize a smaller number of LINE-1 insertion instances, to-taling around 500,000.The vast majority of these sequences are so-called “fixed pre-

sent” insertions, meaning they are found in a homozygous state inall humans. These insertions represent ancestral insertions fre-quently shared with other extant species. A much smaller set ofroughly 500 full-length and truncated LINE-1 insertions in theaverage human genome corresponds to the L1PA1 subfamily orthe “Ta subset” of the Homo sapiens-specific insertions of LINE-1(L1Hs) sequences. These insertions are transcriptionally andtranspositionally active (4, 5). Although this particularly interesting“active” subset of LINE-1 includes fixed present elements, it alsoencompasses polymorphic elements. Each constitutes a biallelicstructural variant, such that an individual may not have the LINE-1insertion (i.e., may carry only the preinsertion allele) or may haveinherited the LINE-1 insertion (i.e., may be either homozygous orheterozygous for the insertion polymorphism).Active L1Hs retrotransposons are responsible not only for

L1Hs sequence but also drive retrotransposition of other mobileDNAs, namely, Alu short interspersed elements (SINEs) andSVA (SINE/VNTR/Alu) transposons (6, 7) and even processed

pseudogenes (8). Collectively, these insertions represent a majorsource of structural variants in human populations.LINE-1 mobilization occurs through a mechanism termed target

primed reverse transcription (TPRT) (9, 10). There are severalhallmarks of a sequence inserted by the TPRT mechanism (Fig.1A), including the following: (i) target site duplications (TSDs)surrounding the insertion, (ii) a poly(A) (pA) tail at the 3′ end ofthe sequence, and (iii) a 5′ end truncation and/or inversion (11).Sequences may also demonstrate 3′ (or, less commonly, 5′) trans-duction events, which result from the formation of RNA inter-mediates more encompassing than the templating LINE-1. Whenthese RNA intermediates are reverse-transcribed, unique se-quences adjacent to the LINE-1 from the templating locus areincorporated into the transposed segment (12–14).The high copy number of LINE-1 repeats and these sequence

features together present substantial challenges to the accurateand sensitive detection of new insertions. These challenges are

Significance

Much of our genome is repetitive sequence. This propertyposes challenges for investigators because differences in re-petitive sequences are difficult to detect. With hundreds ofthousands of similar repeats, it has been difficult to discernhow one person’s genome differs from another person’s ge-nome or how tumor DNA differs from normal DNA. To solvethis issue, we developed methods to target next-generationsequencing to the insertion sites of the most variable repeats.Computational pipelines to make these studies scalable andmore widely accessible were needed, however. Here, we re-port a pipeline that accomplishes this goal. We use it to dem-onstrate insertions of the long interspersed element-1 (LINE-1)acquired in ovarian cancer that may contribute to the devel-opment of these tumors.

Author contributions: Z.T., J.D.B., D.F., and K.H.B. designed research; Z.T., J.P.S., N.R.,C.R.L.H., and T.-L.W. performed research; Z.T., S.M., M.G., C.R.L.H., and I.-M.S. contributednew reagents/analytic tools; Z.T. and C.R.L.H. analyzed data; and Z.T., J.D.B., D.F., and K.H.B.wrote the paper.

Reviewers: P.D., Tulane Cancer Center; and D.J.W., University of Utah.

The authors declare no conflict of interest.

Data deposition: The sequences reported in this paper have been deposited in the NCBISequence Read Archive (SRA) database (accession nos. SRP074110 and SRP074316).1Towhom correspondencemay be addressed. Email: [email protected], [email protected],or [email protected].

2Present address: Institute for Health Informatics, University of Minnesota, Minneapolis,MN 55455.

3Present address: Department of Pathology, Yale University School of Medicine, NewHaven, CT 06520.

4Present address: Atlas Venture, Cambridge, MA 02139.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1619797114/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1619797114 PNAS | Published online January 16, 2017 | E733–E740

GEN

ETICS

PNASPL

US

Dow

nloa

ded

by g

uest

on

Janu

ary

7, 2

021

http://crossmark.crossref.org/dialog/?doi=10.1073/pnas.1619797114&domain=pdf

http://www.ncbi.nlm.nih.gov/sra?term=SRP074110

http://www.ncbi.nlm.nih.gov/sra?term=SRP074316

mailto:[email protected]



http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1619797114/-/DCSupplemental

http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1619797114/-/DCSupplemental

www.pnas.org/cgi/doi/10.1073/pnas.1619797114

exacerbated when insertion alleles are at a low frequency in asample. Such is the case for somatically acquired insertions inprimary tissue samples, which consist of admixtures of cells ofdifferent lineages. Because of these barriers, and assumptionsthat interspersed repeats are not functional, characterizations ofthese sequences have been incomplete. Recently, however,strategies for mapping these elements have been developedbased on selective PCR amplification (15–17), hybridization-based enrichment (18, 19), and read selection from whole-genome sequencing data (20–22). These studies underscore thecontinued activity of LINE-1 in modern humans, and demon-strated somatic insertions in several types of human malignancy(15, 19, 22–27).Here, we describe a strategy for selectively amplifying geno-

mic DNA 3′ of insertions of active subfamilies of human LINE-1for Illumina sequencing (17, 27, 28). We also describe informat-ics pipeline, machine-learning–based computational pipelinetransposon insertion profiling by next-generation sequencing(TIPseq) Hunter (TIPseqHunter; https://github.com/fenyolab/TIPseqHunter, with supporting files available at openslice.fenyolab.org/data/tipseqhunter/) and a visualization tool, Trans-poScope, to identify and display L1Hs in the resulting reads.This combination of tools is useful for identifying inherited

retrotransposon insertion polymorphisms, as well as insertionsthat occur somatically. We show the utility of these tools byapplying TIPseq to one of the first surveys of LINE-1 activity inovarian cancer.

ResultsVectorette PCR. Vectorette PCR is a method that can be used toamplify DNA fragments in which a portion (i.e., the 5′ end of anamplicon) is a known sequence, but the sequence at the opposingend is unknown (29, 30). Vectorette PCR is a type of ligation-mediated PCR; alternatives include linear amplification strate-gies and ligation-mediated PCRs, such as inverse PCR. In ourhands, vectorette PCR is an effective strategy to amplify LINE-1insertion sites and insertion sites of similarly complex pop-ulations of mobile DNAs in the human genome, including AluYaand AluYb subfamilies, as well as human endogenous retrovirus-K (HERV-K) elements (17). Steps of the PCR are diagrammedin Fig. 1B. One advantage of TIPseq over other methods of se-quencing transposon-flanking amplicons is that it depends on asingle round of PCR before library construction. Most othermethods depend on nested PCR-based methods, which tend toamplify the bias in outcomes greatly when a variety of ampliconsof different lengths and sequences are amplified in parallel.Having high-quality, high-molecular-weight genomic DNA as

starting material is important. For the method described, wetypically isolate DNA from fresh-frozen tissues using phenol-chloroform extraction and ethanol precipitation. We use 10 μg ofgenomic DNA and divide this genomic DNA into aliquots that areindependently digested with one of a panel of restriction enzymes.One of the first considerations in designing the vectorette PCR isto select restriction enzymes that will maximally represent portionsof the genome in amplifiable fragment lengths; it is desirable toensure that as close as possible to 100% of genomic insertions willbe represented as at least one fragment of less than 3 kb. ForTIPseq amplifications in the human genome, we use five or sixdifferent 5- or 6-bp restriction enzyme cutters to ensure that alarge majority of the genome (>95%) is represented in fragmentsthat are 1–3 kb in size in at least one of the parallel digests.Shorter fragments amplify but are less informative, particularly forinsertions into genomic locations that are themselves repetitive, orthose insertions representing significant 3′ transduction events. Incontrast, longer fragments are less well amplified relative toshorter fragments in this highly multiplexed PCR.The restriction enzymes chosen conform to the following

characteristics: (i) they should cut efficiently and independent ofgenomic methylation; (ii) they should leave overhanging “stickyends” and demonstrate high efficiencies in serial cut, ligation, andrecut experiments; (iii) the restriction enzyme recognition siteshould occur at the right frequency in the genome (typically, 6-bpcutters); and (iv) we avoid using combinations of enzymes thatwould impose multiple requirements for CG dinucleotide-con-taining cut sites because these site are underrepresented in thehuman genome. Enzymes that can be heat-inactivated beforevectorette oligonucleotide ligation are advantageous. Finally, it iscritical that the cut sites corresponding to chosen enzymes not berepresented within the transposable element at any position 3′ ofthe forward primer. This technique ensures that amplicons willextend the retroelement insertion into unique DNA flanking.Because of this last point, we use different restriction enzymes formapping different types of transposable elements. For humanLINE-1, we use six enzymes: AseI, BspHI, BstYI, HindIII, NcoI,and PstI (New England Biolabs).Next, a pair of vectorette oligonucleotides is designed to

correspond to each restriction enzyme used. The annealed oli-gonucleotide pair will form the vectorette adapter. The twostrands create a double-stranded end for the adapter that iscompatible with the sticky end created by the restriction enzyme.This structure allows for efficient ligation. The annealed

L1 gDNA

PCR template

first strand

dsPCR amplicon

L1

L1/junction read pair

L1/genome read pair

junction/genome read pair

genome/genome read pair

A

B

LINE-1 (L1)ORF1 ORF2 pA

TSDTSD 6kb

Fig. 1. LINE-1 insertions and vectorette PCR. (A) Full-length LINE-1 (L1) in-sertion is diagrammed; the LINE-1 spans 6 kb and includes two ORFs (ORF1and ORF2). The element ends with a 3′ series of adenine nucleobases and thepA tail of its RNA precursor, and it is flanked by TSDs of the preinsertiongenomic sequence (red boxes). (B, Top to Bottom) Vectorette PCR work flow.Genomic DNA (parallel lines) is cut with restriction enzymes, leaving stickyends (downward-facing arrows); vectorette adapters (blue) are ligated tothese ends. The annealed vectorette sequences are not perfectly comple-mentary, and no binding site exists for the amplification primer at the outsetof the PCR assay. First-strand extensions (sea green) occur from a forwardprimer specific for L1Hs LINE-1 (black, rightward-facing top arrow), and insubsequent iterations of the PCR, the reverse amplification primer has itscomplement from these strands (black, leftward-facing bottom arrow). Thestructure of the resulting amplicons, along with possibilities for corre-sponding paired end sequencing reads, is shown. Informative reads can begrouped into categories depending on their positions in the TIPseq ampli-cons. These positions include L1/junction read pairs, L1/genome read pairs,junction/genome, and genome/genome read pairs. L1/L1 concordant readpairs are not informative.

E734 | www.pnas.org/cgi/doi/10.1073/pnas.1619797114 Tang et al.

Dow

nloa

ded

by g

uest

on

Janu

ary

7, 2

021

https://github.com/fenyolab/TIPseqHunter

https://github.com/fenyolab/TIPseqHunter

http://openslice.fenyolab.org/data/tipseqhunter/

http://openslice.fenyolab.org/data/tipseqhunter/


vectorette sequences do not complement one another perfectlythroughout their length, however. Vectorette adapters have acentral, partly mismatched sequence. This mismatched interval iswhere one of the primers for the vectorette PCR is positioned(Fig. 1B, leftward facing arrow). This amplification primer iscomplementary to the first strand synthesized in the PCR assay.Its design forces first strand synthesis from the transposable el-ement sequence; no binding site complementary to the amplifi-cation primer exists unless this strand extension occurs. After thisextension in subsequent cycles of the PCR, exponential amp-lification of sequences flanking the transposable elementcan proceed.The amplification primer responsible for the first strand ex-

tension is designed to be complementary to the transposableelement (Fig. 1B, rightward facing arrow). This primer is placedto take advantage of so-called “diagnostic nucleotide” substitu-tions that define relatively recently active subfamilies of mobileDNAs. This procedure minimizes unwanted amplification ofinsertion sites from older, exclusively “fixed present” transposonsand greatly enriches for amplification of insertions, such aspolymorphic insertions or acquired somatic events. In the case ofL1Hs Ta subset, this specificity is possible because of a consec-utive trinucleotide signature “ACA” in the 3′ UTR of the ele-ment. The vectorette PCR uses the 3′ end of the element, whichis advantageous to detect 5′ truncated retroelements. On theother hand, our dependence on the 3′ end can create difficultiesowing to sequencing across the pA homopolymer. We havefound that the greatest specificity is conferred when ACA nu-cleotides are located at the 3′ end of the amplification primer(L1 amplification primer sequence, 5′-AGATATACCTAATG-CTAGATGACACA-3′). Locked nucleic acids can be placed atthese three 3′-most positions to increase binding specificity butare not required.PCR conditions must strike a balance between imposing this

specificity and achieving a high yield. Use of proofreading poly-merases is necessary, given the 1- to 3-kb targeted optimal lengthof the PCR amplicons, and we use ExTaq DNA Polymerase(TaKaRa Clontech). We use touchdown cycling conditions withannealing temperatures lowering progressively from 72 °C to60 °C. The vectorette PCR produces a complex mixture ofamplicons. These amplicons are sheared to 300 bp and preparedfor Illumina sequencing. We use a Covaris E210 instrument toshear the DNA, with the following settings: 75 s, 200 cycles perburst, four-intensity, and 10% duty cycle. The Illumina TruSeqDNA Sample Preparation Kit v2, Illumina TruSeq DNA PCR-Free Library Preparation Kit, or Kapa Biosystems KAPA DNALibrary Preparation Kit can be used for library preparation.Indexing allows for sequencing runs to be multiplexed. We

have generated high-quality LINE-1 insertion profiles for so-matic retrotransposition assays running 10–12 samples in a singleIllumina HiSeq v4 sequencing lane for an average of between 20and 40 million reads per sample. The TIPseqHunter pipelinerequires paired-end sequencing reads.

Data Analysis. The data analysis pipeline, TIPseqHunter, com-prises five major steps: (i) sequence read preparation and qualitycontrol, (ii) sequence alignment, (iii) candidate LINE-1 insertionsite identification, (iv) LINE-1 insertion site modeling, and(v) LINE-1 insertion site prediction (Fig. 2). Each is described fur-ther in this section. Because the efficacy of the PCR assay and thetype of background associated with each TIPseq dataset can vary,the machine-learning algorithm is run on each sample:

i) The paired-end reads are trimmed of low-quality base pairsas identified by Illumina Phred or Q score using Trimmo-matic (31). Illumina adaptors, vectorette oligonucleotides,and primer sequences are also trimmed.

ii) The processed reads are aligned using Bowtie2 version 223(32) to the human reference genome assembly [February2009 (hg19, GRCh37) release] in which 1,544 RepeatMasker(33) annotated L1Hs fragments have been masked usingBEDTools (34). The reason we chose to mask L1Hs in-sertions incorporated in the reference genome was to beable to use characteristics of alignments at these loci in amachine-learning algorithm to be applied at loci withoutreference L1Hs. Only by aligning to a masked genomecan we be assured that both reference and nonreferenceL1 insertions will behave similarly. Sequence alignmentsare also made against an L1Hs consensus reference se-quence (35).

iii) Sequence reads indicative of LINE-1 insertions are detected(Fig. 1): (a) genome/genome (G/G) read pairs, where bothreads aligned to genome sequence with an intervening dis-tance consistent with the DNA fragment length distribution.The coverage distribution of these reads forms a wide peaknext to a LINE-1 insertion site; (b) L1/genome (L1/G) readpairs, where one read is aligned to the L1Hs sequence andthe other to the reference genome sequence, and the cover-age distribution of these reads extends about 500 bp fromthe LINE-1 insertion site; and (c) junction reads (J) thatspan the insertion [both L1/junction (L1/J) and junction/ge-nome (J/G) pairs]. Reads containing sequences from boththe 3′ pA end of L1 and the flanking genomic sequence canbe used to pinpoint the precise insertion site. TRLocator, anin-house–developed peak-finding algorithm (36), is modified

Preparation and Quality Control:

Alignment:

Identification:

Modeling:

Prediction:

Vectorette sequences

L1Hs annotation

Junction reads

Label set: five features

Unlabeled set

Vectorette sequence removal

Identifications:Target regions (TRLocator)Insertion sites

Logistic regression module

Predict results

Alignments:L1Hs-masked human genomeL1Hs reference sequence

Fig. 2. Schematic of the TIPseqHunter pipeline. There are five steps in thepipeline: (i) low-quality sequences, base pairs, and vectorette sequences aretrimmed using Trimmomatic software; (ii) qualified read pairs are aligned toan L1Hs masked reference genome (hg19) and the L1Hs consensus sequenceusing Bowtie2 software; (iii) candidate insertion sites are identified using theenriched target sites with at least one junction-containing read pair; (iv) amachine-learning model is built using five features (width, depth, variantindex, pA tail purity, and number of junction reads); and (v) the trainedmodel is used to predict probabilities of the candidate insertions being thetrue insertion sites.

Tang et al. PNAS | Published online January 16, 2017 | E735

GEN

ETICS

PNASPL

US

Dow

nloa

ded

by g

uest

on

Janu

ary

7, 2

021

to find enriched target regions. Target regions with at leastone junction read pair are considered candidate LINE-1insertion sites.

iv) A model of candidate insertion sites is built for each sam-ple using the logistic regression module of the R package“caret” (Fig. 3). The model is trained and tested on knownLINE-1 insertions (positive instances) and candidate in-sertion sites missing the first 5′-most position of theLINE-1 primer sequence (negative instances). These in-sertion sites are randomly divided into a training set(70% of insertion sites) and a test set (30% of insertionsites). Ten-fold cross-validation on the training set is usedto ensure model generalizability, and model accuracy wasevaluated using the test set.

The five parameters are used for the training set model are asfollows:

a) Width: the width of the enriched target region (log2-basedvalue)

b) Depth: the average coverage enriched target region (log2-basedvalue)

c) Variant index: alignment mismatches and indels within thepeak interval (sum of the number of mismatches and indelsdivided by the sum of the number of base pairs and the num-ber of reads)

d) pA tail purity of each consensus sequence at the predicted3′ end of the insertion site within a segment of specifiedlength (%)

e) Number of J supporting the predicted site

We previously described a study of somatic retrotranspositionof LINE-1 in patients with lethal metastatic pancreatic ductaladenocarcinoma (PDAC) (27), in which we used TIPseq to mapLINE-1 insertions in a series of primary and metastatic lesions,as well as matched normal genomic DNA (germ line) from thesame individuals. Here, we applied TIPseqHunter to analyzedata from 36 of these TIPseq experiments. Each included 6–30million paired-end 100-bp reads per sample. All sequences werealigned to both the repeat masked human genome sequence andthe L1Hs consensus sequence (as described in ii above). Overallalignment rates to hg19 were 98.7–99.4%, and the alignmentrates to L1Hs were 32–42% (Table S1).Depending on the sample, we identified between 104 and 105

candidate insertion sites (Table S2) defined by an enriched targetregion and at least one junction read. About one-quarter arespurious, with no restriction enzyme cut sites near the enrichedtarget regions. These candidate insertion sites are excluded be-fore modeling. The remaining enriched target regions lacking thefirst 5′-most position of LINE-1 amplification primer are used asnegative instances for training the model.We used two sets of positive instances for training and testing

the model: a set of L1PA1 elements present in all humans, so-called “fixed present insertions,” and a set of L1Hs that includesfixed present L1PA1 insertions, polymorphic L1PA1, and fixedL1PA2. Each set of known insertions has strengths and weaknessesfor training the model. The fixed present L1PA1 elements arepresent in all humans, and will be homozygous in a sample whenthey occur on an autosome (i.e., two copies per genome equiva-lent). Using this set promotes specificity but lowers sensitivity.Detection of somatically acquired insertions present in a sample atless than one copy per genome equivalent is decreased. In contrast,when older LINE-1 and known insertion variants are included,only a subset is expected to be amplified in any given sample. Usingthis set promotes sensitivity with some reduction of specificity.The fixed present insertions used here include 200 L1Hs an-

notated in the human reference genome build by RepeatMaskerand also detected in TIPseq runs on 108 germ-line samples fromdiverse humans. L1PA2 is excluded from this set; all are L1PA1or L1(Ta). We observe strong evidence for their presence in allsamples, and there is a clear boundary between the positive andnegative instances in the training set based on the five features(Fig. 4A and Table S3). Training the model on these 200 fixedpresent L1Hs insertions yields a small set of high-confidenceinsertion sites (Fig. 4B).RepeatMasker annotates 1,544 L1Hs annotations of the ref-

erence genome; of these annotations, about 600–800 are classi-fied as candidate insertions in each sample. These 600–800candidate insertion sites contain all of the 200 fixed presentL1PA1 elements, but also encompass other, older elements anda number of known LINE-1 insertion variants. There is lessdiscrimination between positive and negative instances (Fig. 4Cand Table S4). Training the model on this larger set of positiveinstances that also contains sites supported by weaker evidenceyields a larger set of predicted insertion sites (Fig. 4D).The accuracy of these models was estimated using the test set

and resulted in accuracies of >0.99 and >0.90 for models trainedon the fixed present L1PA1 and larger, RepeatMasker L1Hs sets,respectively (Fig. 5A). Training on the smaller and higher qualityset of fixed present L1PA1 results in higher accuracy and pro-duces a shorter candidate list (Fig. 5B and TranspoScope web siteat openslice.fenyolab.org/transposcope/home.html). In contrast,for a majority of samples, training on a larger set of Repeat-Masker annotated insertions results in retrieval of a larger

Fig. 3. Training and evaluation of the model. (1) To identify if a candidateinsertion site is a true insertion site, a dataset labeled with true and false in-sertion sites (the labeled set) is constructed for each sequencing sample. Pos-itive instances (true insertion sites) in the labeled set are identified bymatching to one of the two annotated LINE-1 lists (fixed present andRepeatMasker). (2) Negative instances (false insertion sites) in the labeled setare defined as candidates missing the first 5′-most position of the L1 amplifi-cation primer. (3) Labeled set comprises positive instances and negative in-stances, with “1” representing a positive instance and “0” representing anegative instance. Five selected features extracted from the sequencing dataare obtained for each instance and will be used to construct the predictivemodel. (4) Labeled set is split into a training set (70%) and test set (30%)randomly. (5) Predictive model is built with logistic regression on the trainingset to establish the relationship between the characteristics in sequencing dataand the instance type (i.e., whether a candidate insertion site is a true insertionsite). (6) Resulting predictive model is applied to the test set to predict theinstance type. (7) Instance type predicted by the model is compared with thetrue instance type of the test set to evaluate the performance of the predictivemodel (measured by accuracy). If the model performance is not satisfactory onthe test set, applying the predictive model on the unlabeled dataset for novelinsertion set discovery is not recommended. (8) Predictive model is applied tothe unlabeled set to predict the probability of a candidate insertion site beinga true transposon insertion site.


Dow

nloa

ded

by g

uest

on

Janu

ary

7, 2

021

http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1619797114/-/DCSupplemental/pnas.201619797SI.pdf?targetid=nameddest=ST1




http://openslice.fenyolab.org/transposcope/home.html


number of insertions (Fig. 5B and TranspoScope web site). Evenwhen a LINE-1 insertion occurs in a region with a low pro-portion of uniquely mapping reads, the percentage of concordantaligned reads is not compromised and the insertion can be re-liably detected (Fig. S1).

Identification of Tumor-Specific Insertion Sites and PCR Validation.To test the TIPseqHunter pipeline, we mapped LINE-1 inser-tions in paired tumor and normal DNA samples from patientswith PDAC and patients with ovarian carcinoma (OC). As pre-viously reported (27), PDAC samples were acquired through arapid autopsy protocol; we had available matched normal pri-mary tumor and metastatic tumor samples from 10 individuals,and matched normal and primary tumor samples from threeindividuals. Using TIPseqHunter, we identified 88 so-called“progenitor L1” insertions, somatically acquired LINE-1 inser-tions shared by a primary tumor and a metastatic site of diseasein the case, and not found in normal genomic DNA (gDNA)from the same patient. We also identified 127 other (unshared)somatic insertions in either primary or metastatic tumor sampleswhen comparing these samples with normal samples: 63 in pri-mary tumors and 64 in metastatic sites of disease. Information onthese somatic insertions is available on the TranspoScope website. We are able to detect 76 of the 80 previously reported,PCR-validated L1 insertions (27) using TIPseqHunter, giving asensitivity of 95%. Of 20 additional candidates identified byTIPseqHunter, we tested four high-quality insertions and suc-cessfully PCR-validated all four of these insertions. Using fewersequencing reads (<10 million per sample) compromises our

ability to detect these insertions (Fig. S2). We note that in theprevious study, an insertion-finding algorithm was used but asignificant component of manual postreview was also required;TIPseqHunter dispenses with this need.For a TIPseq study of OC reported here, we compared paired

tumor and normal gDNA from eight individuals with type II OC,seven with high-grade serous carcinoma (HGSC) and one withcarcinosarcoma. These cases are representative of one of themost lethal malignant diseases in women, and we have previouslyreported high levels of LINE-1–encoded protein expression inthese malignancies (37). Using TIPseq and TIPseqHunter, wefound a total of 36 somatically acquired, tumor-specific inser-tions in five of the eight individuals. We had sufficient gDNAavailable for PCR validations for two of the five samples, bothHGSC. We successfully validated all 21 insertions attempted inone sample (Fig. 6A and TranspoScope web site) and five of thesix insertions in the second sample. Thus, the overall PCR vali-dation rate for this ovarian cancer dataset was 96.3% (26 of 27samples), and our overall validation rate was 95.5% (106 of111 samples).Notably, one of the somatically acquired insertions predicted

by TIPseqHunter falls within intron 5 of the well-known tumorsuppressor gene, breast cancer 1 gene (BRCA1). Inherited mu-tations in BRCA1 can create strong predispositions to bothovarian and breast cancers, and somatically acquired mutationsor loss of BRCA1 expression through copy number alterations orepigenetic silencing occurs in a substantial proportion of ovarianHGSC cases. The 593-bp, 5′ truncated L1Hs insertion is in anantisense orientation with respect to BRCA1, and it has a 12- to

Fig. 4. Model parameters. The distribution of five model parameters (width and depth in base pairs of the enriched target region; variant index, alignmentmismatches and indels; pA tail purity at each predicted LINE-1 3′ end; and the number of junction reads supporting the predicted site). Width, depth, andjunction reads are all log2-based values. Width and depth determine the placement of each point on the x and y axes. The variant index is shown as the colorof the data point fill. The pA purity is shown as the color of the data point outline. The number of junction reads is depicted as the size of the data point.(A) Negative instances (Left), positive instances (Center), and unlabeled instances (Right) when a fixed present set of L1PA1 is used to train the model.(B) Predicted probabilities that candidate insertions are true LINE-1 insertions in five increments of P = 0.02, and then P < 0.9 (rightmost). (C) Instances whenthe RepeatMasker set is used to train the model. More insertions are included as positive instances for training compared with A. (D) Predicted probabilitiesassociated with C.


GEN

ETICS

PNASPL

US

Dow

nloa

ded

by g

uest

on

Janu

ary

7, 2

021

http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1619797114/-/DCSupplemental/pnas.201619797SI.pdf?targetid=nameddest=SF1

http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1619797114/-/DCSupplemental/pnas.201619797SI.pdf?targetid=nameddest=SF2

19-bp TSD (the exact TSD length is uncertain due to T7microhomology with the pA tail). The intron sequence is rich inAlu insertions, and the somatically acquired LINE-1 interruptsan AluSx. The insertion was validated with a series of PCR re-actions, and Sanger-sequenced (Fig. 6A). The patient is not aknown carrier of a germ-line BRCA1 mutation; we are currentlyassessing the functional consequences of this intronic insertion.

Data Visualization: TranspoScope. We have developed a biologist-friendly display for viewing TIPseq results that we call Trans-poScope. TranspoScope is meant to provide the following: (i) agraphical view of read pairs supporting each insertion; (ii) adisplay of the quality and quantity of junction reads, which weperceive as critical for confidently calling an insertion; (iii) azoomable format that allows a range from multiple kilobases toindividual nucleotides; (iv) a pane reporting on the nearest gene(s)and, if intragenic, the position within the gene; and (v) a link tothe UCSC Genome Browser, with a pointer to the exact posi-tion of the insertion. In addition, instances of restriction en-zyme sites used in the TIPseq experiment that lie near theinsertion point are indicated as red vertical lines in the screenshots shown in Fig. 6 B–E, and a supporting movie (https://www.youtube.com/watch?v=exVAnoMRLSM) shows a TranspoScopesession.

DiscussionMuch of our DNA is the result of transposable element activi-ties. Over the past several million years, essentially all of thisactivity has resided with a subfamily of LINE-1, the H. sapiens-specific L1Hs and other RNA transposons dependent on LINE-1–encoded reverse transcriptase for their retrotransposition (2).The highly repetitive nature of these sequences has made

them especially challenging to study. The reference genome as-sembly captures fixed alleles and high-frequency alleles, but doesnot encompass many common variants (17). Targeted methodsfor recovering LINE-1 insertions for next-generation sequencing

are new technologies, and include approaches developed by Rah-bari and Badge (38), Devine and co-workers (15), Faulkner and co-workers (19, 39, 40), Kazazian and co-workers (16, 41), Deiningerand co-workers (42), and Gage and co-workers (43), as well as ourown approaches. Collectively, their application has reinforced thatLINE-1 is an important source of structural variation in humansand contributed to a growing database of LINE-1 insertion poly-morphisms (44). This conclusion is even more true for Alu insertionvariants (45, 46). These studies have also demonstrated that severaltypes of epithelial cancers acquire somatic insertions of LINE-1 asthey develop (15, 19, 24, 26). Recent projects mining whole-genomesequencing data have extended our understanding of the scopeof heritable LINE-1 insertions (20, 21, 47) and somatic retro-transposition (22, 23, 48) greatly.In the coming years, experimentalists will have many reasons to

map LINE-1 insertions in individual samples. The ability to developcomprehensive catalogs of insertions in a particular sample is animportant prerequisite to recognizing recurrent insertion patterns,associating insertions with phenotypes, manipulating these se-quences to test their effects, identifying which LINE-1s are active ina sample (47), and discerning what makes a cell context permissivefor LINE-1 activity. At present, we have focused on TIPseq usinglarge quantities of high-molecular-weight genomic DNA as startingmaterial. In the future, adapting the protocol for single-cell appli-cations or for use with fragmented gDNA from fixed tissues wouldbroaden the scope of questions that can be addressed by targetedLINE-1 insertion site mapping.Targeted LINE-1 mapping methods have recently been de-

veloped and have not been extensively shared among independentlaboratories or even among multiple users within their residentlaboratories. There have been limited publications detailing wetbench considerations or providing accompanying informatics suites.There have been limited efforts to compare approaches to establishbest practices, and more challenging applications, such as single-cellLINE-1 mapping, have generated estimates of LINE-1 activitiesthat are difficult to reconcile (18, 40, 49).Here, we present a user’s guide to a targeted LINE-1 insertion

site sequencing strategy: TIPseq. We describe a ligation-mediatedPCR method for the selective amplification of LINE-1 insertionsand provide code for a machine-learning–based algorithm toidentify insertions based on the resulting reads. The PCR approachis a modification of vectorette PCR, which selectively amplifiesfragments of genomic sequence 3′ of an interspersed repeat se-quence, in this application, the L1PA1 element. The analyticalportion uses information from read alignments to unique sequencesdownstream, as well as the so-called “junction reads” that span the3′ pA end of the LINE-1 and the adjacent DNA. This methodologyprovides resolution of the 3′ end of the insertion to the base pair,and gives the orientation of the insertion. Although the design ofthe PCR allows us to amplify 5′ truncated and 5′ inverted LINE-1insertions, those severely truncated insertions (<100 bp), pA-onlyinsertions, and LINE-1 insertions with extensive 3′ transductionevents are missed or pose problems for TIPseq/TIPseqHunter.We apply TIPseq and TIPseqHunter here to detect somatic

retrotransposition in two types of malignancies, PDAC and typeII OC. High proportions of both types of tumors aberrantly ex-press LINE-1 encoded ORF1p protein (37), an RNA-bindingprotein critical for LINE-1 retrotransposition. Both have pre-viously been shown to permit somatic retrotransposition events(22, 25, 27). We demonstrate that TIPseq and TIPseqHunterallow for the precise detection of somatic LINE-1 integrationswhen LINE-1 insertion profiles of matched tumor and normalDNA are compared. Insertions found by this approach have anexcellent validation rate, with well over 90% of insertions beingvalidated by a traditional gDNA PCR. Our discovery of a so-matically acquired insertion at the BRCA1 locus in a case ofovarian cancer underscores the types of potentially functionalalterations that can be revealed by this approach.

A

B

Fig. 5. Model performance. (A) Accuracies for a set of matched germ-line,primary pancreatic tumor, and metastatic tumor samples. Accuracies are thehighest when the fixed present L1PA1 set is used to define positive instancesfor training models (circles). Accuracies for the RepeatMasker trained modelsrange from 0.90 to 0.98 (squares). (B) Number of insertions detected (labeledpositive plus unlabeled instances) with P > 0.99 as predicted by the modelswhen training on the fixed present and RepeatMasker sets.


Dow

nloa

ded

by g

uest

on

Janu

ary

7, 2

021

https://www.youtube.com/watch?v=exVAnoMRLSM

https://www.youtube.com/watch?v=exVAnoMRLSM


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X

BRCA1

593bp

A

B C

D E

Fig. 6. Somatic LINE-1 insertions in ovarian cancer. (A) Positions of somatic insertions observed in HGSC shown on a chromosomal ideogram (red marks). Tothe lower right, a schematic shows the structure of the BRCA1 gene at 17q21.31 and the location of a somatically acquired, intronic LINE-1 insertion. The 593-bpLINE-1 is 5′ truncated and includes a portion of the ORF2p ORF, the LINE-1 3′ UTR, and a pA tail (red); it is flanked by TSDs (white boxes). (B–E) TranspoScopeview of the evidence for two insertions. (B and C) L1(Ta) at chr6:136,712,694 ± 3 at two different magnifications. (D and E) LINE-1 insertion atchr17:41,250,393 ± 1 in BRCA1. (B and D) Distribution of genome/genome read pairs (gray); genome/L1 read pairs (purple/blue), junction reads (orange), andall reads overlaid. (C and E) Sequence of the junction reads.


GEN

ETICS

PNASPL

US

Dow

nloa

ded

by g

uest

on

Janu

ary

7, 2

021

ACKNOWLEDGMENTS. This work was supported by the Sol GoldmanPancreatic Cancer Research Center and the Health, Empowerment,Research, and Awareness Women’s Cancer Foundation (N.R.); a BurroughsWellcome Fund Career Award for Biomedical Scientists Program (K.H.B.);

US NIH Awards R01CA161210 (to J.D.B.), R01CA163705 (to K.H.B.), andR01GM103999 (to K.H.B.); as well as National Institute of General Med-ical Sciences Center for Systems Biology of Retrotransposition GrantP50GM107632 (to K.H.B. and J.D.B.).

1. Lander ES, et al.; International Human Genome Sequencing Consortium (2001) Initialsequencing and analysis of the human genome. Nature 409(6822):860–921.

2. Burns KH, Boeke JD (2012) Human transposon tectonics. Cell 149(4):740–752.3. Boissinot S, Chevret P, Furano AV (2000) L1 (LINE-1) retrotransposon evolution and

amplification in recent human history. Mol Biol Evol 17(6):915–928.4. Skowronski J, Fanning TG, Singer MF (1988) Unit-length line-1 transcripts in human

teratocarcinoma cells. Mol Cell Biol 8(4):1385–1397.5. Sheen FM, et al. (2000) Reading between the LINEs: Human genomic variation in-

duced by LINE-1 retrotransposition. Genome Res 10(10):1496–1508.6. Dewannieux M, Esnault C, Heidmann T (2003) LINE-mediated retrotransposition of

marked Alu sequences. Nat Genet 35(1):41–48.7. Hancks DC, Goodier JL, Mandal PK, Cheung LE, Kazazian HH, Jr (2011) Retro-

transposition of marked SVA elements by human L1s in cultured cells. HumMol Genet20(17):3386–3400.

8. Esnault C, Maestre J, Heidmann T (2000) Human LINE retrotransposons generateprocessed pseudogenes. Nat Genet 24(4):363–367.

9. Luan DD, Korman MH, Jakubczak JL, Eickbush TH (1993) Reverse transcription ofR2Bm RNA is primed by a nick at the chromosomal target site: A mechanism for non-LTR retrotransposition. Cell 72(4):595–605.

10. Cost GJ, Feng Q, Jacquier A, Boeke JD (2002) Human L1 element target-primed re-verse transcription in vitro. EMBO J 21(21):5899–5910.

11. Ostertag EM, Kazazian HH, Jr (2001) Twin priming: A proposed mechanism for thecreation of inversions in L1 retrotransposition. Genome Res 11(12):2059–2065.

12. Goodier JL, Ostertag EM, Kazazian HH, Jr (2000) Transduction of 3′-flanking se-quences is common in L1 retrotransposition. Hum Mol Genet 9(4):653–657.

13. Pickeral OK, Makałowski W, Boguski MS, Boeke JD (2000) Frequent human genomicDNA transduction driven by LINE-1 retrotransposition. Genome Res 10(4):411–415.

14. Symer DE, et al. (2002) Human l1 retrotransposition is associated with genetic in-stability in vivo. Cell 110(3):327–338.

15. Iskow RC, et al. (2010) Natural mutagenesis of human genomes by endogenous ret-rotransposons. Cell 141(7):1253–1261.

16. Ewing AD, Kazazian HH, Jr (2010) High-throughput sequencing reveals extensivevariation in human-specific L1 content in individual human genomes. Genome Res20(9):1262–1270.

17. Huang CR, et al. (2010) Mobile interspersed repeats are major structural variants inthe human genome. Cell 141(7):1171–1182.

18. Evrony GD, Lee E, Park PJ, Walsh CA (2016) Resolving rates of mutation in the brainusing single-neuron genomics. eLife 5:e12966.

19. Shukla R, et al. (2013) Endogenous retrotransposition activates oncogenic pathwaysin hepatocellular carcinoma. Cell 153(1):101–111.

20. Ewing AD, Kazazian HH, Jr (2011) Whole-genome resequencing allows detection ofmany rare LINE-1 insertion alleles in humans. Genome Res 21(6):985–990.

21. Stewart C, et al.; 1000 Genomes Project (2011) A comprehensive map of mobile ele-ment insertion polymorphisms in humans. PLoS Genet 7(8):e1002236.

22. Lee E, et al.; Cancer Genome Atlas Research Network (2012) Landscape of somaticretrotransposition in human cancers. Science 337(6097):967–971.

23. Tubio JM, et al.; ICGC Breast Cancer Group; ICGC Bone Cancer Group; ICGC ProstateCancer Group (2014) Mobile DNA in cancer. Extensive transduction of nonrepetitiveDNA mediated by L1 retrotransposition in cancer genomes. Science 345(6196):1251343.

24. Solyom S, et al. (2012) Extensive somatic L1 retrotransposition in colorectal tumors.Genome Res 22(12):2328–2338.

25. Ewing AD, et al. (2015) Widespread somatic L1 retrotransposition occurs early duringgastrointestinal cancer evolution. Genome Res 25(10):1536–1545.

26. Doucet-O’Hare TT, et al. (2015) LINE-1 expression and retrotransposition in Barrett’sesophagus and esophageal carcinoma. Proc Natl Acad Sci USA 112(35):E4894–E4900.

27. Rodi�c N, et al. (2015) Retrotransposon insertions in the clonal evolution of pancreaticductal adenocarcinoma. Nat Med 21(9):1060–1064.

28. Wheelan SJ, Scheifele LZ, Martínez-Murillo F, Irizarry RA, Boeke JD (2006) Transposoninsertion site profiling chip (TIP-chip). Proc Natl Acad Sci USA 103(47):17632–17637.

29. Arnold C, Hodgson IJ (1991) Vectorette PCR: A novel approach to genomic walking.PCR Methods Appl 1(1):39–42.

30. Eggert H, Bergemann K, Saumweber H (1998) Molecular screening for P-elementinsertions in a large genomic region of Drosophila melanogaster using polymerasechain reaction mediated by the vectorette. Genetics 149(3):1427–1434.

31. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: A flexible trimmer for Illuminasequence data. Bioinformatics 30(15):2114–2120.

32. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficientalignment of short DNA sequences to the human genome. Genome Biol 10(3):R25.

33. Smit AFA, Hubley R, Green P (2010) RepeatMasker Open-3.0. Available atwww.repeatmasker.org. Accessed January 9, 2017.

34. Quinlan AR, Hall IM (2010) BEDTools: A flexible suite of utilities for comparing ge-nomic features. Bioinformatics 26(6):841–842.

35. Jurka J (2000) Repbase update: A database and an electronic journal of repetitiveelements. Trends Genet 16(9):418–420.

36. Schweikert C, Brown S, Tang Z, Smith PR, Hsu DF (2012) Combining multiple ChIP-seqpeak detection systems using combinatorial fusion. BMC Genomics 13(Suppl 8):S12.

37. Rodi�c N, et al. (2014) Long interspersed element-1 protein expression is a hallmark ofmany human cancers. Am J Pathol 184(5):1280–1286.

38. Rahbari R, Badge RM (2016) Combining Amplification Typing of L1 Active Subfamilies(ATLAS) with high-throughput sequencing. Methods Mol Biol 1400:95–106.

39. Sanchez-Luque FJ, Richardson SR, Faulkner GJ (2016) Retrotransposon Capture Se-quencing (RC-Seq): A targeted, high-throughput approach to resolve somatic L1retrotransposition in humans. Methods Mol Biol 1400:47–77.

40. Upton KR, et al. (2015) Ubiquitous L1 mosaicism in hippocampal neurons. Cell 161(2):228–239.

41. Doucet TT, Kazazian HH, Jr (2016) Long interspersed element sequencing (L1-Seq): Amethod to identify somatic LINE-1 insertions in the human genome. Methods MolBiol 1400:79–93.

42. Streva VA, et al. (2015) Sequencing, identification and mapping of primed L1 ele-ments (SIMPLE) reveals significant variation in full length L1 elements between in-dividuals. BMC Genomics 16:220.

43. Erwin JA, et al. (2016) L1-associated genomic regions are deleted in somatic cells ofthe healthy human brain. Nat Neurosci 19(12):1583–1591.

44. Mir AA, Philippe C, Cristofari G (2015) euL1db: The European database of L1HS ret-rotransposon insertions in humans. Nucleic Acids Res 43(Database issue):D43–D47.

45. Witherspoon DJ, et al. (2010) Mobile element scanning (ME-Scan) by targeted high-throughput sequencing. BMC Genomics 11:410.

46. Witherspoon DJ, et al. (2013) Mobile element scanning (ME-Scan) identifies thou-sands of novel Alu insertions in diverse human populations. Genome Res 23(7):1170–1181.

47. Sudmant PH, et al.; 1000 Genomes Project Consortium (2015) An integrated map ofstructural variation in 2,504 human genomes. Nature 526(7571):75–81.

48. Helman E, et al. (2014) Somatic retrotransposition in human cancer revealed bywhole-genome and exome sequencing. Genome Res 24(7):1053–1063.

49. Evrony GD, et al. (2012) Single-neuron sequencing analysis of L1 retrotranspositionand somatic mutation in the human brain. Cell 151(3):483–496.


Dow

nloa

ded

by g

uest

on

Janu

ary

7, 2

021

http://www.repeatmasker.org

http://www.repeatmasker.org


human transposon insertion profiling: analysis, visualization … · human transposon insertion...

Documents