use of computer-designed group ii introns to disrupt escherichia

Use of Computer-designed Group II Introns to DisruptEscherichia coli DExH/D-box Protein and DNAHelicase Genes

Jiri Perutka, Wenjun Wang, David Goerlitz and Alan M. Lambowitz*

Institute for Cellular andMolecular Biology, Departmentof Chemistry and Biochemistryand Section of MolecularGenetics and MicrobiologySchool of Biological SciencesUniversity of Texas at AustinAustin, TX 78712, USA

Mobile group II introns are site-specific retroelements that use a novelmobility mechanism in which the excised intron RNA inserts directlyinto a DNA target site and is then reverse transcribed by the associatedintron-encoded protein. Because the DNA target site is recognizedprimarily by base-pairing of the intron RNA with only a small number ofpositions recognized by the protein, it has been possible to developgroup II introns into a new type of gene targeting vector (“targetron”),which can be reprogrammed to insert into desired DNA targets simplyby modifying the intron RNA. Here, we used databases of retargetedLactococcus lactis Ll.LtrB group II introns and a compilation of nucleotidefrequencies at active target sites to develop an algorithm that predictsoptimal Ll.LtrB intron-insertion sites and designs primers for modifyingthe intron to insert into those sites. In a test of the algorithm, we designedone or two targetrons to disrupt each of 28 Escherichia coli genes encodingDExH/D-box and DNA helicase-related proteins and tested for thedesired disruptants by PCR screening of 100 colonies. In 21 cases, weobtained disruptions at frequencies of 1–80% without selection, and insix other cases, where disruptants were not identified in the initial PCRscreen, we readily obtained specific disruptions by using the sametargetrons with a retrotransposition-activated selectable marker. Only oneDExH/D-box protein gene, secA, which was known to be essential, didnot give viable disruptants. The apparent dispensability of DExH/D-boxproteins in E. coli contrasts with the situation in yeast, where the majorityof such proteins are essential. The methods developed here should permitthe rapid and efficient disruption of any bacterial gene, the computationalanalysis provides new insight into group II intron target site recognition,and the set of E. coli DExH/D-box protein and DNA helicase disruptantsshould be useful for analyzing the function of these proteins.

q 2003 Elsevier Ltd. All rights reserved.

Keywords: functional genomics; gene targeting; reverse transcriptase;ribozyme; RNA helicase*Corresponding author

Introduction

Mobile group II introns, found in bacterial andorganelle genomes, are retrotransposable elementsthat insert into specific DNA target sites at highfrequency by a process termed retrohoming.1,2

These mobile introns encode reverse transcriptases(RTs) that function in intron mobility and in RNAsplicing by helping the intron RNA fold into thecatalytically active structure. The mobility reac-tions are carried out by a ribonucleoprotein (RNP)complex that forms during RNA splicing and con-tains the intron-encoded protein (IEP) and the

0022-2836/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.

E-mail address of the corresponding author:[email protected]

Abbreviations used: EBS, exon-binding site; IBS,intron-binding site; IEP, intron-encoded protein; IPTG,isopropyl-b-D-thiogalactoside; ORF, open reading frame;SF1 and 2, helicase superfamilies 1 and 2; RAM,retrotransposition-activated marker; RNP,ribonucleoprotein; RT, reverse transcriptase.

doi:10.1016/j.jmb.2003.12.009 J. Mol. Biol. (2004) 336, 421–439

excised intron lariat RNA. RNPs initiate mobilityby recognizing a relatively long DNA target site(30–45 bp), with both the IEP and base-pairing ofthe intron RNA contributing to DNA target-siterecognition. The intron RNA then inserts directlyinto one strand of the DNA target site by reversesplicing, while the IEP cleaves the opposite strandand uses the cleaved 30 end to prime reverse tran-scription of the inserted intron RNA. The resultingintron cDNA is integrated into genomic DNA bycellular recombination or repair mechanisms.Retrohoming is highly efficient with insertionfrequencies that can approach 100% in bothbacteria and organelles.

The Lactococcus lactis Ll.LtrB intron used in thepresent work is shown in Figure 1(a), and itsDNA target site interactions are summarized inFigure 1(b). The DNA target site corresponds tothe ligated-exon sequence of the L. lactis ltrB gene,with the intron inserting at the ligated-exon junc-tion. The region recognized by the intron RNPsextends from position 226 upstream of the intron-insertion site in the 50-exon to þ10 downstream inthe 30-exon. Intron RNPs bind DNA non-specifi-cally and then search for target sites, with the IEPthought to first recognize a small number ofspecific bases in the distal 50-exon region, includingT 223, G 221, and A 220, via major grooveinteractions.3 – 5 These base interactions bolsteredby phosphate backbone and possibly minor grooveinteractions along one face of the helix trigger localDNA unwinding, enabling intron RNA sequencesdenoted exon-binding sites 1 and 2 (EBS1 andEBS2) and d to base-pair to DNA target sitesequences denoted intron-binding sites 1 and 2(IBS1 and IBS2) and d0 (positions 212 to þ 3). Thesame base-pairing interactions between the intronRNA and 50 and 30-exon sequences occur in theprecursor RNA and are required for RNA splicing(Figure 1(a)). By using the same base-pairing inter-actions for both DNA target site recognition andRNA splicing, the intron insures that it will insertonly at target sites from which it can subsequentlysplice, thereby minimizing effects on host geneexpression. After base-pairing to the DNA targetsite, the intron RNA inserts itself into the topstrand by reverse splicing between the IBS1 and d0

sequences. Second-strand cleavage occurs after alag and requires additional interactions between

Figure 1. The L. lactis Ll.LtrB intron: secondarystructure model, DNA target site interactions, anddonor plasmid for expressing retargeted group II introns.(a) Secondary structure model. The predicted secondarystructure consists of six double-helical domains (I–VI).The EBS2/IBS2, EBS1/IBS1 and d–d0 interactionsbetween the intron and flanking exons in unspliced pre-cursor RNA are indicated by broken lines. The ORFencoding the LtrA protein is encoded in intron domainIV. (b) DNA target site interactions. The DNA target sitefor the Ll.LtrB intron is shown from position 226 toþ10. DNA sequences IBS2, IBS1, and d0 between pos-itions 212 and þ3 are recognized by base-pairing withthe complementary intron RNA sequences EBS2, EBS1,

and d located in stem-loops Id1 and Id3(ii). Criticalbases recognized by the IEP in the 50 and 30 exons arecircled.3 The intron-insertion site in the top strand andthe IEP-cleavage site in the bottom strand are indicatedby arrowheads. (c) Intron donor plasmid pACD3. Theplasmid contains a 0.9 kb Ll.LtrB-DORF intron withshort flanking exons cloned downstream of a T7lac pro-moter in a pACYC184-based vector carrying a camR

gene. The LtrA protein is expressed from a position justdownstream of the 30 exon. The location of the IBS1,IBS2, EBS1, EBS2, d, and d0 sequences are indicated. E1and E2 denote the 50 and 30 exons, respectively.

422 Computer-designed Group II Introns

the IEP and the 30-exon, the most critical beingrecognition of T þ5. Other group II introns employthe same basic mechanism for DNA target siterecognition, but with different target-sitesequences recognized by both the IEP and theintron RNA.6 – 8

Because the DNA target site is recognizedmainly by base-pairing of the intron RNA, it ispossible to retarget group II introns to insert intodesired DNA targets simply by modifying theintron RNA. This feature has enabled us to developmobile group II introns into controllable genetargeting vectors, dubbed “targetrons”. In additionto E. coli, a targetron based on the L. lactis Ll.LtrBintron has been used for targeted gene disruptionand site-specific DNA insertion in both Gram-negative and Gram-positive bacteria, includingSalmonella typhimurium, Shigella flexneri, L. lactis,and Staphylococcus aureus (J. Zhong, H. Cui, R.Novick & A.M.L., unpublished results).9 – 11

For bacterial gene disruption, the targetron istypically expressed from a donor plasmid, such aspACD3 (Figure 1(c)).9,11 pACD3 contains a 0.9 kbDORF-derivative of the Ll.LtrB intron and shortflanking exons, with the IEP expressed from aposition just downstream of the 30-exon. Theprotein expressed from this position splices theintron to generate RNPs, which then insert intothe DNA target site. Once inserted, however, theDORF intron cannot splice in the absence of theIEP, leading to a gene disruption. In some cases,splicing can be restored by supplying the IEP intrans, making it possible to obtain a conditionaldisruption.9,10 In addition, new genes can beinserted into intron domain IV, and the targetroncan then serve as a vector to integrate them atdesired DNA locations.12,13 This approach wasemployed recently to engineer lactobacteria byintegrating a commercially important phage-resist-ance gene at a specific, regulatable chromosomallocation without selection.10

The basic strategy for retargeting group IIintrons to insert at new sites is to first identify thebest match to the positions recognized by the IEPand then modify the intron RNA’s EBS1/2 and dsequences to base-pair to the IBS1/2 and d0

sequences of the target site. If the targetron isexpressed from a donor plasmid, the IBS sequencesin the plasmid’s 50-exon also must be madecomplementary to the retargeted EBS sequencesfor efficient precursor RNA splicing, which isrequired to generate active intron RNPs. Thenecessary modifications of the IBS, EBS, and dsequences in the donor plasmid are introduced bya two-step PCR, using three unique primers andone constant primer.9,11,14 Because efficient splicingof the wild-type Ll.LtrB intron in E. coli does notrequire the d–d0 interaction in unspliced precursorRNA, the d0 sequence in the donor plasmid’s 30-exon had been left unmodified, saving additionalprimers that would be required to introduce thismodification.

In practice, it has been difficult to use simple

rules to design efficiently retargeted group IIintrons, in part because the RNPs recognizenucleotide residues in the DNA target site withdifferent stringencies, and none is absolutelyrequired.11,14,15 Initial attempts to identify Ll.LtrBintron target sites based on a consensus targetsequence were inefficient. If all the positions thatcontribute to DNA target site recognition werefixed, then there were too few potential targetsites in any gene, whereas if nucleotide residues atsome positions were allowed to vary, the numberof target sites increased, but the retargeted intronswere often inefficient because the consensussequence did not provide adequate informationabout the variability allowed at these positions. Inthe region recognized by base-pairing of the intronRNA, there are potential thermodynamic con-straints for DNA unwinding and the base-pairinginteractions themselves, different base-pairs maycontribute differently to both forward and reversesplicing, and some nucleotide substitutions mayadversely affect the structure of the intron RNA.For these reasons, the most reliable method forobtaining group II introns that insert at high fre-quency into desired DNA target sites had been touse an E. coli genetic assay to select them fromcombinatorial intron libraries that have random-ized target site recognition sequences.14 However,the size of the intron library that can be sampledis limited by the E. coli transformation efficiency,and the approach is time-consuming because theselectants must be screened and then base-pairinginteractions optimized to obtain the most efficientintron.

Here, we used a database of efficientlyretargeted Ll.LtrB introns and a compilation ofnucleotide frequencies at active target sites todevelop an algorithm that predicts optimal intron-insertion sites in any desired DNA target and thendesigns oligonucleotides for modifying the intronto insert into those sites. By using this algorithm,we designed Ll.LtrB targetrons to disrupt a set ofE. coli genes encoding DExH/D-box and DNAhelicase-related proteins, with only one of the 28genes tested found to be essential. The methodsdeveloped here should permit the rapid andefficient disruption of any bacterial gene, the com-putational analysis provides new insight intogroup II intron target site recognition, and the setof DExH/D-box protein and DNA helicase dis-ruptants should be useful for analyzing thefunction of these proteins.

Results

Information content of DNA target site andintron RNA sequences

As a first step in the development of the algor-ithm, we used databases of retargeted Ll.LtrBintron/target site combinations to determine theinformation content IðpÞ of each position p in the

Computer-designed Group II Introns 423

DNA target site and the intron RNA’s target siterecognition (EBS and d) sequences. The infor-mation content was calculated according to theequation:

IðpÞ ¼ Hmax 2 HðpÞ ð1Þ

where HðpÞ is the uncertainty (Shannon entropy) forposition p, and Hmax is the maximal uncertainty fora DNA or RNA sequence, which is equal to 2 bits.The Shannon entropy for position p was calculatedaccording the equation:

HðpÞ ¼ 2X

i

fiðpÞ log2ðfiðpÞÞ ð2Þ

where fiðpÞ is the frequency of nucleotide i atposition p in the database. A totally randomsequence would have zero information content,while more conserved sequences have higherinformation contents.

Results of the above computations for threedifferent databases are summarized by bar graphsin Figure 2, with open bars indicating the infor-mation content of DNA target site positions, andfilled bars indicating the information content ofintron RNA positions. The first database is basedon 125 retargeted introns that had either beendesigned by previous computational approaches(32 introns) or selected from combinatorial intronlibraries (93 introns). Mobility frequencies for 101of these introns had been determined by using anE. coli two-plasmid assay (unpublished data).9,14

The second database is based on a subset of 92retargeted introns from the first database thathave mobility frequencies $0.1%. The third data-base consists of 88 random intron insertions in theE. coli genome obtained by using an intron librarywith randomized EBS and d sequences.11 All threedatabases gave similar results, as did different sub-sets of the first database cut off at higher mobilityfrequencies (0.5% or 1.0%) or excluding designedintrons (not shown). Nucleotide frequencies andbase-pairing preferences for the second database(database II), which was selected below for use asa training set, are summarized in Tables 1 and 2,respectively.

First, in regions of the DNA target site recog-nized by the IEP (230 to 214 and þ5 to þ15), allthree databases show the highest information con-tent for positions 223, 221, 220, 219, 217, 215,and 214 in the 50-exon and þ5 in the 30-exon,with the two most critical positions being 221and þ5, in agreement with previous mutagenesisand selection experiments.11,14,15 In the regionrecognized by base-pairing of the intron RNA(212 to 28 and 26 to þ3), all three databasesshow high information content for DNA targetsite positions 212 and 28 in IBS2, and 26 inIBS1, although the third database shows substan-tially higher information content at these positions,as well as other IBS and d0 positions (e.g. 211, 25,23, and þ1). These quantitative differences forpositions recognized by base-pairing likely reflect

(i) that the first two databases, where the infor-mation content is lower, include 32 designedintrons that were modified to base-pair to each ofthese positions, effectively equalizing their contri-bution, and (ii) that the third database, where theinformation content is higher, is based onrandomly selected introns from the combinatoriallibrary that are not perfectly base-paired to theirtarget sites, thus increasing the stringency ofselection for the most important positions.

For the intron RNA positions potentiallyinvolved in DNA target site recognition, all threedatabases show high information content at pos-itions 212 and 28 in EBS2, 26 in EBS1, and þ3in d, with the third database also showing highinformation content at position 23 (filled bars inFigure 2). All of these positions except þ3 alsohave high information content in the DNAsequence, suggesting a preference for specificbase-pairs for DNA target site recognition. Thebase-pairing preferences, deduced from compi-lations of base-pairing interactions at active targetsites, are similar in all three databases (Table 2;Figure 6 of Zhong et al.;11 and data not shown).The preferred base-pairs at positions 212 and 26are GC or CG, suggesting that a strong base-pairmay be required to anchor the ends of the EBS2/IBS2 and EBS1/IBS1 duplexes. A GC or CG base-pair is also preferred at position 23, while the pre-ferred base-pairs at position 28 are AT or GC. Thelatter preference may reflect that a purine isrequired at position 28 in the intron RNA inorder to form an optimal structure of the Id1stem-loop containing EBS2 (Figure 2(a), top). Theremaining position, þ3, which has high infor-mation content only in the intron RNA, shows astrong preference for retention of the wild-type Aresidue within the intron, regardless of the poten-tial base-pairing partner in the DNA target site(Table 2 and Figure 6 of Zhong et al.11). Since theintron RNA is expressed from a donor plasmid inwhich the complementary T residue at 30-exon pos-ition þ3 is fixed, this preference could reflect eitherthat the d–d0 þ 3 pairing in the precursor RNA isrequired for efficient RNA splicing or that chang-ing the A residue at this position of the intronRNA has a deleterious effect on RNA structure.

Covariation analysis

We next tested whether there is covariationbetween different positions in the DNA target site.Although limited by the small number of availablesequences, we used the x2 test to detect possibleinteractions between any two positions in theDNA target sequence. x2 was calculated accordingto the formula:

x2 ¼X

i;j

ðOiðpÞ;jðqÞ 2 EiðpÞ;jðqÞÞ2

EiðpÞ;jðqÞð3Þ

where OiðpÞ;jðqÞ is the observed count of a pair ofnucleotide residues i and j at positions p and q,


respectively, and EiðpÞ;jðqÞ is the expected count forthe same pair calculated under the null hypothesisthat the positions in the DNA sequence are inde-pendent. The latter was calculated using theequation:

EiðpÞ;jðqÞ ¼ fiðpÞfjðqÞn ð4Þ

where fiðpÞ is the frequency of nucleotide i at theposition p, fjðqÞ is the frequency of nucleotide j atthe position q, and n is the total number of DNAsequences in the database.

Using the probability level P ¼ 0:001 and ninedegrees of freedom, we found two potentialinteractions with x2 . 27:9 in database I (218/26and 29/28), but different ones in database II

Figure 2. Information content ofDNA target site and intron RNApositions in databases of retargetedLl.LtrB group II intron/DNA targetsite combinations. The informationcontent (bits) of DNA target sitepositions 230 to þ15 (open bars)and intron RNA EBS2, EBS1, and dpositions 212 to þ3 (black bars)was calculated according toequation (1) for three different data-bases: (a) Database of 125 retargetedgroup II introns with correspondingDNA target sites, including 32designed introns and 93 intronsselected from a combinatoriallibrary (unpublished results).9,14 (b)Subset of the first database consist-ing of 92 retargeted intron/targetsite combinations with mobility fre-quencies $0.1%. This database wasused as the training set for thealgorithm to identify potentialLl.LtrB intron-insertion sites. (c)Database based on nucleotide fre-quencies at 88 random intron-inser-tion sites in the E. coli genomeobtained with a targetron librarywith randomized EBS and dsequences.11 Base-pairing inter-actions between the intron RNAand the top strand of the DNAtarget site are shown above the bargraphs in (a).


(217/216), and database III (212/þ9 and 22/21). A combined database consisting of all avail-able sequences showed five potential interactions(220/219, 217/216, 29/28, 23/22, and 22/21). Three of the covariations (218/26, 217/216, and 212/þ9) are unlikely to be real due tothe small values of expected counts (,6). Further,examination of individual sequences showed thatthe covariation between positions 220 and 219reflects the fact that the database contains 32introns that had been designed according to pre-

vious DNA targeting rules, which dictated specificcombinations of nucleotide residues at thesepositions. The remaining three potential covaria-tions, 29/28, 23/22, and 22/21, were testedbelow, but did not improve the performance of thealgorithm.

Calculated DNA duplex stability in efficient andinefficient target sites

DNA duplex stability could potentially influencegroup II intron insertion frequency by affecting thethermodynamics of DNA unwinding or base-pairing interactions with the intron RNA.15 To testthis possibility, we compared average DG8 valuesat 37 8C calculated for sliding windows of three toten nucleotides for collections of 30 target siteshaving either “high” (17–100%) or “low” (0.0001–0.06%) insertion frequencies with retargetedintrons.16 Representative results for a slidingwindow of four nucleotides are shown inFigure 3(a). The plot shows that the average DG8for the more efficient target sites is higher both inthe distal 50-exon region recognized by the IEP(positions 229 to 218) and in the IBS1/2 regionrecognized by base-pairing of the intron RNA(positions 212 to 21), as well as slightly higheraround the second-strand cleavage site at positionþ9 (Figure 3(a)). These findings could reflect thatlower duplex stability in these regions facilitatesDNA unwinding or protein binding. However, thepreferences break down for individual target sites,as illustrated in Figure 3(b) by a similar plotcomparing the wild-type Ll.LtrB target site withan efficient target site in the E. coli phoH gene (25%

Table 1. Nucleotide frequencies at target site positions in the training set

The training set is based on target sites of 92 retargeted group II introns having insertion frequencies $0.1%. Nucleotide residues inthe wild-type ltrB target site are boxed.

Table 2. RNA/DNA base-pairing interactions for retar-geted introns in the training set

Base-pair combinations for the interaction of the wild-typeLl.LtrB intron with its DNA target site are boxed.


insertion frequency; see below). In fact, the plot forwild-type target site differs markedly from that forthe phoH target site, and from the norm for efficienttarget sites in Figure 3(a). These findings likelyreflect that other features of the DNA target sitecan override any DG8 preferences in specificregions. It may ultimately be possible to weightpreferences for local DG8 values, as well as otherDNA structural features, but they are not nowincorporated into the algorithm.

A probabilistic model of the Ll.LtrBtarget sequence

The simplest model of the Ll.LtrB target site is aweight matrix model based on the calculation of aposition-specific score from observed nucleotidefrequencies at all positions in a training set.17 Thistype of model, which is effectively equivalent to azero-order Markov model, makes the assumption

that a nucleotide residue in the DNA targetsequence occurs at random and the likelihood ofits occurrence is independent of all other residuesin the DNA sequence.18

The algorithm developed here is based on such aweight matrix with the position-specific score cal-culated from nucleotide frequencies at selectedpositions in the training set. The probability that aDNA sequence (seq) is an Ll.LtrB intron target sitePðMlseqÞ is defined by the equation:

PðMlseqÞ ¼PðseqlMÞPðMÞ

PðseqÞð5Þ

where M represents our model of the Ll.LtrB targetsequence, PðseqlMÞ is the probability of the DNAsequence given our model, PðseqÞ is the probabilityof the DNA sequence, and PðMÞ is the probabilityof our model M. PðMÞ is a constant for purposesof comparison and thus may be neglected. Thecalculation of PðseqÞ was based on nucleotidefrequencies for each base in the E. coli genome(fG ¼ 0:2536, fA ¼ 0:2462, fT ¼ 0:2459, andfC ¼ 0:2542). DNA sequences were scored using alog-odds score, which measures whether the prob-ability that the sequence generated by the modelis greater than the probability that the sequencewas generated by chance (null model), using theequation:

S ¼ log2PðseqlMÞ

PðseqÞ¼

X

iðpÞ

log2fiðpÞ

fið6Þ

where iðpÞ is a nucleotide i at position p in the DNAsequence, fiðpÞ is the frequency of nucleotide i atposition p in the training set, and fi is the frequencyof nucleotide i in the E. coli genome. A positivescore means that the model fits the potentialgroup II intron target sequence better than thenull model. The significance of the log-odds ratiowas estimated by the E-value calculated using theequation:

E ¼ Npðx $ SÞ ð7Þ

where pðx $ SÞ is the probability that a randomvariable x (the log-odds score of a random poten-tial group II intron target site in the E. coli genome)is greater than or equal to S. pðx $ SÞ was calcu-lated from the extreme value distribution of log-odds scores (the best 1% according to our model)of all 45 bp sequences in the E. coli genome. N rep-resents the number of samples in the distribution,which in our case is equal to 1. The E-value esti-mates the expected number of false positives at orabove the score. In practice, for gene targeting inE. coli, we obtained reliable predictions of Ll.LtrBintron target sequences by the model at aboutE # 0:1, but we considered “hits” up to E # 0:6:

The composition of the training set is critical forthe performance of the model. To identify the bestpossible training set, we evaluated the threedatabases described above, as well as a combineddatabase with all available sequences (213

Figure 3. Calculated DNA duplex stability of Ll.LtrBtarget sites. (a) Average calculated DG8 at 37 8C for a4 nt sliding window in databases of 30 target sites havingeither “high” (17–100%) or “low” (0.0001–0.06%) intron-insertion frequencies. (b) Calculated DG8 at 37 8C for a4 nt sliding window for the wild-type Ll.LtrB target siteand a phoH target site, which has a chromosomalinsertion frequency of 25% (see Figure 8). DG8 at 37 8Cwas calculated as described.16


sequences) and different subsets of database I tar-get sites having progressively higher intron-inser-tion frequencies: 0.01% (95 sequences), 0.5% (85sequences), and 1.0% (78 sequences). The selectionof positions used in the model was controlled bysetting a suitable threshold value of the infor-mation content, which filtered out the poorly dis-criminating positions in the target site. Eachmodel was tested on a set of 27 trusted DNA targetsequences with high intron-insertion frequencies(20–100%). The model that performed best wasderived by selecting 35 positions with informationcontent $0.0182 bit from database II (92 targetsites with intron-insertion frequencies $0.1%).The selected positions are 230, 229, 227, 226,224, 223, 222, 221, 220, 219, 218, 217, 216,215, 214, 212, 210, 29, 28, 26, 25, 24, 23,22, 21, þ1, þ3, þ4, þ5, þ6, þ7, þ8, þ9, þ12and þ13. These positions extend somewhatbeyond the previously determined target siteregion (226 to þ10) on both sides, suggesting thatthe flanking sequences may directly or indirectlyinfluence mobility frequency.

We attempted to create more complex modelsthat consider different combinations of the topthree candidates for potential covariations betweentwo positions in the target site (29/28, 23/22,and 22/21) based on the x2 test (see above). How-ever, these models performed less well than thesimple weight matrix model, presumably becausethe conditional probabilities estimated from thetraining set consisting of only 213 availablesequences had much greater parameter estimationerrors (coefficients of variation) than the prob-abilities estimated as fiðpÞ used in the weight matrixmodel. Generally, a training set consisting of 700–1200 sequences, depending on the fiðpÞ, is neededfor a satisfactory coefficient of variation for a first-order Markov model.19

We also considered other types of models.Classification with the back-propagation neuralnetwork was not possible because of the smalldata set, while classification with a Kohonen net-work was limited by the size of the resulting two-dimensional topological map.20 A large map had anumber of “empty units”, which did not allowclassification, while a small map classified manytarget sites in the same unit. A generic hiddenMarkov model built with freeware HMMER soft-ware could not accommodate the strict spacing ofkey positions, which is characteristic of group IIintron target sites†.15 It may ultimately be possibleto develop a more complex model of the Ll.LtrBintron target sequence, but this will likely requirea larger database of retargeted group II intronsthan is currently available.

Rationale for design of base-pairing interactions

After the prediction of potential Ll.LtrB insertionsites, the intron RNA is retargeted by modifying itsEBS1, EBS2, and d sequences to base-pair with theDNA target site’s IBS1, IBS2, and d0 sequences. Ifthe retargeted intron is introduced on a donorplasmid, the IBS1 and IBS2 sequences in the 50-exon of the donor plasmid must be modified tobase-pair to the retargeted EBS1 and EBS2sequences for efficient RNA splicing.9,14

The intron RNA’s EBS2 and EBS1/d sequencesare present in two stem-loop structures (Id1 andId3(ii), respectively) located in different parts ofintron domain I (Figure 1(a) and (b)). Examinationof the sequences shows that the structures of thestem-loops and their base-pairing interactionswith the DNA target site are ambiguous. Thus, forEBS2, both positions 213 and 212 could poten-tially base-pair with either the DNA target site orwith a nucleotide residue in the intron RNA toform the top of the Id1 stem. Similarly for EBS1/d,positions 27 and þ4 could potentially base-pairwith the DNA target site or with each other toform the top of the stem. In addition, the A residueat d þ3 can potentially base-pair with a T in theDNA target site, but we found selection for this Aresidue in the intron RNA regardless of its poten-tial base-pairing partner in the DNA target site(Figure 2 and Zhong et al.11).

For EBS2, previous work showed selection forbase-pairing with the DNA target site at position212, but not at position 213.11,14 Consequently,the Id1 stem-loop is drawn with the U residue atposition 213 base-paired with the opposite G resi-due in the intron RNA to form the top of the stem.For EBS1/d, previous work showed no selectionfor base-pairing with DNA target site position þ4and selection against base-pairing with the DNAtarget site at position 27.14 Further, an additionalexperiment in which EBS1/d positions 27 to þ4were partially randomized confirmed selection forbase-pairing between positions 27 and þ4 in theintron RNA (M. Karberg & A.M.L., unpublishedresults). Consequently, the Id3(ii) stem-loop isdrawn with intron positions 27 and þ4 base-paired to form the top of the stem.

To test the effect of base-pairing between thed þ3 position in the intron RNA and d0 þ3 positionin the DNA target site, we constructed donorintrons in which the A at d þ3 was either fixed ormodified to base-pair with different nucleotideresidues at the d0 þ3 position in the DNA targetsite (Figure 4). The different combinations werethen assayed quantitatively to determine theireffect on mobility frequency in an E. coli plasmidassay (see Materials and Methods). The resultsshowed that introns that retain A þ3 haveuniformly high mobility frequencies, regardless ofwhether this nucleotide residue can base-pair withthe DNA target site (71–100%). Only three otherintron RNA/DNA target site combinations had† http://hmmer.wustl.edu/


http://hmmer.wustl.edu/

similarly high mobility frequencies (U/A, U/T,and C/T; 78–100%), while the lowest mobility fre-quencies were found for C/G and G/C (32% and40%, respectively), suggesting that strong base-pairsbetween the intron RNA and DNA target site at pos-ition þ3 are in fact deleterious, possibly because theydistort the structure of the Id3(ii) loop.

Based on the above results, the Ll.LtrB intron isretargeted by modifying intron RNA positionsEBS2 212 to 28, EBS1 26 to 21, and d þ1 andþ2 to form Watson–Crick base-pairs with the cor-responding positions of the DNA target site, leav-ing the A residue at d þ 3 fixed. If the intron isintroduced via a donor plasmid, IBS1 26 to 21and IBS2 212 to 28 in the 50-exon of the plasmidare modified to form Watson–Crick base-pairs withthe retargeted EBS sequences, so that the intron cansplice efficiently. For the DNA target site inter-actions, tests at positions 212 and þ2 showed thatreplacement of a Watson–Crick base-pair with awobble GT or UG base-pair resulted in significantlylower mobility frequency (data not shown).

Modifications of the intron donor plasmid andintron RNA that increase the insertionfrequency of retargeted introns

We found several additional modifications of theintron donor plasmid and intron RNA thatincrease the mobility frequency of retargeted

introns. The effect of these modifications is illus-trated in Figure 5 for an intron targeted to themouse HPRT gene. First, we modified the donorplasmid to reduce self-targeting. In the donorplasmids pACD2 or pACD3, the IBS1 and IBS2sequences are preceded by the distal 50-exonsequence derived from the L. lactis ltrB gene.Because these distal 50-exon sequences are recog-nized by the IEP and the adjacent IBS sequenceshave been changed to match the retargeted intron,the 50-exon in the donor plasmid is a potential tar-get site for intron insertion. This problem is mostpronounced in cases like the anti-mouse HPRTintron in which the d þ1 position of the retargetedintron RNA has been changed to a C residue. ThisC residue at d þ1 can base-pair to the invariant Gresidue at the 50 end of the intron sequence in thedonor plasmid, thereby enabling the retargetedintron to form continuous base-pairs across the 50

splice site (Figure 5(c)). Such self-targeting of thedonor plasmid has been confirmed experimentallyfor other introns (J. Zhong & A.M.L., unpublishedresults).

To alleviate this problem, we changed the distal50-exon sequence in the donor plasmid (positions225 to 214) to the sequence 50-ATAATTATCCTT,which is not recognized by the LtrA protein. Theoriginal HPRT targetron inserts into its target siteat a frequency of 0.12(^0.08)% in the plasmidassay, and this increased to 0.35(^0.08)% whenthe intron was expressed from a donor plasmidwith the modified distal 50-exon sequence(Figure 5(d)).

Second, we found that the mobility frequencycould be increased further to 1.35(^0.07)% bychanging the d0 þ1 position in the donor plasmid’s30-exon to a G residue that is complementary tothe C at d þ1 of the retargeted intron, therebyrestoring the d–d0 interaction in unspliced pre-cursor RNA (Figure 5(e)). This finding impliesthat, unlike the wild-type intron, the splicingefficiency of this and likely other retargeted intronsmay be compromised to an extent that restorationof the d–d0 interaction increases the mobilityfrequency. When the changes in the distal 50-exonand d0 sequences were combined, the mobilityfrequency of the targetron increased to 5%, a 42-fold increase over the original mobility frequency(Figure 5(f)).

Finally, we noticed that the retargeted HPRTintron has two C residues at positions 212 and211 in EBS2, which could potentially distort theId1 loop by base-pairing with the opposite Gresidues at the end of the Id1 stem (Figure 5(g)).To alleviate this problem, we replaced the terminalUG pair of the Id1 stem with a UA pair. This modi-fication combined with the others described aboveincreased the insertion frequency of the HPRTtargetron to 74%, a 617-fold increase over theoriginal frequency. This example illustrates thatthe insertion frequency of an inefficient intron canbe improved dramatically by careful attention tointron design to avoid potential mispairings and

Figure 4. Effect of the nucleotide residue at Ll.LtrBintron position d þ3 on intron-insertion frequencies.(a) Predicted structure of intron RNA stem-loops Id1and Id3(ii), containing the EBS2 and EBS1/d sequences,respectively, and base-pairing interactions with theDNA target site. The arrowhead indicates the intron-insertion site. (b) Test of different combinations ofnucleotide residues at intron RNA position d þ3 andthe corresponding d 0þ3 position of the DNA target site.The nucleotide residue N at position d þ3 in the intronRNA was changed to the indicated nucleotide residueswith or without compensatory changes in nucleotideresidue N0 at the d0 þ3 in the DNA target site. The modi-fied nucleotide residues are highlighted by gray shadingin (a). Mobility frequencies were determined by usingan E. coli two-plasmid assay in which a derivative of theLl.LtrB intron with a phage T7 promoter inserted nearits 30 end integrates into a target site cloned upstream ofa promoterless tetR gene, thereby activating that gene(see Materials and Methods). Values are the mean ^ thestandard deviation for three independent determinationsin each case.


other deleterious combinations of nucleotide resi-dues (e.g. runs of G or A residues) that may affectadversely the structures of the Id1 and Id3(ii)loops. It illustrates also that intron design cansometimes override target site selection as themajor determinant of insertion frequency.

To generalize these modifications, we con-structed four new intron donor plasmids,pACD4A, pACD4C, pACD4G and pACD4T,

which have the modified distal 50-exon sequences,and four different nucleotide residues at the d0 þ1position in the 30 exon to complement any nucleo-tide residue at the d þ1 position of the retargetedintron RNA, thereby eliminating the need foradditional primers to change this position. Fornow, we neglect pairing at the d–d0 þ2 position inthe precursor RNA, which appears to be less criti-cal (data not shown). Any necessary modifications

Figure 5. Donor plasmid and intron modifications that improve intron-insertion frequency for an intron targeted tothe mouse HPRT gene. (a) Predicted base-pairing interactions and mobility frequency for the wild-type Ll.LtrB intronwith its normal ltrB target site. (b) and (c) Predicted base-pairing interactions between the anti-HPRT intron and theHPRT target site and intron donor plasmid, respectively, and mobility frequencies with the HPRT target site. (d) Effectof 50-exon modifications that minimize self-targeting (distal 50-exon sequence 50-CGTCGATCGTGA changed to 50-ATAATTATCCTT). (e) Effect of modifying the d0 þ1 sequence in the 30 exon to restore the d–d0 interaction in precursorRNA. (f) Combination of modified distal 50-exon sequence and restoration of d–d0 interaction in unspliced precursorRNA. (g) Replacement of the UG base-pair at the top of the Id1 stem with a UA base-pair. Nucleotide residues thatmatch the wild-type sequence are boxed, and those changed from the original donor plasmid sequences are shadedgray. The intron sequence in the donor plasmid is shown in lower case italic letters. Mobility frequencies were deter-mined by using the E. coli two-plasmid assay described for Figure 4 and in Materials and Methods. Values are themean ^ the standard deviation for three independent determinations in each case.


of the Id1 or Id3(ii) stem-loops based on inspectionof the sequence of the retargeted introns are intro-duced readily via the primers used to modify theEBS1/d and EBS2 sequences.9

Disruption of E. coli DExH/D-box protein andDNA helicase genes

Based on the above results, we developed analgorithm that predicts potential Ll.LtrB intron-insertion sites and then designs primers formodifying the intron’s EBS and d sequences toinsert into those sites. A flow-chart of thealgorithm is shown in Figure 6. As a test of thealgorithm, we designed group II introns to disrupta set of 28 E. coli genes encoding DExH/D-boxand DNA helicase-related proteins. DExH/D pro-teins, which include RNA helicases, function in avariety of processes involving cellular RNAs,21,22

and are of particular interest in our laboratorybecause of their potential involvement in group Iand group II intron RNA splicing, both of whichare readily assayed using E. coli genetic systems(X. Cui, M. Matsuura, Q. Wang, H. Ma, & A.M.L.,unpublished results).23,24

E. coli DExH/D-box protein genes were identi-fied by searching the E. coli proteome using ahidden Markov model based on a “DEAD-likehelicases superfamily” alignment (DEXDc; Smartaccession number SM0487).25,26 The top 20 hits inthis search were deaD, rhlE, srmB, dbpA, rhlB, lhr,mfd, recG, recQ, uvrB, hrpA, hrpB, hepA, yejH, priA,hsdR, ygcB, dinG, yfjK, and yoaA, with scores andE-values ranging from 232, 5.9 £ 10267 (deaD) to 24,1.4 £ 1025 (yoaA). All of these proteins belong tothe DEAD, DEAH, or DExH subfamilies of helicasesuperfamily 2 (SF2) and are therefore classified asDExH/D-box proteins. The next nine hits, which

Figure 6. Flow-chart of the com-puter algorithm. The algorithm isbased on a weight matrix model inwhich a position-specific score iscalculated from the observednucleotide frequencies at 35 DNAtarget site positions having thehighest information content in atraining set of trusted target sites.The algorithm assumes that eachposition in the DNA target site con-tributes independently to targetrecognition and that there is a fixedspacing between different targetsite elements. The algorithm alsodesigns primers for modifying theintron’s EBS1, EBS2, and dsequences to form Watson–Crickbase-pairs with the DNA targetsite’s IBS1, IBS2, and d0 sequences,and for modifying the IBS1 andIBS2 sequences in the 50 exon of thedonor plasmid to form Watson–Crick base-pairs with the retargetedEBS1 and EBS2 sequences forefficient RNA splicing.


had negative scores but E-values ,0.1, potentiallyreflecting significant similarity, were searched forhomology to the conserved DExH/D-box helicasedomains by RPS-BLAST using the CDD (v1.62)database†.27 This search found significant align-ments for five additional genes: recD, ybeZ, secA,phoH, and helD with scores and E-values rangingfrom 46.8, 6 £ 1026 (recD) to 31.7, 0.26 (helD). PhoHand YbeZ are short proteins, which contain heli-case core motifs I and II, while RecD and HelD aresuperfamily I (SF1) UvrD/Rep-like DNA helicases.Since the sequence-based classification does notexclude the possibility that proteins classified asDNA helicase also function on RNA, we decidedto include three additional UvrD/Rep-like DNA-helicase genes (recB, rep, and uvrD), as well asrecC, whose gene product is part of exodeoxyribo-nuclease V along with RecB and RecD. Two of theE. coli DExH/D-box protein genes, hsdR and dbpA,are either inactive or not present in our E. coli hoststrain HMS174(DE3) (genotype hsdR2 andSouthern hybridization for dbpA; not shown).

Because DbpA is a well-characterized proteininvolved in ribosome assembly,28,29 we disruptedthis gene in E. coli BL21(DE3). Thus, the final setof targets included 28 genes. The E. coli genomeproject web site‡ checked on 9 September, 2003listed only 11 of these 28 genes as having beendisrupted by transposon mutagenesis.

We used the algorithm to design one or twoLl.LtrB group II introns targeted to each of the 28genes. The algorithm readily identified multipletarget sites in each gene, and then designedprimers for modifying the intron donor’s IBS1/2,EBS1/d, and EBS2 sequences to insert into thosesites. Sample output for the rhlE gene is shown inFigure 7. To maximize the possibility of disruptinggene function, we selected target sites preferen-tially in the upstream or central regions of thegene or within the conserved “helicase core”, evenif the algorithm did not classify those as the bestsite. The 0.9 kb intron contains multiple stopcodons in all reading frames, resulting in the syn-thesis of truncated proteins as well as susceptibility

Figure 7. Output of the algorithm for targeting the E. coli rhlE gene. (a) Predicted Ll.LtrB target sites in the rhlE gene(GenBank accession number NC_000913, region 830095..831459). Target sites are indicated by nucleotide residuesbracketing the intron-insertion site, with “s” and “a” denoting the sense-strand and antisense-strand, respectively.Log-odds scores were calculated according to equation (6), and E-values were determined from the extreme value dis-tribution of log-odds scores (the best 1% according to our model) of all 45 bp sequences in the E. coli genome. (b) DNAoligonucleotides designed to modify the intron donor plasmid’s IBS1/2, EBS1/d, and EBS2 sequences for insertion ofthe intron into the target sites shown in (a).

† http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

‡ http://www.genome.wisc.edu/functional/tnmutagenesis.htm


http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

http://www.genome.wisc.edu/functional/tnmutagenesis.htm

http://www.genome.wisc.edu/functional/tnmutagenesis.htm

Figure 8. Disruption of E. coli DExH/D-box protein and DNA helicase-related genes using computer-designedLl.LtrB group II introns. Gene names are indicated above, with GenBank annotations (http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val ¼ NC_00091) and parentheses indicating classification as helicase superfamily 1 or 2 (SF1or SF2, respectively),39 where possible, followed by the motif II sequence. The numbers above the linear mapsindicate nucleotide positions, and the red bars below indicate the helicase core region containing conserved aminoacid sequence motifs I–VI plus Q for DEAD-box proteins or motifs I and II for phoH and ybeZ.39,40 The different sizesof the core region reflect variable-length unconserved regions between the conserved helicase motifs. All genes weredisrupted in E. coli HMS174(DE3), except for dbpA, which is not present in HMS174(DE3) and was disrupted in E. coliBL21(DE3). Donor plasmids were: pACD3 (dbpA, deaD, helD, hepA, recC, recQ, rhlE, rhlB, srmB, and yejH); pACD3-RAM (dinG, lhr, priA, recD, rep, and uvrB); and pACD4A, pACD4C, pACD4G, or pACD4T with the appropriate d0 þ1nucleotide residue (hrpA, hrpB, mfd, phoH, recB, recG, uvrD, ybeZ, yfjK, ygcB, and yoaA). Group II intron-insertion sitesare indicated by the nucleotide position 50 to the intron-insertion site, with “s” or “a” indicating the sense orantisense strand, respectively, followed by parentheses with numbers indicating insertion frequency based onscreening 100 colonies by PCR, or RAM, indicating disruptants obtained by using a retargeted intron with aretrotransposition-activated selectable marker.


http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NC_00091























to mRNA degradation.30 Thus, most if not all of theinsertions should totally ablate gene function. Thedonor plasmid containing the retargeted intronwas transformed into the E. coli (DE3) host strain,which contains an isopropyl-b-D-thiogalactoside(IPTG)-inducible RNA polymerase. After inductionof donor plasmid transcription with IPTG, 100colonies were screened for the desired disruptionby PCR, and the correct insertion was confirmedby DNA sequencing of the PCR product.

As summarized in Figure 8, for 21 of the 28genes, we obtained at least one intron that gavethe desired disruption at a frequency of 1–80%without selection. For the remaining seven genes(uvrB, priA, recD, lhr, secA, rep, and dinG), wheredisruptants were not obtained by screening 100colonies, the retargeted introns were recloned intothe vector pACD3-RAM, which contains a retro-transposition-activated trimethoprim-resistance(Tp R) marker within the intron.11 In six cases(uvrB, priA, recD, lhr, rep, and dinG), disruptantswere then obtained readily by simple selection forTpR colonies (see Materials and Methods).11 Theonly DExH/D-box protein gene that did not yieldviable disruptants was secA, a component of theprotein translocation ATPase, which was alreadyknown to be essential.31 It is possible that somegroup II introns with insertion frequencies near1% were missed by the limited PCR sampling. Inaddition, the performance of some introns couldpresumably have been improved by using thenewly constructed intron-donor plasmidsdescribed above, which increase mobility fre-quencies by minimizing self-targeting and restor-ing the d–d0 interaction in unspliced precursorRNA.

For further characterization, we obtained a set ofdisruptants cured of the intron donor plasmidencoding the LtrA protein, thereby eliminating thepossibility that the inserted Ll.LtrB-DORF intronscould splice to a significant extent, even if insertedin the sense orientation (X. Cui, M. Matsuura,Q. Wang, H. Ma & A.M.L., unpublished results).12

To test the specificity of intron insertion, we ran-domly selected one to four disruptants and carriedout Southern hybridizations using a 32P-labeledintron probe (Figure 9). For 22 genes, we found asingle intron insertion at the desired site. In fivecases, in which only one disruptant was tested, wefound two (dbpA, ybe Z, dinG, and priA) or three(hepA) insertions in the initial test; in each case, how-ever, the desired single insertion was identified byanalyzing less than five additional transformants.

Plating assays and growth measurements inliquid culture showed that nine of the disruptants(dbpA, priA, recB, recC, recD, rep, srmB, uvrB, anduvrD) have different degrees of cold-sensitivity at20 8C or 25 8C. recC is encoded in an operon, andwe have not excluded the possibility that a polarityeffect on downstream genes contributes to its cold-sensitive phenotype. Charollais et al. reportedrecently that deletion of the srmB gene results incold-sensitive growth with a deficiency of 50 S

ribosomal subunits.32 Further analysis of thecomplete set of disruptants is in progress.

Discussion

In the present work, we developed an algorithmthat predicts optimal insertion sites for the L. lactisLl.LtrB group II intron and designs primers formodifying the intron to insert into those sites. In a

Figure 9. Southern hybridizations showing the speci-ficity of group II intron insertion into E. coli DExH/D-box protein and DNA helicase-related genes. DNA wasisolated from disruptants obtained with targeted intronsindicated above each lane, digested with restrictionenzymes, blotted to a nylon membrane, and hybridizedwith a 32P-labeled probe specific for the Ll.LtrB intron.Restriction enzymes were: dbpA, Aat II þ Fsp I; deaD,Fsp I þ Hde I; dinG, NdeI þ Mfe I; helD, Aat II þ Hpa I;hepA, Nde I; hrpA, Age I; hrpB, Bsp HI; lhr, Aat II; mfd,Hae II; phoH, Nsp I; priA, AluN I; recB, Hae II; recC, Aat II þFsp I; recD, Ara I; recG, Bsi EI; recQ, Nde I þ Hpa I; rep,Nde I; rhlE, Aat II þ Fsp I; rhlB, Nde I þ Hpa I; srmB, Aat II;uvrB, Hpa I; uvrD, Nde I; ybeZ, Bbs I þ Bpm I; yejH, Hpa I þAat II; yfjk, Hpa I þ Mfe I; yoaA, Aat II þ Hpa I; ygcB,Fsp I. The blot was dried and scanned with a Phosphor-Imager.


test of the algorithm, we designed group II intronsto disrupt 28 E. coli genes encoding DExH/D-boxand DNA helicase-related proteins. For 21 genes,we obtained at least one intron that gave thedesired disruptant at a frequency of 1–80% basedon PCR screening without selection, and for sixgenes, where the retargeted introns presumablyhave lower insertion frequency, disruptants wereobtained readily by using the same introns with aretrotransposition-activated selectable marker.Only one DExH/D-box protein gene, secA, whichwas known to be essential, did not give viabledisruptants.31 Southern hybridizations showedthat the retargeted introns are highly specific, with22/27 giving just a single intron insertion at thedesired site, and the remainder yielding singleinsertions after screening less than five additionaltransformants. This minor inconvenience couldpresumably be avoided by computationally check-ing target sites for close matches to other sequencesin the E. coli genome. By combining the algorithmwith previously developed methods for generatingthe retargeted introns by PCR without cloning, itshould be possible to obtain essentially “over-night” chromosomal gene disruptions in a widevariety of Gram-negative and Gram-positivebacteria in which the Ll.LtrB intron can functionefficiently using native promoters (see Intro-duction). Moreover, because the insertion frequen-cies are, in most cases, high enough to detect byPCR screening without a selectable marker, groupII introns should be particularly useful for obtain-ing multiple gene disruptions, a feature that willfacilitate the analysis of gene families, such as theDExH/D-box protein and DNA helicase families,in which individual members may have related oroverlapping functions.

The algorithm developed here is based on aprobabilistic model in which 35 positions in theDNA target site are assigned weighted valuescorresponding to the nucleotide frequencies atthat position in a training set of trusted group IIintron target sites. The suitability of a new targetsite is determined as the product of these frequen-cies, treating each position as independent. Theperformance of the algorithm is critically depen-dent on the definition of the training set. Amongthe currently available databases, we found that atraining set consisting of 92 retargeted Ll.LtrBintron-insertion sites with intron-mobility frequen-cies $0.1% gave the best performance of the algor-ithm. Those databases restricted to introns havinghigher insertion frequencies perform less wellbecause they have fewer sequences, leading to agreater coefficient of variation.19 This situation willpresumably improve as the number of efficienttarget sites in the database increases. It is possiblethat more complex models of the group II introntarget sequence, which consider dependenciesbetween specific positions, would perform betterwhen larger databases are available.

The information content analysis quantifies thecontribution of each position in the DNA target

site. In agreement with previous biochemical,mutagenesis, and selection experiments, thisanalysis showed that most critical nucleotideresidues in the regions recognized by the IEP areG 221, A 220, and T 219 in the distal 50 exon,and T þ5 in the 30 exon.3,11,14,15 T 223, which wasidentified previously as a critical nucleotideresidue in the 50 exon, makes a significant contri-bution but has a lower information content thanthese other positions. The four nucleotide residuesin the distal 50-exon region are recognized indouble-stranded DNA via major groove inter-actions and are critical for initial DNA target siterecognition, DNA unwinding, and reverse splicing,while T þ5 in the 30 exon is required only forsecond-strand cleavage.3,5,15 In addition to thesepositions, we found relatively high informationcontent for positions 217, 215, and 214, locatedjust upstream of IBS2, a region that could beimportant for the initiation of local DNA unwind-ing. We showed previously that the IEP crossesthe minor groove in this region and appears tomake critical contacts with the 50 phosphate groupsat bottom-strand positions 219 to 215 and top-strand position 213, as judged by ethylnitro-sourea-interference experiments.3 In the regionrecognized by base-pairing of the intron RNA, wefound a strong preference for GC or CG base-pairsat the beginning of the EBS2/IBS2 and EBS1/IBS1duplexes, again in agreement with previousresults.11,14

The algorithm assumes that each position in theDNA target site contributes independently totarget site recognition and that there is a fixedspacing between different target site elements.These assumptions are in accord with previousbiochemical experiments, which showed that thecritical nucleotide residues in the distal 50-exonregion are recognized independently of sequencecontext and that spacing changes of a singlenucleotide residue between different elements inthe DNA target site strongly inhibit reversesplicing and second-strand cleavage.15 Neverthe-less, the assumption that each position contributesindependently is likely true only to a first approxi-mation. For example, the strong selection against aG residue at position 220 of the DNA target site(Table 1) may reflect interference with majorgroove recognition of the critical G 221 by theIEP, and it is easy to envision similar interferencefor other positions recognized by the protein.

The algorithm improves over previous compu-tational approaches by using the most criticalpositions in the DNA target site, as well aspositions that make significant, albeit smaller,contributions to DNA target site recognition. Ingeneral, these additional positions could be recog-nized directly by the RNP, or they could contributeto DNA target site recognition indirectly byleading to a favorable DNA conformation orpropensity for local DNA unwinding, which isrequired for base-pairing of the intron RNA. Animportant consequence of considering the


cumulative contribution of all positions is that thegroup II intron targeting rules are more flexiblethan previously thought. For example, we obtaineda number of highly efficient introns (i.e. near wild-type insertion frequencies) that recognize targetsites with an A residue substituted for T 223 orG 221, even though the wild-type bases are recog-nized directly by the IEP and the correspondingnucleotide substitutions in the context of the wild-type target site decrease mobility frequency by.50% (J. San Filippo & A.M.L., unpublishedresults).3 These findings likely reflect that thewild-type Ll.LtrB target site itself does not havean optimal combination of nucleotide residues,and that a single disfavored nucleotide residuecan be compensated by the cumulative contri-bution of optimally recognized nucleotide residuesat other positions.

A number of other factors may contribute togroup II intron insertion efficiency, but have notbeen incorporated into the algorithm. Theseinclude duplex stability in different regions of thetarget site (Figure 3), higher-order DNA structure,proximity to the DNA replication origin,11 methyl-ation or other DNA modifications within the targetsite sequence, and occlusion of the DNA target siteby protein binding or chromosome structure. Weanticipate that our ability to efficiently targetgroup II introns will increase as we progressivelylearn how to deal with these factors. We note, how-ever, that the present benchmark of generally $1%integration efficiency without selection is alreadyat the high end of other bacterial gene targetingmethods, even in E. coli, where such methods arewell-developed.33 – 37

As a first application of the algorithm, wedesigned group II introns to disrupt a set of E. coliDExH/D-box protein and DNA helicase genes.The DExH/D-box proteins, which include RNAhelicases, are involved in diverse cellular func-tions, particularly those requiring RNA structuralrearrangements, such as ribosome assembly,protein synthesis, RNA splicing, and RNAdegradation.21,22 We showed recently that theNeurospora crassa DEAD-box protein CYT-19 func-tions as an RNA chaperone to resolve non-nativeRNA structures in mitochondrial group I intronsplicing,24 and we reasoned that it might bepossible to identify DExH/D-box proteins thatfunction similarly or in other ways in group I andgroup II intron splicing in bacteria, enabling us toemploy powerful bacterial genetic approaches fortheir analysis.

By using targeted group II introns, we obtainedviable disruptants of 27/28 E. coli DExH/D-boxprotein and DNA helicase-related genes, the onlyexception being secA, a component of the proteintranslocation ATPase, which was already knownto be essential.31 The apparent dispensability ofDExH/D-box protein genes in E. coli contrastssharply with yeast, where 23/34 DExH/D-boxproteins have been found to be essential.22 In part,this difference reflects that yeast DExH/D-box

proteins have evolved to perform functions thatdo not exist in bacteria, including nuclear pre-mRNA splicing, nuclear RNA export, and dis-placement of small nucleolar RNAs (snoRNAs)involved in rRNA modification. In addition, asmany as four yeast DExH/D-box proteins functionin translation initiation, which arguably could havea more stringent requirement for RNA unwindingin eukaryotes, where after 50-cap recognition, ribo-somes must traverse sometimes lengthy andstructured 50-leader sequences to reach theinitiation codon.22,38 On the other hand, bothprokaryotes and eukaryotes are likely to have com-mon use for DExH/D-box proteins to unwindRNA or DNA duplexes, to promote specific RNAstructural transitions, and to function as RNAchaperones to disrupt non-native structures thatconstitute kinetic traps in RNA folding.24 Theapparent dispensability of bacterial DExH/D-boxproteins may reflect that some of these functionsare essential only under certain environmentalconditions, such as low temperature, and indeedwe found that nine of the disruptants are cold-sensitive (see Results). In addition, lack of compart-mentalization in bacteria may make it easier forother DExH/D-box proteins to complement miss-ing functions, a possibility that can now beaddressed by analyzing different combinations ofdisruptants obtained using the targeted group IIintrons constructed here.

We anticipate that the mobile group II intronsdesigned by our algorithm will be useful for manyapplications involving genetic engineering andfunctional genomics in bacteria and eventually wehope in eukaryotic systems. We are currently auto-mating these methods with the idea of rapidlyobtaining whole-genome knockout libraries indifferent bacteria (A. Ellington, E. Marcotte &A.M.L., unpublished results).

Materials and Methods

Computational methods

The algorithm for group II intron gene targeting waswritten in Perl programming language. Access to thealgorithm can be obtained from InGex (St. Louis, MO)at InGex.com. Neural networks were created with theTrajan Neural Network Simulator†. Hidden Markovmodels were built with the HMMER software package(see above).

Bacterial strains

E. coli HMS174(DE3) F2 recA1 hsdR (rK122 mK12

þ ) RifR

(DE3) (Novagen, Madison, WI) and BL21(DE3) F2 ompThsdSB (rB

2 mB2) gal dcm (DE3) (Stratagene, La Jolla, CA)

were used for chromosomal gene disruptions and intronmobility assays, and DH5a was used for cloning.Bacteria were grown in LB medium with antibioticsadded at the following concentrations: ampicillin

† http://www.trajan-software.demon.co.uk/


http://www.trajan-software.demon.co.uk/

(100 mg/ml), trimethoprim (10 mg/ml), chloramphenicol(25 mg/ml), and tetracycline (25 mg/ml).

Recombinant plasmids

The intron donor plasmids pACD2 and pACD3 con-tain a 0.9 kb DORF-derivative of the Ll.LtrB intron,cloned downstream of the T7lac promoter in a CamR

pACYC184-based vector.9,14 The open reading frame(ORF) encoding the LtrA protein is expressed from aposition just downstream of the 30 exon. In pACD2, thedonor intron has an additional phage T7 promoterinserted in intron domain IV for use in plasmid-to-plasmid mobility assays.

The recipient plasmid pBRR3, used for mobilityassays, is an AmpR derivative of pBRR1 (previouslydenoted pBRR-tet)14 with a short segment containingLl.LtrB target site positions 230 to þ15 cloned upstreamof a promoterless tetR gene.9

pACD3-RAM is a derivative of pACD3 that contains asmall (313 bp) TpR marker gene with its own promoterinserted in intron domain IV in the orientation oppositeintron transcription. The TpR marker is disrupted by aself-splicing td group I intron inserted in the forwardorientation. During retrotransposition via an RNA inter-mediate, the td intron is spliced, activating the TpR

marker, which is then selected after the intron hasintegrated into a DNA target site.11

Four new intron donor plasmids, pACD4A, pACD4C,pACD4G, and pACD4T, incorporating improvementsdescribed in this work, were constructed by PCR-basedmutagenesis. These plasmids have a modified distal50-exon sequence (50-CGTCGATCGTGA replaced by 50-ATAATTATCCTT) to minimize self-targeting, and differ-ent nucleotide residues at 30-exon position þ1 to increasethe splicing efficiency of retargeted introns by restoringthe d–d0 interaction in precursor RNA. The distal 50-exon sequence was modified via PCR with the same setof primers used to modify the IBS1/2 and EBS1sequences.9 The PCR product was digested with HindIIIand BsrGI, gel-purified, and cloned between the corre-sponding sites of pACD3. Modifications of the 30 exond0 þ1 position were introduced via two complementaryDNA oligonucleotides:

50-GTACCTCCCTACTTCACNATATCATTTTCTGCAand50-GAAAATGATATN0GTGAAGTAGGGAG

where N is the desired nucleotide residue. The oligo-nucleotides were phosphorylated with phage T4 poly-nucleotide kinase, annealed, and cloned in place of thewild-type fragment between the Acc651 and Pst I sites ofpACD3.

Group II intron gene targeting

Group II introns were expressed from intron donorplasmid pACD2 for plasmid assays and from pACD3 orone of the pACD4 plasmids for chromosomal genetargeting. Donor plasmids containing the retargetedintrons were generated by PCR, as described.9,11 Theinitial two-step PCR used two pairs of partially over-lapping primers, three of which introduce modificationsinto the IBS1/2, EBS2, and EBS1/d sequences, respect-ively. The gel-purified 361 bp PCR product can theneither be cloned into the donor plasmid backbone orused as a primer in a second PCR with the vector back-bone to produce a non-covalently closed, circular DNA,corresponding to the donor plasmid.11

Plasmid assays for quantitative determination ofmobility frequency were carried out as described bycotransforming the CamR donor plasmid pACD2 andthe AmpR recipient plasmid pBRR3 into E. coliHMS174(DE3), which contains an IPTG-inducible phageT7 RNA polymerase.9,14 After induction with 100 mMIPTG for one hour at 37 8C, the intron carrying a T7 pro-moter near its 30 end inserts into the target site in therecipient plasmid just upstream of a promoterless tetR

gene, thereby activating that gene. Colonies in whichthe intron has integrated into the recipient plasmidwere selected on LB plates containing ampicillin andtetracycline. Mobility frequencies are defined as theratio of (TetR þ AmpR)/AmpR colonies.14

For chromosomal gene disruptions, the pACD3 donorplasmid containing the retargeted intron was trans-formed into E. coli HMS174(DE3) or BL21(DE3) (dbpAtargeting), induced with 100 mM IPTG for one hour at37 8C, and grown overnight on LB plates containingchloramphenicol. Chromosomal integration frequencieswere determined by PCR of 100 colonies, using primersflanking the intron-insertion site. For gene disruptionswith a RAM-targetron, cells were transformed with apACD3-RAM-based donor plasmid and induced with500 mM IPTG for two hours at 30 8C. Trimethoprim-resistant colonies were then selected either on plates orby growth in liquid culture, as described.11 In all cases,correct integration of the group II intron was confirmedby PCR, using primers flanking the intron-insertion site,followed by sequencing of the PCR product. After curingthe CamR donor plasmid by displacement with theAmpR plasmid pACY177, which has the same replicationorigin, the specificity of intron integration was assessedby Southern hybridization of randomly selecteddisruptants for each gene.

DNA analysis

E. coli genomic DNA was prepared by using aGenomic DNA Isolation Kit (Qiagen, Valencia, CA). ForSouthern hybridization, DNAs were digested withrestriction enzymes, run in 0.7% agarose gel, blotted toa nylon transfer membrane (Magna, 0.45 mm; OsmonicsInc., Minnetonka, MN), and hybridized with a32P-labeled probe corresponding to intron positions294–935. The probe was generated by PCR of pACD2using primers 50-GTTATCACCACATTTGTACAA and50-GTAGGGAGGTACCGCCTTGTTC, followed by label-ing with [a-32P]dCTP (3000 Ci/mmol; NEN Life ScienceProducts, Boston, MA), using a High Prime DNA label-ing kit (Roche, Indianapolis, IN). The blots were scannedwith a PhosphorImager 445SI (Molecular Dynamics,Sunnyvale, CA).

Acknowledgements

We thank Edward Marcotte for comments on themanuscript, and Georg & Sabine Mohr for helpwith protein alignments. This work was supportedby NIH grants GM37949 and GM37951.

References

1. Lambowitz, A. M., Caprara, M. G., Zimmerly, S. &


Perlman, P. S. (1999). Group I and group II ribozymesas RNPs: clues to the past and guides to the future.In The RNA World (Gesteland, R. F., Cech, T. R. &Atkins, J. F., eds), 2nd edit., pp. 451–485, Cold SpringHarbor Laboratory Press, Cold Spring Harbor, NY.

2. Belfort, M., Derbyshire, V., Parker, M. M., Cousineau,B. & Lambowitz, A. M. (2002). Mobile introns: path-ways and proteins. In Mobile DNA II (Craig, N. L.,Craigie, R., Gellert, M. & Lambowitz, A. M., eds),pp. 761–783, ASM Press, Washington, DC.

3. Singh, N. N. & Lambowitz, A. M. (2001). Interactionof a group II intron ribonucleoprotein endonucleasewith its DNA target site investigated by DNA foot-printing and modification interference. J. Mol. Biol.309, 361–386.

4. Aizawa, Y., Xiang, Q., Lambowitz, A. M. & Pyle,A. M. (2003). The pathway for DNA recognition andRNA integration by a group II intron retro-transposon. Mol. Cell, 11, 795–805.

5. Zhong, J. & Lambowitz, A. M. (2003). Group II intronmobility using nascent strands at DNA replicationforks to prime reverse transcription. EMBO J. 22,4555–4565.

6. Guo, H., Zimmerly, S., Perlman, P. S. & Lambowitz,A. M. (1997). Group II intron endonucleases useboth RNA and protein subunits for recognition ofspecific sequences in double-stranded DNA. EMBOJ. 16, 6835–6848.

7. Yang, J., Mohr, G., Perlman, P. S. & Lambowitz, A. M.(1998). Group II intron mobility in yeast mito-chondria: target DNA-primed reverse transcriptionactivity of al1 and reverse splicing into DNA trans-position sites in vitro. J. Mol. Biol. 282, 505–523.

8. Jimenez-Zurdo, J. I., Garcıa-Rodriguez, F. M.,Barrientos-Duran, A. & Toro, N. (2003). DNA targetsite requirements for homing in vivo of a bacterialgroup II intron encoding a protein lacking the DNAendonuclease domain. J. Mol. Biol. 326, 413–423.

9. Karberg, M., Guo, H., Zhong, J., Coon, R., Perutka, J.& Lambowitz, A. M. (2001). Group II introns as con-trollable gene targeting vectors for genetic manipu-lation of bacteria. Nature Biotechnol. 19, 1162–1167.

10. Frazier, C. L., San Filippo, J., Lambowitz, A. M. &Mills, D. A. (2003). Genetic manipulation ofLactococcus lactis by using targeted group II introns:generation of stable insertions without selection.Appl. Environ. Microbiol. 69, 1121–1128.

11. Zhong, J., Karberg, M. & Lambowitz, A. M. (2003).Targeted and random bacterial gene disruptionusing a group II intron (targetron) vector containinga retrotransposition-activated selectable marker.Nucl. Acids Res. 31, 1656–1664.

12. Matsuura, M., Saldanha, R., Ma, H., Wank, H., Yang,J., Mohr, G. et al. (1997). A bacterial group II intronencoding reverse transcriptase, maturase, and DNAendonuclease activities: biochemical demonstrationof maturase activity and insertion of new geneticinformation within the intron. Genes Dev. 11,2910–2924.

13. Cousineau, B., Smith, D., Lawrence-Cavanagh, S.,Mueller, J. E., Yang, J., Mills, D. et al. (1998). Retro-homing of a bacterial group II intron: mobility viacomplete reverse splicing, independent of homo-logous DNA recombination. Cell, 94, 451–462.

14. Guo, H., Karberg, M., Long, M., Jones, J. P., III,Sullenger, B. & Lambowitz, A. M. (2000). Group IIintrons designed to insert into therapeuticallyrelevant DNA target sites in human cells. Science,289, 452–457.

15. Mohr, G., Smith, D., Belfort, M. & Lambowitz, A. M.(2000). Rules for DNA target site recognition by alactococcal group II intron enable retargeting of theintron to specific DNA sequences. Genes Dev. 14,559–573.

16. Breslauer, K. J., Frank, R., Blocker, H. & Marky, L. A.(1986). Predicting DNA duplex stability from thebase sequence. Proc. Natl Acad. Sci. USA, 83,3746–3750.

17. Stormo, G. D., Schneider, T. D., Gold, L. &Ehrenfeucht, A. (1982). Use of the “Perceptron”algorithm to distinguish translational initiation sitesin E. coli. Nucl. Acids Res. 10, 2997–3011.

18. Durbin, R., Eddy, S., Krogh, A. & Mitchison, G.(1998). Biological Sequence Analysis. ProbabilisticModels of Proteins and Nucleic Acids. CambridgeUniversity Press, Cambridge, UK.

19. Burge, C. B. (1998). Modeling dependencies in pre-mRNA splicing signals. In Computational Methods inMolecular Biology (Salzberg, S. L., Searls, D. B. &Kasif, S., eds), pp. 129–164, Elsevier, New York.

20. Zupan, J. & Gasteiger, J. (1999). Neural Networks inChemistry and Drug Design: An Introduction. 2ndedit., Wiley, New York.

21. Tanner, N. K. & Linder, P. (2001). DExD/H box RNAhelicases: from generic motors to specific dis-sociation functions. Mol. Cell, 8, 251–262.

22. de la Cruz, J. & Kressler, D. (1999). Unwinding RNAin Saccharomyces cerevisiae: DEAD-box proteins andrelated families. Trends Biochem. Sci. 24, 192–198.

23. Mohr, G., Zhang, A. X., Gianelos, J. A., Belfort, M. &Lambowitz, A. M. (1992). The Neurospora CYT-18protein suppresses defects in the phage T4 td intronby stabilizing the catalytically active structure of theintron core. Cell, 69, 483–494.

24. Mohr, S., Stryker, J. M. & Lambowitz, A. M. (2002). ADEAD-box protein functions as an ATP-dependentRNA chaperone in group I intron splicing. Cell, 109,769–779.

25. Schultz, J., Milpetz, F., Bork, P. & Ponting, C. P.(1998). SMART, a simple modular architectureresearch tool: identification of signaling domains.Proc. Natl Acad. Sci. USA, 95, 5857–5864.

26. Letunic, I., Goodstadt, L., Dickens, N. J., Doerks, T.,Schultz, J., Mott, R. et al. (2002). Recent improve-ments to the SMART domain-based sequenceannotation resource. Nucl. Acids Res. 30, 242–244.

27. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang,J., Zhang, Z., Miller, W. & Lipman, D. J. (1997).Gapped BLAST and PSI-BLAST: a new generation ofprotein database search programs. Nucl. Acids Res.25, 3389–3402.

28. Fuller-Pace, F. V., Nicol, S. M., Reid, A. D. & Lane,D. P. (1993). DbpA: a DEAD box protein specificallyactivated by 23 S rRNA. EMBO J. 12, 3619–3626.

29. Diges, C. M. & Uhlenbeck, O. C. (2001). Escherichiacoli DbpA is an RNA helicase that requires hairpin92 of 23 S rRNA. EMBO J. 20, 5503–5512.

30. Nilsson, G., Belasco, J. G., Cohen, S. N. & vonGabain, A. (1987). Effect on premature terminationof translation on mRNA stability depends on thesite of ribosome release. Proc. Natl Acad. Sci. USA,84, 4890–4894.

31. Lill, R., Cunningham, K., Brundage, L., Ito, K.,Oliver, D. & Wickner, W. (1989). The SecA proteinhydrolyzes ATP and is an essential component ofthe protein translocation ATPase of Escherichia coli.EMBO J. 8, 961–966.

32. Charollais, J., Pflieger, D., Vinh, J., Dreyfus, M. & Iost,


I. (2003). The DEAD-box RNA helicase SrmB isinvolved in the assembly of 50 S ribosomal subunitsin Escherichia coli. Mol. Microbiol. 48, 1253–1265.

33. Murphy, K. C. (1998). Use of bacteriophage lambdarecombination functions to promote genereplacement in Escherichia coli. J. Bacteriol. 180,2063–2071.

34. Zhang, Y. M., Buchholz, F., Muyrers, J. P. P. &Stewart, A. F. (1998). A new logic for DNA engineer-ing using recombination in Escherichia coli. NatureGenet. 20, 123–128.

35. Datsenko, K. A. & Wanner, B. L. (2000). One-stepinactivation of chromosomal genes in Escherichia coliK-12 using PCR products. Proc. Natl Acad. Sci. USA,97, 6640–6645.

36. Yu, D., Ellis, H. M., Lee, E. C., Jenkins, N. A.,Copeland, N. G. & Court, D. L. (2000). An efficientrecombination system for chromosome engineering

in Escherichia coli. Proc. Natl Acad. Sci. USA, 97,5978–5983.

37. Ellis, H. M., Yu, D. G., DiTizio, T. & Court, D. L.(2001). High efficiency mutagenesis, repair, andengineering of chromosomal DNA using single-stranded oligonucleotides. Proc. Natl Acad. Sci. USA,98, 6742–6746.

38. Linder, P. (2003). Yeast RNA helicases of the DEAD-box family involved in translation initiation. Biol.Cell, 95, 157–167.

39. Gorbalenya, A. E. & Koonin, E. V. (1993). Helicases:amino acid sequence comparisons and structure–function relationships. Curr. Opin. Struct. Biol. 3,419–429.

40. Tanner, N. K., Cordin, O., Banroques, J., Doere, M. &Linder, P. (2003). The Q motif: a newly identifiedmotif in DEAD box helicases may regulate ATP bind-ing and hydrolysis. Mol. Cell, 11, 127–138.

Edited by D. E. Draper

(Received 17 September 2003; received in revised form 20 November 2003; accepted 2 December 2003)


use of computer-designed group ii introns to disrupt escherichia

Documents