a shortcut to interesting human genes: peptide sequence tags, expressed-sequence tags and computers

2
COMPUTER CORNER TIBS 21 - DECEMBER 1996 A shortcut to interesting human genes: peptide sequence tags, expressed-sequence tags and computers We have developed a method that can dramatically reduce the time and effort involved in cloning a human gene given a minimal amount of the corresponding protein isolated on a polyacrylaminde gel. This method consists of rapidly and sensitively determining some amino acid sequence information by mass spectrometry and using this informa- tion to retrieve an expressed-sequence tag (EST) from public databases with special software algorithms. Obtaining the full sequence from the partial clone containing the EST sequence is then usually straightforward. More than 15 novel proteins have already been identi- fied by this method, most prominently FLICE, the 'missing link' in the signal cascade of receptor-mediated apoptosis t. Expressed-sequence tags (ESTs) ESTs are short nucleotide sequences resulting from one-pass sequencing of complementary DNA (that is, one run of the reverse-translated message on an automated sequencing machine). One- pass sequencing lends itself to high throughput, but its downsides are high error rates and short sequence length. Nevertheless, an EST sequence of 300 base pairs, even with errors, still con- tains much information about the gene product. The original impetus for start- ing large-scale EST work was the realiz- ation by Craig Venter and others that many of the benefits of full genomic sequencing could be obtained with less effort by massive one-pass sequencing of cDNA libraries (and that cDNA se- quencing would be necessary in any case to define the gene products) 2. The strategic use of ESTs is now seen in producing drug targets rather than in wholesale patenting of human genes (which so far has not been allowed by the courts). To prevent private companies from unduly exploiting this pivotal infor- mation through exclusive collaborations, Merck & Co., Inc. has sponsored a pub- lic EST effort3 - carried out at the 494 University of Washington, St Louis, USA and the I.M.A.G.E. consortium 4 - the fruits of which are hundreds of thousands of short DNA sequences that are publicly available over the internet in dbEST5. Cluster analysis of these data suggests that they already cover more than half of the human genes (C. Auffray, CNRS, pets. commun.). Estimates on the more extensive databases of private companies are in excess of 80%. Several years ago, in the course of our work on the database identification of proteins by mass spectrometric in- formation, it occurred to us 6 (and later to others 7) that ESTs might also provide a shortcut to the cloning of human genes. SearchingESTsby peptidesequence tags Proteins isolated from polyacryl- amide gels can now be sequenced by nanoelectrospray mass spectrometry at the low nanogram levels8,9. This repre- sents an increase in sensitivity of up to 100-fold over conventional methods for protein sequencing. However, while this facilitates the determination of protein sequence information, cloning of the corresponding genes is still hampered by the degeneracy of the genetic code, which requires the use of degenerate oligonucleotide probes to screen li- braries, and is also limited by the com- pleteness of the cDNA library being screened. Therefore, we have developed soft- ware algorithms to directly match the information generated when fragmenting peptide ions in the mass spectrometer against the EST databases. A series of peaks in such a fragmentation spectrum, which is separated by amino acid mol- ecular weights, defines not only a partial sequence 8, but also the position of that partial sequence in the peptide. Peptide sequence tags (so named because they were to be used to find ESTs) then con- sist of a short stretch of sequence (typi- cally two to three amino acids) combined 1996,ElsevierScience Ltd with the mass (in daltons) to the amino and carboxyl terminus of the peptide 1~ Together with the sequence specificity of the protease used to cleave the protein (often trypsin), a peptide se- quence tag has about 106-times higher search specificity than the short se- quence alone. The technique is inher- ently error-tolerant as the peptide is divided into three parts, not all of which have to match with the database entry to produce a hit. The search engine of PeptideSearch, a program that corre- lates mass spectrometric information with sequence databases n (http:// www.mann.embl-heidelberg.de/Services/ PeptideSearch/), incorporates a recently developed, flexible, string-searching algo- rithm 12,13, which is also used in the popular UNIX tool 'agrep'. The short search sequence is reverse-translated into a degenerate nucleotide pattern, which allows direct searching at the nucleotide level, without translation into the six possible reading frames (ESTs are usually too short to deter- mine the reading frame). All three sequence errors: base mis- call (or non-determination), insertions and deletions can be tolerated in a search. This is especially important for searching EST databases, because their error rate of up to a few percent often leads to a wrong prediction of the corre- sponding peptide molecular weights. Searches can be performed with mass spectrometric data or short amino acid data gained from any source. Owing to the error-tolerant nature of the searching algorithm, incomplete pep- tide sequence data obtained by Edman degradation can also be matched. For example, even a short and highly de- generate amino acid sequence pattern such as (A/B)(G/K/N)(D/R/W)(F/E)CF/R) (L/P)(R/K) matches no sequence in the EST database. Figure 1 shows an exam- ple of an EST search with a peptide se- quence tag. After one or more clones have been identified in the EST library, assembly software can be used to get a number of clones that overlap the originally found sequence. At the EMBL, in a collabor- ation with Wilhelm Ansorge's group, we are setting up a routine system for long- reading-length double-strand sequenc- ing of the clones obtained from the I.M.A.G.E. consortium. The full-insert- length sequence is often a few hundred amino acids long, which means that several more partially sequenced pep- tides are usually found in the readout. The sequence can then be used for PII: s0968-0004(96)30042-x

Upload: matthias-mann

Post on 16-Sep-2016

215 views

Category:

Documents


3 download

TRANSCRIPT

COMPUTER CORNER T I B S 2 1 - DECEMBER 1996

A shortcut to interesting human genes: peptide sequence tags,

expressed-sequence tags and computers

We have developed a method that can dramatically reduce the time and effort involved in cloning a human gene given a minimal amount of the corresponding protein isolated on a polyacrylaminde gel. This method consists of rapidly and sensitively determining some amino acid sequence information by mass spectrometry and using this informa- tion to retrieve an expressed-sequence tag (EST) from public databases with special software algorithms. Obtaining the full sequence from the partial clone containing the EST sequence is then usually straightforward. More than 15 novel proteins have already been identi- fied by this method, most prominently FLICE, the 'missing link' in the signal cascade of receptor-mediated apoptosis t.

Expressed-sequence tags (ESTs) ESTs are short nucleotide sequences

resulting from one-pass sequencing of complementary DNA (that is, one run of the reverse-translated message on an automated sequencing machine). One- pass sequencing lends itself to high throughput, but its downsides are high error rates and short sequence length. Nevertheless, an EST sequence of 300 base pairs, even with errors, still con- tains much information about the gene product. The original impetus for start- ing large-scale EST work was the realiz- ation by Craig Venter and others that many of the benefits of full genomic sequencing could be obtained with less effort by massive one-pass sequencing of cDNA libraries (and that cDNA se- quencing would be necessary in any case to define the gene products) 2. The strategic use of ESTs is now seen in producing drug targets rather than in wholesale patenting of human genes (which so far has not been allowed by the courts). To prevent private companies from unduly exploiting this pivotal infor- mation through exclusive collaborations, Merck & Co., Inc. has sponsored a pub- lic EST effort 3 - carried out at the

494

University of Washington, St Louis, USA and the I.M.A.G.E. consortium 4 - the fruits of which are hundreds of thousands of short DNA sequences that are publicly available over the internet in dbEST 5. Cluster analysis of these data suggests that they already cover more than half of the human genes (C. Auffray, CNRS, pets. commun.). Estimates on the more extensive databases of private companies are in excess of 80%.

Several years ago, in the course of our work on the database identification of proteins by mass spectrometric in- formation, it occurred to us 6 (and later to others 7) that ESTs might also provide a shortcut to the cloning of human genes.

Searching ESTs by peptide sequence tags Proteins isolated from polyacryl-

amide gels can now be sequenced by nanoelectrospray mass spectrometry at the low nanogram levels 8,9. This repre- sents an increase in sensitivity of up to 100-fold over conventional methods for protein sequencing. However, while this facilitates the determination of protein sequence information, cloning of the corresponding genes is still hampered by the degeneracy of the genetic code, which requires the use of degenerate oligonucleotide probes to screen li- braries, and is also limited by the com- pleteness of the cDNA library being screened.

Therefore, we have developed soft- ware algorithms to directly match the information generated when fragmenting peptide ions in the mass spectrometer against the EST databases. A series of peaks in such a fragmentation spectrum, which is separated by amino acid mol- ecular weights, defines not only a partial sequence 8, but also the position of that partial sequence in the peptide. Peptide sequence tags (so named because they were to be used to find ESTs) then con- sist of a short stretch of sequence (typi- cally two to three amino acids) combined

�9 1996, Elsevier Science Ltd

with the mass (in daltons) to the amino and carboxyl terminus of the peptide 1~ Together with the sequence specificity of the protease used to cleave the protein (often trypsin), a peptide se- quence tag has about 106-times higher search specificity than the short se- quence alone. The technique is inher- ently error-tolerant as the peptide is divided into three parts, not all of which have to match with the database entry to produce a hit. The search engine of PeptideSearch, a program that corre- lates mass spectrometric information with sequence databases n (http:// www.mann.embl-heidelberg.de/Services/ PeptideSearch/), incorporates a recently developed, flexible, string-searching algo- rithm 12,13, which is also used in the popular UNIX tool 'agrep'. The short search sequence is reverse-translated into a degenerate nucleotide pattern, which allows direct searching at the nucleotide level, without translation into the six possible reading frames (ESTs are usually too short to deter- mine the reading frame).

All three sequence errors: base mis- call (or non-determination), insertions and deletions can be tolerated in a search. This is especially important for searching EST databases, because their error rate of up to a few percent often leads to a wrong prediction of the corre- sponding peptide molecular weights. Searches can be performed with mass spectrometric data or short amino acid data gained from any source.

Owing to the error-tolerant nature of the searching algorithm, incomplete pep- tide sequence data obtained by Edman degradation can also be matched. For example, even a short and highly de- generate amino acid sequence pattern such as (A/B)(G/K/N)(D/R/W)(F/E)CF/R) (L/P)(R/K) matches no sequence in the EST database. Figure 1 shows an exam- ple of an EST search with a peptide se- quence tag.

After one or more clones have been identified in the EST library, assembly software can be used to get a number of clones that overlap the originally found sequence. At the EMBL, in a collabor- ation with Wilhelm Ansorge's group, we are setting up a routine system for long- reading-length double-strand sequenc- ing of the clones obtained from the I.M.A.G.E. consortium. The full-insert- length sequence is often a few hundred amino acids long, which means that several more partially sequenced pep- tides are usually found in the readout. The sequence can then be used for

PII: s0968-0004(96)30042-x

TIBS 21 - D E C E M B E R 1 9 9 6 COMPUTER CORNER low-stringency homology searching, for defining peptide sequences for immuniz- ation and, of course, as a probe for obtaining the full-length sequence.

Application of the strategy The strategy described here has

been used to rapidly clone FLICE, the elusive signalling molecule between the apoptosis receptor Apol/Fas and the ICE protease family ~. Upon activation of the receptor by ligand binding, several intracellular factors associate with it and could be visualized on a two- dimensional gel. Nanoelectrospray se- quencing followed by database search- ing quickly revealed a matching EST. The information about this EST (obtained at http://www.ncbi.nlm.nih.gov/) listed a significant homology to the ICE pro- teases. The full-length clone was rapidly obtained by the collaborating groups and the whole project was completed in a matter of weeks.

If the protein is small or the (assem- bled) EST sequence is long, a complete clone might be represented in the cDNA library. This was the case for a novel protein, p20-C66BP, which specifically binds to the (CGG) nucleotide repeat associated with fragile X syndrome TM. Three peptides were sequenced from the material purified by DNA affinity chromatography, one of which matched an EST in the database (H. Diessler et aL, unpublished). Full-length, double- stranded sequencing revealed the com- plete sequence of the protein and cor- rected the EST sequence. After the correction, all three peptides fit to the sequence. It is notable that no cloning steps were necessary in the isolation of the cDNA sequence of this novel protein.

We believe that this combination of mass spectrometry, searching algo- rithms and EST databases could be of general use in the characterization of human proteins. We thank Merck & Co., Inc. and the I.M.A.G.E. consortium for making EST sequences publicly avail- able and we would like to encourage similar efforts at obtaining longer-read- length (ideally full-length) cDNA se- quences and clones.

References Muzio, M. et al. (1996) Cell 85, 817-827

2 Adams, M. D. et al. (1991) Science 252, 1651-1656

3 Williamson, A. R. and Elliston, K. O. (1994) Nature 372, 10

4 Lennon, G. G., Auffray, C., Polymeropoulos, M. and Soares, M. B. (1996) Genomics 33, 151-152

5 Boguski, M. S., Lowe, T. M. J. and

I , ' ~ .~ , , , , , z . ,~ . ,~ . ,~ .~ -~_~ , , .~s~ I Z4 : 605600

ank A:c: ank q] :

E ZEro

ZQ :

typ~:

mf96hOT.r]

42217~ [5'} IMAGE Consortium. LLNL cDNA

Sequencing :

Quality: Entry C[ea%ed: Last Up<~ated:

~'Prlme,

OCAGC~& %'GG~A J tAGKA~ t 'C%~ IGGAA~2~A ' I~ 'A JiAG't 'P/r .&~'P~ GCTAC . _ GA GGAOGACCCA A

CACC"~AA~C

H~gh quai~ty sequence stops at base: Z55 Ju [ ~6 19"96 Jt:l 16 1996

This , c lc [~e is available royaltv-free through LLNL : contact the IMAGE C~>nsoztium ([email protected]) for furthez information%. MG7:256725

ASsigned by submittez Qb:E62!40 ~NA-BINDING PROTEIN F[~S/TLS (HUMAN); gb:X79233 M.musculus EWE mP.NA (MOUSE);

Figure 1 Peptide sequence tag searching and matching in expressed-sequence tag (EST) databases. The upper part of the figure shows the tandem mass spectrum and the tripep- tide sequence VSF derived from the mass differences in the nested set of peptide frag- ments. This information is converted into the peptide sequence tag shown and the partial sequence is reverse-translated into the degenerate oligonucleotide code. Searching this tag in the EST database produced three clones containing the same peptide sequence (two by a forward search and one by a reverse search). The lower part shows one of the corresponding database entries in dbEST. The matching sequence - which also defines the reading frame - is underlined. (Data obtained by G. Neubauer.)

Tolstoshev, C. M. (1993) Nat. Genet. 4, 332-333

6 Mann, M., Hojrup, P. and Roepstorff, P. (1993) Biol. Mass Spec. 22, 338-345

7 James, P., Quadroni, M., Carafoli, E. and Gonnet, G. (1994) Protein Sci. 3, 1347-1350

8 Mann, M. and Wilm, M. (1995) Trends Biochem. Sci. 20, 219-223

9 Wilm, M. et al. (1996) Nature 379, 466-469

10 Mann, M. and Wilm, M. S. (1994) Anal. Chem. 66, 4390-4399

11 Mann, M. (1994) in Microcharacterization of

Proteins (Kellner, R., Lottspeich, F. and Meyer, H. E., eds), pp. 223-245, VCH

12 Baeza-Yates, R. A. and Gonnet, G. H. (1992) Commun. ACM 35, 74-82

13 Wu, S. and Manber, U. (1992) Commun. ACM 35, 83-91

14 Deissler, H., Behn-Krappa, A. and Doerfler, W. (1996) J. Biol. Chem. 271, 4327-4334

MATTHIAS MANN

Group Leader Proteins & Peptides, EMBL, Heidelberg, Germany.

495