i like fish. my favorite is zebrafish. it ’ s called like that because, from a fish point of view,...

1
I like fish. My favorite is Zebrafish. It s called like that because, from a fish point of view, it looks like a Zebra. But still, it s a fish, so its a Zebrafish. Of course, they have fins and eyes so that they can see and quickly hide from the starving ugly big fish. Its so nice to look at them. At the beginning, its only an egg, and then it becomes a fish! With fins, mouth and eyes! I heard that it s all done by the genes. For example, dad told me that theres a gene called six 3 that has to do with the eyes. He didnt say much. So I thought that I could get more information about six 3 on the Internet. Thats when problems started. I typed six 3 in the little box and I started to read the articles. Many were not about my gene. Then, when it was about six 3, it wasnt about Zebrafish (I dont care about Chicken or Elephant!). So, I went to see dad. Dad said that it s because the Internet is about too many different things. He said I have to be more precise. Ah! I thought I just have to ask. Also, he said I forgot to put Sine oculis blabla 3 something because I should also look for the synonyms. From that moment I decided to go to see mum. Dad wasnt funny any more. Mum said that I shouldnt listen to dad. That wasnt the first time she said that. She said that I should forget all about these strange names and just use the UniProt ID (what ever it is). She just said its O73708 for six3 in Zebrafish and thats enough to find all the publications and that I don t have to worry about the synonyms. Mum is fun. The UniProt Index too. 1) Why is it so difficult to find publications related to a protein? Fact : Protein names are highly ambiguous Numbers : More than 600 protein names from Swiss- Prot are also English words such as Had, Great, This. Also, around 6 000 names from Swiss-Prot are abbreviations with several potential expansions. For instance, ADM abbreviates the gene name adrenomedullinas well as the drug name adriamycin. Consequence : Search engine results can be unrelated to the protein of interest. ~~~ Fact: Protein names are not species specific Numbers : Around 90 000 protein names from UniProt are shared over several species. Consequence : When a protein name is mentioned in the text, it is not obvious which species is concerned. ~~~ Fact : Proteins have several names Numbers : Around 84% of Uniprot entries reference more than one name per protein. Half of SwissProt proteins have at least three names. Consequence : Search engine results are incomplete. Indexing with Uniprot What : Acronym disambiguation How : Acronyms can be resolved with their long-forms. Either the long-form of the abbreviation is contained in the document or the context of the document allows to guess the long-form. Once it is resolved, the long form can be considered as a protein name or not. ~~~ What : Solving the species How : Publications mentioning protein names often contain information about the studied organism. It can be the name of the organism itself, or of an ancestor or even of a descendant. Using the NCBI taxonomy, the most probable species is selected given the organisms cited in the document. ~~~ What : Including synonyms in the search How : Swiss-Prot is the most comprehensive and accurate source of names and synonyms for proteins. All the protein names, once disambiguated, are indexed under their names as well as the unique form that represent the protein in the correct organism: the UniProt PANs. d, the fish and other species. (1) QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and TIFF (Uncompressed) are needed to see t QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompresso are needed to see this pictur Wolfson College Cambridge University EMBL EBI 3) So, How does it work? When using EbiMEd for retrieving publications related to a protein, simply use the proteins UniProt PAN instead of using one of the protein name in conjunction with an organism name. For instance, instead of the query: Simply use the query: 1) S. Gaudan * , H. Kirsch and D. Rebholz-Schuhmann: Resolving abbreviations to their senses in Medline. Bioinformatics September 2005 2) EbiMed: http://www.ebi.ac.uk/Rebholz-srv/ebimed/ and http://www.ebi.ac.uk/Rebholz-srv/whatizit/ 3) Sylvain Gaudan is supported by an “E-STAR” fellowship funded by the EC’s FP6 Marie Curie Host fellowship for Early Stage Research Training under con- tract number MEST-CT-2004-504640. ADMR ADrenoMedullin Receptor (gene) Average Daily Metabolic Rate AES Amino-terminal Enhancer of Split (gene) Anterior Ectosylvian Sulcus AMFR Autocrine Motility Factor Receptor (gene) Amplitude-Modulation Following Response (“methionine aminopeptidase 2” OR “peptidase M2” OR MAP2 OR MetAP2) and (mouse or mice) and “tooth germ” O08663 AND “tooth germ” 4) Sounds good. Where can I use it? For Biologists : The Protein Index is available on EbiMed, a Web Portal developed at the European Bioinfromatics Institue. EbiMed retrieves abstracts from Medline and also build a condensed view on the biomedical terminology contained in the result (e.g. Protein names, GO Terms, ...). Bioinformatics : Access the Protein Index via the EBIs Web Services (SOAP/HTTP). (2) 2) What can be done? What : Name disambiguation How : Protein names that are also English words can be identified by analyzing their frequencies in general English text such as the British National Corpus (BNC). ~~~ B n a c 2 A c i c u l i n O x i t o c i n I n s u l i n A p r i l T a s k L i g h t 10 000 1 000 100 10 x x x x x x x Cut-off Frequencies of protein names in the BNC (log) Sylvain Gaudan * Miguel Arregui * Harald Kirsch * Vivian Lee * Dietrich Rebholz-Schuhmann (3) http://www.ebi.ac.uk/Rebholz/

Upload: elijah-spencer

Post on 17-Jan-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: I like fish. My favorite is Zebrafish. It ’ s called like that because, from a fish point of view, it looks like a Zebra. But still, it ’ s a fish, so

I like fish. My favorite is Zebrafish. It’s called like that because, from a fish point of view, it looks like a Zebra. But still, it’s a fish, so it’s a Zebrafish. Of course, they have fins and eyes so that they can see and quickly hide from the starving ugly big fish.

It’s so nice to look at them. At the beginning, it’s only an egg, and then it becomes a fish! With fins, mouth and eyes! I heard that it’s all done by the genes.

For example, dad told me that there’s a gene called six 3 that has to do with the eyes. He didn’t say much. So I thought that I could get more information about six 3 on the Internet. That’s when problems started.

I typed six 3 in the little box and I started to read the articles. Many were not about my gene. Then, when it was about six 3, it wasn’t about Zebrafish (I don’t care about Chicken or Elephant!). So, I went to see dad. Dad said that it’s because the Internet is about too many different things. He said I have to be more precise. Ah! I thought I just have to ask. Also, he said I forgot to put Sine oculis blabla 3 something because I should also look for the synonyms. From that moment I decided to go to see mum. Dad wasn’t funny any more.

Mum said that I shouldn’t listen to dad. That wasn’t the first time she said that. She said that I should forget all about these strange names and just use the UniProt ID (what ever it is). She just said it’s O73708 for six3 in Zebrafish and that’s enough to find all the publications and that I don’t have to worry about the synonyms. Mum is fun. The UniProt Index too.

1) Why is it so difficult to find

publications related to a protein?

Fact: Protein names are highly ambiguousNumbers: More than 600 protein names from Swiss-Prot are also English words such as ’Had’, ’Great’, ’This’. Also, around 6 000 names from Swiss-Prot are abbreviations with several potential expansions. For instance, ADM abbreviates the gene name ’adrenomedullin’ as well as the drug name ’adriamycin’.Consequence: Search engine results can be unrelated to the protein of interest.

~~~Fact: Protein names are not species specificNumbers: Around 90 000 protein names from UniProt are shared over several species. Consequence: When a protein name is mentioned in the text, it is not obvious which species is concerned.

~~~Fact: Proteins have several namesNumbers: Around 84% of Uniprot entries reference more than one name per protein. Half of SwissProt proteins have at least three names.Consequence: Search engine results are incomplete.

Indexing with Uniprot

What: Acronym disambiguationHow: Acronyms can be resolvedwith their long-forms. Either the long-form of the abbreviationis contained in the document or the context of the document allows to guess the long-form. Once it is resolved, the long form can be considered as a protein name or not.

~~~What: Solving the speciesHow: Publications mentioning protein names often contain information about the studied organism. It can be the name of the organism itself, or of an ancestor or even of a descendant. Using the NCBI taxonomy, the most probable species is selected given the organisms cited in the document.

~~~What: Including synonyms in the searchHow: Swiss-Prot is the most comprehensive and accurate source of names and synonyms for proteins. All the protein names, once disambiguated, are indexed under their names as well as the unique form that represent the protein in the correct organism: the UniProt PANs.

Mum, Dad, the fish and other species.

(1)

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

WolfsonCollege

CambridgeUniversity

EMBL EBI

3) So, How does it work?When using EbiMEd for retrieving publications related to a protein, simply use the protein’s UniProt PAN instead of using one of the protein name in conjunction with an organism name. For instance, instead of the query:

Simply use the query:

1) S. Gaudan *, H. Kirsch and D. Rebholz-Schuhmann: Resolving abbreviations to their senses in Medline. Bioinformatics September 2005 2) EbiMed: http://www.ebi.ac.uk/Rebholz-srv/ebimed/ and http://www.ebi.ac.uk/Rebholz-srv/whatizit/ 3) Sylvain Gaudan is supported by an “E-STAR” fellowship funded by the EC’s FP6 Marie Curie Host fellowship for Early Stage Research Training under con- tract number MEST-CT-2004-

504640.

ADMRADrenoMedullin Receptor (gene)Average Daily Metabolic RateAESAmino-terminal Enhancer of Split (gene)Anterior Ectosylvian SulcusAMFRAutocrine Motility Factor Receptor (gene)Amplitude-Modulation Following Response

(“methionine aminopeptidase 2” OR “peptidase M2” OR MAP2 OR MetAP2) and (mouse or mice) and “tooth germ”

O08663 AND “tooth germ”

4) Sounds good. Where can I use it?For Biologists: The Protein Index is available on EbiMed, a Web Portal developed at the European Bioinfromatics Institue. EbiMed retrieves abstracts from Medline and also build a condensed view on the biomedical terminology contained in the result (e.g. Protein names, GO Terms, ...).Bioinformatics: Access the Protein Index via the EBI’s Web Services (SOAP/HTTP).

(2)

2) What can be done?What: Name disambiguationHow: Protein names that are also English words can be identified by analyzing their frequencies in general English text such as the British National Corpus (BNC).

~~~

Bnac2AciculinOxitocinInsulinAprilTaskLight

10 0001 000

10010

x x x

x

xx x

Cut-off

Frequencies of protein names in the BNC (log)

Sylvain Gaudan * Miguel Arregui * Harald Kirsch * Vivian Lee * Dietrich Rebholz-Schuhmann(3)

http://www.ebi.ac.uk/Rebholz/