fission yeast computing workshop -1- searching, querying, browsing downloading and analysing data...

Fission Yeast Computing Workshop

-1-

Searching, querying, browsing downloading and analysing data

using GeneDB and the Gene Ontology annotation

•Basic searching and browsing

• Anatomy of the GeneDB Genepage (overview of page contents)

• Simple data mining and analysis

Create user defined gene sets and Download gene sets in various formats

Combine (union, intersect and subtract) to make and refine user defined lists

• “GO slimming”

•GO enrichment” exercises


-2-

Basic Search /Browse tips

1

2

5

1. This searches ONLY the gene name and product line2. This searches full text of the page. It is advisable to use quotes for compound terms e.g.“mitotic cyclin” as mitotic cyclin will search “mitotic and cyclin”In addition PMID:19250904 will not work but ”PMID:19250904” will.This isn’t a good way to retrieve gene sets (we will look at better ways) although it is useful for quickly getting to a single gene page.You can also use this search to quickly locate the pombe ortholog of a cerevisiae Gene, but you need the systematic ID...e.g. YPR070W3. Register gene names pre-publication here4. Mailing lists pombelist and curated S. cerevisiae orthologs5. Browse catalogues

3 4


-3-

Anatomy of a gene page

Location1. Chromosome, coordinates2. Context map3. GBrowse 4. Artemis (EMBL format or Artemis applet)

General information Gene namesProduct (unique)Access to protein and DNA sequenceAccess to various Blast

1

4

3

2


-4-

CurationIncludes viability (if available, will soon be genome wide)Species distributionPhenotype (new), not comprehensiveName derivationsDisease associationsPost-translational modificationsS. cerevisiae orthologsDomain and family information, (but only when the there are more members thanidentified by Pfam)TargetsInformation about expression and regulationProtein feature info coiled coil, cleavage siteBy using controlled vocabulary can group “like” features. Eventually these will be captured by more formal ontologies (phenotypes-> PATO)Curation “terms” are listed and grouped in the Curation browsable list (e.g. below)Curation terms can be used in the Boolean query tool (later exercise)


-5-

Gene Ontology (GO) AnnotationWe now have good breadth of annotation, especially to high level terms (demonstrated in later practical); Depth (specificity) could be improved.Annotations are supported by an evidence code and a source (publication)Sometimes a qualifier is used to provide extra information about an annotationA term is automatically annotated to all of its parents, so if you wanted to find other genes which might be related to this process you can go up the graph

The GO term on the Gene Page is linked to the AmiGO GO browser “term page”. Term information has definition and synonyms Scroll down the page for the term lineage graph.This shows parents, for example you may wish to go up the graph (or tree view) to access 42 gene products annotated to the parent “spindle organisation” which will also include these 4 genes. NOTE: “spindle organisation” does not appear on mto1 gene page (even though mto1 is annotate to this term), so a the “full text search” would not necessarily retrieve all genes annotated to a GO term. The complete lists of genes annotated to a term can be accessed:i) Through AmiGO (by going to the term page and accessing the product list) ii) From a page with a direct annotationiii) Through the Boolean query interface (later)From GeneDB this view is filtered to show only pombe annotations but you can change to other species filters to access the results in other organisms.


-6-

External Links:

http://128.40.79.33

PfamPfam is a database of protein domains and protein families each represented by non-overlapping multiple sequence alignments. There are two components to Pfam: Pfam-A and Pfam-B. Pfam-A entries are high quality, manually curated families. These cover a large proportion of sequences in the sequence databases (83% for pombe- the highest coverage or any eukaryote). In order to give further coverage,these are supplemented by automatically generated entries called Pfam-B. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions with no Pfam-A entries.


-7-

AmiGO is the official browser of the GO consortium. Allows users to Browse the GO ontologySearch the GO ontologyView annotations to terms in different species

AmiGO is the GO browser we use for GeneDBThis is a separate installation of AmiGO is a installation from the GO site one, an important difference is that the GeneDB implementation includes IEA annotations and will give better coverage (although the GO site should also soon support IEAs in AmiGO). You can tell which version you are using from the URL.

Go to the GeneDB version

http://www.genedb.org/amigo-cgi/search.cgi?

Simple Searching and browsing GO with AmiGO

You can search for gene names or identifiers OR GO terms

Search “GO terms” for “DNA repair”

The most relevant results should be near the top of your results. Click on the term name to take you toThe “term details” page


-8-

You can access broader and narrower terms (move up and down the tree) from the term lineage. Broader terms can be useful to identify lists of related genes.

Scroll down the page to see the term lineage and numbers of genes annotatedTo this term in ALL organismsFilter on “data source GeneDB S. pombe” to retrieve only fission yeast annotations (now 152)

Links to the associations(gene product annotations)

Note that you can also access lists of annotations to a term from the Gene page of any annotated gene product in GeneDB (both direct and indirect) provided there is at least one direct annotation to this term in the genome.


-9-

Return to gene product search, (front page) using your “back button”Search for a fission yeast genes. This search will return any genes products where the gene product name matches the search term. If you still have the filter on you will retrieve only pombe genes which match this name. You can disable the filter under the search option.

From the “Gene product search” you can access all GO annotationsto individual gene products


-10-

Is_a relationship

Part_of relationship

Leaf node or no children

Node has been opened, can be clicked to close

Node has children, can be clicked to view children

Browsing

Browse the high level biological process terms by opening the nodes “+” For “biological process” “cellular process”Set the filter for data source GeneDB_Spombe, you will notice that almost all fission yeast annotated gene products are annotated to “cellular process”

Browsing can be used to identify sets of genes of interest, or to locate a term if you can’t find it by searching, or to see how high level terms relate to each other (as when building a slim later). From wherever you are in AmiGO, click “Browse” in the menu bar to take you to this view:


-11-

Using the “Boolean query interface” to select and download some user defined gene sets

http://www.genedb.org/genedb/pombe

The boolean query interface entry point is from the S. pombe GeneDB front pageFirst a complete list of protein coding genes, their identifiers and products

You can construct queries (AND(Intersect)/OR(union) directly in this interface, but it is much simpler to perform single queries and combine them in the query history. In addition, you can subtract queries from each other in the history, but you can’t in the query builder.

Select “genes of a certain type” and “proceed to next step” from the pull down menu

Select “protein coding” and “submit form the next view.

The results page will provide you with a list of all protein coding genes which you can “page through” at 20 items per page.

The link “visit the history page” takes you to your query history from where you can refine queries and download results sets in various formats.

Exercise 1: Download a protein set


-12-

5025 is actually slightly higher than the actual protein coding gene total. This is because transposons contain a protein coding open reading frame and are annotated as CDS (coding sequence).

Use the “back button” on your browser to return to the “Boolean query selector interface”Select the data type “annotation Status”Select the data type “transposable element” and “Submit”

Go to the history page, you will now have the results of both queries in your query history manager.

Select (using the checkboxes) both results sets and subtract query 2 from query 1 to give the current set of protein coding genes. Note: to do a subtraction query, you need to ensure that the query you wish to subtract is below the query you want to subtract from so in this case you need to make sure that you perform the queries in the correct order. The order does not matter for Intersection and Union queries.

Click the link “Download” next to your final results set.

Tip: At this page you can also supply your own lists to perform the following download operations

Query history


-13-

Download OptionsScroll down the page to see the download options.

We want a tab delimited file with sequence ID and product.Tab delimited is the default, and ID should be pre-selected, so you should only need to select “product”.

The sequence options are for Fasta files Only, and are not applicable to the “tab” delimited file

Submit the query with output destination “normal page”

The output will be a list of ID’s and products in tab-delimited format

Go back and change the output destination to “Save as”

This will allow you to save to disk, the file will download with the name IdListFormHandler, so you will need to rename it to something sensible. We will use this file later

Tip: From this interface, for any results set, or user defined list, you can also download

i) Fasta format protein sequence fileii) Fasta Format DNA sequence fileiii) 5’ or 3’ DNA sequence of user

specifed length for each CDSiv) Each CDS with the 5’ AND 3’

regions of specified length


-14-

Exercise 2: Boolean Queries, protein status query

You can recreate this data in the Boolean query interface, select queries for annotation status “conserved hypothetical”, “role inferred from homology” “experimentally characterised”, “sequence orphan” and “S. pombe specific families”.Go to the query history and union these datasets. Although the numbers may have changed slightly, the total should be close to 4947. Why is this different from the protein coding total in the previous exercise? Subtract query 2 from query 1 to see what the difference is. You may wish to exclude these “genes” from future queries.

You can recreate this data in the Boolean query interface, select queries for“proteins with curation containing a specific word or phrase” then run queries for the phrases (keywords) “conserved in Metazoa” and “conserved in fungi only”Intersect both of these queries with conserved hypothetical query (i.e. conserved unknowns).

If you have time you can query for Genes with a specific GO component “nucleus”And “curation” “predominantly uniformly single copy” to get those which are single copy in most organsims


-15-

What makes a good “slim” ?This depends a lot on what you want your slim to show but there are some general considerations:

1. If you are trying to make a slim for the entire genome you should try to ensure that it covers as many annotated terms as possible, but you might ant to avoid terms with excessively large or small numbers of annotations (to avoid extreme distributions in your histogram). You should be aware of how many terms are annotated but not in your slim, and how many terms are “unknown” (I.e annotated only to the root node).

2. You may want to keep the number of terms as small as possible to convey your results (for display purposes). However, you still need to include the “biologically relevant“ terms. Many terms (I.e metabolic process (2915 annotation), cellular process (4083 annotations) are too “general” for the purpose of most “slims”

4. The slim should probably exclude sibling terms with a large overlaps between their annotations If you choose two siblings with 200 genes annotated to each, and the majority of the annotations overlap, it may be better to select the parent node (i.e replace 2 terms by one single term). Conversely, if the child terms of a node fall into distinct non-overlapping subsets, it might be more informative to include both child terms in your slim (for example the term transport see below.)

5. For most purposes you need to include a representative term for all biologically relevant processes, by including terms which are meaningful, especially if you are defining a slim for a specific purpose.

6. If you are using your slim for data analysis (and not just for visualization) you need to include terms which will allow you to distinguish genes bases on their biological properties. For example, it is not good to lump all genes involved in transport under transport because the genes annotated to distinct child terms; vesicle -mediated transport, protein targeting, transmembrane transport,

are VERY different in term of their i) viability ii) species distribution iii) number of interaction partners iv) copy number v) expression pattern, so it may not make sense to lump them together in your slim set. This is important if you are using a slim to display the results of an enrichment, for example.

GO slimming

• High level view of GO (genes annotated to granular terms are mapped to higher level terms)• Allows users to group genes into broader categories to assess their distribution, for

genome wide analyses or smaller gene sets• Different Annotation groups have created specific GO slims are available at GO’s FTP site (pombe now has an “official GO slim” which give good coverage of high level processes).• You can create and use your own GO slim with high level terms of interest• CARE: not a gene product count, as gene products have multiple annotations this means that it doesn’t make sense to display this information as a pie chart


-16-

You can cut and paste these terms from here:http://www.sanger.ac.uk/Projects/S_pombe/GO_slim

pombe biological Process GO slimGO:0006810 transport (819)GO:0055085 transmembrane transport (305) GO:0006913 nucleocytoplasmic transport (116) GO:0016192 vesicle-mediated transport (277) GO:0006605 protein targeting (164)GO:0006259 DNA metabolic process (310) GO:0006310 DNA recombination (100) GO:0006281 DNA repair (155) GO:0006260 DNA replication (154)GO:0006486 protein amino acid glycosylation (68)GO:0030163 protein catabolic process (229)GO:0006412 translation (594 includes RNA)GO:0006457 protein folding (86)GO:0032446 protein modification by small protein conjugation or removal (155) GO:0016070 RNA metabolic process (914) GO:0006399 tRNA metabolic process(127) GO:0016071 mRNA metabolic process (214)GO:0032569 transcription (447) GO:0032569 specific transcription from RNA polymerase II promoter (139)GO:0006996 organelle organization (791) GO:0007005 mitochondrion organization (230) GO:0042254 ribosome biogenesis (232) GO:0007165 signal transduction (386)GO:0000747 conjugation with cellular fusion (106)GO:0030437 ascospore formation (96)GO:0007010 cytoskeleton organization (215)GO:0006950 response to stress (694)GO:0051186 cofactor metabolic process (137) GO:0006629 lipid metabolic process (203)GO:0006766 vitamin metabolic process (59) GO:0055086 nucleobase, nucleoside and nucleotide metabolic process (131)GO:0005975 carbohydrate metabolic process (226)GO:0006725 cellular nitrogen compound metabolic process (202)GO:0006091 generation of precursor metabolites and energy (128)GO:0006520 amino acid metabolic process (191)GO:0000910 cytokinesis (141)GO:0007059 chromosome segregation (189)GO:0007346 regulation of mitotic cell cycle (162)GO:0007047 cell wall organization (63)GO:0042546 cell wall biogenesis (72)GO:0006461 protein complex assembly (111)GO:0007126 meiosis (173)GO:0007163 establishment or maintenance of cell polarity (60)GO:0019725 cellular homeostasis (101)GO:0016568 chromatin modification (209)

(Children of broader terms are indented)Other processes not in the slim (under 100, work in progress)Process unknown (i.e. annotated only to the root node “biological process” (897)

GO slimming, here’s one I made earlier........


-17-

Exercise 3a “GO Slimming” create a “GO slim”

This exercise uses the generic “GO slim mapper”at Princeton to create a ‘GO slim distribution from our gene set of interest.Go to http://go.princeton.edu/cgi-bin/GOTermMapper (or Google “Princeton generic GO term mapper”)

1. Upload the protein coding gene list from Exercise 1 Select GeneDB S pombe (Generic GO Slim), 2.User defined GO slimIn the advanced options(Use the pombe GO slim as a starting point and add your own terms of interest)


-18-

This exercise uses the GeneDB AmiGO “GO slimmer” to create a ‘GO slim distribution from our gene set of interest.Go to http://www.genedb.org/amigo-cgi/slimmer(or Google “AmiGO GeneDB GO slimmer”)

1. Upload a gene list from the datamining exercise or the complete gene listSelect ”Fission yeast GO slim”

2.User defined GO slimIn the advanced options(Use the pombe GO slim as a starting point and add your own terms of interest)

1

2

Exercise 3b “GO Slimming” create a GO slim


-19-

For most purposes this slim would be inadequate, but it does show

“unknown” (unannotated)

“other” annotated to some other term in the slim

(AmiGO and the Princeton Term mapper should show these soon)

There are usually many more annotations than genes (i.e 8454 here, and this will increase as you add more terms). Many genes are annotated to multiple high level term (I.e. there are intersections between many terms). A pie chart does not show the percentage of the genome involved in a particular process as it is often used and interpreted. Histograms with absolute numbers on the axis rather than percentages are much more meaningful.

To research a user defined slim to ensure you have good coverage, and to check intersections between your chosen terms, use the entire protein set in the Boolean query history, and subtract:

Genes with no GO process annotation

Followed by your GO terms of interest

(your reminder will be annotated terms which are not covered by your GO slim)

If you are slimming and comparing to the complete annotation, be aware that this includes annotations to tRNAs and rRNAs etc, not just proteins

“GO slimming, important considerations”


-20-

Exercise 4: Create a user defined gene set

Return to the Boolean Query InterfaceThis will be used as an input set to the GO “enrichment” tool.Use a combination of searches but try to make your set contain between 500 and 200 gene productsThings you can include in your Boolean query areCuration (already used)Genes of a certain type (protein coding etc) already usedAnnotation status (characterised, orphan etc)Specific GO function, process and componentAny GO annotation, or any GO annotation to a specific aspectA specific Pfam domain, or any Pfam domainAny range of exon number, molecular mass or protein lengthPresence of signal peptides, GPI anchors or transmembrane domains

Note: you can select for the absence of features using a subtraction query

All GO terms and Pfam domain names are listed alphabetically so it helps if you know hoe the term you are looking for is worded before you start:

Remember when you search GO that:i) A gene product annotated to a term is automatically annotated to ALL of its parentsii) A search on a GO term returns annotations to ALL children of that term

The list of GO process terms on page 16 may be a useful starting point. If you would like to use more granular terms you can browse for children of these in AmiGO

Download your results set to use as input to the enrichment exercise(See exercise 1 for the download instructions)


-21-

Exercise 5 “GO Term Enrichment”

Using the generic “GO term finder” tool at Princeton to provide an enrichment analysis (significant shared terms) in a gene set of interest.Go to http://go.princeton.edu/cgi-bin/GOTermFinder

1. Upload your gene list from the Exercise 4.

2. Select the process ontology

3. Choose the pombe association file(annotations)

1

2

3

The results will show the most significant terms inyour gene set, in order of significance. The % in your gene set compared to the % in the genome as a whole is provided, in addition to the P-value


-22-

Results are provided online as html tablesAnd can be downloaded locally.

Results are also presented as a DAGwhich allows users to browse the results set in the context of the GO hierarchy

“GO Term Enrichment”, important considerations

In the advanced options is the option to upload the list of genes for your background population.This is especially important as the significance needs to be calculated form the set of genes in your experiment, not the genome as a whole. Even if you have used the entire genome in your Experiment, you should still upload the gene list incase the gene set has changed. Also, the complete set of annotations includes tRNAs rRNAs etc, and their GO annotations. If your experiment does not include these, and you do not upload your own lists your significance for some terms (e.g. translation) could be very distorted.

For other important considerations for enrichment (and slimming), seeUse and misuse of the gene ontology annotations.Rhee SY, Wood V, Dolinski K, Draghici S. Nat Rev Genet. 2008 Jul;9(7):509-15


-23-

Exercise 6: GO coverage query

You can recreate this by doing Boolean queries for specific components (FPC and combining them in the query history to generate the overlaps.

Or, you can download the 3 gene lists (any Component, and Function and any Process) and import them into the online Venn diagram generator at the Url below


-24-

http://www.sanger.ac.uk/Projects/S_pombe/download.shtml

The contigs or chromosomes inEMBL format are the files youcan use to browse the data withthe Artemis sequence viewer.

Each ftp directory contains aREADME file describing thefile content and format. Make sure you consult this beforedownloading the data.


-25-

http://www.sanger.ac.uk/Projects/S_pombe/genome_stats.shtml

These data are regularly updated Where possible links are providedto the data described.

fission yeast computing workshop -1- searching, querying, browsing downloading and analysing data...

Documents