basic genomic characteristic aim: to collect as much general information as possible about your...

INTRODUCTION TO BIOLOGICAL DATABASES

Basic Genomic Characteristic AIM: to collect as much general

information as possible about your gene:Nucleotide sequence Databases

○ NCBI GenBank○ EMBL Nucleotide Sequence Database○ DDBJ

For Protein sequences○ UniProtKB

NCBI Reference Sequence (RefSeq)

Nucleotide sequence DB

The 3 databases form an international collaboration. Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis.

You do not need to check all of them!

Nucleotide sequence DB

NCBI Entrez

Present all the information available at NCBI for a gene. Entrez is a integrated searching tool across all the databases

Genome Browsers

NCBI Sequence Viewer

UCSC Genome Browser

ENSEMBL

NCBI Sequence Viewer

This is an example view of the human beta globin region on chr11

UCSC Genome Browser

ENSEMBL

ENSEMBL – genome view

ENSEMBL – Gene tree

NCBI OMIM database

Nucleotide databases and Genome Browser provide information on the gene nucleotide sequence (exon, intron, alternative splicing sites…) but give you very few information on gene function

OMIM database provide a summary of all the literature concerning a gene.

NCBI OMIM database

Protein Databases

Protein databases provide useful information about the function of gene: e.g. conserved protein domains,…

UniProt is the reference database Interpro offer automatic protein

annotation based on conserved domains RefSeq

Protein databases - UniProt

Similarity search

If your gene has no protein information

Protein sequence availableBLASTP against a non redundant protein

database

Protein sequence unavailableBLASTX against a non redundant protein

database

Protein 3D structure

Many proteins have the 3D structure determined. Biggest databases are:PDBNCBI Structure GroupDali

They offer tools for the visualization

PDB database

The visualization tools allows you to see the structure and the ligands (if presents), rotate the image and zoom-in

3D structure prediction

Structure still available for a limited number of proteins

Effort to predict protein structures based on sequences similarities

Still not very accurate!

SwissModel PSIPRED PredictProtein

Swiss-Model

Protein interaction databases AIM: find proteins that interact with your

target

IntAct: EBI resource to find interctors

BioGRID: is a freely available interaction database from model organisms and humans.

IntAct

Regulatory and metabolic pathways

the classic “KEGG”:

http://www.genome.ad.jp/kegg/

miRNA specific resources Databases:

miRNAMap: it present several useful information such as secondary structure, tissue specific expression and predicted target gene

HMDD: is specific for disease-miRNA associationMiRbase: is a searchable database of published

miRNA sequences and annotation. Target Prediction tools:

miRecords: is a good repository that shows confirmed target genes and predictions from several other software

C. Elegans specific tools

WormBase: is the main resource of information on C. elegans.

Expression pattern databaseHope lab Expression Pattern Database The Nematode Expression Pattern

DataBase Caenorhabditis elegans Genetics and

Genomics: provides links to many useful resources for C. elegans

Expression databases

Allows exploratory analyses of multiple experiments

Experiments need to be linked Require much information about how

experiments where conducted = sources of variation

Very different to genomic databases MIAME standard

MIAME

Experimental design Microarray design Extraction, preparation and labelling Hybridisation conditions Measurements: images, quantifications,

parameters Systematic error adjustments and

transformations

Gene Expression Omnibus NCBI administered ~280,000 samples >100 organisms >1,000,000,000

measurements

Gene Expression Omnibus

ArrayExpress

EBI administered >7000 experiments Provide p-values Bioconductor

package

ArrayExpress

GEO and ArrayExpress Databases provide:

The raw data for each hybridization (e.g., CEL or GPR files) The final processed (normalized) data for the set of hybridizations in the

experiment (study) (e.g., the gene expression data matrix used to draw the conclusions from the study)

The essential sample annotation including experimental factors and their values (e.g., compound and dose in a dose response experiment)

The experimental design including sample data relationships (e.g., which raw data file relates to which sample, which hybridizations are technical, which are biological replicates)

Sufficient annotation of the array (e.g., gene identifiers, genomic coordinates, probe oligonucleotide sequences or reference commercial array catalog number)

The essential laboratory and data processing protocols (e.g., what normalization method has been used to obtain the final processed data)

Problems:

Difficult compare experiments Significant genes not highlighted Poor results visualization

ArrayExpress is trying with its Atlas to solve this problems

Genevestigator

It is JAVA visualization tool that summarizes results from thousands of high quality transcriptomic experiments

Much easier to compare samples

Open access to only some of the data and 1 probeset/gene

Genevestigator

ONCOMINE

basic genomic characteristic aim: to collect as much general information as possible about your...

Documents

databases slide

kegg slide

visualization slide

intact slide

chr11 slide

software slide

ncbi omim database slide

nucleotide sequence