data mining in ensembl with biomart

34
Data Mining in Ensembl Data Mining in Ensembl with BioMart with BioMart Nov, 2009 www.ensembl.org/biomart/martview www.biomart.org/biomart/martview

Upload: joshwa

Post on 27-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Data Mining in Ensembl with BioMart. www.ensembl.org/biomart/martview www.biomart.org/biomart/martview. Nov, 2009. BioMart- Data mining. BioMart is a search engine that can find multiple terms and put them into a table format. Such as: mouse gene (IDs), chromosome and base pair position - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Mining in Ensembl with BioMart

Data Mining in Ensembl with Data Mining in Ensembl with BioMartBioMart

Nov, 2009

www.ensembl.org/biomart/martviewwww.biomart.org/biomart/martview

Page 2: Data Mining in Ensembl with BioMart

BioMart- Data miningBioMart- Data mining

• BioMart is a search engine that can find multiple terms and put them into a table format.

• Such as: mouse gene (IDs), chromosome and base pair position

• No programming required!

Page 3: Data Mining in Ensembl with BioMart

General or Specific Data-TablesGeneral or Specific Data-Tables

• All the genes for one species

• Or… only genes on one specific region of a chromosome

• Or… genes on one region of a chromosome associated with an InterPro domain

Page 4: Data Mining in Ensembl with BioMart

The First Step: Choose the The First Step: Choose the DatasetDataset

Dataset: Current Ensembl, Human genes

Page 5: Data Mining in Ensembl with BioMart

The Second Step: FiltersThe Second Step: Filters

Filters: Define a gene set

Page 6: Data Mining in Ensembl with BioMart

Attributes attach informationAttributes attach information

Attributes: Determine output columns

Page 7: Data Mining in Ensembl with BioMart

ResultsResults

Tables or sequencesTables or sequences

Page 8: Data Mining in Ensembl with BioMart

Query:Query:

• For the human CFTR gene, can I export the EntrezGene ID, and also, probes with this gene sequence from the “Affy HG U133 Plus 2” microarray platform?

• In the query:

Filters: what we know

Attributes: what we want to know.

Page 9: Data Mining in Ensembl with BioMart

Query:Query:

• For the human CFTR gene, can I export the EntrezGene ID, and also, probes with this gene sequence from the “Affy HG U133 Plus 2” microarray platform?

• In the query:

Filters: what we know

Attributes: what we want to know.

Page 10: Data Mining in Ensembl with BioMart

Query:Query:

• For the human CFTR gene, can I export the EntrezGene ID, and also, probes with this gene sequence from the “Affy HG U133 Plus 2” microarray platform?

• In the query:Filters: what we knowAttributes: what we want to know (columns in the result table)

Page 11: Data Mining in Ensembl with BioMart

A Brief ExampleA Brief Example

SelectHomo sapiens

Use the current Ensembl (archives are also available)

Page 12: Data Mining in Ensembl with BioMart

Select the genes with FiltersSelect the genes with Filters

Expand the GENE panel to enter in the gene ID(s).

Expand the ‘REGION’

panel.

ClickFilters

Page 13: Data Mining in Ensembl with BioMart

FiltersFilters

Change this to HGNC symbol. Enter “CFTR”

in the box.

Click “Count” to see if genes passed through your filters.

Page 14: Data Mining in Ensembl with BioMart

Attributes (Output Options)Attributes (Output Options)

Expand the “GENE” section.

Click on ‘Attributes’

Page 15: Data Mining in Ensembl with BioMart

Expand the ‘EXTERNAL’ panel for non-Ensembl IDs.

Attributes (Output Options)Attributes (Output Options)

Select “Description” and “Associated Gene

Name”.

Page 16: Data Mining in Ensembl with BioMart

Attributes (Output)Attributes (Output)

External IDs include EntrezGene IDs and also Microarray probe IDs.

………………………………………………………………….

Page 17: Data Mining in Ensembl with BioMart

“Results” show Description, Name, EntrezGene and Probe matches from the Affy HG U133-

Plus-2 platform.

The Results Table - PreviewThe Results Table - PreviewFor the full result

table: click “Go” or View “ALL” rows.

Page 18: Data Mining in Ensembl with BioMart

Full Result TableFull Result TableEnsembl Gene and

Transcript IDsDescription

Gene Name

EntrezGene ID

Affy HG probe

Page 19: Data Mining in Ensembl with BioMart

Other Export Options (Attributes)Other Export Options (Attributes) Sequences: UTRs, flanking sequences, cDNA

and peptides, etc

Gene IDs from Ensembl and external sources (MGI, Entrez, etc)

Microarray data

Protein Functions/descriptions (Interpro, GO)

Orthologous gene sets

SNP/ Variation Data

Page 20: Data Mining in Ensembl with BioMart

BioMart Data SetsBioMart Data Sets

• Ensembl genes• Vega genes• Variations

Page 21: Data Mining in Ensembl with BioMart

BioMart around the BioMart around the world…world…

BioMart started at Ensembl…

To where has it travelled?

Page 22: Data Mining in Ensembl with BioMart

Central PortalCentral Portal

www.biomart.org

Page 23: Data Mining in Ensembl with BioMart

WormBase WormBase

Page 24: Data Mining in Ensembl with BioMart

HapMapHapMap

Population frequencies

Inter- population comparisons

Gene annotation

Page 25: Data Mining in Ensembl with BioMart

DictyBaseDictyBase

Page 26: Data Mining in Ensembl with BioMart

GRAMENEGRAMENE

www.gramene.org

Page 27: Data Mining in Ensembl with BioMart

The Potato CenterThe Potato Center

Page 28: Data Mining in Ensembl with BioMart

How to Get ThereHow to Get Therehttp://www.biomart.org/biomart/martview

http://www.ensembl.org/biomart/martview

• Or click on ‘BioMart’ from Ensembl

Page 29: Data Mining in Ensembl with BioMart

• Choose Dataset (All genes for a species)

• Choose Filters (narrows the gene set)

• Choose Attributes (output options)

Now Try the Worked Example on Page 23!

The FlowThe Flow

Page 30: Data Mining in Ensembl with BioMart

Ensembl Core Databases

Relational Database• Normalised• Each data point stored only onceTherefore:• Quick updates• Minimal storage requirementsBut:• Many tables• Many joins for complicated queries• Slow for data mining applications

Page 31: Data Mining in Ensembl with BioMart

Normalised Schema

gene_id gene.symbol

9970 SMAD1

1712 SMAD2

8240 SMAD3

1967 SMAD4

… …

gene_id transcript

9970 ENST00000302085

1712 ENST00000262160

1712 ENST00000356825

8240 ENST00000327367

1967 ENST00000342988

… …

gene_id stable_id

9970 ENSG00000170365

1712 ENSG00000175387

8240 ENSG00000166949

1967 ENSG00000141646

… …

Page 32: Data Mining in Ensembl with BioMart

BioMart Database

Data warehouse• De-normalised• Query-optimisedTherefore:• Fast and flexible• Ideal for data miningBut:• Tables with apparent “redundancy”• Needs rebuilding from scratch for every release

from normalised core databases

Page 33: Data Mining in Ensembl with BioMart

De-Normalised Schema

gene_id transcript_id gene.symbol

ENSG00000170365 ENST00000302085 SMAD1

ENSG00000175387 ENST00000262160 SMAD2

ENSG00000175387 ENST00000356825 SMAD2

ENSG00000166949 ENST00000327367 SMAD3

ENSG00000141646 ENST00000342988 SMAD4

… … …

Page 34: Data Mining in Ensembl with BioMart

SPECIES

FOCUS

REGION

SNP

PROTEIN

HOMOLOGY

GENE

EXPRESSION

REFSEQ

INTERPRO

GO

SWISSPROT

EMBL

AFFYMETRIX

FASTA

FILE

EXCEL

TEXT

GTF

HTML

DATASET FILTER ATTRIBUTES

Information Flow

REGION

SNP

PROTEIN

HOMOLOGY

GENE

EXPRESSION