1 / 30 data mining with biomart

35
1 / 30 Data Mining with Data Mining with BioMart BioMart www.ensembl.org/biomart/martview www.biomart.org/biomart/martview

Upload: jefferson-mothershead

Post on 02-Apr-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 / 30 Data Mining with BioMart

1 / 30

Data Mining with BioMartData Mining with BioMart

www.ensembl.org/biomart/martviewwww.biomart.org/biomart/martview

Page 2: 1 / 30 Data Mining with BioMart

2 / 30

What is BioMart?What is BioMart?

• A data export tool

• A quick table generator

• A web interface to mine Ensembl data

Page 3: 1 / 30 Data Mining with BioMart

3 / 30

BioMart- Data miningBioMart- Data mining

• BioMart is a search engine that can find multiple terms and put them into a table format.

• Such as: mouse gene (IDs), chromosome and base pair position

• No programming required!

Page 4: 1 / 30 Data Mining with BioMart

4 / 30

General or Specific Data-TablesGeneral or Specific Data-Tables

• All the genes for one species

• Or… only genes on one specific region of a chromosome

• Or… make BioMart select genes

(I.e. all transcripts that match a microarry probe set, GO term, or InterPro domain).

Page 5: 1 / 30 Data Mining with BioMart

5 / 30

ResultsResults

Tables or sequencesTables or sequences

Page 6: 1 / 30 Data Mining with BioMart

6 / 30

The First Step: Choose the The First Step: Choose the DatasetDataset

Dataset: Current Ensembl, Human genes

Page 7: 1 / 30 Data Mining with BioMart

7 / 30

The Second Step: FiltersThe Second Step: Filters

Filters: Define a gene set

Page 8: 1 / 30 Data Mining with BioMart

8 / 30

Attributes attach informationAttributes attach information

Attributes: Determine output columns

Page 9: 1 / 30 Data Mining with BioMart

9 / 30

QueryQuery

For the human CFTR gene, export the Entrez Gene ID(s) and matching Affy HG U133-PLUS-2 probeset(s)

Page 10: 1 / 30 Data Mining with BioMart

10 / 30

Query:Query:

For the human CFTR gene, export the Entrez Gene ID(s) and matching Affy HG U133-PLUS-2 probeset(s)

• In the query:

Filters: what we know

Attributes: what we want to know.

Page 11: 1 / 30 Data Mining with BioMart

11 / 30

Query:Query:

For the human CFTR gene, export the Entrez Gene ID(s) and matching Affy HG U133-PLUS-2 probeset(s)

• In the query:

Filters: what we know

Attributes: what we want to know.

Page 12: 1 / 30 Data Mining with BioMart

12 / 30

Query:Query:

For the human CFTR gene, export the Entrez Gene ID(s) and matching Affy HG U133-PLUS-2 probeset(s)

• In the query:

Filters: what we know

Attributes: what we want to know

Page 13: 1 / 30 Data Mining with BioMart

13 / 30

A Brief ExampleA Brief Example

Use the current Ensembl (archives are also available)

SelectHomo sapiens

genes

Page 14: 1 / 30 Data Mining with BioMart

14 / 30

Select the Genes with FiltersSelect the Genes with Filters

Expand the GENE panel to enter in the gene ID(s).

Expand the ‘GENE’ panel.

ClickFilters

Page 15: 1 / 30 Data Mining with BioMart

15 / 30

Filters (and Count)Filters (and Count)

Click “Count” to see if genes passed through your filters.

Change this to HGNC curated name. Enter “CFTR” in the box.

Page 16: 1 / 30 Data Mining with BioMart

16 / 30

Attributes (Output Options)Attributes (Output Options)

Click on ‘Attributes’

‘Attributes’ allows you to output information.

Page 17: 1 / 30 Data Mining with BioMart

17 / 30

Attributes (Output Options)Attributes (Output Options)

Select ‘EntrezGene ID’

Page 18: 1 / 30 Data Mining with BioMart

18 / 30

Attributes (Output Options)Attributes (Output Options)

Select the Affy Platform ‘HG U133-PLUS-2’ in the

‘Microarray’ section

Page 19: 1 / 30 Data Mining with BioMart

19 / 30

The Results Table - PreviewThe Results Table - Preview

For the full result table: click “Go” or View “ALL” rows.

Page 20: 1 / 30 Data Mining with BioMart

20 / 30

Full Result TableFull Result Table

Ensembl Gene ID for CFTR

Ensembl Transcript

IDs

EntrezGene ID

Affy HG probeset

Page 21: 1 / 30 Data Mining with BioMart

21 / 30

Other Export Options (Attributes)Other Export Options (Attributes) Sequences: UTRs, flanking sequences, cDNA

and peptides, etc

Gene IDs from Ensembl and external sources (MGI, Entrez, etc)

Microarray data

Protein Functions/descriptions (Interpro, GO)

Orthologous gene sets

SNP/ Variation Data

Page 22: 1 / 30 Data Mining with BioMart

22 / 30

BioMart around the BioMart around the world…world…

BioMart started at Ensembl…

To where has it travelled?

Page 23: 1 / 30 Data Mining with BioMart

23 / 30

Central PortalCentral Portal

www.biomart.org

Page 24: 1 / 30 Data Mining with BioMart

24 / 30

WormBase WormBase

Page 25: 1 / 30 Data Mining with BioMart

25 / 30

HapMapHapMap

Population frequencies

Inter- population comparisons

Gene annotation

Page 26: 1 / 30 Data Mining with BioMart

26 / 30

DictyBaseDictyBase

Page 27: 1 / 30 Data Mining with BioMart

27 / 30

GRAMENEGRAMENE

www.gramene.org

Page 28: 1 / 30 Data Mining with BioMart

28 / 30

The Potato CenterThe Potato Center

Page 29: 1 / 30 Data Mining with BioMart

29 / 30

How to Get ThereHow to Get Therehttp://www.biomart.org/biomart/martview

http://www.ensembl.org/biomart/martview

• Or click on ‘BioMart’ from Ensembl

Page 30: 1 / 30 Data Mining with BioMart

30 / 30

Worked ExampleWorked Example

• Follow the worked example on pg 26

• Then, do the exercises on pg 34 (answers on pg 37)

This module should do the following:• Show you how to export multiple data types from

Ensembl for gene IDs or chromosomal regions.

Page 31: 1 / 30 Data Mining with BioMart

31 / 30

Ensembl Core DatabasesEnsembl Core Databases

Relational Database• Normalised• Each data point stored only onceTherefore:• Quick updates• Minimal storage requirementsBut:• Many tables• Many joins for complicated queries• Slow for data mining applications

Page 32: 1 / 30 Data Mining with BioMart

32 / 30

Normalised SchemaNormalised Schema

gene_id gene.symbol

9970 SMAD1

1712 SMAD2

8240 SMAD3

1967 SMAD4

… …

gene_id transcript

9970 ENST00000302085

1712 ENST00000262160

1712 ENST00000356825

8240 ENST00000327367

1967 ENST00000342988

… …

gene_id stable_id

9970 ENSG00000170365

1712 ENSG00000175387

8240 ENSG00000166949

1967 ENSG00000141646

… …

Page 33: 1 / 30 Data Mining with BioMart

33 / 30

BioMart DatabaseBioMart Database

Data warehouse• De-normalised• Query-optimisedTherefore:• Fast and flexible• Ideal for data miningBut:• Tables with apparent “redundancy”• Needs rebuilding from scratch for every release

from normalised core databases

Page 34: 1 / 30 Data Mining with BioMart

34 / 30

De-Normalised SchemaDe-Normalised Schema

gene_id transcript_id gene.symbol

ENSG00000170365 ENST00000302085 SMAD1

ENSG00000175387 ENST00000262160 SMAD2

ENSG00000175387 ENST00000356825 SMAD2

ENSG00000166949 ENST00000327367 SMAD3

ENSG00000141646 ENST00000342988 SMAD4

… … …

Page 35: 1 / 30 Data Mining with BioMart

35 / 30

SPECIES

FOCUS

REGION

SNP

PROTEIN

HOMOLOGY

GENE

EXPRESSION

REFSEQ

INTERPRO

GO

SWISSPROT

EMBL

AFFYMETRIX

FASTA

FILE

EXCEL

TEXT

GTF

HTML

DATASET FILTER ATTRIBUTES

Information FlowInformation Flow

REGION

SNP

PROTEIN

HOMOLOGY

GENE

EXPRESSION