data mining in geo and beyond

42
Data Mining in GEO and Beyond Steven Chen Pengcheng Lu Andrew Yi

Upload: ngohanh

Post on 28-Jan-2017

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Mining in GEO and Beyond

Data Mining in GEO and Beyond

Steven Chen Pengcheng Lu Andrew Yi

Page 2: Data Mining in GEO and Beyond

Outline

  General introduction of GEO database and basic search functions (Steven)

  Introduction of a powerful search engine for GEO database and real example illustration (Pengcheng)

  Bioinformatics meta analysis tool introduction (Andrew)

Page 3: Data Mining in GEO and Beyond

  A public repository for the archiving and distribution of gene expression data submitted by the scientific community.

  MIAME compliant data.   Minimum Information About a Microarray Experiment

http://www.mged.org/Workgroups/MIAME/miame.html

  Convenient for deposition of gene expression data, as required by funding agencies and journals.

  Curated, online resource for gene expression data browsing, query, analysis and retrieval.

Gene Expression Omnibus (GEO): Gene Expression and Molecular Abundance Data Repository

Page 4: Data Mining in GEO and Beyond

GEO Architecture

  Platform (GPL) = the technology used and the features detected.

  Sample (GSM) = preparation and description of the sample.

  Series (GSE) defines a set of samples and how they are related.

  DataSets (GDS) sample data collections assembled by GEO staff.

GEO has four kinds of data records

Submitters may provide raw data

  Original microarray scans   Raw quantification data

Page 5: Data Mining in GEO and Beyond

GPL Platform

descriptions

GSM Raw/processed spot intensities from a single

slide/chip

GSE Grouping of

slide/chip data “a single experiment”

GDS Grouping of experiments

Curated by NCBI

Submitted by Experimentalists Submitted by

Manufacturer*

GEO Architecture

Page 6: Data Mining in GEO and Beyond

Data Relationship at GEO

Platform[GPL]

Sample [GSM GSM GSM GSM ... GSM]

Series[GSE]

Dataset[GDS]

Gene Profiles

Submitted

Curated

Page 7: Data Mining in GEO and Beyond

GEO Home Page

Simple interface to:

  show status

  find documentation

  query data

  browse data

  submit data

Page 8: Data Mining in GEO and Beyond

Basic Search: Repository Browser

Selecting the total public data or Repository Browser links on the GEO home page, takes you to the Repository Browser, listing:   number of each type of submitted file, both public and unreleased   the total number of each technology type under Platforms   the total number of each Sample type

Page 9: Data Mining in GEO and Beyond

Basic Search: Browse Platforms

  All GEO submissions need to be associated with a platform file. These describe the features on a given platform, required to understand the data.

  A platform file must be submitted if one is not already present in GEO.   Commercial array platform files are submitted to GEO by the manufacturer.

Page 10: Data Mining in GEO and Beyond

Basic Search: Browse Platforms

Accession:GEO ID

Title: brief description

of platform

Contact: submitter

Samples: number of samples in GEO associated

with platform ID

Technology: platform

type

Release date: when file is

publicly accessible

The table can be sorted on any field except organism by clicking on the header. Specific platform files can be found using the ‘Find Platform’ option.

Page 11: Data Mining in GEO and Beyond

Basic Search: Find Platforms

  Select ‘Find Platform’   Select company   Select distribution   Select species   Enter title keyword

Page 12: Data Mining in GEO and Beyond

Basic Search: Find Platforms (continued)

  Start the platform search   Select the accession for the U133 plus 2.0 array   Scroll down to find data table information

Page 13: Data Mining in GEO and Beyond

Data Retrieval: Browse Series

  Data is submitted to GEO as a Series, which represents the experiment design.

  Selecting Browse>Series brings up a list sorted by release date.   Selecting a Series ID brings up the Series file summary.

Page 14: Data Mining in GEO and Beyond

Data Retrieval: Series Accession Page

Biological sample summary

Design summary

Publication information

Platform (total)

Samples (total)

Page 15: Data Mining in GEO and Beyond

Data Retrieval: Sample File Summary

Sample preparation

Hybridization and data

processing

Platform Series

Page 16: Data Mining in GEO and Beyond

Data Retrieval: Sample File Data Table Data table field

descriptions

Truncated data table from Quick view

Total data rows and file size

Supplementary raw data file

Page 17: Data Mining in GEO and Beyond

Querying GEO with IDs from Papers

  A common way to access GEO data is through accessions from papers.   Online journals include hyperlinks to the GEO accession page.   Or, at the GEO home page enter the accession into the Query>GEO

accession text box

Page 18: Data Mining in GEO and Beyond

GEO Links in PubMed Search Results

  One option for displaying PubMed search results is GEO DataSet links.   When present, the results page is actually from Entrez GEO DataSets.

Page 19: Data Mining in GEO and Beyond

Advanced Searches

GEO data can be queried as:   Datasets: experiment-centric view using Entrez GEO DataSets   Gene profiles: gene-centric view using Entrez GEO Profiles

Selecting either takes you to a similar Entrez introduction page

Page 20: Data Mining in GEO and Beyond

Querying GEO DataSets

  Start a GEO DataSets search with the Query>DataSets text box   This brings up an Entrez GEO DataSets results form

Total results

Number of DataSets

Number of Platforms

Number of Series

Page 21: Data Mining in GEO and Beyond

DataSet Search Result

DataSet ID

Description

Platform

Reference Series

Supplementary files

Number of Samples and truncated list

Cluster image

Select the DataSet ID or click on the cluster image to go to the DataSet record.

Page 22: Data Mining in GEO and Beyond

GEO DataSet Record

Experiment design and DataSet information.

Sample and analysis information.

Data retrieval.

Selecting analysis takes you to the data clustering interface.

Selecting the cluster image takes you to the clustering page

Page 23: Data Mining in GEO and Beyond

GEO Gene Profiles

GEO DataSet ID

Platform ID, Platform Feature ID

Gene description

Target sequence accession

Expression profile

  GEO Gene Profiles use gene IDs from Platform files to show the expression of a gene across DataSets.

  Entering a gene ID into the Query>Gene profiles text box takes you to the Entrez results page.

Page 24: Data Mining in GEO and Beyond

Advanced Search outside of GEO

  GEOmetadb applications   GEOmetadb was developed in an attempt to make querying the

GEO metadata both easier and more powerful.

  Web-based query engine, supported by a MySQL database backend.   All GEO metadata records as well as the relationships between

them are parsed and stored in a local MySQL database   URL: http://gbnci.abcc.ncifcrf.gov/geo/

  BioConductor packages:   GEOquery   GEOmetadb

Page 25: Data Mining in GEO and Beyond

Live Demo at GEOmetadb

Page 26: Data Mining in GEO and Beyond

Live Demo at GEOmetadb

Page 27: Data Mining in GEO and Beyond

BioConductor Packages

  GEOquery package

  GEOquery effectively establishes a bridge between GEO microarray data and BioConductor and R facilities

  GEOquery is used when GEO accession number is known and data manipulation can be done in R to create expression data and experimental design files in the same format as other BioConductor packages can further work on

Page 28: Data Mining in GEO and Beyond

An Example of Using GEOquery

  R Script: library("GEOquery") gse5325<-getGEO( "GSE5325" , GSEMatrix=FALSE ) gene<- Table(GPLList(gse5325)[[1]])$GeneSymbol gse5325.matrix<- do.call("cbind", lapply(GSMList(gse5325)[1:105], function(x) { tab <- Table(x)[,2] return(tab)}))

data<-data.frame(gene,gse5325.matrix) data2<-data[data$gene%in%"MYC"|data$gene%in%"PTEN",] data2[, 1:6]

> data2[, 1:6] gene GSM120468 GSM120469 GSM120470 GSM120475 GSM120477 GSM120478 6168 MYC -1.635421 0.80790556 -3.100395 -1.0790637 -1.107209 -1.580962 7166 PTEN NA NA NA NA NA NA 8763 PTEN 1.297435 -0.09709502 1.065820 0.9462094 1.536822 1.180152 13327 PTEN NA NA NA NA NA NA 18409 PTEN NA NA NA NA NA NA 21865 PTEN NA NA NA NA NA NA 24305 MYC -1.809964 0.62828170 -4.038665 -1.2147221 -1.202033 -1.672025 24532 MYC -1.048360 0.75648633 -3.485740 -1.2723427 -1.145341 -1.751741

Page 29: Data Mining in GEO and Beyond

BioConductor Packages(Cont.)‏

  GEOmetadb package

  GEOmetadb BioConductor package is simply a thin wrapper around the GEOmetadb SQLite database.

  The RSQLite package(James, 2008) includes an embedded SQLite database engine and can interact with any SQLite database.

  The function getSQLiteFile is the standard method for downloading and unzipping the most recent GEOmetadb SQLite file from server.

Page 30: Data Mining in GEO and Beyond

An Example of Using GEOmetadb

  R Script: library(GEOmetadb) getSQLiteFile() con<-dbConnect(SQLite(),"GEOmetadb.sqlite") sql<-paste("SELECT DISTINCT gse.gse, gse.pubmed_id, gse.title", "FROM", " gsm JOIN gse_gsm ON gsm.gsm=gse_gsm.gsm", " JOIN gse ON gse_gsm.gse=gse.gse", " JOIN gse_gpl ON gse_gpl.gse=gse.gse", " JOIN gpl ON gse_gpl.gpl=gpl.gpl", "WHERE", " gse.summary LIKE '%lung%cancer%' AND", " gpl.organism LIKE '%Homo sapiens%'", sep = " ") rs<-dbGetQuery(con, sql) print(rs[,1])

> rs[,1] [1] "GSE10025" "GSE10072" "GSE10089" "GSE10096" "GSE10309" "GSE1037" [7] "GSE10431" "GSE10445" "GSE10799" "GSE1081" "GSE10841" "GSE10847" [13] "GSE10957" "GSE11078" "GSE11117" "GSE11559" "GSE11969" "GSE12027" [19] "GSE12236" "GSE12280" "GSE1650" "GSE2052" "GSE2189" "GSE3202" [25] "GSE3268" "GSE3521" "GSE3598" "GSE3707" "GSE3708" "GSE3754" [31] "GSE4115" "GSE4127" "GSE4705" "GSE4716" "GSE4869" "GSE4882" [37] "GSE5056" "GSE5057" "GSE5058" "GSE5059" "GSE5123" "GSE5816" [43] "GSE5843" "GSE6013" "GSE6044" "GSE6253" "GSE6400" "GSE6413" [49] "GSE6474" "GSE6695" "GSE6883" "GSE6960" "GSE6962" "GSE7339" [55] "GSE7670" "GSE7878" "GSE7880" "GSE7930" "GSE8045" "GSE8332" [61] "GSE8569" "GSE8837" "GSE8894" "GSE8987" "GSE9008" "GSE9212" [67] "GSE9315" "GSE9994" "GSE9995"

Page 31: Data Mining in GEO and Beyond

Layers of Microarray Data Sets

A typical experiment design: A (Normal) group vs B (Tumor) group

Image

Raw data

Microarray Core

EXALT

Essential pieces of information: N vs T, sigGene-Direction-Confidence

Definition - a set of statistically significant differentially expressed genes and their expression directions.

Normalized data: comparable N vs T

Significant differential gene list: e.g. MYC, PTEN

GEO

Signature, profile, or set

Page 32: Data Mining in GEO and Beyond

GEO Online Functionalities

Data set level: Good Data set management in raw, GSE, and GDS Search for data sets by gene, sample, and experiment key word Browsing, download,

Within a data set: clustering analysis, gene searching, two group comparison by t-test in real time, slow, signature profile among all samples

Across data sets: support only a single gene search show gene expression levels within a data set

NEITHER a data set comparison NOR a signature search among GEO data sets can be performed

Limited in the number of cancer related data sets

Page 33: Data Mining in GEO and Beyond

Triplets format: Gene-DirCode-Score

Stanford_Pollack_CaP_PNAS_V101_P811

Signature name: normal prostate samples vs primary prostate tumors

GEO SigDB

HuCa SigDB

Signature Encoding Format in EXALT

Normal group PRCa

VS

N Samples PR Samples

PR Datasets

PR Studies

A PRca signature in triplets (MYC-U-62, PTEN-D-38, …)

Prostate Cancer

Page 34: Data Mining in GEO and Beyond

EXALT SigDB Statistics

EXALT Daily Web Statistics The number of EXALT page hits: 74,591 Total number of EXALT users: 1,611

Signature Database (08212008) Statistics EXALT Datasets Signatures Arrays Geo 1,445 25,613 28,862 Human Cancer 446 2,069 37,644

Page 35: Data Mining in GEO and Beyond

EXALT Online Features

• Data set browsing and search • Query data set uploading and comparison • Signature search • Signature comparison

Page 36: Data Mining in GEO and Beyond

Upload a Query Data set

Page 37: Data Mining in GEO and Beyond

Signature Based Dataset Searching

Example: identify a cluster of gene expression signatures related to a query data set of prostate cancer (Pollack PNAS 2004).

The QUERY dataset has generated 3 signatures PNAS 101(3):811-6. 2004 Normal, Cancer, Metastasis

EXALT Search in Human Cancer Signature Database ...

Total SEVEN matched Subject Data Sets GDS2545 NCBI_Geo_GDS2545 has 3 hit (100%) with Average pValue=1.29E-9 Analysis of metastatic prostate tumors and primary prostate tumors J Clin Oncol 2004 Jul 15;22(14):2790-9.

Page 38: Data Mining in GEO and Beyond

Multiple Search Entries

A

B

Page 39: Data Mining in GEO and Beyond

Homologous Signatures Search and select a query signature, and then launch a homology analysis

Launch a QUERY by selecting a signature from studies of estrogen receptor status (ER- vs ER+, Herschkowitz et al., 2008)

Page 40: Data Mining in GEO and Beyond

Co-expression Analysis

Input: MYC, PTEN

Page 41: Data Mining in GEO and Beyond

EXALT Online Distinctive Features

•  Two big signature databases that have collections of >27,000 signatures, •  The signature annotations are integrated with the NCBI GEO and the PubMed, •  The multiple mining strategies allow directly access and analysis, •  The interfaces are friendly to biologists •  All signatures are pre-computed and directly meaningful to research scientists.

http://seq.mc.vanderbilt.edu/exalt/

Page 42: Data Mining in GEO and Beyond

Acknowledgements

Andrew Yi

Mentors: Yu Shyr, Matusik, Al George Posdoc: QingChao Qiu Developers: Guangzu Zhang

Collaborators: Wu Jun, Xie Lu at SBIT

Funding NCI, IG, VICTR, VICC biostatistics

Steven Chen and Pengcheng Lu

Thank Yu Shyr, Ming Li, Heidi Chen and Aixiang Jiang for suggestions.