data mining in geo and beyond

Data Mining in GEO and Beyond

Steven Chen Pengcheng Lu Andrew Yi

Outline

  General introduction of GEO database and basic search functions (Steven)

  Introduction of a powerful search engine for GEO database and real example illustration (Pengcheng)

  Bioinformatics meta analysis tool introduction (Andrew)

  A public repository for the archiving and distribution of gene expression data submitted by the scientific community.

  MIAME compliant data.   Minimum Information About a Microarray Experiment

http://www.mged.org/Workgroups/MIAME/miame.html

  Convenient for deposition of gene expression data, as required by funding agencies and journals.

  Curated, online resource for gene expression data browsing, query, analysis and retrieval.

Gene Expression Omnibus (GEO): Gene Expression and Molecular Abundance Data Repository

GEO Architecture

  Platform (GPL) = the technology used and the features detected.

  Sample (GSM) = preparation and description of the sample.

  Series (GSE) defines a set of samples and how they are related.

  DataSets (GDS) sample data collections assembled by GEO staff.

GEO has four kinds of data records

Submitters may provide raw data

  Original microarray scans   Raw quantification data

GPL Platform

descriptions

GSM Raw/processed spot intensities from a single

slide/chip

GSE Grouping of

slide/chip data “a single experiment”

GDS Grouping of experiments

Curated by NCBI

Submitted by Experimentalists Submitted by

Manufacturer*

GEO Architecture

Data Relationship at GEO

Platform[GPL]

Sample [GSM GSM GSM GSM ... GSM]

Series[GSE]

Dataset[GDS]

Gene Profiles

Submitted

Curated

GEO Home Page

Simple interface to:

  show status

  find documentation

  query data

  browse data

  submit data

Basic Search: Repository Browser

Selecting the total public data or Repository Browser links on the GEO home page, takes you to the Repository Browser, listing:   number of each type of submitted file, both public and unreleased   the total number of each technology type under Platforms   the total number of each Sample type

Basic Search: Browse Platforms

  All GEO submissions need to be associated with a platform file. These describe the features on a given platform, required to understand the data.

  A platform file must be submitted if one is not already present in GEO.   Commercial array platform files are submitted to GEO by the manufacturer.

Basic Search: Browse Platforms

Accession:GEO ID

Title: brief description

of platform

Contact: submitter

Samples: number of samples in GEO associated

with platform ID

Technology: platform

type

Release date: when file is

publicly accessible

The table can be sorted on any field except organism by clicking on the header. Specific platform files can be found using the ‘Find Platform’ option.

Basic Search: Find Platforms

  Select ‘Find Platform’   Select company   Select distribution   Select species   Enter title keyword

Basic Search: Find Platforms (continued)

  Start the platform search   Select the accession for the U133 plus 2.0 array   Scroll down to find data table information

Data Retrieval: Browse Series

  Data is submitted to GEO as a Series, which represents the experiment design.

  Selecting Browse>Series brings up a list sorted by release date.   Selecting a Series ID brings up the Series file summary.

Data Retrieval: Series Accession Page

Biological sample summary

Design summary

Publication information

Platform (total)

Samples (total)

Data Retrieval: Sample File Summary

Sample preparation

Hybridization and data

processing

Platform Series

Data Retrieval: Sample File Data Table Data table field

descriptions

Truncated data table from Quick view

Total data rows and file size

Supplementary raw data file

Querying GEO with IDs from Papers

  A common way to access GEO data is through accessions from papers.   Online journals include hyperlinks to the GEO accession page.   Or, at the GEO home page enter the accession into the Query>GEO

accession text box

GEO Links in PubMed Search Results

  One option for displaying PubMed search results is GEO DataSet links.   When present, the results page is actually from Entrez GEO DataSets.

Advanced Searches

GEO data can be queried as:   Datasets: experiment-centric view using Entrez GEO DataSets   Gene profiles: gene-centric view using Entrez GEO Profiles

Selecting either takes you to a similar Entrez introduction page

Querying GEO DataSets

  Start a GEO DataSets search with the Query>DataSets text box   This brings up an Entrez GEO DataSets results form

Total results

Number of DataSets

Number of Platforms

Number of Series

DataSet Search Result

DataSet ID

Description

Platform

Reference Series

Supplementary files

Number of Samples and truncated list

Cluster image

Select the DataSet ID or click on the cluster image to go to the DataSet record.

GEO DataSet Record

Experiment design and DataSet information.

Sample and analysis information.

Data retrieval.

Selecting analysis takes you to the data clustering interface.

Selecting the cluster image takes you to the clustering page

GEO Gene Profiles

GEO DataSet ID

Platform ID, Platform Feature ID

Gene description

Target sequence accession

Expression profile

  GEO Gene Profiles use gene IDs from Platform files to show the expression of a gene across DataSets.

  Entering a gene ID into the Query>Gene profiles text box takes you to the Entrez results page.

Advanced Search outside of GEO

  GEOmetadb applications   GEOmetadb was developed in an attempt to make querying the

GEO metadata both easier and more powerful.

  Web-based query engine, supported by a MySQL database backend.   All GEO metadata records as well as the relationships between

them are parsed and stored in a local MySQL database   URL: http://gbnci.abcc.ncifcrf.gov/geo/

  BioConductor packages:   GEOquery   GEOmetadb

Live Demo at GEOmetadb

BioConductor Packages

  GEOquery package

  GEOquery effectively establishes a bridge between GEO microarray data and BioConductor and R facilities

  GEOquery is used when GEO accession number is known and data manipulation can be done in R to create expression data and experimental design files in the same format as other BioConductor packages can further work on

An Example of Using GEOquery

  R Script: library("GEOquery") gse5325<-getGEO( "GSE5325" , GSEMatrix=FALSE ) gene<- Table(GPLList(gse5325)[[1]])$GeneSymbol gse5325.matrix<- do.call("cbind", lapply(GSMList(gse5325)[1:105], function(x) { tab <- Table(x)[,2] return(tab)}))

data<-data.frame(gene,gse5325.matrix) data2<-data[data$gene%in%"MYC"|data$gene%in%"PTEN",] data2[, 1:6]

> data2[, 1:6] gene GSM120468 GSM120469 GSM120470 GSM120475 GSM120477 GSM120478 6168 MYC -1.635421 0.80790556 -3.100395 -1.0790637 -1.107209 -1.580962 7166 PTEN NA NA NA NA NA NA 8763 PTEN 1.297435 -0.09709502 1.065820 0.9462094 1.536822 1.180152 13327 PTEN NA NA NA NA NA NA 18409 PTEN NA NA NA NA NA NA 21865 PTEN NA NA NA NA NA NA 24305 MYC -1.809964 0.62828170 -4.038665 -1.2147221 -1.202033 -1.672025 24532 MYC -1.048360 0.75648633 -3.485740 -1.2723427 -1.145341 -1.751741

BioConductor Packages(Cont.)‏

  GEOmetadb package

  GEOmetadb BioConductor package is simply a thin wrapper around the GEOmetadb SQLite database.

  The RSQLite package(James, 2008) includes an embedded SQLite database engine and can interact with any SQLite database.

  The function getSQLiteFile is the standard method for downloading and unzipping the most recent GEOmetadb SQLite file from server.

An Example of Using GEOmetadb

  R Script: library(GEOmetadb) getSQLiteFile() con<-dbConnect(SQLite(),"GEOmetadb.sqlite") sql<-paste("SELECT DISTINCT gse.gse, gse.pubmed_id, gse.title", "FROM", " gsm JOIN gse_gsm ON gsm.gsm=gse_gsm.gsm", " JOIN gse ON gse_gsm.gse=gse.gse", " JOIN gse_gpl ON gse_gpl.gse=gse.gse", " JOIN gpl ON gse_gpl.gpl=gpl.gpl", "WHERE", " gse.summary LIKE '%lung%cancer%' AND", " gpl.organism LIKE '%Homo sapiens%'", sep = " ") rs<-dbGetQuery(con, sql) print(rs[,1])

> rs[,1] [1] "GSE10025" "GSE10072" "GSE10089" "GSE10096" "GSE10309" "GSE1037" [7] "GSE10431" "GSE10445" "GSE10799" "GSE1081" "GSE10841" "GSE10847" [13] "GSE10957" "GSE11078" "GSE11117" "GSE11559" "GSE11969" "GSE12027" [19] "GSE12236" "GSE12280" "GSE1650" "GSE2052" "GSE2189" "GSE3202" [25] "GSE3268" "GSE3521" "GSE3598" "GSE3707" "GSE3708" "GSE3754" [31] "GSE4115" "GSE4127" "GSE4705" "GSE4716" "GSE4869" "GSE4882" [37] "GSE5056" "GSE5057" "GSE5058" "GSE5059" "GSE5123" "GSE5816" [43] "GSE5843" "GSE6013" "GSE6044" "GSE6253" "GSE6400" "GSE6413" [49] "GSE6474" "GSE6695" "GSE6883" "GSE6960" "GSE6962" "GSE7339" [55] "GSE7670" "GSE7878" "GSE7880" "GSE7930" "GSE8045" "GSE8332" [61] "GSE8569" "GSE8837" "GSE8894" "GSE8987" "GSE9008" "GSE9212" [67] "GSE9315" "GSE9994" "GSE9995"

Layers of Microarray Data Sets

A typical experiment design: A (Normal) group vs B (Tumor) group

Image

Raw data

Microarray Core

EXALT

Essential pieces of information: N vs T, sigGene-Direction-Confidence

Definition - a set of statistically significant differentially expressed genes and their expression directions.

Normalized data: comparable N vs T

Significant differential gene list: e.g. MYC, PTEN

GEO

Signature, profile, or set

GEO Online Functionalities

Data set level: Good Data set management in raw, GSE, and GDS Search for data sets by gene, sample, and experiment key word Browsing, download,

Within a data set: clustering analysis, gene searching, two group comparison by t-test in real time, slow, signature profile among all samples

Across data sets: support only a single gene search show gene expression levels within a data set

NEITHER a data set comparison NOR a signature search among GEO data sets can be performed

Limited in the number of cancer related data sets

Triplets format: Gene-DirCode-Score

Stanford_Pollack_CaP_PNAS_V101_P811

Signature name: normal prostate samples vs primary prostate tumors

GEO SigDB

HuCa SigDB

Signature Encoding Format in EXALT

Normal group PRCa

VS

N Samples PR Samples

PR Datasets

PR Studies

A PRca signature in triplets (MYC-U-62, PTEN-D-38, …)

Prostate Cancer

EXALT SigDB Statistics

EXALT Daily Web Statistics The number of EXALT page hits: 74,591 Total number of EXALT users: 1,611

Signature Database (08212008) Statistics EXALT Datasets Signatures Arrays Geo 1,445 25,613 28,862 Human Cancer 446 2,069 37,644

EXALT Online Features

• Data set browsing and search • Query data set uploading and comparison • Signature search • Signature comparison

Upload a Query Data set

Signature Based Dataset Searching

Example: identify a cluster of gene expression signatures related to a query data set of prostate cancer (Pollack PNAS 2004).

The QUERY dataset has generated 3 signatures PNAS 101(3):811-6. 2004 Normal, Cancer, Metastasis

EXALT Search in Human Cancer Signature Database ...

Total SEVEN matched Subject Data Sets GDS2545 NCBI_Geo_GDS2545 has 3 hit (100%) with Average pValue=1.29E-9 Analysis of metastatic prostate tumors and primary prostate tumors J Clin Oncol 2004 Jul 15;22(14):2790-9.

Multiple Search Entries

A

B

Homologous Signatures Search and select a query signature, and then launch a homology analysis

Launch a QUERY by selecting a signature from studies of estrogen receptor status (ER- vs ER+, Herschkowitz et al., 2008)

Co-expression Analysis

Input: MYC, PTEN

EXALT Online Distinctive Features

•  Two big signature databases that have collections of >27,000 signatures, •  The signature annotations are integrated with the NCBI GEO and the PubMed, •  The multiple mining strategies allow directly access and analysis, •  The interfaces are friendly to biologists •  All signatures are pre-computed and directly meaningful to research scientists.

http://seq.mc.vanderbilt.edu/exalt/

Acknowledgements

Andrew Yi

Mentors: Yu Shyr, Matusik, Al George Posdoc: QingChao Qiu Developers: Guangzu Zhang

Collaborators: Wu Jun, Xie Lu at SBIT

Funding NCI, IG, VICTR, VICC biostatistics

Steven Chen and Pengcheng Lu

Thank Yu Shyr, Ming Li, Heidi Chen and Aixiang Jiang for suggestions.

data mining in geo and beyond

Documents