in silico study of cancer-related genes and micrornas...

In silico study of cancer-related genes and microRNAs運用微晶片篩選癌症基因及探討其上游之調控 microRNAs

Ka-Lok Ng (吳家樂 )Department of Biomedical Informatics

(生物與醫學資訊學系 )Asia University

Contents

Motivation Predict cancer genes based on microarray mRNA expression levels microRNA (miRNA) can act as an oncogene (OCG) or tumor suppressor

gene (TSG) Identify cancer-related miRNAs, their target genes, downstream protein-

protein interactions (prediction novel cancerous proteins)

(1) Introduction – microarray, cancer, microRNA(2) Methods – input data(3) Results

(a) cancer genes prediction (Bioconductor), i.e. prostate/breast cancer (b) correlation study of miRNAs and mRNA expression levels (c) ncRNAppi – A platform for studying microRNA and their target

genes’ protein-protein interactions(4) Summary

Central dogma of molecular biology

Post-transcription regulation – microRNA targets mRNA

transcriptome

Types of RNAsRNA

mRNAncRNA

Non-coding RNA. Transcribed RNA with a structural, functional or catalytic role

rRNARibosomal RNA

Participate in protein synthesis

tRNATransfer RNA

Interface betweenmRNA &

amino acids

snRNASmall nuclear RNA

-Incl. RNA thatform part of the

spliceosome

snoRNASmall nucleolar RNAFound in nucleolus,

involved in modificationof rRNA

miRNAMicro RNA

Small RNA involvedregulation of expression

OtherIncluding large RNA

with roles in chromotin structure and

imprinting

siRNASmall interfering RNAActive molecules in

RNA interference

stRNASmall temporal RNA.RNA with a role in

developmental timing

Introduction

癌症的形成及97年台灣前十大主要癌症死亡原因摘要

順位死亡原因 Cause of Death 死亡數百分率癌症類型 Cancer Type 38,913 100%

1 肺癌 Lung Cancer 7,777 20.0%

2 肝癌 Hepatocellular Carcinoma 7,651 19.7%

3 結腸直腸癌 Colorectal Cancer 4,266 11.0%

4 女性乳癌 Female Breast Cancer 1,541 4.0%

5 胃癌 Gastric Cancer 2,292 5.9%

6 口腔癌 Oral Cavity Cancer 2,218 5.7%

7 前列腺 (攝護腺 )癌 Prostate Cancer 892 2.3%

8 子宮頸癌 Cervical Cancer 710 1.8%

9 食道癌 Esophageal Cancer 1,433 3.7%

10 胰臟癌 Pancreatic Cancer 1,364 3.5%

By Hanne Jarmer, BioCentrum-DTU, Technical University of Denmark

cDNA labeled by Cy3 (Green)

cDNA labeled by Cy5 (Red)

Probe genes

Target

Microarray – overview

Microarrays are used to measure gene expression levels in two different conditions. Green label for the control sample and a red one for the experimental sample.

DNA-cDNA or DNA-mRNA hybridization.

The hybridised microarray is excited by a laser and scanned at the appropriate wavelenghts for the red and green dyes

Amount of fluorescence emitted (intensity) upon laser excitation ~ amount of mRNA bound to each spot

If the sample in control/experimental condition is in abundance green/red, which indicates the relative amount of transcript for the mRNA (EST) in the samples.

If both are equal yellow

If neither are present black

cDNA microarrays

Microarray data generation, processing and analysis

Information processing Image quantitation –

locating the spots and measuring their fluorescence intensities

Data normalization and integration – construction of the gene expression matrix from sets of spot

Gene expression data analysis and mining – finding differentially expressed genes (DEGs) or clusters of similarly expressed genes

Generation from these analyses of new hypotheses about the underlying biological processes stimulates new hypotheses that in turn should be tested in follow-up experiments

http://www.mathworks.com/company/pressroom/image_library/biotech.html

Image analysis

Data analysisclustering

miRNA gene pri-miRNA (stem-loop structure) processed by Drosha pre-miRNA (65~90 bp) carried by Exportin 5 to cytoplasm mature miRNA (20~25 bp) is generated by the RNaseIII type enzyme Dicer directed by RISC to the miRNA target mRNA cleavage or impede its translation into protein

Introduction – biogenesis of microRNA

When miRNA plays an oncogenic role, it targets TSG, control cell differentiation or apoptosis genes, and leads to tumor formation.

if miRNA plays the tumor suppressor role, it targets OCG, control cell differentiation or apoptosis genes, so it can suppress tumor formation.

Expect negative correlation of miRNA and mRNA expression profiles

integrate the human miRNA-targeted (or siRNA-targetd) mRNA data, protein-protein interactions (PPI) records, tissues, pathways, and disease information to establish a disease-related miRNA (or siRNA) pathway database

Introduction - miRNAs can play the role of an OCG and TSG

Introduction – cancer-related miRNAs

Cancer-related miRNA Cancer type References

miR-17-92 　 cluster, let-7 Lung cancerMartin et al., 2006, Yanaiharaet a. 2006, Takamizawa et al., 2004

miR-10b, miR-21, miR-125b, miR-145, miR-155

Breast cancer Iorio et al., 2005, Si et al., 2007

miR-18, miR-122a, miR-224, miR-199a, miR-199a*

Liver cancerMurakami et al., 2006, Meng et al., 2007, Gramantieri et al., 2007

miR-195, miR-125a, miR-200a,miR15, miR-16

B-CLLCalin et al., 2004Calin et al. 2002

A platform for studying miRNAs and cancerous target genes

miRNA

mRNA

miRNA-mRNAanti-correlation pairs

Annotation:TAG known OCG, TSG or CRGOMIM disease genesKEGG cancer pathways

Annotation:miR2Disease – disease related miRNAChromosomal fragile sitesmiRNA clusters info.CpG island proximal miRNA

TarBASE data Experimentally verified miRNA-mRNA pairs

NCI-60 cancer data:Expression profileof miRNA and mRNA

　 Breast CNS Colon Lung Leukemia Melanoma Ovarian Prostate Renal

No. of Cell Lines 5 6 7 9 6 10 7 2 8

Number of cell lines for the nine cancer types in the NCI-60 data sets

miRNA, target gene, protein-protein interaction (PPI)

Tissue specific miRNA or siRNA target, and its PPI partners up to the second level If the upstream miRNA (or siRNA) is defective, its effect could be amplified

downstream. As an illustration, given that a miRNA (or siRNA) targets gene TG, which has two

successive PPI partners, i.e. proteins L1 and L2; and suppose that genes TG and L2 are involved with the same disease, then it is highly probably that gene L1 is also related to the same disease quantify by enrichment analysis

miRNAor　siRNA protein (mRNA is

suppressed)

protein

protein (TF)

protein

TG L1 L2BP/MF x y zOverlap BP/MF n1 n2

Input data and Methods

Databases : ArrayExpress

64 prostate cancer tissue and 18 normal prostate tissue samples’ raw data files with U95Av2

TAG (Tumor Associated Gene) NCI-60 – miRNA and mRNA gene expression profiles for 9 cancer types TarBase – miRNA targets (experimental verified) miR2Disease

a comprehensive resource of miRNA deregulation in various human diseases OMIM – human disease information KEGG – cancer pathways information ncRNAppi

a useful tool for identifying ncRNA target pathways PPI data (BioIR) – Seven databases are integrated: HPRD, DIP, BIND,

IntAct, MIPS, MINT and BioGRID Gene Ontology (GO) – Biological Function, Molecular Process annotations Tool: Bioconductor

ResearchProtocol

Term Enter command in R environment

1 library("affy")

2 library("limma") 3 eset<-justRMA()4 design<-cbind(normal=c(rep(1,18),rep(0,64)),DM=c(rep(0,18),rep(1,64)))5 fit<-lmFit(eset,design)6 cont.matrix<-makeContrasts(DMvsNo=DM-normal,levels=design)7 fit2<-contrasts.fit(fit,cont.matrix)

8 fit2<-eBayes(fit2)

9 topTable(fit2,number=100,adjust="BH")10 genenames <- as.character(topTable(fit2,number=100,adjust="BH")$ID)11 adj.P_Val<-signif(topTable(fit2,number=100,adjust="BH")$adj.P.Val,digits=3)12 logFC <-signif(topTable(fit2,number=100,adjust="BH")$logFC ,digits=3)13 library("XML")14 annotation(eset)15 library("annotate")16 library("hgu95av2.db")17 absts <- pm.getabst(genenames,"hgu95av2.db")18 library("annaffy")19 atab <- aafTableAnn(genenames,"hgu95av2.db", aaf.handler())20 stattable <- aafTable("logFC " = logFC , "adj_P.Val" = adj.P_Val)

21 table <- merge(atab, stattable)

22 saveHTML(table, file = "report.html",title="Significant gene list and its annotation information")

Predict DEGs using R and Bioconductor commands

Results – DEGs predicted by Bioconductor The result of the top 100 DEGs (either up or down) Eliminate duplicated genes, the predicted total number of DEGs is 85,

and the adjusted p-value of all DEGs are less than 1.9 * 10-5. TAG ∩ DEGs 14 known cancer genes among the 85 predicted DEGs

(16.5%)

Results – miRNAs, DEGs and cancer types

Other DEGs

Results - The relationship among miR-20a, TGFBR2 and human prostate cancer

16461460http://ppi.bioinfo.asia.edu.tw/R_cancer/


miRNA

mRNA

miRNA-mRNAanti-correlation pairs

Annotation:TAG known OCG, TSG or CRGOMIM disease genesKEGG cancer pathways

Annotation:miR2Disease – disease related miRNAChromosomal fragile sitesmiRNA clusters info.CpG island proximal miRNA

TarBASE data Experimentally verified miRNA-mRNA pairs

NCI-60 cancer data:Expression profileof miRNA and mRNA

　 Breast CNS Colon Lung Leukemia Melanoma Ovarian Prostate Renal

No. of Cell Lines 5 6 7 9 6 10 7 2 8

Number of cell lines for the nine cancer types in the NCI-60 data sets


For a given cancer tissue type, we calculated both the PCC and SRC, , between the is given by,

where xi and yi denote the expression intensity of miRNA and the miRNA's target gene respectively.

One of the troubles with quantifying the strength of correlation by PCC is that it is susceptible to be skewed by outliers. Outliers that are a single data point can result in two genes appearing to be correlated, even when all the other data points not. SRC is a non-parametric statistical method that is robust to outliers.

The PCC and SRC are calculated for:

Three Affymetrix chips: U95(A-E), U133A, U133B

Normalization methods: GCRMA, MAS5, RMA

n

i

n

i ii

n

i ii

yyxx

yyxx

1 1

22

1

)()(

))((

Test of hypothesis of PCC and SRC

The Pearson product-moment table to test the significance of a PCC result. The hypothesis being tested is a one-tailed test. A different test is applied for the SRC results.

Critical values for one-tailed test using Pearson and Spearman correlation at a significant level of a equal to 0.05 and 0.10.

Results – hsa-miR-1:AXL, PCC and SRC calculations

Cases where both PCC and SRC are less than or equal to -0.5.

Results – hsa-miR-10b:HOXD10

miR2Disease - hsa-mir-10b initiated diseases, i.e. leukemia, breast, colon, ovarian, prostate cancers.

Another example:hsa-miR-21:PTEN (TSG)hsa-miR-15b: BCL2 (TSG)hsa-miR-16: BCL2 (TSG)

Extension - works in progress

Validate how good is correlation prediction Adding further information

– CpG island, miRNAs located around CpG islands (i.e., miR-34b, miR-137, miR-193a, and miR-203) are silenced by DNA hypermethylation in oral cancer

miRNA clusters, fragile sites

Positive correlated miRNA:mRNA pairs may involving TFs

ncRNAppi – miRNA, target genes, PPI, andthe protocol of enrichment analysis

There is a tendency for two directly interacting proteins participate in the same biological process or share the same molecular function. Let a miRNA targeting pathway denoted by miRNA – TG – L1 – L2. We propose to rank the pathway result according to the number of overlapping of the biological processes (or molecular functions) between TG and L1, and between L1 and L2. The Jaccard coefficient (JC) is used to rank the significance of a pathway. JC of set A and B is defined by

where and denote the cardinality of and respectively.

||

||

BA

BAJC

|| BA || BA BA BA

miRNAor　siRNA protein　

(mRNA　is　suppressed)

protein

protein　(TF)

protein

JC(TG,L1) JC(L1,L2)

ncRNAppi – The protocol of enrichment analysis

The biological process (BP) and molecular function (MF) annotations are carried from Gene Ontology, which is used to characterize the path TG – L1 – L2, and the JC for the pathway is given by,

where and denote the JC score of the biological process for segment TG – L1, and the TG – L1– L2 pathway respectively.

)]2,1()1,([2

1)2,1,( LLJCLTGJCLLTGJC BPBP

aveBP

)1,( LTGJCBP )2,1,( LLTGJC aveBP

ncRNAppi – The protocol of enrichment analysis, p-value

We assigned a p-value to every JC calculation, this provides a measure of the statistical significance. Here is how we estimate the p-value. Let N be total number of BP found in GO. Assume that TG, L1 and L2 have x, y and z BP annotations respectively. Also, let n1 and n2 be the number of identical BP for TG – L1 and L1 – L2 respectively. Let p1 and p2 be the probabilities that TG – L1 and L1 – L2 have n1 and n2 common BP (or MF) terms respectively, which are defined as;

and

Ny

Nx

xNny

nNnx

Nn

CC

CCCp

1

1

11

1

Nz

Ny

yNnz

nNny

Nn

CC

CCCp

2

2

22

2

TG L1

x-n1 n1 y-n1

N

ncRNAppi – Extension of TarBase targets

Limitations of miRNA target prediction tools

There are many tools available for miRNA target genes prediction, such as miRanda, TargetScan, and RNAhybrid etc. A major problem of miRNA target genes prediction is that the prediction accuracy remains uncertain, there was report indicated that the false positive rate could be as high as 24-39% for miRanda, and 22-31% for TargetScan. If the miRNA:mRNA targeting part is uncertain, then the ‘Level 1’ and ‘Level 2’ protein-protein interaction pathways derived from PPI database are doubtful.

ncRNAppi – Extension of TarBase targets

miRNA target prediction tool – miRanda

Mature human miRNA FASTA sequences is downloaded from miRBase (the latest version is 13).

Then, we predict the possibilities of miRNA binding with OCG, and TSG. Target prediction tool, miRanda, allows for fining tuning of certain parameters, i.e. MFE threshold, score, shuffle statistics, gap open and gap extension scores. We set MFE threshold and the shuffle statistics to -25 kcal/mol and ON respectively. The rest of the parameters are set to their default values. Once the binding lists of OCG and TSG obtained, then their PPI pathways can be retrieved from the BioIR database.

ncRNAppi provides web-based data access and allows disease assignment for a specific node along miRNA (siRNA) targeting pathways. For example Select miRNA ID – hsa-let-7 Checks the ‘OMIM Disease type for individual node’ box labeled with ‘Target’ and ‘Level-2’ Choose the item ‘lung tumor’ under the ‘TUMOR TYPE’ pull-down menu (OMIM) Select ‘Yes’ under the “Common expression of target, Level-1 and level-2 nodes in KEGG” pathways are ranked according to the Jaccrad index and p-value for BP or MF

Results - ncRNAppi

Example1)hsa-let72)Unigene: liver3)Target, L1 and L2 are OCG4)submit

Summary

The R and Bioconductor are used to predict DEGs using prostate cancer microarray data. By integrating the Tumor Associated Gene (TAG), ncRNAppi and miR2Disease databases, it is found that certain DEGs are regulated by microRNAs.

A platform for studying miRNAs and cancer target genes(1) PCC and SRC results are used to quantify the correlation between miRNA and

its target expression profiles. The predicted results are annotated with reference to the TAG, OMIM, miR2Disease and KEGG data sets.

(2) The main advantage of the two platforms on miRNA-mRNA targeting information is that all the target genes information and disease records are experimentally verified.

ncRNAppi platformncRNAppi provide a powerful tool for identifying cancer-related miRNAs or siRNAs. For instance, the tool allows the possibilities of predicting novel caner genes through tissue or disease specific search. This platform is useful for investigating the regulatory role of miRNAs and siRNAs for cancer study.

AcknowledgementNational Science Foundation

Professor S.C. Lee (李尚熾 ) - Chung Shan Medical University

Mr. Liu Hsueh-Chuan (劉學銓 ) – former graduate student at Asia University

Mr. C.W. Weng (翁嘉偉 ) – former graduate student at Asia University

Mr. Kevin Lo (羅琮傑 ) – MSc. graduate student at Asia University

Thank you for your attention.

in silico study of cancer-related genes and micrornas...

Documents