in silico study of cancer-related genes and micrornas...
DESCRIPTION
In silico study of cancer-related genes and microRNAs 運用微晶片篩選癌症基因及探討其上游之調控 microRNAs. Ka-Lok Ng ( 吳家樂 ) Department of Biomedical Informatics ( 生物與醫學資訊學系 ) Asia University. Contents. Motivation Predict cancer genes based on microarray mRNA expression levels - PowerPoint PPT PresentationTRANSCRIPT
In silico study of cancer-related genes and microRNAs運用微晶片篩選癌症基因及探討其上游之調控 microRNAs
Ka-Lok Ng (吳家樂 )Department of Biomedical Informatics
(生物與醫學資訊學系 )Asia University
Contents
Motivation Predict cancer genes based on microarray mRNA expression levels microRNA (miRNA) can act as an oncogene (OCG) or tumor suppressor
gene (TSG) Identify cancer-related miRNAs, their target genes, downstream protein-
protein interactions (prediction novel cancerous proteins)
(1) Introduction – microarray, cancer, microRNA(2) Methods – input data(3) Results
(a) cancer genes prediction (Bioconductor), i.e. prostate/breast cancer (b) correlation study of miRNAs and mRNA expression levels (c) ncRNAppi – A platform for studying microRNA and their target
genes’ protein-protein interactions(4) Summary
Central dogma of molecular biology
Post-transcription regulation – microRNA targets mRNA
transcriptome
Types of RNAsRNA
mRNAncRNA
Non-coding RNA. Transcribed RNA with a structural, functional or catalytic role
rRNARibosomal RNA
Participate in protein synthesis
tRNATransfer RNA
Interface betweenmRNA &
amino acids
snRNASmall nuclear RNA
-Incl. RNA thatform part of the
spliceosome
snoRNASmall nucleolar RNAFound in nucleolus,
involved in modificationof rRNA
miRNAMicro RNA
Small RNA involvedregulation of expression
OtherIncluding large RNA
with roles in chromotin structure and
imprinting
siRNASmall interfering RNAActive molecules in
RNA interference
stRNASmall temporal RNA.RNA with a role in
developmental timing
Introduction
癌症的形成及97年台灣前十大主要癌症死亡原因摘要
順位 死亡原因 Cause of Death 死亡數 百分率癌症類型 Cancer Type 38,913 100%
1 肺癌 Lung Cancer 7,777 20.0%
2 肝癌 Hepatocellular Carcinoma 7,651 19.7%
3 結腸直腸癌 Colorectal Cancer 4,266 11.0%
4 女性乳癌 Female Breast Cancer 1,541 4.0%
5 胃癌 Gastric Cancer 2,292 5.9%
6 口腔癌 Oral Cavity Cancer 2,218 5.7%
7 前列腺 (攝護腺 )癌 Prostate Cancer 892 2.3%
8 子宮頸癌 Cervical Cancer 710 1.8%
9 食道癌 Esophageal Cancer 1,433 3.7%
10 胰臟癌 Pancreatic Cancer 1,364 3.5%
By Hanne Jarmer, BioCentrum-DTU, Technical University of Denmark
cDNA labeled by Cy3 (Green)
cDNA labeled by Cy5 (Red)
Probe genes
Target
Microarray – overview
Microarrays are used to measure gene expression levels in two different conditions. Green label for the control sample and a red one for the experimental sample.
DNA-cDNA or DNA-mRNA hybridization.
The hybridised microarray is excited by a laser and scanned at the appropriate wavelenghts for the red and green dyes
Amount of fluorescence emitted (intensity) upon laser excitation ~ amount of mRNA bound to each spot
If the sample in control/experimental condition is in abundance green/red, which indicates the relative amount of transcript for the mRNA (EST) in the samples.
If both are equal yellow
If neither are present black
cDNA microarrays
Microarray data generation, processing and analysis
Information processing Image quantitation –
locating the spots and measuring their fluorescence intensities
Data normalization and integration – construction of the gene expression matrix from sets of spot
Gene expression data analysis and mining – finding differentially expressed genes (DEGs) or clusters of similarly expressed genes
Generation from these analyses of new hypotheses about the underlying biological processes stimulates new hypotheses that in turn should be tested in follow-up experiments
http://www.mathworks.com/company/pressroom/image_library/biotech.html
Image analysis
Data analysisclustering
miRNA gene pri-miRNA (stem-loop structure) processed by Drosha pre-miRNA (65~90 bp) carried by Exportin 5 to cytoplasm mature miRNA (20~25 bp) is generated by the RNaseIII type enzyme Dicer directed by RISC to the miRNA target mRNA cleavage or impede its translation into protein
Introduction – biogenesis of microRNA
When miRNA plays an oncogenic role, it targets TSG, control cell differentiation or apoptosis genes, and leads to tumor formation.
if miRNA plays the tumor suppressor role, it targets OCG, control cell differentiation or apoptosis genes, so it can suppress tumor formation.
Expect negative correlation of miRNA and mRNA expression profiles
integrate the human miRNA-targeted (or siRNA-targetd) mRNA data, protein-protein interactions (PPI) records, tissues, pathways, and disease information to establish a disease-related miRNA (or siRNA) pathway database
Introduction - miRNAs can play the role of an OCG and TSG
Introduction – cancer-related miRNAs
Cancer-related miRNA Cancer type References
miR-17-92 cluster, let-7 Lung cancerMartin et al., 2006, Yanaiharaet a. 2006, Takamizawa et al., 2004
miR-10b, miR-21, miR-125b, miR-145, miR-155
Breast cancer Iorio et al., 2005, Si et al., 2007
miR-18, miR-122a, miR-224, miR-199a, miR-199a*
Liver cancerMurakami et al., 2006, Meng et al., 2007, Gramantieri et al., 2007
miR-195, miR-125a, miR-200a,miR15, miR-16
B-CLLCalin et al., 2004Calin et al. 2002
A platform for studying miRNAs and cancerous target genes
miRNA
mRNA
miRNA-mRNAanti-correlation pairs
Annotation:TAG known OCG, TSG or CRGOMIM disease genesKEGG cancer pathways
Annotation:miR2Disease – disease related miRNAChromosomal fragile sitesmiRNA clusters info.CpG island proximal miRNA
TarBASE data Experimentally verified miRNA-mRNA pairs
NCI-60 cancer data:Expression profileof miRNA and mRNA
Breast CNS Colon Lung Leukemia Melanoma Ovarian Prostate Renal
No. of Cell Lines 5 6 7 9 6 10 7 2 8
Number of cell lines for the nine cancer types in the NCI-60 data sets
miRNA, target gene, protein-protein interaction (PPI)
Tissue specific miRNA or siRNA target, and its PPI partners up to the second level If the upstream miRNA (or siRNA) is defective, its effect could be amplified
downstream. As an illustration, given that a miRNA (or siRNA) targets gene TG, which has two
successive PPI partners, i.e. proteins L1 and L2; and suppose that genes TG and L2 are involved with the same disease, then it is highly probably that gene L1 is also related to the same disease quantify by enrichment analysis
miRNAor siRNA protein (mRNA is
suppressed)
protein
protein (TF)
protein
TG L1 L2BP/MF x y zOverlap BP/MF n1 n2
Input data and Methods
Databases : ArrayExpress
64 prostate cancer tissue and 18 normal prostate tissue samples’ raw data files with U95Av2
TAG (Tumor Associated Gene) NCI-60 – miRNA and mRNA gene expression profiles for 9 cancer types TarBase – miRNA targets (experimental verified) miR2Disease
a comprehensive resource of miRNA deregulation in various human diseases OMIM – human disease information KEGG – cancer pathways information ncRNAppi
a useful tool for identifying ncRNA target pathways PPI data (BioIR) – Seven databases are integrated: HPRD, DIP, BIND,
IntAct, MIPS, MINT and BioGRID Gene Ontology (GO) – Biological Function, Molecular Process annotations Tool: Bioconductor
ResearchProtocol
Term Enter command in R environment
1 library("affy")
2 library("limma") 3 eset<-justRMA()4 design<-cbind(normal=c(rep(1,18),rep(0,64)),DM=c(rep(0,18),rep(1,64)))5 fit<-lmFit(eset,design)6 cont.matrix<-makeContrasts(DMvsNo=DM-normal,levels=design)7 fit2<-contrasts.fit(fit,cont.matrix)
8 fit2<-eBayes(fit2)
9 topTable(fit2,number=100,adjust="BH")10 genenames <- as.character(topTable(fit2,number=100,adjust="BH")$ID)11 adj.P_Val<-signif(topTable(fit2,number=100,adjust="BH")$adj.P.Val,digits=3)12 logFC <-signif(topTable(fit2,number=100,adjust="BH")$logFC ,digits=3)13 library("XML")14 annotation(eset)15 library("annotate")16 library("hgu95av2.db")17 absts <- pm.getabst(genenames,"hgu95av2.db")18 library("annaffy")19 atab <- aafTableAnn(genenames,"hgu95av2.db", aaf.handler())20 stattable <- aafTable("logFC " = logFC , "adj_P.Val" = adj.P_Val)
21 table <- merge(atab, stattable)
22 saveHTML(table, file = "report.html",title="Significant gene list and its annotation information")
Predict DEGs using R and Bioconductor commands
Results – DEGs predicted by Bioconductor The result of the top 100 DEGs (either up or down) Eliminate duplicated genes, the predicted total number of DEGs is 85,
and the adjusted p-value of all DEGs are less than 1.9 * 10-5. TAG ∩ DEGs 14 known cancer genes among the 85 predicted DEGs
(16.5%)
Results – miRNAs, DEGs and cancer types
Other DEGs
Results - The relationship among miR-20a, TGFBR2 and human prostate cancer
16461460http://ppi.bioinfo.asia.edu.tw/R_cancer/
A platform for studying miRNAs and cancerous target genes
A platform for studying miRNAs and cancerous target genes
miRNA
mRNA
miRNA-mRNAanti-correlation pairs
Annotation:TAG known OCG, TSG or CRGOMIM disease genesKEGG cancer pathways
Annotation:miR2Disease – disease related miRNAChromosomal fragile sitesmiRNA clusters info.CpG island proximal miRNA
TarBASE data Experimentally verified miRNA-mRNA pairs
NCI-60 cancer data:Expression profileof miRNA and mRNA
Breast CNS Colon Lung Leukemia Melanoma Ovarian Prostate Renal
No. of Cell Lines 5 6 7 9 6 10 7 2 8
Number of cell lines for the nine cancer types in the NCI-60 data sets
A platform for studying miRNAs and cancerous target genes
For a given cancer tissue type, we calculated both the PCC and SRC, , between the is given by,
where xi and yi denote the expression intensity of miRNA and the miRNA's target gene respectively.
One of the troubles with quantifying the strength of correlation by PCC is that it is susceptible to be skewed by outliers. Outliers that are a single data point can result in two genes appearing to be correlated, even when all the other data points not. SRC is a non-parametric statistical method that is robust to outliers.
The PCC and SRC are calculated for:
Three Affymetrix chips: U95(A-E), U133A, U133B
Normalization methods: GCRMA, MAS5, RMA
n
i
n
i ii
n
i ii
yyxx
yyxx
1 1
22
1
)()(
))((
Test of hypothesis of PCC and SRC
The Pearson product-moment table to test the significance of a PCC result. The hypothesis being tested is a one-tailed test. A different test is applied for the SRC results.
Critical values for one-tailed test using Pearson and Spearman correlation at a significant level of a equal to 0.05 and 0.10.
Results – hsa-miR-1:AXL, PCC and SRC calculations
Cases where both PCC and SRC are less than or equal to -0.5.
Results – hsa-miR-10b:HOXD10
miR2Disease - hsa-mir-10b initiated diseases, i.e. leukemia, breast, colon, ovarian, prostate cancers.
Another example:hsa-miR-21:PTEN (TSG)hsa-miR-15b: BCL2 (TSG)hsa-miR-16: BCL2 (TSG)
Extension - works in progress
Validate how good is correlation prediction Adding further information
– CpG island, miRNAs located around CpG islands (i.e., miR-34b, miR-137, miR-193a, and miR-203) are silenced by DNA hypermethylation in oral cancer
miRNA clusters, fragile sites
Positive correlated miRNA:mRNA pairs may involving TFs
ncRNAppi – miRNA, target genes, PPI, andthe protocol of enrichment analysis
There is a tendency for two directly interacting proteins participate in the same biological process or share the same molecular function. Let a miRNA targeting pathway denoted by miRNA – TG – L1 – L2. We propose to rank the pathway result according to the number of overlapping of the biological processes (or molecular functions) between TG and L1, and between L1 and L2. The Jaccard coefficient (JC) is used to rank the significance of a pathway. JC of set A and B is defined by
where and denote the cardinality of and respectively.
||
||
BA
BAJC
|| BA || BA BA BA
miRNAor siRNA protein
(mRNA is suppressed)
protein
protein (TF)
protein
JC(TG,L1) JC(L1,L2)
ncRNAppi – The protocol of enrichment analysis
The biological process (BP) and molecular function (MF) annotations are carried from Gene Ontology, which is used to characterize the path TG – L1 – L2, and the JC for the pathway is given by,
where and denote the JC score of the biological process for segment TG – L1, and the TG – L1– L2 pathway respectively.
)]2,1()1,([2
1)2,1,( LLJCLTGJCLLTGJC BPBP
aveBP
)1,( LTGJCBP )2,1,( LLTGJC aveBP
ncRNAppi – The protocol of enrichment analysis, p-value
We assigned a p-value to every JC calculation, this provides a measure of the statistical significance. Here is how we estimate the p-value. Let N be total number of BP found in GO. Assume that TG, L1 and L2 have x, y and z BP annotations respectively. Also, let n1 and n2 be the number of identical BP for TG – L1 and L1 – L2 respectively. Let p1 and p2 be the probabilities that TG – L1 and L1 – L2 have n1 and n2 common BP (or MF) terms respectively, which are defined as;
and
Ny
Nx
xNny
nNnx
Nn
CC
CCCp
1
1
11
1
Nz
Ny
yNnz
nNny
Nn
CC
CCCp
2
2
22
2
TG L1
x-n1 n1 y-n1
N
ncRNAppi – Extension of TarBase targets
Limitations of miRNA target prediction tools
There are many tools available for miRNA target genes prediction, such as miRanda, TargetScan, and RNAhybrid etc. A major problem of miRNA target genes prediction is that the prediction accuracy remains uncertain, there was report indicated that the false positive rate could be as high as 24-39% for miRanda, and 22-31% for TargetScan. If the miRNA:mRNA targeting part is uncertain, then the ‘Level 1’ and ‘Level 2’ protein-protein interaction pathways derived from PPI database are doubtful.
ncRNAppi – Extension of TarBase targets
miRNA target prediction tool – miRanda
Mature human miRNA FASTA sequences is downloaded from miRBase (the latest version is 13).
Then, we predict the possibilities of miRNA binding with OCG, and TSG. Target prediction tool, miRanda, allows for fining tuning of certain parameters, i.e. MFE threshold, score, shuffle statistics, gap open and gap extension scores. We set MFE threshold and the shuffle statistics to -25 kcal/mol and ON respectively. The rest of the parameters are set to their default values. Once the binding lists of OCG and TSG obtained, then their PPI pathways can be retrieved from the BioIR database.
ncRNAppi provides web-based data access and allows disease assignment for a specific node along miRNA (siRNA) targeting pathways. For example Select miRNA ID – hsa-let-7 Checks the ‘OMIM Disease type for individual node’ box labeled with ‘Target’ and ‘Level-2’ Choose the item ‘lung tumor’ under the ‘TUMOR TYPE’ pull-down menu (OMIM) Select ‘Yes’ under the “Common expression of target, Level-1 and level-2 nodes in KEGG” pathways are ranked according to the Jaccrad index and p-value for BP or MF
Results - ncRNAppi
Example1)hsa-let72)Unigene: liver3)Target, L1 and L2 are OCG4)submit
Summary
The R and Bioconductor are used to predict DEGs using prostate cancer microarray data. By integrating the Tumor Associated Gene (TAG), ncRNAppi and miR2Disease databases, it is found that certain DEGs are regulated by microRNAs.
A platform for studying miRNAs and cancer target genes(1) PCC and SRC results are used to quantify the correlation between miRNA and
its target expression profiles. The predicted results are annotated with reference to the TAG, OMIM, miR2Disease and KEGG data sets.
(2) The main advantage of the two platforms on miRNA-mRNA targeting information is that all the target genes information and disease records are experimentally verified.
ncRNAppi platformncRNAppi provide a powerful tool for identifying cancer-related miRNAs or siRNAs. For instance, the tool allows the possibilities of predicting novel caner genes through tissue or disease specific search. This platform is useful for investigating the regulatory role of miRNAs and siRNAs for cancer study.
AcknowledgementNational Science Foundation
Professor S.C. Lee (李尚熾 ) - Chung Shan Medical University
Mr. Liu Hsueh-Chuan (劉學銓 ) – former graduate student at Asia University
Mr. C.W. Weng (翁嘉偉 ) – former graduate student at Asia University
Mr. Kevin Lo (羅琮傑 ) – MSc. graduate student at Asia University
Thank you for your attention.