![Page 1: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/1.jpg)
Mining Public Data for Insights into Human Disease
11/16/2009
Baliga Lab Meeting
Chris Plaisier
![Page 2: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/2.jpg)
Utility of Gene Expression for Human Disease
![Page 3: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/3.jpg)
Microarray Technology
![Page 4: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/4.jpg)
Big Picture
![Page 5: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/5.jpg)
Data Access
![Page 6: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/6.jpg)
Gene Expression Microarray Repositories
• Gene Expression Omnibus (GEO) Hosted by: NCBI Platform: All accepted Normalization: Experiment by experiment basis Access: R (GEOquery), EUtils Meta-Information: GEOMetaDB
• ArrayExpress Hosted by: EMBL Platform: All accepted Normalization: Experiment by experiment basis Access: Web interface, EMBL API Meta-Information: ? (API)
• Many smaller repositories which have more phenotypic information for specific diseases Phenotypic information may be hard to access
![Page 7: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/7.jpg)
Gene Expression Omnibus
![Page 8: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/8.jpg)
Samples Per Platform in GEO
HGU133 Plus 2.0
HGU133A
Latest 3’ Affymetrix Array
Affymetrix arrays account for ~67% of humangene expression data in public repositories.
![Page 9: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/9.jpg)
Affymetrix Probesets
Probe ProbePair
Probeset(11 Probe Pairs)
Perfect Match
Mismatch
GeneChip U133 Plus 2.0 Array(Image stored as CEL file.)
>54,000 Probesets
25 nucleotides
![Page 10: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/10.jpg)
Pre-Processing 101
![Page 11: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/11.jpg)
Pre-Processing Gene Expression Data
![Page 12: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/12.jpg)
Removing Miss-Targeted and Non-Specific Probes
CELFile
CDFFile
Intensities
Normally CDF File Comes from Affymetrix
Zhang, et al. 2005
CELFile
AltCDFFile
Intensities
Alternative CDF File Thorougly Cleaned
![Page 13: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/13.jpg)
Pre-Processing Gene Expression Data
![Page 14: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/14.jpg)
What Makes Cells Different?
![Page 15: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/15.jpg)
PANP: Presence/Absence Filtering
• Use Negative Strand Matching Probesets (NSMPs) to determine true background distribution
NSMPs probesets are designed to hybridize to the opposite strand from the expressed strand
• Utilize this background distribution from these NSMPs to threshold the entire dataset
• Output is a call for each array for each gene
Calls are:• P = presence• M = marginal• A = Absence
![Page 16: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/16.jpg)
Identifying Present Genes
• Filter out genes ≥ 50% absent Whole dataset Subsets
• Only present genes are utilized in future analyses
![Page 17: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/17.jpg)
Pre-Processing Gene Expression Data
![Page 18: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/18.jpg)
Removing Redundancy
![Page 19: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/19.jpg)
Reason for Removing Redundancy Before Running
![Page 20: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/20.jpg)
Removing Redundancy
• Collapse Affymetrix Probeset IDs to EntrezIDs
• Test for correlation between probesets If correlation is ≥ 0.8 then combine probesets If not then leave them separate
![Page 21: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/21.jpg)
Pre-Processing Gene Expression Data
![Page 22: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/22.jpg)
Pre-Processing Pipeline
= Implemented in R
= Implemented in Python
![Page 23: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/23.jpg)
Big Picture
![Page 24: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/24.jpg)
Glioma:A Deadly Brain Cancer
Wikimedia commons
![Page 25: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/25.jpg)
Brain Anatomy
Wikimedia commons
![Page 26: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/26.jpg)
What do they do?
![Page 27: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/27.jpg)
Neurophysiology
![Page 28: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/28.jpg)
Hierarchy ofNervous Tissue Tumors
![Page 29: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/29.jpg)
Glioma
WHO Grade Tumor TypePercentage of CNS
Tumors
I Pilocytic Astrocytoma
9.8%IIDiffuse or Low-Grade
Astrocytoma
III Anaplastic Astrocytoma
IV Glioblastoma Multiforme 20.3%
Gliomas account for 40% of all tumors and 78% of malignant tumors.
Buckner et al., 2007
![Page 30: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/30.jpg)
Glioma Survival
http://www.neurooncology.ucla.edu/
5 years
10 years
![Page 31: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/31.jpg)
![Page 32: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/32.jpg)
Repository of Molecular Brain Neoplasia Data (REMBRANDT)
• REMBRANDT (Madhavan et al., 2009) Currently 257 individual specimens
• Glioblastoma multiforme (GBM) = 110• Astrocytoma = 50• Oligodendroglioma = 55• Mixed = 21• Non-Tumor = 21
Phenotypes• Tumor type:
GBM, Astrocytoma, etc.• WHO Grade:
176 individuals• Age:
253 individuals• Sex:
250 individuals (partially inferred using Y chromosome genes)• Survival (days post diagnosis):
169 individuals
![Page 33: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/33.jpg)
REMBRANT:Chromosome Y Expression
Se
x spe
cificg
en
e e
xpre
ssion
Female Male
Conversions of male to female should be more common than the other way,because it is difficult for females to express the Y chromosome.
4 females clusterwith males
8 males clusterwith females
![Page 34: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/34.jpg)
REMBRANT:Chr. Y Expression – Intelligent Reassignment
Se
x spe
cificg
en
e e
xpre
ssion
Female Male
Intelligent Reassignment – If previous call of sex is for other group then the callis turned into an NA. All unknowns are given a call.
![Page 35: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/35.jpg)
Progression of Astrocytic Glioma
Furnari, et al. (2007)
![Page 36: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/36.jpg)
Modeling Glioma
• Increasing metastatic potential and severity of glioma could be modeled using this simple schema
• Correlation of model to survival post diagnosis is -0.68
0
1
2
![Page 37: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/37.jpg)
Exploring Meta-Information
• Age explains 31% of survival post diagnosis
• Age explains 25% of the progression model
• Sex does not have a significant effect on either survival or the progression model Yet it is known that glioblastoma is slightly more
common in men than in women
![Page 38: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/38.jpg)
Summary
• Very ample dataset with good amount of meta-information
• Ready for dimensionality reduction and network inference!
![Page 39: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/39.jpg)
Big Picture
![Page 40: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/40.jpg)
Clustering asDimensionality Reduction
![Page 41: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/41.jpg)
Big Picture
![Page 42: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/42.jpg)
Likely Issues
• Size of eukaryotic genomes
• Added complexity of regulatory regions
• Tissue and cell type heterogeneity
• Patient genetic and environmental heterogeneity
![Page 43: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/43.jpg)
Relative Genome Sizes
![Page 44: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/44.jpg)
Solutions
• Pre-process genomic sequences
• Reduce data complexity by collapsing redundancies
• Utilize filters that select for only the most variant genes
![Page 45: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/45.jpg)
Likely Issues
• Size of eukaryotic genomes
• Added complexity of regulatory regions
• Tissue and cell type heterogeneity
• Patient genetic and environmental heterogeneity
![Page 46: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/46.jpg)
Eukaryotic Gene Structure
![Page 47: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/47.jpg)
Eukaryotic Gene Structure
TranscriptionalStartSite Start
Codon
Untranslated Regions
![Page 48: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/48.jpg)
Eukaryotic Gene Structure
Exons
![Page 49: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/49.jpg)
Eukaryotic Gene Structure
Introns
![Page 50: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/50.jpg)
Regulatory Regions
3’ UTR
miRNA binding sites(4-9bp motifs)
Promoter
Transcription FactorBinding Sites(6-12bp motifs)
No set length forpromoters in eukaryotes.
Grabbing 2Kbp, so we canuse 2Kbp or smaller.
Median 3’ UTRlength is 831bp
![Page 51: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/51.jpg)
Three Examples After Capture
85% (n = 36,177) of probesets are associated with a sequence
![Page 52: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/52.jpg)
Solution
• Do motif detection on both promoter and 3’ UTR sequences
• Incorporate both of these regulatory regions into the cMonkey bi-cluster scoring matrix
![Page 53: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/53.jpg)
Promoter Sequences
• Looking for transcription factor binding sites (TFBS) Using MEME with 6-12bp motif widths
• Utilized RefSeq gene mapping to identify putative promoter regions 2Kbp of sequence upstream of transcriptional start
site (TSS) was grabbed
• If two RefSeq gene mappings did not overlap then the longest transcripts promoter was taken
![Page 54: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/54.jpg)
3’ UTR Sequences
• Looking for miRNA binding sites miRNA are 21bp RNA
molecules that bind to mRNA and alter expression
Using MEME with 4-9bp motif widths
![Page 55: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/55.jpg)
Likely Issues
• Size of eukaryotic genomes
• Added complexity of regulatory regions
• Tissue and cell type heterogeneity
• Patient genetic and environmental heterogeneity
![Page 56: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/56.jpg)
Complexity ofMammalian Systems
![Page 57: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/57.jpg)
Cellular Heterogeneityin Tissues
![Page 58: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/58.jpg)
What Makes Cells Different?
![Page 59: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/59.jpg)
Solution
• Filter our genes that are not expressed for each tissue, leaving only those that are expressed
• Enhance the capability of the software to handle missing data
![Page 60: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/60.jpg)
Likely Issues
• Size of eukaryotic genomes
• Added complexity of regulatory regions
• Tissue and cell type heterogeneity
• Patient genetic and environmental heterogeneity
![Page 61: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/61.jpg)
Intelligent Sample Collection
• Genetic and environmental heterogeneity are real world issues
• Can try to match for certain confounders
• Or stratify analyses based on particular confounders
![Page 62: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/62.jpg)
Running cMonkey
• Running cMonkey on AEGIR cluster 10 nodes with 8 cores per
node
1 node has 24GB ram
2 others have 16GB ram
• Completion time depending heavily on the size of the run
![Page 63: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/63.jpg)
Beautiful NewResult Interface
![Page 64: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/64.jpg)
Looking at a Cluster
![Page 65: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/65.jpg)
Chris’s Graphics Mods
![Page 66: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/66.jpg)
Original cMonkey Output
![Page 67: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/67.jpg)
Sorted cMonkey Output
![Page 68: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/68.jpg)
Boxplot For All Samples
![Page 69: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/69.jpg)
Boxplot for In Samples
![Page 70: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/70.jpg)
Integrating Phenotypes
![Page 71: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/71.jpg)
What to do when you find a cluster?
![Page 72: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/72.jpg)
Checking Out PSSM #1
![Page 73: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/73.jpg)
Known Motif?
![Page 74: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/74.jpg)
Motif Known?
![Page 75: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/75.jpg)
What do the genes do?
![Page 76: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/76.jpg)
Functional Enrichment?
![Page 77: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/77.jpg)
Functional Enrichment
![Page 78: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/78.jpg)
Genes?
![Page 79: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/79.jpg)
Interesting Cluster
![Page 80: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/80.jpg)
Phenotype Correlations
• Survival – Correlation coefficient = -0.48 P-value = 3.2 x 10-11
• Progression Model – Correlation coefficient = 0.55 P-value = 6.7 x 10-16
• Age – Correlation coefficient = 0.32 P-value = 2.2 x 10-7
• Sex – Correlation coefficient = -0.27 P-value = 0.0012
Bonferroni corrected significant p-value ≤ (0.05 / (585*4)) ≤ 2.1 x 10-5
![Page 81: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/81.jpg)
Genes from Cluster
AFFY_ID Gene Symbol Gene Name
212067_S_AT C1R complement component 1, r subcomponent
208747_S_AT C1S complement component 1, s subcomponent
201743_AT CD14 cd14 antigen
215049_X_AT CD163 cd163 antigen
203854_AT CFI complement factor i
213060_S_AT CHI3L2 chitinase 3-like 2
208146_S_AT CPVL carboxypeptidase, vitellogenic-like
201798_S_AT FER1L3 fer-1-like 3, myoferlin (c. elegans)
206584_AT LY96 lymphocyte antigen 96
202180_S_AT MVP major vault protein
204150_AT STAB1 stabilin 1
204924_AT TLR2 toll-like receptor 2
= Previously known to be differentially expressed in GBM.
![Page 82: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/82.jpg)
Motif Matches
PSSM #2
PSSM #1
![Page 83: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/83.jpg)
Summary
• Very promising results
• Need to further develop certain aspects of cMonkey to better utilize the human data
• Then need to build network inference component
![Page 84: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/84.jpg)
General Questions
• Biclustering or not?
• How many genes to run?
• How much sequence to feed MEME?
• Can more than one experiment be included?
![Page 85: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/85.jpg)
Cluster Samples, or Not?
• Bi-clustering clusters not only on genes but also by experimental conditions (samples)
• Because we are using just one experiment it may not be necessary to cluster samples
• Although it may be useful again once other experiments are included
![Page 86: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/86.jpg)
Bi-clustering or Not?
Bi-clustering Gene Clustering Only
![Page 87: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/87.jpg)
Brief Glance
• Looks like for this dataset it may make more sense to only cluster genes More clusters with significant motifs
• Although this is likely to change once we add more experiments to the mix
• Need a method to quantify this
![Page 88: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/88.jpg)
General Questions
• Biclustering or not?
• How many genes to run?
• How much sequence to feed MEME?
• Can more than one experiment be included?
![Page 89: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/89.jpg)
Maxing Out cMonkey
• Can cMonkey handle running all genes Yes, without doing motif finding With motif finding this will take a long time (weeks?),
and tends to crash out
• Essentially need to balance sequence length for motif finding with cluster size and number of clusters
• Need a method to quantify this
![Page 90: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/90.jpg)
General Questions
• Biclustering or not?
• How many genes to run?
• How much sequence to feed MEME?
• Can more than one experiment be included?
![Page 91: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/91.jpg)
Length for Promoters?
• MEME suggests 1Kbp or less for sequences as input
• Tried using 500bp, 1Kbp, 2Kbp, 2.5Kbp, and 5Kbp
![Page 92: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/92.jpg)
Brief Glance
• So far looks like the 500bp give the most clusters with motifs
• Need a method to quantify this
![Page 93: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/93.jpg)
General Questions
• Biclustering or not?
• How many genes to run?
• How much sequence to feed MEME?
• Can more than one experiment be included?
![Page 94: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/94.jpg)
Breast Cancer Metastasis
Bos et al., 2009
![Page 95: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/95.jpg)
cMonkey for Eukaryotes
Future Modifications to cMonkey for eukaryotes:
Preprocess sequence data
Add 3’ UTR miRNA motif detection
Integrate 3’ UTR miRNA motif scores with promoter motif scores
![Page 96: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/96.jpg)
Network Inference
• cMonkey software is utilized to produce the bi-clusters
• Inferelator can then be used to identify regulatory factors
• Simple correlation with phenotypes can relate bi-clusters to disease
![Page 97: Mining Public Data for Insights into Human Disease](https://reader035.vdocuments.site/reader035/viewer/2022081603/568145ae550346895db2ad51/html5/thumbnails/97.jpg)
Acknowledgements
Baliga Lab• Nitin• David• Chris• Dan
Hood Lab• Burak Kutlu
• Luxembourg Project• REMBRANDT