using functional genomic units to corroborate user...
TRANSCRIPT
Using Functional Genomic Unitsto Corroborate User Experimentswith the Rosetta Compendium
Duke Bioinformatics Shared ResourceDuke University Medical Center
Simon M. Lin*Patrick McConnell*
Department of Electronic EngineeringDuke UniversityXuejun Liao*
Lawrence Carin
Department of CardiologyDuke University Medical Center
Korkut Vita*Pascal Goldschmidt
(* Authors contributed equally to the work)
Contributions
� Can we use biological knowledge in exploratory data analysis?�Context-sensitive Clustering
�Designed a Java Application
� Can we computationally find the coordinated gene groups? Canwe use them to simplify our analysis?
�Functional Genomic Units (will be available to academic groups)
�Utilized an ICA Implementation in MatLab
� Can we use Rosetta data to explain our own experiments?�Conducted an Affymetrix measurement of RacC/A yeast strain
�Explained results from different Labs/Instrumentation setups
Knowledge Should not beIgnored in the MicroarrayAnalysis Process
Scientist
Data
Knowledge Publication
Experiment
Context-driven Clustering� Clustering is unsupervised learning. No
previous knowledge is necessary.
� Even with its exploratory nature, it stilldepends on your point of view.
� You previous knowledge will help you onfeature selection.
Why clustering should be done ina given context (a Toy Example)
Features
Obj ects
…
…
…
…
…
3400290422000Person1
0300180432900Person2
……………………
0500380712500Person10000
# ofclaims:autoaccident
Autoinsurancepremium
# of carsin thehousehold
BloodPressure
Fiberintake
Saltintake
Calorieintake
…
…
…
…
…
3400290422000Person1
0300180432900Person2
……………………
0500380712500Person10000
# ofclaims:autoaccident
Autoinsurancepremium
# of carsin thehousehold
BloodPressure
Fiberintake
Saltintake
Calorieintake
Same is true for genomics
2…422000Experiment 1
1…432900Experiment 2
………………
3…712500Experiment 300
Gene 10000
…Gene3
Gene2
Gene1
Features
Obj ects
“Kitchen-sink” Clustering
Genomic Knowledge Organizedin a Tree Structure
Integrated withthe ExpressionBrowser forClustering
Clustering in the lipid-metabolismContext
Independent Component Analysis (ICA) of the GeneExpression Profiles
• Statement of the ICA problem
Axy =
y - the observed random vector of N components
x - is a random vector with M independent components (IC)
A - mixing matrix
Q - separating matrix
• ICA Signal Model
• Objective
Find Q and A such that the components in x are asindependent as possible
or Qyx =
• ICA Solution of the Blind Source Separation Problem
--- An Illustration
• ICA Model of the Microphone Array Signals
1 0 2 0 3 0 4 0 5 0 6 0
- 1
- 0 . 5
0
0 . 5
1
1 0 2 0 3 0 4 0 5 0 6 0
- 1
- 0 . 5
0
0 . 5
1
• Audio signals of two independent speakers
Speaker 1
Speaker 2
0 1 0 2 0 3 0 4 0 5 0 6 0- 1
- 0 . 5
0
0 . 5
0 1 0 2 0 3 0 4 0 5 0 6 0- 0 . 5
0
0 . 5
0 1 0 2 0 3 0 4 0 5 0 6 0- 0 . 2
0
0 . 2
Time indices
• Mixed audio signals received at the microphone array
Microphonearray
Original signals
ICA signals
PCA signals
Time indices
• Extraction of the two speakers’ audio signals via ICA and PCA
1 0 2 0 3 0 4 0 5 0 6 0
- 1
- 0 . 5
0
0 . 5
1
1 0 2 0 3 0 4 0 5 0 6 0
- 1
- 0 . 5
0
0 . 5
1
1 0 2 0 3 0 4 0 5 0 6 0
- 1
- 0 . 5
0
0 . 5
1
1 0 2 0 3 0 4 0 5 0 6 0
- 1
- 0 . 5
0
0 . 5
1
1 0 2 0 3 0 4 0 5 0 6 0
- 1
- 0 . 5
0
0 . 5
1
1 0 2 0 3 0 4 0 5 0 6 0
- 1
- 0 . 5
0
0 . 5
1
• ICA Model of the DNA Microarray (Gene Expression) Profiles
Functional Event 1eg, Cell Proliferation
Functional Event 2eg, Detoxification
• Gene expression profiles versus experiment (condition) received at the DNA microarray
• Expression versus experiment (condition) measurements of two mutually independentFunctional Events Experiments (conditions)
Gen
es
DNA Microarray
1 2 3 4 5 6 7 8 9 1 0 1 1 1 2
- 1
- 0 . 5
0
0 . 5
1
1 2 3 4 5 6 7 8 9 1 0 1 1 1 2
- 1
- 0 . 5
0
0 . 5
1
1 2 3 4 5 6 7 8 9 10 11 12-0.4-0.200.20.40.60.8
1 2 3 4 5 6 7 8 9 10 11 12012
1 2 3 4 5 6 7 8 9 10 11 12-1.5-1-0.500.5
1 2 3 4 5 6 7 8 9 10 11 12-0.1
-0.050
0.05
1 2 3 4 5 6 7 8 9 10 11 12-0.200.20.40.6
1 2 3 4 5 6 7 8 9 10 11 12-1.5
-1-0.50
1 2 3 4 5 6 7 8 9 10 11 12-1.5
-1-0.5
00.5
• Extraction of the two mutually independent Functional Events via ICA and PCA
Original events
Functionaleventsrecoveredfrom ICA
Functionaleventsrecoveredfrom PCA
Experiments (conditions)
1 2 3 4 5 6 7 8 9 1 0 1 1 1 2
- 1
- 0 . 5
0
0 . 5
1
1 2 3 4 5 6 7 8 9 1 0 1 1 1 2
- 1
- 0 . 5
0
0 . 5
1
1 2 3 4 5 6 7 8 9 1 0 1 1 1 2
- 1
- 0 . 5
0
0 . 5
1
1 2 3 4 5 6 7 8 9 1 0 1 1 1 2
- 1
- 0 . 5
0
0 . 5
1
1 2 3 4 5 6 7 8 9 1 0 1 1 1 2
- 1
- 0 . 5
0
0 . 5
1
1 2 3 4 5 6 7 8 9 1 0 1 1 1 2
- 1
- 0 . 5
0
0 . 5
1
• ICA Model of the DNA Microarray (Gene Expression) Profiles
= ×Expression
Profile of
Genes
Experiment indexIndependentComponents
ExpressionProfiles ofFunctionalUnits
Experiment indexElement (i, j)DenotesFuzzyMembershipof Gene iBelongingtoFunctionalunit j
Expression
Profile of Genes=
Memberships of Genes
Belonging to Function Units×
ExpressionProfile ofFunctionalUnits
GenomicFunctionalUnits
Representing Impactsof Experiments interms of Genes
Representing Impacts ofExperiments in terms ofGenomic Functional Units
xAy ⋅= ICA Model
or
• Definition of a Genomic Functional Unit
0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 00
0 .0 2
0 .0 4
0 .0 6
0 .0 8
0 .1
0 .1 2M e m b e r s h i p fu n c ti o n o f G e n o m i c F u n c t i o n a l U n i t # 6 9
YFL026W'
YFL053W
YML007W
YLR307W
YAL067C
DR218CYIL037C
YLR296W
YPL121C
YPR116W
Fuzzy membership function of Unit # 69, which is responsible for oxidative stress response
A Genomic Functional Unit is a fuzzy set defined on the genes in consideration.
It generally contains genes that work together to accomplish a certain biological function
• Principles of the Independent Component Analysis Algorithm
1. Measure of statistical independence
• Mutual information
• Original definitionA random vector x has independent components xi if
∏=
=N
iixx upp
i1
)()(u
joint pdf marginal pdf
Kullback-Leibler distance between joint pdf and marginal pdf
� ∏∏ ===
uu
u dup
ppppkldpI
iix
xx
N
ixxx
i
i )(
)(ln)(),()(
1
Differential entropy of x
�−= uuu dpppS xxx )(ln)()(
Negentropy of x
)()()( xyx pSSpJ −= φφY(u) - Gaussian distribution with equal covariance matrix to px(u)
J(px) ≥ 0 with equality iff px(u ) = φY(u ). This is so because Gaussiandistribution has the largest entropy among the pdf’s having a givencovariance matrix
IMPRTANT: J(px) is invariant under general invertible linear transforms
because AxAx detln)()( += pSpS
cancel out in J(px)
V
VpJpJpI ii
N
ixxx i det
ln2
1)()()(
1
∏� +−==
Proof.
])()([)(
)(ln)(
])(ln)()([)(ln)()(
)]()([)()()()(1
�� ∏
� ��
��
−+=
−−+=
−−−=−=
ixx
iix
xx
iixxxxxx
ixxxx
N
ixx
i
i
ii
iii
SSdup
pp
duppSdppS
pSSpSSpJpJ
φφ
φφ
φφ
uu
u
uuuuu
∏=
iiiV
Vdetln
2
1
)( xpIVeS n
x det)2ln(2
1)( πφ =
• Representation of mutual information using the negentropy
2. Basic Principals of the ICA algorithms
V
VpJpJpI ii
N
ixxx i det
ln2
1 )( )()(
1
∏� +−==
J(px) is invariant undergeneral invertiblelinear transforms
To bemaximized
Cancel out via standardization,which transforms x to with aunitary covariance matrix
x~
3 Examples of practical criterions of statistical independence
V
VpJpI ii
N
iiiiiiiiiiiiiiiiiixx det
ln21
)487
81
481
121
( )()(1
4222 ∏� ++−+−=
=
κκκκκ
),,,( iiiiiii xxxxcum=κwhere
3.1 Criterion based on approximation of negentropy
To be maximized
3.2 Simple criterions based on cumulants
�QyzQ ==�
=
of cumulants the with)(1
2
'
KKN
iiiisir
�ψ
• ICA Results of the Rosetta Compendium Data Set
Rosetta Data Set --- Expression profiles of genes in 300 experiments
ICA results I:Expression profiles of Independent Components (functional units) in 300experiments
Experiment indices
Ind
epe
nden
t Co
mp
one
nt in
dice
s
50 100 150 200 250 300
50
100
150
200
250
PCA results for comparison:Expression profiles of the Principal Components in 300 experiments
Experiment indices
Prin
cipa
l Co
mpo
nent
indi
ces
50 100 150 200 250 300
50
100
150
200
250
-40
-30
-20
-10
0
10
• Functional Genomic Unit #6:
� Six of these genes are coding for isoforms of α-glucosidase (MAL62, MAL32, MAL12, FSP2, YIL172c, and YJL216c)
� Four of the genes are directly associated with cell-wallsynthesis and sporulation (sporulation specific homolog of csd4 (YER096w),sporulation specific cell wall maturation protein (YHR139c), first enzyme in dityrosine synthesis inthe outer layer of the spore wall pathway, converting L-tyrosine to N-formyl-L-tyrosine,(YDR403w), and Cell wall mannoprotein (YJR150c)).
� Five genes are involved in the glucose metabolism (glucoserepression regulatory protein-exhibits similarity to beta subunits of G proteins (TUP1), Highaffinity hexose transporter (YDL245c), High affinity hexose transporter (YEL069c), Hexosetransporter (YNR072w), and Hexose Transporter (YJR158w)).
ICA results II:Discovery or corroboration of genes’ functional unit
• Definition of a Genomic Functional Unit
0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 00
0 .0 2
0 .0 4
0 .0 6
0 .0 8
0 .1
0 .1 2M e m b e r s h i p fu n c ti o n o f G e n o m i c F u n c t i o n a l U n i t # 6 9
YFL026W'
YFL053W
YML007W
YLR307W
YAL067C
DR218CYIL037C
YLR296W
YPL121C
YPR116W
Fuzzy membership function of Unit # 69, which is responsible for oxidative stress response
A Genomic Functional Unit is a fuzzy set defined on the genes in consideration.
It generally contains genes that work together to accomplish a certain biological function
CCAGAAGTTGA 1 319 1CAAAAAGGTGT 1 647 0CCTGAAGTTGT 3 47 1CAAAAAGGTCA 3 362 1CCGGAAGGGGT 3 440 0CAGGAAGGTGA 4 81 1CAGGAAGTTGA 4 121 1CACAAAGGTGA 6 69 0CCTGAAGGTCA 7 169 0CCTGAAGGTTT 7 188 1** ********
Common 5’ -UTR
Concept of “Functional Genomic Unit”
� The set of gene found here is different fromthe “Pathways” in the traditional sense.
� Mathematical Point of View: LatentVariables constructed by IndependentComponent Analysis
� Biological Point of View: Coordinatedgenes to achieve a certain goal
Using FGU to explain userexperiments
FGUs
Ex p
e rim
ent s
Putative signal transduction pathways
Ras
Raf MKK4
JNK
ERK
MKK3/6
P38
c-fos c-jun
AP-1 REJun1/jun2
UVCytokine ReceptorsGrowth Factor receptors
Rac / CDC42
MEK
Rac / CDC42
NADPH Oxidase
ROS
+ +
c-jun ATF2C-fos promoter
Summary of Findings� Incorporated biological knowledge in
exploratory data analysis
� Utilized ICA to model the yeast functionalgenomics behavior
� Proposed Functional Genomics Units
� Demonstrated the potentials of the“Compendium” approach