big data in omics and imaging · biology, second edition d.s. jones, m.j. plank, and b.d. sleeman...
TRANSCRIPT
Big Data in Omicsand Imaging
Integrated Analysis and CausalInference
CHAPMAN & HALL/CRC Mathematical and Computational Biology Series
Aims and scope: This series aims to capture new developments and summarize what is known
over the entire spectrum of mathematical and computational biology and
medicine. It seeks to encourage the integration of mathematical, statistical,
and computational methods into biology by publishing a broad range of
textbooks, reference works, and handbooks. The titles included in the
series are meant to appeal to students, researchers, and professionals in the
mathematical, statistical and computational sciences, fundamental biology
and bioengineering, as well as interdisciplinary researchers involved in the
techniques and examples, is highly encouraged.
Series Editors
N. F. BrittonDepartment of Mathematical SciencesUniversity of Bath
Xihong LinDepartment of BiostatisticsHarvard University
Nicola MulderUniversity of Cape TownSouth Africa
Maria Victoria Schneider
European Bioinformatics Institute
Mona SinghDepartment of Computer SciencePrinceton University
Proposals for the series should be submitted to one of the series editors above or directly to:CRC Press, Taylor & Francis Group3 Park Square, Milton ParkAbingdon, Oxfordshire OX14 4RNUK
Published Titles
An Introduction to Systems Biology:
Design Principles of Biological Circuits
Uri Alon
Glycome Informatics: Methods
and Applications
Kiyoko F. Aoki-Kinoshita
Computational Systems Biology
of Cancer
Emmanuel Barillot, Laurence Calzone, Philippe Hupé, Jean-Philippe Vert, and Andrei Zinovyev
Python for Bioinformatics, Second Edition
Sebastian Bassi
Quantitative Biology: From Molecular
to Cellular Systems
Sebastian Bassi
Methods in Medical Informatics:
Fundamentals of Healthcare
Programming in Perl, Python, and Ruby
Jules J. Berman
Chromatin: Structure, Dynamics,
Regulation
Ralf Blossey
Computational Biology: A Statistical
Mechanics Perspective
Ralf Blossey
Game-Theoretical Models in Biology
Mark Broom and Jan Rychtár
Computational and Visualization
Techniques for Structural Bioinformatics
Using Chimera
Forbes J. Burkowski
Structural Bioinformatics: An Algorithmic
Approach
Forbes J. Burkowski
Spatial Ecology
Stephen Cantrell, Chris Cosner, and Shigui Ruan
Cell Mechanics: From Single Scale-
Based Models to Multiscale Modeling
Arnaud Chauvière, Luigi Preziosi, and Claude Verdier
Bayesian Phylogenetics: Methods,
Algorithms, and Applications
Ming-Hui Chen, Lynn Kuo, and Paul O. Lewis
Statistical Methods for QTL Mapping
Zehua Chen
An Introduction to Physical Oncology:
How Mechanistic Mathematical
Modeling Can Improve Cancer Therapy
Outcomes
Vittorio Cristini, Eugene J. Koay, and Zhihui Wang
Normal Mode Analysis: Theory and
Applications to Biological and Chemical
Systems
Qiang Cui and Ivet Bahar
Kinetic Modelling in Systems Biology
Oleg Demin and Igor Goryanin
Data Analysis Tools for DNA Microarrays
Sorin Draghici
Statistics and Data Analysis for
Microarrays Using R and Bioconductor,
Second Edition
Sorin Draghici
Computational Neuroscience:
A Comprehensive Approach
Jianfeng Feng
Mathematical Models of Plant-Herbivore
Interactions
Zhilan Feng and Donald L. DeAngelis
Biological Sequence Analysis Using
the SeqAn C++ Library
Andreas Gogol-Döring and Knut Reinert
Gene Expression Studies Using
Affymetrix Microarrays
Hinrich Göhlmann and Willem Talloen
Handbook of Hidden Markov Models
in Bioinformatics
Martin Gollery
Meta-Analysis and Combining
Information in Genetics and Genomics
Rudy Guerra and Darlene R. Goldstein
Differential Equations and Mathematical
Biology, Second Edition
D.S. Jones, M.J. Plank, and B.D. Sleeman
Knowledge Discovery in Proteomics
Igor Jurisica and Dennis Wigle
Introduction to Proteins: Structure,
Function, and Motion
Amit Kessel and Nir Ben-Tal
RNA-seq Data Analysis: A Practical
Approach
Eija Korpelainen, Jarno Tuimala, Panu Somervuo, Mikael Huss, and Garry Wong
Introduction to Mathematical Oncology
Yang Kuang, John D. Nagy, and Steffen E. Eikenberry
Biological Computation
Ehud Lamm and Ron Unger
Optimal Control Applied to Biological
Models
Suzanne Lenhart and John T. Workman
Clustering in Bioinformatics and Drug
Discovery
John D. MacCuish and Norah E. MacCuish
Spatiotemporal Patterns in Ecology
and Epidemiology: Theory, Models,
and Simulation
Horst Malchow, Sergei V. Petrovskii, and Ezio Venturino
Stochastic Dynamics for Systems Biology
Christian Mazza and Michel Benaïm
Statistical Modeling and Machine
Learning for Molecular Biology
Alan M. Moses
Engineering Genetic Circuits
Chris J. Myers
Pattern Discovery in Bioinformatics:
Theory & Algorithms
Laxmi Parida
Exactly Solvable Models of Biological
Invasion
Sergei V. Petrovskii and Bai-Lian Li
Computational Hydrodynamics of
Capsules and Biological Cells
C. Pozrikidis
Modeling and Simulation of Capsules
and Biological Cells
C. Pozrikidis
Cancer Modelling and Simulation
Luigi Preziosi
Computational Exome and Genome
Analysis
Peter N. Robinson, Rosario M. Piro, and Marten Jäger
Introduction to Bio-Ontologies
Peter N. Robinson and Sebastian Bauer
Dynamics of Biological Systems
Michael Small
Genome Annotation
Jung Soh, Paul M.K. Gordon, and Christoph W. Sensen
Niche Modeling: Predictions from
Statistical Distributions
David Stockwell
Algorithms for Next-Generation
Sequencing
Wing-Kin Sung
Algorithms in Bioinformatics: A Practical
Introduction
Wing-Kin Sung
Introduction to Bioinformatics
Anna Tramontano
The Ten Most Wanted Solutions in
Protein Bioinformatics
Anna Tramontano
Combinatorial Pattern Matching
Algorithms in Computational Biology
Using Perl and R
Gabriel Valiente
Managing Your Biological Data with
Python
Allegra Via, Kristian Rother, and Anna Tramontano
Published Titles (continued)
Cancer Systems Biology
Edwin Wang
Stochastic Modelling for Systems
Biology, Second Edition
Darren J. Wilkinson
Big Data in Omics and Imaging:
Association Analysis
Momiao Xiong
Big Data Analysis for Bioinformatics
and Biomedical Discoveries
Shui Qing Ye
Bioinformatics: A Practical Approach
Shui Qing Ye
Introduction to Computational Proteomics
Golan Yona
Big Data in Omics and Imaging:
Integrated Analysis and Causal Inference
Momiao Xiong
Published Titles (continued)
http://taylorandfrancis.com
Big Data in Omicsand Imaging
Integrated Analysis and CausalInference
Momiao Xiong
MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does notwarrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB® softwareor related products does not constitute endorsement or sponsorship by The MathWorks of a particularpedagogical approach or particular use of the MATLAB® software.
CRC PressTaylor & Francis Group6000 Broken Sound Parkway NW, Suite 300Boca Raton, FL 33487-2742
© 2018 by Taylor & Francis Group, LLCCRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed on acid-free paper
International Standard Book Number-13: 978-0-8153-8710-7 (Hardback)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts havebeen made to publish reliable data and information, but the author and publisher cannot assume responsi-bility for the validity of all materials or the consequences of their use. The authors and publishers haveattempted to trace the copyright holders of all material reproduced in this publication and apologize tocopyright holders if permission to publish in this form has not been obtained. If any copyright material has notbeen acknowledged, please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, trans-mitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafterinvented, including photocopying, microfilming, and recording, or in any information storage or retrievalsystem, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com(http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive,Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and regis-tration for a variety of users. For organizations that have been granted a photocopy license by the CCC, aseparate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are usedonly for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site athttp://www.taylorandfrancis.com
and the CRC Press Web site athttp://www.crcpress.com
To Ping
http://taylorandfrancis.com
Contents
Preface..................................................................................................................xxiiiAuthor..................................................................................................................xxix
1. Genotype–Phenotype Network Analysis...................................................11.1 Undirected Graphs for Genotype Network.......................................1
1.1.1 Gaussian Graphic Model........................................................11.1.2 Alternating Direction Method of Multipliers
for Estimation of Gaussian Graphical Model......................21.1.3 Coordinate Descent Algorithm and Graphical Lasso........61.1.4 Multiple Graphical Models..................................................10
1.1.4.1 Edge-Based Joint Estimation of MultipleGraphical Models..................................................10
1.1.4.2 Node-Based Joint Estimation of MultipleGraphical Models..................................................11
1.2 Directed Graphs and Structural Equation Modelsfor Networks........................................................................................161.2.1 Directed Acyclic Graphs.......................................................161.2.2 Linear Structural Equation Models.....................................171.2.3 Estimation Methods...............................................................21
1.2.3.1 Maximum Likelihood (ML) Estimation.............221.2.3.2 Two-Stage Least Squares Method.......................221.2.3.3 Three-Stage Least Squares Method.....................24
1.3 Sparse Linear Structural Equations...................................................261.3.1 L1-Penalized Maximum Likelihood Estimation................271.3.2 L1-Penalized Two Stage Least Square Estimation............281.3.3 L1-Penalized Three-Stage Least Square Estimation..........31
1.4 Functional Structural Equation Modelsfor Genotype–Phenotype Networks.................................................341.4.1 Functional Structural Equation Models..............................341.4.2 Group Lasso and ADMM for Parameter Estimation
in the Functional Structural Equation Models..................371.5 Causal Calculus...................................................................................41
1.5.1 Effect Decomposition and Estimation.................................411.5.2 Graphical Tools for Causal Inference in Linear SEMs.....44
1.5.2.1 Basics.......................................................................441.5.2.2 Wright’s Rules of Tracing and Path Analysis...46
xi
1.5.2.3 Partial Correlation, Regression, and PathAnalysis...................................................................48
1.5.2.4 Conditional Independence and D-Separation...501.5.3 Identification and Single-Door Criterion............................521.5.4 Instrument Variables.............................................................551.5.5 Total Effects and Backdoor Criterion..................................581.5.6 Counterfactuals and Linear SEMs.......................................59
1.6 Simulations and Real Data Analysis................................................601.6.1 Simulations for Model Evaluation......................................601.6.2 Application to Real Data Examples....................................62
Appendix 1.A..................................................................................................64Appendix 1.B..................................................................................................67Exercises...........................................................................................................71
2. Causal Analysis and Network Biology.....................................................732.1 Bayesian Networks as a General Framework for Causal
Inference................................................................................................742.2 Parameter Estimation and Bayesian Dirichlet Equivalent
Uniform Score for Discrete Bayesian Networks.............................752.3 Structural Equations and Score Metrics for Continuous
Causal Networks.................................................................................782.3.1 Multivariate SEMs for Generating Node Core Metrics....782.3.2 Mixed SEMs for Pedigree-Based Causal Inference...........79
2.3.2.1 Mixed SEMs............................................................792.3.2.2 Two-Stage Estimate for the Fixed Effects
in the Mixed SEMs................................................822.3.2.3 Three-Stage Estimate for the Fixed Effects
in the Mixed SEMs................................................832.3.2.4 The Full Information Maximum Likelihood
Method....................................................................842.3.2.5 Reduced Form Representation of the Mixed
SEMs........................................................................862.4 Bayesian Networks with Discrete and Continuous Variables......89
2.4.1 Two-Class Network Penalized Logistic Regressionfor Learning Hybrid Bayesian Networks...........................89
2.4.2 Multiple Network Penalized Functional LogisticRegression Models for NGS Data.......................................92
2.4.3 Multi-Class Network Penalized Logistic Regressionfor Learning Hybrid Bayesian Networks...........................93
2.5 Other Statistical Models for Quantifying Node Score Function...942.5.1 Nonlinear Structural Equation Models...............................94
2.5.1.1 Nonlinear Additive Noise Modelsfor Bivariate Causal Discovery............................94
2.5.1.2 Nonlinear Structural Equations for CausalNetwork Discovery.............................................100
xii Contents
2.5.2 Mixed Linear and Nonlinear Structural EquationModels...................................................................................104
2.5.3 Jointly Interventional and Observational Datafor Causal Inference.............................................................1092.5.3.1 Structural Equation Model for Interventional
and Observational Data......................................1092.5.3.2 Maximum Likelihood Estimation of
Structural Equation Models fromInterventional and Observational Data............112
2.5.3.3 Sparse Structural Equation Models with JointInterventional and Observational Data............115
2.6 Integer Programming for Causal Structure Leaning....................1192.6.1 Introduction..........................................................................1202.6.2 Integer Linear Programming Formulation
of DAG Learning.................................................................1212.6.3 Cutting Plane for Integer Linear Programming..............1262.6.4 Branch-and-Cut Algorithm for Integer Linear
Programming........................................................................1292.6.5 Sink Finding Primal Heuristic Algorithm........................130
2.7 Simulations and Real Data Analysis..............................................1322.7.1 Simulations...........................................................................1322.7.2 Real Data Analysis..............................................................134
Software Package.........................................................................................137Appendix 2.A Introduction to Smoothing Splines..................................137Appendix 2.B Penalized Likelihood Function for Jointly
Observational and Interventional Data...........................162Exercises.........................................................................................................171
3. Wearable Computing and Genetic Analysisof Function-Valued Traits..........................................................................1733.1 Classification of Wearable Biosensor Data....................................174
3.1.1 Introduction..........................................................................1743.1.2 Functional Data Analysis for Classification of Time
Course Wearable Biosensor Data......................................1753.1.3 Differential Equations for Extracting Features
of the Dynamic Process and for Classificationof Time Course Data...........................................................1763.1.3.1 Differential Equations with Constant
and Time-Varying Parameters for Modelinga Dynamic System...............................................176
3.1.3.2 Principal Differential Analysis for Estimationof Parameters in Differential Equations...........177
3.1.3.3 QRS Complex Example......................................179
Contents xiii
3.1.4 Deep Learning for Physiological Time SeriesData Analysis.......................................................................1873.1.4.1 Procedures of Convolutional Neural
Networks for Time Course Data Analysis.......1883.1.4.2 Convolution is a Powerful Tool for Liner
Filter and Signal Processing...............................1883.1.4.3 Architecture of CNNs.........................................1913.1.4.4 Convolutional Layer...........................................1933.1.4.5 Parameter Estimation..........................................197
3.2 Association Studies of Function-Valued Traits.............................2013.2.1 Introduction..........................................................................2013.2.2 Functional Linear Models with Both Functional
Response and Predictors for Association Analysisof Function-Valued Traits...................................................203
3.2.3 Test Statistics........................................................................2063.2.4 Null Distribution of Test Statistics....................................2073.2.5 Power.....................................................................................2093.2.6 Real Data Analysis..............................................................2123.2.7 Association Analysis of Multiple Function-Valued
Traits......................................................................................2173.3 Gene–Gene Interaction Analysis of Function-Valued Traits.......221
3.3.1 Introduction..........................................................................2213.3.2 Functional Regression Models...........................................2223.3.3 Estimation of Interaction Effect Function.........................2233.3.4 Test Statistics........................................................................2263.3.5 Simulations...........................................................................227
3.3.5.1 Type 1 Error Rates...............................................2273.3.5.2 Power.....................................................................228
3.3.6 Real Data Analysis..............................................................233Appendix 3.A Gradient Methods for Parameter Estimation
in the Convolutional Neural Networks..........................234Exercises.........................................................................................................246
4. RNA-Seq Data Analysis.............................................................................2474.1 Normalization Methods on RNA-Seq Data Analysis..................247
4.1.1 Gene Expression...................................................................2474.1.2 RNA Sequencing Expression Profiling.............................2494.1.3 Methods for Normalization................................................250
4.1.3.1 Total Read Count Normalization......................2514.1.3.2 Upper Quantile Normalization.........................2514.1.3.3 Relative Log Expression (RLE)..........................2534.1.3.4 Trimmed Mean of M-Values (TMM)................2544.1.3.5 RPKM, FPKM, and TPM....................................255
xiv Contents
4.1.3.6 Isoform Expression Quantification...................2574.1.3.7 Allele-Specific Expression Estimation
from RNA-Seq Data with Diploid Genomes.....2674.2 Differential Expression Analysis for RNA-Seq Data....................271
4.2.1 Distribution-Based Approach to DifferentialExpression Analysis.............................................................2724.2.1.1 Poisson Distribution............................................2724.2.1.2 Negative Binomial Distribution.........................279
4.2.2 Functional Expansion Approach to DifferentialExpression Analysis of RNA-Seq Data.............................2844.2.2.1 Functional Principal Component Expansion
of RNA-Seq Data.................................................2854.2.3 Differential Analysis of Allele Specific Expressions
with RNA-Seq Data.............................................................2864.2.3.1 Single-Variate FPCA for Testing ASE
or Differential Expression...................................2894.2.3.2 Allele-Specific Differential Expression
by Bivariate Functional PrincipalComponent Analysis...........................................290
4.2.3.3 Real Data Application.........................................2934.3 eQTL and eQTL Epistasis Analysis with RNA-Seq Data............300
4.3.1 Matrix Factorization............................................................3014.3.2 Quadratically Regularized Matrix Factorization
and Canonical Correlation Analysis.................................3024.3.3 QRFCCA for eQTL and eQTL Epistasis Analysis
of RNA-Seq Data.................................................................3034.3.3.1 QRFCCA for eQTL Analysis..............................3034.3.3.2 Data Structure for Interaction Analysis...........3034.3.3.3 Multivariate Regression......................................3044.3.3.4 CCA for Epistasis Analysis................................304
4.3.4 Real Data Analysis..............................................................3064.3.4.1 RNA-Seq Data and NGS Data...........................3064.3.4.2 Cis-Trans Interactions..........................................306
4.4 Gene Co–Expression Network and Gene RegulatoryNetworks.............................................................................................3094.4.1 Co-Expression Network Construction with RNA-Seq
Data by CCA and FCCA....................................................3094.4.1.1 CCA Methods for Construction of Gene
Co-Expression Networks....................................3104.4.1.2 Bivariate CCA for Construction
of Co-Expression Networks with ASE Data....3114.4.2 Graphical Gaussian Models...............................................3124.4.3 Real Data Applications.......................................................314
Contents xv
4.5 Directed Graph and Gene Regulatory Networks.........................3164.5.1 General Procedures for Inferring Genome-Wide
Regulatory Networks..........................................................3164.5.2 Hierarchical Bayesian Networks for Whole Genome
Regulatory Networks..........................................................3184.5.2.1 Summary Statistics for Representation
of Groups of Gene Expressions.........................3194.5.2.2 Low Rank Presentation Induced Causal
Network................................................................3224.5.3 Linear Regulatory Networks..............................................3294.5.4 Nonlinear Regulatory Networks.......................................330
4.6 Dynamic Bayesian Network and Longitudinal ExpressionData Analysis.....................................................................................3344.6.1 Dynamic Structural Equation Models
with Time-Varying Structures and Parameters...............3354.6.2 Estimation and Inference for Dynamic Structural
Equation Models with Time-Varying Structuresand Parameters.....................................................................3404.6.2.1 Maximum Likelihood (ML) Estimation...........3414.6.2.2 Generalized Least Square Estimation...............342
4.6.3 Sparse Dynamic Structural Equation Models..................3454.6.3.1 L1-Penalized Maximum Likelihood
Estimation.............................................................3454.6.3.2 L1 Penalized Generalized Least Square
Estimator...............................................................3494.7 Single Cell RNA-Seq Data Analysis, Gene Expression
Deconvolution, and Genetic Screening..........................................3524.7.1 Cell Type Identification......................................................3534.7.2 Gene Expression Deconvolution and Cell
Type-Specific Expression....................................................3574.7.2.1 Gene Expression Deconvolution
Formulation..........................................................3574.7.2.2 Loss Functions and Regularization...................3594.7.2.3 Algorithms for Fitting Generalized Low
Rank Models.........................................................361Software Package.........................................................................................364Appendix 4.A Variational Bayesian Theory for Parameter
Estimation and RNA-Seq Normalization.......................365Appendix 4.B Log-linear Model for Differential Expression
Analysis of the RNA-Seq Data with NegativeBinomial Distribution........................................................378
Appendix 4.C Derivation of ADMM Algorithm.....................................390Appendix 4.D Low Rank Representation Induced Sparse
Structural Equation Models..............................................394
xvi Contents
Appendix 4.E Maximum Likelihood (ML) Estimationof Parameters for Dynamic Structural EquationModels..................................................................................404
Appendix 4.F Generalized Least Squares Estimator of theParameters in Dynamic Structural EquationModels..................................................................................407
Appendix 4.G Proximal Algorithm for L1-Penalized MaximumLikelihood Estimation of Dynamic StructuralEquation Model..................................................................411
Appendix 4.H Proximal Algorithm for L1-Penalized GeneralizedLeast Square Estimation of Parameters in theDynamic Structural Equation Models.............................417
Appendix 4.I Multikernel Learning and Spectral Clusteringfor Cell Type Identification...............................................420
Exercises.........................................................................................................427
5. Methylation Data Analysis........................................................................4315.1 DNA Methylation Analysis.............................................................4315.2 Epigenome-Wide Association Studies (EWAS)............................434
5.2.1 Single-Locus Test.................................................................4345.2.2 Set-Based Methods...............................................................434
5.2.2.1 Logistic Regression Model.................................4345.2.2.2 Generalized T2 Test Statistic..............................4355.2.2.3 PCA........................................................................4355.2.2.4 Sequencing Kernel Association Test (SKAT)......4365.2.2.5 Canonical Correlation Analysis.........................436
5.3 Epigenome-Wide Causal Studies....................................................4375.3.1 Introduction..........................................................................4375.3.2 Additive Functional Model for EWCS.............................438
5.3.2.1 Mathematic Formulation of EACS....................4385.3.2.2 Parameter Estimation..........................................4395.3.2.3 Test for Independence.........................................4415.3.2.4 Test Statistics for Epigenome-Wise
Causal Studies......................................................4525.4 Genome-Wide DNA Methylation Quantitative Trait Locus
(mQTL) Analysis...............................................................................4545.4.1 Simple Regression Model...................................................4545.4.2 Multiple Regression Model................................................4545.4.3 Multivariate Regression Model..........................................4555.4.4 Multivariate Multiple Regression Model.........................4555.4.5 Functional Linear Models for mQTL Analysis
with Whole Genome Sequencing (WGS) Data................4555.4.6 Functional Linear Models with Both Functional
Response and Predictors for mQTL Analysiswith Both WGBS and WGS Data......................................456
Contents xvii
5.5 Causal Networks for Genetic-Methylation Analysis....................4565.5.1 Structural Equation Models with Scalar Endogenous
Variables and Functional Exogenous Variables..............4575.5.1.1 Models...................................................................4575.5.1.2 The Two-Stage Least Squares Estimator..........4595.5.1.3 Sparse FSEMs.......................................................460
5.5.2 Functional Structural Equation Modelswith Functional Endogenous Variables and ScalarExogenous Variables (FSEMs)...........................................4645.5.2.1 Models...................................................................4645.5.2.2 The Two-Stage Least Squares Estimator..........4665.5.2.3 Sparse FSEMs.......................................................467
5.5.3 Functional Structural Equation Models with BothFunctional Endogenous Variables and ExogenousVariables (FSEMF)...............................................................4745.5.3.1 Model.....................................................................4745.5.3.2 Sparse FSEMF for the Estimation
of Genotype-Methylation Networkswith Sequencing Data.........................................477
Software Package.........................................................................................484Appendix 5.A Biased and Unbiased Estimators of the HSIC...............484Appendix 5.B Asymptotic Null Distribution of Block-Based HSIC.....489Exercises.........................................................................................................491
6. Imaging and Genomics..............................................................................4956.1 Introduction........................................................................................4956.2 Image Segmentation..........................................................................496
6.2.1 Unsupervised Learning Methods for ImageSegmentation........................................................................4966.2.1.1 Nonnegative Matrix Factorization....................4966.2.1.2 Autoencoders.......................................................5026.2.1.3 Parameter Estimation of Autoencoders...........5076.2.1.4 Convolutional Neural Networks.......................516
6.2.2 Supervised Deep Learning Methods for ImageSegmentation........................................................................5306.2.2.1 Pixel-Level Image Segmentation.......................5306.2.2.2 Deconvolution Network for Semantic
Segmentation........................................................5366.3 Two- or Three-Dimensional Functional Principal Component
Analysis for Image Data Reduction................................................5386.3.1 Formulation..........................................................................5396.3.2 Integral Equation and Eigenfunctions..............................5406.3.3 Computations for the Function Principal Component
Function and the Function PrincipalComponent Score.................................................................541
xviii Contents
6.4 Association Analysis of Imaging-Genomic Data..........................5446.4.1 Multivariate Functional Regression Models
for Imaging-Genomic Data Analysis................................5456.4.1.1 Model.....................................................................5456.4.1.2 Estimation of Additive Effects...........................5456.4.1.3 Test Statistics........................................................547
6.4.2 Multivariate Functional Regression Modelsfor Longitudinal Imaging Genetics Analysis...................548
6.4.3 Quadratically Regularized Functional CanonicalCorrelation Analysis for Gene–Gene InteractionDetection in Imaging Genetic Studies..............................5516.4.3.1 Single Image Summary Measure......................5516.4.3.2 Multiple Image Summary Measures................5526.4.3.3 CCA and Functional CCA for Interaction
Analysis.................................................................5526.5 Causal Analysis of Imaging-Genomic Data..................................554
6.5.1 Sparse SEMs for Joint Causal Analysis of StructuralImaging and Genomic Data...............................................555
6.5.2 Sparse Functional Structural Equation Modelsfor Phenotype and Genotype Networks..........................556
6.5.3 Conditional Gaussian Graphical Models (CGGMs)for Structural Imaging and Genomic Data Analysis......557
6.6 Time Series SEMs for Integrated Causal Analysis of fMRIand Genomic Data.............................................................................5586.6.1 Models...................................................................................5586.6.2 Reduced Form Equations...................................................5606.6.3 Single Equation and Generalized Least Square
Estimator...............................................................................5616.6.4 Sparse SEMs and Alternating Direction Method
of Multipliers........................................................................5626.7 Causal Machine Learning.................................................................565Software Package.........................................................................................568Appendix 6.A Factor Graphs and Mean Field Methods
for Prediction of Marginal Distribution..........................569Exercises.........................................................................................................574
7. From Association Analysis to Integrated Causal Inference...............5777.1 Genome-Wide Causal Studies.........................................................578
7.1.1 Mathematical Formulation of Causal Analysis...............5797.1.2 Basic Causal Assumptions..................................................5807.1.3 Linear Additive SEMs with Non-Gaussian Noise..........5817.1.4 Information Geometry Approach......................................584
7.1.4.1 Basics of Information Geometry........................5847.1.4.2 Formulation of Causal Inference
in Information Geometry....................................589
Contents xix
7.1.4.3 Generalization......................................................5957.1.4.4 Information Geometry for Causal Inference...6017.1.4.5 Information Geometry-Based Causal
Inference Methods...............................................6037.1.5 Causal Inference on Discrete Data....................................618
7.1.5.1 Distance Correlation............................................6197.1.5.2 Properties of Distance Correlation
and Test Statistics................................................6207.1.5.3 Distance Correlation for Causal Inference.......6227.1.5.4 Additive Noise Models for Causal Inference
on Discrete Data..................................................6267.2 Multivariate Causal Inference and Causal Networks..................630
7.2.1 Markov Condition, Markov Equivalence,Faithfulness, and Minimality.............................................631
7.2.2 Multilevel Causal Networks for Integrative Omicsand Imaging Data Analysis................................................6357.2.2.1 Introduction..........................................................6357.2.2.2 Additive Noise Models for Multiple
Causal Networks.................................................6357.2.2.3 Integer Programming as a General
Framework for Joint Estimation of MultipleCausal Networks.................................................642
7.3 Causal Inference with Confounders...............................................6437.3.1 Causal Sufficiency................................................................6447.3.2 Instrumental Variables........................................................6447.3.3 Confounders with Additive Noise Models......................648
7.3.3.1 Models...................................................................6487.3.3.2 Methods for Searching Common
Confounder...........................................................6497.3.3.3 Gaussian Process Regression.............................6517.3.3.4 Algorithm for Confounder Identification
Using Additive Noise Modelsfor Confounder....................................................657
Software Package.........................................................................................658Appendix 7.A Approximation of Log-Likelihood Ratio
for the LiNGAM.................................................................659Appendix 7.B Orthogonality Conditions and Covariance....................664Appendix 7.C Equivalent Formulations Orthogonality Conditions.....667Appendix 7.D M–L Distance in Backward Direction.............................669Appendix 7.E Multiplicativity of Traces..................................................671Appendix 7.F Anisotropy and K–L Distance..........................................680
xx Contents
Appendix 7.G Trace Method for Noise Linear Model............................682Appendix 7.H Characterization of Association.......................................687Appendix 7.I Algorithm for Sparse Trace Method...............................687Appendix 7.J Derivation of the Distribution of the Prediction
in the Bayesian Linear Models.........................................691Exercises.........................................................................................................695
References.....................................................................................................697Index..............................................................................................................711
Contents xxi
http://taylorandfrancis.com
Preface
Despite significant progress in dissecting the genetic architecture of complexdiseases by association analysis, understanding the etiology and mechanismof complex diseases remains elusive. It is known that significant findings ofassociation analysis have lacked consistency and often proved to be contro-versial. The current approach to genomic analysis lacks breadth (number ofvariables analyzed at a time) and depth (the number of steps which are takenby the genetic variants to reach the clinical outcomes across genomic andmolecular levels) and its paradigm of analysis is association and correlationanalysis. Next generation genomic, epigenomic, sensing, and image tech-nologies are producing ever deeper multiple omic, physiological, imag-ing, environmental, and phenotypic data, the causal inference of whichis a cornerstone of scientific discovery and an essential component for dis-covery of mechanism of diseases. It is time to shift the current paradigm ofgenetic analysis from shallow association analysis to deep causal inferenceand from genetic analysis alone to integrated genomic, epigenomic, imagingand phenotypic data analysis for unraveling the mechanism of psychiatricdisorders.This book is a natural extension of the book Big Data in Omics and Imaging:
Association Analysis. The focus of this book is integrated genomic, epigenomic,and imaging data analysis and causal inference. To make the paradigm shiftfeasible, this book will (1) develop novel or apply existing causal inferencemethods for genome-wide and epigenome-wide causal studies of complexdiseases; (2) develop unified frameworks for systematic casual analysis ofintegrated genomic, epigenomic, image, and clinical phenotype data analysis,and inferring multilevel omic and image causal networks which lead to dis-covery of paths of genetic variants to the disease via multiple omic and imagecausal networks; (3) develop novel and apply existing methods for geneexpression and methylation deconvolution, and develop novel methods forinferring cell specific multiple omic causal networks; and (4) introduce deeplearning for genomic, epigenomic, and imaging data analysis and developmethods for combining deep learning with causal inference.This book is organized into seven chapters. The following is a description of
each chapter. Chapter 1, “Genotype–Phenotype Network Analysis,” studiesdirected and undirected genotype–phenotype networks, which are majortopics of causal inference. Efficient genetic analysis consists of two majorparts: (1) breadth (the number of phenotypes which the genetic variantsaffect) and (2) depth (the number of steps which are taken by the geneticvariants to reach the clinical outcomes). Causal inference theory and chaingraph models provide an innovative analytic platform for deep and precisemultilevel hybrid causal genotype–disease network analysis. Very few
xxiii
genetic and epigenetic textbooks cover causal inference theory in depth;therefore, Chapter 1 and Chapter 2 will provide solid knowledge and efficienttools for causal inference in genomic and epigenomic analysis. Chapter 1includes (1) undirected graphs for genotype network, (2) alternating directionmethod of multipliers for estimation of Gaussian graphical model, (3) coor-dinate descent algorithm and graphical Lasso, (4) multiple graphical models,(5) directed graphs and structural equation models for networks, (6) sparselinear structural equations, (7) functional structural equation models forgenotype–phenotype networks with next-generation sequencing data, and(8) effect decomposition and estimation.Chapter 2, “Causal Analysis and Network Biology,” covers (1) Bayesian
networks as a general framework for causal inference, (2) structural equationsand score metrics for continuous causal networks, (3) network penalizedlogistic regression for learning hybrid Bayesian networks, (4) statisticalmethods for pedigree-based causal inference, (5) nonlinear structural equa-tion models, (6) mixed linear and nonlinear structural equation models,(7) jointly interventional and observational data for causal inference, and(8) integer programming for causal structure leaning.Chapter 3, “Wearable Computing and Genetic Analysis of Function-Valued
Traits,” studies the genetics of function-valued traits. Early detection of dis-eases and health monitoring are primary goals of health care and diseasemanagement. Physiological traits such as ECG, EEG, SCG, EMG, MEG, andoxygen saturation levels provide important information on the health statusof humans and can be used to monitor and diagnose diseases. Wearablesensors with a capacity of noninvasive and continuous personal healthmonitoring will not only measure health parameters of individuals at rest, butalso generate signals of transient events that may be of profound prognosticor therapeutic importance. These physiological traits are a function-valuedtrait. Analysis of genomic and space-temporal physiological data can providethe holistic genetic structure of disease, but also poses great methodologicaland computational challenges. There is a lack of statistical methods forgenetic analysis of function-valued traits in the literature. In this chapter, wepropose novel statistical methods for genetic analysis of physiological traits.Chapter 3 covers wearable computing for automated disease diagnosis andreal time health care monitoring, deep learning for physiological time seriesdata analysis, functional linear models with both functional response andfunctional predictors for association analysis of physiological traits with next-generation sequencing data, mixed functional linear models with functionalresponse for family-based genetic analysis of physiological traits, functionalregression models with both functional response and functional predictors forgene–gene interaction analysis, and functional canonical correlation analysisfor association studies of physiological traits.Chapter 4, “RNA-Seq Data Analysis,” covers (1) data normalization and
preprocessing, (2) functional principal component analysis test for differentialexpression analysis with RNA-seq or miRNA-seq data, (3) multivariate
xxiv Preface
functional principal component analysis for allele-specific expressionanalysis, (4) eQTL and eQTL epistasis analysis with RNA-seq data,(5) co-expression networks, (6) linear and nonlinear regulatory networks,(7) gene expression imputation, and (8) genotype–expression regulatorynetworks, (9) dynamic Bayesian networks and longitudinal expression dataanalysis, and (10) single cell RNA-seq data analysis, gene expressiondeconvolution, and genetic screening.Chapter 5, “Methylation Data Analysis,” discusses methylation data anal-
ysis. The statistical methods for differential gene expression, eQTL analysis,and genotype–expression regulatory networks can be easily extended tomethylation data analysis. Epigenome-wide causal studies, a new concept forepigenetic analysis, will be first introduced in this chapter. In addition to theseanalyses, Chapter 5 will put emphasis on inference on whole genome meth-ylation and expression causal networks. Since both gene expression andmethylation data involve more than 20,000 genes, it is impossible to constructa causal network with more than 40,000 nodes. Therefore, multiple levelmethylation-expression networks should be designed. Chapter 5 addressesthree essential issues in the estimation of multiple level methylation expres-sion networks: (1) low rankmodel for representation of either gene expressionor methylation in a pathway or a cluster, (2) construction of methylation andexpression networks using low rankmodel representation of methylation andgene expression in the pathways or clusters, and (3) construction of methyl-ation and gene expression causal networks using original methylation andgene expression values in the local connected pathways or clusters. Chapter 5also investigates the methylations in what cells regulate what cell geneexpression. This chapter presents several novel approaches to methylationand gene expression analysis.Chapter 6, “Imaging and Genomics,” focuses on imaging signal processing,
automatic image diagnosis, and genetic-imaging data analysis. There isincreasing interest in statistical methods and computational algorithms toanalyze high dimensional, space-correlated, and complex imaging data, andclinical and genetic data for disease diagnosis, management, and diseasemechanism research. This chapter covers (1) deep learning for medical imagesemantic segmentation, (2) three-dimensional functional principal componentanalysis for imaging signal extraction, (3) imaging network construction andconnectivity analysis, (4) causal machine learning for automated imagingdiagnosis of disease, (5) multiple functional linear models for imaging geneticsanalysis with next-generation sequencing data, (6) quadratically regularizedfunctional canonical correlation analysis for imaging genetics or imaging RNA-seq data analysis, (7) causal analysis for imaging genetics and imaging RNA-seq data analysis, (8) time series structural equation models for integratedcausal analysis of fMRI and genomic data, and (9) causal machine learning.Chapter 7, “FromAssociation Analysis to Integrated Causal Inference,”will
develop novel statistical methods for genome-wide causal studies andinvestigate integrated genomic, epigenomic, imaging, andmultiple phenotype
Preface xxv
data analysis. Chapter 7 presents mathematical formulation of causal analysisand discusses principles underlying causation. The criterions for distinguish-ing causation tests from association tests are also introduced in Chapter 7.In genomic and epigenomic data analysis, we usually consider four typesof associations: association of discrete variables with continuous variables,continuous variables with continuous variables, discrete variables with binarytrait, and continuous variables with binary trait (disease status). These fourtypes of association analyses are extended to four types of causation analysesin this chapter. Chapter 7 also covers several powerful tools, including additivenoise models, information geometry, trace methods, and Haar measureand distance correlation, for casual inference. There are multiple stepsbetween genes and phenotypes. Only broadly and deeply searching enormouspath space connecting genetic variants to the clinical outcomes allows usto uncover the mechanism of diseases. Precision medicine demands deep,systematic, comprehensive, and precise analysis of genotype–phenotype – “andthe deeper you go, the more you know.” Chapter 7 proposes to use causalinference theory to develop an innovative analytic platform for deep and precisemultilevel hybrid causal genotype–disease network analysis, which inte-grates gene association subnetworks, environment subnetworks, gene regu-latory subnetworks, causal genetic-methylation subnetworks, methylation-gene expression networks, genotype–gene expression-imaging subnetworks,the intermediate phenotype subnetworks, and multiple disease subnetworksinto a single connected multilevel genotype–disease network to reveal thedeep causal chain of mechanisms underlying the disease. In addition, Chapter7 also covers causal inference with confounders.Overall, this book introduces state-of-the-art studies and practice achieve-
ments in causal inference, deep learning, genomic, epigenomic, imaging, andmultiple phenotype data analysis. This book sets the basis and analytic plat-forms for further research in this challenging and rapidly changing field.The expectation is that the presented concepts, statistical methods, computa-tional algorithms and analytic platforms in the book will facilitate trainingnext-generation statistical geneticists, bioinformaticians, and computationalbiologists.I would like to thank Sara A. Barton for editing the book. I am deeply
grateful to my colleagues and collaborators Li Jin, Eric Boerwinkle, and otherswhom I have worked with for many years. I would especially like to thankmy former and current students and postdoctoral fellows for their strongdedication to the research and scientific contributions to the book: JinyingZhao, Li Luo, Shenying Fang, Nan Lin, Rong Jiao, Zixin Hu, Panpan Wang,Kelin Xu, Dan Xie, Xiangzhong Fang, Jun Li, Shicheng Guo, Shengjun Hong,Pengfei Hu, Tao Xu, Wenjia Peng, Xuesen Wu, Yun Zhu, Dung-Yang Lee,Lerong Li, Getie A. Zewdie, Long Ma, Hua Dong, Futao Zhang, andHoicheong Siu. Finally, I must thank my editor, David Grubbs, for hisencouragement and patience during the process of creating this book.
xxvi Preface
MATLAB® is a registered trademark of The MathWorks, Inc. For productinformation, please contact:
The MathWorks, Inc.3 Apple Hill DriveNatick, MA 01760-2098 USATel: 508-647-7000Fax: 508-647-7001Email: [email protected]: www.mathworks.com
Preface xxvii
http://taylorandfrancis.com
Author
Momiao Xiong, is a professor in the Department of Biostatistics and DataScience, University of Texas School of Public Health; a regular member in theGenetics & Epigenetics (G&E) Graduate Program at The University of TexasMD Anderson Cancer, UTHealth Graduate School of Biomedical Science; anda distinguished professor in the school of Life Science, Fudan University,China.
xxix