integrating multi-omics for type 2...

66
ACTA UNIVERSITATIS UPSALIENSIS UPPSALA 2019 Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1860 Integrating multi-omics for type 2 diabetes Data science and big data towards personalized medicine KLEV DIAMANTI ISSN 1651-6214 ISBN 978-91-513-0758-9 urn:nbn:se:uu:diva-393440

Upload: others

Post on 14-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

ACTAUNIVERSITATIS

UPSALIENSISUPPSALA

2019

Digital Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Science and Technology 1860

Integrating multi-omics for type 2diabetes

Data science and big data towards personalizedmedicine

KLEV DIAMANTI

ISSN 1651-6214ISBN 978-91-513-0758-9urn:nbn:se:uu:diva-393440

Page 2: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

Dissertation presented at Uppsala University to be publicly examined in C2:305, BiomedicalCentrum (BMC), Husargatan 3, Uppsala, Monday, 11 November 2019 at 09:00 for thedegree of Doctor of Philosophy. The examination will be conducted in English. Facultyexaminer: Associate Professor Peter Spégel (Centre for Analysis and Synthesis, Departmentof Chemistry, Lund University).

AbstractDiamanti, K. 2019. Integrating multi-omics for type 2 diabetes. Data science and big datatowards personalized medicine. Digital Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Science and Technology 1860. 65 pp. Uppsala: Acta UniversitatisUpsaliensis. ISBN 978-91-513-0758-9.

Type 2 diabetes (T2D) is a complex metabolic disease characterized by multi-tissue insulinresistance and failure of the pancreatic β-cells to secrete sufficient amounts of insulin. Cellsrecruit transcription factors (TF) to specific genomic loci to regulate gene expression thatconsequently affects the protein and metabolite abundancies. Here we investigated the interplayof transcriptional and translational regulation, and its impact on metabolome and phenome forseveral insulin-resistant tissues from T2D donors. We implemented computational tools andmulti-omics integrative approaches that can facilitate the selection of candidate combinatorialmarkers for T2D.

We developed a data-driven approach to identify putative regulatory regions and TF-interaction complexes. The cell-specific sets of regulatory regions were enriched for disease-related single nucleotide polymorphisms (SNPs), highlighting the importance of such locitowards the genomic stability and the regulation of gene expression. We employed asimilar principle in a second study where we integrated single nucleus ribonucleic acidsequencing (snRNA-seq) with bulk targeted chromosome-conformation-capture (HiCap) andmass spectrometry (MS) proteomics from liver. We identified a putatively polymorphic site thatmay contribute to variation in the pharmacogenetics of fluoropyrimidines toxicity for the DPYDgene. Additionally, we found a complex regulatory network between a group of 16 enhancersand the SLC2A2 gene that has been linked to increased risk for hepatocellular carcinoma (HCC).Moreover, three enhancers harbored motif-breaking mutations located in regulatory regions ofa cohort of 314 HCC cases, and were candidate contributors to malignancy.

In a cohort of 43 multi-organ donors we explored the alternating pattern of metabolitesamong visceral adipose tissue (VAT), pancreatic islets, skeletal muscle, liver and blood serumsamples. A large fraction of lysophosphatidylcholines (LPC) decreased in muscle and serum ofT2D donors, while a large number of carnitines increased in liver and blood of T2D donors,confirming that changes in metabolites occur in primary tissues, while their alterations inserum consist a secondary event. Next, we associated metabolite abundancies from 42 subjectsto glucose uptake, fat content and volume of various organs measured by positron emissiontomography/magnetic resonance imaging (PET/MRI). The fat content of the liver was positivelyassociated with the amino acid tyrosine, and negatively associated with LPC(P-16:0). Theinsulin sensitivity of VAT and subcutaneous adipose tissue was positively associated withseveral LPCs, while the opposite applied to branch-chained amino acids. Finally, we presentedthe network visualization of a rule-based machine learning model that predicted non-diabetesand T2D in an “unseen” dataset with 78% accuracy.

Keywords: type 2 diabetes, multi-omics, genomics, metabolomics, data science, machinelearning, personalized medicine

Klev Diamanti, Department of Cell and Molecular Biology, Computational Biology andBioinformatics, Box 596, Uppsala University, SE-751 24 Uppsala, Sweden.

© Klev Diamanti 2019

ISSN 1651-6214ISBN 978-91-513-0758-9urn:nbn:se:uu:diva-393440 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-393440)

Page 3: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

To my father Kristaq, my mother Engjellushe

and my wife Mila

Page 4: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

Gjuha jonë sa e mirë, Sa e ëmbël, sa e gjerë, Sa e lehtë, sa e lirë, Sa e bukur, sa e vlerë.

Naim Frashëri – «Korça»

Σα βγεις στον πηγαιμό για την Ιθάκη, να εύχεσαι να 'ναι μακρύς ο δρόμος,

γεμάτος περιπέτειες, γεμάτος γνώσεις. …

Η Ιθάκη σ' έδωσε τ' ωραίο ταξίδι. Χωρίς αυτήν δεν θα 'βγαινες στον δρόμο.

Άλλα δεν έχει να σε δώσει πια. Κι αν πτωχική την βρεις, η Ιθάκη δεν σε γέλασε.

Έτσι σοφός που έγινες, με τόση πείρα, ήδη θα το κατάλαβες οι Ιθάκες τι σημαίνουν.

Κ. Π. Καβάφης – «Ιθάκη»

Page 5: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

List of Publications

This thesis is based on the following papers, which are referred to in the text by their Roman numerals.

I Diamanti K.*, Umer H.M.*, Kruczyk M., Dąbrowski M.J.,

Cavalli M., Wadelius C., Komorowski J. (2016). Maps of context-dependent putative regulatory regions and genomic signal interactions. Nucleic Acids Research, 44(19):9110-9120.

II Cavalli M.*, Diamanti K.*, Pan G., Spalinskas R., Kumar C., Deshmukh A.S., Mann M., Sahlén P., Komorowski J. and Wadelius C. (2019). Single nuclei transcriptome analysis of human liver with integration of proteomics and capture Hi-C bulk tissue data. Genomics, Proteomics and Bioinformatics (under review).

III Diamanti K., Cavalli M., Pan G., Pereira M.J., Kumar C., Skrtic S., Grabherr M., Risérus U., Eriksson J.W., Komorowski J. and Wadelius C. (2019). Intra- and inter-individual metabolic profiling highlights carnitine and lysophosphatidylcholine pathways as key molecular defects in type 2 diabetes. Scientific Reports 9, 9653.

IV Diamanti K.*, Visvanathar R.*, Pereira M.J., Cavalli M., Pan G., Kumar C., Skrtic S., Ingelsson M., Fall T., Lind L., Eriksson J.W., Kullberg J., Wadelius C., Ahlström H. and Komorowski J. (2019). Integration of whole-body PET/MRI with non-targeted metabolomics provides new insights into insulin sensitivity of various tissues. Manuscript.

* These authors contributed equally to this work as first authors

Reprints were made with permission from the respective publishers.

Page 6: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

List of Additional Publications

Papers not included in the thesis:

1. Stępniak K., Machnicka M., Mieczkowski J., Macioszek A., Wojtaś B., Gielniewski B., Król S., Guzik R., Jardanowska M., Dziedzic A., Grabowicz I., Kranas H., Sienkiewicz K., Dramiński M., Dąbrowski M.J., Diamanti K., Kotulska K., Grajkowska W., Roszkowski M., Czernicki T., Marchel A., Komorowski J., Kaminska B., Wilczyński B. (2019). Mapping chromatin accessibility and active regulatory elements reveals new pathological mechanisms in human gliomas. Cancer Cell (submitted).

2. The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Network*** (2019). Pan-cancer analysis of whole genomes. Nature (under review).

3. Rheinbay E., Nielsen M.M., …, Diamanti K., …, Komorowski J., …, Umer H.M., …, Wadelius C., …, Lopez-Bigas N., Martincorena I., Pedersen J.S., Getz G. (2019). Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes. Nature (under review).

4. Garbulowski M., Diamanti K.**, Smolińska K.**, Stoll P., Bornelöv S., Øhrn A., Komorowski J. (2019). R.ROSETTA: a package for analysis of rule-based classification models. bioRxiv, 625905.

5. Atienza-Párraga A.*, Diamanti K.*, Nylund P., Skaftason A., Ma A., Jin J., Martín-Subero J.I., Öberg F., Komorowski J., Jernberg-Wiklund H., Kalushkova A. (2019). Epigenomic re-configuration of primary multiple myeloma underlies the synergistic effect of combined DNMT and EZH2 inhibition. Manuscript.

6. Dabrowski M.J., Draminski M., Diamanti K., Stepniak K., Mozolewska M.A., Teisseyre P., Koronacki J., Komorowski J., Kaminska B., Wojtas B. (2018). Unveiling new interdependencies between significant DNA methylation sites, gene expression profiles and glioma patients survival. Scientific Reports, 8(1):4390.

Page 7: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

7. Umer H.M., Cavalli M., Dabrowski M.J., Diamanti K., Kruczyk M., Pan G., Komorowski J., Wadelius C. (2016). A significant regulatory mutation burden at a high-affinity position of the CTCF motif in gastrointestinal cancers. Human Mutation, 37(9):904-913.

8. Dramiński M., Dabrowski M.J., Diamanti K., J Koronacki J. and Komorowski J. (2016). Discovering networks of interdependent features in high-dimensional problems. Big Data Analysis: New Algorithms for a New Society, pp. 285-304.

9. Diamanti K., Kanavos A., Makris C., Tokis T. (2014). Handling weighted sequences employing inverted files and suffix trees. Proceedings of the 10th International Conference on Web Information Systems and Technologies, Vol. 10, pp. 231-238.

* These authors contributed equally to this work as first authors ** These authors contributed equally to this work as second authors *** Klev Diamanti is a member of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Network

Page 8: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,
Page 9: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

Contents

Introduction ............................................................................................... 13Diabetes ................................................................................................ 13Type 2 diabetes ..................................................................................... 13

The “ominous octet” of T2D pathogenesis ...................................... 14Pre-diabetes and treatment strategies .............................................. 16Quantifiable markers for T2D and IR .............................................. 16Genetic, protein and metabolic factors in T2D ................................ 17

Omics technologies and experimental data .......................................... 19ChIP-seq and DNaseI-seq ................................................................ 19Transcriptomics ............................................................................... 20Proteomics ....................................................................................... 21Metabolomics .................................................................................. 23Other omics ...................................................................................... 25Data deposition in the public domain .............................................. 26Experimental biases and sources of error ........................................ 26

Bioinformatics ...................................................................................... 30Data correction and normalization .................................................. 30Statistical analysis ............................................................................ 30Machine learning ............................................................................. 32Multi-omics integration ................................................................... 36Data science and software development in life sciences ................. 38

Aims .......................................................................................................... 39

Experimental design .................................................................................. 40Cohort I ............................................................................................ 41Cohort II .......................................................................................... 41Cohort III ......................................................................................... 41

Methods and results .................................................................................. 42Paper I .................................................................................................. 42

Aims ................................................................................................. 42Methods ........................................................................................... 42Results ............................................................................................. 43

Paper II ................................................................................................. 45Aims ................................................................................................. 45Methods ........................................................................................... 45

Page 10: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

Results ............................................................................................. 46Paper III ................................................................................................ 47

Aims ................................................................................................. 47Methods ........................................................................................... 47Results ............................................................................................. 48

Paper IV ................................................................................................ 49Aims ................................................................................................. 49Methods ........................................................................................... 50Results ............................................................................................. 51

Discussion and concluding remarks .......................................................... 53

Sammanfattning på svenska ...................................................................... 56

Acknowledgements ................................................................................... 59

References ................................................................................................. 61

Page 11: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

Abbreviations

AA Amino acid

AAA Aromatic amino acid

AUC Area under the curve

BCAA Brached-chain amino acid

BMI Body-mass index

bp Base pairs

ChIP-seq Chromatin immunoprecipitation followed by high-throughput sequencing

ChromHMM Chromatin state discovery and characterization

CV Cross validation

DNA Deoxyribonucleic acid

DNaseI-seq Deoxyribonuclease I sequencing

FFA Free fatty acid

GC-MS Gas chromatography mass spectrometry

HbA1c Glycosylated hemoglobin A1c

HiCap Targeted chromosome conformation capture

HOMA-IR Homeostasis model assessment - insulin resistance

IFG Impaired fasting glucose

IGT Impaired glucose tolerance

IR Insulin resistance

IS Internal standard

Ki Influx rate of [18F]FDG

LC-MS Liquid chromatography mass spectrometry

LPC Lysophosphatidylcholine

Page 12: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

M-value Whole-body insulin sensitivity index

MCFS Monte Carlo feature selection

MS Mass spectrometry

MS/MS Tandem mass spectrometry

ND Non-diabetes

OGTT Oral glucose tolerance test

PCA Principal component analysis

PET/MRI Positron emission tomography/magnetic resonance imaging

ROC Receiver operating characteristic

RT Retention time

SAT Subcutaneous adipose tissue

SNP Single nucleotide polymorphism

snRNA-seq Single nucleus ribonucleic acid sequencing

t-SNE t-Distributed stochastic neighbor embedding

T2D Type 2 diabetes

TF Transcription factor

TFBS Transcription factor binding site

UMAP Uniform manifold approximation and projection

UMI Unique molecular identifier

VAT Visceral adipose tissue

WHR Waist-hip ratio

Page 13: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

13

Introduction

Diabetes Diabetes is a complex metabolic disease presented in the form of hyperglycemia that is caused by deficient insulin secretion, impaired insulin sensitivity or both1. There exist evidence from ancient Egypt, India, Persia and Greece referring to common diabetes symptoms such as polydipsia, polyphagia, excessive weight loss and glycosuria2. The most common types of diabetes are type 1 diabetes (T1D) and type 2 diabetes (T2D)1. T1D is caused by an autoimmune reaction of the T-cells attacking the cell-surface antigens of pancreatic β-cells, and it accounts for 5-10% of all diabetes incidents3,4. On the other hand, T2D, that accounts for the vast majority of diabetes incidents, is characterized by multi-organ insulin resistance (IR) and impaired glucose tolerance (IGT), and it has been linked to older age, higher body-mass index (BMI), lack of physical activity, high-fat diet and a limited number of genetic factors5. However, it would be more accurate to refer to diabetes as a spectrum of metabolic diseases characterized by multi-organ IR and the inability of the body to maintain glucose homeostasis6. Another subtype of diabetes, latent autoimmune diabetes of the adult (LADA), is more similar to T1D due to the autoantibodies that target the pancreatic islets. However, LADA cases exhibit fewer antibodies that primarily target a different type of antigens7. Further diabetes sub-classification includes gestational diabetes, neonatal diabetes, maturity-onset-diabetes of the young (MODY), and maternally inherited diabetes and deafness (MIDD)6. In the following sections of this thesis we will focus on T2D.

Type 2 diabetes T2D accounts for 80-90% of the diabetes incidents worldwide5,8. The prevalence of T2D has increased by more than 100% since 1980 and is predicted to reach levels of a pandemic by 2030, with a projected number of more than 430 million recorded cases (Figure 1)9,10. In a 2016 report, the world health organization (WHO) estimated a total of 422 million diagnosed and undiagnosed adults living with diabetes in 2014, and 3.7 million deaths caused by diabetes itself or hyperglycemia8. Taken together, this evidence confirm T2D as a public threat to human health, and justify the decision of world leaders to classify it as one of the four priority noncommunicable diseases

Page 14: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

14

targeted for action8. T2D is strongly linked to the lifestyle of the modern society, characterized by a high-energy diet that is accompanied with limited or no exercise. Developing economies in Asia and Africa are predicted to demonstrate the most dramatic increases in T2D incidents by 2030 (Figure 1)5,11,12. On the other hand, the rise in T2D cases in Europe is predicted to be lower than in Asia and Africa due to specific genetic factors, and the increasing awareness on its complications and risks (Figure 1)13,14.

Figure 1. Prevalence of diabetes in 2010 (top value in millions), the projected number of diabetes cases in 2030 (middle value in millions) and the percentage increase (bottom value) in seven color-coded regions of the world. Figure adapted with permission from Macmillian Publishers Ltd: Nat. Rev. Endocrinol., (Chen L. et al., 8, 228-236) copyright 2012.

The “ominous octet” of T2D pathogenesis Pancreatic β-cells produce insulin that is the main anabolic hormone of the human body. Insulin mainly regulates the metabolism of carbohydrates and promotes the absorption of glucose from the muscles, the liver and the adipose tissue15. The main tissues that become resistant to the effect of insulin in T2D are the liver and the skeletal muscle. IR in liver and muscle leads to elevated levels of glucose in blood, that is a condition known as hyperglycemia. In response to hyperglycemia the pancreatic β-cells produce increasing amounts of insulin, leading to hyperinsulinemia16. Under normal conditions, insulin inhibits hepatic glucose production (HGP), also known as gluconeogenesis. IR in liver is demonstrated by an increase in HGP, suggesting deficiencies in the mechanism for halting it17–19. In the muscle, IR is manifested by impaired glucose uptake20 that is linked to deficient glucose transport and phosphorylation21, decreased glucose oxidation22 and lower glycogen synthesis23–25. While excessive insulin secretion maintains normoglycemic levels for a certain period, constant stress on pancreatic β-cells16, in combination with lipotoxicity26 and hypersecretion of islet amyloid

Page 15: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

15

polypeptide27,28 lead to failure of β-cells. Other contributing factors are glucotoxicity29,30, ageing31 and specific genetic factors6,32. IR in liver and muscle together with the progressive failure of the pancreatic β-cells form the T2D “triumvirate” (Figure 2)33.

Figure 2. Schematic overview of the “ominous octet” of T2D pathogenesis. From top-left clockwise: liver, pancreatic β-cells, skeletal muscle, gastrointestinal tissue, brain, kidney, pancreatic α-cells and adipose tissue. Figure adapted with permission from the American Diabetes Association: Diabetes, (DeFronzo, 58(4), 773-795) copyright 2009; and with the assistance of Marek Mazurkiewicz, Cancer Research UK, Olek Remesz, Wilfredor, Yayamamo, ColnKurtz and Adert (Wikimedia Commons).

An extensive amount of literature has suggested that the adipose tissue, the gastrointestinal tissues, the pancreatic α-cells, the kidney and the brain should be considered to extend the triumvirate to an “ominous octet”. IR in adipose tissue is mainly demonstrated by resistance to the antilipolytic effect of insulin34,35, that consequently leads to increased levels of free fatty acids (FFA) in blood, and lipotoxicity36,37. Secretion of glucagon-like peptide-1 (GLP-1) and gastric inhibitory polypeptide (GIP) from the gastrointestinal tissue triggers the production of insulin from the pancreatic β-cells. In early and overt T2D, β-cells are resistant to the insulin-stimulatory effect of GIP38,39, while reduced levels of GLP-1 have been associated with impaired glycemic control40. The pancreatic α-cells secrete glucagon that, as the main catabolic hormone of the body, increases the concentration of fatty acids and glucose in the blood41. While glucagon and insulin generally maintain glucose homeostasis, in T2D this balance is heavily disturbed with elevated glucose production stimulated by glucagon and steadily decreasing insulin secretion

Page 16: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

16

occurring due to failure of β-cells42,43. In the kidneys, the renal sodium/glucose transporter proteins SGLT1 and SGLT2, that are responsible for the reabsorption of glucose from the body, have been suggested as drug targets towards improved glycemic control in T2D44–46. Finally, the glucose uptake from the brain has been shown to be higher in subjects with IGT47,48, while the hypothalamic areas linked to appetite regulation demonstrated a slower inhibitory response in T2D and IGT subjects49,50. Six different tissues, and the pancreatic α- and β-cells form the “ominous octet” that is involved in the pathogenesis of T2D (Figure 2)16. This underlines the potential of significant discoveries from exploratory multi-tissue studies.

Pre-diabetes and treatment strategies Classification of subjects with hyperglycemia as T2D is usually decided upon insufficient evidence of inclusion to other diabetes subtypes6. Commonly, pre-diabetes is a classification of subjects that cannot be assigned as normoglycemic or diabetic, but are considered to be at high risk to develop T2D51. Pre-diabetes subjects are categorized based on impaired fasting glucose (IFG) (100 ≥ fasting plasma glucose (FPG) ≥ 125 mg/dL), IGT (140 ≥ glucose ≥ 199 mg/dL 2-hours post 75g oral glucose tolerance test (OGTT)) and/or elevated levels of the glycosylated hemoglobin protein A1c (HbA1c) 5.7-6.4%52. Subjects with pre-diabetes are deprived from ~80% of the functionality of their β-cells, while approximately 10% of them are expected to develop diabetic retinopathy16. Some researchers have suggested that subjects with IGT and/or IFG should be treated similarly to those with T2D16.

In a seminal paper, DeFronzo supported the application of a T2D treatment strategy that targets its pathophysiological disturbances through preservation of β-cell functionality and enhancement of insulin sensitivity16. In a recent groundbreaking study, Groop and colleagues identified five subgroups of patients with diabetes based on unsupervised machine learning on six clinical variables. The five groups showed varying risks with regard to diabetes complications in different stages of diabetes progression, and, consequently, suggested targeted treatment strategies53. In addition, multiple studies have identified genes and, common and rare genetic variants to be associated with T2D6. Taken together, these studies support the idea of adapting the diabetes treatment to a personalized level in order to tackle the pathophysiological consequences of specific diabetes subgroups.

Quantifiable markers for T2D and IR Markers that can be quantified from biological samples enable the estimation of the current state of a disease, the observation of the response to treatment or the prediction of future events or complications54. Such markers are abundant in the case of T2D55. Scientific studies and clinical applications rely

Page 17: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

17

largely on accurate indicators for insulin resistance/sensitivity or glucose tolerance56.

The commonly accepted standard for quantifying whole-body insulin sensitivity is the hyperinsulinemic euglycemic clamp57. During the clamp, insulin is infused or perfused to the patient to induce hyperinsulinemia, while steady-state glucose levels are maintained through glucose infusion55. The glucose infusion rate after steady-state levels are achieved reflects the glucose-uptake ability of the body. The M-value is the whole-body insulin sensitivity index that is calculated based on the amount of infused glucose in combination with the lean body mass and time47.

However, carrying out hyperinsulinemic euglycemic clamps on large cohorts is arduous and costly. The development of various surrogate indices has enabled the estimation of IR and insulin sensitivity, while drastically decreasing the cost and labor. OGTT is one of the surrogate indices that is used to measure glucose tolerance and detect T2D52. During an OGTT, the patient, after a minimum of 8 hours of fasting, is provided with an oral dose of 75g of glucose followed by blood sampling in fixed time points over a period of two hours52,58. Glucose or insulin measurements from blood samples of OGTTs are used to calculate the area under the curve (AUC) as a single index of glucose intolerance or insulin secretion, respectively59,60. An additional surrogate marker is the homeostasis model assessment (HOMA) that quantifies β-cell function or IR from basal insulin and fasting glucose concentrations55,61. The insulin resistance is quantified with HOMA-IR, while the β-cell function with HOMA-B61,62. Another popular marker that reflects the average glycemic control over a period of two to three months is HbA1c

63,64. A drawback of HbA1c is its limited clinical relevance to IR for non-diabetes (ND) subjects55. Nevertheless, HbA1c is easy to measure from one blood sample and it has been proven to accurately represent postprandial and fasting glycemic states, hence it consists a robust complementary diagnostic tool for T2D65,66.

Of the aforementioned markers, M-value is considered as the “golden standard” that allows quantification of peripheral IR on baseline glucose conditions. On the other hand, HOMA indicates fasting IR of the liver or functionality of the β-cells, OGTT AUC is a reliable index of insulin response upon glucose stimulation and HbA1c reflects the average glucose uptake over a period of a few months. To sum up, every marker carries a specific set of advantages and disadvantages that should be taken into consideration during the interpretation of the results56.

Genetic, protein and metabolic factors in T2D Research and review articles have presented a large collection of genetic, protein and metabolic factors associated with T2D. The polymorphic site rs7903146, located in an intron of the gene TCF7L2 was among the first ones to be associated with T2D, insulin secretion and proliferation of pancreatic β-cells32,67. Other studies discovered T2D-risk exonic mutations in the

Page 18: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

18

PPAR-γ transcription factor68 and the KCNJ11 gene encoding subunits of the ATP-sensitive potassium channel69,70. Association of gene expression with genetic variation identified various candidate expression quantitative trait loci (eQTL) to affect various genetic mechanisms in T2D6. In a recent genome-wide association study (GWAS), Yang, Zeng and colleagues identified 143 variants associated with T2D, of which 42 were novel71. Other genetic approaches have suggested sets of single nucleotide polymorphisms (SNPs) to affect enhancers involved in the pathogenesis of T2D72. Deoxyribonucleic acid (DNA) methylation studies have reported alterations on the methylation profile73, while others have gone as far as shortlisting genes with differential DNA methylation levels as candidates for clinical applications74. Further epigenetic studies have focused on histone modifications as a mechanism of metabolic memory for glucotoxicity in endothelial cells75,76, and others have used histone modifications, open chromatin and long-range genomic interactions to propose regulatory regions that might contribute to T2D77–79. A series of articles demonstrated a systems biology approach that combines GWAS with metabolomics or proteomics to propose metabolite quantitative trait loci (mQTL) or protein quantitative trait loci (pQTL), respectively80–86. Mendelian randomization is yet another approach that has been employed to identify causal effects between metabolomics or proteomics and IR87–90. Other exploratory studies have described differences between T2D and ND in gene expression73,91 and protein abundancies92. Metabolomics studies have reported a plethora of associations between branched-chain amino acids (BCAA), aromatic amino acids, (AAA), carnitines and lysophosphatidylcholines (LPC), and T2D markers93,94.

The aforementioned collection of discoveries is a sample of the bulk of studies that have investigated various aspects of T2D. Nevertheless, a small fraction of these studies has systematically integrated various types of data of the complex interplay among the genome, the transcriptome, the metabolome and the phenome, to investigate IR and T2D.

Page 19: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

19

Omics technologies and experimental data Recent advances in experimental and data-analysis methodologies have enabled substantial steps forward in high-throughput approaches towards the parallel quantification of an increasing number of biological variables from numerous biological samples. These omics technologies include assessment of the expression levels of ribonucleic acid (transcriptomics), identification of transcription factor (TF) binding sites (TFBS) that govern gene expression (genomics), measurement of the abundancies of proteins (proteomics) and metabolites (metabolomics)95. The development of such a plethora of technologies poses a major challenge on the computational investigation of the interplay of multi-omics96. Exploring multi-omics in the context of complex diseases, such as T2D, where environmental factors play a critical role, largely increases the level of difficulty in the interpretation of the results97.

Omics technologies are commonly divided into targeted and untargeted, while some belong to both. Targeted omics include techniques that measure a pre-defined set of molecules. For example, ribonucleic acid (RNA) microarrays in transcriptomics quantify the expression levels of an a priori defined set of genes. On the other hand, untargeted metabolomics is able to capture the entire metabolome that, downstream, is annotated to a set of known metabolites based on computational analytical approaches. At the same time, for various untargeted omics there exist targeted technologies that are usually more affordable. In the field of metabolomics and proteomics, targeted approaches offer higher sensitivity, while the untargeted a broader spectrum of detectable molecules.

In this thesis, we used a large collection of omics including genomics, transcriptomics, proteomics, metabolomics and imaging-omics (imiomics). We used chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) to locate TFBSs, single nucleus RNA sequencing (snRNA-seq) to quantify gene expression in transcriptomics, untargeted mass spectrometry (MS) for proteomics and metabolomics, positron emission tomography/magnetic resonance imaging (PET/MRI) for imiomics and targeted chromosome conformation capture technique (HiCap) to identify long-range genomic interactions.

ChIP-seq and DNaseI-seq TFs are proteins that bind to DNA and regulate gene expression in the cells. Histone modifications (HM) are biochemical modifications such as methylation or acetylation, that are located on the N-terminal tails of the histone octamers. ChIP-seq is the most popular method to identify TFBSs or HMs across the whole genome. The genome-wide landscape of TF-DNA interactions or HMs facilitates the exploration of gene expression regulation.

Page 20: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

20

The initial step of ChIP-seq involves the crosslinking of the target-factor to DNA, followed by fragmentation of the DNA by sonication or nucleases. Next, a carefully selected antibody that identifies specific TFs or HMs is employed. This step, also known as immunoprecipitation, is used to select the genomic loci harboring the targeted TF or HM. The selected loci are purified, and the DNA fragments are amplified and sequenced. The sequenced fragments are aligned to a reference genome, and a specific software is used to select the DNA fragments that were enriched for sequencing reads, when compared to the background98. The final dataset consists of the coordinates of short chromosomal stretches for each enriched site.

Deoxyribonuclease I (DNaseI) hypersensitive sites sequencing (DNaseI-seq) is an experimental method that identifies open chromatin regions. Chromatin “unwraps” at specific loci to allow TF binding. DNaseI is an endonuclease that cleaves open chromatin sites. In a DNaseI-seq experiment, the regions of target for cleavage by DNaseI are identified through DNA fragmentation, DNaseI digestion and sequencing. Finally, a specific software is used to identify enriched sites that are finally reported in a similar format to the ChIP-seq experiments98.

Collections of ChIP-seq and DNaseI-seq experiments are often used for characterizing the functionality of DNA alterations or assisting into the identification of regulatory regions for gene expression. Large international consortia such as the Encyclopedia for DNA Elements (ENCODE)99, the Roadmap epigenomics project100 and the Blueprint epigenome project101, have undertaken the task of performing and reporting hundreds of ChIP-seq and DNaseI-seq experiments.

Transcriptomics In the early 00’s the first attempt to sequence the entire DNA sequence of a human was completed102. Nowadays, technological advancements have largely decreased the sequencing cost while achieving better resolution. Genotyping on GWAS arrays is a faster and cheaper way to capture common variants and associate them to phenotypes. Such associations are usually accompanied by quantification of gene expression through RNA sequencing (RNA-seq). During RNA-seq, the transcripts are isolated, fragmented and reverse transcribed to create the complementary DNA (cDNA) sequences. Next, barcode DNA adapters are added to the cDNA libraries, followed by amplification and sequencing103. Raw reads from sequencing undergo extensive quality control and the remaining reads are mapped to a reference genome. Reads corresponding to genes are counted and normalized to account for biases such as sequencing depth104. Normalization contributes greatly to the robustness of the differential analysis that identifies up- or down-regulated genes between cases and controls.

RNA sequencing quantifies the average expression of genes in bulk samples resulting in a comprehensive, but yet incomplete, illustration of the

Page 21: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

21

transcriptome. Recent advancements in transcriptomics have enabled quantification of the expression level of genes and transcripts from single cells or nuclei. In a revolutionary study in 2009, Lao, Surani and colleagues, achieved to explore the transcriptome of single cells105. Currently, droplet-based single cell RNA-seq (scRNA-seq) or snRNA-seq technologies are among the most widespread for transcriptome profiling. A droplet-based technology commercially available by 10X Genomics has been used in this thesis. This approach allows robust identification of cells and strandness, offers a fast library preparation and achieves low costs. On the other hand, it requires a custom droplet separation device and is limited to messenger RNA (mRNA) sequences106,107.

In the 10X Genomics droplet-based approach, single cells or nuclei are encapsulated in droplets that contain barcoded primers attached to microbeads and a cell-lysis buffer. The primer consists of an identical sequence for all the beads, a 12 base-pair (bp) cell identifier, a 8bp mRNA identifier and a 30bp oligo(dT) primer used for obtaining cDNA from mRNA. The barcoded primers serve as unique molecular identifiers (UMI) to assign transcripts to cells or nuclei. The cellular or nuclear membrane is lysed and the primers from the microbeads are attached to the mRNAs. In a next step, beads are released and the fused sequences of primers and mRNA are reverse transcribed, after droplets are pooled and broken. Finally, the cDNA from the reverse transcription is amplified using polymerase chain reaction (PCR) and sequenced106. The most convenient approach to quantify gene expression from such experiments is to run the mkfastq and count options of the cellranger tool available by 10X Genomics. These two options perform demultiplexing on the raw files, conversion to the established fastq file format, alignment to a reference genome and count of the UMIs for each gene. The final data format is a two-dimensional matrix that contains UMI counts of genes (rows) in cells (columns). A following important process involves extensive multi-step quality control. First, empty droplets, dying cells/nuclei and cells/nuclei with too few or too many transcripts are removed. Next, technical noise and batch effect is removed, followed by normalization of the UMI counts. The normalized dataset undergoes dimensionality reduction to reduce computation times from subsequent steps, clustering to identify cell types, visualization of the clusters in a reduced dimensional space and manual exploration of the markers from the differential expression analysis among clusters for accurate cell type annotation108.

Proteomics The field of proteomics has undertaken the task to develop and apply methods that would map the entire proteome of biological systems. Two large projects attempted to construct a map of the human proteome based on MS technologies, and they succeeded to cover ~70-80%109,110. However, direct protein quantification is a cumbersome task111.

Page 22: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

22

Initially, proteins are extracted from samples. Next, proteases are used to digest proteins to peptides spanning 6-50 amino acids. Proteases introduce a certain error rate by occasionally cutting at unexpected sites or missing cleavage sites. MS in proteomics is commonly preceded by liquid chromatography (LC-MS) that separates the peptides based on their physicochemical properties such as charge or hydrophobicity. Specifically, while samples are injected, the liquid solvent is gradually altered to allow different types of peptides in the column (Figure 3b). Next, peptides are slowed down before eluting to the mass spectrometer (Figure 3b). Eluting peptides are ionized into precursor ions, accelerated from an electric or magnetic field and, finally, their mass-over-charge ratio (m/z), intensity and retention time (RT) is recorded. The triplet of m/z, intensity and RT consists the MS1 spectrum.

The MS1 spectrum offers satisfactory identification quality, but often this is not sufficient. Tandem MS (MS/MS) is commonly used in proteomics to increase the certainty of the identification. MS/MS involves (at least) one additional round of fragmentation and detection to provide the MS2 spectrum, with two different available approaches. The first is called data-dependent acquisition (DDA), and relies on the selection of a narrow window from the high-intensity m/z peaks of the MS1 spectrum112. This functions as a filter that allows only one peptide species at a time to be detected in the MS2 spectrum. Usually, several such windows are selected in order to facilitate the detection of multiple peptides. The second approach, called data-independent acquisition (DIA), allows selection of broader m/z windows from the MS1 spectrum. As a result, multiple peptide species proceed to further fragmentation and detection113, and consequently, a more complex MS2 spectrum emerges. Quantification of proteins from MS2 spectra that include a plethora of peptide species is a complicated computational task. However, this issue is circumvented by taking advantage of the computational power and by employing the vast amounts of existing spectral libraries. Overall, DDA has a limited reproducibility due to the selection of precursor ions, while DIA offers a larger and more representative snapshot of the proteome than DDA, and fewer missing values114.

The next step involves computational identification and quantification of proteins based on the MS1 and MS2 spectra. This process includes feature detection, peptide identification, protein identification and protein quantification, that are usually performed from versatile and powerful platforms such as MaxQuant115. The final dataset is a two-dimensional matrix that can be subject to downstream analyses for further normalization/correction, biomarker identification and multivariate analysis with custom scripts or widely accepted tools tailored to proteomics analysis116.

The construction of the draft map of the human proteome took an extensive amount of time to complete and proved the major challenges that the field of proteomics is facing. A major obstacle is the sensitivity of the methods, that despite the recent developments, is unable to identify proteins with fewer than

Page 23: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

23

100 copies and has an upper limit on the range of detectable protein abundances117.

Figure 3. Basic schematic representation of a) GC-MS and b) LC-MS. Figure adapted from K. Murray and Dubaj (Wikimedia Commons).

Metabolomics The entire set of small molecules of a biological system that are products or substrates of biological pathways comprise the metabolome. Metabolomics is a collection of high-throughput methods utilized to quantify the metabolome. Metabolomics facilitates detection and evaluation of a very large number of small compounds (<1,500 Da) from biological samples. The specifications and setup of the underlying metabolomics technology define the detectable spectrum of molecules that usually includes amino acids (AA) and lipids. Nuclear magnetic resonance (NMR) spectroscopy, and LC or gas chromatography (GC) followed by MS (LC-MS or GC-MS) are the two most popular methods for untargeted compound-quantification that bear distinct drawbacks and advantages. Even though NMR is superior to MS for not requiring sample separation and derivatization, its sensitivity is poorer. On the other hand, MS technologies, that we focus in this thesis, offer a broader selection of platforms that can be adjusted with regard to the sensitivity and the spectrum of detectable molecules118,119.

Liquid and gas chromatographic methods aid at providing information about the retention time and the physicochemical properties of the compounds119. In GC, a sample of derivatized compounds is transported into a column through a carrier gas (e.g. He), after a heated injection is used to make it volatile. A scheduled gradual elevation of the temperature of the oven that encapsulates the column, stimulates the molecules of the sample that elute

Page 24: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

24

at varying time-points due to differences in polarity and boiling temperatures. Eluting molecules from the GC column are subsequently transferred to the MS detector120 (Figure 3a). In LC, the sample is transported by a liquid mobile to a column that separates the molecules based on their polarity or charge. Next, molecules elute from the column towards to MS detector is performed (Figure 3b).

MS consists of ionization from an ion source, ion separation in the mass analyzer and recording of m/z. Ionization techniques differ between GC- and LC-MS. The ionization that follows GC is usually electron ionization that is a hard ionization technique, while the one that follows LC is usually electrospray ionization that runs on a positive or negative mode, and it is a soft ionization approach. Next, the mass analyzer accelerates the ionized compounds, while the detector records the abundance of ions and the RT. Similarly to proteomics, in metabolomics, tandem MS is also commonly used to improve the identification and quantification certainty of compounds. Overall, LC-MS has a wider range of detectable compounds, while GC-MS achieves higher sensitivity, despite the additional derivatization step119,121.

m/z spectra obtained from MS undergo rigorous quality control to remove background noise and to maintain the relevant signals that can be utilized for annotation to known metabolites122. Reference in-house or commercial empirical libraries are popular approaches to annotate spectra of detected compounds to identities of metabolites. However, such reference libraries contain a rather limited number of metabolites that, consequently, reduce the overall potential of the study119. On the other hand, the availability of public resources that provide large libraries for annotation of metabolites is limited but constantly growing123. A common analysis-ready data format for metabolomics is a two-dimensional table with samples in the rows and metabolite quantifications in the columns. Popular information that accompany metabolite quantifications include internal standards (IS), blank sample measurements and metabolite-specific database identifiers. The latter consists of identifiers for large databases such as the human metabolome database (HMDB)124, the chemical entities of biological interest (ChEBI)125, the Kyoto encyclopedia of genes and genomes (KEGG)126 and the database of U.S. national center for biotechnology information (PubChem)127. The steps to follow in the downstream analysis involve data analytics such as biomarker identification, pathway enrichment analysis and computational model development.

The field of metabolomics has contributed to the identification of potential markers for cardiovascular diseases128 and T2D129, and with single-cell metabolomics being “around the corner”130, it is expected to assist in multiple novel discoveries in the near future. However, additionally to limitations posed by reference libraries, the overall lack of method-standardizations and high variability among measurements need to be addressed119.

Page 25: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

25

Other omics

Imiomics Recent improvements in imaging techniques have undergone various developments that allow researchers to extract tissue-specific information from whole-body scans131. The availability of integrated imaging platforms such as PET coupled with MRI132,133 or PET coupled with computerized tomography (CT)134 allow simultaneous quantification and analysis of phenotypic characteristics from multiple tissues in human subjects. An example of such phenotypic characteristics is the quantification of the volume or the fat content of various organs of the human body. Specifically, a series of consecutive scans is performed with PET/MRI or PET/CT after a tracer substance is ingested by the subject. Next, the scans are corrected and normalized, and the signal intensities are stored in voxels in a technique called imiomics135. Finally, based on manual segmentation of a human reference scan that assigns voxels to their corresponding organs, a summarized map of information regarding the volume and the fat content of various organs is obtained. The final dataset is a two-dimensional matrix with the imiomics quantifications in the columns and the samples in the rows.

HiCap Somatic cells of different organs contain identical copies of the same genetic information; however, their phenotype and functionality differ largely. This is because the underlying gene expression regulation mechanism allows a specific set of genes to be transcribed. Among the key components of the gene regulation mechanism are the TFs, the HMs, the DNA methylation sites and specific genomic loci also known as regulatory regions. Examples of regulatory regions are the promoters, that are DNA sequences spanning a few hundreds of bps upstream the transcription start site, and the enhancers that are regulatory sequences spanning between a few hundreds to a few thousand bps, which are commonly located in introns of genes or in intergenic sequences. The most common example of gene expression activation occurs when promoters and enhancers are brought to close proximity via protein-protein interactions of TFs.

In recent years, various methodologies have been developed to capture such distal interactions in both targeted and untargeted manners. Untargeted approaches capture the bulk of the cases when DNA sequences are brought to close proximity through TFs, at the expense of limited resolution. Two widespread approaches that record long-range genomic interactions in an untargeted manner are Hi-C136,137 and chromatin interaction analysis by paired-end tag sequencing (ChIA-PET)138,139. On the other hand, targeted technologies use predesigned probes to identify interactions to distal regions or other probes. They offer higher resolution at the cost of missing interactions occurring among “non-probes”. Such an example is HiCap140, that we used in this thesis.

Page 26: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

26

Long-range interaction profiling methods use high throughput sequencing followed by mapping of the interacting loci to a reference genome. The initial list of detected interactions undergoes extensive quality control and a set of reliable interactions is reported using specialized software136,141, that usually allows for additional genomic information to assist the process (e.g. tissue-specific ChIP-seq experiments). The data format of the long-range genomic interactions is a tab-separated file with one interaction reported in each row, and additional information representing its quality.

Data deposition in the public domain In the recent years, large consortia have deposited petabytes for multiple types of omics data in the public domain. Additionally, the national center for biotechnology information (NCBI) in the USA and the European bioinformatics laboratory (EBI) provide a vital infrastructure for researchers to deposit experimental data in their databases, while making them available to the research community for reproducibility of the findings and potential reanalysis. Datasets produced from omics technologies, in a smaller or larger scale, are usually required by the scientific journals to be uploaded to the public domain142,143. However, there are certain limitations in sharing clinical variables of the subjects that provided the original biological samples. Such information is usually prohibited to be accessible from the public, but only for researchers or governmental institutions that have obtained ethical permissions. The same applies to biological samples that are obtained for analysis from donors or biobanks, that store organs of deceased individuals for research purposes. This assists on maintaining the anonymity of the donors and ensuring that sensitive data is used exclusively for research. Overall, data deposition on publicly accessible databases enhances the quality of science and enables meta-analyses to a great extent.

Experimental biases and sources of error Experimental data are heavily influenced by technical or biological sources of bias. Sample collection performed by different practitioners or even from the same practitioner on different days might introduce biases. Biases originating from samples collected in different facilities that followed the same protocol have been well-documented, and various pipelines have been developed to account for144. Experiments performed on different instruments, or on the same instrument on different days are capable of introducing additional biases (Figure 4a). Decay of instruments over time is always reflected on the quality of the data. Contamination or corruption of samples over time, even when they are stored in -80oC should not be crossed out of the list as well.

Page 27: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

27

Figure 4. Examples of batch and replicate effect from MS metabolomics and snRNA-seq, respectively. a) PCA showing batch effect from GC- and LC-MS metabolomics in raw data (left) and after correction (right) from paper IV. b) t-SNE illustrating uncorrected (left) and corrected (right) technical noise from the two snRNA-seq replicates from paper II.

The majority of the scientific investigations are performed in bulk samples taking a snapshot of the ongoing processes, while alterations of the landscape of biological interactions over time are usually ignored or impossible to explore. On the computational end, there is little agreement among scientific groups on handling missing values. The same applies to the methods used for data exploration, while it is not uncommon for incorrect methods to be applied or for methods to be applied incorrectly. In the following sections we will present a selection of sources that introduce systematic biases in omics technologies.

Human experimental cohorts In multiple studies, including this thesis, a large collection of biological samples originating from living or deceased donors is used. Such samples allow scientists to run experiments resulting in important or groundbreaking discoveries. A common approach for scientific groups to obtain sets of samples is to organize own recruitments or to use biobanks that store and maintain samples from deceased organ donors. Biological samples should be accompanied by clinical data that occasionally are incorrect or incomplete, leading to errors or misinterpretations of the findings.

A complete record of clinical information allows correcting the quantifications of biological variables for various factors that might influence the interpretation of the results. For example, a complete list of medications or lifestyle habits, such as smoking or alcohol consumption, of the participating subjects would allow researchers to control for them while building computational models or performing statistical analyses. Additional important factors for samples received from living donors include dietary habits, record of the sampling practitioner and the time the sample remained in room temperature. Biobanks should report the time the patient spent in the

Page 28: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

28

intensive care unit, the time frame between death and sample collection, and the specifications of the intravenous serum. All these factors consist a short list of potential sources of biases and errors in scientific research of human cohorts.

ChIP-seq and DNaseI-seq There are multiple potential sources of bias in next generation sequencing (NGS) technologies that include ChIP-seq and DNaseI-seq. DNA fragmentation performed with sonication or nuclease digestion introduces certain biases. At the same time, large scientific consortia have reported multiple antibodies that lacked the necessary specificity99,100. Biological/technical replicates, control experiments and inclusion of spike-ins improve largely the identification of variation due to technical flaws and the characterization of true biological variation. During spike-in experiments a fixed amount of DNA and antibodies from a different species is included in the experiment in order to create a normalization factor that will facilitate observation of subtle biological effects. However, a very limited number of ChIP-seq experiments has employed spike-ins in their studies. Finally, performance of pilot studies on low read sequencing depth, followed by experiments of uniform sequencing depth consists the recommended order of events that is not always necessarily accomplished145.

Single cell or single nucleus RNA sequencing Sample preparation for snRNA-seq or scRNA-seq might lead to loss of various polyadenylated (polyA) RNAs146, while amplification introduces biases for lowly expressed genes147. Batch effects in snRNA-seq or scRNA-seq experiments occur even among replicates of the same sample (Figure 4b). Additionally, computational approaches that are used to model and remove technical variation among cells and genes, usually make parametric assumptions that are not necessarily fulfilled from the data148. Finally, the assignment of groups of cells or nuclei to specific cell types is an intensive non-standardized process that might introduce various systematic biases.

MS metabolomics and proteomics MS-based omics technologies are analytical processes with various steps in which biases might occur. First, the complexity of the sample preparation plays a key role, while the separation of the compounds or the peptides in the column might introduce additional biases. Specifically, the sensitivity of the instrument declines over time, usually, due to decay of the separation performance of the column. Additionally, detection of compounds or peptides might fail due to co-elution or absence from the reference library. Reference libraries in MS technologies are limited. For example, a common problem in metabolomics is the bias of the identified metabolites in favor of amino acids and lipids149. On the other hand, in proteomics, even though the reference

Page 29: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

29

libraries are richer and more accurate, the complexity of the MS2 spectrum of the DIA approach introduces uncertainties114.

Other omics In imiomics approaches the resolution of the PET/MRI or the PET/CT scanner in combination with the selection of a suitable tracer substance contribute largely to the interpretation of the results. Moreover, the random selection of a single scan as reference rather than a consensus image introduces certain biases. Finally, the segmentation of the reference scan is performed manually by an experienced medical practitioner, that increases the possibility of a random human error due to lack of a systematic approach.

As mentioned earlier approaches that capture long range genomic interactions in an untargeted manner provide a rough approximation of the span of the genomic loci due to limited resolution. On the other hand, targeted approaches, are limited to detect only interactions of potential regulatory regions for which probes are available. Additionally, the cutoffs for selecting reliable interactions are rather arbitrary. Moreover, experiments that identify long range genomic interactions include DNA fragmentation and antibodies that we described earlier to introduce specific biases. Finally, taking under consideration the highly dynamic and adaptive behavior of the genome, such approaches are unable to present maps of genomic interactions, but rather a snapshot of a distinct state.

Page 30: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

30

Bioinformatics

Data correction and normalization In the previous section we discussed some of the potential sources that might introduce systematic biases in studies of the biomedical field. One of the key goals of bioinformatics analysis is their identification and correction.

In MS omics technologies technical biases such as batch effect or minor instrumental decay are usually corrected using IS, information from blank samples or the running order of the samples. A study by Ingelsson and colleagues suggested an analysis of variance (ANOVA) type of correction for sources that introduce technical variation to metabolomics data122. An additional common problem in MS-based omics is the imputation of missing values. A recent publication reviewed a large number of imputation approaches for missing values in MS metabolomics, and showed that a k-nearest neighbors machine learning approach performed best150. In single cell or single nucleus RNA sequencing technologies, except from identifying empty droplets or droplets containing dying cells/nuclei, one should take care of technical variation introduced by other factors, such as amplification bias or batch effect. The majority of the computational tools tailored to single cell or single nucleus experiments offer various approaches such as modeling the technical noise with a Poisson distribution108 or using canonical correlation analysis (CCA) to remove batch effects151.

A subset of the variables measured in a biological experiment might be associated with anthropometric measurements such as BMI, waist-hip ratio (WHR), age or sex. It is common practice to control for such unwanted biological variation when building statistical models or running statistical tests, while some computational tools provide insights of the variation that is explained by such factors152. Quantifications of proteins and metabolites from MS methods are in the ranges of thousands or millions, hence a logarithmic transformation is recommended to reduce the range of values while providing a better approximation of a normal distribution122. In transcriptomics, both bulk and single cell/nucleus, it is important to normalize the counts or UMIs for sequencing depth and gene length to obtain comparable quantifications153.

Statistical analysis High throughput omics technologies generate large numbers of variables that require statistical analysis to assist in the selection of those that are of interest to the underlying hypothesis. Computational biology tools often apply statistical tests to provide metrics of levels of importance of the variables that lead to further assessment and conclusions. In bioinformatics we mainly use hypothesis testing to select events that occur due to biological variance rather than random factors or technical noise.

Page 31: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

31

In hypothesis testing, a null (H0) and an alternative (H1) hypothesis are compared against each other. We are usually interested in testing whether H1, that models the observed data, is largely different than H0, that models the random events. When the hypothesis distributions are compared, a test-statistic is calculated to represent their difference. The test-statistic is then used to compute a probability value (p-value), that is subsequently compared to a preselected significance threshold α, that is usually 0.05. If the p-value is smaller than the threshold, then we accept the result as significant, reject the null hypothesis and proceed to the interpretation of the observed values. In the opposite case we cannot reject the null hypothesis and the test is considered non-significant.

The commonly accepted threshold α implies that we can reject H0 with 95% certainty, or vice versa, that we tolerate one out of 20 times to falsely reject H0. Falsely significant statistical tests occur when some data points from H1 are much larger or smaller than H0 because of random chance, hence the p-values will be falsely smaller than α. At the same time, it is common to perform tens of thousands of statistical tests in one experiment, increasing the probability of falsely significant tests. For example, when performing a differential gene expression analysis, tens of thousands of statistical tests are conducted at once; hence the number of these random events grows largely and might play an important role towards misinterpretations or false conclusions. Several methods have been proposed to account for multiple testing problems, with false discovery rate (FDR) currently being widely accepted.

Hypothesis testing belongs to the broader group of inference statistics that often requires parametric assumptions that are challenging to meet, or population sizes that are not large enough. Alternative approaches to inference statistics are resampling statistical methods such as bootstrapping and Monte Carlo permutation tests, that generate empirical distributions for statistical measurements. In bootstrapping the original set is sampled with replacement to generate sets where the test-statistic is computed to build the bootstrapping distribution. Next, the test statistic from the original set is compared to the one from the bootstrapping distribution to estimate the statistical significance or confidence interval. In a permutation test the distribution of the test statistic that is used as H0 is computed from all the possible rearranged sets of the original set, and the rejection of the null hypothesis is decided upon the comparison of the original test statistic to H0. Usually calculating a full permutation is infeasible or extremely time-consuming, hence Monte Carlo methods, that run a number of random rearrangements (e.g. 10,000) that is sufficiently large to accurately approximate the full permutation distribution, are preferred. Overall, bootstrapping is more suitable to estimate confidence intervals while Monte Carlo permutation tests are a preferable approach for hypothesis testing.

Page 32: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

32

Machine learning Machine learning refers to a large collection of techniques that exploit the increasing processing power of computers to identify shared patterns of variables among subsets of samples from input datasets. The input datasets, that are known as training sets, commonly consist of quantified variables or features represented in the columns, and samples or objects represented in the rows of a two-dimensional table. Machine learning algorithms are further divided in to two large subgroups of supervised and unsupervised learning. Supervised learning contains methods that identify combinations of variables that optimally predict a predefined attribute, also known as decision. On the other hand, unsupervised learning algorithms identify groups or clusters of objects based on shared or similar patterns of features.

Figure 5. Sample a) UMAP and b) t-SNE embeddings visualization from scRNA-seq in blood and bone marrow samples. Color-coded groups represent cell types annotated manually based on differentially expressed sets of genes. Figure adapted with permission from Macmillian Publishers Ltd: Nature Biotechnology, (Becht et al., Volume 37(1), pp.38-44) copyright 2019.

Unsupervised learning Dimensionality reduction is a subcategory of unsupervised learning that includes methods that aim at representing large sets of variables in a lower-dimensional space of variables that summarize the variability of the original set. Principal component analysis (PCA) is the most commonly used method for dimensionality reduction. PCA transforms the set of input variables to a smaller set of orthogonal variables, called principal components (PCs), that captures a large fraction of the variation from the original data. In other words, PCs are linear combinations of the variables from the original set. When PCA is computed the set of PCs is ordered based on the total amount of variance they explain. The basic structure of the input data for a phenotype can be explored by plotting the first two or three PCs in a two- or three-dimensional space, respectively (Figure 4a). PCA is the preferred

Page 33: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

33

method for dimensionality reduction in single cell and single nucleus transcriptomics. However, other dimensionality reduction methods such as t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) are preferred to PCA for visualization purposes, due to their ability to capture complex non-linear associations in high-dimensional space (Figures 4b and 5). Specifically, the input to t-SNE or UMAP is the PCA-reduced space, where the number of PCs that covers the majority of the variation of the initial dataset is decided based on specific algorithms (e.g. jackknife)154. Finally, the cells or nuclei are represented in a two-dimensional UMAP or t-SNE space based on their similarity in the number-of-PCs-dimensional space. Overall, UMAP is a better approach than t-SNE due to a more robust mathematical background and a more representative visualization structure, while t-SNE can be characterized as a visualization heuristic155 (Figure 5). Nevertheless, both are stochastic and less mature than PCA156.

Clustering is a subcategory of unsupervised learning that aims at identifying groups of objects from the training set that share patterns of variables. Hierarchical and k-means clustering are classic examples of clustering methods. Hierarchical clustering measures the similarity of all pairs of objects from the training set and then uses a dendrogram to visualize the distances placing the most similar pairs of objects nearby. In k-means, given a training set and the expected number of clusters k, an iterative computation of cluster-representative centroids with minimal distances from the cluster members is performed. Even though such classic clustering methods are used in single cell or single nucleus transcriptomics to identify different cell types from gene expression patterns, graph-based clustering is preferred. In this approach a reduced dimensionality representation, usually by PCA, is used to embed the cells or nuclei in an n-dimensional space graph, where n is the number of selected PCs. The weights of the edges of the graph are calculated based on a similarity metric (e.g. Jaccard distance), and continuously updated so that specific communities are identified. The resulting clusters are finally overlaid on t-SNE or UMAP plots, and the cell types can be identified based on cluster-specific markers108.

Feature selection Feature or variable selection consists a subgroup of the supervised machine learning algorithms. In bioinformatics we often process large training sets with hundreds or even thousands of variables, which, most likely, will be a mixture of non-informative, redundant and predictive attributes. Feature selection methods identify and discard redundant and non-informative variables, while suggesting predictive ones for further analysis. This largely reduces the manpower and computational resources dedicated to tasks for data analytics. The simplest feature selection approaches use univariate statistical tests to examine associations between variables and decision. Other approaches use multivariate linear methods such as stepwise regression, least

Page 34: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

34

absolute shrinkage and selection operator (lasso), ridge regression, or combinations of them157. More advanced methods use multivariate algorithms that are able to capture non-linear relationship among variables, such as Boruta158 or Monte Carlo Feature Selection (MCFS)159,160, that use random forests (RF) and decision trees, respectively.

In MCFS, the feature-space of an input dataset with n objects and d features is sampled using a Monte Carlo approach to create s subsets of n objects and m features, with m << d. Each subset is split t times into a training set and a testing set. The training set will be used to train a decision-tree machine learning model and the testing set will be used to assess the model performance. The classification impact of every feature is computed based on its distance from the root of the tree and the overall performance of the model. Relative importance (RI) is the cumulative score that summarizes the overall impact of each feature to classification. Finally, MCFS suggests a RI cutoff threshold that allows a straightforward selection of the set of informative features.

Overall, linear multivariate methods are usually preferred due to speed and simplicity, while non-linear ones are advantageous due to capturing complex non-linear interdependencies of variables161. Commonly, the selected set of features is subsequently explored for biologically relevant signals or used to build supervised machine learning models.

Supervised learning Supervised learning consists of regression and classification methods. One of the simplest and most widely used regression methods is linear regression. In linear regression, a dependent variable is predicted from a linear combination of weights with a predefined set of features (predictors) and a constant (intercept). The goal is to compute the intercept and the weights so that the approximation of the dependent variable is optimized and the distance (residual) between the predicted and experimental value is minimized. An important feature of linear regression is that one can estimate the dependent variable from a predictor while accounting for additional variables that might play a confounding role. For example, in paper IV, one of the goals was to predict insulin sensitivity in the visceral adipose tissue (VAT) from levels of LPC (P-16:0) while accounting for the potential effect of anthropometric measurements such as BMI, WHR, sex and age. This is used to observe only the relevant association of the predictor to the dependent variable while ignoring the effect of confounders.

Classification consists of a large number of methods including decision trees, RF, support vector machines (SVM) and neural networks (NN). The choice of a classification method depends on the input data and the goal of the study. RF, SVM or NN might be a better choice towards building a highly predictive classification model. On the other hand, rule-based classification models consist of sets of IF-THEN rules that offer transparency and interpretation of the decision making. Rule-based classification relies on the

Page 35: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

35

theory of rough sets162. A decision system is a common input format that classification methods require. It consists of objects or samples in the rows and features or variables in the columns, followed by an additional decision column. The decision column is a group label or a phenotype that we are aiming to study. In rough sets the features and the decision are expected to be discrete; however, a discussion on discretization approaches and their impact to classification is beyond the scope of this work. Rough sets aim at identifying minimal subsets of features that can discern between objects.

A first step in rough-set theory is to compare every object to all others. Objects that have identical values of features are considered indiscernible and are grouped in equivalence classes. Next, the decision attribute is overlaid on the members of the equivalence classes. If all the members of each equivalence class agree on the decision then the system is called crisp, otherwise it is called rough. In other words, the theory of rough sets presumes that there are at least two members of an equivalence class with different decision values, which is a realistic assumption for biological datasets. Next, minimal sets of features (reducts) need to be computed to maintain the indiscernibility across equivalence classes. Reduct computation is a NP-hard problem, that can be circumvented by computing approximate reducts. Minimal sets of features in approximate reducts are capable of assigning a sufficiently large fraction of objects to their corresponding decision classes. Computation of approximate reducts is performed based on a Boolean discernibility function. The inclusion or exclusion of each feature is represented by a binary variable in the Boolean function that is simplified using Boolean algebra. Finally, combinations of variables and their corresponding levels from all the reducts are translated into readable sets of IF-THEN rules based on the decision system. The set of rules constitutes the rule-based classification model. Here we have used R.ROSETTA, a R wrapper of the ROSETTA implementation for rough-sets163,164 and VisuNet that is a visualization tool for rule-based classifiers165.

Performance metrics Common performance indices for linear regressions are the R2, the β coefficients and the p-values. R2 notes the proportion of variation of the dependent variable explained by the linear regression model. The β coefficient quantifies the magnitude of the change on the dependent variable for each unit of change in the predictor, and the p-value measures the statistical significance of this effect.

The performance of a rule-based classification model is commonly quantified from the accuracy and the receiver operating characteristics (ROC) AUC of the model from a 10-times cross validation (CV) process. During 10-times CV the original training set is split into 10 equal sets without replacement, of which 9 are used to train the model an one is used to test the performance. This process is repeated 10 times so that every CV-set is used as a training set once. In every repetition four values are calculated based on

Page 36: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

36

a predefined (background) decision class: i) objects classified correctly to the decision class are called true positive (TP); ii) objects classified incorrectly to the decision class are called false positive (FP); objects classified correctly to the other decision class(es) are called true negative (TN); and objects classified incorrectly to the other decision class(es) are called false negative (FN). Various combinations of these values are used to compute a large collection of metrics that facilitate the assessment of the quality of the classifier. For example, the accuracy of a machine learning model is estimated by the average ratio of the correctly classified objects

. Additional metrics include the true positive rate (TPR) or sensitivity, that represents the fraction of the correctly classified objects from the background class calculated from ; and the specificity that is the fraction of the correctly classified objects of the other decision classes . 1-Specificity is also known as the false positive rate (FPR). By plotting the FPR in the x-axis against the TPR in the y-axis, one can visualize the ROC curve that illustrates the ability of the model to discern between the background class and the other classes. ROC AUC is calculated as the integral of the area between the x-axis [0, 1] and the ROC curve. The average ROC AUC of all the CVs is usually reported. In paper IV, we trained a rule-based classifier on a large dataset with more than 900 examples and we used a small external testing set to assess its performance. The latter, additionally to demonstrating the overall versatility of the model, exhibited the power of machine learning to classify sets of “unseen” data.

Multi-omics integration The development of omics technologies has provided large datasets, while the increase in computational power and resources that allows massive processing has established the research field of multi-omics integration. In a recent study, Tagkopoulos and colleagues provided an effective machine learning approach for integrating multiple types of omics that relied on a thoroughly-designed database constructed from public sources of biological knowledge, multi-omics and experimental metadata166. Other studies have employed deep learning approaches that have highly predictive power and are able to capture complex data dependencies167. On the other hand, deep learning methods often require large training sets, while lacking interpretability despite the considerable advancements168,169. The limited interpretability on the prediction-making of deep learning models limits their potential towards clinical applications. An additional layer of complexity in deep learning is the structure of the hidden layers of the underlying neural network, that ultimately defines the deep learning schema. In a recent review, Telenti and colleagues briefly discussed a few of the most common types, while in another review, Greene and colleagues showed that the underlying structure of the hidden

Page 37: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

37

layers should be representative of the interdependencies in the input data168–

170.

Figure 6. Overview of the genome first and phenotype first approaches discussed in Hasin et al., 2017. Colored panels refer to various omics technologies and circles to measurements. Arrows among layers imply inter-layer correlations. Figure adapted with CC BY 4.0 from BioMed Central Ltd (Part of Springer Nature): Genome Biology, (Hasin et al., 18, 83) copyright 2017.

Various computational tools that were initially developed to analyze specific types of omics for biomarker identification, have been modified to handle additional types of omics data171. These tools would apply standard multi-omics integrative approaches to select correlated signals that were involved in related or synergistic biological processes. A set of conceptually different integrative methodologies were presented in a recent review by Lusis and colleagues, where two main approaches termed phenotype first and genome first were discussed (Figure 6)96. They avoided exploring approaches where the data was pooled and processed in a single computational model, while focusing on step-wise strategies. The phenotype first approach utilizes omics layers other than genomics to discover promising signals, and then uses follow-up experiments to identify the responsible genetic markers. On the other hand, the genome first approach uses genomic signals as a primary target that would allow identification of the affected biological signals in other omics layers. Yet another approach discussed in the same review, named environment first, supports the identification of environmental factors leading to disease and follow-up omics approaches to discover the affected molecular factors. Other studies, including papers II and IV of this thesis, have presented integrative approaches towards identifying markers and affected mechanisms in cases and controls from bulk172, single cell/nucleus154 and mixed experiments173. Nevertheless, the biomedical field of multi-omics and data

Page 38: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

38

analytics should adopt various standardization methods mainly for sample collection, data preprocessing, data storage and data availability. Other important actions to be considered might include exchange of computational expertise across laboratories, and implementation of open-source software and algorithms.

Data science and software development in life sciences Data science is a research field that applies statistics and computer science analytical methods on big data. Bioinformatics applies a large fraction of the data science tools and approaches spanning from statistical analysis and machine learning, to software development and algorithms. Earlier we discussed a collection of basic and advanced statistical and machine learning methods. Development of novel algorithms and implementation of software tools is a cornerstone of bioinformatics. However, these tasks require detailed documentation, testing, reproducibility and versatility, that is often ignored in scientific research due to limited interest and time constraints. In computer science, documentation and public availability of computational tools is a routine that increases the chances of reproducibility of the results and enhances scientific progress. In recent years, GitHub, that is a repository for software version-control hosting, has become the most successful platform for sharing open-source software among scientists. Fortunately, bioinformaticians and computational biologists are following this trend that apart from sharing software, operates as a forum to report bugs, to request new features for existing tools or to discuss software modifications. Additionally, large online communities such as Biostars (www.biostars.org) provide opportunities to ask questions and discuss various approaches on biological data analysis.

In this thesis we have used a combination of programming languages to perform the analysis of the data. An algorithm was developed and implemented in C# for the detection of genome-wide putative regulatory regions in paper I. An additional computational tool was implemented in C# to analyze the metabolomics datasets for papers III and IV. A pipeline was implemented in R to perform the quality control and analysis of the snRNA-seq data from paper II, while various custom, mainly R, scripts were implemented to analyze various data types and visualize the results.

Page 39: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

39

Aims

The main goal of this thesis was to globally explore the interplay between transcriptional and translational regulation in T2D, and its consequences in the metabolome and the phenome. More specifically we aimed at:

• Investigating various types of omics with novel computational approaches to identify multi-tissue and combinatorial markers for T2D. (papers I, III and IV)

• Introducing and applying novel integrative multi-omics approaches, and exploring their results. (papers II and IV)

Page 40: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

40

Experimental design

In this thesis we used a large number of publicly available and in-house-generated datasets to explore the alterations of various types of omics and their associations to T2D. Specifically, in paper I we investigated the sets of regulatory elements that control gene expression based on the presence of TFs upon response to the environment (Figure 7a). For that purpose, we employed ChIP-seq and DNaseI-seq experiments for five human cell lines from the ENCODE consortium. In paper II we used cohort I to identify polymorphic sites that potentially disrupt the binding of TFs on regulatory elements that consequently alter the expression of important genes for liver biology (Figure 7). In paper III we used cohort II to investigate alterations of metabolites in various tissues involved in T2D pathogenesis. Metabolites interact directly with enzymes in biological pathways and reflect the physiological status of the organism (Figure 7). Finally, in paper IV we used cohort III to explore associations of various metabolites to characteristic phenotypes of tissues identified by PET/MRI (Figure 7).

Figure 7. Overview of the experimental design of this thesis. a) Correspondence of papers and cohorts to the omics of the central dogma of molecular biology. b) Sampling and distribution of experimental cohorts. Figures adapted from Pander, setreset, Strogoff and Mikael Häggström (Wikimedia Commons).

Page 41: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

41

Cohort I A single liver sample was obtained from the healthy tissue surrounding a liver tumor removed with partial hepatectomy. The sample was obtained from Prof. Per Artursson at Uppsala University following the approval by the Uppsala Regional Ethics Committee (Dnr: 2014/433). Results from snRNA-seq, HiCap and MS proteomics performed in this sample are summarized in paper II.

Cohort II VAT, liver, pancreatic islets, skeletal muscle and blood serum was obtained for a cohort of 43 multi-organ donors from the biobank of excellence for diabetes (EXODIAB) in Sweden following the Uppsala Regional Ethics Committee approval (Dnr: 2014/391). Samples did not show any statistical differences for age and BMI for the phenotypes of control, pre-diabetes and T2D, while the ratio of females to males was similar. Results from MS metabolomics performed in this cohort are presented in paper III.

Cohort III A cohort of 42 subjects was recruited and, subcutaneous adipose tissue (SAT) and blood plasma samples were collected following the approval by the Uppsala Regional Ethics Committee (Dnr: 2014/313). Anthropometric markers of BMI and WHR were associated with HOMA-IR and M-value, while age and sex-ratio did not show any biases. Results from MS metabolomics and PET/MRI performed in this cohort are presented in paper IV.

Page 42: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

42

Methods and results

Paper I “Maps of context-dependent putative regulatory regions and genomic signal interactions”. Diamanti K., Umer H.M. et al., 2016 Nucleic Acids Research 44(19):9110-9120.

Aims We aimed at designing a data-driven algorithm that would identify cell-type-specific clusters of TFs that constitute sets of putative regulatory regions. Additionally, we aimed at providing a publicly available open source computational tool that would allow researchers to reproduce our results, and generate custom sets of putative regulatory regions and TF interaction networks. Finally, this approach allowed us to suggest networks of recurrent TF interactions.

Methods We downloaded data for ChIP-seq experiments from ENCODE for five different human cell lines: H1 human embryonic stem cells (H1-hESC), lymphoblastoid GM12878, myelogenous leukemia K562, human liver cancer HepG2 and cervical cancer HeLa-S3. The selected cell lines contained ChIP-seq data for a minimum of 50 distinct TFs. We complemented the ChIP-seq datasets with data from DNaseI-seq experiments for the same cell lines.

The algorithm was designed to suggest sets of putative regulatory regions based on the observation that genomic loci that harbor clusters of TFs are more likely to play an active role in the regulation of transcription174. The algorithm identified clusters of cell-line-specific peaks located within 300bp from each other. Putative regulatory regions were considered clusters containing two different TFs or a TF and a DNaseI signal (Figure 8). Finally, sets of putative regulatory regions were overlaid to cell-line-specific chromatin annotations from the ChromHMM chromatin state discovery and characterization approach175.

We also used the distance between peaks to identify significant pairs of interacting TFs in regulatory regions. We constructed three types of TF-interaction networks: i) co-occurring, for pairs of TFs located in the same

Page 43: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

43

regulatory regions; ii) neighboring, for TF peak summits located between 20 and 60bp apart; iii) overlapping, for TF peak summits located within 20bp. In each network we counted the instances of TF pairs. We used a hypergeometric test to identify significant TF pairs in the overlapping and neighboring networks, and a binomial test to identify overrepresented TF-pairs in the co-occurring network. The Bonferroni-corrected p-values were used for the construction of the statistically significant TF-TF interaction networks.

Figure 8. Graphical representation of the regulatory region detection from tfNet. Checking marks in the figure adapted from VasilievVV and Penubag (Wikimedia Commons).

We named our algorithm tfNet and we provided an open-source cross-platform implementation that allowed reproducibility of the results, and generation of sets of putative regulatory regions and TF-interaction networks from custom ChIP-seq experiments.

Results On average we identified ~144K putative regulatory regions across the five different cell lines. The average length of the regulatory regions was 300bps and approximately 76% overlapped with open chromatin. We randomly generated synthetic ChIP-seq datasets to confirm the robustness of our approach. Putative regulatory regions were also enriched for disease-related variants identified in GWAS. Additionally, putative regulatory regions overlapped cell-line-specific distal interacting loci from Hi-C experiments that harbored potentially functional SNPs. For example, we identified multiple putative regulatory regions harboring a number of SNPs overlapping the fatty acid desaturase 1 (FADS1) and 2 (FADS2) genes (Figure 9a). Furthermore, we generated sets of putative regulatory regions for colorectal and prostate cancer in humans, and five different cell types for Mus musculus. A very large

Page 44: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

44

fraction (83%) of the putative regulatory regions for Drosophila melanogaster overlapped with experimentally validated cis-regulatory modules from the REDfly database.

Figure 9. Analysis of putative regulatory regions from tfNet. a) Blue sections in the first track show TFBS clustering in regulatory regions and curved lines show long-range genomic interactions identified by Hi-C. The two following tracks show SNPs and genes overlapping the putative regulatory regions. SNPs marked in red overlap putative regulatory regions. b-d) Selected TF-interaction networks in regulatory regions overlapping heterochromatin ChromHMM annotations. Grouped rows represent cell lines and grouped columns the type of network.

ChromHMM annotations characterized an unexpectedly large fraction (13%) of putative regulatory regions as silent or heterochromatic. We explored different types of TF interaction networks in the presumably silent regulatory regions. We identified complexes of CTCF and the cohesin that have been shown to establish long-range genomic interactions (Figure 9b), complexes of USF1 and USF2 that have been associated with hyperlipidemia and cardiovascular disease (Figure 9c), and complexes of HNF4α and HNF4γ that are known to govern gene expression (Figure 9d).

Page 45: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

45

Paper II “Single nuclei transcriptome analysis of human liver with integration of proteomics and capture Hi-C bulk tissue data”. Cavalli M., Diamanti K. et al., 2019 Genomics, Proteomics and Bioinformatics (under review).

Aims Here we aimed at integrating snRNA-seq with bulk MS proteomics and HiCap data for a liver sample. A subsequent goal was to explore variants that might affect the expression of important genes to liver biology.

Figure 10. Visualization of t-SNE embeddings for the 4,282 nuclei clustered by graph-based clustering and manually assigned to cell types based on gene-markers from the literature.

Methods We performed snRNA-seq using the 10X Genomics droplet-based technology for two replicates of a liver sample, one of which had undergone fluorescence-activated cell sorting (FACS). The files from the sequencer were demultiplexed and aligned against the GRCh38 human genome using the 10X Genomics software cellranger. The resulting UMI datasets were merged and an extensive quality control was performed. Quality control included removing empty droplets, removing damaged or dead nuclei, removing nuclei containing large portions of mitochondrial DNA, normalizing UMI counts,

Page 46: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

46

modelling technical noise, correcting batch effect and removing outliers. Next, we used graph-based clustering to identify clusters of nuclei and t-SNE for visualization purposes. Finally, we identified cluster-specific gene-markers in order to assign them to known cell types.

On the same sample we performed bulk MS proteomics and HiCap. Protein quantifications were computed through MaxQuant and were subsequently log2-transformed to approximate a normal distribution. HiCap interactions were filtered based on statistical significance (p-value < 0.05).

Results The 4,282 nuclei were clustered in 8 groups based on graph-based

clustering. We characterized seven major cell types including hepatocytes, liver sinusoidal endothelial cells, Kupffer cells, cholangiocytes, Nk/T/B cells, hepatic stellate cells and nervous/arterial cells (Figure 10). As expected, hepatocytes were the biggest class consisting of 2,155 nuclei. Nuclei of hepatic stellate cells grouped into two distinct clusters based on genes marking active or inactive cell-states. The cluster of Nk/T/B cells carried markers of immune cells different from those of Kupffer-cell macrophages and similar to natural killer, T and B cells. Finally, a smaller cluster was enriched for genes marking cells of arterial and nervous origin (Figure 10).

In a next step, we focused on genes of interest to the liver biology. Specifically, we utilized the liver-specific set of genes characterized as “Pharmacogenomic biomarkers in Drug Labeling” by the food and drug administration (FDA), and the ones associated with the survival of patients with hepatocellular carcinoma (HCC).

We used only genes of pharmacogenomic interest that were expressed in snRNA-seq, that were identified by MS proteomics and that contained HiCap interactions. We identified DPYD, that is a gene involved in the pharmacogenetics of fluoropyrimidines toxicity, to be highly expressed in Kupffer cells. The promoter of the DPYD gene interacted with six distal elements characterized as enhancers by ChromHMM. One of the six enhancers, that was located in the first intron of DPYD, overlapped various TF binding sites and the single nucleotide polymorphic site rs74450569 (Figure 11a).

Following a similar approach, we identified 46 genes associated with patient survival in HCC interacting with 49 distal active enhancers. The promoter of the SLC2A2 gene interacted with 16 enhancers, and one of them was overlapping with the intron 25 of the gene TNIK. The active enhancer located in the TNIK gene was reported from the PanCancer Analysis of Whole Genomes consortium in 314 HCC cases to overlap a motif braking mutation in the binding site of the transcription factor VSX1 (Figure 11b).

Page 47: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

47

Figure 11. The regulatory landscape for the candidate functional genetic variants identified for a) DPYD and b) SLC2A2. The first track shows the chromosomal location, the second the ChIP-seq signal from two HMs and the annotations from ChromHMM, and the third the overlapping genes. The last track represents TFBSs from ENCODE (left) and the corresponding tissue (right). Liver is marked with “L”. The green stretch on TFBSs marks the TF motifs. The blue pin and the vertical dotted blue line show the location of the mutation.

Paper III “Intra- and inter-individual metabolic profiling highlights carnitine and lysophosphatidylcholine pathways as key molecular defects in type 2 diabetes”. Diamanti et al., 2019 Scientific Reports 9, 9653.

Aims Our primary goal was to explore the landscape of metabolic alterations in T2D across five tissues strongly connected to the pathogenesis of T2D. Additionally to providing a highly valuable resource to the scientific community, we aimed at suggesting the primary tissues where specific significant alterations of metabolites occurred.

MethodsFrom the biobank network of EXODIAB we obtained tissue samples for 43 multi-organ donors for VAT, pancreatic islets, liver, skeletal muscle and blood serum. Randomly selected subjects from the biobank (n > 200) were

Page 48: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

48

characterized as normoglycemic (n=17), pre-diabetes (n=13) and T2D (n=13). We performed LC- and GC-MS metabolomics profiling for all samples in all tissues. Compound quantifications were log2-transformed and scaled to μ=0 and σ=1.

Pair-wise comparisons of the profiles of controls to pre-diabetes, and pre-diabetes to T2D were performed using a Mann Whitney U test based on a Monte Carlo permutation approach. In a subsequent analysis, pre-diabetes and control subjects were combined into a ND group, and a differential analysis between ND and T2D was conducted using the aforementioned statistical approach. Permutation-based linear regression models were built to explore the associations of metabolites to HbA1c. The bulk of the analysis was performed with an in-house implemented computational tool freely available at https://github.com/klevdiamanti/MS_targeted.

Figure 12. Fold-change and statistical significance of LPC (left), free fatty acids (FFA) (middle) and carnitines (right) across five tissues in T2D. Carbon-chain length of lipids is shown in the rows.

Results The study provided a unique dataset of multi-tissue metabolic screening in a cohort of 43 organ donors. The differential analysis between ND and T2D showed that a large group of LPCs was significantly decreased in skeletal muscle and blood serum of T2D, while a similarly large group of carnitines was higher in liver of T2D, with multiple carnitines from serum following the same trend (Figure 12). Overall, the class of lipid and lipid-like molecules represented more than half of metabolites that were significantly altered in at least one tissue.

Similarly to earlier studies, we observed BCAAs, AAAs and other AAs to be elevated in T2D176,177, even though not crossing the threshold of statistical significance. However, significant associations of BCAA, AAA and AA with

Page 49: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

49

HbA1c across the majority of tissues confirmed their established role as T2D markers. Additionally, we observed a subset of bile acids to be significantly elevated and associated with HbA1c in liver of T2D.

Comparison of the levels of a subset of metabolites among controls, pre-diabetes and T2D, in combination with the associations of the same metabolites with HbA1c, pointed towards acute alterations in early onset of T2D. For example, significantly lower levels of the LPC species 14:0, 16:0 and 17:0 in skeletal muscle of pre-diabetes, followed by moderate decrease in T2D, might be a mark of pre-diabetes178. A similar event occurred in liver, where glycodeoxycholic acid was significantly higher in pre-diabetes, while further increase in T2D did not cross the threshold for statistical significance.

Paper IV “Integration of whole-body PET/MRI with non-targeted metabolomics provides new insights into insulin sensitivity of various tissues”. Diamanti, Visvanathar et al., 2019 Manuscript.

Aims We aimed at integrating data from MS metabolomics with whole-body PET/MRI imiomics measurements, and exploring their associations. A secondary goal was to develop a rule-based machine learning model from a large collection of metabolomics measurements to predict ND or T2D.

Figure 13. Schematic overview of the rule-based classification schema for training in the metabolomics dataset from ULSAM and testing in plasma LC-MS metabolomics from cohort III.

Page 50: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

50

Methods SAT and blood plasma samples were obtained from a cohort of 42 subjects matched for BMI, WHR, age and sex. The subjects underwent OGTT, hyperinsulinemic euglycemic clamp and [18F]FDG-PET/MRI. Influx rate (Ki) of [18F]FDG was calculated179, and an imiomics approach135 was used to normalize the signals from the study-subjects to a reference coordinate system.

SAT and blood plasma samples were used for MS metabolomics screening coupled with GC and LC, followed by computational identification of compounds. A log2 transformation was applied and an ANOVA-type correction was performed to correct for IS, running order and batch effect. We performed a differential analysis for metabolites in each tissue using Mann Whitney U test based on a Monte Carlo permutation approach. We additionally used a permutation approach to develop linear regression models to associate metabolite quantifications with T2D markers and imiomics measurements. The statistical tests for the differential analysis and the linear regression models were corrected for BMI, WHR, sex and age.

Figure 14. Pearson correlation of LPC(P-16:0) with whole-body voxel quantifications from PET/MRI on glucose uptake (Ki) (left), fat fraction (FF) (middle), and volume (right).

Page 51: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

51

We used the Uppsala Longitudinal Study on Adult men (ULSAM) that contained 180 blood metabolites for 963 samples as a training set, and the metabolomics from the current LC-MS analysis as a test set. We ran MCFS on the training set for ND and T2D and identified 19 metabolites to be important for classification (Figure 13). Next, the rule-based classification model was built and was visualized using VisuNet. Finally, the test set was classified with the rule-based model.

Results Differential analysis and associations to various T2D markers confirmed what has been shown in earlier studies from our group and others. There was an overall decrease of LPCs, while FFAs, carnitines and AAs, including BCAAs and AAAs increased in plasma of T2D. Despite the overall depletion of significant individual associations for SAT metabolites to T2D markers, we identified two groups (modules) of predominantly SAT metabolites significantly associated with the M-value.

Various interesting associations among metabolomics in plasma and imiomics measurements were identified. Leucine and its gamma-glutamyl dipeptide conjunct were inversely associated with the glucose uptake in SAT and VAT. Phenylalanine and tyrosine were positively associated with VAT volume and liver fat content, respectively. Various species of the LPCs were positively associated with VAT and SAT Ki, including LPC(P-16:0) that showed a strong negative association to the hepatic fat content (Figure 14). In addition, the SAT pool of FFAs showed a positive association to liver and VAT glucose uptake, and a negative association to the fat fraction in VAT. Valeryl and tetradecadienyl carnitines from SAT showed negative associations to VAT Ki and positive associations to VAT fat fraction, respectively. Glycodeoxycholic acid was negatively associated with VAT Ki, and positively associated with the VAT volume and fat fraction in SAT.

Finally, we trained a rule-based classification model on the ULSAM cohort using R.ROSETTA for ND and T2D, preceded by feature selection using MCFS. The CV accuracy of the model was 0.67 and it consisted of 450 rules that were visualized using VisuNet (Figure 15a and 15b). In ND, alpha-linolenic acid was strongly interdependent to two bile acids, deoxycholic and glycodeoxycholic acid, and to acetylcarnitine and creatinine. In T2D we observed weaker interactions overall, with interdependencies between LPC(0:0/18:2) and glycocholic acid, and creatinine and eicosatrienoic acid, standing out. Surprisingly, the accuracy of the rule-based model on the “unseen” test set was 0.78, while only one of the correctly classified objects was likely to be predicted by chance (Figure 15c).

Page 52: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

52

Figure 15. Rule-based classifiers and predictions for plasma LC-MS. a-b) Visualization of the rule-based classifier for decision classes a) ND, and b) T2D. c) Predictions of the rule-based classifier on the test set (top row) and the FDR for each prediction (bottom row). NS stands for non-significant.

Page 53: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

53

Discussion and concluding remarks

Cells respond to the environment by activating or repressing the expression of genes through TFs. The latter in combination with genetic variation lead to alterations in protein and metabolite levels. In this thesis we explored the interplay between transcriptional and translational regulation in T2D, and the consequent alterations on the metabolome and the phenome.

In paper I we explored various cell lines that recruit specific sets of TFs to regulate gene expression. We developed a methodology to detect and analyze putative regulatory regions and TF interactions in a genome-wide scale (Figure 8). We explored a subset of the identified regulatory regions located in the heterochromatin and identified various complexes of TFs that have been associated with the three-dimensional genome organization and regulation of gene expression. These results suggest that many heterochromatic regions are likely to be functional, and require further investigation (Figure 9b-9d). Additionally, our analysis demonstrated a systematic approach to select DNA polymorphic sites based on putative regulatory regions.

In paper II we identified putatively functional polymorphic sites by integrating snRNA-seq with bulk MS proteomics and HiCap data from a human liver sample. After identifying 7 clusters of cells from snRNA-seq, we focused on two groups of genes relevant to liver biology (Figure 10). DPYD is a gene of pharmacogenomic interest, whose promoter exhibited several distal interactions with active enhancers. One of the interactions overlapped various TFBSs and TF motifs, and harbored a SNP that was in linkage disequilibrium with one of the DPYD haplotypes associated with lower enzymatic activity, suggesting an active role in the variation of fluoropyrimidine toxicity (Figure 11). We also focused on the SLC2A2 gene that has been associated with HCC survival. Malfunctioning of SLC2A2 has been linked with lack of efflux of glucose from cells leading to glycogen accumulation, glucose intolerance due to glucotoxicity and hypoglycemia180. The promoter of SLC2A2 was interacting with several active enhancers, one of which was located in an intron of the gene TNIK that has been earlier associated with poor HCC prognosis181. More importantly, the same enhancer harbored a mutation in the binding site of the transcription factor VSX1 that has been characterized as motif-breaking by the PanCancer Analysis of Whole Genomes (PCAWG) consortium from 314 HCC cases (Figure 11). Identification and characterization of motif-breaking mutations in PCAWG

Page 54: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

54

follows a similar approach to paper I, while highlighting the importance of systematic integrative approaches.

T2D is characterized by multi-tissue IR in the human body. IR is commonly identified through various biomarkers measured mainly in biofluids149,182,183. In papers III and IV we investigated multi-tissue alterations of metabolites in T2D and their associations with several phenotypes. In paper III we explored levels of metabolites from MS metabolomics in a collection of five tissues involved in IR, including liver, skeletal muscle, VAT, pancreatic islets and blood serum. LPCs were significantly lower in skeletal muscle and serum of T2D subjects, and negatively associated with the levels of glycosylated hemoglobin (Figure 12). Similar associations were confirmed for blood plasma in paper IV, in addition to the negative associations with OGTT AUCglucose. In paper III some species of LPCs were significantly lower in subjects with pre-diabetes phenotype, while their further decrease in T2D did not cross the threshold for statistical significance, suggesting a stronger association with pre-diabetes. An earlier study by Bruce and colleagues observed plasma LPC levels in mice to be lower from the first week of high-fat diet178. Furthermore, in paper IV several LPCs were positively associated with the glucose uptake in both visceral and subcutaneous adipose tissue, and LPC (P-16:0) was negatively associated to the fat content in liver (Figure 14).

In paper III a large fraction of carnitines in liver was positively correlated with HbA1c and significantly higher in T2D, while following an increasing trend in serum (Figure 12). In paper IV the levels of a set of carnitines from VAT and plasma were positively associated with OGTT AUCglucose. Earlier studies have associated increased levels of carnitines to pre-diabetes and T2D, and specifically to the incomplete mitochondrial β-oxidation183. The latter was further supported by a significant increase in the levels of 3-hydroxybutyric acid in liver and serum in paper III184. Additionally, in paper IV BCAAs and their dipeptides were negatively associated with glucose uptake in VAT and SAT, while the aromatic amino acid tyrosine, in contrast to LPC (P-16:0), was positively associated with the fat fraction in liver. Earlier studies have linked increased levels of tyrosine with chronic hepatitis, liver cirrhosis and non-alcoholic fatty liver disease (NAFLD)185,186.

In paper IV we also constructed a transparent rule-based classification model from MS metabolomics to explore non-linear interdependencies of metabolites in ND and T2D, that we visualized in a rule network (Figure 15). In addition to the fact that the network consisted of metabolites known to be significantly altered in T2D, it showed novel multidimensional associations of metabolite levels. Furthermore, the rule-based model was able to predict ND and T2D in an independent dataset with 78% accuracy.

Taken together our methods suggest various strategies of multi-omics integration in the study of complex diseases and our results demonstrate a broad spectrum of findings. Specifically, we presented a collection of powerful methodologies for systematic identification of metabolites of interest and putative genetic regulatory regions. Next, we complemented these

Page 55: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

55

approaches with imiomics, proteomics and other types of genomics to suggest interesting associations to phenotypes and novel genomic alterations related to complex diseases.

Page 56: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

56

Sammanfattning på svenska

Världshälsoorganisationen (WHO) estimerade nyligen att det finns totalt 422 miljoner diagnostiserade och odiagnostiserade fall av typ 2 diabetes (T2D) i världen, och förutspådde att de identifierade fallen skulle öka till 430 miljoner år 2030, vilket gör T2D till en allvarlig potentiell pandemi. T2D är en komplex metabol sjukdom karaktäriserad av att flera organ blir resistenta mot insulin, hormonet som hjälper celler att ta upp glukos för energi. Andra karaktäriseringar av T2D inkluderar förhöjda blodsockernivåer och en oförmåga hos β-cellerna i bukspottskörteln att utsöndra tillräckliga mängder insulin. Låg fysisk aktivitet, fetma och hypertoni har alla länkats till insulinresistens (IR), och därigenom T2D. En serie av studier har associerat diverse genetiska faktorer, metaboliter och proteiner med T2D och IR.

Den centrala dogmen i molekylärbiologi illustrerar flödet av den genetiska informationen i celler (Figur 7a). Deoxiribonukleinsyran (DNA) är basen för den centrala dogmen som innehåller den genetiska informationen för reproduktion, tillväxt och funktion i organismen. Den genetiska informationen är kodad i funktionella sträckor inom DNA kallade gener vilka är vidare uttryckta via ribonukleinsyra (RNA). Översättningen av gener till RNA är vanligtvis känt som genuttryck. RNA molekyler är transkriberade till proteiner, vilka är de huvudsakliga strukturella enheterna i cellerna och utför den stora majoriteten av biologiska funktioner i organismen. Proteiner är hörnstenen i biokemiska reaktioner i biologiska processer, där metaboliterna fungerar som intermediära substrat eller slutprodukter.

När stimulerade av omgivningen så bidrar celler med instruktioner för att reglera genuttryck. I detta syfte rekryterar de specifika proteiner, kallade transkriptionsfaktorer (TFs), som binder till specifika genomiska lokus, kallade regulativa regioner, för att aktivera eller inhibera uttrycket av gener. Vanligtvis så samlas många TFs i en regulativ region där de formar proteinkomplex genom protein-protein interaktioner. Vi utvecklade en förutsättningslös algoritm, kallad tfNet, som upptäckte regulativa region-kandidater i genomet baserat på TF-komplex. Vi demonstrerade även att vår modell var data-driven och specifik för olika typer av celler och organismer. Totalt identifierade vi ungefär 144,000 kandidater för regulativa regioner över fem olika celltyper. Vi visade även att en delmängd av dessa regioner kan innehålla ändringar kapabla att avreglera uttrycket av gener på andra sätt en de ursprungliga instruktioner som cellen bidragit med.

Page 57: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

57

Cellkärnor innehåller majoriteten av den genetiska informationen i en cell. I en annan studie isolerade vi 4,282 cellkärnor från ett fruset leverprov. Vi använde uttrycksnivåerna för gener från individuella kärnor för att identifiera sju olika levertyper: hepatocyter, sinusoidala endotelialceller, cholangiocyter, Kupffer-celler, Nk/T/B-celler, celler från nerv/arteriell källa och intressant nog båda aktiva, och flera grupper av inaktiva lever stellatceller. En extra funktion hos TFs är att interagera med andra TF komplex lokaliserade i distal-regulativa regioner och forma genomiska loopar genom att binda ihop två distal-regioner ungefär som en knut skapar en snara. På vårt leverprov använde vi en teknik kallad HiCap för att identifiera sådana regioner medans en annan teknik användes för att mäta nivåerna av protein. Vi integrerade de olika typerna av data och fokuserade på gener med etablerad relevans inom leverbiologi. Vi identifierade en DNA-ändring i en regulativ region som potentiellt kan påverka uttrycket av genen DPYD. Denna gen kodar för ett protein som ansvarar för att bryta ned en specifik grupp av cancerbehandlingsläkemedel kallade fluoropyrimidiner. Trasig reglering eller uttryck av DPYD kan potentiellt leda till livshotande toxicitet hos patienter som undergår kemoterapi. En ytterligare DNA-mutation upptäcktes i TNIK genen som potentiellt kan påverka uttrycket av genen SLC2A2. Båda dessa har associerats med låg överlevnad i patienter som lider av levercancer. Vikten av mutationen upptäcktes i en stor studie på fler än 350 fall av levercancer. De tidigare nämnda studierna demonstrerade diverse strategier och den övergripande potentialen av att integrera olika typer av data via avancerade bioinformatikmodeller.

I en annan studie använde vi en kohort på 43 multiorgan-donatorer kategoriserade som hälsosamma, pre-diabetiska, och typ 2 diabetiska baserat på deras blodsockernivåer. Vi mätte nivåerna på 286 metaboliter från invärtes fetmaprover (VAT), sköldkörtelholmar, skelettmuskulatur, lever och blodserum i kohorten. Detta data set var en värdefull källa för T2D fältet som tillät oss att jämföra varierande mönster av metaboliter i vävnaden. En stor mängd metaboliter tillhörandes gruppen karnitiner, som är essentiella metaboliter för energiproduktion, var signifikant förhöjda i levern för T2D patienter. En annan stor grupp metaboliter, kallade lysofosfatidylcholiner (LPCs), som ingår i cellulära membran, var signifikant färre i muskulaturen hos T2D patienter. Delmängder av karnitiner och LPCs i blodserumet följde ett liknande mönster. Detta konfirmerade att förändringarna av nivåer av specifika metaboliter sker i primära vävnader, så som levern och skelettmuskulaturen, medans deras nivåer i serum är en sekundär händelse.

I vår sista studie fokuserade vi på samverkan mellan metaboliterna och fenotyperna i T2D (Figur 7a). Vi rekryterade 42 personer med olika blodsockernivåer och mätte förekomsten av 280 metaboliter från underhudsfett (SAT) och blodplasma. Vi tog dessutom en helkroppsbild av personerna för att mäta fenotypiska attribut så som glukosupptagning, fettmängd och volymen av diverse organ. Vi identifierade flera starka associationer mellan metaboliter och organfenotyper. Fettmängden i levern,

Page 58: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

58

och glukosupptagningen i VAT/SAT var starkt associerade med LPCs. Å andra sidan var en undergrupp av aminosyror, essentiella byggstenar för proteiner, negativt associerade med glukosupptagningen i VAT/SAT. Slutligen så upptäckte vi diverse komplexa associationer mellan metaboliter som kunde förutspå T2D personer gentemot hälsosamma personer med 78% träffsäkerhet. Dessa resultat konfirmerade en stark association mellan metabolom och fenotyp, och demonstrerade kombinationer av metaboliter som indikerar T2D.

I denna studie presentarede vi en samling strategier för att integrera olika typer av data i komplexa sjukdomar och ett brett spektrum av viktiga fynd. Specifikt så presenterade en samling kraftfulla metodologier för systematisk identifikation av associationer mellan biomarkörer och fenotypiska karaktäriseringar av T2D. En delmängd av dessa markörer har potentialen att fortgå till kliniska studier.

Page 59: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

59

Acknowledgements

It has been a long path towards this local maximum of a researcher’s life with many contributors. A few of them lie at the heart of this PhD thesis. I would like to express my sincere gratitude to: My main advisor, Prof. Jan Komorowski, who showed me the mountain and encouraged me to climb it. You believed in me from the very early days of my research training back in 2013, and you provided me the opportunity of this fascinating journey. I am extremely grateful for your advice in science and life, for your constant support and for paving the way for me to become an independent researcher. My other advisor, Prof. Claes Wadelius, who was always there to discuss and teach biology, and not only. I am greatly indebted to you for your kindness, patience and novel scientific ideas. My co-advisor, Docent Manfred Grabherr, for the discussions on computational approaches. Former and current members of the Komorowski group: Marcin for introducing me to bioinformatics and for sharing the Polish spirits. Michał for the exchange of scientific ideas and the brainstorming coffee-breaks. Husen for the countless discussions on programming languages and the fruitful collaborations. Nicholas for the teaching times, for sharing the unique experiences of our administration, for hosting me in Πήλιο and for the special sense of humor. Mateusz for the brainstorming and the SciLife discussions. Fredrik for the morning office chats. Susanne, Karolina, Sara, Zeeshan, and Behrooz for the group meetings, the lunch discussions and the cakes. Members of the Wadelius group: Gang for the attention to detailed biological mechanisms, and Marco for the countless meetings on our common manuscripts and the fun chats. Other collaborators who helped in this research: Maria, Jan E, Chanchal, Joel, Patrik, Håkan and Stanko for the collaborations in publications. Robin for the extremely efficient teamwork in paper IV. Alba for agreeing that we disagree, and Antonia for the countless plannings on the MM project.

Page 60: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

60

The friends that we spent hours eating, drinking, dancing and chatting: Laora and Antony for always being there; for the chilling dinners, the wine and cheese. Aleksandra and Esli for the Albanian dinners, and the arguments on life and politics. Sofia and Sebastian for the relaxing times, the advice and the amazing wedding. Showgy, Juan and Max for the ultimate parties and the fun times. Uppsala volleyball for the matches and the trainings. Serkan, Umut and Alex for the Turkish dinners and the great BBQs. Παναγιώτης (κουμπάρος) και Λεωνίδας (παιδικός φίλος) για το ότι ήταν πάντα εκεί για να τα πούμε και να τα πιούμε όποτε γυρνούσα για λίγες μέρες στην Πάτρα. Ceidόφιλοι Πάνος και Κώστας για τα καμένα αστεία, τις μπύρες και τη φιλοξενία. (Extended) family members: Berti, Silva, Arbi, Jokli, Evis dhe Nineta që më kanë mbështetur, më kanë ndihmuar me gjithçka dhe më kanë dhënë gjithmonë më shumë se sa meritoj. Halla, që u bë nënë e dytë për mua dhe që më ndihmoi pa fund. My parents Kristaq and Engjellushe: Mami që është larguar nga jeta më shumë se 13 vjet më parë, dhe që nuk do ketë mundësine të krenohet për këtë arritje. Babi që më ka mbështetur dhe besuar gjithmonë, akoma dhe kur unë e kam tepruar. Nuk do kisha arritur kurrë këtu pa sakrificat e pafundme nga ana juaj. Shpresoj se ky është një pjesë e shpërblimit nga ana ime! To my wife, best friend and partner in crime, Mila. My dear, you were not here from the beginning of this journey, but when you joined it became colorful and meaningful. I am so grateful you are part of my life, and I am the happiest man on earth for you carrying our son! I will be eternally indebted to you for agreeing to start our common life in Uppsala.

I thank you all from the very bottom of my heart!

Klev Diamanti Uppsala, November 2019

Page 61: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

61

References

1. Association, A.D. Diabetes Care 33, S62–S69 (2010). 2. Lakhtakia, R. Sultan Qaboos Univ. Med. J. 13, 368–370 (2013). 3. Katsarou, A. et al. Nat. Rev. Dis. Primer 3, 17016 (2017). 4. Paschou, S.A., Papadopoulou-Marketou, N., Chrousos, G.P. & Kanaka-

Gantenbein, C. Endocr. Connect. 7, R38–R46 (2018). 5. Chen, L., Magliano, D.J. & Zimmet, P.Z. Nat. Rev. Endocrinol. 8, 228–236

(2012). 6. Prasad, R.B. & Groop, L. Genes 6, 87–123 (2015). 7. Tuomi, T. et al. The Lancet 383, 1084–1094 (2014). 8. World Health Organization (World Health Organization: Geneva,

Switzerland, 2016). 9. Danaei, G. et al. The Lancet 378, 31–40 (2011). 10. Shaw, J.E., Sicree, R.A. & Zimmet, P.Z. Diabetes Res. Clin. Pract. 87, 4–14

(2010). 11. Simmons, D., Williams, D.R. & Powell, M.J. Br. Med. J. 298, 18–21 (1989). 12. Yang, W. et al. N. Engl. J. Med. (2010).doi:10.1056/NEJMoa0908292 13. Wändell, P.E. et al. Diabetes Metab. 34, 328–333 (2008). 14. McNeely, M.J. & Boyko, E.J. Diabetes Care 27, 66–69 (2004). 15. Sonksen, P. & Sonksen, J. Br. J. Anaesth. 85, 69–79 (2000). 16. DeFronzo, R.A. Diabetes 58, 773–795 (2009). 17. Consoli, A., Nurjhan, N., Reilly, J.J., Bier, D.M. & Gerich, J.E. J. Clin. Invest.

86, 2038–2045 (1990). 18. DeFronzo, R.A., Ferrannini, E. & Simonson, D.C. Metab. - Clin. Exp. 38,

387–395 (1989). 19. Groop, L.C. et al. J. Clin. Invest. 84, 205–213 (1989). 20. Ferrannini, E. et al. Metabolism 37, 79–85 (1988). 21. Bonadonna, R.C. et al. Diabetes 45, 915–925 (1996). 22. Groop, L.C. et al. J. Clin. Endocrinol. Metab. 72, 96–107 (1991). 23. Shulman, G.I. et al. J. Clin. Invest. 76, 1229–1236 (1985). 24. Shulman, G.I. et al. N. Engl. J. Med.

(1990).doi:10.1056/NEJM199001253220403 25. Mandarino, L.J. et al. J. Clin. Invest. 80, 655–663 (1987). 26. Lupi, R. et al. Diabetes 51, 1437–1442 (2002). 27. Huang, C. et al. Diabetes 56, 2016–2027 (2007). 28. Ritzel, R.A., Meier, J.J., Lin, C.-Y., Veldhuis, J.D. & Butler, P.C. Diabetes

56, 65–71 (2007). 29. Andreozzi, F. et al. Endocrinology 145, 2845–2857 (2004). 30. Patanè, G. et al. Diabetes 51, 2749–2756 (2002). 31. Chang, A.M. & Halter, J.B. Am. J. Physiol.-Endocrinol. Metab. 284, E7–E12

(2003).

Page 62: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

62

32. Lyssenko, V. et al. J. Clin. Invest. 117, 2155–2163 (2007). 33. DeFronzo, R.A. Diabetes 37, 667–687 (1988). 34. Bays, H.E. et al. Expert Rev. Cardiovasc. Ther. 6, 343–368 (2008). 35. Bays, H., Mandarino, L. & DeFronzo, R.A. J. Clin. Endocrinol. Metab. 89,

463–478 (2004). 36. Richardson, D.K. et al. J. Biol. Chem. 280, 10290–10297 (2005). 37. Bajaj, M. et al. Diabetes 51, 3043–3048 (2002). 38. Yamada, Y. & Seino, Y. Horm. Metab. Res. 36, 771–774 (2004). 39. Škrha, J., Hilgertová, J., Jarolímková, M., Kunešová, M. & Hill, M.59, 7

(2010). 40. Calanna, S. et al. Diabetologia 56, 965–972 (2013). 41. Hædersdal, S., Lund, A., Knop, F.K. & Vilsbøll, T. Mayo Clin. Proc. 93, 217–

239 (2018). 42. Baron, A.D., Schaeffer, L., Shragg, P. & Kolterman, O.G. Diabetes 36, 274–

283 (1987). 43. Dai, X. et al. Diabetes 67, 315-LB (2018). 44. Abdul-Ghani, M. & DeFronzo, R. Endocr. Pract. 14, 782–790 (2008). 45. Norton, L. et al. Diabetes Obes. Metab. 19, 1322–1326 (2017). 46. Spatola, L., Finazzi, S., Angelini, C., Dauriz, M. & Badalamenti, S. Diabetes

Ther. 9, 427–430 (2018). 47. Boersma, G.J. et al. Horm. Metab. Res. 50, 627–639 (2018). 48. Hirvonen, J. et al. Diabetes 60, 443–447 (2011). 49. Timper, K. & Brüning, J.C. Dis. Model. Mech. 10, 679–689 (2017). 50. Könner, A.C. & Brüning, J.C. Cell Metab. 16, 144–152 (2012). 51. Perreault, L. Endotext (2000).at

<http://www.ncbi.nlm.nih.gov/books/NBK538537/> 52. Association, A.D. Diabetes Care 41, S13–S27 (2018). 53. Ahlqvist, E. et al. Lancet Diabetes Endocrinol. 6, 361–369 (2018). 54. Dzau Victor J. Circulation 109, IV–1 (2004). 55. Singh, B. & Saxena, A. World J. Diabetes 1, 36–47 (2010). 56. Muniyappa, R., Lee, S., Chen, H. & Quon, M.J. Am. J. Physiol.-Endocrinol.

Metab. 294, E15–E26 (2008). 57. DeFronzo, R.A., Tobin, J.D. & Andres, R. Am. J. Physiol.-Endocrinol. Metab.

237, E214 (1979). 58. Man, C.D. et al. Diabetes 54, 3265–3273 (2005). 59. Terra, S.G. et al. Exp. Clin. Endocrinol. Diabetes 119, 401–407 (2011). 60. Sakaguchi, K. et al. Diabetol. Int. 7, 53–58 (2016). 61. Matthews, D.R. et al. Diabetologia 28, 412–419 (1985). 62. Levy, J.C., Matthews, D.R. & Hermans, M.P. Diabetes Care 21, 2191–2192

(1998). 63. Stratton, I.M. et al. BMJ 321, 405–412 (2000). 64. Diabetes Control and Complications Trial Research Group et al. N. Engl. J.

Med. 329, 977–986 (1993). 65. Rohlfing, C.L. et al. Diabetes Care 25, 275–278 (2002). 66. Monnier, L., Lapinski, H. & Colette, C. Diabetes Care 26, 881–885 (2003). 67. Welters, H.J. & Kulkarni, R.N. Trends Endocrinol. Metab. 19, 349–355

(2008). 68. Deeb, S.S. et al. Nat. Genet. 20, 284–287 (1998). 69. Hani, E.H. et al. Diabetologia 41, 1511–1515 (1998).

Page 63: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

63

70. Gloyn, A.L. et al. Diabetes 52, 568–572 (2003). 71. Xue, A. et al. Nat. Commun. 9, 2941 (2018). 72. Sun, W. et al. PLOS ONE 13, e0192105 (2018). 73. Kirchner, H. et al. Mol. Metab. 5, 171–183 (2016). 74. Willmer, T., Johnson, R., Louw, J. & Pheiffer, C. Front. Endocrinol. 9,

(2018). 75. Kato, M. & Natarajan, R. Nat. Rev. Nephrol. 10, 517–530 (2014). 76. Brasacchio, D. et al. Diabetes 58, 1229–1236 (2009). 77. Cavalli, M. et al. Hum. Genomics 13, 20 (2019). 78. Bysani, M. et al. Sci. Rep. 9, 1–12 (2019). 79. Greenwald, W.W. et al. Nat. Commun. 10, 1–12 (2019). 80. Gieger, C. et al. PLOS Genet. 4, e1000282 (2008). 81. Suhre, K. et al. Nature 477, 54–60 (2011). 82. Suhre, K. et al. PLOS ONE 5, e13953 (2010). 83. Long, T. et al. Nat. Genet. 49, 568–578 (2017). 84. Shin, S.-Y. et al. Nat. Genet. 46, 543–550 (2014). 85. Suhre, K. et al. Nat. Commun. 8, 14357 (2017). 86. Stunnenberg, H.G. & Hubner, N.C. Hum. Genet. 133, 689–700 (2014). 87. Nowak, C. et al. PLOS Genet. 12, e1006379 (2016). 88. Nowak, C. et al. Diabetes 65, 276–284 (2016). 89. Lotta, L.A. et al. PLOS Med. 13, e1002179 (2016). 90. Nowak, C. et al. Sci. Rep. 8, 8691 (2018). 91. Xin, Y. et al. Cell Metab. 24, 608–615 (2016). 92. Maris, M., Overbergh, L. & Mathieu, C. PROTEOMICS – Clin. Appl. 2, 312–

326 (2008). 93. Guasch-Ferré, M. et al. Diabetes Care 39, 833–846 (2016). 94. Roberts, L.D., Koulman, A. & Griffin, J.L. Lancet Diabetes Endocrinol. 2,

65–75 (2014). 95. Schneider, M.V. & Orchard, S. Bioinforma. Omics Data Methods Protoc. 3–

30 (2011).doi:10.1007/978-1-61779-027-0_1 96. Hasin, Y., Seldin, M. & Lusis, A. Genome Biol. 18, 83 (2017). 97. Fuhrer, T. & Zamboni, N. Curr. Opin. Biotechnol. 31, 73–78 (2015). 98. Furey, T.S. Nat. Rev. Genet. 13, 840–852 (2012). 99. The ENCODE Project, C. Nature 489, 57–74 (2012). 100. Bernstein, B.E. et al. Nat. Biotechnol. 28, 1045–1048 (2010). 101. The cell editorial team Cell 167, 1139 (2016). 102. Venter, J.C. et al. Science 291, 1304–1351 (2001). 103. Chatterjee, A., Ahn, A., Rodger, E.J., Stockwell, P.A. & Eccles, M.R. Gene

Expr. Anal. 1783, 35–80 (2018). 104. Conesa, A. et al. Genome Biol. 17, 13 (2016). 105. Tang, F. et al. Nat. Methods 6, 377–382 (2009). 106. Macosko, E.Z. et al. Cell 161, 1202–1214 (2015). 107. Ziegenhain, C. et al. Mol. Cell 65, 631-643.e4 (2017). 108. Lun, A.T.L., McCarthy, D.J. & Marioni, J.C. F1000Research 5, 2122 (2016). 109. Kim, M.-S. et al. Nature 509, 575–581 (2014). 110. Wilhelm, M. et al. Nature 509, 582–587 (2014). 111. Catherman, A.D., Skinner, O.S. & Kelleher, N.L. Biochem. Biophys. Res.

Commun. 445, 683–693 (2014).

Page 64: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

64

112. Kelstrup, C.D., Young, C., Lavallee, R., Nielsen, M.L. & Olsen, J.V. J. Proteome Res. 11, 3487–3497 (2012).

113. Gillet, L.C. et al. Mol. Cell. Proteomics 11, O111.016717 (2012). 114. Koopmans, F., Ho, J.T.C., Smit, A.B. & Li, K.W. PROTEOMICS 18,

1700304 (2018). 115. Tyanova, S., Temu, T. & Cox, J. Nat. Protoc. 11, 2301–2319 (2016). 116. Tyanova, S. et al. Nat. Methods 13, 731–740 (2016). 117. Zubarev, R.A. PROTEOMICS 13, 723–726 (2013). 118. Emwas, A.-H.M. Metabonomics Methods Protoc. 161–193

(2015).doi:10.1007/978-1-4939-2377-9_13 119. Riekeberg, E. & Powers, R. F1000Research 6, 1148 (2017). 120. Sparkman, O.D., Penton, Z.E. & Kitson, F.G. Gas Chromatogr. Mass

Spectrom. Second Ed. 15–83 (2011).doi:10.1016/B978-0-12-373628-4.00002-2

121. Hoffmann, E. de & Stroobant, V. (Wiley: Chichester, 2009). 122. Ganna, A. et al. Metabolomics 12, 4 (2015). 123. Smith, C.A. Ther. Drug Monit. 27, 747–751 (2005). 124. Wishart, D.S. et al. Nucleic Acids Res. 46, D608–D617 (2018). 125. Hastings, J. et al. Nucleic Acids Res. 44, D1214–D1219 (2016). 126. Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. Nucleic

Acids Res. 45, D353–D361 (2017). 127. Kim, S. et al. Nucleic Acids Res. 47, D1102–D1109 (2019). 128. Würtz Peter et al. Circulation 131, 774–785 (2015). 129. Wang, T.J. et al. Nat. Med. 17, 448–453 (2011). 130. Fessenden, M. Nature 540, 153–155 (2016). 131. Gálisová, A. et al. Mol. Imaging Biol. 21, 454–464 (2019). 132. Quick, H.H. J. Magn. Reson. Imaging 39, 243–258 (2014). 133. Schlemmer, H.-P.W. et al. Radiology 248, 1028–1035 (2008). 134. von Schulthess, G.K., Steinert, H.C. & Hany, T.F. Radiology 238, 405–422

(2006). 135. Strand, R. et al. PLOS ONE 12, e0169966 (2017). 136. Rao, S.S.P. et al. Cell 159, 1665–1680 (2014). 137. Lieberman-Aiden, E. et al. Science 326, 289 (2009). 138. Fullwood, M.J. & Ruan, Y. J. Cell. Biochem. 107, 30–39 (2009). 139. Zhang, J. et al. Methods 58, 289–299 (2012). 140. Sahlén, P. et al. Genome Biol. 16, 156 (2015). 141. Anil, A., Spalinskas, R., Åkerborg, Ö. & Sahlén, P. Bioinformatics 34, 675–

677 (2018). 142. National Research Council (US) Committee on Responsibilities of Authorship

in the Biological Sciences (National Academies Press (US): Washington (DC), 2003).

143. Toronto International Data Release Workshop Authors Nature 461, 168–170 (2009).

144. Wang, Q. et al. Sci. Data 5, 180061 (2018). 145. Meyer, C.A. & Liu, X.S. Nat. Rev. Genet. 15, 709–721 (2014). 146. Marinov, G.K. et al. Genome Res. 24, 496–510 (2014). 147. Brennecke, P. et al. Nat. Methods 10, 1093–1095 (2013). 148. Kim, J.K., Kolodziejczyk, A.A., Ilicic, T., Teichmann, S.A. & Marioni, J.C.

Nat. Commun. 6, 8687 (2015).

Page 65: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

65

149. Fall, T. et al. Diabetologia 59, 2114–2124 (2016). 150. Do, K.T. et al. Metabolomics 14, 128 (2018). 151. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Nat. Biotechnol.

36, 411–420 (2018). 152. Hoffman, G.E. & Schadt, E.E. BMC Bioinformatics 17, 483 (2016). 153. McCarthy, D.J., Campbell, K.R., Lun, A.T.L. & Wills, Q.F. Bioinformatics

33, 1179–1186 (2017). 154. Stuart, T. et al. bioRxiv 460147 (2018).doi:10.1101/460147 155. Becht, E. et al. Nat. Biotechnol. 37, 38–44 (2019). 156. McInnes, L., Healy, J. & Melville, J. ArXiv180203426 Cs Stat (2018).at

<http://arxiv.org/abs/1802.03426> 157. Friedman, J., Hastie, T. & Tibshirani, R. J. Stat. Softw. 33, 1–22 (2010). 158. Kursa, M.B. & Rudnicki, W.R. J. Stat. Softw. 36, 1–13 (2010). 159. Dramiński, M. & Koronacki, J. J. Stat. Softw. 85, 1–28 (2018). 160. Dramiński, M. et al. Bioinformatics 24, 110–117 (2008). 161. Saeys, Y., Inza, I. & Larrañaga, P. Bioinformatics 23, 2507–2517 (2007). 162. Pawlak, Z. Int. J. Comput. Inf. Sci. 11, 341–356 (1982). 163. Garbulowski, M. et al. bioRxiv 625905 (2019).doi:10.1101/625905 164. Øhrn, A. & Komorowski, J. Proc Third Int. Jt. Conf. Inf. Sci. 403–407 (1997). 165. Smolinska, K., Garbulowski, M., Komorowski, J. & Diamanti, K. Prep.

(2018). 166. Kim, M., Rai, N., Zorraquino, V. & Tagkopoulos, I. Nat. Commun. 7, 13090

(2016). 167. Chaudhary, K., Poirion, O.B., Lu, L. & Garmire, L.X. Clin. Cancer Res. 24,

1248–1259 (2018). 168. Ching Travers et al. J. R. Soc. Interface 15, 20170387 (2018). 169. Grapov, D., Fahrmann, J., Wanichthanarak, K. & Khoomrung, S. OMICS J.

Integr. Biol. 22, 630–636 (2018). 170. Zou, J. et al. Nat. Genet. 51, 12 (2019). 171. Pinu, F.R. et al. Metabolites 9, 76 (2019). 172. Kindt, A. et al. Nat. Commun. 9, 3760 (2018). 173. Angelidis, I. et al. Nat. Commun. 10, 963 (2019). 174. Yan, J. et al. Cell 154, 801–813 (2013). 175. Ernst, J. et al. Nature 473, 43–49 (2011). 176. Giesbertz, P. & Daniel, H. Curr. Opin. Clin. Nutr. Metab. Care 19, 48–54

(2016). 177. Gar, C. et al. Crit. Rev. Clin. Lab. Sci. 55, 21–32 (2018). 178. Barber, M.N. et al. PLOS ONE 7, e41456 (2012). 179. Johansson, E. et al. Radiology 286, 271–278 (2017). 180. Santer, R., Schneppenheim, R., Suter, D., Schaub, J. & Steinmann, B. Eur. J.

Pediatr. 157, 783–797 (1998). 181. Jin, J. et al. Pathol. - Res. Pract. 210, 621–627 (2014). 182. Floegel, A. et al. Diabetes 62, 639–648 (2013). 183. Adams, S.H. et al. J. Nutr. 139, 1073–1081 (2009). 184. Pereira, M.J. et al. Metabolism 65, 1768–1780 (2016). 185. Michitaka, K. et al. Hepatol. Res. 40, 393–398 (2010). 186. Kawanaka, M. et al. Hepatic Med. Evid. Res. 7, 29–35 (2015).

Page 66: Integrating multi-omics for type 2 diabetesuu.diva-portal.org/smash/get/diva2:1353269/FULLTEXT01.pdfDissertation presented at Uppsala University to be publicly examined in C2:305,

Acta Universitatis UpsaliensisDigital Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Science and Technology 1860

Editor: The Dean of the Faculty of Science and Technology

A doctoral dissertation from the Faculty of Science andTechnology, Uppsala University, is usually a summary of anumber of papers. A few copies of the complete dissertationare kept at major Swedish research libraries, while thesummary alone is distributed internationally throughthe series Digital Comprehensive Summaries of UppsalaDissertations from the Faculty of Science and Technology.(Prior to January, 2005, the series was published under thetitle “Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Science and Technology”.)

Distribution: publications.uu.seurn:nbn:se:uu:diva-393440

ACTAUNIVERSITATIS

UPSALIENSISUPPSALA

2019