cheminformatics approaches for metabolomics...
TRANSCRIPT
1
Cheminformatics approaches for metabolomics research
ChemAxon User Group Meeting 2009San Diego, CA
Tobias Kind UC Davis Genome Center FiehnLab - Metabolomics
2
1) Very Short Introduction into Metabolomics
2) Seven Real Life Approaches with ChemAxon Tools
3) Outlook and Conclusions
Outline
3
Metabolomics as part of modern life sciences
Phenotype(temporal x spatial resolution)
Genotype x Environment
mRNA expression
Metabolite expression
Protein expression
Genomics
Transcr
iptomics
Proteomics
Metabolomics
Techniques and tools @ FiehnLab
LC-MSUPLC-MS
monolithic LCHILIC, RP, NP
GC-TOF-MSGCxGC-TOF-MSQuadrupole-GC-MSPyrolysis-GC-MS
BioInformatics and ChemInformatics
BinBase and SetupXStatistics and machine learning
Open Source + commercial software
LTQ-FT-MSvia CoreLab
Gas Chromatography FT-ICR-MS Liquid Chromatography
5
Approach No. 1: Data sharing in chemistry
Tools used: MView, Molconvert, Instant-JChem, MSketch, IUPAC naming
Topic: Missing spectral repositories and semantics annotation hinder research
Results:
• Use InChiKey and PubChem for structure annotations, do not use SMILES• Submit structures directly to journal, do not use OCR• Submit spectra directly to journals/repositories, do not use OCR• Annotate older publications with structure-to-name algorithms
Ideas for ChemAxon:
• Can MSketch code a CML into a chemical reaction picture for journals?• Can Chemicalize automatically annotate my new paper with InchiKeys?
Kind T, Scholz M, Fiehn OHow Large Is the Metabolome? A Critical Analysis of Data Exchange Practices in Chemistry. PLoS ONE 4(5): e5440. (2009); doi:10.1371/journal.pone.0005440
DQBQWWSFRPLIAX-UHFFFAOYAG
6
Approach No. 1: Data sharing in chemistry
Hamburger to Cow algorithm or "Wishful Thinking"Requires Jurassic Park Technology
Digital structuresand spectra
Digital databasefrom OCR data
Analog paperpublication
Data reduction and lossremove noise and uninteresting data
Extreme data lossOCR and text miningconversion errors
Digital structuresand spectra
Digital databasefrom OCR data
Analog paperpublication
Data reduction and lossremove noise and uninteresting data
Extreme data lossOCR and text miningconversion errors
Digital structuresand spectra
Digital databasefrom OCR data
Analog paperpublication
Data reduction and lossremove noise and uninteresting data
Extreme data lossOCR and text miningconversion errors
7
Approach No. 2: From mass to molecular formula
Tools used: MView, Molconvert, MSketch, Cxcalc, Instant-JChem
Topic: Create correct elemental formulas with mass spectrometry and query compound databases for possible structures for metabolite rediscovery process
Results:
• Converted PubChem, DNP, Drugbank, TSCA into formula test set• Developed heuristic rules for correct elemental composition determination• Determined size of molecular formula space
Ideas for ChemAxon:
• Can we use JKlustor or LibMCS for creating a set of natural product fragments?• Can we use the Synthesizer to create matching natural product like compounds?
Kind, T; Fiehn, OSeven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometryBMC BIOINFORMATICS, 8: Art. No. 105 MAR 27 2007; http://www.biomedcentral.com/1471-2105/7/234
8
Approach No. 2: From exact mass to structures
More filters to come: MS/MS matching, retention time refinements...
9
Approach No. 2: The molecular formula space of small molecules calculated by the Seven Golden Rules
Each molecular formula can expand to billions of structural isomers.Molecular Formula ≠ Molecular Isomer
8,000,000,000possible elemental compositions< 2000 Da, CHNSOP, Lewis+Senior
600,000,000highly probable formulasusing Seven Golden Rules
The molecular formula spacebelow 2000 Dalton (grey box)
700,000 formulae inPubChem covering10,000,000 isomers
50,000 elemental compositionsNaturals, Drugs, Toxicants
10
Approach No. 3: Organic reactions in-silico and in-vitro
Tools used: MSketch, Reactor
Topic: Create expected structures in in-silico,detect structures with GC-MS
Results: Reactor used in reaction planning for metabolic profiling
Ideas for users:
• Share organic reaction libraries for later use with Reactor• Use Reactor for organic synthesis teaching at universities
Example protocol (Although Reactor was not directly used for application)Fiehn O, Wohlgemuth G, Scholz M, Kind T, Lee DY, Lu Y, Moon S, Nikolau BJ Quality control for plant metabolomics: Reporting MSI-compliant studies.Plant Journal (2008) 53, 691-704 So
urce
: Bla
ckw
ell
1300s di- & tri-saccharides
mono-saccharides
small acidsalcohols
free fattyacids
sterolshydroxy acidsamino acids
1300s di- & tri-saccharides
mono-saccharides
small acidsalcohols
free fattyacids
sterolshydroxy acidsamino acids
Example GC-MS chromatogram
11
Approach No. 3: Organic reactions in-silico and in-vitro
A) Methoximation of aldehyde and keto groups (primarily for opening reducing ring sugars)B) Silylation of polar hydroxy, thiol, carboxy and amino groups with silylation agent MSTFAGas chromatography-mass spectrometry (GC-MS) can distinguish between stereoisomers
A) Methoximation B) Silylation
Gas chromatography requires volatile compounds (two step derivatization in vial)
80 110 140 170 200 230 260 290 320 350 380 410 440 470 500
0
50
100
50
100
91
91
96
96
107
107
115
115
128
128
141
141
147 163
163
177
189
193
205
207
218
218
231
231
244
244257
267
271
283
283
298
298
312
312
340
340
356
356
371
371
383388401 415
415435 457 475 489
m/z
Abun
danc
e
80 110 140 170 200 230 260 290 320 350 380 410 440 470 500
0
50
100
50
100
91
91
96
96
107
107
115
115
128
128
141
141
147 163
163
177
189
193
205
207
218
218
231
231
244
244257
267
271
283
283
298
298
312
312
340
340
356
356
371
371
383388401 415
415435 457 475 489
m/z
Abun
danc
e
Z/E isomer have same mass spectrumbut differ 2 seconds in retention time
12
Approach No. 4: Gas chromatography-mass spectrometry mass spectral and retention library
Tools used: JChem API, MSketch, Instant-JChem
Topic: Developed GC-MS library for metabolic profiling and calculated structural overlap with existing metabolite databases
Results:
• 17,475 animal, human, plant and microbial samples from 55 different species from 248 metabolomic studies
• Metabolic profiling with FiehnLib identifies around 150 compounds per run
Ideas for ChemAxon:
• Provide PCA or PLS output for statistical analysis of library overlaps• Automated Venn diagrams for DB overlap within Instant-JChem
Tobias Kind, Mine Palazoglu, Do Yup Lee, Yun Lu, Gert Wohlgemuth, Martin Scholz, Oliver FiehnFiehnLib - a mass spectral and retention index library for comprehensive metabolic profilinghttp://fiehnlab.ucdavis.edu/projects/FiehnLib/
13
Approach No. 4: Gas chromatography-mass spectrometry mass spectral and retention library
701Any (total number of structures)SA-T
22PurinesSA-8
53Carboxyl (acid, ester, salt) with aliphatic carbon chain (n>6)
SA-7
321Carboxylic acidsSA-6
58Nitrogen (n>0) in aromatic 6-ringSA-5
1Chlorine containing (non salt)SA-4
41Phosphate group containingSA-3
16General steroidsSA-2
7Aromatic steroidsSA-1
16Sugar pattern reducing sugarsS278
46Sugar pattern (multiple rings)S277
48AmidesS98
14LactonesS86
106KetonesS49
20AldehydesS48
130AminesS23
276AlcoholsS12
0AlkynesS6
96AlkenesS5
FiehnLib HitsFunctional groupID
Table (SMARTS) and hashed fingerprints calculated with ChemAxon JAVA API;
The GC-MS library contains a diverse set of compounds important for metabolic profiling and machine learning purposes.
Mass spectra + retention indices: 1200Unique compounds: 701
JAVA API example for SMARTS matching
14
KEGGMolecules
Detect 1024substructures
Mol1 010011001001101101101100...Mol2 010011001001101101101100...Mol3 010011001001101101101100......Moln 010011001001101101101100...
Create 1024 bitfingerprints
Multivariatecompression
Tanimotosimilarity score HCA PCA
T = C/(A+B+C)
Tanimoto
-6 -4 -2 0 2 4
t1
-6
-4
-2
0
2
4
t2
- FiehnLib
- BioMeta/KEGG
Approach No. 4: Gas chromatography-mass spectrometry mass spectral and retention library
Diversity visualization using PCAoverlapping dots refer to same compound
Table (SMARTS) and hashed fingerprints calculated with ChemAxon JAVA API; Fingerprints are also available from PubChem Score Matrix
15
Approach No. 5: Retention time prediction for liquid chromatography
Tools used: Marvin, JChem API, MSketch, Instant-JChem. Kier&Hall SMARTS
Topic: Use retention time filter as for structure refinement instructure elucidation process
Results:
• LC retention time prediction currently not accurate enough• LC RT prediction relies on accurate pka, logD predictions• good QSPR models require >500 or better >1000 diverse compounds
Ideas for ChemAxon:
• Provide more validation sets of pKa, logD for skeptic users ☺
Ideas for Users:• Share more data pKa, logD, solubility data for better model development ☺
16
100
0
50
75
25
0 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00
Retention time [min]
logP (lipophilicity)
Approach No. 5: Retention time prediction for liquid chromatography
logP=2 logP=4 logP=8
• very simplistic and coarse filter for RP only• problematic with multi ionizable compounds• logD (includes pKa) better than logP • possible use as time segment filter Deoxyguanosine
% sp
ecies
pH
Calibration using logP concept for reversed phase liquid chromatography data
17
y = 1.0191x + 0.5298R2 = 0.8744
0
5
10
15
20
25
30
35
40
45
0 5 10 15 20 25 30 35 40
experimental RT [min]
pred
icte
d R
T [m
in]
Approach No. 5: Retention time prediction for liquid chromatography
• Based on logD, pKa, logP and Kier & Hall atomic descriptors; • 90 compounds; (ndev= 48, ntest = 32); Std error 3.7 min • Good models need development set n>500 • Prediction power is most important
QSRR Model: Tobias Kind (FiehnLab) using ChamAxon Marvin and WEKAData Source: Lu W, Kimball E, Rabinowitz JD. J Am Soc Mass Spectrom. 2006 Jan;17(1):37-50; LC method using 90 nitrogen metabolites on RP-18
Riboflavin
Deoxyguanosine monophosphate
(dGMP)
Arginine
18
Approach No. 6: Cheminformatics tools in teaching
Tools used: Marvin, MSketch, Instant-JChem, Calculator plugins
Topic: Spectra and structures must be handled as a unityGeneration of stereoisomers, resonance species for mass spectrometry
Ideas for university teachers and students:• Use the free ChemAxon teaching license
Free teaching slides: http://fiehnlab.ucdavis.edu/staff/kind/Teaching/
19
Approach No. 7: Lipid Analysis
Tools used: MSketch, Instant-JChem, Calculator plugins, Reactor
Topic: • Analysis of polar lipids with tandem mass spectrometry (MS/MS)
Results:• lipid compounds were created with LipidMaps tools• structure handling provided by Instant-JChem + EXCEL export• spectral fragments data can be calculated from structures• match in-silico spectra with experimental spectra
Ideas for Users or ChemAxon:• Use the PubChem, LipidMaps, ChemSpider APIs to obtain database contents
Data presented at ASMS 2008 and Metabolomics 2009 conferences Table downloads: http://fiehnlab.ucdavis.edu/staff/kind/Metabolomics/LipidAnalysis/
20
Approach No. 7: Lipid analysis
Iontrap MS/MS spectra creationIontrap MS/MS spectra creation
Low-resolution LTQ Ion Trap
High-resolution LTQ-FT
NanoMate nanoESIchip based infusion
nanoESI chip with 400 nozzles
sn1 = alkyl or acyl rest
sn2 = alkyl or acyl rest
head group
PCs_Pos_ID_CE45_01 #21-151 RT: 0.04-0.28 AV: 131 NL: 2.19E4T: ITMS + p ESI Full ms [300.00-1100.00]
700 720 740 760 780 800 820 840m/z
0
10
20
30
40
50
60
70
80
90
100
Rel
ativ
e A
bund
ance
760.64
782.64
788.64734.64
776.64810.64756.64
798.64732.64 746.64 774.64706.64 840.45814.64728.55720.55 826.64694.55
PCs_Pos_ID_CE45_01 #163-214 RT: 0.31-0.97 AV: 2 NL: 1.51E1T: Average spectrum MS2 760.50 (163-214)
200 250 300 350 400 450 500 550 600 650 700 750m/z
0
10
20
30
40
50
60
70
80
90
100
Rel
ativ
e A
bund
ance
504.36
478.36
701.45
577.45
742.73
522.45
301.18 658.55616.82404.09293.18 433.18335.91256.27 396.91 761.00
MS
MS/MS
21
Approach No. 7: Lipid analysis
Export of structures from Instant-JChem into EXCEL
Structures created with LipidMaps tools
Lipid database of44,000 glycerophospholipids444,080 diacylglycerols.and mostly triacylglycerols
22
Conclusions – Metabolomics @ FiehnLab
Structure elucidation techniques for GC-MS and LC-MS
• require deep interaction between structure and spectra handling• require algorithms for spectra interpretation and retention index prediction• integration of metabolite and small molecule databases (PubChem/KEGG) needed
ChemAxon tools are technology enablers for metabolomics
• used for daily structure handling of small molecule structures and databases• used for metabolomics method development • used for development of new structure elucidation algorithms
23
Thank you!
Fiehn Lab
Dr. Oliver Fiehn (Principal Investigator)Mine Palazoglu (Library, GC-MS, GCT)Dr. Tobias Kind (Cheminformatics)Dinesh Kumar Barupal (Bioinformatics)Dr. Do Yup Lee (Biology, Proteins)Gert Wohlgemuth (BinBase)Kirsten Skogerson (NMR, GCxGC)Dr. Kwang-Hyeon Liu (LC, Pharma)Dr. Yun Gyong Ahn (GCT, GC-MS) Sevini Shahbaz (Library)
Sponsors Fiehn Lab
NIH R01 ES013932NIH GM078233NIH R01 DK078328UC Discovery itl07-10167NSF MCB 0520140EU FP7 Health-2007-2.1.4.1/Dupont Agilent, LECO, Waters
Thanks to ChemAxon for free research and teaching licensesand great support in the ChemAxon Forum!