big data in chemistry - aboutbigchem.eu/sites/default/files/school1_tetko.pdf · data types...
TRANSCRIPT
![Page 1: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/1.jpg)
Big Data in Chemistry
Institute of Structural Biology (STB), Helmholtz Zentrum München (HMGU)
Dr.IgorV.Tetko
Neuherberg,18October2016
![Page 2: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/2.jpg)
Outline
• Sources• ExampleofBigData• Dataqualityandcomplexity• Annota?onoflargevirtualsets• Deeplearning• Securedatasharing• Outlook
![Page 3: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/3.jpg)
Big Data Sources
DowereallyhaveBigDatainchemistry?Whatkindoflargedatadowehave?
![Page 4: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/4.jpg)
Big Data definition
Bigdataisatermfordatasetsthataresolargeorcomplexthattradi?onaldataprocessingapplica?ons
areinadequate(Wikipedia)
![Page 5: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/5.jpg)
Large Chemical Database
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
Compounds
Exp.Facts
?
![Page 6: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/6.jpg)
Data Types
Database Main data types
ChEMBL v. 211 Data mined from literature and PubChemHTS assays
BindingDB2 Experimental protein-small molecule interaction data
PubChem3 Bioactivity data from HTS assays
Reaxys4 Literature mined property, activity and reaction data
SciFinder (CAS)5 Experimental properties, 13C and 1H NMR spectra, reaction data
GOSTAR6 Target-linked data from patents and articles
AZ IBIS7 AZ in-house SAR data points
OCHEM8 Mainly ADMET data collected from literature
1)PapadatosG,etal.JComputAidedMolDes2015;29(9)885-96.2)GilsonMK,etal.NucleicAcidsRes2016;44(D1):D1045-53.3)KimS,etal.NucleicAcidsRes2016;44(D1):D1202-13.4)h^p://www.elsevier.com/solu?ons/reaxys5)h^p://www.cas.org/products/scifinder6)h^p://www.gostardb.com7)MuresanSetal.DrugDiscovToday2011;16(23-24):1019-30.8)SushkoI,etal..JComputAidedMolDes2011;25(6):533-54.
![Page 7: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/7.jpg)
Big Data sizesBigdataisatermfordatasetsthataresolargeorcomplexthattradi?onaldataprocessingapplica?onsareinadequate(Wikipedia)
CCBY-SA3.0,h^ps://commons.wikimedia.org/w/index.php?curid=29452425
1exabyte:1018
![Page 8: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/8.jpg)
Large Chemical Database
100,0001,000,000
10,000,000100,000,000
1,000,000,00010,000,000,000100,000,000,000
1,000,000,000,00010,000,000,000,000
100,000,000,000,0001,000,000,000,000,000
10,000,000,000,000,000100,000,000,000,000,000
1,000,000,000,000,000,000
Compounds
Exp.Facts
![Page 9: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/9.jpg)
Big Data are relative to a field
• Methodstoanalyzesuchdatadonotexist• Wemaynotsufficienttechnicalresources(speed,memory)to
usetheexis?ngmethods• Wemaynothaveknowledgetousetheexis?ngmethodsThustheBigDatacanappeardueto:Physicalchallenges(hardware)Knowledgechallenges(informa?cs,sogware)
![Page 10: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/10.jpg)
Example of Big Data
Whichdataarereallybigones?
![Page 11: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/11.jpg)
What data sizes are “big” ones?
“Generalmel?ngpointpredic?onbasedonadiversecompounddatasetandar?ficialneuralnetworks”Karthikeyanetal.J.Chem.Inf.Model.2005,45(3),681-90.N=4173
à Largedataset~50kà Bigdataset~250k
![Page 12: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/12.jpg)
Melting Point Datasets
• Bergström277• Bradley2886• OCHEM22404• Enamine21883
data
Bergström
Bradley
OCHEM
Enamine
TetkoetalJ.Chem.Inf.Model.2014,22;54(12):3320-9.
![Page 13: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/13.jpg)
275k Melting Point Datasets
• Bergström277• Bradley2886• OCHEM22404• Enamine21883• PATENTS228079
data
Bergström Bradley OCHEM Enamine Patents
TetkoetalJ.Chemoinforma2cs,2016,8,2.
COMBINED:OCHEM+Enamine+Bradley+Bergström
![Page 14: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/14.jpg)
Extraction of MP information from patents
• [0835] To a solution of 2-amino-4,6-dimethoxybenzamide (0.195 g, 0.99 mmol) and 5-(2-(tert-butyldimethylsilyloxy)ethoxy)-6-phenylpicolinaldehyde (0.355 g, 0.99 mmol) in N,N-dimethyl acetamide (10 ml), was added NaHSO3 (0.264 g, 1.49 mmol) and p-toluenesulfonic acid monohydrate (0.038 g, 0.198 mmol). The reaction mixture was heated at 120° C. for 16 h. After that time the reaction was cooled to rt and the solvent was removed under reduced pressure. The reaction mixture was then diluted with water (150 mL) and neutralized with NaHCO3. The precipitated solids were collected by filtration, washed with water and dried to give 2-(5-(2-(tert-butyldimethylsilyloxy)ethoxy)-6-phenylpyridin-2-yl)-5,7-dimethoxyquinazolin-4(3H)-one (0.500 g, 94%) as an off-white solid: 1H NMR (400 MHz, DMSO-d6) δ 11.08 (s, 1H), 8.35 (d, J=8.98 Hz, 1H), 8.21 (d, J=2.34 Hz, 2H), 7.82 (d, J=8.59 Hz, 1H), 7.44-7.52 (m, 3H), 6.81 (d, J=2.34 Hz, 1H), 6.58 (d, J=2.34 Hz, 1H), 4.24-4.32 (m, 2H), 3.94-4.00 (m, 2H), 3.92 (s, 3H), 3.86 (s, 3H), 0.85 (s, 9H), 0.08 (s, 6H); ESI MS m/z 534 [M+H]+.
• http://www.google.com/patents/US20140140956
![Page 15: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/15.jpg)
Extracting of melting points from patents
Workflow
NextMoveLtd,UK
![Page 16: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/16.jpg)
Extraction of MP information from patents
![Page 17: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/17.jpg)
Modeling of MP data
Package name
Type of descriptors
Number of descriptors
Matrix size, billions
Non zero values, millions
Sparseness
Functional Groups integer 595 0.18 3.1 33
QNPR integer 1502 0.45 6.3 49
MolPrint binary 688634 205 8.1 7200
Estate count float 631 0.19 10 14
Inductive float 54 0.02 11 1
ECFP4 binary 1024 0.31 12 25
Isida integer 5886 1.75 18 37
ChemAxon float 498 0.15 23 1.5
GSFrag integer 1138 0.34 24 5.7
CDK float 239 0.07 27 2
Adriana float 200 0.06 32 1.3
Mera, Mersy float 571 0.17 61 1.1
Dragon float 1647 0.49 183 1.5
![Page 18: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/18.jpg)
Large à Big
• NeuralNetworkswastooslow(ensembletraining!)àSVMwasused
• Supportofparallelcalcula?ons(48core)• Supportofgridanalysis(>1000CPUs)
• Storageoffulldatamatrix->sparsedatamatrix
![Page 19: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/19.jpg)
Prediction errors for Bergström drug like compounds using models developed with different training sets
0
10
20
30
40
50
60
277(Bergström) 4k(Karthikeyan) 50k(Literature) 275k(Patents)
°C
trainingsetsize
![Page 20: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/20.jpg)
Prediction of Huuskonen set using ALOGPS logP and MP based on 50k measurements
logS=0.5–0.01(MP-25)–logKow
![Page 21: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/21.jpg)
Prediction of Huuskonen set using ALOGPS logP and MP based on 230k measurements
logS=0.5–0.01(MP-25)–logKow
![Page 22: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/22.jpg)
Big Data Quality and Complexity
Whyisitveryimportant?Howdomainspecificanalysiscouldhelp?
![Page 23: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/23.jpg)
Suscep?bilityofCPM-basedHTStoscreeningcompound-basedinterference.(A)Assayschema?cfortheCPM-basedHTSusedinthisstudy.TheassaymeasurestheHATac?vityoftheR^109–Vps75complex,whichcatalyzesthetransferofanacetylmoietyfromacetyl-CoAtospecificlysineresiduesontheAsf1–dH3–H4substratecomplextoproduceacetylatedhistoneresiduesandcoenzymeA(CoA).Addi?onofthethiol-scavengingprobeCPMleadstoahighlyfluorescentadductbyreac?ngwiththeCoAbyproduct,whichisusedtoquan?fyHATac?vityviafluorescenceintensitymeasurement.(B)Representa?veassayinterferencechemotypesiden?fiedduringpost-HTStriage.
DahlinetalJ.Med.Chem.2015,58,2091-2113.
99.8%FHs!
![Page 24: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/24.jpg)
Promiscuous compounds filters
![Page 25: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/25.jpg)
Promiscuous compounds filters
![Page 26: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/26.jpg)
Pan Assay INterference compoundS (PAINS) Filters
• colorquenching• singletoxygenquenching• auto-fluorescence• covalentbinding• inherently“s?cking”
compounds• disrupttheinterac?on
betweenthetagoftheproteinandbindingsiteofthedetec?onsystem
BaellandHolloway,J.Med.Chem.,2010,53:2719-40.
~500filtersbasedonN=93212compounds
![Page 27: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/27.jpg)
Structural & Toxic Alerts at http://ochem.eu • Screening of compounds against published toxicity alerts,
groups, frequent hitters • Filter alerts by endpoints or publications • Create or upload custom SMARTS rules
Sushkoetal,J.Chem.Inf.Model,2012,52(8):2310-6.
>500func?onalgroups>2.3kalertsintotal
![Page 28: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/28.jpg)
Identification of AlphaScreen-HIS Frequent Hitters
For Peer Review
210x297mm (300 x 300 DPI)
Page 31 of 43
http://mc.manuscriptcentral.com/jbsc
Journal of Biomolecular Screening
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
TrueHitsTM:StreptavidinDonorbeadBio?nylatedAcceptorbeads
SchorppetalJ.Biomol.Screen.2014,9,715-726.
![Page 29: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/29.jpg)
Mode Of Action of AlphaScreen-HIS Frequent Hitters
SchorppetalJ.Biomol.Screen.2014,9,715-726.
![Page 30: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/30.jpg)
Bio Assays Ontology relationships
Abeyruwan,U.etal“EvolvingBioAssayOntology(BAO):Modulariza?on,Integra?onandApplica?ons,”JournalofBiomedicalSeman?cs,vol.5,no.1:S5,2014.
![Page 31: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/31.jpg)
Annotation of large chemical spaces
BigData,whichhavebeenalwaysinchemistry.
![Page 32: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/32.jpg)
Virtual chemical spaces
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
10,000,000,000
100,000,000,000
1,000,000,000,000
Synthesizable~1024andtotalspaceis~1060
![Page 33: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/33.jpg)
Virtual chemical spaces
1.00E+031.00E+041.00E+051.00E+061.00E+071.00E+081.00E+091.00E+101.00E+111.00E+121.00E+131.00E+141.00E+151.00E+161.00E+171.00E+18
GDB*N–allpossiblechemicalswith≤Natoms
![Page 34: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/34.jpg)
Virtual chemical spaces
1.00E+03
1.00E+05
1.00E+07
1.00E+09
1.00E+11
1.00E+13
1.00E+15
1.00E+17
1.00E+19
1.00E+21
1.00E+23
Synthesizable~1024
![Page 35: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/35.jpg)
Virtual chemical spaces
1.00E+031.00E+061.00E+091.00E+121.00E+151.00E+181.00E+211.00E+241.00E+271.00E+301.00E+331.00E+361.00E+391.00E+421.00E+451.00E+481.00E+511.00E+541.00E+571.00E+60
Synthesizable~1024andtotal“drug–like”spaceis~1060
![Page 36: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/36.jpg)
Annotation of compounds
• ALOGPS2.1*(predic?onoflogPandwatersolubilityofchemicalcompounds)
• ~100,000moleculesperminute
• Annota?onofGDB-17willtake~3yearsofcalcula?onsusingonecore
• ~10minutesonLeibnizSupercompu?ngCentrewith241,000cores
*Tetko,I.V.J.Chem.Inf.Comput.Sci.2001,41,1407-1421.
![Page 37: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/37.jpg)
We can’t predict unpredictable!
N
O O
newmeasurement
N O
O OHN O
O
newseriestopredict
![Page 38: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/38.jpg)
New machine learning approaches
WhichmethodscanhelpuswithBigData?
![Page 39: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/39.jpg)
CourtesyofProf.J.Bajorath
![Page 40: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/40.jpg)
Multi-task learning
0
0.1
0.2
0.3
0.4
0.5
0.6
Mea
n A
bsol
ute
Erro
rone several
Human data
FatBrainLiverKidneyMuscle
Problem:• predic?onof?ssue-airpar??oncoefficients• smalldatasets30-100molecules(human&ratdata)Results:simultaneouspredic?onofseveralproper?esincreasedtheaccuracyofmodels
Varnek,A.etalJ.Chem.Inf.Model.2009,49,133-44.
![Page 41: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/41.jpg)
Renaissance of neural networks
Deeplearning– Massiveneuralnetworkswiththousandsofneuronsandlayers– Newlearningmethods(dropouttechnique)
Examplesoftheuseofdeeplearningtechnology:– Recogni?onofChinesecharacterswithhumanaccuracy– VictoryinGo-tournament– Diagnos?csofbreastcancer
Baskin,I.I.;Winkler,D.;Tetko,I.V.Arenaissanceofneuralnetworksindrugdiscovery.Expertopinionondrugdiscovery2016,11(8):785-95.
![Page 42: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/42.jpg)
h^p://adsabs.harvard.edu/abs/2015arXiv150202072R
259datasets• 128PubChem• 17MUV• 102DUD-E• 12Tox21
Total~40Mdatapointsfor1.6Mcompounds
Descriptors:ECFP4RDKit
![Page 43: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/43.jpg)
Multitask Networks Learning Results
• Massivelymul?tasknetworksobtainpredic?veaccuraciessignificantlybe^erthansingle-taskmethods.
• Thepredic?vepowerofmul?tasknetworksimprovesasaddi?onaltasksanddataareadded.
• Thetotalamountofdataandthetotalnumberoftasksbothcontributesignificantlytomul?taskimprovement.
• Mul?tasknetworksaffordlimitedtransferabilitytotasksnotinthetrainingset.
h^p://adsabs.harvard.edu/abs/2015arXiv150202072R
![Page 44: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/44.jpg)
Multitask benefit from increasing tasks and data independently.
h^p://adsabs.harvard.edu/abs/2015arXiv150202072R
![Page 45: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/45.jpg)
Secure Information Sharing
Howcanweshareinforma?onbutnotdata?Howcanweenablecoopera?onbetweenindustries?
![Page 46: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/46.jpg)
Secure Sharing of information
• CINF/COMPworkshopwasorganizedduringACSin2005byProf.Oprea• Variousstructurerepresenta?on(descriptors)wereproposed• Severalmethodsforsecuresharingwereintroduced
• Butinthetheore?callimit*– SMILESrepresenta?onofmolecules:CCC,CNCCC,c1ccccc1– Zippingofstructuresrequires<1bitperatom– Structurewith32atomsrequires<32bits– Anydescriptorortheircombina?onwith>32bitscouldbeusedtodecodea
molecule(intheory)
*Tetko,I.V.;Abagyan,R.;Oprea,T.I.J.Comput.Aided.Mol.Des.2005,19,749-764.
![Page 47: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/47.jpg)
Currently used technologies
“Honestbroker”– Receivesdescriptors(orstructures)– Developmodelsanddonotrevealtheunderlyingdata
Sharingrela?onshipsbetweenstructures– MatchedMolecularPairs(changesinpropertyduetochangeofgroups)
Sharingdevelopedmodels– Structuralalerts– Computa?onalpredic?onmodels
Sharingreliablepredic?ons(surrogatedata)*
*Tetko,I.V.;Abagyan,R.;Oprea,T.I.J.Comput.Aided.Mol.Des.2005,19,749-764.
![Page 48: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/48.jpg)
Multi-party secure computation
![Page 49: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/49.jpg)
Secure summation
![Page 50: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/50.jpg)
Conclusions
ExpectaOonsü Improvedpredic?onofproper?es,and
ac?vi?esü Improvedpoly-pharmacologyü Searchofnewchemistry(topdown
explora?onanddenovodesign)ü Predic?onofinvivoenpoints
Challengesü Newmachinelearningapproaches(deep
learning)ü Integra?onofdiversedataandapriory
knowledge(ontology,pathways,invitro,invivo,simula?onresults,differenterrors,etc.)
ü Applicabilitydomainü Securedatasharingü Datavisualiza?onü Denovodesign
![Page 51: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/51.jpg)
Further reading
• Tetko,I.V.;Engkvist,O.;Koch,U.;Reymond,J.L.;Chen,H.,BIGCHEM:ChallengesandOpportuni?esforBigDataAnalysisinChemistry.MolInform2016,35(11-12):615-621(OpenAccess).
• Tetko,I.V.;Engkvist,O.;Chen,H.Does'BigData'existinmedicinalchemistry,andifso,howcanitbeharnessed?FutureMedChem.20168(15):1801-1806(OpenAccess).
![Page 52: Big Data in Chemistry - Aboutbigchem.eu/sites/default/files/School1_Tetko.pdf · Data Types Database Main data types ChEMBL v. 211 Data mined from literature and PubChem HTS assays](https://reader030.vdocuments.site/reader030/viewer/2022040610/5ed26c24264666134b226f9c/html5/thumbnails/52.jpg)
Acknowledgements
Dr.Y.SushkoDr.S.NovotarskyiMr.R.KörnerMrs.E.SalminaDr.K.Hadian(TOX,HMGU)Dr.A.Williams(USA)Dr.D.Lowe(UK)
Dr.O.Engkvist(AZ)Dr.H.Chen(AZ)Dr.U.Koch(LDC)Prof.J.-L.Reymond(UniBern)Dr.I.Baskin(MSU)Dr.D.Winkler(CSIRO)AndBIGCHEMpartners