discovery of highly polymorphic organic materials: a new
Post on 05-May-2022
2 Views
Preview:
TRANSCRIPT
doi.org/10.26434/chemrxiv.9524219.v1
Discovery of Highly Polymorphic Organic Materials: A New MachineLearning ApproachZied Hosni, Annalisa Riccardi, Stephanie Yerdelen, Alan R. G. Martin, Deborah Bowering, Alastair Florence
Submitted date: 12/08/2019 • Posted date: 13/08/2019Licence: CC BY-NC-ND 4.0Citation information: Hosni, Zied; Riccardi, Annalisa; Yerdelen, Stephanie; Martin, Alan R. G.; Bowering,Deborah; Florence, Alastair (2019): Discovery of Highly Polymorphic Organic Materials: A New MachineLearning Approach. ChemRxiv. Preprint.
Polymorphism is the capacity of a molecule to adopt different conformations or molecular packingarrangements in the solid state. This is a key property to control during pharmaceutical manufacturingbecause it can impact a range of properties including stability and solubility. In this study, a novel approachbased on machine learning classification methods is used to predict the likelihood for an organic compound tocrystallise in multiple forms. A training dataset of drug-like molecules was curated from the CambridgeStructural Database (CSD) and filtered according to entries in the Drug Bank database. The number ofseparate forms in the CSD for each molecule was recorded. A metaclassifier was trained using this dataset topredict the expected number of crystalline forms from the compound descriptors. This approach was used toestimate the number of crystallographic forms for an external validation dataset. These results suggest thisnovel methodology can be used to predict the extent of polymorphism of new drugs or not-yet experimentallyscreened molecules. This promising method complements expensive ab initio methods for crystal structureprediction and as integral to experimental physical form screening, may identify systems that with unexploredpotential.
File list (5)
download fileview on ChemRxivMachine learning-based approach to predict the polymorp... (1.19 MiB)
download fileview on ChemRxivSI1- Predictive models of polymorphism.docx (75.45 KiB)
download fileview on ChemRxivSolvent.docx (25.73 KiB)
download fileview on ChemRxivoutput of classifiers (Autosaved).xlsx (93.17 KiB)
download fileview on ChemRxivSI2- Experimental solvents screening.xlsx (34.76 KiB)
Discovery of highly polymorphic organic materials: a new machine learning approach Zied Hosni1,3*, Annalisa Riccardi2, Stephanie Yerdelen1, Alan R. G. Martin1, Deborah
Bowering1, and Alastair J. Florence1*
Polymorphism is the capacity of a molecule to adopt different conformations or molecular packing arrangements in the
solid state. This is a key property to control during pharmaceutical manufacturing because it can impact a range of
properties including stability and solubility. In this study, a novel approach based on machine learning classification
methods is used to predict the likelihood for an organic compound to crystallise in multiple forms. A training dataset of
drug-like molecules was curated from the Cambridge Structural Database (CSD) and filtered according to entries in the
Drug Bank database. The number of separate forms in the CSD for each molecule was recorded. A metaclassifier was
trained using this dataset to predict the expected number of crystalline forms from the compound descriptors. This
approach was used to estimate the number of crystallographic forms for an external validation dataset. These results
suggest this novel methodology can be used to predict the extent of polymorphism of new drugs or not-yet experimentally
screened molecules. This promising method complements expensive ab initio methods for crystal structure prediction and
as integral to experimental physical form screening, may identify systems that with unexplored potential.
achine Learning methods (ML) are ubiquitous in many areas
of modern science and have become a crucial tool where large
amounts of data from different sources are available. There is
a diverse range of ML algorithms available that have been applied to the
modelling and prediction of complex systems and problems. Various
factors have an impact on the suitability of ML approaches for different
applications. Among those are the size and distribution of the training
data in the features space, the correlation of the descriptors, the nature
of the problem and its degree of non-linearity. The non-linearity of the
problem considered in this study is one of the main drivers for the choice
of the ML approach used. Support Vector Machine (SVM) and Random
Forest (RF) are ML methods that have already been successfully used
for classification and prediction of non-linear chemical processes (i.e.
the features and the response are not correlated with a linear
relationship) and they are suitable for large dimensional problems (i.e.
many factors affect the response of the phenomena)1,2. The k-Nearest Neighbours (k-NN) algorithm joins simplicity and
intuitiveness. In the Mitchell group, this method was applied to predict
the melting point for 4119 structurally diverse organic molecules and
277 drug-like molecules. The performance of this algorithm was
compared with the one from neural networks showing the strengths and
the weaknesses to exploit their predictive models. Cross-validation and
y-randomisation both proved to be good strategies for prediction
validation3. Tropsha et al. highlighted the importance of validation
techniques in Quantitative Structure-Property Relationship (QSPR)
models before applying them on real world problems. They enumerated
1 EPSRC Future Continuous Manufacturing and Advanced Crystallisation Hub, University of Strathclyde, Technology and Innovation Centre,
99 George Street, Glasgow, U.K. G1 1RD. 2 Department of Mechanical and Aerospace Engineering, University of Strathclyde, Glasgow, UK.
G1 1XJ. 3 Centre of Computational Chemistry, School of Chemistry, Cantock's Close, Bristol, U.K BS8 1TS. email: zied.hosni@bristol.ac.uk;
alastair.florence@strath.ac.uk
several examples of predictive failure when the validation step was not
considered carefully4.
QSPR modelling is by definition based on the assumption that
changes in molecular structure are reflected in variation in the observed
macroscopic properties of materials2. This approach does not require
access to expensive, high-performance computing power and has been
shown to deliver scalability, efficiency, robustness and predictability5.
QSPR has been applied to predict a wide range of material properties
such as physicochemical and biological properties of nanomaterials6,
catalytic activity in homogeneous and heterogeneous catalysts7,8,
protein adsorption, cell attachment, cellular proliferation on biomaterial
surface9, glass transition temperature for polymers10, melting points for
ionic liquids and others11.
Crystal structure prediction is a challenging area and one of the
promising applications of QSPR and ML. Philipps et al. identified new
types of crystalline structures from large data sets of coordinates. They
deployed a hierarchy of pattern analysis techniques and applied ML
with shape matching algorithms to extract and classify crystals into
categories12. Clustering and the identification of intrinsic structural
features in particle tracking data were also investigated using the
Neighborhood Graph Analysis (NGA) method13. In inorganic
chemistry, the Cluster Resolution Feature Selection (CR-FS) and
support vector machine (SVM) classification were applied to predict the
crystal structures of ternary equiatomic compositions based only on the
constituent elements14. Moreover, Principle Components Analysis
(PCA) was exploited to render structure maps of spinel nitrides
(AB2N4)15.
M
Figure 1 | The 3-dimensional packing and crystallographic information (the distances, the angles and the volume of the unit cell) of two biologically active molecules (Chlorpropamide and Thiazolidinone) showing the five characterised polymorphic forms and the singular reported form in the second.
The Random Forest algorithm was able to predict full-Heusler
structures and discriminate between Heusler and inverse Heusler
structures16.
Raccuglia et al. exploited a successfully failed experiments’ database to
highlight the factors that control organically templated metal oxides
reaction outcome. They applied different algorithms and found that
SVM was the most robust one to mine the chemical information
rendered from historical reactions17. The identification of crystal
structure using optimisation of relevant thermodynamic potential in the
space of atomic coordinates is gaining significant interest18. Various
algorithms such as genetic algorithms or simulated annealing were
exploited to determine the global energy configurations of crystal
structures19. A ML model was trained on a dataset of ab initio
calculation results for 7000 organic molecules. Various molecular
descriptors such as nuclear charges and cartesian coordinates were
exploited as features for a deep multi-task artificial neural network
capable of predicting atomisation energy, ionisation potential and
electron affinity simultaneously20.
Polymorphism is defined as “the existence of a solid crystalline
phase of a given compound resulting from the possibility of at least two
different arrangements of the molecules of that compound in the solid
state”21,22. It is difficult to predict ab initio whether a specific molecule
will adopt more than one crystal structure, how many polymorphs are
likely to be observed, or the specific crystal packing arrangements and
associated physical properties each polymorph will display23,24. Organic
crystals are of paramount importance in different industrial sectors
including agrochemicals, food, paint, energetic materials, and
pharmaceuticals. The polymorphism of these entities dictates their
flowability, stability, colour, solubility and mechanical strength25. In
addition to the challenges related to the production control of a specific
solid form, polymorphism is still posing intellectual property conflicts
and prolonged legal battles26.
Considerable progress has been made in the field of crystal energy
landscaping (i.e. calculating the thermodynamically feasible crystal
structures within an energy landscape of possible polymorphs)27. This
procedure can guide expensive, and time consuming experimental
screening approaches for solid forms if thermodynamic and kinetic
factors are both taken into consideration28,29. Although numerous ab
initio predictive methodologies have been developed to deal with
increasingly complex challenges (flexible conformers, multicomponent
crystals), it is not yet possible to rely on such approaches without
carrying out experimental investigations. Figure 1 shows the contrast
between two biologically active molecules that present completely
different behaviours in terms of experimental polymorphism. Indeed,
while Chlorpropamide is reported to form at least 5 different crystalline
Form I Form II
Form IV
Form III
Form V
Chlorpropamide
Unique form P 2
1 2
1 2
1
a 9.066(4) b 5.218(3) c 26.604(8)
P 21 21 21
a 9.066 b 5.218 c 26.604
α 90 β 90 γ 90
P 21
a 6.126(2) b 8.941(6) c 12.020(4)
α 90 β 99.68 γ 90
648.993
P b c a
a 9.3198(4) b 10.3218(3) c 26.2663(10)
α 90 β 90 γ 90
2526.74
P b c n
a 14.777(3) b 9.316(4) c
19.224(5)
α 90 β 90 γ 90
2646.42
P n a 21
a 19.9121(10) b 7.3459(4) c 9.1384(4)
α 90 β 90 γ 90
Xabepilone
forms30, the cytotoxic Thiazolidinone is only reported to have a single
polymorph in the Cambridge Structural Database (CSD)31.
The CSD does not contain a complete record of the full extent of
polymorphism for all chemical entries,32 rather it includes all the entries
that have been reported in the literature or reported directly by
researchers to the Cambridge Crystallographic Data Centre (CCDC).
Thus, it is possible that some molecules, with a single crystal structure
entry, are highly polymorphic once studied in an extensive experimental
polymorphic screening.
The Random Forest method, in particular, has previously been
successfully applied to the design of experimental screens assessing the
completeness of experimental screens for solvate formation33;
prediction of packing types from solvent properties34 and the
crystallisability of organic molecules35,36. However, it has not been
assessed as a predictive tool for the extent of polymorphism expected
from a target molecule. In this work, we exploit curated data from the
intersection of the CSD and Drugbank37 and implement a metaclassifier
that enables the discovery of the true extent of polymorphism in organic
molecules and their potential to crystallise in with new solid-state forms.
This metaclassifier is the combination of various machine learning
algorithms. Four types of datasets were exploited to build predictive
models. The most robust model was selected to identify an organic
molecule susceptible to exist in several solid states. The experimental
validation was conducted by the crystallisation screening of this
compound in 60 different solvents.
Case study 1: Polymorphism prediction with 2D descriptors
and with dimensionality reduction The nine statistical models (i.e. the eight machine learning models and
the Prediction Fusion model) generated for the dataset of 2D structures
with dimensionality reduction are summarized in Figure 2.A. The
comparison between the different models showed that all the algorithms
were successful in reaching acceptable accuracy of prediction (>60%
for a six classes classification problem where the randomness is 100
6 %
(i.e. This is the probability to predict the correct number of
polymorphism by random guessing)) except for the Naïve Bayesian
Multinomial and the multilayer Perceptron algorithm. k-Nearest
Neighbours and Random Forest were the best methods with an accuracy
of 86% and 85%, respectively. The exploitation of the Prediction Fusion
enabled an improvement of the predictive capacity and demonstrated a
synergetic effect of combining the probability generated from the
different algorithms in one unique model. The Prediction Fusion
method rendered an accuracy of 91% and a Cohen’s kappa of 90%.
The confusion matrix that is depicted in Table 1 explains the high
accuracy that characterises the Prediction Fusion model. It is clear that
the diagonal of this matrix, explaining the correct prediction, is very rich
in samples.
Table 1. Confusion matrix from the fused predictive model of the case
study 1
Number of
polymorphs Experimental polymorphism
1 2 3 4 5 6
Pre
dic
ted
po
lym
orp
his
m 1 55 17 7 6 6 3
2 3 88 0 3 0 0
3 0 0 94 0 0 0
4 1 1 1 91 0 0
5 0 0 0 0 94 0
6 0 0 0 0 0 94
Table 2. Confusion matrix from the fused predictive model of the case
study 2
Number of
polymorphs Experimental polymorphism
1 2 3 4 5 6
Pre
dic
ted
po
lym
orp
his
m 1 90 3 0 1 0 0
2 1 88 4 0 1 0
3 5 23 53 8 4 1
4 0 0 1 93 0 0
5 0 0 0 0 94 0
6 0 0 0 0 0 94
Table 3. Confusion matrix from the fused predictive model of the case
study 3
Number of
polymorphs Experimental polymorphism
1 2 3 4 5 6
Pre
dic
ted
po
lym
orp
his
m 1 334 18 20 3 14 3
2 8 266 52 31 16 19
3 5 13 343 11 11 9
4 2 11 9 359 3 8
5 0 9 12 5 359 7
6 3 6 20 5 10 348
Table 4. Confusion matrix from the fused predictive model of the case
study 4
Number of
polymorphs Experimental polymorphism
1 2 3 4 5 6
Pre
dic
ted
poly
morp
his
m
1 366 4 6 11 2 3
2 13 346 8 19 2 4
3 17 13 297 35 20 10
4 12 2 26 345 5 2
5 4 0 5 7 376 0
6 4 0 13 17 6 352
Case study 2: Polymorphism prediction with 2D descriptors
and without dimensionality reduction The nine models generated from the datasets of 2D structures using their
corresponding molecular descriptors were plotted in Figure 2.B. They
showed very similar performance to the Case study 1 and proved that
the dimensionality reduction did not improve dramatically the
performance of the obtained models in the Case study 1. It is shown that
PCA improved the weakest models (i.e. Naive Bayesian and Multilayer
Perceptron). Their accuracies increased from 9% and 18% to 55% and
52%, respectively. The comparison of the confusion matrices between
the two cases shows the enhancement of prediction for class 1.
Therefore, in case 2, only 4 samples were predicted wrong, compared
to 39 samples for the class of single form. It is noteworthy that the best
models from case 1, where Principle components were used instead of
the original molecular descriptors, had very similar results. The worst
two predictive models were improved dramatically. For instance, the
Multilayer Perceptron achieved an accuracy equal to 52% while it was
18% when dimensionality reduction was considered. This proves that
the reduction of the number of dimensions did not help to improve the
already robust models and even led to a deterioration of relatively weak
models such as the Naïve Bayes multinomial algorithm and the
Multilayer Perceptron. The confusion matrix, as depicted in Table 2,
explained the good performance of the Prediction Fusion model and
showed an enrichment of the matrix’s diagonal in samples.
Case study 3: Polymorphism prediction with
crystallographic descriptors and with dimensionality
reduction Case 3 exploits the information from the dataset of 3D structures.
Instead of using the crystallographic descriptors, a Principle
Components Analysis was conducted to reduce the dimensionality of
the system to 9 components. Comparing to the two previous cases, all
the obtained models in case 3, as illustrated in Figure 2.C, were less
robust in predicting the polymorphism of the molecules of interest than
the previous models of the 2D structures datasets. K-NN, RF and PF
were still very robust to predict the polymorphism. Their accuracies are
equal to 84%, 83% and 89%, respectively. The low number of
independent variables can explain this robustness compared to the
number of samples within the dataset. The corresponding confusion
matrix for the Prediction Fusion models that are illustrated in Table 3
explains the high accuracy in estimating the polymorphism from 3D
structures and dimensionality reduction.
Case study 4: Polymorphism prediction with
crystallographic descriptors and without dimensionality
reduction
The last case uses the 3D structures from the intersection of the Drug
Bank and the CSD databases and their corresponding crystallographic
descriptors such as the unit cell parameters. The performance of the
different models was plotted in Figure 2.D. The utilisation of the
original crystallographic descriptors improved the performance of all
the models slightly, with no exception. Naïve Bayes Multinomial,
Simple Logistic and the Multilayer Perceptron were the weakest
predictive models. As before, k-NN, RF and PF were at the head of the
list to estimate the polymorphism. Support Vector Machine, Ordinal
Classic Classifier and the Gradient Boosted Trees had a relatively
acceptable accuracy between 60% and 75%. The Prediction Fusion
model exploited the probabilities from all the generated models with
respect to their individual accuracies. Table 4 explains the robustness of
this model through the sample-rich diagonal.
Figure 2 | Performance of the 8 independent machine learning algorithms to generate statistical models in: A- Case study 1, B- Case study 2, C- Case study 3, D-
Case study 4.
Descriptors importance When the Principle Components analysis was not applied for the
datasets of 2D and 3D structures (i.e. case 2 and 4), it was possible to
check the importance of the independent variable and to interpret their
contribution to define the accuracy of the designed model. In the case
of the 2D structures dataset, molecular descriptors were generated from
the MOE and RDKit software packages.
After the pre-processing and filtration step, 169 molecular
descriptors were employed to build different models. The backward
selection was used in a loop with the k-NN algorithm as assessor of
accuracy because it has already demonstrated a good prediction
performance and it does not require other predictive models as in the
case of the Prediction Fusion. Each descriptor was deleted at each
iteration of the loop, and the accuracy was measured. The most
important variables were those which significantly deteriorated the
accuracy of the model. From the best 2D structure model, the most
influential variables on the performance of the predictive models were:
Q-VSA-NEG (Total negative van der Waals surface area), Q-VSA-Pol
(Total positive van der Waals surface area). These two descriptors
belong to partial charge descriptors. There are also the molecular
quantum numbers MQN3 (number of chlorines) and MQN26 (number
of acyclic single valent nodes), a_ICM (This is the entropy of the
element distribution in the molecule). A detailed explanation of all the
previous descriptors is included in the manual of MOE and RDKit
software38,39. In the case of the crystallographic descriptors, the most
influencial descriptors were the “a” and “b” parameters of the reduced
cell. This was expected because these two parameters define most of the
crystal geometry.
In silico Discovery of hidden polymorphism The ultimate goal of this work was to discover polymorphism in
neglected or new molecules. This can be conducted from 2D or 3D
structures by applying the suitable model (i.e. one of the 4 cases
explained above). Finding the real potential of a molecule to give a
number of polymorphs has many benefits and can be exploited in
9%
64%
76%
18%
74%
86% 85%79%
91%
57%
71%
1%
69%
83% 82%75%
90%
0%
50%
100%
Naive BayesMultinomial
OrdinalClassic
classifier
Simplelogistic
MultilayerPerceptron
SupportVector
Machine
k-NearestNeighbours
RandomForest
GradientBoosted
trees
Predictionfusion
AAccuracy Kohen's kappa
44%
63%
87%
52%
71%
85% 85%79%
91%
33%
56%
85%
42%
65%
82% 82%
74%
89%
0%
20%
40%
60%
80%
100%
Naive BayesMultinomial
OrdinalClassic
classifier
Simplelogistic
MultilayerPerceptron
SupportVector
Machine
k-NearestNeighbours
RandomForest
GradientBoosted
trees
Predictionfusion
B
4%
56%
17%12%
66%
83%79%
67%
85%
48%
0%
59%
79% 74%
61%
83%
0%
20%
40%
60%
80%
100%
Naive BayesMultinomial
OrdinalClassic
classifier
Simplelogistic
MultilayerPerceptron
SupportVector
Machine
k-NearestNeighbours
RandomForest
GradientBoosted
trees
Predictionfusion
C
5%
61%
19% 16%
68%
84% 83%
74%
89%
53%
3%
62%
81% 79%
69%
86%
0%
20%
40%
60%
80%
100%
Naive BayesMultinomial
OrdinalClassic
classifier
Simplelogistic
MultilayerPerceptron
SupportVector
Machine
k-NearestNeighbours
RandomForest
GradientBoosted
trees
Predictionfusion
D
several stages of solid-state research. For instance, this information can
be useful for an initial screening from large databases like ZINC40 and
ChEMBL41. It is also practical for already investigated polymorphs to
see whether other solid forms are missing, and further experimental
screening would lead to producing them.
We exploited our predictive models to estimate the polymorphism in
a subset (i.e. 100 different 2D structures from the CSD that has not been
involved in any stage of designing the predictive models). This subset
presents the occurrence of polymorphism for molecules that were not
included in the Drug Bank like the majority of molecules in the CSD.
The comparison of the distribution of the polymorphs in this subset
according to their corresponding number of possible forms was
summarised in Figure 7. k-NN, Random forest, and Prediction Fusion
were selected as they are the most effective predictive models for
estimating polymorphism. From the current experimental observations,
it was clear that there was a dominance of the structures possessing two
different polymorphs. All the different models estimated over 40% of
the molecules in the CSD have 2 polymorphs. Currently, 54% of the
database structures were classified as possessing just two solid forms.
Except for the k-NN, RF and PF models estimated that structures with
a single solid form are overestimated. Indeed, RF estimated that only
4% of the structures have a unique crystalline form. Only 3 % of the
database was predicted to have 4 different polymorphs. Interestingly, all
the statistical models estimated a higher occurrence of structures having
4 forms than the current experimental estimation. For examples, k-NN
and PF predicted 7% and 13% of 4-forms occurrence, respectively.
Figure 3 | Comparison of the distribution of the number of polymorphs of molecules in the CSD database if the predictive model of polymorphism is
exploited or not. The bar chart presents the occurrence of polymorphism in each
of the 6 classes. The blue and the black dashed curves show an approximate trend of this abundance before and after applying the predictive models, respectively.
This overall shape of polymorphism occurrence distribution was kept
in the different predictive models. This is illustrated in Figure 3 with
black and blue dashed curves for the predicted and the current
experimental abundance of polymorphism, respectively. Interestingly,
we observed that the predicted area of a high number of polymorphs
(i.e. 4, 5 or 6 form per 2D structure) is wider than what has been
achieved experimentally, thus far. The current statistics of the 100-
sample subset showed that there is no structure with six polymorphs.
The same trend applies to the structures with five polymorphs. This
leads us to conclude that building predictive models based on carefully
selected molecules (i.e. structures that were heavily screened for
polymorphism in the pharmaceutical industry) enabled the discovery of
a hidden area of the chemical space.
Experimental screening The X-Ray powder diffraction patterns of the crystallised samples,
depicted in Figure 4.C provided evidence of the presence of new solid
forms of pentoxifilline in addition to the already characterised form in
the CSD database. The comparison of the Pearson correlation between
the patterns identified 4 clusters as depicted in the dendrogram and the
clusters plot below. Comparison of the X-Ray diffraction patterns of the
samples crystallised in tetrahydrofuran, diethylene glycol, acetic acid
and benzylamine shows the presence of additional Bragg reflections,
which cannot be explained by the reference pattern. This is indicative
of the presence of new solid forms, but the exact nature of these new
forms is still not known. The DSC/TGA analysis, represented in Figure
4.D and 4.E, confirms these results by thermal events identified which
do not correspond to the crystallisation solvent or the thermal transition
of the reference form within the temperature range investigated. ATR-
IR spectra were collected as a fingerprint for the new forms and
compiled in Figure 4.FMinor differences in the spectra can be
rationalised by the presence of different orientation of the pentoxifilline
in space, which affects the non-covalent interactions such as the
hydrogen bonds.
Two different datasets were extracted from the Drug Bank database
and the CSD. They contain 2D and 3D structures or organic molecules
and their corresponding polymorphism number. Molecular descriptors
were employed as independent variables for 2D structures dataset and
crystallographic descriptors were used for 3D structures dataset. PCA
was exploited to reduce the dimensions of each of the two datasets,
which allow the generation of 4 different datasets in total. 8 different
machine learning algorithms were applied to the different dataset, and a
metaclassifier was built from the probabilities estimated from each
algorithm. 9 statistical models were rendered for each dataset with
various capabilities to predict the real number of experimentally
achievable polymorphs. It was clear that K-Nearest Neighbours and
Random forest were reliably the most robust statistical models. A
synergistic effect has also been obtained a metaclassifier called the
Prediction Fusion. This latter gave higher accuracy than the RF or the
K-NN models. It is also noteworthy to mention that the application of
the dimensionality reduction for these systems did not improve the
results but slightly deteriorate them.
In addition, the most robust models were exploited to detect the most
influential descriptors on the polymorphism capability of each structure.
As expected, the reduced unit cell parameters were the most important
features in the case of 3D structures approach. A number of molecular
descriptors such as the Total negative and positive van der Waals
surface area and the number of chlorine atoms in the molecules were
among the most influencing molecular descriptors on the models built
from 2 structures.
34%
43%
13%7%
1% 3%4%
90%
1%4%
0% 1%
21%
58%
3%
13%
1%5%
33%
54%
10%
3%0% 0%
0%
20%
40%
60%
80%
100%
1 2 3 4 5 6
Ab
un
dan
ce
Polymorphism classes
k-NN
Random Forest
Prediction Fusion
Current experimental observations
A
B
C
D
E
Figure 4 | Analysis of the crystallised pentoxifylline in the different solvents. A- Dendrogram of the most relevant samples screened experimentally and found in the
literature, showing the similarity between the solid forms. B- Clustering plot of the selected samples distinguishing between the new form and the mixtures of the existing forms. C- X-Ray powder diffraction patterns of the reference material extracted from the CSD and the selected forms from the experimental screening. D- DSC traces of the selected forms presenting the thermal events occurring during the heating of the samples. E- ATR-IR spectra of the selected new polymorphs
The comparison between the distribution of the abundance of the
number of polymorphs in the current CSD with what was predicted from
the best-designed models reveals a hidden area of chemical space that
was potentially underestimated and under-screened for polymorphism.
In other words, these models show the real potential of any known or
unknown structure to give a certain number of crystalline forms. In the
present work, the most robust model has successfully predicted the
number of solid forms that are missing. This was validated
experimentally by conducting a solvents screening that revealed the
hidden forms. This has paramount practical importance for
crystallographers and materials engineers because referring to the best
of our knowledge today, this is the first computational tool based on
data mining and machine learning that gives experimentalists an initial
guideline about the hidden potential of organic molecules to render extra
solid forms, not yet discovered and isolated experimentally.
Acknowledgments
The authors would like to acknowledge that this work was
carried out in the CMAC National Facility, housed within the
University of Strathclyde’s Technology and Innovation Centre,
and funded with a UKRPIF (UK Research Partnership Institute
Fund) capital award, SFC ref. H13054, from the Higher
Education Funding Council for England (HEFCE).
Keywords: Machine learning, metaclassification, polymorphism,
materials discovery, crystal structure, Artificial Intelligence
References and Notes
1. Lowe, R., Glen, R. C. & Mitchell, J. B. O. Predicting phospholipidosis using machine learning. Mol. Pharm. (2010).
0
5000
10000
15000
20000
25000
30000
35000
0 10 20 30 40
2θ λ=1.540Å
p rawExp No 2Exp No 3Exp No 24Exp No 60
0
0.1
0.2
0.3
0.4
0.5
0.6
500100015002000250030003500
pentoxifilline
2
3
24
60
doi:10.1021/mp100103e 2. Hughes, L. D., Palmer, D. S., Nigsch, F. & Mitchell, J. B. O.
Why Are Some Properties More Difficult To Predict than Others? A Study of QSPR Models of Solubility, Melting Point, and Log P. J. Chem. Inf. Model. 48, 220–232 (2008).
3. Nigsch, F. et al. Melting Point Prediction Employing k -Nearest Neighbor Algorithms and Genetic Parameter Optimization. J. Chem. Inf. Model. 46, 2412–2422 (2006).
4. Tropsha, A., Gramatica, P. & Gombar, V. The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models. QSAR Comb. Sci. 22, 69–77 (2003).
5. Le, T., Epa, V. C., Burden, F. R. & Winkler, D. A. Quantitative Structure–Property Relationship Modeling of Diverse Materials Properties. Chem. Rev. 112, 2889–2919 (2012).
6. Lubinski, L. et al. Evaluation criteria for the quality of published experimental data on nanomaterials and their usefulness for QSAR modelling. SAR QSAR Environ. Res. 24, 995–1008 (2013).
7. Yao, S., Shoji, T., Iwamoto, Y. & Kamei, E. Consideration of an activity of the metallocene catalyst by using molecular mechanics, molecular dynamics and QSAR. Comput. Theor. Polym. Sci. 9, 41–46 (1999).
8. Cruz, V. L. et al. 3D-QSAR study of ansa-metallocene catalytic behavior in ethylene polymerization. Polymer (Guildf). (2007). doi:10.1016/j.polymer.2007.05.081
9. Norbert, W., Durgadas, B., L., B. S. & Joachim, K. Small changes in the polymer structure influence the adsorption behavior of fibrinogen on polymer surfaces: Validation of a new rapid screening technique. J. Biomed. Mater. Res. Part A 68A, 496–503
10. K., T. Makromol. Chem. Macromol. Symp. vol. 69 1993. 4th European Polymer Federation Symposia on Polymeric Materials. Symposium Editors C. Bubeck, H. W Spiess. Acta Polym. 45, 57
11. Greaves, T. L. & Drummond, C. J. Protic Ionic Liquids: Properties and Applications. Chem. Rev. 108, 206–237 (2008).
12. Phillips, C. L. & Voth, G. A. Discovering crystals using shape matching and machine learning. Soft Matter 9, 8552 (2013).
13. Reinhart, W. F., Long, A. W., Howard, M. P., Ferguson, A. L. & Panagiotopoulos, A. Z. Machine learning for autonomous crystal structure identification. Soft Matter 13, 4733–4745 (2017).
14. Oliynyk, A. O. et al. Disentangling Structural Confusion through Machine Learning: Structure Prediction and Polymorphism of Equiatomic Ternary Phases ABC. J. Am. Chem. Soc. 139, 17870–17881 (2017).
15. Balachandran, P. V, Broderick, S. R. & Rajan, K. Identifying the {\textquoteleft}inorganic gene{\textquoteright} for high-temperature piezoelectric perovskites through statistical learning. Proc. R. Soc. London A Math. Phys. Eng. Sci. 467, 2271–2290 (2011).
16. Oliynyk, A. O. et al. High-Throughput Machine-Learning-Driven Synthesis of Full-Heusler Compounds. Chem. Mater. 28, 7324–7331 (2016).
17. Raccuglia, P. et al. Machine-learning-assisted materials discovery using failed experiments. Nature 533, 73–76 (2016).
18. Christian, S. J. & Martin, J. First Step Towards Planning of Syntheses in Solid-State Chemistry: Determination of Promising Structure Candidates by Global Optimization. Angew. Chemie Int. Ed. English 35, 1286–1304
19. Abraham, N. L. & Probert, M. I. J. A periodic genetic algorithm with real-space representation for crystal structure and polymorph prediction. Phys. Rev. B 73, 224104 (2006).
20. Montavon, G. et al. Machine learning of molecular electronic properties in chemical compound space. New J. Phys. (2013). doi:10.1088/1367-2630/15/9/095003
21. Haleblian, J. & McCrone, W. Pharmaceutical applications of polymorphism. Journal of Pharmaceutical Sciences (1969). doi:10.1002/jps.2600580802
22. Nangia, A. Conformational polymorphism in organic crystals. Acc. Chem. Res. (2008). doi:10.1021/ar700203k
23. Brittain, H. G. Polymorphism and solvatomorphism 2010. Journal of Pharmaceutical Sciences (2012). doi:10.1002/jps.22788
24. Hilfiker, R. Polymorphism: In the Pharmaceutical Industry. Polymorphism: In the Pharmaceutical Industry (2006). doi:10.1002/3527607889
25. Le, T., Epa, V. C., Burden, F. R. & Winkler, D. A. Quantitative Structure–Property Relationship Modeling of Diverse Materials Properties. Chem. Rev. 112, 2889–2919 (2012).
26. Cabri, W., Ghetti, P., Pozzi, G. & Alpegiani, M. Polymorphisms and patent, market, and legal battles: Cefdinir case study. Organic Process Research and Development (2007). doi:10.1021/op0601060
27. Braun, D. E., McMahon, J. A., Koztecki, L. H., Price, S. L. & Reutzel-Edens, S. M. Contrasting Polymorphism of Related Small Molecule Drugs Correlated and Guided by the Computed Crystal Energy Landscape. Cryst. Growth Des. 14, 2056–2072 (2014).
28. Hulme, A. T. et al. Search for a Predicted Hydrogen Bonding Motif − A Multidisciplinary Investigation into the Polymorphism of 3-Azabicyclo[3.3.1]nonane-2,4-dione. J. Am. Chem. Soc. 129, 3649–3657 (2007).
29. Vasileiadis, M., Pantelides, C. C. & Adjiman, C. S. Prediction of the crystal structures of axitinib, a polymorphic pharmaceutical molecule. Chem. Eng. Sci. 121, 60–76 (2015).
30. Drebushchak, V. A., Drebushchak, T. N., Chukanov, N. V. & Boldyreva, E. V. Transitions among five polymorphs of chlorpropamide near the melting point. J. Therm. Anal. Calorim. (2008). doi:10.1007/s10973-007-8822-0
31. Weng, J. Q., Shen, D. L., Tan, C. X. & Liu, H. J. 2-Oxo-N-phenylthiazolidine-3-carboxamide. Acta Crystallogr. Sect. E Struct. Reports Online (2004). doi:10.1107/S1600536804008621
32. Allen, F. H. & Motherwell, W. D. S. Applications of the Cambridge Structural Database in organic chemistry and crystal chemistry. Acta Crystallogr. Sect. B Struct. Sci. 58, 407–422 (2002).
33. Johnston, A., Johnston, B. F., Kennedy, A. R. & Florence, A. J. Targeted crystallisation of novel carbamazepine solvates based on a retrospective Random Forest classification. CrystEngComm (2008). doi:10.1039/b713373a
34. Bhardwaj, R. M., Reutzel-Edens, S. M., Johnston, B. F. & Florence, A. J. A random forest model for predicting crystal packing of olanzapine solvates. CrystEngComm 20, 3947–3950 (2018).
35. Bhardwaj, R. M., Johnston, A., Johnston, B. F. & Florence, A. J. A random forest model for predicting the crystallisability of organic molecules. CrystEngComm (2015). doi:10.1039/c4ce02403f
36. Wicker, J. G. P. & Cooper, R. I. Will it crystallise? Predicting crystallinity of molecular materials. CrystEngComm (2015). doi:10.1039/c4ce01912a
37. Wishart, D. S. et al. DrugBank 5.0: A major update to the DrugBank database for 2018. Nucleic Acids Res. (2018). doi:10.1093/nar/gkx1037
38. Inc., C. C. G. Molecular Operating Environment (MOE), 2016.08. 1010 Sherbooke St.West, Suite #910, Montreal, QC, Canada, H3A 2R7 (2016).
39. Landrum, G. RDKit: Open-source Cheminformatics. Http://Www.Rdkit.Org/ (2006). doi:10.2307/3592822
40. Irwin, J. J. & Shoichet, B. K. ZINC - A free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. (2005). doi:10.1021/ci049714+
41. Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res. (2017). doi:10.1093/nar/gkw1074
42. Allen, F. H. The Cambridge Structural Database: A quarter of a million crystal structures and rising. Acta Crystallogr. Sect. B Struct. Sci. (2002). doi:10.1107/S0108768102003890
43. López-Mejías, V., Kampf, J. W. & Matzger, A. J. Nonamorphism in Flufenamic Acid and a New Record for a Polymorphic Compound with Solved Structures. J. Am. Chem. Soc. 134, 9872–9875 (2012).
44. Peterson, M. L. et al. Iterative High-Throughput Polymorphism Studies on Acetaminophen and an Experimentally Derived Structure for Form III. J. Am. Chem. Soc. 124, 10958–10959 (2002).
45. P. Mazanetz, M., J. Marmon, R., B. T. Reisser, C. & Morao, I. Drug Discovery Applications for KNIME: An Open Source Data Mining Platform. Curr. Top. Med. Chem. 12, 1965–1979 (2012).
46. Jagla, B., Wiswedel, B. & Coppée, J.-Y. Extending KNIME for next-generation sequencing data analysis. Bioinformatics 27, 2907–2909 (2011).
47. Beisken, S. et al. KNIME-CDK: Workflow-driven cheminformatics. BMC Bioinformatics (2013). doi:10.1186/1471-2105-14-257
48. ChemicalComputingGroupInc. Molecular Operating Environment (MOE). Sci. Comput. Instrum. (2004). doi:10.1017/CBO9781107415324.004
49. Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. (2002). doi:10.1613/jair.953
50. McCallum, A. & Nigam, K. A Comparison of Event Models for Naive Bayes Text Classification. AAAI/ICML-98 Work. Learn. Text Categ. (1998). doi:10.1.1.46.1529
51. Schapire, R. E., Stone, P., McAllester, D., Littman, M. L. & Csirik, J. A. Modeling auction price uncertainty using boosting-based conditional density estimation. Mach. Learn. Work. Then Conf. (2002).
52. Landwehr, N., Hall, M. & Frank, E. Logistic model trees. Mach. Learn. (2005). doi:10.1007/s10994-005-0466-3
53. Pal, S. K. & Mitra, S. Multilayer Perceptron, Fuzzy Sets, and Classification. IEEE Trans. Neural Networks (1992). doi:10.1109/72.159058
54. Ben-Hur, A. & Weston, J. A user’s guide to support vector machines. Methods Mol. Biol. (2010). doi:10.1007/978-1-60327-241-4_13
55. Aha, D. W., Kibler, D. & Albert, M. K. Instance-Based Learning Algorithms. Mach. Learn. (1991). doi:10.1023/A:1022689900470
56. Breiman, L. Random forests. Mach. Learn. (2001). doi:10.1023/A:1010933404324
57. Si, S. et al. Gradient Boosted Decision Trees for High Dimensional Sparse Output. Icml (2017).
58. Hall, M. et al. The WEKA data mining software. SIGKDD Explor. Newsl. (2009). doi:10.1145/1656274.1656278
59. Song, L., Smola, A., Gretton, A., Bedo, J. & Borgwardt, K. Feature selection via dependence maximization. J. Mach. Learn. Res. (2012). doi:10.1145/1273496.1273600
Methods Database curation The datasets generated to build predictive models for the polymorphism of
organic molecules were curated from two main databases: The Drug Bank 37 (DB) and the Cambridge Structural Database42 (CSD). The choice of these two
databases was based on the complementarity of information that they provide.
In the CSD, which is approaching one million crystal structure entries, each molecule entry may have metadata field describing its polymorphism and
whether it is available in the DB with the corresponding reference. The DB
currently has 9292 entries comprising 3189 small molecule drugs, 926 approved biotech (protein/peptide) drugs, 108 nutraceuticals and over 5,069
experimental drugs. To evaluate the role of ML in predicting polymorphism in
the drug like molecules, a subset database was created. Filters were applied to the full CSD database, to identify entries that were organic molecules with
recorded polymorphs (i.e. >1 distinct crystal structure with the same formula
unit) and which also appeared within the DB. The choice of the intersection between the two databases, depicted in Figure 5, was based on the increased
likelihood that commercially available drugs will have been screened
experimentally for polymorphism and more likely to have a complete record of the number of experimentally achievable crystallographic forms compared with
molecules investigated in academia or other industries. While it is still probable
that all forms may not have been reported in the public domain or even structurally characterised to allow them to be included in the CSD, this
database still provides a useful training set to assess the potential application of
ML. From the initial CSD dataset of 15202 3D structures, only 883 structures remained after the filtration process. These three-dimensional structures can be
further categorised into 178 two-dimensional structures (smiles). The 15202 3D
structures for the 883 molecules also include redeterminations, variable temperature studies, conformational polymorphs, polymorphic forms, co-
crystals, salts, hydrates and solvates. Therefore, a manual filter was used to
discard co-crystals and redeterminations of each molecule. Although they have chemically different compositions, the salts, hydrates and solvates were kept in
our datasets because they have very relevant information that determines the
diversity of the polymorphic forms, especially for pharmaceutical applications.
Figure 5 | The intersection between two different databases: CSD and Drug
Bank giving drug like molecules that have likely been screened for
polymorphism with entries in the CSD.
Confirmation of polymorphism was provided by checking the original
published source article for each selected REFCODE entry. Although some papers claim the existence of molecules with up to 9 different polymorphs such
as the flufenamic acid43, the datasets ignored these molecules as they are a very
extreme minority that risks unbalancing the datasets heavily. In the other
extreme of the polymorphism spectrum, only molecules that are commercially
available as drugs were considered as “true negative” as any drug is expected to
be screened for polymorphism before it becomes available in the market. There are several works that claim the presence of 60 different solid forms for
Atorvastatin but surprisingly none was reported in the CSD because no single
crystal was identified and isolated.44 The final dataset used in this study considered six classes that describe chemicals in the database with 1, 2, 3, 4, 5
or 6 discrete, crystallographically different polymorphs including solvates and salts but not co-crystals.
3 www.knime.com
Data workflow The pre-processing of the 833 3D structures dataset “condensed” in 178 2D
structures dataset was carried out through a series of data transformation and cleaning available in Knime3 software and described in Figure 645–47. The initial
data transformation was the generation of molecular descriptors for the 2D
structures dataset. This step was not performed in the case of the 3D structures dataset because crystallographic descriptors will be used instead of most of the
molecular descriptors generated with MOE48 and RDKit47 nodes available in
Knime. In a second step, structures having descriptors with missing or highly correlated values (i.e. Correlation threshold was set to 0.9 as a lower limit of
correlation) were also eliminated from both datasets. The descriptors with low
variance (i.e. Variance upper bound was set to 0.01) were discarded, too. A normalisation between 0 and 1 was additionally applied to unify the scale
employed for the different descriptors. The order of samples was randomly chosen via the “Shuffle” node. The Synthetic Minority Over-sampling
algorithm to balance the classes of each dataset was used49. This step was
crucial to get balanced classes in each dataset. Figure 6 illustrates in a pie chart the occurrence of the different classes (i.e. Number of polymorphs) before the
application of the balancing process through the SMOTE algorithm. This
technique oversamples only the minority classes and takes into account the 5
nearest neighbours. In the final pre-processing stage, the Principle Components
Analysis was applied to reduce the dimensionality of the dataset especially
when hundreds of descriptors were included. Therefore, starting from the two datasets (the 2D structures list and the 3D structures list), two new datasets
were generated where the principal components replaced the original molecular
or crystallographic descriptors. A number of principal components were used by keeping 95%, and 90% of the
variance were preserved for the 2D and the 3D structures datasets, respectively.
These percentages were chosen in the way of keeping the maximum of the variance in the Data and generating in the same time a reasonable number of
principal components for the case 2 and 4 that will be described later. At the
end of the pre-processing phase, four different datasets were compiled and ready to be trained over machine learning algorithms (i.e. the two original
datasets and the two obtained with PCA).
Crystallographic and molecular descriptors The four datasets have different number and type of descriptors as depicted in
Table 1. In the first dataset using the 2D structures, only molecular descriptors were included. These descriptors highlight the properties of the whole
molecule, and they can be categorised into constitutional, geometrical,
topological and electronic descriptors. For instance, one cites the molecular mass, the number of carbon atoms, the polar surface size, the charge, the count
of halogen or hydrogen atoms in the molecule of interest. etc. In the second
dataset, these descriptors were replaced by the principal components generated from the dimensionality reduction. It is noteworthy that these descriptors have
no chemical meaning but were defined to reduce the dimensionality of the
feature space and ease the task of machine learning algorithms. In the third dataset, crystallographic descriptors were employed including the unit cell
parameters and the volume of the cell. etc. Finally, the fourth dataset is the
dimensionally reduced dataset obtained from the application of PCA on the crystallographic descriptors (i.e. the third dataset). The table below summarises
the number and the type of descriptors used for each dataset.
9292 Entries selected
15202 organic molecules highlighting polymorphism data
833 3D structures belonging to 178 2D structures
Figure 6 | Data flow and the different steps followed from the data collection, passing through the pre-processing and generation of machine learning models and
ending with a meta classifier design to predict the polymorphism of unknown structures.
Figure 7 | The occurrence of polymorphism in the datasets and the role of the
SMOTE algorithm to balance the classes.
Choice of machine learning classifiers 8 different algorithms were selected from Knime nodes to train the models over the 4 datasets: Naive Bayes Multinomial50, Ordinal Classic classifier51, Simple
logistic52, Multilayer Perceptron53, Support Vector Machine54, k-Nearest
Neighbours55, Gradient Boosted Trees and Random Forests. Figure 8 demonstrates how the curated data were classified into 2D and 3D structures
and how these two datasets were treated differently by applying the original
features (i.e. molecular descriptors or crystallographic parameters). 4 datasets were finally obtained according to the pre-processing method. Random
Forest,56 Gradient Boosted trees.57 All these classifiers were implemented in
Weka nodes available within Knime software.58 All these algorithms were used with their default settings except the followings. In SVM, the “Hyper Tangent”
was chosen as a kernel with a kappa equals to 0.1 and delta equals to 0.5. In the
K-Nearest Neighbours classifier, the number of neighbours was chosen to be 6 as the number of classes in the datasets. In Random Forests classier, the number
of trees was adjusted to 500. 10-fold cross-validation was applied for each
classifier. The accuracy and Cohen’s kappa were used as metrics to evaluate the performance of the models. Also, Confusion matrices, Recall, Precision,
sensitivity, specificity and F-measure were provided for each class of each
classifier.
Meta-classifier design Once the eight classifiers were trained over the four datasets, probabilities of classification (i.e. the probability for each of the 6 classes of the response) were
generated. The “Prediction fusion” node was employed to combine the
probabilities from the different classifiers and weigh them according to the robustness of the obtained model. These weights Wi correspond to the accuracy
obtained from each ML model. This is translated by the formula
below: 𝑃𝑓𝑢𝑠𝑖𝑜𝑛 =∑ 𝑃𝑖 ×𝑊𝑖
𝑘=𝑀𝑖𝑖=1
8 (1)
where Pfusion is the overall probability of that particular class, Mi is the model
among the eight selected model, Pi is the individual probability per model, and
Wi is the scaling factor that is equal to the accuracy of the model Mi. Like for individual classifiers, the same metrics were employed to characterise the
overall model from the prediction fusion.
Curation of data
•Extract the organic molecules from the CSD
•Extract the molecules from the DrugBank
•Select the overlapped compounds
•Generate 2D and 3D structures
Preprocessing
•Generate molecular descriptors( MOE and RDKit)
•Filter missing values and highly correlated or with low variance descriptors
•Normalization
•Shuffle: Randomization of the samples
•SMOTE: Balance the dataset
•Dimensionality reduction: PCA
Machine learning
modeling: Classification & cross validation
•Naive Bayes Multinomial
•Ordinal Classic classifier
• Simple logistic
• Multilayer Perceptron
•Support Vector Machine
• k-Nearest Neighbours
•Random Forest
•Gradient Boosted trees
Metaclassifierdesign
•Weight-scalling of classification algorithms
•Prediction fusion
112%
253%
319%
48%
56%
62% 1
16%
216%
317%
417%
517%
617%
1
2
3
4
5
6
classes balancing
Figure 8 | The partition of the curated datasets into 4 categories where 2D and 3D structures are generated, then 8 different machine learning algorithms were
applied to generate statistical models from molecular/crystallographic descriptors or from principal components of PCA.
Table 5. Number and nature of the descriptors used as independent variables to build the predictive model
2D structures 3D structures
Case 1 a Case 2 b Case 3 c Case 4 d
Number of
dimension
s
44 169 9 12
Number
samples 564 564 2352 2352
a 2D structures treated without PCA , b 2D structures traded
with PCA, c 3D structures treated without PCA (crystallographic descriptors) , d 3D structures treated
with PCA
Features selection The backward features selection59 was carried out in the last stage of the data
analysis and the machine learning model design. The classifier that renders the best predictive model was incorporated in the loop of features selection. This
technique eliminates each descriptor consecutively and builds a predictive
model with the classifier of choice (i.e. One feature was removed in each iteration of the elimination loop until all features are eliminated at the end of
the loop). In each iteration, all the features are tested by discarding them once
then, the feature that has the highest impact on the accuracy is deleted for the next iteration. The models are ranked according to their accuracy, and the most
important features are the ones that affect the most the robustness and the
accuracy of the model. Therefore, the most important feature is the one that gives the lowest accuracy for the model when it has been eliminated before the
training step.
Solvent screening
Pentoxifylline (CAS Number 6493-05-6) was purchased from Sigma Aldrich.
Samples of 100mg each were recrystallised in 66 different solvents listed in the
supporting information with their available physical properties. Each solution was heated near the boiling point of the corresponding solvent to ensure the full
dissolution of the starting material. Once clear solutions are obtained, they were
left to cool to ambient temperature. They were then kept in atmospheric conditions to evaporate the solvents. The pure polymorphic forms and the
percentage compositions of all the sample mixtures were characterized by
differential scanning calorimetry–thermogravimetry (DSC-TG), XRPD, and ATR-IR spectroscopy.
Powder X-Ray Diffraction
For crystalline form identification, a small quantity (10-50 mg) of the sample
was analysed using transmission XRPD data collected on a Bruker AXS D8 Advance transmission diffractometer equipped with θ/θ geometry, with primary
monochromated radiation (Cu Kα1 λ= 1.54056 Å), a Vantec PSD and an
automated multiposition x-y sample stage. Samples were mounted on a 28-position sample plate supported on a polyimide (Kapton, 7.5 µm thickness)
film. Data were collected from each sample in the range 4-35° 2θ with a
0.015°2θ step size and 1 s per step count time. Samples were oscillated in the x-y plane at a speed of 0.3 mm s-1 throughout data collection to maximise particle
sampling and minimise preferred orientation effects. Thermogravimetry
analysis
Netzsch STA 449 F1 Jupiter® performed DSC and TGA simultaneously
allowing the monitoring of remaining solvent evaporation.
Differential scanning calorimeter analysis
Differential scanning calorimetry–thermogravimetric experiments were
performed on a Netzsch DSC214 Polyma differential scanning calorimeter. The
heating rate for all polymorphs was kept constant at 20°C/min and all runs were carried out from 25 °C to 250 °C. The measurements were performed in
aluminium crucibles, nitrogen was used as the purge gas in ambient mode, and
calibration was performed using indium metal. The cooling of the samples was conducted for all the samples after a temperature plateau at 250 °C.
Attenuated Total Reflectance–Infrared Spectroscopy
Attenuated total reflectance–infrared spectra were collected on a Bruker
TENSOR II FT-IR spectrometer with Opus v7.5 software. The spectrometer is
fitted with a KBr beamsplitter, which operates in the range 8000–10 cm-1 with a universal ATR accessory (‘PLATINUM’ diamond ATR-accessory), and
HYPERION (IR microscope). Spectra were collected in the 4000–650 cm-1
range with a resolution of 4.00 cm-1 and scan number of 4.
GRAPHICAL ABSTRACT
Polymorphism is a physical feature that characterises solid crystalline compounds Its regulation is crucial for the control of other properties such as the solubility or mechanical resistance. Machine learning is a modern statistical tool exploited to learn from the existent data and to generate models that predict how the molecule is able to give a number of solid forms. This computational tool provides a guideline to experimentalists in order to spot molecules with potentially underestimated polymorphism and to ease the discovery of novel materials.
GRAPHICAL ABSTRACT FIGURE
Data curation
Classification Discovery
Machine learning
download fileview on ChemRxivMachine learning-based approach to predict the polymorp... (1.19 MiB)
Supporting information
Machine learning-based approach to predict the “polymorphism" oforganic compounds
Zied Hosni1,3*, Annalisa Riccardi2, Stephanie Yerdelen1, Alan R. G. Martin1, Deborah Bowering1, and Alastair J. Florence1*
1. 2D and PCA1.1. Naive Bayes Multinomial1.1.1. Confusion matrix
1 2 3 4 5 6
1 15 38 24 12 5 0
2 21 1 7 17 36 12
3 21 25 9 2 34 3
4 12 30 0 9 38 5
5 0 42 19 25 8 0
6 0 65 0 19 3 7
1.1.2. Accuracy statistics
Class
True Positive
False Positive
True Negative
False Negative Recall
Precision
Sensitivity Specifity
F-neasure Accuracy
Cohen's kappa
115 54 416 79
0.159574
0.217391
0.1595740.885106
0.184049
? ?
21 200 270 93
0.010638
0.004975
0.0106380.574468
0.00678 ? ?
39 50 420 85
0.095745
0.152542
0.0957450.893617
0.117647
? ?
49 75 395 85
0.095745
0.107143
0.0957450.840426
0.101124
? ?
58 116 354 86
0.085106
0.064516
0.0851060.753191
0.073394
? ?
67 20 450 87
0.074468
0.259259
0.0744680.957447
0.115702
? ?
Overall
? ? ? ? ? ? ? ? ?0.08687
9
-0.0957
4
1 EPSRC Future Continuous Manufacturing and Advanced Crystallisation Hub, University of Strathclyde, Technology and Innovation Centre, 99 George Street, Glasgow, U.K. G1 1RD. 2 Department of Mechanical and Aerospace Engineering, University of Strathclyde, Glasgow, UK. G1 1XJ. 3 Centre of Computational Chemistry, School of Chemistry, Cantock's Close, Bristol, U.K BS8 1TS. email:zied.hosni@bristol.ac.uk; alastair.florence@strath.ac.uk 2
1.2. Ordinal Classic classifier1.2.1. Confusion matrix
1 2 3 4 5 6
1 66 14 6 5 1 2
2 9 33 24 12 9 7
3 3 19 67 4 1 0
4 6 5 11 64 7 1
5 1 5 13 18 57 0
6 0 3 0 0 15 76
1.2.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity F-neasure
Accuracy Cohen's kappa
1 66 19 451 28 0.702128
0.776471
0.702128 0.959574
0.737430
? ?
2 33 46 424 61 0.351064
0.417722
0.351064 0.902128
0.381503
? ?
3 67 54 416 27 0.712766
0.553719
0.712766 0.885106
0.623256
? ?
4 64 39 431 30 0.680851
0.621359
0.680851 0.917021
0.649746
? ?
5 57 33 437 37 0.606383
0.633333
0.606383 0.929787
0.619565
? ?
6 76 10 460 18 0.808511
0.883721
0.808511 0.978723
0.844444
? ?
Overall
? ? ? ? ? ? ? ? ? 0.643617
0.572340
1.3. Simple logistic1.3.1. Confusion matrix
1 2 3 4 5 6
1 74 6 11 1 1 1
2 12 30 22 11 12 7
3 10 16 55 0 13 0
4 0 0 0 90 4 0
5 0 4 5 0 85 0
6 0 0 0 0 0 94
1.3.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision Sensitivity
Specifity F-neasure
Accuracy Cohen's kappa
1 74 22 448 20 0.787234
0.770833
0.787234 0.953191
0.778947
? ?
2 30 26 444 64 0.319149
0.535714
0.319149 0.944681
0.400000
? ?
3 55 38 432 39 0.585106
0.591398
0.585106 0.919149
0.588235
? ?
4 90 12 458 4 0.957447
0.882353
0.957447 0.974468
0.918367
? ?
5 85 30 440 9 0.904255
0.739130
0.904255 0.936170
0.813397
? ?
6 94 8 462 0 1.000000
0.921569
1.000000 0.982979
0.959184
? ?
Overall ? ? ? ? ? ? ? ? ? 0.758865
0.710638
1.4. Multilayer Perceptron1.4.1. Confusion matrix
1 2 3 4 5 6
1 3 4 0 87 0 0
2 5 2 0 84 0 3
3 9 0 0 84 0 1
4 0 0 0 94 0 0
5 0 3 0 90 1 0
6 0 0 0 93 0 1
1.4.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity F-neasure
Accuracy
Cohen's kappa
1 3 14 456 91 0.031915
0.176471
0.031915 0.970213
0.054054
? ?
2 2 7 463 92 0.021277
0.222222
0.021277 0.985106
0.038835
? ?
3 0 0 470 94 0.000000
? 0.000000 1.000000
? ? ?
4 94 438 32 0 1.000000
0.176692
1.000000 0.068085
0.300319
? ?
5 1 0 470 93 0.010638
1.000000
0.010638 1.000000
0.021053
? ?
6 1 4 466 93 0.010638
0.200000
0.010638 0.991489
0.020202
? ?
Overall
? ? ? ? ? ? ? ? ? 0.179078
0.014894
1.5. Support Vector Machine1.5.1. Confusion matrix
1 2 3 4 5 6
1 51 20 2 21 0 0
2 0 91 3 0 0 0
3 0 32 50 12 0 0
4 0 13 1 80 0 0
5 5 11 0 0 57 21
6 0 3 0 0 0 91
1.5.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity F-neasure
Accuracy
Cohen's kappa
1 51 5 465 43 0.542553
0.910714
0.542553 0.989362
0.680000
? ?
2 91 79 391 3 0.968085
0.535294
0.968085 0.831915
0.689394
? ?
3 50 6 464 44 0.531915
0.892857
0.531915 0.987234
0.666667
? ?
4 80 33 437 14 0.851064
0.707965
0.851064 0.929787
0.772947
? ?
5 57 0 470 37 0.606383
1.000000
0.606383 1.000000
0.754967
? ?
6 91 21 449 3 0.968085
0.812500
0.968085 0.955319
0.883495
? ?
Overall ? ? ? ? ? ? ? ? ? 0.744681
0.693617
1.6. k-nearest neighbors1.6.1. Confusion matrix
1 2 3 4 5 6
1 92 0 1 1 0 0
2 11 24 24 14 17 4
3 2 1 91 0 0 0
4 0 4 0 90 0 0
5 0 0 0 0 94 0
6 0 0 0 0 0 94
1.6.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity F-neasure
Accuracy
Cohen's kappa
1 92 13 457 2 0.978723
0.876190
0.978723 0.972340
0.924623
? ?
2 24 5 465 70 0.255319
0.827586
0.255319 0.989362
0.390244
? ?
3 91 25 445 3 0.968085
0.784483
0.968085 0.946809
0.866667
? ?
4 90 15 455 4 0.957447
0.857143
0.957447 0.968085
0.904523
? ?
5 94 17 453 0 1.000000
0.846847
1.000000 0.963830
0.917073
? ?
6 94 4 466 0 1.000000
0.959184
1.000000 0.991489
0.979167
? ?
Overall ? ? ? ? ? ? ? ? ? 0.859929
0.831915
1.7. Random Forest1.7.1. Confusion matrix
1 2 3 4 5 6
1 88 0 5 1 0 0
2 9 43 26 12 4 0
3 4 5 81 0 4 0
4 1 8 0 83 2 0
5 0 1 1 0 92 0
6 0 0 0 0 0 94
1.7.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity F-neasure
Accuracy
Cohen's kappa
1 88 14 456 6 0.936170
0.862745
0.936170 0.970213
0.897959
? ?
2 43 14 456 51 0.457447
0.754386
0.457447 0.970213
0.569536
? ?
3 81 32 438 13 0.861702
0.716814
0.861702 0.931915
0.782609
? ?
4 83 13 457 11 0.882979
0.864583
0.882979 0.972340
0.873684
? ?
5 92 10 460 2 0.978723
0.901961
0.978723 0.978723
0.938776
? ?
6 94 0 470 0 1.000000
1.000000
1.000000 1.000000
1.000000
? ?
Overall ? ? ? ? ? ? ? ? ? 0.852837
0.823404
1.8. Gradient Boosted trees1.8.1. Confusion matrix
1 2 3 4 5 6
1 75 12 3 3 1 0
2 5 53 19 9 4 4
3 5 12 73 1 1 2
4 0 13 1 77 3 0
5 0 8 9 0 77 0
6 0 4 0 0 0 90
1.8.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity F-neasure
Accuracy
Cohen's kappa
1 75 10 460 19 0.797872
0.882353
0.797872 0.978723
0.837989
? ?
2 53 49 421 41 0.563830
0.519608
0.563830 0.895745
0.540816
? ?
3 73 32 438 21 0.776596
0.695238
0.776596 0.931915
0.733668
? ?
4 77 13 457 17 0.819149
0.855556
0.819149 0.972340
0.836957
? ?
5 77 9 461 17 0.819149
0.895349
0.819149 0.980851
0.855556
? ?
6 90 6 464 4 0.957447
0.937500
0.957447 0.987234
0.947368
? ?
Overall ? ? ? ? ? ? ? ? ? 0.789007
0.746809
1.9. Prediction fusion1.9.1. Confusion matrix
1 2 3 4 5 6
1 55 17 7 6 6 3
2 3 88 0 3 0 0
3 0 0 94 0 0 0
4 1 1 1 91 0 0
5 0 0 0 0 94 0
6 0 0 0 0 0 94
1.9.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity F-neasure
Accuracy Cohen's kappa
1 55 4 466 39 0.585106
0.932203
0.585106 0.991489
0.718954
? ?
2 88 18 452 6 0.936170
0.830189
0.936170 0.961702
0.880000
? ?
3 94 8 462 0 1.000000
0.921569
1.000000 0.982979
0.959184
? ?
4 91 9 461 3 0.968085
0.910000
0.968085 0.980851
0.938144
? ?
5 94 6 464 0 1.000000
0.940000
1.000000 0.987234
0.969072
? ?
6 94 3 467 0 1.000000
0.969072
1.000000 0.993617
0.984293
? ?
Overall
? ? ? ? ? ? ? ? ? 0.914894
0.897872
2. 2D2.1. Naive Bayes Multinomial2.1.1. Confusion matrix
1 2 3 4 5 6
1 38 8 28 9 11 0
2 16 16 22 15 19 6
3 14 9 35 3 29 4
4 4 16 0 35 39 0
5 0 11 10 17 56 0
6 0 0 0 12 13 69
2.1.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity F-neasure
Accuracy Cohen's kappa
1 38 34 436 56 0.404255
0.527778
0.404255 0.92766 0.457831
? ?
2 16 44 426 78 0.170213
0.266667
0.170213 0.906383
0.207792
? ?
3 35 60 410 59 0.37234 0.368421
0.37234 0.87234 0.37037 ? ?
4 35 56 414 59 0.37234 0.384615
0.37234 0.880851
0.378378
? ?
5 56 111 359 38 0.595745
0.335329
0.595745 0.76383 0.429119
? ?
6 69 10 460 25 0.734043
0.873418
0.734043 0.978723
0.797688
? ?
Overall ? ? ? ? ? ? ? ? ? 0.441489
0.329787
2.2. Ordinal Classic classifier2.2.1. Confusion matrix
1 2 3 4 5 6
1 58 17 12 6 1 0
2 8 37 26 14 8 1
3 11 19 50 8 6 0
4 5 10 19 55 3 2
5 1 1 15 12 65 0
6 0 1 0 0 0 93
2.2.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity F-neasure
Accuracy
Cohen's kappa
1 58 25 445 36 0.617021
0.698795
0.617021 0.946809
0.655367
? ?
2 37 48 422 57 0.393617
0.435294
0.393617 0.897872
0.413408
? ?
3 50 72 398 44 0.531915
0.409836
0.531915 0.846809
0.462963
? ?
4 55 40 430 39 0.585106
0.578947
0.585106 0.914894
0.582011
? ?
5 65 18 452 29 0.691489
0.783133
0.691489 0.961702
0.734463
? ?
6 93 3 467 1 0.989362
0.96875 0.989362 0.993617
0.978947
? ?
Overall ? ? ? ? ? ? ? ? ? 0.634752
0.561702
2.3. Simple logistic2.3.1. Confusion matrix
1 2 3 4 5 6
1 91 0 3 0 0 0
2 15 31 27 8 10 3
3 2 4 88 0 0 0
4 0 0 0 94 0 0
5 0 0 0 0 94 0
6 0 0 0 0 0 94
2.3.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity F-neasure
Accuracy
Cohen's kappa
1 91 17 453 3 0.968085
0.842593
0.968085 0.96383 0.90099 ? ?
2 31 4 466 63 0.329787
0.885714
0.329787 0.991489
0.48062 ? ?
3 88 30 440 6 0.93617 0.745763
0.93617 0.93617 0.830189
? ?
4 94 8 462 0 1 0.921569
1 0.982979
0.959184
? ?
5 94 10 460 0 1 0.903846
1 0.978723
0.949495
? ?
6 94 3 467 0 1 0.969072
1 0.993617
0.984293
? ?
Overall ? ? ? ? ? ? ? ? ? 0.87234 0.846809
2.4. Multilayer Perceptron2.4.1. Confusion matrix
1 2 3 4 5 6
1 52 39 2 1 0 0
2 15 53 16 6 2 2
3 3 45 46 0 0 0
4 0 47 0 47 0 0
5 3 66 24 0 0 1
6 0 0 0 0 0 94
2.4.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity
F-neasure
Accuracy
Cohen's kappa
1 52 21 449 42 0.553191
0.712329
0.553191 0.955319
0.622754
? ?
2 53 197 273 41 0.56383 0.212 0.56383 0.580851
0.30814 ? ?
3 46 42 428 48 0.489362
0.522727
0.489362 0.910638
0.505495
? ?
4 47 7 463 47 0.5 0.87037 0.5 0.985106
0.635135
? ?
5 0 2 468 94 0 0 0 0.995745
NaN ? ?
6 94 3 467 0 1 0.969072
1 0.993617
0.984293
? ?
Overall ? ? ? ? ? ? ? ? ? 0.51773 0.421277
2.5. Support Vector Machine2.5.1. Confusion matrix
1 2 3 4 5 6
1 50 21 23 0 0 0
2 4 90 0 0 0 0
3 1 32 33 28 0 0
4 0 14 0 80 0 0
5 0 11 0 5 57 21
6 1 2 0 0 0 91
2.5.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity F-neasure
Accuracy
Cohen's kappa
1 50 6 464 44 0.531915
0.892857
0.531915 0.987234
0.666667
? ?
2 90 80 390 4 0.957447
0.529412
0.957447 0.829787
0.681818
? ?
3 33 23 447 61 0.351064
0.589286
0.351064 0.951064
0.44 ? ?
4 80 33 437 14 0.851064
0.707965
0.851064 0.929787
0.772947
? ?
5 57 0 470 37 0.606383
1 0.606383 1 0.754967
? ?
6 91 21 449 3 0.968085
0.8125 0.968085 0.955319
0.883495
? ?
Overall ? ? ? ? ? ? ? ? ? 0.710993
0.653191
2.6. k-nearest neighbors2.6.1. Confusion matrix
1 2 3 4 5 6
1 90 0 3 0 1 0
2 9 27 30 13 10 5
3 0 0 92 0 2 0
4 1 0 0 89 4 0
5 2 0 3 0 89 0
6 0 0 0 0 0 94
2.6.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision Sensitivity
Specifity F-neasure
Accuracy Cohen's kappa
1 90 12 458 4 0.957447
0.882353
0.957447 0.974468
0.918367
? ?
2 27 0 470 67 0.287234
1 0.287234 1 0.446281
? ?
3 92 36 434 2 0.978723
0.71875 0.978723 0.923404
0.828829
? ?
4 89 13 457 5 0.946809
0.872549
0.946809 0.97234 0.908163
? ?
5 89 17 453 5 0.946809
0.839623
0.946809 0.96383 0.89 ? ?
6 94 5 465 0 1 0.949495
1 0.989362
0.974093
? ?
Overall ? ? ? ? ? ? ? ? ? 0.852837
0.823404
2.7. Random Forest2.7.1. Confusion matrix
1 2 3 4 5 6
1 79 2 7 6 0 0
2 12 48 21 9 3 1
3 2 7 81 1 3 0
4 1 2 0 89 2 0
5 0 3 1 1 89 0
6 0 1 0 0 1 92
2.7.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity F-neasure
Accuracy
Cohen's kappa
1 79 15 455 15 0.840426
0.840426
0.840426 0.968085
0.840426
? ?
2 48 15 455 46 0.510638
0.761905
0.510638 0.968085
0.611465
? ?
3 81 29 441 13 0.861702
0.736364
0.861702 0.938298
0.794118
? ?
4 89 17 453 5 0.946809
0.839623
0.946809 0.96383 0.89 ? ?
5 89 9 461 5 0.946809
0.908163
0.946809 0.980851
0.927083
? ?
6 92 1 469 2 0.978723
0.989247
0.978723 0.997872
0.983957
? ?
Overall ? ? ? ? ? ? ? ? ? 0.847518
0.817021
2.8. Gradient Boosted trees2.8.1. Confusion matrix
1 2 3 4 5 6
1 68 9 11 6 0 0
2 7 48 27 6 4 2
3 6 9 76 0 3 0
4 2 5 5 82 0 0
5 0 5 12 0 77 0
6 0 0 1 0 0 93
2.8.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision Sensitivity
Specifity F-neasure
Accuracy Cohen's kappa
1 68 15 455 26 0.723404
0.819277
0.723404 0.968085
0.768362
? ?
2 48 28 442 46 0.510638
0.631579
0.510638 0.940426
0.564706
? ?
3 76 56 414 18 0.808511
0.575758
0.808511 0.880851
0.672566
? ?
4 82 12 458 12 0.87234 0.87234 0.87234 0.974468
0.87234 ? ?
5 77 7 463 17 0.819149
0.916667
0.819149 0.985106
0.865169
? ?
6 93 2 468 1 0.989362
0.978947
0.989362 0.995745
0.984127
? ?
Overall ? ? ? ? ? ? ? ? ? 0.787234
0.744681
2.9. Prediction fusion2.9.1. Confusion matrix
1 2 3 4 5 6
1 90 3 0 1 0 0
2 1 88 4 0 1 0
3 5 23 53 8 4 1
4 0 0 1 93 0 0
5 0 0 0 0 94 0
6 0 0 0 0 0 94
2.9.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity
F-neasure
Accuracy
Cohen's kappa
1 90 6 464 4 0.957447
0.9375 0.957447 0.987234
0.947368
? ?
2 88 26 444 6 0.93617 0.77193 0.93617 0.944681
0.846154
? ?
3 53 5 465 41 0.56383 0.913793
0.56383 0.989362
0.697368
? ?
4 93 9 461 1 0.989362
0.911765
0.989362 0.980851
0.94898 ? ?
5 94 5 465 0 1 0.949495
1 0.989362
0.974093
? ?
6 94 1 469 0 1 0.989474
1 0.997872
0.994709
? ?
Overall ? ? ? ? ? ? ? ? ? 0.907801
0.889362
3. 3D and PCA3.1. Naive Bayes Multinomial3.1.1. Confusion matrix
1 2 3 4 5 6
1 52 17 53 58 100 112
2 50 0 1 69 0 272
3 41 26 0 45 3 277
4 12 194 38 17 60 71
5 62 45 17 80 0 188
6 2 76 54 68 164 28
3.1.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision Sensitivity Specifity F-neasure
Accuracy Cohen's kappa
1 52 167 1793 340 0.132653 0.237443 0.132653 0.914796 0.170213 ? ?
2 0 358 1602 392 0 0 0 0.817347 NaN ? ?
3 0 163 1797 392 0 0 0 0.916837 NaN ? ?
4 17 320 1640 375 0.043367 0.050445 0.043367 0.836735 0.046639 ? ?
5 0 327 1633 392 0 0 0 0.833163 NaN ? ?
6 28 920 1040 364 0.071429 0.029536 0.071429 0.530612 0.041791 ? ?
Overall ? ? ? ? ? ? ? ? ? 0.041241 -0.15051
3.2. Ordinal Classic classifier3.2.1. Confusion matrix
1 2 3 4 5 6
1 215 79 47 24 20 7
2 20 154 93 50 56 19
3 15 37 227 62 35 16
4 3 31 73 249 30 6
5 7 21 35 68 246 15
6 14 9 24 50 63 232
3.2.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity
F-neasure
Accuracy
Cohen's kappa
1 215 59 1901 177 0.548469
0.784672
0.548469 0.969898
0.645646
? ?
2 154 177 1783 238 0.392857
0.465257
0.392857 0.909694
0.426003
? ?
3 227 272 1688 165 0.579082
0.45491 0.579082 0.861224
0.50954 ? ?
4 249 254 1706 143 0.635204
0.49503 0.635204 0.870408
0.556425
? ?
5 246 204 1756 146 0.627551
0.546667
0.627551 0.895918
0.584323
? ?
6 232 63 1897 160 0.591837
0.786441
0.591837 0.967857
0.6754 ? ?
Overall ? ? ? ? ? ? ? ? ? 0.5625 0.475
3.3. Simple logistic3.3.1. Confusion matrix
1 2 3 4 5 6
1 127 19 75 58 46 67
2 23 23 42 163 51 90
3 28 34 68 89 71 102
4 16 129 91 45 75 36
5 47 25 35 79 89 117
6 0 59 38 84 171 40
3.3.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision Sensitivity Specifity F-neasure
Accuracy Cohen'skappa
1 127 114 1846 265 0.32398 0.526971
0.32398 0.941837
0.401264
? ?
2 23 266 1694 369 0.058673
0.079585
0.058673 0.864286
0.067548
? ?
3 68 281 1679 324 0.173469
0.194842
0.173469 0.856633
0.183536
? ?
4 45 473 1487 347 0.114796
0.086873
0.114796 0.758673
0.098901
? ?
5 89 414 1546 303 0.227041
0.176938
0.227041 0.788776
0.198883
? ?
6 40 412 1548 352 0.102041
0.088496
0.102041 0.789796
0.094787
? ?
Overall ? ? ? ? ? ? ? ? ? 0.166667
0
3.4. Multilayer Perceptron3.4.1. Confusion matrix
1 2 3 4 5 6
1 154 46 91 38 62 1
2 108 31 12 77 156 8
3 52 69 20 128 121 2
4 284 8 10 21 66 3
5 119 9 2 197 63 2
6 330 4 18 2 38 0
3.4.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity F-neasure
Accuracy
Cohen's kappa
1 154 893 1067 238 0.392857
0.147087
0.392857 0.544388
0.214038
? ?
2 31 136 1824 361 0.079082
0.185629
0.079082 0.930612
0.110912
? ?
3 20 133 1827 372 0.05102 0.130719
0.05102 0.932143
0.073394
? ?
4 21 442 1518 371 0.053571
0.045356
0.053571 0.77449 0.049123
? ?
5 63 443 1517 329 0.160714
0.124506
0.160714 0.77398 0.140312
? ?
6 0 16 1944 392 0 0 0 0.991837
NaN ? ?
Overall
? ? ? ? ? ? ? ? ? 0.122874
-0.05255
3.5. Support Vector Machine3.5.1. Confusion matrix
1 2 3 4 5 6
1 235 37 5 27 88 0
2 0 332 60 0 0 0
3 0 154 137 0 101 0
4 0 58 15 208 0 111
5 0 94 16 0 282 0
6 0 31 2 0 0 359
3.5.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity
F-neasure
Accuracy
Cohen'skappa
1 235 0 1960 157 0.59949 1 0.59949 1 0.749601
? ?
2 332 374 1586 60 0.846939
0.470255
0.846939 0.809184
0.604736
? ?
3 137 98 1862 255 0.34949 0.582979
0.34949 0.95 0.437002
? ?
4 208 27 1933 184 0.530612
0.885106
0.530612 0.986224
0.663477
? ?
5 282 189 1771 110 0.719388
0.598726
0.719388 0.903571
0.653534
? ?
6 359 111 1849 33 0.915816
0.76383 0.915816 0.943367
0.832947
? ?
Overall
? ? ? ? ? ? ? ? ? 0.660289
0.592347
3.6. k-nearest neighbors3.6.1. Confusion matrix
1 2 3 4 5 6
1 339 19 15 2 10 7
2 16 226 60 39 29 22
3 7 13 342 11 9 10
4 2 15 7 356 6 6
5 4 7 12 5 352 12
6 7 8 25 6 15 331
3.6.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity
F-neasure
Accuracy
Cohen's kappa
1 339 36 1924 53 0.864796
0.904 0.864796 0.981633
0.883963
? ?
2 226 62 1898 166 0.576531
0.784722
0.576531 0.968367
0.664706
? ?
3 342 119 1841 50 0.872449
0.741866
0.872449 0.939286
0.801876
? ?
4 356 63 1897 36 0.908163
0.849642
0.908163 0.967857
0.877928
? ?
5 352 69 1891 40 0.897959
0.836105
0.897959 0.964796
0.865929
? ?
6 331 57 1903 61 0.844388
0.853093
0.844388 0.970918
0.848718
? ?
Overall ? ? ? ? ? ? ? ? ? 0.827381
0.792857
3.7. Random Forest3.7.1. Confusion matrix
1 2 3 4 5 6
1 311 30 26 7 13 5
2 16 236 45 40 36 19
3 14 13 323 11 14 17
4 0 17 18 338 9 10
5 2 19 19 6 332 14
6 3 22 28 9 20 310
3.7.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity
F-neasure
Accuracy
Cohen's kappa
1 311 35 1925 81 0.793367
0.898844
0.793367 0.982143
0.842818
? ?
2 236 101 1859 156 0.602041
0.700297
0.602041 0.948469
0.647462
? ?
3 323 136 1824 69 0.82398 0.703704
0.82398 0.930612
0.759107
? ?
4 338 73 1887 54 0.862245
0.822384
0.862245 0.962755
0.841843
? ?
5 332 92 1868 60 0.846939
0.783019
0.846939 0.953061
0.813725
? ?
6 310 65 1895 82 0.790816
0.826667
0.790816 0.966837
0.808344
? ?
Overall
? ? ? ? ? ? ? ? ? 0.786565
0.743878
3.8. Gradient Boosted trees3.8.1. Confusion matrix
1 2 3 4 5 6
1 250 44 62 8 18 10
2 17 213 55 49 45 13
3 18 33 269 19 31 22
4 6 51 25 279 19 12
5 5 25 30 9 303 20
6 9 36 34 13 27 273
3.8.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity
F-neasure
Accuracy
Cohen's kappa
1 250 55 1905 142 0.637755
0.819672
0.637755 0.971939
0.71736 ? ?
2 213 189 1771 179 0.543367
0.529851
0.543367 0.903571
0.536524
? ?
3 269 206 1754 123 0.686224
0.566316
0.686224 0.894898
0.620531
? ?
4 279 98 1862 113 0.711735
0.740053
0.711735 0.95 0.725618
? ?
5 303 140 1820 89 0.772959
0.683973
0.772959 0.928571
0.725749
? ?
6 273 77 1883 119 0.696429
0.78 0.696429 0.960714
0.735849
? ?
Overall ? ? ? ? ? ? ? ? ? 0.674745
0.609694
3.9. Prediction fusion3.9.1. Confusion matrix
1 2 3 4 5 6
1 334 18 20 3 14 3
2 8 266 52 31 16 19
3 5 13 343 11 11 9
4 2 11 9 359 3 8
5 0 9 12 5 359 7
6 3 6 20 5 10 348
3.9.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity
F-neasure
Accuracy
Cohen's kappa
1 334 18 1942 58 0.852041
0.948864
0.852041 0.990816
0.897849
? ?
2 266 57 1903 126 0.678571
0.823529
0.678571 0.970918
0.744056
? ?
3 343 113 1847 49 0.875 0.752193
0.875 0.942347
0.808962
? ?
4 359 55 1905 33 0.915816
0.86715 0.915816 0.971939
0.890819
? ?
5 359 54 1906 33 0.915816
0.869249
0.915816 0.972449
0.891925
? ?
6 348 46 1914 44 0.887755
0.883249
0.887755 0.976531
0.885496
? ?
Overall
? ? ? ? ? ? ? ? ? 0.854167
0.825
4. 3D4.1. Naive Bayes Multinomial4.1.1. Confusion matrix
1 2 3 4 5 6
1 19 0 127 0 132 114
2 45 0 7 15 9 316
3 45 42 9 0 148 148
4 42 73 41 51 79 106
5 89 6 133 0 6 158
6 62 3 136 0 166 25
4.1.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity
F-neasure
Accuracy
Cohen's kappa
1 19 283 1677 373 0.048469
0.062914
0.048469 0.855612
0.054755
? ?
2 0 124 1836 392 0 0 0 0.936735
NaN ? ?
3 9 444 1516 383 0.022959
0.019868
0.022959 0.773469
0.021302
? ?
4 51 15 1945 341 0.130102
0.772727
0.130102 0.992347
0.222707
? ?
5 6 534 1426 386 0.015306
0.011111
0.015306 0.727551
0.012876
? ?
6 25 842 1118 367 0.063776
0.028835
0.063776 0.570408
0.039714
? ?
Overall ? ? ? ? ? ? ? ? ? 0.046769
-0.14388
4.2. Ordinal Classic classifier4.2.1. Confusion matrix
1 2 3 4 5 6
1 218 94 45 17 9 9
2 23 215 68 38 35 13
3 9 45 262 34 33 9
4 10 15 62 272 31 2
5 8 23 49 59 240 13
6 10 23 21 23 91 224
4.2.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity
F-neasure
Accuracy
Cohen's kappa
1 218 60 1900 174 0.556122
0.784173
0.556122 0.969388
0.650746
? ?
2 215 200 1760 177 0.54846 0.51807 0.548469 0.89795 0.53283 ? ?
9 2 9 8
3 262 245 1715 130 0.668367
0.516765
0.668367 0.875 0.58287 ? ?
4 272 171 1789 120 0.693878
0.613995
0.693878 0.912755
0.651497
? ?
5 240 199 1761 152 0.612245
0.546697
0.612245 0.898469
0.577617
? ?
6 224 46 1914 168 0.571429
0.82963 0.571429 0.976531
0.676737
? ?
Overall
? ? ? ? ? ? ? ? ? 0.608418
0.530102
4.3. Simple logistic4.3.1. Confusion matrix
1 2 3 4 5 6
1 138 26 73 43 36 76
2 28 29 80 122 39 94
3 19 98 65 44 82 84
4 16 127 43 98 71 37
5 34 25 72 49 80 132
6 31 48 58 48 170 37
4.3.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity F-neasure
Accuracy
Cohen's kappa
1 138 128 1832 254 0.352041
0.518797
0.352041 0.934694
0.419453
? ?
2 29 324 1636 363 0.07398 0.082153
0.07398 0.834694
0.077852
? ?
3 65 326 1634 327 0.165816
0.16624 0.165816 0.833673
0.166028
? ?
4 98 306 1654 294 0.25 0.242574
0.25 0.843878
0.246231
? ?
5 80 398 1562 312 0.204082
0.167364
0.204082 0.796939
0.183908
? ?
6 37 423 1537 355 0.094388
0.080435
0.094388 0.784184
0.086854
? ?
Overall ? ? ? ? ? ? ? ? ? 0.190051
0.028061
4.4. Multilayer Perceptron4.4.1. Confusion matrix
1 2 3 4 5 6
1 76 148 151 14 1 2
2 1 74 190 118 9 0
3 1 140 172 72 4 3
4 0 141 190 51 5 5
5 1 216 110 54 9 2
6 0 164 166 61 0 1
4.4.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity F-neasure
Accuracy
Cohen'skappa
1 76 3 1957 316 0.193878
0.962025
0.193878 0.998469
0.322718
? ?
2 74 809 1151 318 0.188776
0.083805
0.188776 0.587245
0.116078
? ?
3 172 807 1153 220 0.438776
0.175689
0.438776 0.588265
0.250912
? ?
4 51 319 1641 341 0.130102
0.137838
0.130102 0.837245
0.133858
? ?
5 9 19 1941 383 0.022959
0.321429
0.022959 0.990306
0.042857
? ?
6 1 12 1948 391 0.002551
0.076923
0.002551 0.993878
0.004938
? ?
Overall ? ? ? ? ? ? ? ? ? 0.16284 -0.00459
4.5. Support Vector Machine4.5.1. Confusion matrix
1 2 3 4 5 6
1 235 32 0 115 10 0
2 0 337 0 0 55 0
3 0 150 178 36 28 0
4 0 62 0 319 11 0
5 0 99 0 0 293 0
6 0 26 57 0 73 236
4.5.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision Sensitivity
Specifity F-neasure
Accuracy Cohen's kappa
1 235 0 1960 157 0.59949 1 0.59949 1 0.749601
? ?
2 337 369 1591 55 0.859694
0.477337
0.859694 0.811735
0.613843
? ?
3 178 57 1903 214 0.454082
0.757447
0.454082 0.970918
0.567783
? ?
4 319 151 1809 73 0.813776
0.678723
0.813776 0.922959
0.740139
? ?
5 293 177 1783 99 0.747449
0.623404
0.747449 0.909694
0.679814
? ?
6 236 0 1960 156 0.602041
1 0.602041 1 0.751592
? ?
Overall ? ? ? ? ? ? ? ? ? 0.679422
0.615306
4.6. k-nearest neighbors4.6.1. Confusion matrix
1 2 3 4 5 6
1 358 12 8 2 12 0
2 14 221 58 40 33 26
3 5 13 340 6 19 9
4 0 8 7 368 7 2
5 4 4 11 3 359 11
6 3 12 14 3 19 341
4.6.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity F-neasure
Accuracy
Cohen's kappa
1 358 26 1934 34 0.913265
0.932292
0.913265 0.986735
0.92268 ? ?
2 221 49 1911 171 0.563776
0.818519
0.563776 0.975 0.667674
? ?
3 340 98 1862 52 0.867347
0.776256
0.867347 0.95 0.819277
? ?
4 368 54 1906 24 0.938776
0.872038
0.938776 0.972449
0.904177
? ?
5 359 90 1870 33 0.915816
0.799555
0.915816 0.954082
0.853746
? ?
6 341 48 1912 51 0.869898
0.876607
0.869898 0.97551 0.873239
? ?
Overall ? ? ? ? ? ? ? ? ? 0.844813
0.813776
4.7. Random Forest4.7.1. Confusion matrix
1 2 3 4 5 6
1 321 32 18 4 12 5
2 14 295 38 16 19 10
3 3 24 337 4 20 4
4 5 20 11 348 7 1
5 9 11 26 6 333 7
6 5 20 28 4 19 316
4.7.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity F-neasure
Accuracy
Cohen's kappa
1 321 36 1924 71 0.818878
0.89916 0.818878 0.981633
0.857143
? ?
2 295 107 1853 97 0.752551
0.733831
0.752551 0.945408
0.743073
? ?
3 337 121 1839 55 0.859694
0.735808
0.859694 0.938265
0.792941
? ?
4 348 34 1926 44 0.887755
0.910995
0.887755 0.982653
0.899225
? ?
5 333 77 1883 59 0.84949 0.81219 0.84949 0.96071 0.83042 ? ?
5 4 4
6 316 27 1933 76 0.806122
0.921283
0.806122 0.986224
0.859864
? ?
Overall ? ? ? ? ? ? ? ? ? 0.829082
0.794898
4.8. Gradient Boosted trees4.8.1. Confusion matrix
1 2 3 4 5 6
1 264 42 37 11 23 15
2 16 266 48 22 23 17
3 15 39 281 10 36 11
4 10 30 22 308 18 4
5 10 20 33 8 317 4
6 6 25 30 7 23 301
4.8.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity
F-neasure
Accuracy
Cohen's kappa
1 264 57 1903 128 0.673469
0.82243 0.673469 0.970918
0.740533
? ?
2 266 156 1804 126 0.678571
0.630332
0.678571 0.920408
0.653563
? ?
3 281 170 1790 111 0.716837
0.62306 0.716837 0.913265
0.666667
? ?
4 308 58 1902 84 0.785714
0.84153 0.785714 0.970408
0.812665
? ?
5 317 123 1837 75 0.808673
0.720455
0.808673 0.937245
0.762019
? ?
6 301 51 1909 91 0.767857
0.855114
0.767857 0.97398 0.80914 ? ?
Overall
? ? ? ? ? ? ? ? ? 0.73852 0.686224
4.9. Prediction fusion4.9.1. Confusion matrix
1 2 3 4 5 6
1 366 4 6 11 2 3
2 13 346 8 19 2 4
3 17 13 297 35 20 10
4 12 2 26 345 5 2
5 4 0 5 7 376 0
6 4 0 13 17 6 352
4.9.2. Accuracy statistics
Class True Positive
False Positive
True Negative
False Negative
Recall Precision
Sensitivity
Specifity
F-neasure
Accuracy
Cohen's kappa
1 366 50 1910 26 0.933673
0.879808
0.933673 0.97449 0.905941
? ?
2 346 19 1941 46 0.882653
0.947945
0.882653 0.990306
0.914135
? ?
3 297 58 1902 95 0.757653
0.83662 0.757653 0.970408
0.795181
? ?
4 345 89 1871 47 0.880102
0.794931
0.880102 0.954592
0.835351
? ?
5 376 35 1925 16 0.959184
0.914842
0.959184 0.982143
0.936488
? ?
6 352 19 1941 40 0.897959
0.948787
0.897959 0.990306
0.922674
? ?
Overall ? ? ? ? ? ? ? ? ? 0.885204
0.862245
download fileview on ChemRxivSI1- Predictive models of polymorphism.docx (75.45 KiB)
Solventformula MW
boiling
melting
L_solub(i) = w_solub * Solvent_density
point
point
compound 1
compound 2
freezing
Tb
50
60
70
80
90
100
110
120
130
140
150
(°C) (°C)
acetic acidC2H4O2
60.052 118 16.6
148.5599062
760.1772 m x
acetoneC3H6O
58.079
56.05
-94.7 y l x
acetonitrileC2H3N
41.052
81.65
-43.8
162.0894225
217.3548 y m x
Anisol155.
5-
37.359.5966
139284.77
141 y h x
benzol 205 -15 h x
benzylamine 185 10150.848
9011149.5
782 h x
bromobenzene 156
-30.8
33.32326767
58.88731 h x
bromobutane 102
-112.
419.0753
745630.75
315 y m x
1-butanolC4H10O
74.12
117.7
-88.6
41.13795169
32.82109 y m x
2-butanolC4H10O
74.12 99.5
-88.5
39.3002199
32.98498 y x
2-butanoneC4H8O
72.11 79.6
-86.6
136.219786
125.7603 y x
chloroform CHCl3119.
38 61.2-
63.4 y x
Cyclopentane 49 -940.45797
7311.247
981 y - - - - - - - - - - -
Cyclopentyl methyl ether 106 -140 y x
diethylene glycol+
C4H10O3
106.12 246 -10
44.74650612
45.00735 x
diethyl etherC4H10O
74.12 34.5
-116.
223.1550
383626.31
825 y - - - - - - - - - - -
diglyme (diethylene glycoldimethyl ether)
C6H14O3
134.17 162 -68 y
Diiodomethane
182.1 6.2 x
Dimethoxypropane 83 -47 y x
dimethyl-formamide (DMF)
C3H7NO
73.09 153
-60.4
8316.393
0535256.9
693 y x
dimethyl sulfoxide (DMSO)
C2H6OS
78.13 189 18.4
324.1045727
248.3549 x
1,4-dioxaneC4H8O2
88.11
101.1 11.8
165.182237
153.6324 x
Dodecane 214 -100.12007
76690.394
447 x
ethanolC2H6O
46.07 78.5
-114.
182.1520
73962.81
078 y x
Ethoxy ethanol + 135 -70
55.0927549
50.1593 y x
ethyl acetateC4H8O2
88.11 77
-83.6
110.7112646
109.9329 y x
ethylene glycol
C2H6O2
62.07 195 -13 x x
heptaneC7H16
100.2 98
-90.6
0.208640266
0.642268 y x
Hexanone127.
6 -5546.1061
75748.90
32 y x
Iodobenzene 188 -2927.0992
589345.97
503 x
2-iodobutane 127
-103.
511.6991
215219.14
337 y x
Iodomethane 42.4-
66.5 y - - - - - - - - - - -
Isoamyl alcohol
131.1 -117 y x
Isobutylacetat 118 -99 y x
e
Isopropyl acetate 89 -73
57.72860428
64.49613 y x
methanol CH4O32.0
4 64.6 -98121.320
011492.61
165 y x
3 methylbutanol 132 -117 y x
methyl t-butylether (MTBE)
C5H12O
88.15 55.2 -109
1.622897611
3.043001 y x
methylene chloride
CH2Cl2
84.93 39.8
-96.7 y - - - - - - - - - - -
methylcyclohexane 101 -126
0.2619268
0.771017 y x
1methylnaphtalene
244.4 -30 x
2methyl1propanol 107 -108 y x
4methyl2pentanone 117 -84
41.53932525
45.26553 y x
methyl tetrahydrofurane 78 -136 y x
nitrobenzene210.
9 5.7112.808
3749144.0
984 x
nitromethaneCH3NO2
61.04
101.2 -29
425.6196295
470.6408 x
1 octanol 195 -1611.7236
039511.18
539 x
2 octanol178.
5 -38 x
2 pentanol 119 -7328.8593
571724.81
189 y x
2 phenoxyethanol 242 14 x
pentadecane 270 16 x
3 pentanol115.
3 -6328.3720
638425.02
772 y x
2 phenoxyethanol 247 -2 x
2 phenyl ethanol 219 -27 x
propane1,2 diol 188 -59 y x
1-propanolC3H8O
88.15 97 -126
63.39754569
47.84395 y x
2-propanolC3H8O
88.15 82.4
-88.5
50.75476517
42.20535 y x
propyl acetate101.
6 -9260.0559
11562.36
033 y x
propylene carbonate 242 -48 y x
salicylaldehyde 197 -7 x
terbutylmethylether
55.05 -108 y x
tetrachloroethylene 121 -22 x
tetrahydrofuran (THF)
C4H8O
72.106 65
-108.
4 y x
toluene C7H892.1
4110.
6 -9322.9217
705338.19
096 y x
tridecane 233 -50.11126
58730.368
231
triethyl amineC6H15N
101.19 88.9
-114.
7 triethyl y x
2,2,2 trifluoroethanol + 77 -44 y x
2,2,4 trimethylpentane +
99.23 -107 y x
water H2O18.0
2 100 014.2735
907322.61
968 x
o-xyleneC8H10
106.17 144
-25.2 x
m-xyleneC8H10
106.17
139.1
-47.8 y x
p-xyleneC8H10
106.17
138.4 13.3 x
download fileview on ChemRxivSolvent.docx (25.73 KiB)
Other files
download fileview on ChemRxivoutput of classifiers (Autosaved).xlsx (93.17 KiB)
download fileview on ChemRxivSI2- Experimental solvents screening.xlsx (34.76 KiB)
top related