discovery of highly polymorphic organic materials: a new

doi.org/10.26434/chemrxiv.9524219.v1

Discovery of Highly Polymorphic Organic Materials: A New MachineLearning ApproachZied Hosni, Annalisa Riccardi, Stephanie Yerdelen, Alan R. G. Martin, Deborah Bowering, Alastair Florence

Submitted date: 12/08/2019 • Posted date: 13/08/2019Licence: CC BY-NC-ND 4.0Citation information: Hosni, Zied; Riccardi, Annalisa; Yerdelen, Stephanie; Martin, Alan R. G.; Bowering,Deborah; Florence, Alastair (2019): Discovery of Highly Polymorphic Organic Materials: A New MachineLearning Approach. ChemRxiv. Preprint.

Polymorphism is the capacity of a molecule to adopt different conformations or molecular packingarrangements in the solid state. This is a key property to control during pharmaceutical manufacturingbecause it can impact a range of properties including stability and solubility. In this study, a novel approachbased on machine learning classification methods is used to predict the likelihood for an organic compound tocrystallise in multiple forms. A training dataset of drug-like molecules was curated from the CambridgeStructural Database (CSD) and filtered according to entries in the Drug Bank database. The number ofseparate forms in the CSD for each molecule was recorded. A metaclassifier was trained using this dataset topredict the expected number of crystalline forms from the compound descriptors. This approach was used toestimate the number of crystallographic forms for an external validation dataset. These results suggest thisnovel methodology can be used to predict the extent of polymorphism of new drugs or not-yet experimentallyscreened molecules. This promising method complements expensive ab initio methods for crystal structureprediction and as integral to experimental physical form screening, may identify systems that with unexploredpotential.

File list (5)

download fileview on ChemRxivMachine learning-based approach to predict the polymorp... (1.19 MiB)

download fileview on ChemRxivSI1- Predictive models of polymorphism.docx (75.45 KiB)

download fileview on ChemRxivSolvent.docx (25.73 KiB)

download fileview on ChemRxivoutput of classifiers (Autosaved).xlsx (93.17 KiB)

download fileview on ChemRxivSI2- Experimental solvents screening.xlsx (34.76 KiB)

http://doi.org/10.26434/chemrxiv.9524219.v1

https://chemrxiv.org/authors/Zied_Hosni/7012763

https://chemrxiv.org/ndownloader/files/17152268

https://chemrxiv.org/articles/Discovery_of_Highly_Polymorphic_Organic_Materials_A_New_Machine_Learning_Approach/9524219/1?file=17152268









Discovery of highly polymorphic organic materials: a new machine learning approach Zied Hosni1,3*, Annalisa Riccardi2, Stephanie Yerdelen1, Alan R. G. Martin1, Deborah

Bowering1, and Alastair J. Florence1*

Polymorphism is the capacity of a molecule to adopt different conformations or molecular packing arrangements in the

solid state. This is a key property to control during pharmaceutical manufacturing because it can impact a range of

properties including stability and solubility. In this study, a novel approach based on machine learning classification

methods is used to predict the likelihood for an organic compound to crystallise in multiple forms. A training dataset of

drug-like molecules was curated from the Cambridge Structural Database (CSD) and filtered according to entries in the

Drug Bank database. The number of separate forms in the CSD for each molecule was recorded. A metaclassifier was

trained using this dataset to predict the expected number of crystalline forms from the compound descriptors. This

approach was used to estimate the number of crystallographic forms for an external validation dataset. These results

suggest this novel methodology can be used to predict the extent of polymorphism of new drugs or not-yet experimentally

screened molecules. This promising method complements expensive ab initio methods for crystal structure prediction and

as integral to experimental physical form screening, may identify systems that with unexplored potential.

achine Learning methods (ML) are ubiquitous in many areas

of modern science and have become a crucial tool where large

amounts of data from different sources are available. There is

a diverse range of ML algorithms available that have been applied to the

modelling and prediction of complex systems and problems. Various

factors have an impact on the suitability of ML approaches for different

applications. Among those are the size and distribution of the training

data in the features space, the correlation of the descriptors, the nature

of the problem and its degree of non-linearity. The non-linearity of the

problem considered in this study is one of the main drivers for the choice

of the ML approach used. Support Vector Machine (SVM) and Random

Forest (RF) are ML methods that have already been successfully used

for classification and prediction of non-linear chemical processes (i.e.

the features and the response are not correlated with a linear

relationship) and they are suitable for large dimensional problems (i.e.

many factors affect the response of the phenomena)1,2. The k-Nearest Neighbours (k-NN) algorithm joins simplicity and

intuitiveness. In the Mitchell group, this method was applied to predict

the melting point for 4119 structurally diverse organic molecules and

277 drug-like molecules. The performance of this algorithm was

compared with the one from neural networks showing the strengths and

the weaknesses to exploit their predictive models. Cross-validation and

y-randomisation both proved to be good strategies for prediction

validation3. Tropsha et al. highlighted the importance of validation

techniques in Quantitative Structure-Property Relationship (QSPR)

models before applying them on real world problems. They enumerated

1 EPSRC Future Continuous Manufacturing and Advanced Crystallisation Hub, University of Strathclyde, Technology and Innovation Centre,

99 George Street, Glasgow, U.K. G1 1RD. 2 Department of Mechanical and Aerospace Engineering, University of Strathclyde, Glasgow, UK.

G1 1XJ. 3 Centre of Computational Chemistry, School of Chemistry, Cantock's Close, Bristol, U.K BS8 1TS. email: [email protected];

[email protected]

several examples of predictive failure when the validation step was not

considered carefully4.

QSPR modelling is by definition based on the assumption that

changes in molecular structure are reflected in variation in the observed

macroscopic properties of materials2. This approach does not require

access to expensive, high-performance computing power and has been

shown to deliver scalability, efficiency, robustness and predictability5.

QSPR has been applied to predict a wide range of material properties

such as physicochemical and biological properties of nanomaterials6,

catalytic activity in homogeneous and heterogeneous catalysts7,8,

protein adsorption, cell attachment, cellular proliferation on biomaterial

surface9, glass transition temperature for polymers10, melting points for

ionic liquids and others11.

Crystal structure prediction is a challenging area and one of the

promising applications of QSPR and ML. Philipps et al. identified new

types of crystalline structures from large data sets of coordinates. They

deployed a hierarchy of pattern analysis techniques and applied ML

with shape matching algorithms to extract and classify crystals into

categories12. Clustering and the identification of intrinsic structural

features in particle tracking data were also investigated using the

Neighborhood Graph Analysis (NGA) method13. In inorganic

chemistry, the Cluster Resolution Feature Selection (CR-FS) and

support vector machine (SVM) classification were applied to predict the

crystal structures of ternary equiatomic compositions based only on the

constituent elements14. Moreover, Principle Components Analysis

(PCA) was exploited to render structure maps of spinel nitrides

(AB2N4)15.

M

mailto:[email protected]

Figure 1 | The 3-dimensional packing and crystallographic information (the distances, the angles and the volume of the unit cell) of two biologically active molecules (Chlorpropamide and Thiazolidinone) showing the five characterised polymorphic forms and the singular reported form in the second.

The Random Forest algorithm was able to predict full-Heusler

structures and discriminate between Heusler and inverse Heusler

structures16.

Raccuglia et al. exploited a successfully failed experiments’ database to

highlight the factors that control organically templated metal oxides

reaction outcome. They applied different algorithms and found that

SVM was the most robust one to mine the chemical information

rendered from historical reactions17. The identification of crystal

structure using optimisation of relevant thermodynamic potential in the

space of atomic coordinates is gaining significant interest18. Various

algorithms such as genetic algorithms or simulated annealing were

exploited to determine the global energy configurations of crystal

structures19. A ML model was trained on a dataset of ab initio

calculation results for 7000 organic molecules. Various molecular

descriptors such as nuclear charges and cartesian coordinates were

exploited as features for a deep multi-task artificial neural network

capable of predicting atomisation energy, ionisation potential and

electron affinity simultaneously20.

Polymorphism is defined as “the existence of a solid crystalline

phase of a given compound resulting from the possibility of at least two

different arrangements of the molecules of that compound in the solid

state”21,22. It is difficult to predict ab initio whether a specific molecule

will adopt more than one crystal structure, how many polymorphs are

likely to be observed, or the specific crystal packing arrangements and

associated physical properties each polymorph will display23,24. Organic

crystals are of paramount importance in different industrial sectors

including agrochemicals, food, paint, energetic materials, and

pharmaceuticals. The polymorphism of these entities dictates their

flowability, stability, colour, solubility and mechanical strength25. In

addition to the challenges related to the production control of a specific

solid form, polymorphism is still posing intellectual property conflicts

and prolonged legal battles26.

Considerable progress has been made in the field of crystal energy

landscaping (i.e. calculating the thermodynamically feasible crystal

structures within an energy landscape of possible polymorphs)27. This

procedure can guide expensive, and time consuming experimental

screening approaches for solid forms if thermodynamic and kinetic

factors are both taken into consideration28,29. Although numerous ab

initio predictive methodologies have been developed to deal with

increasingly complex challenges (flexible conformers, multicomponent

crystals), it is not yet possible to rely on such approaches without

carrying out experimental investigations. Figure 1 shows the contrast

between two biologically active molecules that present completely

different behaviours in terms of experimental polymorphism. Indeed,

while Chlorpropamide is reported to form at least 5 different crystalline

Form I Form II

Form IV

Form III

Form V

Chlorpropamide

Unique form P 2

1 2

1 2

1

a 9.066(4) b 5.218(3) c 26.604(8)

P 21 21 21

a 9.066 b 5.218 c 26.604

α 90 β 90 γ 90

P 21

a 6.126(2) b 8.941(6) c 12.020(4)

α 90 β 99.68 γ 90

648.993

P b c a

a 9.3198(4) b 10.3218(3) c 26.2663(10)

α 90 β 90 γ 90

2526.74

P b c n

a 14.777(3) b 9.316(4) c

19.224(5)

α 90 β 90 γ 90

2646.42

P n a 21

a 19.9121(10) b 7.3459(4) c 9.1384(4)

α 90 β 90 γ 90

Xabepilone

forms30, the cytotoxic Thiazolidinone is only reported to have a single

polymorph in the Cambridge Structural Database (CSD)31.

The CSD does not contain a complete record of the full extent of

polymorphism for all chemical entries,32 rather it includes all the entries

that have been reported in the literature or reported directly by

researchers to the Cambridge Crystallographic Data Centre (CCDC).

Thus, it is possible that some molecules, with a single crystal structure

entry, are highly polymorphic once studied in an extensive experimental

polymorphic screening.

The Random Forest method, in particular, has previously been

successfully applied to the design of experimental screens assessing the

completeness of experimental screens for solvate formation33;

prediction of packing types from solvent properties34 and the

crystallisability of organic molecules35,36. However, it has not been

assessed as a predictive tool for the extent of polymorphism expected

from a target molecule. In this work, we exploit curated data from the

intersection of the CSD and Drugbank37 and implement a metaclassifier

that enables the discovery of the true extent of polymorphism in organic

molecules and their potential to crystallise in with new solid-state forms.

This metaclassifier is the combination of various machine learning

algorithms. Four types of datasets were exploited to build predictive

models. The most robust model was selected to identify an organic

molecule susceptible to exist in several solid states. The experimental

validation was conducted by the crystallisation screening of this

compound in 60 different solvents.

Case study 1: Polymorphism prediction with 2D descriptors

and with dimensionality reduction The nine statistical models (i.e. the eight machine learning models and

the Prediction Fusion model) generated for the dataset of 2D structures

with dimensionality reduction are summarized in Figure 2.A. The

comparison between the different models showed that all the algorithms

were successful in reaching acceptable accuracy of prediction (>60%

for a six classes classification problem where the randomness is 100

6 %

(i.e. This is the probability to predict the correct number of

polymorphism by random guessing)) except for the Naïve Bayesian

Multinomial and the multilayer Perceptron algorithm. k-Nearest

Neighbours and Random Forest were the best methods with an accuracy

of 86% and 85%, respectively. The exploitation of the Prediction Fusion

enabled an improvement of the predictive capacity and demonstrated a

synergetic effect of combining the probability generated from the

different algorithms in one unique model. The Prediction Fusion

method rendered an accuracy of 91% and a Cohen’s kappa of 90%.

The confusion matrix that is depicted in Table 1 explains the high

accuracy that characterises the Prediction Fusion model. It is clear that

the diagonal of this matrix, explaining the correct prediction, is very rich

in samples.

Table 1. Confusion matrix from the fused predictive model of the case

study 1

Number of

polymorphs Experimental polymorphism

1 2 3 4 5 6

Pre

dic

ted

po

lym

orp

his

m 1 55 17 7 6 6 3

2 3 88 0 3 0 0

3 0 0 94 0 0 0

4 1 1 1 91 0 0

5 0 0 0 0 94 0

6 0 0 0 0 0 94


study 2

Number of


1 2 3 4 5 6

Pre

dic

ted

po

lym

orp

his

m 1 90 3 0 1 0 0

2 1 88 4 0 1 0

3 5 23 53 8 4 1

4 0 0 1 93 0 0

5 0 0 0 0 94 0

6 0 0 0 0 0 94


study 3

Number of


1 2 3 4 5 6

Pre

dic

ted

po

lym

orp

his

m 1 334 18 20 3 14 3

2 8 266 52 31 16 19

3 5 13 343 11 11 9

4 2 11 9 359 3 8

5 0 9 12 5 359 7

6 3 6 20 5 10 348


study 4

Number of


1 2 3 4 5 6

Pre

dic

ted

poly

morp

his

m

1 366 4 6 11 2 3

2 13 346 8 19 2 4

3 17 13 297 35 20 10

4 12 2 26 345 5 2

5 4 0 5 7 376 0

6 4 0 13 17 6 352

Case study 2: Polymorphism prediction with 2D descriptors

and without dimensionality reduction The nine models generated from the datasets of 2D structures using their

corresponding molecular descriptors were plotted in Figure 2.B. They

showed very similar performance to the Case study 1 and proved that

the dimensionality reduction did not improve dramatically the

performance of the obtained models in the Case study 1. It is shown that

PCA improved the weakest models (i.e. Naive Bayesian and Multilayer

Perceptron). Their accuracies increased from 9% and 18% to 55% and

52%, respectively. The comparison of the confusion matrices between

the two cases shows the enhancement of prediction for class 1.

Therefore, in case 2, only 4 samples were predicted wrong, compared

to 39 samples for the class of single form. It is noteworthy that the best

models from case 1, where Principle components were used instead of

the original molecular descriptors, had very similar results. The worst

two predictive models were improved dramatically. For instance, the

Multilayer Perceptron achieved an accuracy equal to 52% while it was

18% when dimensionality reduction was considered. This proves that

the reduction of the number of dimensions did not help to improve the

already robust models and even led to a deterioration of relatively weak

models such as the Naïve Bayes multinomial algorithm and the

Multilayer Perceptron. The confusion matrix, as depicted in Table 2,

explained the good performance of the Prediction Fusion model and

showed an enrichment of the matrix’s diagonal in samples.

Case study 3: Polymorphism prediction with

crystallographic descriptors and with dimensionality

reduction Case 3 exploits the information from the dataset of 3D structures.

Instead of using the crystallographic descriptors, a Principle

Components Analysis was conducted to reduce the dimensionality of

the system to 9 components. Comparing to the two previous cases, all

the obtained models in case 3, as illustrated in Figure 2.C, were less

robust in predicting the polymorphism of the molecules of interest than

the previous models of the 2D structures datasets. K-NN, RF and PF

were still very robust to predict the polymorphism. Their accuracies are

equal to 84%, 83% and 89%, respectively. The low number of

independent variables can explain this robustness compared to the

number of samples within the dataset. The corresponding confusion

matrix for the Prediction Fusion models that are illustrated in Table 3

explains the high accuracy in estimating the polymorphism from 3D

structures and dimensionality reduction.

Case study 4: Polymorphism prediction with

crystallographic descriptors and without dimensionality

reduction

The last case uses the 3D structures from the intersection of the Drug

Bank and the CSD databases and their corresponding crystallographic

descriptors such as the unit cell parameters. The performance of the

different models was plotted in Figure 2.D. The utilisation of the

original crystallographic descriptors improved the performance of all

the models slightly, with no exception. Naïve Bayes Multinomial,

Simple Logistic and the Multilayer Perceptron were the weakest

predictive models. As before, k-NN, RF and PF were at the head of the

list to estimate the polymorphism. Support Vector Machine, Ordinal

Classic Classifier and the Gradient Boosted Trees had a relatively

acceptable accuracy between 60% and 75%. The Prediction Fusion

model exploited the probabilities from all the generated models with

respect to their individual accuracies. Table 4 explains the robustness of

this model through the sample-rich diagonal.

Figure 2 | Performance of the 8 independent machine learning algorithms to generate statistical models in: A- Case study 1, B- Case study 2, C- Case study 3, D-

Case study 4.

Descriptors importance When the Principle Components analysis was not applied for the

datasets of 2D and 3D structures (i.e. case 2 and 4), it was possible to

check the importance of the independent variable and to interpret their

contribution to define the accuracy of the designed model. In the case

of the 2D structures dataset, molecular descriptors were generated from

the MOE and RDKit software packages.

After the pre-processing and filtration step, 169 molecular

descriptors were employed to build different models. The backward

selection was used in a loop with the k-NN algorithm as assessor of

accuracy because it has already demonstrated a good prediction

performance and it does not require other predictive models as in the

case of the Prediction Fusion. Each descriptor was deleted at each

iteration of the loop, and the accuracy was measured. The most

important variables were those which significantly deteriorated the

accuracy of the model. From the best 2D structure model, the most

influential variables on the performance of the predictive models were:

Q-VSA-NEG (Total negative van der Waals surface area), Q-VSA-Pol

(Total positive van der Waals surface area). These two descriptors

belong to partial charge descriptors. There are also the molecular

quantum numbers MQN3 (number of chlorines) and MQN26 (number

of acyclic single valent nodes), a_ICM (This is the entropy of the

element distribution in the molecule). A detailed explanation of all the

previous descriptors is included in the manual of MOE and RDKit

software38,39. In the case of the crystallographic descriptors, the most

influencial descriptors were the “a” and “b” parameters of the reduced

cell. This was expected because these two parameters define most of the

crystal geometry.

In silico Discovery of hidden polymorphism The ultimate goal of this work was to discover polymorphism in

neglected or new molecules. This can be conducted from 2D or 3D

structures by applying the suitable model (i.e. one of the 4 cases

explained above). Finding the real potential of a molecule to give a

number of polymorphs has many benefits and can be exploited in

9%

64%

76%

18%

74%

86% 85%79%

91%

57%

71%

1%

69%

83% 82%75%

90%

0%

50%

100%

Naive BayesMultinomial

OrdinalClassic

classifier

Simplelogistic

MultilayerPerceptron

SupportVector

Machine

k-NearestNeighbours

RandomForest

GradientBoosted

trees

Predictionfusion

AAccuracy Kohen's kappa

44%

63%

87%

52%

71%

85% 85%79%

91%

33%

56%

85%

42%

65%

82% 82%

74%

89%

0%

20%

40%

60%

80%

100%


OrdinalClassic

classifier

Simplelogistic


SupportVector

Machine

k-NearestNeighbours

RandomForest

GradientBoosted

trees

Predictionfusion

B

4%

56%

17%12%

66%

83%79%

67%

85%

48%

0%

59%

79% 74%

61%

83%

0%

20%

40%

60%

80%

100%


OrdinalClassic

classifier

Simplelogistic


SupportVector

Machine

k-NearestNeighbours

RandomForest

GradientBoosted

trees

Predictionfusion

C

5%

61%

19% 16%

68%

84% 83%

74%

89%

53%

3%

62%

81% 79%

69%

86%

0%

20%

40%

60%

80%

100%


OrdinalClassic

classifier

Simplelogistic


SupportVector

Machine

k-NearestNeighbours

RandomForest

GradientBoosted

trees

Predictionfusion

D

several stages of solid-state research. For instance, this information can

be useful for an initial screening from large databases like ZINC40 and

ChEMBL41. It is also practical for already investigated polymorphs to

see whether other solid forms are missing, and further experimental

screening would lead to producing them.

We exploited our predictive models to estimate the polymorphism in

a subset (i.e. 100 different 2D structures from the CSD that has not been

involved in any stage of designing the predictive models). This subset

presents the occurrence of polymorphism for molecules that were not

included in the Drug Bank like the majority of molecules in the CSD.

The comparison of the distribution of the polymorphs in this subset

according to their corresponding number of possible forms was

summarised in Figure 7. k-NN, Random forest, and Prediction Fusion

were selected as they are the most effective predictive models for

estimating polymorphism. From the current experimental observations,

it was clear that there was a dominance of the structures possessing two

different polymorphs. All the different models estimated over 40% of

the molecules in the CSD have 2 polymorphs. Currently, 54% of the

database structures were classified as possessing just two solid forms.

Except for the k-NN, RF and PF models estimated that structures with

a single solid form are overestimated. Indeed, RF estimated that only

4% of the structures have a unique crystalline form. Only 3 % of the

database was predicted to have 4 different polymorphs. Interestingly, all

the statistical models estimated a higher occurrence of structures having

4 forms than the current experimental estimation. For examples, k-NN

and PF predicted 7% and 13% of 4-forms occurrence, respectively.

Figure 3 | Comparison of the distribution of the number of polymorphs of molecules in the CSD database if the predictive model of polymorphism is

exploited or not. The bar chart presents the occurrence of polymorphism in each

of the 6 classes. The blue and the black dashed curves show an approximate trend of this abundance before and after applying the predictive models, respectively.

This overall shape of polymorphism occurrence distribution was kept

in the different predictive models. This is illustrated in Figure 3 with

black and blue dashed curves for the predicted and the current

experimental abundance of polymorphism, respectively. Interestingly,

we observed that the predicted area of a high number of polymorphs

(i.e. 4, 5 or 6 form per 2D structure) is wider than what has been

achieved experimentally, thus far. The current statistics of the 100-

sample subset showed that there is no structure with six polymorphs.

The same trend applies to the structures with five polymorphs. This

leads us to conclude that building predictive models based on carefully

selected molecules (i.e. structures that were heavily screened for

polymorphism in the pharmaceutical industry) enabled the discovery of

a hidden area of the chemical space.

Experimental screening The X-Ray powder diffraction patterns of the crystallised samples,

depicted in Figure 4.C provided evidence of the presence of new solid

forms of pentoxifilline in addition to the already characterised form in

the CSD database. The comparison of the Pearson correlation between

the patterns identified 4 clusters as depicted in the dendrogram and the

clusters plot below. Comparison of the X-Ray diffraction patterns of the

samples crystallised in tetrahydrofuran, diethylene glycol, acetic acid

and benzylamine shows the presence of additional Bragg reflections,

which cannot be explained by the reference pattern. This is indicative

of the presence of new solid forms, but the exact nature of these new

forms is still not known. The DSC/TGA analysis, represented in Figure

4.D and 4.E, confirms these results by thermal events identified which

do not correspond to the crystallisation solvent or the thermal transition

of the reference form within the temperature range investigated. ATR-

IR spectra were collected as a fingerprint for the new forms and

compiled in Figure 4.FMinor differences in the spectra can be

rationalised by the presence of different orientation of the pentoxifilline

in space, which affects the non-covalent interactions such as the

hydrogen bonds.

Two different datasets were extracted from the Drug Bank database

and the CSD. They contain 2D and 3D structures or organic molecules

and their corresponding polymorphism number. Molecular descriptors

were employed as independent variables for 2D structures dataset and

crystallographic descriptors were used for 3D structures dataset. PCA

was exploited to reduce the dimensions of each of the two datasets,

which allow the generation of 4 different datasets in total. 8 different

machine learning algorithms were applied to the different dataset, and a

metaclassifier was built from the probabilities estimated from each

algorithm. 9 statistical models were rendered for each dataset with

various capabilities to predict the real number of experimentally

achievable polymorphs. It was clear that K-Nearest Neighbours and

Random forest were reliably the most robust statistical models. A

synergistic effect has also been obtained a metaclassifier called the

Prediction Fusion. This latter gave higher accuracy than the RF or the

K-NN models. It is also noteworthy to mention that the application of

the dimensionality reduction for these systems did not improve the

results but slightly deteriorate them.

In addition, the most robust models were exploited to detect the most

influential descriptors on the polymorphism capability of each structure.

As expected, the reduced unit cell parameters were the most important

features in the case of 3D structures approach. A number of molecular

descriptors such as the Total negative and positive van der Waals

surface area and the number of chlorine atoms in the molecules were

among the most influencing molecular descriptors on the models built

from 2 structures.

34%

43%

13%7%

1% 3%4%

90%

1%4%

0% 1%

21%

58%

3%

13%

1%5%

33%

54%

10%

3%0% 0%

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6

Ab

un

dan

ce

Polymorphism classes

k-NN

Random Forest

Prediction Fusion

Current experimental observations

A

B

C

D

E

Figure 4 | Analysis of the crystallised pentoxifylline in the different solvents. A- Dendrogram of the most relevant samples screened experimentally and found in the

literature, showing the similarity between the solid forms. B- Clustering plot of the selected samples distinguishing between the new form and the mixtures of the existing forms. C- X-Ray powder diffraction patterns of the reference material extracted from the CSD and the selected forms from the experimental screening. D- DSC traces of the selected forms presenting the thermal events occurring during the heating of the samples. E- ATR-IR spectra of the selected new polymorphs

The comparison between the distribution of the abundance of the

number of polymorphs in the current CSD with what was predicted from

the best-designed models reveals a hidden area of chemical space that

was potentially underestimated and under-screened for polymorphism.

In other words, these models show the real potential of any known or

unknown structure to give a certain number of crystalline forms. In the

present work, the most robust model has successfully predicted the

number of solid forms that are missing. This was validated

experimentally by conducting a solvents screening that revealed the

hidden forms. This has paramount practical importance for

crystallographers and materials engineers because referring to the best

of our knowledge today, this is the first computational tool based on

data mining and machine learning that gives experimentalists an initial

guideline about the hidden potential of organic molecules to render extra

solid forms, not yet discovered and isolated experimentally.

Acknowledgments

The authors would like to acknowledge that this work was

carried out in the CMAC National Facility, housed within the

University of Strathclyde’s Technology and Innovation Centre,

and funded with a UKRPIF (UK Research Partnership Institute

Fund) capital award, SFC ref. H13054, from the Higher

Education Funding Council for England (HEFCE).

Keywords: Machine learning, metaclassification, polymorphism,

materials discovery, crystal structure, Artificial Intelligence

References and Notes

1. Lowe, R., Glen, R. C. & Mitchell, J. B. O. Predicting phospholipidosis using machine learning. Mol. Pharm. (2010).

0

5000

10000

15000

20000

25000

30000

35000

0 10 20 30 40

2θ λ=1.540Å

p rawExp No 2Exp No 3Exp No 24Exp No 60

0

0.1

0.2

0.3

0.4

0.5

0.6

500100015002000250030003500

pentoxifilline

2

3

24

60

doi:10.1021/mp100103e 2. Hughes, L. D., Palmer, D. S., Nigsch, F. & Mitchell, J. B. O.

Why Are Some Properties More Difficult To Predict than Others? A Study of QSPR Models of Solubility, Melting Point, and Log P. J. Chem. Inf. Model. 48, 220–232 (2008).

3. Nigsch, F. et al. Melting Point Prediction Employing k -Nearest Neighbor Algorithms and Genetic Parameter Optimization. J. Chem. Inf. Model. 46, 2412–2422 (2006).

4. Tropsha, A., Gramatica, P. & Gombar, V. The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models. QSAR Comb. Sci. 22, 69–77 (2003).

5. Le, T., Epa, V. C., Burden, F. R. & Winkler, D. A. Quantitative Structure–Property Relationship Modeling of Diverse Materials Properties. Chem. Rev. 112, 2889–2919 (2012).

6. Lubinski, L. et al. Evaluation criteria for the quality of published experimental data on nanomaterials and their usefulness for QSAR modelling. SAR QSAR Environ. Res. 24, 995–1008 (2013).

7. Yao, S., Shoji, T., Iwamoto, Y. & Kamei, E. Consideration of an activity of the metallocene catalyst by using molecular mechanics, molecular dynamics and QSAR. Comput. Theor. Polym. Sci. 9, 41–46 (1999).

8. Cruz, V. L. et al. 3D-QSAR study of ansa-metallocene catalytic behavior in ethylene polymerization. Polymer (Guildf). (2007). doi:10.1016/j.polymer.2007.05.081

9. Norbert, W., Durgadas, B., L., B. S. & Joachim, K. Small changes in the polymer structure influence the adsorption behavior of fibrinogen on polymer surfaces: Validation of a new rapid screening technique. J. Biomed. Mater. Res. Part A 68A, 496–503

10. K., T. Makromol. Chem. Macromol. Symp. vol. 69 1993. 4th European Polymer Federation Symposia on Polymeric Materials. Symposium Editors C. Bubeck, H. W Spiess. Acta Polym. 45, 57

11. Greaves, T. L. & Drummond, C. J. Protic Ionic Liquids: Properties and Applications. Chem. Rev. 108, 206–237 (2008).

12. Phillips, C. L. & Voth, G. A. Discovering crystals using shape matching and machine learning. Soft Matter 9, 8552 (2013).

13. Reinhart, W. F., Long, A. W., Howard, M. P., Ferguson, A. L. & Panagiotopoulos, A. Z. Machine learning for autonomous crystal structure identification. Soft Matter 13, 4733–4745 (2017).

14. Oliynyk, A. O. et al. Disentangling Structural Confusion through Machine Learning: Structure Prediction and Polymorphism of Equiatomic Ternary Phases ABC. J. Am. Chem. Soc. 139, 17870–17881 (2017).

15. Balachandran, P. V, Broderick, S. R. & Rajan, K. Identifying the {\textquoteleft}inorganic gene{\textquoteright} for high-temperature piezoelectric perovskites through statistical learning. Proc. R. Soc. London A Math. Phys. Eng. Sci. 467, 2271–2290 (2011).

16. Oliynyk, A. O. et al. High-Throughput Machine-Learning-Driven Synthesis of Full-Heusler Compounds. Chem. Mater. 28, 7324–7331 (2016).

17. Raccuglia, P. et al. Machine-learning-assisted materials discovery using failed experiments. Nature 533, 73–76 (2016).

18. Christian, S. J. & Martin, J. First Step Towards Planning of Syntheses in Solid-State Chemistry: Determination of Promising Structure Candidates by Global Optimization. Angew. Chemie Int. Ed. English 35, 1286–1304

19. Abraham, N. L. & Probert, M. I. J. A periodic genetic algorithm with real-space representation for crystal structure and polymorph prediction. Phys. Rev. B 73, 224104 (2006).

20. Montavon, G. et al. Machine learning of molecular electronic properties in chemical compound space. New J. Phys. (2013). doi:10.1088/1367-2630/15/9/095003

21. Haleblian, J. & McCrone, W. Pharmaceutical applications of polymorphism. Journal of Pharmaceutical Sciences (1969). doi:10.1002/jps.2600580802

22. Nangia, A. Conformational polymorphism in organic crystals. Acc. Chem. Res. (2008). doi:10.1021/ar700203k

23. Brittain, H. G. Polymorphism and solvatomorphism 2010. Journal of Pharmaceutical Sciences (2012). doi:10.1002/jps.22788

24. Hilfiker, R. Polymorphism: In the Pharmaceutical Industry. Polymorphism: In the Pharmaceutical Industry (2006). doi:10.1002/3527607889

25. Le, T., Epa, V. C., Burden, F. R. & Winkler, D. A. Quantitative Structure–Property Relationship Modeling of Diverse Materials Properties. Chem. Rev. 112, 2889–2919 (2012).

26. Cabri, W., Ghetti, P., Pozzi, G. & Alpegiani, M. Polymorphisms and patent, market, and legal battles: Cefdinir case study. Organic Process Research and Development (2007). doi:10.1021/op0601060

27. Braun, D. E., McMahon, J. A., Koztecki, L. H., Price, S. L. & Reutzel-Edens, S. M. Contrasting Polymorphism of Related Small Molecule Drugs Correlated and Guided by the Computed Crystal Energy Landscape. Cryst. Growth Des. 14, 2056–2072 (2014).

28. Hulme, A. T. et al. Search for a Predicted Hydrogen Bonding Motif − A Multidisciplinary Investigation into the Polymorphism of 3-Azabicyclo[3.3.1]nonane-2,4-dione. J. Am. Chem. Soc. 129, 3649–3657 (2007).

29. Vasileiadis, M., Pantelides, C. C. & Adjiman, C. S. Prediction of the crystal structures of axitinib, a polymorphic pharmaceutical molecule. Chem. Eng. Sci. 121, 60–76 (2015).

30. Drebushchak, V. A., Drebushchak, T. N., Chukanov, N. V. & Boldyreva, E. V. Transitions among five polymorphs of chlorpropamide near the melting point. J. Therm. Anal. Calorim. (2008). doi:10.1007/s10973-007-8822-0

31. Weng, J. Q., Shen, D. L., Tan, C. X. & Liu, H. J. 2-Oxo-N-phenylthiazolidine-3-carboxamide. Acta Crystallogr. Sect. E Struct. Reports Online (2004). doi:10.1107/S1600536804008621

32. Allen, F. H. & Motherwell, W. D. S. Applications of the Cambridge Structural Database in organic chemistry and crystal chemistry. Acta Crystallogr. Sect. B Struct. Sci. 58, 407–422 (2002).

33. Johnston, A., Johnston, B. F., Kennedy, A. R. & Florence, A. J. Targeted crystallisation of novel carbamazepine solvates based on a retrospective Random Forest classification. CrystEngComm (2008). doi:10.1039/b713373a

34. Bhardwaj, R. M., Reutzel-Edens, S. M., Johnston, B. F. & Florence, A. J. A random forest model for predicting crystal packing of olanzapine solvates. CrystEngComm 20, 3947–3950 (2018).

35. Bhardwaj, R. M., Johnston, A., Johnston, B. F. & Florence, A. J. A random forest model for predicting the crystallisability of organic molecules. CrystEngComm (2015). doi:10.1039/c4ce02403f

36. Wicker, J. G. P. & Cooper, R. I. Will it crystallise? Predicting crystallinity of molecular materials. CrystEngComm (2015). doi:10.1039/c4ce01912a

37. Wishart, D. S. et al. DrugBank 5.0: A major update to the DrugBank database for 2018. Nucleic Acids Res. (2018). doi:10.1093/nar/gkx1037

38. Inc., C. C. G. Molecular Operating Environment (MOE), 2016.08. 1010 Sherbooke St.West, Suite #910, Montreal, QC, Canada, H3A 2R7 (2016).

39. Landrum, G. RDKit: Open-source Cheminformatics. Http://Www.Rdkit.Org/ (2006). doi:10.2307/3592822

40. Irwin, J. J. & Shoichet, B. K. ZINC - A free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. (2005). doi:10.1021/ci049714+

41. Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res. (2017). doi:10.1093/nar/gkw1074

42. Allen, F. H. The Cambridge Structural Database: A quarter of a million crystal structures and rising. Acta Crystallogr. Sect. B Struct. Sci. (2002). doi:10.1107/S0108768102003890

43. López-Mejías, V., Kampf, J. W. & Matzger, A. J. Nonamorphism in Flufenamic Acid and a New Record for a Polymorphic Compound with Solved Structures. J. Am. Chem. Soc. 134, 9872–9875 (2012).

44. Peterson, M. L. et al. Iterative High-Throughput Polymorphism Studies on Acetaminophen and an Experimentally Derived Structure for Form III. J. Am. Chem. Soc. 124, 10958–10959 (2002).

45. P. Mazanetz, M., J. Marmon, R., B. T. Reisser, C. & Morao, I. Drug Discovery Applications for KNIME: An Open Source Data Mining Platform. Curr. Top. Med. Chem. 12, 1965–1979 (2012).

46. Jagla, B., Wiswedel, B. & Coppée, J.-Y. Extending KNIME for next-generation sequencing data analysis. Bioinformatics 27, 2907–2909 (2011).

47. Beisken, S. et al. KNIME-CDK: Workflow-driven cheminformatics. BMC Bioinformatics (2013). doi:10.1186/1471-2105-14-257

48. ChemicalComputingGroupInc. Molecular Operating Environment (MOE). Sci. Comput. Instrum. (2004). doi:10.1017/CBO9781107415324.004

49. Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. (2002). doi:10.1613/jair.953

50. McCallum, A. & Nigam, K. A Comparison of Event Models for Naive Bayes Text Classification. AAAI/ICML-98 Work. Learn. Text Categ. (1998). doi:10.1.1.46.1529

51. Schapire, R. E., Stone, P., McAllester, D., Littman, M. L. & Csirik, J. A. Modeling auction price uncertainty using boosting-based conditional density estimation. Mach. Learn. Work. Then Conf. (2002).

52. Landwehr, N., Hall, M. & Frank, E. Logistic model trees. Mach. Learn. (2005). doi:10.1007/s10994-005-0466-3

53. Pal, S. K. & Mitra, S. Multilayer Perceptron, Fuzzy Sets, and Classification. IEEE Trans. Neural Networks (1992). doi:10.1109/72.159058

54. Ben-Hur, A. & Weston, J. A user’s guide to support vector machines. Methods Mol. Biol. (2010). doi:10.1007/978-1-60327-241-4_13

55. Aha, D. W., Kibler, D. & Albert, M. K. Instance-Based Learning Algorithms. Mach. Learn. (1991). doi:10.1023/A:1022689900470

56. Breiman, L. Random forests. Mach. Learn. (2001). doi:10.1023/A:1010933404324

57. Si, S. et al. Gradient Boosted Decision Trees for High Dimensional Sparse Output. Icml (2017).

58. Hall, M. et al. The WEKA data mining software. SIGKDD Explor. Newsl. (2009). doi:10.1145/1656274.1656278

59. Song, L., Smola, A., Gretton, A., Bedo, J. & Borgwardt, K. Feature selection via dependence maximization. J. Mach. Learn. Res. (2012). doi:10.1145/1273496.1273600

Methods Database curation The datasets generated to build predictive models for the polymorphism of

organic molecules were curated from two main databases: The Drug Bank 37 (DB) and the Cambridge Structural Database42 (CSD). The choice of these two

databases was based on the complementarity of information that they provide.

In the CSD, which is approaching one million crystal structure entries, each molecule entry may have metadata field describing its polymorphism and

whether it is available in the DB with the corresponding reference. The DB

currently has 9292 entries comprising 3189 small molecule drugs, 926 approved biotech (protein/peptide) drugs, 108 nutraceuticals and over 5,069

experimental drugs. To evaluate the role of ML in predicting polymorphism in

the drug like molecules, a subset database was created. Filters were applied to the full CSD database, to identify entries that were organic molecules with

recorded polymorphs (i.e. >1 distinct crystal structure with the same formula

unit) and which also appeared within the DB. The choice of the intersection between the two databases, depicted in Figure 5, was based on the increased

likelihood that commercially available drugs will have been screened

experimentally for polymorphism and more likely to have a complete record of the number of experimentally achievable crystallographic forms compared with

molecules investigated in academia or other industries. While it is still probable

that all forms may not have been reported in the public domain or even structurally characterised to allow them to be included in the CSD, this

database still provides a useful training set to assess the potential application of

ML. From the initial CSD dataset of 15202 3D structures, only 883 structures remained after the filtration process. These three-dimensional structures can be

further categorised into 178 two-dimensional structures (smiles). The 15202 3D

structures for the 883 molecules also include redeterminations, variable temperature studies, conformational polymorphs, polymorphic forms, co-

crystals, salts, hydrates and solvates. Therefore, a manual filter was used to

discard co-crystals and redeterminations of each molecule. Although they have chemically different compositions, the salts, hydrates and solvates were kept in

our datasets because they have very relevant information that determines the

diversity of the polymorphic forms, especially for pharmaceutical applications.

Figure 5 | The intersection between two different databases: CSD and Drug

Bank giving drug like molecules that have likely been screened for

polymorphism with entries in the CSD.

Confirmation of polymorphism was provided by checking the original

published source article for each selected REFCODE entry. Although some papers claim the existence of molecules with up to 9 different polymorphs such

as the flufenamic acid43, the datasets ignored these molecules as they are a very

extreme minority that risks unbalancing the datasets heavily. In the other

extreme of the polymorphism spectrum, only molecules that are commercially

available as drugs were considered as “true negative” as any drug is expected to

be screened for polymorphism before it becomes available in the market. There are several works that claim the presence of 60 different solid forms for

Atorvastatin but surprisingly none was reported in the CSD because no single

crystal was identified and isolated.44 The final dataset used in this study considered six classes that describe chemicals in the database with 1, 2, 3, 4, 5

or 6 discrete, crystallographically different polymorphs including solvates and salts but not co-crystals.

3 www.knime.com

Data workflow The pre-processing of the 833 3D structures dataset “condensed” in 178 2D

structures dataset was carried out through a series of data transformation and cleaning available in Knime3 software and described in Figure 645–47. The initial

data transformation was the generation of molecular descriptors for the 2D

structures dataset. This step was not performed in the case of the 3D structures dataset because crystallographic descriptors will be used instead of most of the

molecular descriptors generated with MOE48 and RDKit47 nodes available in

Knime. In a second step, structures having descriptors with missing or highly correlated values (i.e. Correlation threshold was set to 0.9 as a lower limit of

correlation) were also eliminated from both datasets. The descriptors with low

variance (i.e. Variance upper bound was set to 0.01) were discarded, too. A normalisation between 0 and 1 was additionally applied to unify the scale

employed for the different descriptors. The order of samples was randomly chosen via the “Shuffle” node. The Synthetic Minority Over-sampling

algorithm to balance the classes of each dataset was used49. This step was

crucial to get balanced classes in each dataset. Figure 6 illustrates in a pie chart the occurrence of the different classes (i.e. Number of polymorphs) before the

application of the balancing process through the SMOTE algorithm. This

technique oversamples only the minority classes and takes into account the 5

nearest neighbours. In the final pre-processing stage, the Principle Components

Analysis was applied to reduce the dimensionality of the dataset especially

when hundreds of descriptors were included. Therefore, starting from the two datasets (the 2D structures list and the 3D structures list), two new datasets

were generated where the principal components replaced the original molecular

or crystallographic descriptors. A number of principal components were used by keeping 95%, and 90% of the

variance were preserved for the 2D and the 3D structures datasets, respectively.

These percentages were chosen in the way of keeping the maximum of the variance in the Data and generating in the same time a reasonable number of

principal components for the case 2 and 4 that will be described later. At the

end of the pre-processing phase, four different datasets were compiled and ready to be trained over machine learning algorithms (i.e. the two original

datasets and the two obtained with PCA).

Crystallographic and molecular descriptors The four datasets have different number and type of descriptors as depicted in

Table 1. In the first dataset using the 2D structures, only molecular descriptors were included. These descriptors highlight the properties of the whole

molecule, and they can be categorised into constitutional, geometrical,

topological and electronic descriptors. For instance, one cites the molecular mass, the number of carbon atoms, the polar surface size, the charge, the count

of halogen or hydrogen atoms in the molecule of interest. etc. In the second

dataset, these descriptors were replaced by the principal components generated from the dimensionality reduction. It is noteworthy that these descriptors have

no chemical meaning but were defined to reduce the dimensionality of the

feature space and ease the task of machine learning algorithms. In the third dataset, crystallographic descriptors were employed including the unit cell

parameters and the volume of the cell. etc. Finally, the fourth dataset is the

dimensionally reduced dataset obtained from the application of PCA on the crystallographic descriptors (i.e. the third dataset). The table below summarises

the number and the type of descriptors used for each dataset.

9292 Entries selected

15202 organic molecules highlighting polymorphism data

833 3D structures belonging to 178 2D structures

Figure 6 | Data flow and the different steps followed from the data collection, passing through the pre-processing and generation of machine learning models and

ending with a meta classifier design to predict the polymorphism of unknown structures.

Figure 7 | The occurrence of polymorphism in the datasets and the role of the

SMOTE algorithm to balance the classes.

Choice of machine learning classifiers 8 different algorithms were selected from Knime nodes to train the models over the 4 datasets: Naive Bayes Multinomial50, Ordinal Classic classifier51, Simple

logistic52, Multilayer Perceptron53, Support Vector Machine54, k-Nearest

Neighbours55, Gradient Boosted Trees and Random Forests. Figure 8 demonstrates how the curated data were classified into 2D and 3D structures

and how these two datasets were treated differently by applying the original

features (i.e. molecular descriptors or crystallographic parameters). 4 datasets were finally obtained according to the pre-processing method. Random

Forest,56 Gradient Boosted trees.57 All these classifiers were implemented in

Weka nodes available within Knime software.58 All these algorithms were used with their default settings except the followings. In SVM, the “Hyper Tangent”

was chosen as a kernel with a kappa equals to 0.1 and delta equals to 0.5. In the

K-Nearest Neighbours classifier, the number of neighbours was chosen to be 6 as the number of classes in the datasets. In Random Forests classier, the number

of trees was adjusted to 500. 10-fold cross-validation was applied for each

classifier. The accuracy and Cohen’s kappa were used as metrics to evaluate the performance of the models. Also, Confusion matrices, Recall, Precision,

sensitivity, specificity and F-measure were provided for each class of each

classifier.

Meta-classifier design Once the eight classifiers were trained over the four datasets, probabilities of classification (i.e. the probability for each of the 6 classes of the response) were

generated. The “Prediction fusion” node was employed to combine the

probabilities from the different classifiers and weigh them according to the robustness of the obtained model. These weights Wi correspond to the accuracy

obtained from each ML model. This is translated by the formula

below: 𝑃𝑓𝑢𝑠𝑖𝑜𝑛 =∑ 𝑃𝑖 ×𝑊𝑖

𝑘=𝑀𝑖𝑖=1

8 (1)

where Pfusion is the overall probability of that particular class, Mi is the model

among the eight selected model, Pi is the individual probability per model, and

Wi is the scaling factor that is equal to the accuracy of the model Mi. Like for individual classifiers, the same metrics were employed to characterise the

overall model from the prediction fusion.

Curation of data

•Extract the organic molecules from the CSD

•Extract the molecules from the DrugBank

•Select the overlapped compounds

•Generate 2D and 3D structures

Preprocessing

•Generate molecular descriptors( MOE and RDKit)

•Filter missing values and highly correlated or with low variance descriptors

•Normalization

•Shuffle: Randomization of the samples

•SMOTE: Balance the dataset

•Dimensionality reduction: PCA

Machine learning

modeling: Classification & cross validation

•Naive Bayes Multinomial

•Ordinal Classic classifier

• Simple logistic

• Multilayer Perceptron

•Support Vector Machine

• k-Nearest Neighbours

•Random Forest

•Gradient Boosted trees

Metaclassifierdesign

•Weight-scalling of classification algorithms

•Prediction fusion

112%

253%

319%

48%

56%

62% 1

16%

216%

317%

417%

517%

617%

1

2

3

4

5

6

classes balancing

Figure 8 | The partition of the curated datasets into 4 categories where 2D and 3D structures are generated, then 8 different machine learning algorithms were

applied to generate statistical models from molecular/crystallographic descriptors or from principal components of PCA.

Table 5. Number and nature of the descriptors used as independent variables to build the predictive model

2D structures 3D structures

Case 1 a Case 2 b Case 3 c Case 4 d

Number of

dimension

s

44 169 9 12

Number

samples 564 564 2352 2352

a 2D structures treated without PCA , b 2D structures traded

with PCA, c 3D structures treated without PCA (crystallographic descriptors) , d 3D structures treated

with PCA

Features selection The backward features selection59 was carried out in the last stage of the data

analysis and the machine learning model design. The classifier that renders the best predictive model was incorporated in the loop of features selection. This

technique eliminates each descriptor consecutively and builds a predictive

model with the classifier of choice (i.e. One feature was removed in each iteration of the elimination loop until all features are eliminated at the end of

the loop). In each iteration, all the features are tested by discarding them once

then, the feature that has the highest impact on the accuracy is deleted for the next iteration. The models are ranked according to their accuracy, and the most

important features are the ones that affect the most the robustness and the

accuracy of the model. Therefore, the most important feature is the one that gives the lowest accuracy for the model when it has been eliminated before the

training step.

Solvent screening

Pentoxifylline (CAS Number 6493-05-6) was purchased from Sigma Aldrich.

Samples of 100mg each were recrystallised in 66 different solvents listed in the

supporting information with their available physical properties. Each solution was heated near the boiling point of the corresponding solvent to ensure the full

dissolution of the starting material. Once clear solutions are obtained, they were

left to cool to ambient temperature. They were then kept in atmospheric conditions to evaporate the solvents. The pure polymorphic forms and the

percentage compositions of all the sample mixtures were characterized by

differential scanning calorimetry–thermogravimetry (DSC-TG), XRPD, and ATR-IR spectroscopy.

Powder X-Ray Diffraction

For crystalline form identification, a small quantity (10-50 mg) of the sample

was analysed using transmission XRPD data collected on a Bruker AXS D8 Advance transmission diffractometer equipped with θ/θ geometry, with primary

monochromated radiation (Cu Kα1 λ= 1.54056 Å), a Vantec PSD and an

automated multiposition x-y sample stage. Samples were mounted on a 28-position sample plate supported on a polyimide (Kapton, 7.5 µm thickness)

film. Data were collected from each sample in the range 4-35° 2θ with a

0.015°2θ step size and 1 s per step count time. Samples were oscillated in the x-y plane at a speed of 0.3 mm s-1 throughout data collection to maximise particle

sampling and minimise preferred orientation effects. Thermogravimetry

analysis

Netzsch STA 449 F1 Jupiter® performed DSC and TGA simultaneously

allowing the monitoring of remaining solvent evaporation.

Differential scanning calorimeter analysis

Differential scanning calorimetry–thermogravimetric experiments were

performed on a Netzsch DSC214 Polyma differential scanning calorimeter. The

heating rate for all polymorphs was kept constant at 20°C/min and all runs were carried out from 25 °C to 250 °C. The measurements were performed in

aluminium crucibles, nitrogen was used as the purge gas in ambient mode, and

calibration was performed using indium metal. The cooling of the samples was conducted for all the samples after a temperature plateau at 250 °C.

Attenuated Total Reflectance–Infrared Spectroscopy

Attenuated total reflectance–infrared spectra were collected on a Bruker

TENSOR II FT-IR spectrometer with Opus v7.5 software. The spectrometer is

fitted with a KBr beamsplitter, which operates in the range 8000–10 cm-1 with a universal ATR accessory (‘PLATINUM’ diamond ATR-accessory), and

HYPERION (IR microscope). Spectra were collected in the 4000–650 cm-1

range with a resolution of 4.00 cm-1 and scan number of 4.

https://www.sigmaaldrich.com/catalog/search?term=6493-05-6&interface=CAS%20No.&N=0&mode=partialmax&lang=en&region=GB&focus=product

GRAPHICAL ABSTRACT

Polymorphism is a physical feature that characterises solid crystalline compounds Its regulation is crucial for the control of other properties such as the solubility or mechanical resistance. Machine learning is a modern statistical tool exploited to learn from the existent data and to generate models that predict how the molecule is able to give a number of solid forms. This computational tool provides a guideline to experimentalists in order to spot molecules with potentially underestimated polymorphism and to ease the discovery of novel materials.

GRAPHICAL ABSTRACT FIGURE

Data curation

Classification Discovery

Machine learning

download fileview on ChemRxivMachine learning-based approach to predict the polymorp... (1.19 MiB)



Supporting information

Machine learning-based approach to predict the “polymorphism" oforganic compounds

Zied Hosni1,3*, Annalisa Riccardi2, Stephanie Yerdelen1, Alan R. G. Martin1, Deborah Bowering1, and Alastair J. Florence1*

1. 2D and PCA1.1. Naive Bayes Multinomial1.1.1. Confusion matrix

1 2 3 4 5 6

1 15 38 24 12 5 0

2 21 1 7 17 36 12

3 21 25 9 2 34 3

4 12 30 0 9 38 5

5 0 42 19 25 8 0

6 0 65 0 19 3 7

1.1.2. Accuracy statistics

Class

True Positive

False Positive

True Negative

False Negative Recall

Precision

Sensitivity Specifity

F-neasure Accuracy

Cohen's kappa

115 54 416 79

0.159574

0.217391

0.1595740.885106

0.184049

? ?

21 200 270 93

0.010638

0.004975

0.0106380.574468

0.00678 ? ?

39 50 420 85

0.095745

0.152542

0.0957450.893617

0.117647

? ?

49 75 395 85

0.095745

0.107143

0.0957450.840426

0.101124

? ?

58 116 354 86

0.085106

0.064516

0.0851060.753191

0.073394

? ?

67 20 450 87

0.074468

0.259259

0.0744680.957447

0.115702

? ?

Overall

? ? ? ? ? ? ? ? ?0.08687

9

-0.0957

4

1 EPSRC Future Continuous Manufacturing and Advanced Crystallisation Hub, University of Strathclyde, Technology and Innovation Centre, 99 George Street, Glasgow, U.K. G1 1RD. 2 Department of Mechanical and Aerospace Engineering, University of Strathclyde, Glasgow, UK. G1 1XJ. 3 Centre of Computational Chemistry, School of Chemistry, Cantock's Close, Bristol, U.K BS8 1TS. email:[email protected]; [email protected] 2

mailto:[email protected]

1.2. Ordinal Classic classifier1.2.1. Confusion matrix

1 2 3 4 5 6

1 66 14 6 5 1 2

2 9 33 24 12 9 7

3 3 19 67 4 1 0

4 6 5 11 64 7 1

5 1 5 13 18 57 0

6 0 3 0 0 15 76


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity F-neasure

Accuracy Cohen's kappa

1 66 19 451 28 0.702128

0.776471

0.702128 0.959574

0.737430

? ?

2 33 46 424 61 0.351064

0.417722

0.351064 0.902128

0.381503

? ?

3 67 54 416 27 0.712766

0.553719

0.712766 0.885106

0.623256

? ?

4 64 39 431 30 0.680851

0.621359

0.680851 0.917021

0.649746

? ?

5 57 33 437 37 0.606383

0.633333

0.606383 0.929787

0.619565

? ?

6 76 10 460 18 0.808511

0.883721

0.808511 0.978723

0.844444

? ?

Overall

? ? ? ? ? ? ? ? ? 0.643617

0.572340

1.3. Simple logistic1.3.1. Confusion matrix

1 2 3 4 5 6

1 74 6 11 1 1 1

2 12 30 22 11 12 7

3 10 16 55 0 13 0

4 0 0 0 90 4 0

5 0 4 5 0 85 0

6 0 0 0 0 0 94


Class True Positive

False Positive

True Negative

False Negative

Recall Precision Sensitivity

Specifity F-neasure


1 74 22 448 20 0.787234

0.770833

0.787234 0.953191

0.778947

? ?

2 30 26 444 64 0.319149

0.535714

0.319149 0.944681

0.400000

? ?

3 55 38 432 39 0.585106

0.591398

0.585106 0.919149

0.588235

? ?

4 90 12 458 4 0.957447

0.882353

0.957447 0.974468

0.918367

? ?

5 85 30 440 9 0.904255

0.739130

0.904255 0.936170

0.813397

? ?

6 94 8 462 0 1.000000

0.921569

1.000000 0.982979

0.959184

? ?

Overall ? ? ? ? ? ? ? ? ? 0.758865

0.710638

1.4. Multilayer Perceptron1.4.1. Confusion matrix

1 2 3 4 5 6

1 3 4 0 87 0 0

2 5 2 0 84 0 3

3 9 0 0 84 0 1

4 0 0 0 94 0 0

5 0 3 0 90 1 0

6 0 0 0 93 0 1


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity F-neasure

Accuracy

Cohen's kappa

1 3 14 456 91 0.031915

0.176471

0.031915 0.970213

0.054054

? ?

2 2 7 463 92 0.021277

0.222222

0.021277 0.985106

0.038835

? ?

3 0 0 470 94 0.000000

? 0.000000 1.000000

? ? ?

4 94 438 32 0 1.000000

0.176692

1.000000 0.068085

0.300319

? ?

5 1 0 470 93 0.010638

1.000000

0.010638 1.000000

0.021053

? ?

6 1 4 466 93 0.010638

0.200000

0.010638 0.991489

0.020202

? ?

Overall

? ? ? ? ? ? ? ? ? 0.179078

0.014894

1.5. Support Vector Machine1.5.1. Confusion matrix

1 2 3 4 5 6

1 51 20 2 21 0 0

2 0 91 3 0 0 0

3 0 32 50 12 0 0

4 0 13 1 80 0 0

5 5 11 0 0 57 21

6 0 3 0 0 0 91


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity F-neasure

Accuracy

Cohen's kappa

1 51 5 465 43 0.542553

0.910714

0.542553 0.989362

0.680000

? ?

2 91 79 391 3 0.968085

0.535294

0.968085 0.831915

0.689394

? ?

3 50 6 464 44 0.531915

0.892857

0.531915 0.987234

0.666667

? ?

4 80 33 437 14 0.851064

0.707965

0.851064 0.929787

0.772947

? ?

5 57 0 470 37 0.606383

1.000000

0.606383 1.000000

0.754967

? ?

6 91 21 449 3 0.968085

0.812500

0.968085 0.955319

0.883495

? ?

Overall ? ? ? ? ? ? ? ? ? 0.744681

0.693617

1.6. k-nearest neighbors1.6.1. Confusion matrix

1 2 3 4 5 6

1 92 0 1 1 0 0

2 11 24 24 14 17 4

3 2 1 91 0 0 0

4 0 4 0 90 0 0

5 0 0 0 0 94 0

6 0 0 0 0 0 94


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity F-neasure

Accuracy

Cohen's kappa

1 92 13 457 2 0.978723

0.876190

0.978723 0.972340

0.924623

? ?

2 24 5 465 70 0.255319

0.827586

0.255319 0.989362

0.390244

? ?

3 91 25 445 3 0.968085

0.784483

0.968085 0.946809

0.866667

? ?

4 90 15 455 4 0.957447

0.857143

0.957447 0.968085

0.904523

? ?

5 94 17 453 0 1.000000

0.846847

1.000000 0.963830

0.917073

? ?

6 94 4 466 0 1.000000

0.959184

1.000000 0.991489

0.979167

? ?

Overall ? ? ? ? ? ? ? ? ? 0.859929

0.831915

1.7. Random Forest1.7.1. Confusion matrix

1 2 3 4 5 6

1 88 0 5 1 0 0

2 9 43 26 12 4 0

3 4 5 81 0 4 0

4 1 8 0 83 2 0

5 0 1 1 0 92 0

6 0 0 0 0 0 94


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity F-neasure

Accuracy

Cohen's kappa

1 88 14 456 6 0.936170

0.862745

0.936170 0.970213

0.897959

? ?

2 43 14 456 51 0.457447

0.754386

0.457447 0.970213

0.569536

? ?

3 81 32 438 13 0.861702

0.716814

0.861702 0.931915

0.782609

? ?

4 83 13 457 11 0.882979

0.864583

0.882979 0.972340

0.873684

? ?

5 92 10 460 2 0.978723

0.901961

0.978723 0.978723

0.938776

? ?

6 94 0 470 0 1.000000

1.000000

1.000000 1.000000

1.000000

? ?

Overall ? ? ? ? ? ? ? ? ? 0.852837

0.823404

1.8. Gradient Boosted trees1.8.1. Confusion matrix

1 2 3 4 5 6

1 75 12 3 3 1 0

2 5 53 19 9 4 4

3 5 12 73 1 1 2

4 0 13 1 77 3 0

5 0 8 9 0 77 0

6 0 4 0 0 0 90


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity F-neasure

Accuracy

Cohen's kappa

1 75 10 460 19 0.797872

0.882353

0.797872 0.978723

0.837989

? ?

2 53 49 421 41 0.563830

0.519608

0.563830 0.895745

0.540816

? ?

3 73 32 438 21 0.776596

0.695238

0.776596 0.931915

0.733668

? ?

4 77 13 457 17 0.819149

0.855556

0.819149 0.972340

0.836957

? ?

5 77 9 461 17 0.819149

0.895349

0.819149 0.980851

0.855556

? ?

6 90 6 464 4 0.957447

0.937500

0.957447 0.987234

0.947368

? ?

Overall ? ? ? ? ? ? ? ? ? 0.789007

0.746809

1.9. Prediction fusion1.9.1. Confusion matrix

1 2 3 4 5 6

1 55 17 7 6 6 3

2 3 88 0 3 0 0

3 0 0 94 0 0 0

4 1 1 1 91 0 0

5 0 0 0 0 94 0

6 0 0 0 0 0 94


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity F-neasure


1 55 4 466 39 0.585106

0.932203

0.585106 0.991489

0.718954

? ?

2 88 18 452 6 0.936170

0.830189

0.936170 0.961702

0.880000

? ?

3 94 8 462 0 1.000000

0.921569

1.000000 0.982979

0.959184

? ?

4 91 9 461 3 0.968085

0.910000

0.968085 0.980851

0.938144

? ?

5 94 6 464 0 1.000000

0.940000

1.000000 0.987234

0.969072

? ?

6 94 3 467 0 1.000000

0.969072

1.000000 0.993617

0.984293

? ?

Overall

? ? ? ? ? ? ? ? ? 0.914894

0.897872

2. 2D2.1. Naive Bayes Multinomial2.1.1. Confusion matrix

1 2 3 4 5 6

1 38 8 28 9 11 0

2 16 16 22 15 19 6

3 14 9 35 3 29 4

4 4 16 0 35 39 0

5 0 11 10 17 56 0

6 0 0 0 12 13 69


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity F-neasure


1 38 34 436 56 0.404255

0.527778

0.404255 0.92766 0.457831

? ?

2 16 44 426 78 0.170213

0.266667

0.170213 0.906383

0.207792

? ?

3 35 60 410 59 0.37234 0.368421

0.37234 0.87234 0.37037 ? ?

4 35 56 414 59 0.37234 0.384615

0.37234 0.880851

0.378378

? ?

5 56 111 359 38 0.595745

0.335329

0.595745 0.76383 0.429119

? ?

6 69 10 460 25 0.734043

0.873418

0.734043 0.978723

0.797688

? ?

Overall ? ? ? ? ? ? ? ? ? 0.441489

0.329787


1 2 3 4 5 6

1 58 17 12 6 1 0

2 8 37 26 14 8 1

3 11 19 50 8 6 0

4 5 10 19 55 3 2

5 1 1 15 12 65 0

6 0 1 0 0 0 93


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity F-neasure

Accuracy

Cohen's kappa

1 58 25 445 36 0.617021

0.698795

0.617021 0.946809

0.655367

? ?

2 37 48 422 57 0.393617

0.435294

0.393617 0.897872

0.413408

? ?

3 50 72 398 44 0.531915

0.409836

0.531915 0.846809

0.462963

? ?

4 55 40 430 39 0.585106

0.578947

0.585106 0.914894

0.582011

? ?

5 65 18 452 29 0.691489

0.783133

0.691489 0.961702

0.734463

? ?

6 93 3 467 1 0.989362

0.96875 0.989362 0.993617

0.978947

? ?

Overall ? ? ? ? ? ? ? ? ? 0.634752

0.561702


1 2 3 4 5 6

1 91 0 3 0 0 0

2 15 31 27 8 10 3

3 2 4 88 0 0 0

4 0 0 0 94 0 0

5 0 0 0 0 94 0

6 0 0 0 0 0 94


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity F-neasure

Accuracy

Cohen's kappa

1 91 17 453 3 0.968085

0.842593

0.968085 0.96383 0.90099 ? ?

2 31 4 466 63 0.329787

0.885714

0.329787 0.991489

0.48062 ? ?

3 88 30 440 6 0.93617 0.745763

0.93617 0.93617 0.830189

? ?

4 94 8 462 0 1 0.921569

1 0.982979

0.959184

? ?

5 94 10 460 0 1 0.903846

1 0.978723

0.949495

? ?

6 94 3 467 0 1 0.969072

1 0.993617

0.984293

? ?

Overall ? ? ? ? ? ? ? ? ? 0.87234 0.846809


1 2 3 4 5 6

1 52 39 2 1 0 0

2 15 53 16 6 2 2

3 3 45 46 0 0 0

4 0 47 0 47 0 0

5 3 66 24 0 0 1

6 0 0 0 0 0 94


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity

F-neasure

Accuracy

Cohen's kappa

1 52 21 449 42 0.553191

0.712329

0.553191 0.955319

0.622754

? ?

2 53 197 273 41 0.56383 0.212 0.56383 0.580851

0.30814 ? ?

3 46 42 428 48 0.489362

0.522727

0.489362 0.910638

0.505495

? ?

4 47 7 463 47 0.5 0.87037 0.5 0.985106

0.635135

? ?

5 0 2 468 94 0 0 0 0.995745

NaN ? ?

6 94 3 467 0 1 0.969072

1 0.993617

0.984293

? ?

Overall ? ? ? ? ? ? ? ? ? 0.51773 0.421277


1 2 3 4 5 6

1 50 21 23 0 0 0

2 4 90 0 0 0 0

3 1 32 33 28 0 0

4 0 14 0 80 0 0

5 0 11 0 5 57 21

6 1 2 0 0 0 91


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity F-neasure

Accuracy

Cohen's kappa

1 50 6 464 44 0.531915

0.892857

0.531915 0.987234

0.666667

? ?

2 90 80 390 4 0.957447

0.529412

0.957447 0.829787

0.681818

? ?

3 33 23 447 61 0.351064

0.589286

0.351064 0.951064

0.44 ? ?

4 80 33 437 14 0.851064

0.707965

0.851064 0.929787

0.772947

? ?

5 57 0 470 37 0.606383

1 0.606383 1 0.754967

? ?

6 91 21 449 3 0.968085

0.8125 0.968085 0.955319

0.883495

? ?

Overall ? ? ? ? ? ? ? ? ? 0.710993

0.653191


1 2 3 4 5 6

1 90 0 3 0 1 0

2 9 27 30 13 10 5

3 0 0 92 0 2 0

4 1 0 0 89 4 0

5 2 0 3 0 89 0

6 0 0 0 0 0 94


Class True Positive

False Positive

True Negative

False Negative


Specifity F-neasure


1 90 12 458 4 0.957447

0.882353

0.957447 0.974468

0.918367

? ?

2 27 0 470 67 0.287234

1 0.287234 1 0.446281

? ?

3 92 36 434 2 0.978723

0.71875 0.978723 0.923404

0.828829

? ?

4 89 13 457 5 0.946809

0.872549

0.946809 0.97234 0.908163

? ?

5 89 17 453 5 0.946809

0.839623

0.946809 0.96383 0.89 ? ?

6 94 5 465 0 1 0.949495

1 0.989362

0.974093

? ?

Overall ? ? ? ? ? ? ? ? ? 0.852837

0.823404


1 2 3 4 5 6

1 79 2 7 6 0 0

2 12 48 21 9 3 1

3 2 7 81 1 3 0

4 1 2 0 89 2 0

5 0 3 1 1 89 0

6 0 1 0 0 1 92


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity F-neasure

Accuracy

Cohen's kappa

1 79 15 455 15 0.840426

0.840426

0.840426 0.968085

0.840426

? ?

2 48 15 455 46 0.510638

0.761905

0.510638 0.968085

0.611465

? ?

3 81 29 441 13 0.861702

0.736364

0.861702 0.938298

0.794118

? ?

4 89 17 453 5 0.946809

0.839623

0.946809 0.96383 0.89 ? ?

5 89 9 461 5 0.946809

0.908163

0.946809 0.980851

0.927083

? ?

6 92 1 469 2 0.978723

0.989247

0.978723 0.997872

0.983957

? ?

Overall ? ? ? ? ? ? ? ? ? 0.847518

0.817021


1 2 3 4 5 6

1 68 9 11 6 0 0

2 7 48 27 6 4 2

3 6 9 76 0 3 0

4 2 5 5 82 0 0

5 0 5 12 0 77 0

6 0 0 1 0 0 93


Class True Positive

False Positive

True Negative

False Negative


Specifity F-neasure


1 68 15 455 26 0.723404

0.819277

0.723404 0.968085

0.768362

? ?

2 48 28 442 46 0.510638

0.631579

0.510638 0.940426

0.564706

? ?

3 76 56 414 18 0.808511

0.575758

0.808511 0.880851

0.672566

? ?

4 82 12 458 12 0.87234 0.87234 0.87234 0.974468

0.87234 ? ?

5 77 7 463 17 0.819149

0.916667

0.819149 0.985106

0.865169

? ?

6 93 2 468 1 0.989362

0.978947

0.989362 0.995745

0.984127

? ?

Overall ? ? ? ? ? ? ? ? ? 0.787234

0.744681


1 2 3 4 5 6

1 90 3 0 1 0 0

2 1 88 4 0 1 0

3 5 23 53 8 4 1

4 0 0 1 93 0 0

5 0 0 0 0 94 0

6 0 0 0 0 0 94


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity

F-neasure

Accuracy

Cohen's kappa

1 90 6 464 4 0.957447

0.9375 0.957447 0.987234

0.947368

? ?

2 88 26 444 6 0.93617 0.77193 0.93617 0.944681

0.846154

? ?

3 53 5 465 41 0.56383 0.913793

0.56383 0.989362

0.697368

? ?

4 93 9 461 1 0.989362

0.911765

0.989362 0.980851

0.94898 ? ?

5 94 5 465 0 1 0.949495

1 0.989362

0.974093

? ?

6 94 1 469 0 1 0.989474

1 0.997872

0.994709

? ?

Overall ? ? ? ? ? ? ? ? ? 0.907801

0.889362

3. 3D and PCA3.1. Naive Bayes Multinomial3.1.1. Confusion matrix

1 2 3 4 5 6

1 52 17 53 58 100 112

2 50 0 1 69 0 272

3 41 26 0 45 3 277

4 12 194 38 17 60 71

5 62 45 17 80 0 188

6 2 76 54 68 164 28


Class True Positive

False Positive

True Negative

False Negative

Recall Precision Sensitivity Specifity F-neasure


1 52 167 1793 340 0.132653 0.237443 0.132653 0.914796 0.170213 ? ?

2 0 358 1602 392 0 0 0 0.817347 NaN ? ?

3 0 163 1797 392 0 0 0 0.916837 NaN ? ?

4 17 320 1640 375 0.043367 0.050445 0.043367 0.836735 0.046639 ? ?

5 0 327 1633 392 0 0 0 0.833163 NaN ? ?

6 28 920 1040 364 0.071429 0.029536 0.071429 0.530612 0.041791 ? ?

Overall ? ? ? ? ? ? ? ? ? 0.041241 -0.15051


1 2 3 4 5 6

1 215 79 47 24 20 7

2 20 154 93 50 56 19

3 15 37 227 62 35 16

4 3 31 73 249 30 6

5 7 21 35 68 246 15

6 14 9 24 50 63 232


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity

F-neasure

Accuracy

Cohen's kappa

1 215 59 1901 177 0.548469

0.784672

0.548469 0.969898

0.645646

? ?

2 154 177 1783 238 0.392857

0.465257

0.392857 0.909694

0.426003

? ?

3 227 272 1688 165 0.579082

0.45491 0.579082 0.861224

0.50954 ? ?

4 249 254 1706 143 0.635204

0.49503 0.635204 0.870408

0.556425

? ?

5 246 204 1756 146 0.627551

0.546667

0.627551 0.895918

0.584323

? ?

6 232 63 1897 160 0.591837

0.786441

0.591837 0.967857

0.6754 ? ?

Overall ? ? ? ? ? ? ? ? ? 0.5625 0.475


1 2 3 4 5 6

1 127 19 75 58 46 67

2 23 23 42 163 51 90

3 28 34 68 89 71 102

4 16 129 91 45 75 36

5 47 25 35 79 89 117

6 0 59 38 84 171 40


Class True Positive

False Positive

True Negative

False Negative

Recall Precision Sensitivity Specifity F-neasure

Accuracy Cohen'skappa

1 127 114 1846 265 0.32398 0.526971

0.32398 0.941837

0.401264

? ?

2 23 266 1694 369 0.058673

0.079585

0.058673 0.864286

0.067548

? ?

3 68 281 1679 324 0.173469

0.194842

0.173469 0.856633

0.183536

? ?

4 45 473 1487 347 0.114796

0.086873

0.114796 0.758673

0.098901

? ?

5 89 414 1546 303 0.227041

0.176938

0.227041 0.788776

0.198883

? ?

6 40 412 1548 352 0.102041

0.088496

0.102041 0.789796

0.094787

? ?

Overall ? ? ? ? ? ? ? ? ? 0.166667

0


1 2 3 4 5 6

1 154 46 91 38 62 1

2 108 31 12 77 156 8

3 52 69 20 128 121 2

4 284 8 10 21 66 3

5 119 9 2 197 63 2

6 330 4 18 2 38 0


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity F-neasure

Accuracy

Cohen's kappa

1 154 893 1067 238 0.392857

0.147087

0.392857 0.544388

0.214038

? ?

2 31 136 1824 361 0.079082

0.185629

0.079082 0.930612

0.110912

? ?

3 20 133 1827 372 0.05102 0.130719

0.05102 0.932143

0.073394

? ?

4 21 442 1518 371 0.053571

0.045356

0.053571 0.77449 0.049123

? ?

5 63 443 1517 329 0.160714

0.124506

0.160714 0.77398 0.140312

? ?

6 0 16 1944 392 0 0 0 0.991837

NaN ? ?

Overall

? ? ? ? ? ? ? ? ? 0.122874

-0.05255


1 2 3 4 5 6

1 235 37 5 27 88 0

2 0 332 60 0 0 0

3 0 154 137 0 101 0

4 0 58 15 208 0 111

5 0 94 16 0 282 0

6 0 31 2 0 0 359


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity

F-neasure

Accuracy

Cohen'skappa

1 235 0 1960 157 0.59949 1 0.59949 1 0.749601

? ?

2 332 374 1586 60 0.846939

0.470255

0.846939 0.809184

0.604736

? ?

3 137 98 1862 255 0.34949 0.582979

0.34949 0.95 0.437002

? ?

4 208 27 1933 184 0.530612

0.885106

0.530612 0.986224

0.663477

? ?

5 282 189 1771 110 0.719388

0.598726

0.719388 0.903571

0.653534

? ?

6 359 111 1849 33 0.915816

0.76383 0.915816 0.943367

0.832947

? ?

Overall

? ? ? ? ? ? ? ? ? 0.660289

0.592347


1 2 3 4 5 6

1 339 19 15 2 10 7

2 16 226 60 39 29 22

3 7 13 342 11 9 10

4 2 15 7 356 6 6

5 4 7 12 5 352 12

6 7 8 25 6 15 331


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity

F-neasure

Accuracy

Cohen's kappa

1 339 36 1924 53 0.864796

0.904 0.864796 0.981633

0.883963

? ?

2 226 62 1898 166 0.576531

0.784722

0.576531 0.968367

0.664706

? ?

3 342 119 1841 50 0.872449

0.741866

0.872449 0.939286

0.801876

? ?

4 356 63 1897 36 0.908163

0.849642

0.908163 0.967857

0.877928

? ?

5 352 69 1891 40 0.897959

0.836105

0.897959 0.964796

0.865929

? ?

6 331 57 1903 61 0.844388

0.853093

0.844388 0.970918

0.848718

? ?

Overall ? ? ? ? ? ? ? ? ? 0.827381

0.792857


1 2 3 4 5 6

1 311 30 26 7 13 5

2 16 236 45 40 36 19

3 14 13 323 11 14 17

4 0 17 18 338 9 10

5 2 19 19 6 332 14

6 3 22 28 9 20 310


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity

F-neasure

Accuracy

Cohen's kappa

1 311 35 1925 81 0.793367

0.898844

0.793367 0.982143

0.842818

? ?

2 236 101 1859 156 0.602041

0.700297

0.602041 0.948469

0.647462

? ?

3 323 136 1824 69 0.82398 0.703704

0.82398 0.930612

0.759107

? ?

4 338 73 1887 54 0.862245

0.822384

0.862245 0.962755

0.841843

? ?

5 332 92 1868 60 0.846939

0.783019

0.846939 0.953061

0.813725

? ?

6 310 65 1895 82 0.790816

0.826667

0.790816 0.966837

0.808344

? ?

Overall

? ? ? ? ? ? ? ? ? 0.786565

0.743878


1 2 3 4 5 6

1 250 44 62 8 18 10

2 17 213 55 49 45 13

3 18 33 269 19 31 22

4 6 51 25 279 19 12

5 5 25 30 9 303 20

6 9 36 34 13 27 273


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity

F-neasure

Accuracy

Cohen's kappa

1 250 55 1905 142 0.637755

0.819672

0.637755 0.971939

0.71736 ? ?

2 213 189 1771 179 0.543367

0.529851

0.543367 0.903571

0.536524

? ?

3 269 206 1754 123 0.686224

0.566316

0.686224 0.894898

0.620531

? ?

4 279 98 1862 113 0.711735

0.740053

0.711735 0.95 0.725618

? ?

5 303 140 1820 89 0.772959

0.683973

0.772959 0.928571

0.725749

? ?

6 273 77 1883 119 0.696429

0.78 0.696429 0.960714

0.735849

? ?

Overall ? ? ? ? ? ? ? ? ? 0.674745

0.609694


1 2 3 4 5 6

1 334 18 20 3 14 3

2 8 266 52 31 16 19

3 5 13 343 11 11 9

4 2 11 9 359 3 8

5 0 9 12 5 359 7

6 3 6 20 5 10 348


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity

F-neasure

Accuracy

Cohen's kappa

1 334 18 1942 58 0.852041

0.948864

0.852041 0.990816

0.897849

? ?

2 266 57 1903 126 0.678571

0.823529

0.678571 0.970918

0.744056

? ?

3 343 113 1847 49 0.875 0.752193

0.875 0.942347

0.808962

? ?

4 359 55 1905 33 0.915816

0.86715 0.915816 0.971939

0.890819

? ?

5 359 54 1906 33 0.915816

0.869249

0.915816 0.972449

0.891925

? ?

6 348 46 1914 44 0.887755

0.883249

0.887755 0.976531

0.885496

? ?

Overall

? ? ? ? ? ? ? ? ? 0.854167

0.825

4. 3D4.1. Naive Bayes Multinomial4.1.1. Confusion matrix

1 2 3 4 5 6

1 19 0 127 0 132 114

2 45 0 7 15 9 316

3 45 42 9 0 148 148

4 42 73 41 51 79 106

5 89 6 133 0 6 158

6 62 3 136 0 166 25


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity

F-neasure

Accuracy

Cohen's kappa

1 19 283 1677 373 0.048469

0.062914

0.048469 0.855612

0.054755

? ?

2 0 124 1836 392 0 0 0 0.936735

NaN ? ?

3 9 444 1516 383 0.022959

0.019868

0.022959 0.773469

0.021302

? ?

4 51 15 1945 341 0.130102

0.772727

0.130102 0.992347

0.222707

? ?

5 6 534 1426 386 0.015306

0.011111

0.015306 0.727551

0.012876

? ?

6 25 842 1118 367 0.063776

0.028835

0.063776 0.570408

0.039714

? ?

Overall ? ? ? ? ? ? ? ? ? 0.046769

-0.14388


1 2 3 4 5 6

1 218 94 45 17 9 9

2 23 215 68 38 35 13

3 9 45 262 34 33 9

4 10 15 62 272 31 2

5 8 23 49 59 240 13

6 10 23 21 23 91 224


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity

F-neasure

Accuracy

Cohen's kappa

1 218 60 1900 174 0.556122

0.784173

0.556122 0.969388

0.650746

? ?

2 215 200 1760 177 0.54846 0.51807 0.548469 0.89795 0.53283 ? ?

9 2 9 8

3 262 245 1715 130 0.668367

0.516765

0.668367 0.875 0.58287 ? ?

4 272 171 1789 120 0.693878

0.613995

0.693878 0.912755

0.651497

? ?

5 240 199 1761 152 0.612245

0.546697

0.612245 0.898469

0.577617

? ?

6 224 46 1914 168 0.571429

0.82963 0.571429 0.976531

0.676737

? ?

Overall

? ? ? ? ? ? ? ? ? 0.608418

0.530102


1 2 3 4 5 6

1 138 26 73 43 36 76

2 28 29 80 122 39 94

3 19 98 65 44 82 84

4 16 127 43 98 71 37

5 34 25 72 49 80 132

6 31 48 58 48 170 37


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity F-neasure

Accuracy

Cohen's kappa

1 138 128 1832 254 0.352041

0.518797

0.352041 0.934694

0.419453

? ?

2 29 324 1636 363 0.07398 0.082153

0.07398 0.834694

0.077852

? ?

3 65 326 1634 327 0.165816

0.16624 0.165816 0.833673

0.166028

? ?

4 98 306 1654 294 0.25 0.242574

0.25 0.843878

0.246231

? ?

5 80 398 1562 312 0.204082

0.167364

0.204082 0.796939

0.183908

? ?

6 37 423 1537 355 0.094388

0.080435

0.094388 0.784184

0.086854

? ?

Overall ? ? ? ? ? ? ? ? ? 0.190051

0.028061


1 2 3 4 5 6

1 76 148 151 14 1 2

2 1 74 190 118 9 0

3 1 140 172 72 4 3

4 0 141 190 51 5 5

5 1 216 110 54 9 2

6 0 164 166 61 0 1


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity F-neasure

Accuracy

Cohen'skappa

1 76 3 1957 316 0.193878

0.962025

0.193878 0.998469

0.322718

? ?

2 74 809 1151 318 0.188776

0.083805

0.188776 0.587245

0.116078

? ?

3 172 807 1153 220 0.438776

0.175689

0.438776 0.588265

0.250912

? ?

4 51 319 1641 341 0.130102

0.137838

0.130102 0.837245

0.133858

? ?

5 9 19 1941 383 0.022959

0.321429

0.022959 0.990306

0.042857

? ?

6 1 12 1948 391 0.002551

0.076923

0.002551 0.993878

0.004938

? ?

Overall ? ? ? ? ? ? ? ? ? 0.16284 -0.00459


1 2 3 4 5 6

1 235 32 0 115 10 0

2 0 337 0 0 55 0

3 0 150 178 36 28 0

4 0 62 0 319 11 0

5 0 99 0 0 293 0

6 0 26 57 0 73 236


Class True Positive

False Positive

True Negative

False Negative


Specifity F-neasure


1 235 0 1960 157 0.59949 1 0.59949 1 0.749601

? ?

2 337 369 1591 55 0.859694

0.477337

0.859694 0.811735

0.613843

? ?

3 178 57 1903 214 0.454082

0.757447

0.454082 0.970918

0.567783

? ?

4 319 151 1809 73 0.813776

0.678723

0.813776 0.922959

0.740139

? ?

5 293 177 1783 99 0.747449

0.623404

0.747449 0.909694

0.679814

? ?

6 236 0 1960 156 0.602041

1 0.602041 1 0.751592

? ?

Overall ? ? ? ? ? ? ? ? ? 0.679422

0.615306


1 2 3 4 5 6

1 358 12 8 2 12 0

2 14 221 58 40 33 26

3 5 13 340 6 19 9

4 0 8 7 368 7 2

5 4 4 11 3 359 11

6 3 12 14 3 19 341


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity F-neasure

Accuracy

Cohen's kappa

1 358 26 1934 34 0.913265

0.932292

0.913265 0.986735

0.92268 ? ?

2 221 49 1911 171 0.563776

0.818519

0.563776 0.975 0.667674

? ?

3 340 98 1862 52 0.867347

0.776256

0.867347 0.95 0.819277

? ?

4 368 54 1906 24 0.938776

0.872038

0.938776 0.972449

0.904177

? ?

5 359 90 1870 33 0.915816

0.799555

0.915816 0.954082

0.853746

? ?

6 341 48 1912 51 0.869898

0.876607

0.869898 0.97551 0.873239

? ?

Overall ? ? ? ? ? ? ? ? ? 0.844813

0.813776


1 2 3 4 5 6

1 321 32 18 4 12 5

2 14 295 38 16 19 10

3 3 24 337 4 20 4

4 5 20 11 348 7 1

5 9 11 26 6 333 7

6 5 20 28 4 19 316


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity F-neasure

Accuracy

Cohen's kappa

1 321 36 1924 71 0.818878

0.89916 0.818878 0.981633

0.857143

? ?

2 295 107 1853 97 0.752551

0.733831

0.752551 0.945408

0.743073

? ?

3 337 121 1839 55 0.859694

0.735808

0.859694 0.938265

0.792941

? ?

4 348 34 1926 44 0.887755

0.910995

0.887755 0.982653

0.899225

? ?

5 333 77 1883 59 0.84949 0.81219 0.84949 0.96071 0.83042 ? ?

5 4 4

6 316 27 1933 76 0.806122

0.921283

0.806122 0.986224

0.859864

? ?

Overall ? ? ? ? ? ? ? ? ? 0.829082

0.794898


1 2 3 4 5 6

1 264 42 37 11 23 15

2 16 266 48 22 23 17

3 15 39 281 10 36 11

4 10 30 22 308 18 4

5 10 20 33 8 317 4

6 6 25 30 7 23 301


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity

F-neasure

Accuracy

Cohen's kappa

1 264 57 1903 128 0.673469

0.82243 0.673469 0.970918

0.740533

? ?

2 266 156 1804 126 0.678571

0.630332

0.678571 0.920408

0.653563

? ?

3 281 170 1790 111 0.716837

0.62306 0.716837 0.913265

0.666667

? ?

4 308 58 1902 84 0.785714

0.84153 0.785714 0.970408

0.812665

? ?

5 317 123 1837 75 0.808673

0.720455

0.808673 0.937245

0.762019

? ?

6 301 51 1909 91 0.767857

0.855114

0.767857 0.97398 0.80914 ? ?

Overall

? ? ? ? ? ? ? ? ? 0.73852 0.686224


1 2 3 4 5 6

1 366 4 6 11 2 3

2 13 346 8 19 2 4

3 17 13 297 35 20 10

4 12 2 26 345 5 2

5 4 0 5 7 376 0

6 4 0 13 17 6 352


Class True Positive

False Positive

True Negative

False Negative

Recall Precision

Sensitivity

Specifity

F-neasure

Accuracy

Cohen's kappa

1 366 50 1910 26 0.933673

0.879808

0.933673 0.97449 0.905941

? ?

2 346 19 1941 46 0.882653

0.947945

0.882653 0.990306

0.914135

? ?

3 297 58 1902 95 0.757653

0.83662 0.757653 0.970408

0.795181

? ?

4 345 89 1871 47 0.880102

0.794931

0.880102 0.954592

0.835351

? ?

5 376 35 1925 16 0.959184

0.914842

0.959184 0.982143

0.936488

? ?

6 352 19 1941 40 0.897959

0.948787

0.897959 0.990306

0.922674

? ?

Overall ? ? ? ? ? ? ? ? ? 0.885204

0.862245

download fileview on ChemRxivSI1- Predictive models of polymorphism.docx (75.45 KiB)



Solventformula MW

boiling

melting

L_solub(i) = w_solub * Solvent_density

point

point

compound 1

compound 2

freezing

Tb

50

60

70

80

90

100

110

120

130

140

150

(°C) (°C)

acetic acidC2H4O2

60.052 118 16.6

148.5599062

760.1772 m x

acetoneC3H6O

58.079

56.05

-94.7 y l x

acetonitrileC2H3N

41.052

81.65

-43.8

162.0894225

217.3548 y m x

Anisol155.

5-

37.359.5966

139284.77

141 y h x

benzol 205 -15 h x

benzylamine 185 10150.848

9011149.5

782 h x

bromobenzene 156

-30.8

33.32326767

58.88731 h x

bromobutane 102

-112.

419.0753

745630.75

315 y m x

1-butanolC4H10O

74.12

117.7

-88.6

41.13795169

32.82109 y m x

2-butanolC4H10O

74.12 99.5

-88.5

39.3002199

32.98498 y x

2-butanoneC4H8O

72.11 79.6

-86.6

136.219786

125.7603 y x

chloroform CHCl3119.

38 61.2-

63.4 y x

Cyclopentane 49 -940.45797

7311.247

981 y - - - - - - - - - - -

Cyclopentyl methyl ether 106 -140 y x

diethylene glycol+

C4H10O3

106.12 246 -10

44.74650612

45.00735 x

diethyl etherC4H10O

74.12 34.5

-116.

223.1550

383626.31

825 y - - - - - - - - - - -

diglyme (diethylene glycoldimethyl ether)

C6H14O3

134.17 162 -68 y

Diiodomethane

182.1 6.2 x

Dimethoxypropane 83 -47 y x

dimethyl-formamide (DMF)

C3H7NO

73.09 153

-60.4

8316.393

0535256.9

693 y x

dimethyl sulfoxide (DMSO)

C2H6OS

78.13 189 18.4

324.1045727

248.3549 x

1,4-dioxaneC4H8O2

88.11

101.1 11.8

165.182237

153.6324 x

Dodecane 214 -100.12007

76690.394

447 x

ethanolC2H6O

46.07 78.5

-114.

182.1520

73962.81

078 y x

Ethoxy ethanol + 135 -70

55.0927549

50.1593 y x

ethyl acetateC4H8O2

88.11 77

-83.6

110.7112646

109.9329 y x

ethylene glycol

C2H6O2

62.07 195 -13 x x

heptaneC7H16

100.2 98

-90.6

0.208640266

0.642268 y x

Hexanone127.

6 -5546.1061

75748.90

32 y x

Iodobenzene 188 -2927.0992

589345.97

503 x

2-iodobutane 127

-103.

511.6991

215219.14

337 y x

Iodomethane 42.4-

66.5 y - - - - - - - - - - -

Isoamyl alcohol

131.1 -117 y x

Isobutylacetat 118 -99 y x

e

Isopropyl acetate 89 -73

57.72860428

64.49613 y x

methanol CH4O32.0

4 64.6 -98121.320

011492.61

165 y x

3 methylbutanol 132 -117 y x

methyl t-butylether (MTBE)

C5H12O

88.15 55.2 -109

1.622897611

3.043001 y x

methylene chloride

CH2Cl2

84.93 39.8

-96.7 y - - - - - - - - - - -

methylcyclohexane 101 -126

0.2619268

0.771017 y x

1methylnaphtalene

244.4 -30 x

2methyl1propanol 107 -108 y x

4methyl2pentanone 117 -84

41.53932525

45.26553 y x

methyl tetrahydrofurane 78 -136 y x

nitrobenzene210.

9 5.7112.808

3749144.0

984 x

nitromethaneCH3NO2

61.04

101.2 -29

425.6196295

470.6408 x

1 octanol 195 -1611.7236

039511.18

539 x

2 octanol178.

5 -38 x

2 pentanol 119 -7328.8593

571724.81

189 y x

2 phenoxyethanol 242 14 x

pentadecane 270 16 x

3 pentanol115.

3 -6328.3720

638425.02

772 y x

2 phenoxyethanol 247 -2 x

2 phenyl ethanol 219 -27 x

propane1,2 diol 188 -59 y x

1-propanolC3H8O

88.15 97 -126

63.39754569

47.84395 y x

2-propanolC3H8O

88.15 82.4

-88.5

50.75476517

42.20535 y x

propyl acetate101.

6 -9260.0559

11562.36

033 y x

propylene carbonate 242 -48 y x

salicylaldehyde 197 -7 x

terbutylmethylether

55.05 -108 y x

tetrachloroethylene 121 -22 x

tetrahydrofuran (THF)

C4H8O

72.106 65

-108.

4 y x

toluene C7H892.1

4110.

6 -9322.9217

705338.19

096 y x

tridecane 233 -50.11126

58730.368

231

triethyl amineC6H15N

101.19 88.9

-114.

7 triethyl y x

2,2,2 trifluoroethanol + 77 -44 y x

2,2,4 trimethylpentane +

99.23 -107 y x

water H2O18.0

2 100 014.2735

907322.61

968 x

o-xyleneC8H10

106.17 144

-25.2 x

m-xyleneC8H10

106.17

139.1

-47.8 y x

p-xyleneC8H10

106.17

138.4 13.3 x

download fileview on ChemRxivSolvent.docx (25.73 KiB)



Other files

download fileview on ChemRxivoutput of classifiers (Autosaved).xlsx (93.17 KiB)

download fileview on ChemRxivSI2- Experimental solvents screening.xlsx (34.76 KiB)





discovery of highly polymorphic organic materials: a new

Documents