identifying pb-free perovskites for solar cells by machine

8
ARTICLE OPEN Identifying Pb-free perovskites for solar cells by machine learning Jino Im 1 , Seongwon Lee 2 , Tae-Wook Ko 2 , Hyun Woo Kim 1 , YunKyong Hyon 2 and Hyunju Chang 1 Recent advances in computing power have enabled the generation of large datasets for materials, enabling data-driven approaches to problem-solving in materials science, including materials discovery. Machine learning is a primary tool for manipulating such large datasets, predicting unknown material properties and uncovering relationships between structure and property. Among state-of-the-art machine learning algorithms, gradient-boosted regression trees (GBRT) are known to provide highly accurate predictions, as well as interpretable analysis based on the importance of features. Here, in a search for lead-free perovskites for use in solar cells, we applied the GBRT algorithm to a dataset of electronic structures for candidate halide double perovskites to predict heat of formation and bandgap. Statistical analysis of the selected features identies design guidelines for the discovery of new lead-free perovskites. npj Computational Materials (2019)5:37 ; https://doi.org/10.1038/s41524-019-0177-0 INTRODUCTION Identifying optimal materials in applications research is a time- consuming step due to the vast scope of possible materials composed of three-dimensional (3D) networks of elements selected from the periodic table. Data-driven research has recently received attention as a new route to accelerating this step. 19 This approach uses a pre-computed materials database and statistical tools that efciently screen candidates in a search for optimal materials. The availability of open-access databases of material properties, 1014 along with machine learning (ML) techniques, has rapidly advanced research in this area. Over the last decade, ML has been applied to materials science problems in a variety of directions, such as prediction and classication of crystal structures, 1,1517 development of interatomic potentials, 1820 nding of optimal density functionals for density functional theory, 2123 and building of predictive models of material properties. 2427 The use of ML in materials science, however, has been hindered by the accuracy and interpretability of predictive models. Complex interaction among compositions of materials leads to highly nonlinear relationships between material features and target properties. To accurately describe such relationships, nonlinear ML algorithms have been utilized due to their exible forms. However, lack of interpretability of most nonlinear ML predictive models prevents further mechanistic understanding such as nding key ingredients for target properties. Thus, nding ML algorithms that can achieve both accurate prediction and interpretability is crucial to the further advance of data-driven materials research. Tree-based learning algorithms can be one candidate due to their advantages in both accuracy and interpretability. 28 Utilizing tree-based algorithms, we here focus on nding optimal candidates for double perovskite solar cells. While recent solar cell technology has been prompted by the development of hybrid lead perovskites having an increase of power conversion efciency and low-cost manufacturing, the inclusion of lead ion raises environmental and health issues preventing commercializa- tion. 29,30 Alternatively, a new strategy using mixed mono- and tri-valent cations, in the form of the double perovskite A 2 B 1+ B 3+ X 6 , has been introduced to replace lead-based perovskite solar cell materials. 3134 In this approach, sizable combinations of double perovskites can be possible, and thus a combination of high- throughput computations and the ML technique can be a powerful tool to explore the large combinatorial space. Here, employing the gradient-boosted regression tree (GBRT) algorithm and a dataset of calculated electronic structures of A 2 B 1+ B 3+ X 6 , we present an ML-based investigation, which can be ultimately used to identify Pb-free double perovskite solar cell materials. The GBRT method allows us to obtain highly accurate predictive models for the heat of formation (ΔH F ) and bandgap (E g ), with importance scores for each feature of materials. Based on the scores, we extract crucial features to determine the values ΔH F and E g of halide double perovskites, enabling an overall understanding of the relationships between features and proper- ties. Finally, we discuss the relevance of extracted features to the chemical and physical aspects of ΔH F and E g , and practical approaches of the ML model toward nding optimal candidates of Pb-free halide double perovskites solar cell materials. RESULTS Dataset of Pb-free halide double perovskites For the ML investigation, we rst generated a dataset of the electronic structures of halide double perovskites. Figure 1a presents the crystal structure of the double perovskite A 2 B 1+ B 3+ X 6 . Compared to the original perovskite, this structure incorporates two different types of cations, B 1+ and B 3+ , instead of a single B Received: 4 July 2018 Accepted: 11 March 2019 1 Korea Research Institute of Chemical Technology (KRICT), Daejeon 34114, Republic of Korea and 2 National Institute for Mathematical Sciences (NIMS), Daejeon 34047, Republic of Korea Correspondence: YunKyong Hyon ([email protected]) or Hyunju Chang ([email protected]) These authors contributed equally: Jino Im, Seongwon Lee www.nature.com/npjcompumats Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences

Upload: others

Post on 23-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Identifying Pb-free perovskites for solar cells by machine

ARTICLE OPEN

Identifying Pb-free perovskites for solar cells by machinelearningJino Im 1, Seongwon Lee2, Tae-Wook Ko 2, Hyun Woo Kim1, YunKyong Hyon2 and Hyunju Chang 1

Recent advances in computing power have enabled the generation of large datasets for materials, enabling data-driven approachesto problem-solving in materials science, including materials discovery. Machine learning is a primary tool for manipulating suchlarge datasets, predicting unknown material properties and uncovering relationships between structure and property. Amongstate-of-the-art machine learning algorithms, gradient-boosted regression trees (GBRT) are known to provide highly accuratepredictions, as well as interpretable analysis based on the importance of features. Here, in a search for lead-free perovskites for usein solar cells, we applied the GBRT algorithm to a dataset of electronic structures for candidate halide double perovskites to predictheat of formation and bandgap. Statistical analysis of the selected features identifies design guidelines for the discovery of newlead-free perovskites.

npj Computational Materials (2019) 5:37 ; https://doi.org/10.1038/s41524-019-0177-0

INTRODUCTIONIdentifying optimal materials in applications research is a time-consuming step due to the vast scope of possible materialscomposed of three-dimensional (3D) networks of elementsselected from the periodic table. Data-driven research has recentlyreceived attention as a new route to accelerating this step.1–9 Thisapproach uses a pre-computed materials database and statisticaltools that efficiently screen candidates in a search for optimalmaterials. The availability of open-access databases of materialproperties,10–14 along with machine learning (ML) techniques, hasrapidly advanced research in this area. Over the last decade, MLhas been applied to materials science problems in a variety ofdirections, such as prediction and classification of crystalstructures,1,15–17 development of interatomic potentials,18–20

finding of optimal density functionals for density functionaltheory,21–23 and building of predictive models of materialproperties.24–27

The use of ML in materials science, however, has been hinderedby the accuracy and interpretability of predictive models. Complexinteraction among compositions of materials leads to highlynonlinear relationships between material features and targetproperties. To accurately describe such relationships, nonlinear MLalgorithms have been utilized due to their flexible forms. However,lack of interpretability of most nonlinear ML predictive modelsprevents further mechanistic understanding such as finding keyingredients for target properties. Thus, finding ML algorithms thatcan achieve both accurate prediction and interpretability is crucialto the further advance of data-driven materials research.Tree-based learning algorithms can be one candidate due to

their advantages in both accuracy and interpretability.28 Utilizingtree-based algorithms, we here focus on finding optimalcandidates for double perovskite solar cells. While recent solarcell technology has been prompted by the development of hybrid

lead perovskites having an increase of power conversion efficiencyand low-cost manufacturing, the inclusion of lead ion raisesenvironmental and health issues preventing commercializa-tion.29,30 Alternatively, a new strategy using mixed mono- andtri-valent cations, in the form of the double perovskite A2B

1+B3+X6,has been introduced to replace lead-based perovskite solar cellmaterials.31–34 In this approach, sizable combinations of doubleperovskites can be possible, and thus a combination of high-throughput computations and the ML technique can be apowerful tool to explore the large combinatorial space.Here, employing the gradient-boosted regression tree (GBRT)

algorithm and a dataset of calculated electronic structures ofA2B

1+B3+X6, we present an ML-based investigation, which can beultimately used to identify Pb-free double perovskite solar cellmaterials. The GBRT method allows us to obtain highly accuratepredictive models for the heat of formation (ΔHF) and bandgap(Eg), with importance scores for each feature of materials. Basedon the scores, we extract crucial features to determine the valuesΔHF and Eg of halide double perovskites, enabling an overallunderstanding of the relationships between features and proper-ties. Finally, we discuss the relevance of extracted features to thechemical and physical aspects of ΔHF and Eg, and practicalapproaches of the ML model toward finding optimal candidates ofPb-free halide double perovskites solar cell materials.

RESULTSDataset of Pb-free halide double perovskitesFor the ML investigation, we first generated a dataset of theelectronic structures of halide double perovskites. Figure 1apresents the crystal structure of the double perovskite A2B

1+B3+X6.Compared to the original perovskite, this structure incorporatestwo different types of cations, B1+ and B3+, instead of a single B

Received: 4 July 2018 Accepted: 11 March 2019

1Korea Research Institute of Chemical Technology (KRICT), Daejeon 34114, Republic of Korea and 2National Institute for Mathematical Sciences (NIMS), Daejeon 34047, Republicof KoreaCorrespondence: YunKyong Hyon ([email protected]) or Hyunju Chang ([email protected])These authors contributed equally: Jino Im, Seongwon Lee

www.nature.com/npjcompumats

Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences

Page 2: Identifying Pb-free perovskites for solar cells by machine

cation. With anion X, both B1+ and B3+ cations form octahedralunits. Usually, the perovskite has a structural phase transitionupon tilting and rotating of the octahedral unit. As shown in Fig.1b, two possible crystal structures can be considered: one has acubic space group; the other has an orthorhombic space group.In this study, we considered coinage elements and lower group

XIII elements for B1+, and upper group XIII and lower group XVelements for B3+. Then, substitutable combinations for di-valentlead ion could span over 30 combinations. Furthermore, weconsidered a series of alkali metals from K to Cs for the A-site (Liand Na were not considered because of their small size). Here, forsimplicity of calculation, organic molecules, such as methylammonium, were not included. Halogen ions were assigned forthe X-site. Figure 1c summarizes all combinations of chemicalconstituents. In sum, along with the two space groups of thecrystal structure, 540 hypothetical compounds of A2B

1+B3+X6 wereconsidered.Using first-principles density functional theory (DFT), we

generated a dataset, including values of ΔHF and Eg for the 540compounds. ΔHF indicates the stability of a compound comparedto those of the elemental phases of its chemical constituents.Generally, a more negative value of ΔHF indicates a more stablecompound. On the other hand, Eg can represent the capability toabsorb solar energy, which is critical to achieving high perfor-mance of a solar cell. The optimal value for solar absorption isreported to range from 1.1 eV to 1.8 eV.35 However, note that Eg isseverely underestimated in this work due to the limitation of thestandard DFT.36 Previous studies have shown that a DFT bandgapfrom 0.3 eV to 0.8 eV can recover to an optimal value of a solar cellmaterial if more accurate computation methods such as hybridDFT or GW are used.37,38 Further computational details for thesetwo quantities can be found in the method section.We can detect a few notable characteristics of ΔHF and Eg

without relying on ML analysis. In the case of ΔHF, all candidatematerials have negative values, indicating that all can be stablysynthesized (see Fig. 1d). Another prominent observation aboutΔHF is its dependence on halogen anion, which contrasts to itsweak dependence on other elements. As halogen atoms changefrom iodine to chlorine, ΔH decreases. On the other hand, therelationship between Eg and atomic species and space group ismore complicated, two remarkable characteristics of which wesummarize in the following. First, in most cases, Eg increases by

changing the space group (SG) from cubic to orthorhombic (Fig.1e). Second, it is found that Eg mostly increases as halogen atomschange from iodine to chlorine. However, in mapping between Egand materials, no other dependencies are observed.

Machine learning and featuresWe next apply the machine learning algorithms to the dataset ofhalide double perovskites. In general, machine learning studyrequires the appropriate selection of a learning algorithm and anoptimized set of input features. In this study, we employed theGradient-Boosted Regression Tree (GBRT), which is one of the tree-based machine learning algorithms. Decision tree learning is amachine learning method that uses a tree-like diagram, usually abinary tree, to predict a target variable. The goal is to create a treein which each node represents a split based on one of the inputfeatures, and each leaf represents the prediction of the targetvariable. The prediction can be nonlinear because the partitioningof the input variable space is repeated recursively.39 Compared toother ML methods, the decision tree is advantageous for itsaccuracy and speed, although it is prone to overfitting.40 Usingensemble methods such as bagging and boosting can preventoverfitting, and thus can improve the accuracy.41–43 GBRT adoptsthe gradient boosting method, which combines weak learners intoone strong learner using the gradient descent algorithm.Furthermore, the predictive model can be used to record the

improvement of prediction results for a specific feature as eachnode corresponding to a single feature is added to the trees. Inthis manner, one can measure the feature importance auto-matically.44 This feature importance fosters interpretability ofpredicted results and it leads to the extraction of key-features,which we will show later in the results. In the present work, weadopted a gradient boost method to generate a regression treeensemble that is implemented in the XGBoost library.28 Seefurther details in the method section.Another critical step for achieving good prediction performance

is the selection of appropriate input features, referred to as featureengineering. For the dataset of materials, features should clearlydescribe a single given material and, also, discriminate separatematerials. In this study, we selected 32 features, includingchemical information of atomic constituents and geometric

(a) (b)

Double perovksite structure

ABB

X

A B B X2 6

Cubic

Orthorhombic

1+ 3+

LiNaKRbCs

Be

CaSrBa

BAlGaInTl

CSiGeSnPb

NPAsSbBi

OSSeTePo

FClBrI

At

NeArKrXeRn

ScYLa

TiZrHf

VNbTa

CrMoW

MnTcRe

FeRuOs

CoRhIr

NiPdPt

CuAgAu

ZnCdHg

MgKRbCs

CuAgAu

InTl

AlGaInIn

AsSbBi

ClBrI

(c)A-siteB -site

B -siteX-site

(d)

(e)1+ 3+

K Rb Cs Cl Br ITl In Ag Au Cu Al Ga Bi In Sb As Cub. Ortho.

-1.75

-1.50

-1.25

-1.00

-0.75

-0.50

ΔH

(eV

/ato

m)

A-site B -site B -site X-site space group

K Rb Cs Cl Br ITlInAg AuCu AlGaBi InSbAs Cub. Ortho.A-site B -site B -site X-site space group

-1

0

1

2

3

4

E

(eV

)

Fig. 1 a Crystal structure of double perovskite with A, B1+, B3+, and X-sites denoted by light blue, blue, red, and gray spheres, respectively. bStructural deformation by tilting of octahedral unit presents. c List of chemical elements considered in a dataset of halide double perovskites.Distribution of calculated d heat of formation (ΔHF) and e bandgap (Eg). In each panel, average values are depicted as a white point

J. Im et al.

2

npj Computational Materials (2019) 37 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences

1234567890():,;

Page 3: Identifying Pb-free perovskites for solar cells by machine

information such as bond length and crystal symmetry. The totalof 32 features include the following:Pauling’s electronegativity (χ), ionization potential (IP), highest

occupied atomic level (h), lowest unoccupied atomic level (l), and s-,p-, and d- valence orbital radii rs, rp, and rd of isolated neutral atomsA, B1+, B3+, and X; atomic distance (D) between cations and thenearest halogen atom; space group of crystal (SG). Unlike the otherfeatures, SG is considered a categorical variable for cubic andorthorhombic symmetry.Here, we note that the GBRT algorithm cannot appropriately

evaluate the importance scores of two strongly correlatedfeatures. The reason is that two strongly correlated featurescannot be distinguished in the learning process. Thus, reducingthe dimensions of the feature space can improve the quality ofprediction while simultaneously decreasing the computing cost.Here, we implemented a dimensionality reduction based on thesquare of the Pearson correlation matrix among features. For eachpair of features x and y, the square of the correlation coefficientR2xy is defined as

R2xy ¼Pn

i¼1 ðxi � xÞðyi � yÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPni¼1 ðxi � xÞ2 Pn

i¼1 ðyi � yÞ2q ;

where x and y are the sample means of features xi and yi of the i-th material over a total of n compounds. As shown in Fig. 2, strongcorrelations are found in several pairs of features: (1) all atomicfeatures of halogen atoms at the X-site, (2) rs and rp for A-, B

1+-, orB3+-site atoms, and (3) IE and h for A-, B1+-, or B3+-site atoms. Alack of atomic variation at the X-site (only three atoms: Cl, Br, and I)led to strong correlations among all pairs of atomic features. Thesame trend was observed at the A-site, although this trend wasnot as strong as it was at the X-site. The correlation matrices wereused to downselect features of the halide double perovskitedataset. We selected χ as a representative feature of all atomicfeatures of the X-site atoms. Furthermore, we selected only rs andexcluded rp for the A-, B1+-, and B3+-site atoms. For IE and h of theA-, B1+-, or B3+-site atoms, we considered the only h. We foundthat machine learning performance is almost the same under thedimensional reduction from the Pearson correlation matrix amongfeatures.

Predictive model and feature importanceWe performed regression using GBRT to predict values of Eg andΔHF of the halide double perovskites A2B

1+B3+X6. Figure 3apresents the prediction of ΔHF. The results show that averagedroot-mean-square-error (RMSE) of test sets for ΔHF is 0.021 eV/atom. Even though the number of the current dataset was limited,it is noteworthy that the accuracy of the predictive model of ΔHF

from GBRT is comparable to the fundamental error of 0.024 eV/atom that results from differences between the experimental andDFT-based values of ΔHF of ternary oxides.45 In the case of Eg, theaveraged RMSE of the test sets is equal to 0.223 eV (Fig. 3b), whichis worse than the error of ΔHF. However, as the effective range(1.1–1.8 eV) of the bandgap of solar cell material is much largerthan the RMSE, such low accuracy of Eg could be acceptable forsolar cell applications.The origin of the high accuracy of the predictive models can be

attributed to several things. One is the nonlinear nature of theGBRT algorithm, in contrast to that of most linear algorithms (SeeSupplementary Information 1). Another reason for the highaccuracy could be the structural similarity of the materialsconsidered in the dataset. In the present study, crystal structuresof all materials are perovskite. Given the same structure, materialshaving similar chemical constituents might have similar proper-ties, which allows for more feasible interpolation of properties inthe predictive models. Discussing structural similarity is beyondthe scope of our study, but it is becoming a significant topic in MLstudies of materials science.46,47

On top of providing highly accurate predictive models, theGBRT method provides interpretation of the results via featureimportance scores, which is the main advantage of this method.Figure 3c, d show the importance score of all features for ΔHF andEg, respectively. For ΔHF, it is revealed that the type of halogenanions, represented by χX, is the most important feature (Fig. 3c).Remarkably, the importance score of χX is more than two timeshigher than that of the secondly-ranked feature, DB3þ . Beyond thefirst two features, importance scores steeply decrease, indicatingthat ΔHF strongly depends on only a few features of materials. On

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Feature IndexA B X SG1+ B3+

Feat

ure

Inde

x

A

B

XSG

1+

B3+

A, B , B

X

D χ IP h l rs rpp rd

χ IP h l rs rpp rd

1+ 3+

DχIPhlrsrprd

χIPhlrsrprd

Fig. 2 Square of Pearson correlation coefficient matrix halidedouble perovskites. Size and color of circles vary with values

-1.5 -1.0

-1.50

-1.25

-1.00

-0.75

(b)

Pre

dict

ed (e

V)

Given (eV)

Heat of formation

0 20

1

2

3

(a)

Pre

dict

ed (e

V)

Given (eV)

Band gap

Training setTest set

Impo

rtanc

e

(c)

20

40

0

(ΔH )F (E )g

Heat of formation (ΔH )F

0.4

0.8

0.0Impo

rtanc

e

(d)

SGχ h lD r r rl χD hD χr χh rrl

SG χ h l D r r r l χ D h D χ r χ h r r l

Band gap (E )g

Fig. 3 Prediction of a heat of formation and b bandgap for halidedouble perovskites. Orange filled circles correspond to trainingdataset and blue circles to test dataset. Red solid lines indicate thereference line corresponding to the perfect fit. Feature importancefrom GBRT for c heat of formation and d bandgap of halide doubleperovskite

J. Im et al.

3

Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2019) 37

Page 4: Identifying Pb-free perovskites for solar cells by machine

the other hand, the feature importance used to predict Eg is moredispersed (Fig. 3d). This implies a more complex relationshipbetween Eg and the material features, which is consistent with thetendency observed without ML techniques (see Fig. 1e). We foundthat although SG is the most important feature, the followingfeatures, such as χX, hB1þ , lB1þ , DB3þ , rdB3þ , or rdB1þ , are notnegligible.

Extraction of key-features and applicationIn this section, we suggest possible utilization of the featureimportance scores in an efficient search for target materials. First,we show that the feature importance scores can be utilized toextract key-features determining target properties. To this end, werecursively excluded the least important feature and built apredictive model using only the remaining set of features andlearning new decision trees. This process was repeated until thetop three features remained for each target property. Figure 4a, bshow how the error in the prediction of ΔHF and E changes underthe process of extracting key-features. Remarkably, we found thatRMSEs increase almost monotonically in both cases, whichindicates that a feature with a higher feature score has strongerpredictive power.Through this process, we selected the most important features

for each target properties. For ΔHF, RMSE abruptly increases whenthe number of features is smaller than five (Fig. 4a). This meansthat at least five features are required to predict ΔHF within RMSEof 0.036 eV. In the case of Eg, seven features comprise minimal setto predict Eg within an RMSE of 0.322 eV (Fig. 4b). The list ofselected features includes χX, DB3þ , DA, hB1þ , and DB1þ for ΔHF, andSG, χX, hB1þ , lB1þ , DB3þ , rdB3þ , and rdB1þ for Eg.Key-features extracted from the feature importance score of

GBRT method can have various implications for a fast search fornew material. For instance, selected key-features can be utilized asan optimal set of features in another type of ML algorithms suchas classification. We considered a binary classification with class 1

for Eg in the range of 0.3–0.8 eV and class 2 for otherwise, tosearch for an optimal candidate of solar cell materials. Figure 5presents the accuracy of classification tasks, as well as correspond-ing confusion matrices where the classification task wasperformed with a different number of features. With a smallernumber of features, the prediction accuracy is lower than thatwith all features. However, it is notable that classification with thetop seven key-features selected for Eg provides a goodapproximation to that with the full features. As the classificationcan be accelerated with the reduced feature space, materialsscreening before a further investigation can be feasible in amassive dataset.The mechanistic understanding of the relationship between

features and target properties can provide a practical guide in thesearch for optimal double perovskite solar cell materials. Forexample, Cs2InAgCl6 is one of newly synthesized Pb-free halidedouble perovskite, but its bandgap energy is too large to be usedas a solar cell material.34 Knowing key-features determining Eg,one may have several plausible remedies to decrease Eg, such asanion-mixing from the electronegativity of halogen anions andpartial-substitution/alloying for In and Ag from the highestoccupied and lowest unoccupied orbitals of B1+ ions, the distancebetween B3+ ion and anion, radii of d-orbitals of B1+ and B3+.However, we note that an explicit relationship between features

and properties is still difficult to obtain directly from the featureimportance score, and additional steps would be needed. To thisend, one may still utilize the process of feature selection describedabove and fit the properties using basis functions of the reducedfeatures. Such an explicit relationship can be used to furthermechanistic understanding of target properties such as revealinginteraction between important features, but it is out of the scopeof our paper.

DISCUSSIONIn the importance-score-based selection process of the GBRTmethod, scientific knowledge is not reflected. Here, we considerwhether the roles of the features selected using the GBRT methodin determining the target properties are consistent with previousknown scientific knowledge.First, we investigated the role of the features selected for ΔHF

for the bonding mechanism. The top five selected features for ΔHF

were χX, DB3þ , DA, hB1þ , and DB1þ (Fig. 4), among which the mostimportant feature was χX. Interestingly, it is well-known thatdifferences between electronegativities of bonding partners aregood indicators of the bonding character and bonding strength,which strongly affect ΔHF. Compared to other groups of elements,the halogen group shows relatively large variation in electro-negativity. This could be attributed to a strong dependency of ΔHF

on χX. Along with electronegativity, the distances between cationand anion (DA, DB1þ , and DB3þ ) are also good indicators of bondingstrength. Usually, strong bonding is accompanied by a shorterbond length. In the case of hB1þ , IPB1þ is strongly correlated withhB1þ , (see Fig. 2), and IP is also an important chemical quantity toexplain ionic bonding. In this way, the top five selected featuresfor ΔHF are all relevant to the bonding mechanism, which isimportant to determine ΔHF.In the case of Eg, more complicated theories are required to

understand the role of the selected features. For this purpose, weperformed DFT analysis of the band structure, explicitly focusingon the selected features. Generally, symmetry-lowering by tiltingof the octahedral unit of a halide perovskite increases Eg becauseof bandwidth shrinkage.48 In the case of halide double perovskite,a similar transformation of band structure is found. Figure 6a–dshow the band structures of cubic and orthorhombic phases ofthe representative compounds. In all cases, the bandwidths of theconduction and valence bands were reduced by the tiltedoctahedral units, and this led to increases of the bandgap in

(a)

(b)

RMSE

(eV/

atom

)

0.01

0.03

0.05

0.07

3 5 7 9 11 13 15 17 19Number of Features

3 5 7 9 11 13 15 17 19Number of Features

RMSE

(eV

)

0.1

0.3

0.5

0.6

0.2

0.4

Fig. 4 Root-mean-square-error (RMSE) of a heat of formation andthe b bandgap of halide double perovskite as a function of thenumber of features. In each panel, the blue curve corresponds to thetraining set and the orange curve to the test set

J. Im et al.

4

npj Computational Materials (2019) 37 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences

Page 5: Identifying Pb-free perovskites for solar cells by machine

orthorhombic phases. For other compounds, the same trendoccurs, as shown in Fig. 1e. Thus, SG is relevant in determining Eg.To check the validity of features χX, hB1þ , lB1þ , DB3þ , rdB3þ , and

rdB1þ , we plotted a schematic diagram of the orbital hybridizationbetween cations and anions in halide double perovskites based onthe DFT band structure calculations (see Fig. 6e). The diagramshows that Eg originates from energy differences between twohybridized states, the valence band maximum (VBM), composedof an anti-bonding state involving B1+-site and X-site atoms, andthe conduction band minimum (CBM), composed of an anti-bonding state involving B3+-site and X-site atoms. Even thoughcomplicated interaction exists, the energy levels of the valenceelectrons of B1+, B3+, and X- play important roles in determiningthe value of Eg of a given compound. The highest occupied andlowest unoccupied atomic levels can be good indicators of theenergy levels of the valence electrons. In addition to the atomiclevels, the electronegativity can play a crucial role in determiningEg by controlling energy splitting between the bonding and anti-bonding states, denoted by ΔE in Fig. 6e. This splitting indicatesthe strength of the hybridization among the orbitals. The highelectronegativity of the compounds leads to tightly boundelectronic distribution around the atoms and reflects stronghybridization via small bonding length, such as DB3þ .In this study, we used the GBRT method to investigate the ML

prediction of the values ΔHF and Eg of halide double perovskites.The GBRT method provided outstanding prediction performancefor those properties, as well as providing an interpretable featureimportance score. Notably, for ΔHF, GBRT worked accurately, withan error comparable to the fundamental error associated with thedifference between experimental and DFT values. The predictionof Eg was also acceptable for use in the search for solar cellmaterials. Key-features extracted based on the importance score

provide a better mechanistic understanding of ΔHF and Eg. On topof the accurate and interpretable predictive models, we furtherverified that the key-feature was relevant to scientific knowledge.

METHODSDensity functional theory (DFT)Structural optimization, total energy, and electronic band structure of 540halide double perovskites were performed within density functional theory(DFT) formalism. We utilized a plane-wave basis set (cutoff energy =350 eV), and the projector augmented wave method,49 implemented inthe Vienna Ab-initio simulation package (VASP).50,51 For the exchangecorrelational functional, the generalized gradient approximation wasadopted within Perdue-Ernzerhof-Burke formalism.52 A 5 x 5 x 5 regulargrid was employed for momentum space sampling. The heat of formation,ΔHF, was calculated using the following formula:

ΔHF ¼ Etot A2B1þB3þX6

� �� 2Eref Að Þ þ Erel B1þ� �þ Erel B

3þ� �þ 6Erel Xð Þ� �;

where Etot is the total energy and Eref is the reference energy. For the bandstructures, spin–orbit interaction was considered.

Atomic featuresThe highest occupied atomic level (h) and the lowest unoccupied atomiclevel (l) were taken from the atomic parameters of the VASP pseudopo-tential.50,51 We set h as the highest orbital energy of the partially or fullyoccupied orbitals and l as the lowest orbital energy of the unoccupiedones. If the orbital energies of the unoccupied orbitals were not availablein the parameter files, the highest orbital energy was set at l. Indetermining h and l, we considered degenerated atomic orbitals to be thesame orbital.

75

80

85

90

3 7 12

Acc

urac

y (%

)

# of features

0

1

0 1Predicted

Giv

en

0

1

0 1Predicted

Giv

en

0

1

0 1Predicted

Giv

en

3 features

7 features 12 features

(a) (b)

(c) (d)

Fig. 5 a The accuracy of classification as a function of the number of features. Confusion matrices of classification results for the fivefold testset with b 3 features, c 7 features, and d 12 features. Class “1” means set of materials with a bandgap in a range of [0.3 eV, 0.8 eV] and class “0”of everyone else

J. Im et al.

5

Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2019) 37

Page 6: Identifying Pb-free perovskites for solar cells by machine

Gradient-boosted regression treeThe supervised learning model has a loss function to be minimized. InXGBoost the loss function of the model (ensemble of trees fk) is

L ¼Xi

lðyi ; yiÞ þXk

ΩðfkÞ

where l is a function that measures the difference between the predictionand the target and Ω is a regularization term (complexity of the tree) toprevent overfitting. This loss function cannot be optimized usingtraditional optimization methods: the model is trained in an additivemanner by adding a tree at a time that most improves the model (mostdecreases the loss) to the existing set of trees. Let xi be an i-th sample andyðt�1Þi be its prediction with the current set of t− 1 trees. The model needsto add the t-th tree ft to minimize the loss function

LðtÞðftÞ ¼Xni¼1

lðyi ; yðt�1Þi þ ftðxiÞÞ þ ΩðftÞ:

It is impossible, however, to check all possible tree structures ft to beadded. So, the algorithm starts from a root (single leaf) and greedily addsbranches to the tree. For each step, the model finds a leaf to split, and afeature and its value for the split that maximize loss reduction after the

split. If the current tree structure is f ðcurrentÞt and the structure after the split

is f ðsplitÞt , then the loss reduction by branching can be calculated by the

difference of the loss, DscoreðftÞ ¼ LðtÞ f ðcurrentÞt

� �� LðtÞ f ðsplitÞt

� �. This means

that for each branch of the trees, the model knows, which feature is usedfor the split and its loss reduction.28 For each split of the trees duringmodel training, the algorithm finds (approximately) the feature andsplitting point that provide the largest cost decrease. It is possible tocalculate the number of times each feature is selected for a split in thetrees, or the average of the gain of the splits that use each feature. Thelibrary offers these numerical values as an F-score of type weight or typegain, respectively. In this study, we used gain.The hyperparameters of each model were optimized using a cross-

validated grid search or a randomized search over the parameter settings.The RMSE of the target variable was used as a cost function for all models.For all ML models and each model’s training, randomly chosen 80%samples of the data are used for training; the remaining 20% are used for atest set. We averaged 200 evaluations, 40 sets of fivefold training/test setsplitting with random shuffling (which also have 80% of the data astraining set) and calculated the importance scores of the figures. In thegradient-boosted regression tree, several hyperparameters exist. Aparameter subsample for bagging (subsample ratio of the training

instance) and colsample_bytree (subsample ratio of features) for therandom forest were optimized to prevent overfitting. The regularizationparameters were also optimized. The values of the hyperparameters usedhere are:

max depth ¼ 6;min child weight ¼ 1; colsample bytree ¼ 0:5;

subsample ¼ 0:7; reg alpha ¼ 0:1; learning rate ¼ 0:03:

The regularization parameters were also optimized. The parametermax_depth is for maximum depth of a tree, and min_child_weight is forthe minimum number of instances needed in each node. The smallermax_depth and the larger min_child_weight are, the less the training islikely to overfit. The parameter reg_alpha is an L1 regularization term onweights.

Gradient-boosted classification treesThe classifications were performed for halide double perovskite datasetwith gradient booted classification trees (GBCT). For these classificationsand predictions for all data, the halide double perovskite dataset wasseparated into five disjoint sub-datasets, and then train data consisted offour sub-datasets and test data of the other sub-dataset. This means thatthere were five predictions for the full halide double perovskite dataset. Inestablishing the predictive model, typically given hyperparameters ofGBCT were adopted for the classifications. The given parameter values forthe classifications used here are:

max depth ¼ 4; min child weight ¼ 4; colsample bytree ¼ 0:8;

subsample ¼ 0:8; reg alpha ¼ 0:1; learning rate ¼ 0:1:

DATA AVAILABILITYA dataset on halide double perovskites is provided in the Supplementary Information(See the separate Excel File and corresponding explanation in part 2 ofSupplementary Information). Other electronic data in this study are available fromthe corresponding authors upon reasonable request. A dataset on oxide doubleperovskites is available via the Computational Materials Repository.[https://cmr.fysik.dtu.dk/low_symmetry_perovskites/low_symmetry_perovskites.html#low-symmetry-perovskites].

-2

-1

0

1

2

3

4

E-E

(eV

)

-2

-1

0

1

2

3

4

E-E

(eV

)

-2

-1

0

1

2

3

4

E-E

(eV

)

-2

-1

0

1

2

3

4

E-E

(eV

)

(a) (b)

(c) (d)

Rb AgBiBr2 6 Rb AgInBr2 6

Cs TlBiBr2 6 Rb TlGaBr2 6

(e)

B1+

X

B3+B -X hybridized1+

B -X hybridized3+

anti-bonding

bonding

anti-bonding

bonding

EF

ΔE

ΔE

Fig. 6 a–d Present cubic and orthorhombic crystal structures and corresponding band structures of Rb2AgBiBr6, Rb2AgInBr6, Cs2TlBiBr6, andRb2TlGaBr6, respectively. e Schematic illustration of a band diagram for halide double perovskite

J. Im et al.

6

npj Computational Materials (2019) 37 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences

Page 7: Identifying Pb-free perovskites for solar cells by machine

ACKNOWLEDGEMENTSThis research was supported by the Nano·Material Technology DevelopmentProgram through the National Research Foundation of Korea (NRF), funded by theMinistry of Science and ICT (NRF-2016M3A7B4025408 and NRF-2017M3A7B4049366).We especially thank Castelli53 for data sharing and appreciate S. Lim for valuablediscussions.

AUTHOR CONTRIBUTIONSH.C. and Y.H. designed the project. J.I. performed the high-throughput DFTcalculations, and S.L. carried out machine learning calculations. H.W.K. and T.-W.K.developed the features to analyze the data. J.I. and S.L. wrote the manuscript. Allauthors discussed data and feature analysis and reviewed the manuscript.

ADDITIONAL INFORMATIONSupplementary information accompanies the paper on the npj ComputationalMaterials website (https://doi.org/10.1038/s41524-019-0177-0).

Competing interests: The authors declare no competing interests.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claimsin published maps and institutional affiliations.

REFERENCES1. Fischer, C. C. et al. Predicting crystal structure by merging data mining with

quantum mechanics. Nat. Mater. 5, 641 (2006).2. Armiento, R. et al. Screening for high-performance piezoelectrics using high-

throughput density functional theory. Phys. Rev. B 84, 014103 (2011).3. Hautier, G. et al. Identification and design principles of low hole effective mass p-

type transparent conducting oxides. Nat. Commun. 4, 2292 (2013).4. Carrete, J. et al. Nanograined half-heusler semiconductors as advanced thermo-

electrics: An ab initio high-throughput statistical study. Adv. Funct. Maters. 24,7427 (2014).

5. Yim, K. et al. Novel high-κ dielectrics for next-generation electronic devicesscreened by automated ab initio calculations. NPG Asia Mater. 7, e190 (2015).

6. Agrawala, A. & Choudhary, A. Perspective: materials informatics and big data:Realization of the “fourth paradigm” of science in materials science. APL Mater. 4,053208 (2016).

7. Jain, A. et al. New opportunities for materials informatics: Resources and datamining techniques for uncovering hidden relationships. J. Mater. Res. 31, 977(2016).

8. Kalidindi, S. et al. Role of materials data science and informatics in acceleratedmaterials innovation. Mrs. Bull. 41, 596 (2016).

9. Boyd, P. G., Lee, Y. & Smit, B. Computational development of the nanoporousmaterials genome. Nat. Rev. Mater. 2, 17037 (2017).

10. Jain, A. et al. The materials project: A materials genome approach to acceleratingmaterials innovation. APL Mater. 1, 011002 (2013).

11. Saal, J. E., Kirklin, S., Aykol, M., Meredig, B. & Wolverton, C. Materials design anddiscovery with high-throughput density functional theory: The Open QuantumMaterials Database (OQMD). JOM 65, 1501 (2013).

12. Kirklin, S. et al. The Open Quantum Materials Database (OQMD): assessing theaccuracy of DFT formation energies. npj Comput. Mater. 1, 15010 (2015).

13. Draxl, C. & Scheffler, M. NOMAD: The FAIR concept for big data-driven materialsscience. Mrs. Bull. 43, 676 (2018).

14. Curtarolo, S. et al. AFLOW: An automatic framework for high-throughput mate-rials discovery. Comp. Mater. Sci. 58, 218 (2012).

15. Carr, D. A. et al. Machine learning approach for structure-based zeolite classifi-cation. Micro Meso. Mater. 117, 339 (2009).

16. Pilania, G., Gubernatis, J. E. & Lookman, T. Structure classification and meltingtemperature prediction in octet AB solids via machine learning. Phys. Rev. B 91,214302 (2015).

17. Yamashita, T. et al. Crystal structure prediction accelerated by Bayesian optimi-zation. Phys. Rev. Mater. 2, 013803 (2018).

18. Hobday, S. et al. Applications of neural networks to fitting interatomic potentialfunctions. Model. Simul. Mater. Sci. Eng. 7, 397 (1999).

19. Handkey, C. M. & Popelier, L. A. Potential energy surfaces fitted by artificial neuralnetworks. J. Phys. Chem. A 114, 3371 (2010).

20. Schneider, E. et al. Stochastic neural network approach for learning high-dimensional free energy surfaces. Phys. Rev. Lett. 119, 150601 (2017).

21. Wellendorff, J. et al. Density functionals for surface science: Exchange-correlationmodel development with Bayesian error estimation. Phys. Rev. B 85, 235149(2012).

22. Snyder, J. C. et al. Finding density functionals with machine learning. Phys. Rev.Lett. 108, 253002 (2012).

23. Brockherde, F. et al. Bypassing the Kohn-Sham equations with machine learning.Nat. Commun. 8, 872 (2017).

24. Pilania, G. et al. Accelerating materials property predictions using machinelearning. Sci. Rep. 3, 2810 (2013).

25. Lee, J. et al. Prediction model of band gap for inorganic compounds by combi-nation of density functional theory calculations and machine learning techni-ques. Phys. Rev. B 93, 115104 (2016).

26. Pilania, G. et al. Machine learning bandgaps of double perovskites. Sci. Rep. 6,19375 (2016).

27. Bartok, A. P. et al. Machine learning unifies the modeling of materials andmolecules. Sci. Adv. 3, e170816 (2017).

28. Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system.arXiv:1603.02754 (2016).

29. Chen, W. et al. Efficient and stable large-area perovskite solar cells with inorganiccharge extraction layers. Science 350, 944–948 (2015).

30. Shin, S. et al. Colloidally prepared La-doped BaSnO3 electrodes for efficient,photostable perovskite solar cells. Science 356, 167–171 (2017).

31. Volonakis, G. et al. Lead-free halide double perovskites via heterovalent sub-stitution of noble metals. J. Phys. Chem. Lett. 7, 1254 (2016).

32. Philip, M. R. et al. Band gaps of the lead-free halide double perovskites Cs2BiAgCl6and Cs2BiAgBr6 from theory and experiment. J. Phys. Chem. Lett. 7, 2579 (2016).

33. Wei, F. et al. Synthesis and properties of a lead-free hybrid double perovskite:(CH3NH3)2AgBiBr6. Chem. Mater. 29, 1089 (2017).

34. Volonakis, G. et al. Cs2InAgCl6: A new lead-free halide double perovskite withdirect band gap. J. Phys. Chem. Lett. 8, 772 (2017).

35. Shockley, W. & Queisser, H. J. Detailed balance limit of efficiency of p-n junctionsolar cells. J. Appl. Phys. 32, 510–519 (1961).

36. Jin, H., Im, J. & Freeman, A. J. Topological insulator phase in halide perovskitestructure. Phys. Rev. B 86, 121102 (2012).

37. Menéndez-Proupin, E., Palacios, P., Wahnón, P. & Conesa, J. C. Self-consistentrelativistic band structure of the CH3NH3PbI3 perovskite. Phys. Rev. B 90, 045207(2014).

38. Umari, P., Moscon, E. & Angelis, F. D. Relativistic GW calculations on CH3NH3PbI3and CH3NH3SnI3 perovskites for solar cell applications. Sci. Report. 4, 4467 (2014).

39. Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. Classification andRegression Trees (The Wadsworth statistics/probability series). (Chapman and Hall,New York, 1984).

40. Strobl, C., Malley, J. & Tutz, G. An introduction to recursive partitioning: rationale,application and characteristics of classification and regression trees, bagging andrandom forests. Psycol. Methods 14, 323 (2009).

41. Breiman, L. Bagging predictors. Mach. Learn. 24, 123 (1996).42. Ho, T. K. The random subspace method for constructing decision forests. IEEE T.

Pattern Anal. 20, 832 (1998).43. Friedman, J. H. Greedy function approximation: A gradient boosting machine.

Ann. Stat. 29, 1189 (2001).44. Hastie, R. & Tibshirani, J. F. The elements of statistical learning: Data mining,

inference, and prediction (Springer Series in Statistics, Springer, 2001).45. Hautier, G. et al. Accuracy of density functional theory in predicting formation

energies of ternary oxides from binary oxides and its implication on phase sta-bility. Phys. Rev. B 85, 155208 (2012).

46. Schutt, K. T. et al. How to represent crystal structures for machine learning:Towards fast prediction of electronic properties. Phys. Rev. B 89, 205118 (2014).

47. Ghiringhelli, L. M. et al. Big data of materials science: Critical role of the descriptor.Phys. Rev. Lett. 114, 105503 (2015).

48. Stoumpos, C. C., Malliakas, C. D. & Kanatzidis, M. G. Semiconducting tin and leadiodide perovskites with organic cations: Phase transitions, high mobilities, andnear-infrared photoluminescent properties. Inorg. Chem. 52, 9019 (2013).

49. Blöchl, P. E. Phys. Rev. B 50, 17953 (1994).50. Kresse, G. & Furthmüller Efficiency of ab-initio total energy calculations for metals

and semiconductors using a plane-wave basis set. J. Comput. Mat. Sci. 6, 15(1996).

51. Kresse, G. & Furthmüller Efficient iterative schemes for ab initio total-energycalculations using a plane-wave basis set. J. Phys. Rev. B 54, 11169 (1996).

52. Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation madesimple. Phys. Rev. Lett. 77(18), 3865 (1996).

53. Castelli, I. E., Thygesen, K. S. & Jacobsen, K. W. Bandgap engineering of doubleperovskites for one- and two-photon water splitting. Mater. Res. Soc. Symp. Proc.1523, https://doi.org/10.1557/opl.2013.450 (2013).

J. Im et al.

7

Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2019) 37

Page 8: Identifying Pb-free perovskites for solar cells by machine

Open Access This article is licensed under a Creative CommonsAttribution 4.0 International License, which permits use, sharing,

adaptation, distribution and reproduction in anymedium or format, as long as you giveappropriate credit to the original author(s) and the source, provide a link to the CreativeCommons license, and indicate if changes were made. The images or other third partymaterial in this article are included in the article’s Creative Commons license, unlessindicated otherwise in a credit line to the material. If material is not included in the

article’s Creative Commons license and your intended use is not permitted by statutoryregulation or exceeds the permitted use, you will need to obtain permission directlyfrom the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

© The Author(s) 2019

J. Im et al.

8

npj Computational Materials (2019) 37 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences