latent tree models for multivariate density …lzhang/paper/pspdf/wang-thesis.pdf · yi wang this...

LATENT TREE MODELS FOR MULTIVARIATEDENSITY ESTIMATION: ALGORITHMS AND

APPLICATIONS

by

YI WANG

A Thesis Submitted toThe Hong Kong University of Science and Technology

in Partial Fulfillment of the Requirements for

the Degree of Doctor of Philosophy

in Computer Science and Engineering

August 2009, Hong Kong

Copyright c© by Yi Wang 2009

Authorization

I hereby declare that I am the sole author of the thesis.

I authorize the Hong Kong University of Science and Technology to lend this thesis

to other institutions or individuals for the purpose of scholarly research.

I further authorize the Hong Kong University of Science and Technology to reproduce

the thesis by photocopying or by other means, in total or in part, at the request of other

institutions or individuals for the purpose of scholarly research.

YI WANG

ii


APPLICATIONS

by

YI WANG

This is to certify that I have examined the above Ph.D. thesis

and have found that it is complete and satisfactory in all respects,

and that any and all revisions required by

the thesis examination committee have been made.

PROF. NEVIN LIANWEN ZHANG, THESIS SUPERVISOR

PROF. MOUNIR HAMDI, HEAD OF DEPARTMENT

Department of Computer Science and Engineering

18 August 2009

iii

ACKNOWLEDGMENTS

First of all, I would like to express my gratitude to my supervisor Prof. Nevin Lianwen

Zhang. He guided me through all the phases to this thesis. In the past six years, I have

learned a lot from him, not only useful skills for research, but also positive attitude toward

research and life. It is my fortune to be his student.

I thank the members of my proposal and thesis examination committees: Prof. Ning

Cai, Prof. Lei Chen, Prof. Marek J. Druzdzel, Prof. Brian Mak, and Prof. Dit-Yan

Yeung. They took time to read the earlier versions of this manuscript and provided

valuable comments. Those comments help significantly improve the quality of this thesis.

I am grateful to my colleagues in HKUST: Tao Chen, Haipeng Guo, and Kin Man

Poon. We collaborated closely in many projects. Discussions with them inspired me a lot.

My thanks also go to my friends: James Cheng, Sheng Gao, Bingsheng He, Dan Hong,

Xing Jin, Wujun Li, An Lu, Jialin Pan, Pingzhong Tang, Ya Tang, Gang Wang, Yajun

Wang, Qiuyan Xia, Mingxuan Yuan, Jun Zhang, Li Zhao, Wenchen Zheng, Weihua Zhou,

and the list goes on. I had so much fun with you guys, which made my PhD life much

more enjoyable.

Finally, I want to thank my beloved family. I am deeply indebted to my parents Jie’er

Wang and Xueqiang Peng. They gave me good education, encourage me to pursue my

dreams, and always support my big decisions. I also owe a lot to my wife Yiping Ke. She

accompanied me through the long journey towards this thesis, cheered me up when I was

down, and gave me courage to move forward. Without her love and tolerance, I could

never make this far.

I dedicate this thesis to my family.

iv

TABLE OF CONTENTS

Title Page i

Authorization Page ii

Signature Page iii

Acknowledgments iv

Table of Contents v

List of Figures ix

List of Tables xi

Abstract xii

Chapter 1 Introduction 1

1.1 Approaches to Density Estimation 1

1.2 Latent Tree Models for Density Estimation 2

1.3 Learning Latent Tree Models 3

1.4 Contributions 4

1.4.1 New Algorithms for Learning LTMs 4

1.4.2 Two Applications 6

1.5 Organization 7

Chapter 2 Background 8

2.1 Notations 8

2.2 Bayesian Networks 8

2.3 Latent Tree Models 9

2.3.1 Learning Latent Tree Models for Density Estimation 10

2.3.2 Model Inclusion and Equivalence 12

2.3.3 Root Walking and Unrooted LTMs 12

2.3.4 Regular LTMs 13

v

Chapter 3 Algorithm 1: EAST 14

3.1 Search Operators and Search Procedure 14

3.1.1 Search Operators 14

3.1.2 Brute-Force Search 15

3.1.3 EAST Search 16

3.2 Efficient Model Evaluation 19

3.2.1 Parameter Sharing among Models 19

3.2.2 Restricted Likelihood 20

3.2.3 Local EM 21

3.2.4 Avoiding Local Maxima 22

3.2.5 Two-Stage Model Evaluation 22

3.2.6 The PickModel Subroutine 23

3.3 Operation Granularity 23

3.4 Summary 25

Chapter 4 Algorithm 2: Hierarchical Clustering Learning 26

4.1 Heuristic Construction of Model Structure 26

4.1.1 Basic Ideas 26

4.1.2 MI Between A Latent Variable and Another Variable 27

4.2 Cardinalities of Latent Variables 29

4.2.1 Larger C for Better Estimation 29

4.2.2 Maximum Value of C Under Complexity Constraint 31

4.3 Model Simplification 32

4.3.1 Model Regularization 32

4.3.2 Redundant Variable Absorption 34

4.4 Parameter Optimization 38

4.5 The HCL Algorithm 39

4.6 Summary 39

Chapter 5 Algorithm 3: Pyramid 40

5.1 Basic Ideas 40

5.1.1 Bottom-up Construction of Model Structure 40

5.1.2 Unidimensionality Test 41

5.1.3 Subset Growing Termination 42

5.1.4 Sibling Cluster Determination 43

5.2 The Pyramid Algorithm 44

vi

5.3 Mutual Information 46

5.3.1 MI Between Manifest Variables 47

5.3.2 MI Between a Latent Variable and a Manifest Variable 47

5.3.3 MI Between Two Latent Variables 48

5.4 Simple Model Learning 48

5.4.1 Exhaustive Search 48

5.4.2 Restricted Expansion 49

5.4.3 When S Contains Latent Variables 51

5.5 Cardinality and Parameter Refinement 53

5.6 Summary 54

Chapter 6 Empirical Evaluation 55

6.1 Data Sets 55

6.1.1 Synthetic Data Sets 55

6.1.2 Real-World Data Sets 55

6.2 Measures of Model Quality 57

6.3 Impact of Algorithmic Parameters 57

6.3.1 Experimental Settings 57

6.3.2 EAST 58

6.3.3 HCL 59

6.3.4 Pyramid 59

6.4 Comparison of EAST, HCL and Pyramid 60

6.4.1 Model Quality 60

6.4.2 Computational Efficiency 63

6.4.3 Latent Structure Discovery 63

6.5 Summary 65

Chapter 7 Application 1: Approximate Probabilistic Inference 67

7.1 Probabilistic Inference in Bayesian Networks 67

7.2 Basic Idea 68

7.2.1 User Specified Bound on Inferential Complexity 68

7.3 Approximating Bayesian Networks with Latent Tree Models 68

7.3.1 Two Computational Difficulties 69

7.3.2 Optimization via Density Estimation 69

7.3.3 Impact of Imax 70

7.4 LTM-based Approximate Inference 70

vii

7.5 Empirical Results 71


7.5.2 Impact of N and Imax 72

7.5.3 Comparison with CTP 77

7.5.4 Comparison with LBP 84

7.5.5 Comparison with CL-based Method 84

7.5.6 Comparison with LCM-based Method 85

7.6 Related Work 86

7.7 Summary 87

Chapter 8 Application 2: Classification 88

8.1 Background 88

8.2 Build Classifiers via Density Estimation 89

8.2.1 The Generative Approach to Classification 89

8.2.2 Generative Classifiers Based on Latent Tree Models 89

8.3 Latent Tree Classifier 91

8.4 A Learning Algorithm for Latent Tree Classifier 91

8.4.1 Parameter Smoothing 92

8.5 Empirical Evaluation 93

8.5.1 Data Sets 93


8.5.3 Effect of Parameter Smoothing 95

8.5.4 LTC-E versus LTC-P 97

8.5.5 Comparison with the Other Algorithms 97

8.5.6 Appreciating Learned Models 101

8.6 Related Work 103

8.7 Summary 104

Chapter 9 Conclusions and Future Work 105

9.1 Summary of Contributions 105

9.2 Future Work 106

9.2.1 Other Applications 106

9.2.2 Handling Continuous Data 107

9.2.3 Generalization to Partially Observed Trees 107

Bibliography 109

viii

LIST OF FIGURES

1.1 Example latent tree model and latent class model. X’s denote manifestvariables, Y ’s denote latent variables. 3

2.1 The Asia network. 9

2.2 Rooted latent tree models, latent tree model obtained by root walking,and unrooted latent tree model. The X’s are manifest variables and theY ’s are latent variables. 9

3.1 The NI and NR operators. The model m2 is obtained from m1 by intro-ducing a new latent node Y3 to mediate between Y1 and two of its neighborsX1 and X2. The cardinality of Y3 is set to be the same as that of Y1. Themodel m3 is obtained from m2 by relocating X3 from Y1 to Y3. 15

3.2 A candidate model obtained by modifying the model in Figure 2.2. Thetwo models share the parameters for describing the distributions P (Y1),P (Y2|Y1), P (X1|Y2), P (X2|Y2), P (X3|Y2), P (X4|Y1), P (Y3|Y1), and P (X5|Y3).On the other hand, the parameters for describing P (Y4|Y3), P (X6|Y4), andP (X7|Y4) are peculiar to the candidate model. 20

4.1 An illustrative example. The numbers within the parentheses denote thecardinalities of the variables. 28

4.2 Redundant variable absorption. (a) A part of a model that contains twoadjacent and saturated latent nodes Y1 and Y2, with Y2 subsuming Y1. (b)Simplified model with Y1 absorbed by Y2. 35

5.1 An example subset growing process. The numbers within the parenthesesdenote the cardinalities of the latent variables. 43

5.2 The model structure after adding one latent variable Y1. 44

5.3 An example for evaluating simple models over latent variables. The colorsof the nodes in Figure 5.3c indicate where the parameters of the nodescome from. 52

6.1 The structures of the generative models of the 3 synthetic data sets. 56

6.2 The quality of the models produced by the three learning algorithms undervarious settings. 61

6.3 The running time of the three learning algorithms under various settings. 62

6.4 The structures of the best models learned by EAST from the 3 syntheticdata sets. 64

6.5 The structures of the best models learned by Pyramid from the 3 syntheticdata sets. 65

6.6 The structures of the best models learned by HCL from the 3 syntheticdata sets. 66

7.1 Running time of HCL under different settings. Settings for which EM didnot converge are indicated by arrows. 73

ix

7.1 Running time of HCL under different settings (continued). Settings forwhich EM did not converge are indicated by arrows. 74

7.2 Approximation accuracy of the LTM-based method under different set-tings. 75

7.2 Approximation accuracy of the LTM-based method under different settings(continued). 76

7.3 Running time of the online phase of the LTM-based method under differentsettings. 78

7.3 Running time of the online phase of the LTM-based method under differentsettings (continued). 79

7.4 Approximation accuracy of various inference methods. 80

7.4 Approximation accuracy of various inference methods (continued). 81

7.5 Running time of various inference methods. 82

7.5 Running time of various inference methods (continued). 83

8.1 NB, TAN, and LTC. C is the class variable, X1, X2, X3, and X4 are fourattributes, Y1 and Y2 are latent variables. 90

8.2 The training time of LTC-E and LTC-P. 99

8.3 The classification time of different classifiers. 101

8.4 The structures of the LTMs for Corral data. 102

8.5 The attribute distributions in each latent class and the correspondingconcept. 103

x

LIST OF TABLES

1.1 A taxonomy of representative density estimation approaches. 2

2.1 The parameters of the Asia network. Abbreviations: V — VisitAsia; S— Smoking; T — Tuberculosis; C — Cancer; B — Bronchitis; TC —TbOrCa; X — XRay; D — Dyspnea. 10

4.1 The empirical MI between the manifest variables. 27

4.2 The estimated MI between each latent variable and other variables. 29

6.1 The 3 settings on the algorithmic parameters of EAST and Pyramid thathave been tested. 59

7.1 The networks used in the experiments and their characteristics. 72

8.1 The 37 data sets used in the experiments. 94

8.2 The classification accuracy of LTC-E and LTC-P with/without parame-ter smoothing. Boldface numbers denote higher accuracy. Small circlesindicate significant wins. 96

8.3 Comparison of classification accuracy between LTC-E and LTC-P. 98

8.4 The classification accuracy of the tested algorithms. The 3 entries indi-cated by small circles become the best after taking out C4.5. 100

8.5 The number of times that LTC significantly won, tied with, and losed tothe other algorithms. 101

xi


APPLICATIONS

by

YI WANG

Department of Computer Science and Engineering

The Hong Kong University of Science and Technology

ABSTRACT

Multivariate density estimation is a fundamental problem in Applied Statistics and

Machine Learning. Given a collection of data sampled from an unknown distribution, the

task is to approximately reconstruct the generative distribution. There are two different

approaches to the problem, the parametric approach and the non-parametric approach.

In the parametric approach, the approximate distribution is represented by a model from

a predetermined family.

In this thesis, we adopt the parametric approach and investigate the use of a model

family called latent tree models for the task of density estimation. Latent tree models

are tree-structured Bayesian networks in which leaf nodes represent observed variables,

while internal nodes represent hidden variables. Such models can represent complex

relationships among observed variables, and in the meantime, admit efficient inference

among them. Consequently, they are a desirable tool for density estimation.

While latent tree models are studied for the first time in this thesis for the purpose of

density estimation, they have been investigated earlier for clustering and latent structure

discovery. Several algorithms for learning latent tree models have been proposed. The

state-of-the-art is an algorithm called EAST. EAST determines model structures through

principled and systematic search, and determines model parameters using the EM algo-

rithm. It has been shown to be capable of achieving good trade-off between fit to data

and model complexity. It is also capable of discovering latent structures behind data.

Unfortunately, it has a high computational complexity, which limits its applicability to

xii

density estimation problems.

In this thesis, we propose two latent tree model learning algorithms specifically for

density estimation. The two algorithms have distinct characteristics and are suitable for

different applications. The first algorithm is called HCL. HCL assumes a predetermined

bound on model complexity and restricts to binary model structures. It first builds a

binary tree structure based on mutual information and then runs the EM algorithm once

on the resulting structure to determine the parameters. As such, it is efficient and can

deal with large applications. The second algorithm is called Pyramid. Pyramid does not

assume predetermined bounds on model complexity and does not restrict to binary tree

structures. It builds model structures using heuristics based on mutual information and

local search. It is slower than HCL. However, it is faster than EAST and is only slightly

inferior to EAST in terms of the quality of the resulting models.

In this thesis, we also study two applications of the density estimation techniques that

we develop. The first application is to approximate probabilistic inference in Bayesian

networks. A Bayesian network represents a joint distribution over a set of random vari-

ables. It often happens that the network structure is very complex and making inference

directly on the network is computational intractable. We propose to approximate the

joint distribution using a latent tree model and exploit the latent tree model for faster

inference. The idea is to sample data from the Bayesian network, learn a latent tree

model from the data offline, and when online, make inference with the latent tree model

instead of the original Bayesian network. HCL is used here because the sample size needs

to be large to produce accurate approximation and it is possible to predetermine a bound

on the online running. Empirical evidence shows that this method can achieve good

approximation accuracy at low online computational cost.

The second application is classification. A common approach to this task is to for-

mulate it as a density estimation problem: One constructs the class-conditional density

for each class and then uses the Bayes rule to make classification. We propose to esti-

mate those class-conditional densities using either EAST or Pyramid. Empirical evidence

shows that this method yields good classification performances. Moreover, the latent tree

models built for the class-conditional densities are often meaningful, which is conducive to

user confidence. A comparison between EAST and Pyramid reveals that Pyramid is sig-

nificantly more efficient than EAST, while it results in more or less the same classification

performance as the latter.

xiii

CHAPTER 1

INTRODUCTION

Multivariate density estimation is a fundamental problem in Applied Statistics and Ma-

chine Learning. Suppose there is a collection of data that was drawn from an unknown

probability distribution. The task is to construct an estimate of the generative distribu-

tion from the data (Silverman, 1986). The estimate can help domain experts understand

the properties of the population. It can also be used to make predictions. It can be used

to calculate the likelihood of new data cases, classify new data cases, and compute the

posterior distributions of some variables after observing others.

1.1 Approaches to Density Estimation

Density estimation approaches can be categorized along two dimensions: (1) whether or

not they are based on some parametric models and (2) whether they deal with contin-

uous data or discrete data. Parametric approaches assume that the generative model is

from a given parametric family, and pick one model from the family to approximate the

generative model. For continuous data, commonly used model families include Gaussian

distributions (Duda & Hart, 1973), mixtures of Gaussian (Fraley & Raftery, 2002), and

factor models (Bartholomew & Knott, 1999). For discrete data, Markov random fields

(Kindermanna & Snell, 1980) and Bayesian networks (Pearl, 1988) are often used. Non-

parametric approaches do not restrict the form of the generative distribution. Examples

of non-parametric methods include histogram (Scott, 1992), nearest neighbor (Loftsgaar-

den & Quesenberry, 1965), and Parzen windows (Parzen, 1962). Most non-parametric

approaches deal with only continuous data. Table 1.1 shows a taxonomy of representative

density estimation approaches. In this thesis, we are concerned with density estimation

problems with discrete variables and we focus on parametric approaches.

In the case of discrete variables, a natural class of models for density estimation is

Bayesian networks (Pearl, 1988; Heckerman, 1995; Jordan, 1998). A Bayesian network

(BN) is an annotated directed acyclic graph. Each node in the graph represents a random

variable, and is attached with a conditional probability distribution of the node given its

parent nodes. The Bayesian network as a whole represents a joint distribution over all

the variables. In theory, Bayesian networks can represent any joint distributions exactly.

Therefore, one can often obtain, from observed data, Bayesian networks that approximate

1

Continuous Variable Discrete Variable

ParametricGaussian distribution Markov random fieldMixture of Gaussian Bayesian network

Factor model Latent tree model

Non-parametricHistogram

Nearest neighbor –Parzen window

Table 1.1: A taxonomy of representative density estimation approaches.

the generative distribution well (Heckerman, 1995). However, the resulting Bayesian

networks might be complex and hence computationally hard to deal with. As a matter

of fact, making inference in general Bayesian networks is NP-hard (Cooper, 1990; Dagum

& Luby, 1993).

Another class of models that one can use for density estimation of discrete data is tree-

structured Bayesian networks. This method was first proposed by Chow and Liu (1968)

and is commonly known as Chow-Liu trees. Probabilistic inference in such models only

takes time linear in the number of nodes (Pearl, 1988). However, Chow-Liu trees capture

only second-order dependencies among the variables and ignore higher-order dependencies.

As such, the method might not yield good approximation to the generative model when

the model contains complex relationships among the variables.

1.2 Latent Tree Models for Density Estimation

In this thesis, we study the use of a new class of models, called latent tree models, for

density estimation of discrete data. Like Chow-Liu trees, latent tree models (LTMs) are

also tree-structured BNs. Unlike Chow-Liu trees, LTMs contain latent variables that

are not observed, in addition to manifest variables which are observed. Specifically, the

internal nodes in an LTM represent latent variables and the leaf node represent manifest

variables. An example LTM is shown in Figure 1.1a.

LTMs were first systematically studied by Zhang (2004) and were then called hier-

archical latent class models. The models were identified as a class of potentially useful

models much earlier by Pearl (1988). There are two reasons. First, LTMs are compu-

tationally simple to work with because they are tree-structured. Second, if viewed as

models for the manifest variables, they can represent complex relationship among those

variables. As a matter of fact, no conditional independence relationships hold among the

manifest variables in an LTM. Using LTMs, one can approximate any distribution over

2

(a) LTM (b) LCM

Figure 1.1: Example latent tree model and latent class model. X’s denote manifestvariables, Y ’s denote latent variables.

the manifest variables arbitrarily well.

Learning LTMs from data is a challenging task. One needs to determine the number

of latent variables, the cardinality (i.e., number of states) of each latent variable, the tree

topology that connects the latent variables and manifest variables, and all the conditional

probability distributions. Those factors jointly lead to a huge model space. This is the

reason why the potentials of LTMs have not been explored until recently.

Researchers have previously studied a subclass of LTMs, namely LTMs with one single

latent variable. Such models are called latent class models (LCMs). Lazarsfeld and Henry

(1968) use them for cluster analysis. Lowd and Domingos (2005) use them for density

estimation and call them naıve Bayes models with latent variables. An example LCM is

shown in Figure 1.1b. LCMs constitute a much smaller model space than LTMs. This

makes the learning of LCMs easier than LTMs. On the other hand, LCMs might not be

able to approximate a generative model as well as LTMs.

1.3 Learning Latent Tree Models

Only a few algorithms for constructing LTMs have been developed so far. The first algo-

rithm was proposed by Pearl (1988). The objective is to construct an LTM to represent a

generative distribution assuming that it can be represented exactly using an LTM. Sarkar

(1995) extended the work to the case where the generative distribution cannot be repre-

sented exactly by an LTM. In both cases, one is given the generative distribution instead

of data and all the manifest variables are binary.

The first algorithm that learns LTMs from data was proposed by Zhang (2004). The

algorithm is called double hill climbing (DHC). DHC is a search-based algorithm that

uses the BIC score for model selection. It uses two nested hill climbing subroutines to

seek for high scoring LTMs. At the first level, DHC hill climbs in the space of latent

3

tree structures. For each candidate model structure, DHC invokes a second hill climb

subroutine to optimize the cardinalities of latent variables. This strategy leads to a large

number of candidate models. To calculate the BIC score of each candidate model, DHC

runs EM to optimize the model parameters. EM is known to be time consuming. As a

result, DHC is computationally expensive to use. It can only handle data sets with a half

to one dozen manifest variables.

Zhang and Kocka (2004) proposed a more efficient algorithm called heuristic single

hill climbing (HSHC). HSHC improves DHC in two ways. First, it uses one single hill

climbing routine and directly searches in the space of LTMs. At each step, modifications

to both structures and cardinalities of latent variables are considered at the same time.

Second, HSHC uses easy-to-compute heuristics to evaluate candidate models instead of

calculating the BIC scores exactly. HSHC was the first LTM learning algorithm that can

analyze non-trivial data sets.

The current state-of-the-art for learning LTMs is the EAST algorithm (Chen, 2008).

EAST differs from HSHC in two ways. First, EAST adopts the grow-restructure-thin

search strategy and divides the search process into three stages. At each stage, EAST

searches with only a subset of operators rather than all of them. This reduces the number

of candidate models at each search step. Second, instead of the heuristics of HSHC,

EAST adopts a more principled approach for efficient evaluation of candidate models.

The approach is based on an approximation to maximized likelihood, called restricted

maximized likelihood. We will discuss EAST in details in Chapter 3.

1.4 Contributions

EAST is a search-based algorithm that aims at finding the LTM with the highest BIC

score. It was designed for latent structure discovery and clustering (Chen, 2008). It can

also be used for density estimation. In this context, the AIC score should be used for

model selection instead of the BIC score (see Section 2.3.1). We empirically evaluate

EAST and find that it can yield good solutions to density estimation problems when

coupled with the AIC score. However, it has a serious drawback. Its computational

complexity is high. It typically takes days to process data sets with dozens of manifest

variables and thousands of samples. As such, it is not suitable for large density estimation

problems.

Our contribution in this thesis is two-fold. First, we develop two new algorithms for

learning LTMs that are significantly more efficient than EAST. Second, we apply the

density estimation techniques that we develop to two problems, approximate inference in

complex Bayesian networks and model-based classification.

4

1.4.1 New Algorithms for Learning LTMs

Consider a given LTM. Suppose we remove from it all the conditional distributions and

all the information about the cardinalities of the latent variables. What remains is called

an LTM structure. In an LTM structure, we know the number of latent variables and

how they are connected with the manifest variables, but we do not know how many states

each latent variable takes.

Suppose there is a given LTM structure. Theoretically, one can represent any distribu-

tion over the manifest variables exactly by setting the cardinalities of the latent variable

large enough (see Proposition 4.1). If the cardinalities are not large enough, what we

can get is an approximation of the distribution. If the structure is ‘good’, we can achieve

good approximation with low cardinalities. If the structure is ‘not good’, we need the

cardinalities to be large in order to achieve good approximation.

Finding a ‘good’ LTM structure takes time. This is why EAST is so inefficient. In

this thesis, we develop two new algorithms for learning LTMs. The idea is to spend less

time on finding LTM structure and compensate for sub-optimality in model structure by

increasing the cardinalities of latent variables. In this way, good approximation to the

generative model can still be achieved. The two new algorithms differ on how much efforts

they spend on optimizing model structure. Hence they have different characteristics and

are suitable for different applications.

Our first new algorithm for learning LTM is called hierarchical clustering learning

(HCL). It spends the least effort on determining model structure among all the algo-

rithms. It builds model structures based on the heuristic that variables exhibiting strong

correlation in data should share a common latent parent. More specifically, it constructs

a binary latent tree structure through hierarchical clustering of manifest variables. At

each step, two closely correlated sets of manifest variables are grouped, and a new latent

variable is introduced to account for the relationship between them. The cardinalities of

the latent variables are determined according to a predetermined threshold on the com-

plexity of the resulting model. Finally, model parameters are optimized using the EM

algorithm. HCL is drastically more efficient than EAST. It is suitable for applications

where there are a large number of manifest variables, a large number of data samples, 1

and a pre-specified constraint on the complexity of the resulting the model. It can yield

a good approximation to the generative distribution under the complexity constraint.

The second new algorithm for learning LTM is called Pyramid. It spends more ef-

forts on determining model structure than HCL and less efforts than EAST. Like HCL,

it builds model structure in a bottom-up fashion. At each step, it selects a subset of

1Because the model structure is not very ‘good’, one might need to set the cardinalities of latent variableshigh. This implies a large number of model parameters, which in term requires large sample size.

5

strongly correlated variables and introduces a latent parent for them. To make the selec-

tion, it starts with the subset consisting of the two most closely correlated variables. It

then adds other closely correlated variables to the subset one by one, until the so-called

unidimensionality test fails. Several variables in the final subset are selected according to

the outcomes of the test, and a new common latent parent is introduced for them. The

cardinality of the new latent variable is determined by considering the AIC score of a local

model that contains the variable. Pyramid is significantly more efficient than EAST, but

significantly less efficient than HCL. It can find ‘good’ model structures (almost as good

as those found by EAST) and hence can achieve good approximation of the generative

model without using high cardinalities for the latent variables.

1.4.2 Two Applications

In this thesis, we also apply the density estimation techniques that we develop to two im-

portant problems in Artificial Intelligence and Machine Learning and propose interesting

new solutions for those problems.

The first problem is probabilistic inference in Bayesian networks (BN). When the

network structure of a BN is complex, exact inference is infeasible. We propose a novel

approximate inference method for such cases. Suppose there is a BN over a set X of

variables. It represents a joint distribution P (X) over the variables X. Our idea is to

sample data from the BN, and learn from the data an LTM with X as manifest variables.

The learning is done offline. The resultant LTM is viewed also as a model for X and

represents another distribution P ′(X). When online, we make inference using the LTM

instead of the original BN. Obviously, inference in the LTM can be much more efficient

than in the original BN. Meanwhile, because LTM can represent complex relationships

among manifest variables, P ′(X) can be a good approximation of P (X). Hence, inference

results obtained in the LTM can be accurate approximations of those in the original BN.

Among the aforementioned three LTM learning algorithms, HCL was designed specif-

ically for this application. There are three characteristics with this application. First, the

number of manifest variables, i.e., the number of variables in X, can be large. Second,

the sample size needs to be large for accurate approximation. In practice, one wants to

set the sample size as large as possible. Third, it is desirable to impose a bound on the

complexity of the resulting LTM. In many real-world applications, such as embedded sys-

tems, one would want to guarantee that inference can be done within certain time limit.

HCL is ideal for such situations.

Empirical results on an array of example BNs show that our approximate method

can achieve high approximation accuracy with low online computational time. It is often

6

orders of magnitude faster than exact inference. It also consistently outperforms loopy

belief propagation (Pearl, 1988), a previous approximate inference algorithm that is widely

used in many domains (Frey & MacKay, 1997; Murphy et al., 1999).

The second application we consider is classification. Let C and X be the class variable

and the set of attributes respectively. From the probabilistic perspective, the task is to

determine P (C|X). The Bayes rule states that

P (C|X) =P (C)P (X|C)

P (X).

So, one approach to classification is to first estimate the class-conditional distribution

P (X|C = c) for each class C = c and then use the Bayes rule for classification.

We propose to estimate the class-conditional distributions P (X|C = c) using LTMs.

This idea is interesting for three reasons. First, because LTMs can represent complex

relationships among manifest variables, the estimation can be accurate. This implies that

the classification accuracy can be high. Second, LTMs have low computational complexity.

This means that the online classification time would be short. Third, the LTMs might

reveal interesting latent structures behind the data. This is conducive to user confidence

in the classifiers.

In this application, quality of model structures is an important consideration. Good

model structures not only help to boost user confidence, but also allow us to find an

appropriate balance between fit to data and model complexity. In other words, they help

us to avoid overfitting. As such, the HCL algorithm is not suitable here. We consider

only the use of EAST and Pyramid.

Empirical results on a large number of data sets show that our method compares

favorably, in terms of classification accuracy, with related alternative methods such as

the naıve Bayes classifier, tree-augmented naıve Bayes (Friedman et al., 1997), averaged

one-dependence estimators (Webb et al., 2005), and decision tree (Quinlan, 1993). It

is fast in terms of online classification time. And it is unique in that it can discover

interesting latent structures. The empirical results also suggest that Pyramid achieves a

better tradeoff between classification accuracy and training time than EAST. When one

moves from EAST to Pyramid, the classification accuracy drops only slightly, while the

training decreases drastically.

1.5 Organization

The rest of this thesis is structured as follows. In the next chapter, we review basic

concepts and facts about BNs and LTMs. In Chapters 3–5, we present the EAST, HCL,

7

and Pyramid algorithms for learning LTMs. In Chapter 6, we evaluate them empirically

on synthetic density estimation problems. In Chapter 7, we describe the use of LTMs

for approximate inference in BNs. In Chapter 8, we describe the application of LTMs to

classification. Finally, we conclude this thesis in Chapter 9 and discuss possible future

directions.

8

CHAPTER 2

BACKGROUND

In this chapter, we review basic concepts and facts about Bayesian networks and latent

tree models.

2.1 Notations

In this thesis, we deal only with categorical variables, i.e., variables that take finite number

of values. We use capital letters X, Y , Z to denote random variables, and use lower-case

letters x, y, z to denote specific values that X, Y , Z can take. We use bold-face letters

X, Y, Z to represent sets of random variables, and x, y, z to represent their values.

2.2 Bayesian Networks

Bayesian networks (BNs) are a class of probabilistic models that use graphs to model con-

ditional independence among random variables. They provide a compact representation

for joint probability distributions by exploiting conditional independence relationships.

Formally, a BN N over a set of random variables X consists of two components: a

directed acyclic graph (DAG) G and a collection of conditional probability tables (CPTs)

θG. We will refer to the first component as the structure of the BN, and the second

component as the parameters of the BN.

The structure G encodes the conditional independence relationships among the random

variables X. Each node in G represents a random variable in X. In this thesis, we use

the terms ‘node’ and ‘variable’ interchangeably. Each edge in G represents the direct

dependency between two nodes, while lacking of edge between two nodes implies that

they are conditional independent given some other variables. In particular, a node is

conditionally independent of all its non-descendent nodes given its parent nodes.

The parameter θG quantifies the strength of the dependencies along the edges in G.It consists of a CPT P (X|pa(X)) for each node X given its parent nodes pa(X) in G. If

X is a root node, the set pa(X) is empty, and the conditional distribution P (X|pa(X))

reduces to the marginal distribution P (X). Semantically, the BN N represents a joint

9

Figure 2.1: The Asia network.

(a) Rooted LTM (b) LTM after root walking (c) Unrooted LTM

Figure 2.2: Rooted latent tree models, latent tree model obtained by root walking, andunrooted latent tree model. The X’s are manifest variables and the Y ’s are latent vari-ables.

probability distribution PN (X) over X that decomposes as follows:

PN (X) =∏

X∈X

P (X|pa(X), θG).

Figure 2.1 shows an example BN called Asia. It models the relationships among the

profile of a patient (whether he visited Asia and whether he smokes), the possibility that

the patient has the diseases of tuberculosis, cancer, and bronchitis, as well as the symptoms

(positive outcome of X-Ray and dyspnea) that the patient may have. The parameters of

this network is given in Table 2.1. Note that each row of the tables represents a conditional

distribution and sums up to one.

2.3 Latent Tree Models

A latent tree model (LTM) is a special Bayesian network. Its structure is a rooted tree.

The leaf nodes represent manifest variables that are observed, while the internal nodes

represent latent variables that are hidden. Figure 2.2a shows an example LTM. In this

model, X1–X7 represent manifest variables, while Y1–Y3 represent latent variables.

10

V = y V = n

0.01 0.99

(a) P (V )

S = y S = n

0.5 0.5

(b) P (S)

T = y T = n

V = y 0.05 0.95V = n 0.01 0.99

(c) P (T |V )

C = y C = n

S = y 0.1 0.9S = n 0.01 0.99

(d) P (C|S)

B = y B = n

S = y 0.6 0.4S = n 0.3 0.7

(e) P (B|S)

X = y X = n

TC = y 0.98 0.02TC = n 0.05 0.95

(f) P (X |TC)

TC = y TC = n

T = yC = y 1 0C = n 1 0

T = nC = y 1 0C = n 0 1

(g) P (TC|T, C)

D = y D = n

TC = yB = y 0.9 0.1B = n 0.7 0.3

TC = nB = y 0.8 0.2B = n 0.1 0.9

(h) P (D|TC, B)

Table 2.1: The parameters of the Asia network. Abbreviations: V — VisitAsia; S —Smoking; T — Tuberculosis; C — Cancer; B — Bronchitis; TC — TbOrCa; X — XRay;D — Dyspnea.

Following the notation of Bayesian network, we write an LTM as a pairM = (m, θm).

The second component θm is the same as in BN. It is the collection of parameters of the

LTM and consists of a conditional distribution P (Z|pa(Z)) for each node Z.1 The first

component m denotes the rest of the LTM. It consists of the variables, the cardinalities

of the variables, and the topology of the rooted tree. We sometimes refer to m also as an

LTM.

2.3.1 Learning Latent Tree Models for Density Estimation

Let P (X) be an unknown distribution over a set X of variables. Suppose D is a collection

of data drawn from P (X). There are infinitely many possible LTMs with X as manifest

nodes. Each LTM M represents a distribution PM(X) over X. In this thesis, we are

concerned with the problem of learning LTMs for density estimation, i.e., to find the

1In LTM, each node has at most one parent node. We thus denote the parent of a node Z as pa(Z)rather than pa(Z).

11

LTM that is as close to the generative distribution P (X) as possible.

A commonly used quantity for measuring discrepancy between two distributions is

KL divergence (Cover & Thomas, 1991). The KL divergence of an LTM m from the

generative distribution P (X) is defined as follows:

D(P‖Pm) =∑

X

P (X) logP (X)

Pm(X). (2.1)

The smaller the KL divergence, the closer Pm(X) to P (X), and the better the LTM m.

Conceptually, our goal is to find the LTM m⋆ that minimize the KL divergence.

How do we find m⋆? One way is to view D(P‖Pm) as a scoring function on model m

and hill-climbs in the space of LTMs to maximize the score. The problem with this method

is that the score D(P‖Pm) cannot be computed because the generative distribution P (X)

is unknown. Fortunately, we do have a data set D that was sampled from P (X). We can

obtain an approximation of D(P‖Pm) based on D. As a matter of fact, D(P‖Pm) can

be approximated using the AIC score (Akaike, 1974) when the sample size is large. The

AIC score of an LTM m is defined as follows:

AIC(m|D) = −2[maxθm

log P (D|m, θm)− d(m)], (2.2)

where d(m) is dimension of model m, i.e., the number of independent parameters of the

model. In machine learning community, however, researchers usually use the negation of

Equation 2.2, i.e.,

AIC(m|D) = maxθm

log P (D|m, θm)− d(m).

The first term is known as the maximized log-likelihood of m. It measures how well model

m fits the data D. The second term is a penalty term for model complexity. The density

estimation problem is thus transformed into the problem of finding the LTM with the

highest AIC score.

There exists other approximations to the KL divergence. For example, leave-one-out

cross-validation (LOOCV) is asymptotically equivalent to the AIC score (Shao, 1997).

However, to use LOOCV is computationally much more demanding than to use the AIC

score. Calculating LOOCV for an LTM needs to optimize the model parameters |D| times,

where |D| denotes the sample size of D. In contrast, calculating AIC score only requires

one pass of parameter optimization. Researchers have also proposed variants to the AIC

score, e.g., the AICc score with a correction for handling small sample sizes. However,

the original AIC score is still the most widely used approximation.

Previous learning algorithms for LTMs all use the BIC score (Schwarz, 1978) for model

selection (Zhang, 2004; Zhang & Kocka, 2004; Chen, 2008). In contrast to AIC, BIC is

12

based on a different philosophy. The underlying assumption is that the model space

contains the generative model. Given a training data, the objective is thus to find the

most probable model from the space. The probability of a model is know as the marginal

likelihood. The BIC score is a large sample approximation to the marginal likelihood.

Clearly, the assumption does not hold in our settings. Therefore, the BIC is not suitable

for density estimation problem.

2.3.2 Model Inclusion and Equivalence

Consider two LTMs m and m′ that share the same set of manifest variables X. We say

that m includes m′ if for any parameter value θm′ of m′, there exists a parameter value

θm of m such that

P (X|m, θm) = P (X|m′, θm′).

When this is the case, m can represent any distributions over the manifest variables that

m′ can. As such, the maximized log-likelihood of m is larger than or equal to that of m′:

maxθm

log P (D|m, θm) ≥ maxθm′

log P (D|m′, θ′m).

If m includes m′ and vice versa, we say that m and m′ are marginally equivalent.

Marginally equivalent models are equivalent if they have the same number of independent

parameters. It is impossible to distinguish between equivalent models based on data if

the AIC score, or any other penalized likelihood score (Green, 1999), is used for model

selection.

2.3.3 Root Walking and Unrooted LTMs

Let Y1 be the root of a latent tree model m. Suppose Y2 is a child of Y1 and it is also

a latent node. Define another latent tree model m′ by reversing the arrow Y1 → Y2.

Variable Y2 becomes the root in the new model. The operation is called root walking: The

root has walked from Y1 to Y2. The model m′ in Figure 2.2b is the model obtained by

walking the root from Y1 to Y2 in model m.

It has been shown that root walking leads to equivalent models (Zhang, 2004). There-

fore, the root and edge orientations of an LTM cannot be determined from data. We can

only learn unrooted LTMs, which are LTMs with all directions on the edges dropped. An

example of an unrooted LTM is given in Figure 2.2c.

An unrooted LTM represents an equivalent class of LTMs. Members of the class are

obtained by rooting the model at various nodes. Semantically it is a Markov random field

over an undirected tree. The leaf nodes are observed while the interior nodes are latent.

13

Model inclusion and equivalence can be defined for unrooted LTMs in the same way as

for rooted models. In the rest of this thesis, LTMs always mean unrooted LTMs unless it

is explicitly stated otherwise.

2.3.4 Regular LTMs

For a latent variable Y in an LTM, enumerate its neighbors as Z1, Z2, . . . , Zk. An LTM

is regular if for any latent variable Y ,

|Y | ≤∏k

i=1 |Zi|maxk

i=1 |Zi|, (2.3)

and when Y has only two neighbors, strict inequality holds and one of the neighbors must

be a latent node.

For any irregular model m, there always exists a regular model m′ that is marginally

equivalent to m and has fewer independent parameters (Zhang, 2004). The model m′ can

be obtained from m through the following regularization process:

1. For each latent variable Y in m,

(a) If it violates inequality (2.3), reduce the cardinality of Y toQk

i=1|Zi|

maxki=1

|Zi|.

(b) If it has only two neighbors with one being a latent node and it violates the

strict version of inequality (2.3), remove Y from m and connect the two neigh-

bors of Y .

2. Repeat Step 1 until no further changes.

The regular model m′ has a higher AIC score than m itself. Therefore, we can restrict

our attention to the space of regular models when searching for the LTM with the highest

AIC score. For a given set of manifest variables, there are only finitely many regular

LTMs (Zhang, 2004).

14

CHAPTER 3

ALGORITHM 1: EAST

In this chapter, we present a search-based algorithm, called EAST, for learning LTMs.

We start with the operators and the search procedure (Section 3.1). Then, we discuss two

issues that are critical to the performance of EAST, namely efficient model evaluation

(Section 3.2) and operation granularity (Section 3.3).

3.1 Search Operators and Search Procedure

EAST hill-climbs in the space of regular LTMs under the guidance of the AIC score. It

uses 5 search operators borrowed from (Zhang & Kocka, 2004) with minor adaptations.

And it adopts a search strategy known as grow-restructure-thin, which originated from

the literature on learning Bayesian networks without latent variables (e.g., Chickering,

2002).

3.1.1 Search Operators

The search operators are: state introduction (SI), node introduction (NI), node relocation

(NR), state deletion (SD), and node deletion (ND). We describe them one by one in the

following.

Given an LTM and a latent variable in the model, the state introduction (SI) operator

creates a new model by adding a state to the domain of the variable. The state deletion

(SD) operator does the opposite. Applying SI on a model m results in another model

that includes m. Applying SD on a model m results in another model that is included by

m.

Node introduction (NI) involves one latent node Y and two of its neighbors. It creates

a new model by introducing a new latent node Z to mediate between Y and the two

neighbors. The cardinality of Z is set to be the same as that of Y . In the model m1 of

Figure 3.1, introducing a new latent node Y3 to mediate Y1 and its neighbors X1 and X2

results in m2. Applying NI on a model m results another model that includes m. For the

sake of computational efficiency, we do not consider introducing a new node to mediate

Y and more than two of its neighbors. This restriction will be compensated in search

control.

15

(a) m1 (b) m2 (c) m3

Figure 3.1: The NI and NR operators. The model m2 is obtained from m1 by introducinga new latent node Y3 to mediate between Y1 and two of its neighbors X1 and X2. Thecardinality of Y3 is set to be the same as that of Y1. The model m3 is obtained from m2by relocating X3 from Y1 to Y3.

Node deletion (ND) is the opposite of NI. It involves two neighboring latent nodes Y

and Z. It creates a new model by deleting Z and making all neighbors of Z other than

Y neighbors of Y . We refer to Y as the anchor variable of the deletion and say that Z

is deleted with respect to Y . In the model m2 of Figure 3.1, deleting Y3 with respect to

Y1 leads us back to the model m1. Applying ND on a model m results in another model

that is included by m if the node deleted has more or the same number of states as the

anchor node.

Node relocation (NR) involves a node W , one of its latent neighbors Y , and another

latent node Z. It creates a new model by relocating W to Z, i.e., removing the link

between W and Y and adding a link between W and Z. In m2 of Figure 3.1, relocating

X3 from Y1 to Y3 results in m3. Unlike in (Zhang & Kocka, 2004), we now do not require

the two latent nodes Y and Z to be neighbors.

There are some boundary conditions on the search operators. The SD operator cannot

be applied to latent variables with only two possible states. The NI and NR operators

cannot be applied if they make some latent nodes leaves. To ensure regularity, a regular-

ization step is applied to every candidate model right after its creation.

3.1.2 Brute-Force Search

Let m be an LTM. In the following we use NI(m), SI(m), NR(m), ND(m), and SD(m)

to respectively denote the sets of candidate models that one can obtain by applying the

five search operators on m. The models are sometimes referred to as NI, SI, NR, ND, SD

candidate models, respectively. The union of the five sets is denoted by ALL(m).

Suppose we are given a data set D and an initial model m. Algorithm 3.1 gives a

brute-force search algorithm for learning an LTM.

16

Algorithm 3.1 BruteForce(m,D)

1: while true do2: m1 ← arg maxm′∈ALL(m) AIC(m′|D)3: if AIC(m1|D) ≤ AIC(m|D) then4: return m5: else6: m← m1

7: end if8: end while

Brute-force search is inefficient for two reasons. First, it evaluates a large number

of candidate models at each step. Let n, l, and r be the number of manifest nodes,

the number of latent nodes, and the maximum number of neighbors that any latent

node has in the current model, respectively. The numbers of candidate models that the

five operators SI, SD, NI, ND and NR generate are O(l), O(l), O(lr(r − 1)/2), O(lr)

and O(l(l + n)) respectively. So the brute-force algorithm evaluates a total number of

O(l(2+ r/2+ r2/2+ l+n)) candidate models at each step. Most of the candidate models

are generated by the NI and NR operators.

Second, one needs to compute the maximized log-likelihood of each candidate model

m′ in order to calculate its AIC score. This requires the expectation-maximization (EM)

algorithm due to the presence of latent variables. EM is known to be time-consuming.

We will next describe a search procedure that generates fewer candidate models than

brute-force search. In Section 3.2, we will present an efficient way to evaluate candidate

models.

3.1.3 EAST Search

The five operators can be classified into three groups. The NI and SI operators produce

candidate models that include the current model. They are hence expansion operators.

The ND and SD operators produce candidate models that are included by the current

model. They are hence simplification operators. NR does not alter nodes in the current

model. It only changes the connections between the nodes. Hence we call it an adjustment

operator.

The EAST algorithm is given in Algorithm 3.2. At each step of search, EAST uses

only a subset of the search operators instead all of them. More specifically, it divides

search into three stages: expansion, adjustment and simplification. At each stage, it uses

only the operators from the corresponding group. For example, it searches only with the

expansion operators at the expansion stage. If the model score is improved in any of the

three stages, the algorithm continues search by repeating the loop. This is why it is called

17

Algorithm 3.2 EAST(m,D)

1: while true do2: m1 ← Expand(m,D)3: m2 ← Adjust(m1,D)4: m3 ← Simplify(m2,D)5: if AIC(m3|D) ≤ AIC(m|D) then6: return m7: else8: m← m3


Algorithm 3.3 Expand(m,D)

1: while true do2: m1 ← PickModel-IR(NI(m) ∪ SI(m), m)3: if AIC(m1|D) ≤ AIC(m|D) then4: return m5: else if m1 ∈ NI(m) then6: m← EnhanceNI(m1, m,D)7: else8: m← m1


‘EAST’ — Expansion, Adjustment, Simplification until Termination.

At the expansion stage, EAST searches with the expansion operators until the AIC

score ceases to increase. See Algorithm 3.3. To understand the intuition, recall that the

AIC score consists of a term that measures model fit and another term that penalizes for

model complexity. If we start with a model that fits data poorly, which is usually the

case, then improving model fit is the first priority. Model fit can be improved by searching

with the expansion operators. This is exactly what EAST does at the expansion stage.

The pseudo code for the expansion stage contains two subroutines. The subroutine

PickModel-IR(NI(m) ∪ SI(m), m) selects one model from all the candidate models gen-

erated from m by the NI and SI operators. It will be discussed in details in the next two

sections.

The second subroutine EnhanceNI is called after each application of the NI operator.

This is to compensate for the constraint imposed on NI. Consider the model m1 in Figure

3.1. We can introduce a new latent node Y3 to mediate Y1 and two of its neighbors, say

X1 and X2, and thereby obtain the model m2. However, we are not allowed to introduce

a latent node to mediate Y1 and more than two of its neighbors, say X1, X2, and X3, and

thereby obtain m3. As a remedy we consider, after each application of the NI operator,

enhancements to the operation. As an example, suppose we have just applied NI to m1

and have obtained m2. What we do next is to consider relocating the other neighbors of

18

Algorithm 3.4 EnhanceNI(m′, m,D)

1: while L 6= ∅ do2: m′

W1→Z ← PickModel({m′W→Z|W ∈ L}, m)

3: if AIC(m′W1→Z |D) ≤ AIC(m′|D) then

4: return m′

5: else6: m′ ← m′

W1→Z

7: L← L \ {W1}8: end if9: end while

Algorithm 3.5 Adjust(m,D)

1: while true do2: m1 ← PickModel(NR(m), m)3: if AIC(m1|D) ≤ AIC(m|D) then4: return m5: else6: m← m1


Y1 in m1, i.e., X3, X4, X5 and Y2, to the new latent variable Y3. If it turns out to be

beneficial to relocate X3 but not the other three nodes, then we obtain the model m3.

In general, suppose we have just introduced a new node Z into the current model m to

mediate a latent node Y and two of its neighbors, and obtained a candidate model m′. Let

L be the list of all the other neighbors of Y in m. For any W ∈ L, use m′W→Z to denote

the model obtained from m′ by relocating W to Z. What we do next is to enhance the NI

operation using the subroutine described in Algorithm 3.4. The subroutine PickModel

selects and returns one model from a list of candidate models. It will be given in the next

section.

After model expansion ceases to increase the AIC score, EAST enters the adjustment

stage (Algorithm 3.5). At this stage, EAST repeatedly relocates nodes in the current

model until it is no longer beneficial to do so, and there is no restriction on how far away

a node can be relocated. Node relocation is necessary because multiple latent nodes are

usually introduced during model expansion and two nodes that should be together might

end up at different parts of the model at the end of the expansion process.

The adjustment stage is followed by the simplification stage (Algorithm 3.6). At this

stage EAST first repeatedly applies ND to the current model until the AIC score ceases

to increase and then it does the same with SD. We choose not to consider ND and SD

simultaneously because that would be computationally more expensive and it is not clear

whether that would be helpful in avoiding local maxima.

19

Algorithm 3.6 Simplify(m,D)

1: while true do2: m1 ← PickModel(ND(m), m)3: if AIC(m1|D) ≤ AIC(m|D) then4: break5: else6: m← m1

7: end if8: end while9: while true do

10: m1 ← PickModel(SD(m), m)11: if AIC(m1|D) ≤ AIC(m|D) then12: return m13: else14: m← m1


At each step in the expansion stage, EAST generates O(l + lr(r − 1)/2) candidate

models. At each step in the adjustment stage, EAST generates O(l(l + n)) candidate

models. The simplification stage consists of two sub-stages. At the first sub-stage, EAST

searches with the ND operator and generates O(lr) candidate models at each step. At

the second sub-stage, EAST searches with the SD operator and generates O(l) candidate

models at each step. So EAST generates fewer candidate models than the brute-force

algorithm at each step of search.

3.2 Efficient Model Evaluation

The PickModel subroutine is supposed to find, from a list of candidate models, the model

with the highest AIC score. A straightforward way to do so is to calculate the AIC

score of each candidate model and then pick the best one. Calculating the AIC scores

of a large number of models exactly is computationally prohibitive. So, we propose to

use approximations of the AIC score for model selection. In this section, we present

one approximation of the AIC score that is easy to compute. The idea is to replace the

likelihood term with what we call restricted likelihood. We begin by discussing parameter

sharing between a candidate model and the current model.

3.2.1 Parameter Sharing among Models

Conceptually we work with unrooted LT models. In implementation, however, we repre-

sent unrooted models as rooted models. Rooted LTMs are Bayesian networks and their

20

(a) m′ (rooted) (b) m′ (unrooted)

Figure 3.2: A candidate model obtained by modifying the model in Figure 2.2. The twomodels share the parameters for describing the distributions P (Y1), P (Y2|Y1), P (X1|Y2),P (X2|Y2), P (X3|Y2), P (X4|Y1), P (Y3|Y1), and P (X5|Y3). On the other hand, the pa-rameters for describing P (Y4|Y3), P (X6|Y4), and P (X7|Y4) are peculiar to the candidatemodel.

parameters are defined without ambiguity. This makes it easy to see how the parameter

composition of a candidate model is related to that of the current model.

Consider the model m in Figure 2.2. Let m′ be the model obtained from m by

introducing a new latent node Y4 to mediate Y3 and two of its neighbors X6 and X7, as

shown in Figure 3.2. If both m and m′ are represented as rooted models, their parameter

compositions are clear. The two models share parameters for describing the distributions

P (Y1), P (Y2|Y1), P (X1|Y2), P (X2|Y2), P (X3|Y2), P (X4|Y1), P (Y3|Y1), and P (X5|Y3). On

the other hand, the parameters for describing P (Y4|Y3), P (X6|Y4), and P (X7|Y4) are

peculiar to m′ while those for describing P (X6|Y3) and P (X7|Y3) are peculiar to m.

We write the parameters of a candidate model m′ as a pair (θ′1, θ

′2), where θ′

1 is the

collection of parameters that m′ shares with the current model m. The other parameters

θ′2 are peculiar to m′ and are called new parameters of m′. Similarly we write the param-

eters of the current model m as a pair (θ1, θ2), where θ1 is the collection of parameters

that m shares with m′.

One unrooted LTM can be represented by multiple rooted LTMs. In the aforemen-

tioned example, if the representation of m′ is rooted at Y3 instead of Y1, then we would

have P (Y3) and P (Y1|Y3) instead of P (Y1) and P (Y3|Y1). The parameters describing P (Y3)

and P (Y1|Y3) would be peculiar to m′. However, this is due to the choice of representation

rather than search operation. Hence those parameters are not genuinely new parameters.

In implementation, one needs to coordinate the representations of the current model and

the candidate models so as to avoid such fake new parameters.

3.2.2 Restricted Likelihood

Suppose we have computed the MLE (θ⋆1, θ

⋆2) of the parameters of the current model m.

For a given value of θ′2, (m′, θ⋆

1, θ′2) is a fully specified Bayesian network. In this network,

21

we can compute

P (D|m′, θ⋆1, θ

′2) =

∏

d∈D

P (d|m′, θ⋆1, θ

′2).

As a function of θ′2, this is referred to as the restricted likelihood function of m′. The

maximum restricted log-likelihood, or simply the maximum RL, of the candidate model m′

is defined to be

maxθ′

2

log P (D|m′, θ⋆1, θ

′2).

Replacing the likelihood term in the AIC score of m′ with its maximum RL, we get the

following approximate score:

AICRL(m′|D) = maxθ′

2

log P (D|m′, θ⋆1, θ

′2)− d(m′). (3.1)

We propose that PickModel uses the AICRL score for model selection instead of the

AIC score. It should be noted that the idea of optimizing only some parameters of a

model while freezing the others is used in, among others, phylogenetic tree reconstruction

(Guindon & Gascuel, 2003) and learning of continuous Bayesian networks (Nachmana

et al., 2004).

Next we describe an efficient method for approximately calculating the AICRL score.

The method is called local EM.

3.2.3 Local EM

Local EM works in the same way as EM except with the value of θ′1 fixed at θ⋆

1. It starts

with an initial value δ(0)2 for θ′

2 and iterates. After t − 1 iterations, it obtains δ(t−1)2 . At

iteration t, it completes the data D using the Bayesian network (m′, θ⋆1, δ

(t−1)2 ), calculates

some sufficient statistics, and therefrom obtains δ(t)2 . Suppose the parameters θ′

2 of m′

describe distributions P (Zj|Wj) (j = 1, . . . , ρ).1 The distributions P (Zj|Wj , δ(t)2 ) that

make up δ(t)2 can be obtained in two steps:

• E-Step: For each data case d ∈ D, make inference in the Bayesian network (m′, θ⋆1, δ

(t−1)2 )

to compute

P (Zj, Wj|d, m′, θ⋆1, δ

(t−1)2 ) (j = 1, . . . , ρ).

1When Zj is the root, Wj is to be regarded as a vacuous variable and P (Zj|Wj) is simply P (Zj).

22

• M-Step: Obtain

P (Zj|Wj, δ(t)2 ) = f(Zj, Wj)/

∑

Zj

f(Zj, Wj) (j = 1, . . . , ρ)

where the sufficient statistic f(Zj,Wj) =∑

d∈D P (Zj ,Wj|d,m′,θ⋆1, δ

(t−1)2 ).

Local EM converges. That is, the series of log-likelihood {log P (D|m′, θ⋆1, δ

(t)2 )|t =

0, 1, . . .} increases monotonically with t and it is upper-bounded by 0.

Unlike local EM, standard EM optimizes all parameters. To avoid potential confusions,

we call it full EM. The M-step of a local EM is computationally much cheaper than that of

a full EM because a local EM updates fewer parameters. For the candidate model shown

in Figure 3.2, we need to update only the parameters that describe P (Y4|Y3), P (X6|Y4)

and P (X7|Y4). Besides reduction in computation, this fact also implies that a local EM

takes fewer steps to converge than a full EM.

3.2.4 Avoiding Local Maxima

Like full EM, local EM might get stuck at local maxima. To avoid the local maxima,

we adopt the scheme proposed by Chickering and Heckerman (1997a) and call it the

pyramid scheme. The idea is to randomly generate a number µ of initial values for the

new parameters θ′2, resulting in µ initial models. One local EM iteration is run on all the

models and afterwards the bottom µ/2 models with the lowest log-likelihood are discarded.

Then two local EM iterations are run on the remaining models and afterwards the bottom

µ/4 models are discarded. Then four local EM iterations are run on the remaining models,

and so on. The process continues until there is only one model. After that, some more

local EM iterations are run on the remaining model, until the total number of iterations

reaches a predetermined number ν. Hence, there are two algorithmic parameters µ and

ν.

Suppose m′ is a candidate model obtained from the current model m. Use

LocalEM(m, m′, µ, ν) to denote the procedure described in the previous paragraph. The

output is an estimate of the new parameter θ′2. Denote it by θ2. The PickModel subrou-

tine evaluates m′ using the following quantity:

AIC(m′, θ⋆1, θ2|D) = log P (D|m′, θ⋆

1, θ2)− d(m′). (3.2)

Note that the AIC score given here is for a model m′ and a set of parameter values θ⋆1

and θ2 for the model. In contrast, the AIC score given by Equation (2.2) is for a model

only.

23

3.2.5 Two-Stage Model Evaluation

Local EM is faster than full EM. To achieve further speedup, we propose to divide model

evaluation into two stages, a screening stage and an evaluation stage. In the screening

stage, we screen out most of the candidate models by running local EM at a low setting,

while in the evaluation stage we evaluate the remaining models by running local EM at a

high setting. We call this approach two-stage model evaluation.

In local EM, the parameter µ controls the number of initial points and the parameter

ν controls the number of iterations. For the screening stage, we fix the first parameter at 1

and we allow only the second parameter to vary. To distinguish it from the corresponding

parameter at the evaluation stage, we denote it by νs.

Because local EM starting from only one initial point at the screening stage, there

is no effort to avoid local maxima at all. We argue that this does not cause serious

problems because there is an implicit local-maximum-avoidance mechanism built in. A

particular application of a search operator is called a search operation. It corresponds to

one candidate model. So, evaluation of candidate models can also be viewed as evaluation

of search operations. Suppose local EM picks a poor initial point at one step when

evaluating an operation and consequently the operation is screened out. Chances are

that the same operation is also applicable at the next few steps. In that case local EM

would be called to evaluate the operation again and again, each time from a different

starting point. So in the end local EM is run from multiple starting points to evaluate

the operation. If the operation is a good one, there is high probability for it to be picked

at one of those steps.

3.2.6 The PickModel Subroutine

Finally, the pseudo code for PickModel is given in Algorithm 3.7. It has four algorithmic

parameters. The parameters νs and k control the screening stage, while µ and ν control

the evaluation stage.

PickModel involves only local EM. In EAST (Algorithm 3.2), full EM is run on the

model m1 returned by PickModel in order to compute its AIC score. This also facilitates

the next call to PickModel. In the next step, m1 will be the current model. So, PickModel

will need the MLE of the parameters of m1 at the next step. To kick start the whole

search process, full EM needs to be run on the initial model. So, EAST runs full EM once

at each step of search.

24

Algorithm 3.7 PickModel(L, m)

1: for each m′ ∈ L do2: Run LocalEM(m, m′, 1, νs) to estimate the parameters of m′

3: end for4: L′ ← Top k models from L with the highest AIC scores as given by Equation 3.25: for each m′ ∈ L′ do6: Run LocalEM(m, m′, µ, ν) to estimate the parameters of m′

7: end for8: return the model in L′ with the highest AIC score as given by Equation 3.2

3.3 Operation Granularity

At the expansion stage, EAST does not select models using the subroutine PickModel.

Rather it uses another subroutine called PickModel-IR. This is to deal with the issue of

operation granularity.

Operation granularity refers to the phenomenon where some operations might increase

the complexity of the current model much more than other operations. As an example,

consider the situation where there are 100 binary manifest variables. Suppose the search

starts with the LC model with one binary latent node Y . Applying the SI operator to the

model would introduce 101 additional model parameters, while applying the NI operator

to the model would increase the number of model parameters by only 2. The latter

operation is clearly of much finer-grain than the former.

It has been observed that operation granularity often leads to local maxima when BIC

score is used to guide the search (Zhang & Kocka, 2004). The reason is that, at the early

stage of search, SI operations are usually of larger grain than NI operations and often

have higher BIC scores. So, SI operations tend to be applied early, which sometimes leads

to fat latent variables, i.e., latent variables that have excessive numbers of states. Fat

latent variables tend to attract excessive numbers of neighbors. This makes it difficult to

thin fat variables despite of the SD operator. Local maxima are consequently produced

(Chen, 2008). Though the phenomenon is observed when BIC score is used to guide the

search, we believe that it will also happen in the case of AIC score.

One might suggest that we deal with fat latent variables by introducing an additional

search operator that simultaneously reduces the number of states and the number of

neighbors of a latent variable. However, this would complicate algorithm design and

would increase the complexity of the search process. We adopt a simple and effective

strategy called the cost-effectiveness principle (Zhang & Kocka, 2004).

Let m be the current model and m′ be a candidate model. Define the improvement

25

Algorithm 3.8 PickModel-IR(L, m)

1: for each m′ ∈ L do2: Run LocalEM(m, m′, 1, νs) to estimate the parameters of m′

3: end for4: L′ ← Top k models from L with the highest IR scores as given by Equation 3.45: for each m′ ∈ L′ do6: Run LocalEM(m, m′, µ, ν) to estimate the parameters of m′

7: end for8: return the model in L′ with the highest IR score as given by Equation 3.4

ratio of m′ over m given data D to be

IR(m′, m|D) =AIC(m′|D)− AIC(m|D)

d(m′)− d(m). (3.3)

It is the increase in model score per unit increase in model complexity. The cost-

effectiveness principle stipulates that one chooses, among a list of candidate models, the

one with the highest improvement ratio.

The principle is applied only on candidate models generated by the SI and NI opera-

tors. The other operators do not or do not necessarily increase model complexity. Hence

the term d(m′)− d(m) is or might be negative.

Like PickModel, PickModel-IR does not run full EM to optimize the parameters of the

candidate models. Instead, it inherits the values of the old parameters from the current

model and runs local EM to optimize only the new parameters. Let m be the current

model and m′ be a candidate model obtained from m. Suppose the MLE (θ⋆1, θ

⋆2) of the

parameters m have been computed. Let θ2 be the estimate of the new parameters of m′

obtained by local EM. The subroutine PickModel-IR evaluates the candidate model m′

using the following IR score:

IR(m′, m, θ⋆1, θ

⋆2, θ2|D) =

AIC(m′, θ⋆1, θ2|D)−AIC(m, θ⋆

1, θ⋆2|D)

d(m′)− d(m). (3.4)

The pseudo code of PickModel-IR is given in Algorithm 3.8. It is the same as

PickModel except IR scores, rather than AIC scores, are used to evaluate candidate

models.

3.4 Summary

EAST is the state-of-the-art algorithm for learning LTMs. It systematically searches for

LTMs with the highest AIC score in a principled manner. It adopts the grow-restructure-

thin strategy and structures the search into three stages. It has an efficient method for

26

model evaluation that is based on an approximation to AIC score. EAST can induce high

quality models from data and can discover underlying latent structures. We will provide

empirical evidence for those points in Chapter 6.

27

CHAPTER 4

ALGORITHM 2: HIERARCHICALCLUSTERING LEARNING

EAST is a principled search-based learning algorithm. It can induce high quality LTMs,

and hence yield good solutions to density estimation problems. Moreover, it is capable

of revealing interesting latent structures. All those merits come at a cost, namely high

computational complexity. This drawback limits the applicability of EAST to large-scale

problems.

In this chapter, we develop a heuristic learning algorithm that is much more efficient

than EAST. The new algorithm is called HCL, a shorthand for Hierarchical Clustering

Learning of LTMs. It targets at density estimation problems with large training samples,

and assumes that there is a predetermined constraint on the inferential complexity of

the resulting LTM. By inferential complexity, we mean the computational cost of making

inference in the model. In Chapter 7, the reader will see an application that suits HCL.

4.1 Heuristic Construction of Model Structure

We first present the heuristic that HCL uses to determine the model structure.

4.1.1 Basic Ideas

We start with a definition.

Definition 4.1 In an LTM, two manifest variables are called siblings if they share the

same parent.

Our heuristic is based on three ideas:

1. In an LTMM, siblings are generally more closely correlated than variables that are

located far apart.

2. If M is a good estimation of the generative distribution P (X), then two variables

X and X ′ are closely correlated in M if and only if they are closely correlated in

P (X).

28

X1 X2 X3 X4 X5

X1 - - - - -X2 0.0000 - - - -X3 0.0003 0.0971 - - -X4 0.0015 0.0654 0.0196 - -X5 0.0017 0.0311 0.0086 0.1264 -X6 0.0102 0.0252 0.0080 0.1817 0.0486

Table 4.1: The empirical MI between the manifest variables.

3. Given a large data set D drawn from the generative distribution P (X), if two vari-

ables X and X ′ are closely correlated in P (X), then they should also reflect strong

interactions in D.

Therefore, we can examine each pair of variables in data D, pick the two variables that

are most closely correlated, and introduce a latent variable as their parent inM.

Denoted by P (X) the empirical distribution over X induced by the data D. We

measure the strength of correlation between a pair of variables X and X ′ by the empirical

mutual information (MI) (Cover & Thomas, 1991)

I(X; X ′) =∑

X,X′

P (X, X ′) logP (X, X ′)

P (X)P (X ′).

Example 4.1 Consider a data set D containing 6 binary variables X1, X2, . . ., X6.

Suppose the empirical MI based on D is as presented in Table 4.1. We find that X4 and

X6 are the pair with the largest MI. Therefore, we create a latent variable Y1 and make it

the parent of X4 and X6. See the lower-left corner of the model m in Figure 4.1a.

The next step is to find, among Y1, X1, X2, X3, and X5, the pair of variables with the

largest MI. The next subsection explains how.

4.1.2 MI Between A Latent Variable and Another Variable

To pick the best pair among Y1, X1, X2, X3, and X5, we need to calculate the MI between

all possible pairs. There is one difficulty: Y1 is not in the generative distribution and thus

not observed in the data set. Consequently, the MI between Y1 and the other variables

cannot be computed directly. We hence seek an approximation.

In the final model M, Y1 would d-separate X4 and X6 from the other variables.

Therefore, for any X ∈ {X1, X2, X3, X5}, we have

IM(Y1; X) ≥ IM(X4; X), IM(Y1; X) ≥ IM(X6; X).

29

(a) Latent tree model m (b) Regularized model m′

(c) Simplified model m′′

Figure 4.1: An illustrative example. The numbers within the parentheses denote thecardinalities of the variables.

We hence approximate IM(Y1; X) using the lower bound

max{IM(X4; X), IM(X6; X)}.

In general, suppose we need to calculate the MI between a latent variable Y and

another variable Z inM. Denote the two children of Y by W and W ′. We estimate the

MI as follows:

IM(Y ; Z) ≈ max{IM(W ; Z), IM(W ′; Z)}.

Example 4.2 Back to our running example, the estimated MI between Y1 and X1, X2,

X3, X5 is as presented in Table 4.2a. We see that the next pair to pick is Y1 and X5. We

introduce a latent variable Y2 as the parent of Y1 and X5. The process continues.

The next step is the find the best pair among Y2, X1, X2, and X3. The estimated MI

between Y2 and X1, X2, X3 is as shown in Table 4.2b. It is clear that the pair X2 and X3

have the largest MI. We hence introduce a latent variable Y3 as the parent of X2 and X3.

The process moves on.

Next, we pick the best pair among Y3, X1, and Y2. The estimated MI for Y3 is given

30

X1 X2 X3 X5

Y1 0.0102 0.0654 0.0196 0.1264

(a) Y1

X1 X2 X3

Y2 0.0102 0.0654 0.0196

(b) Y2

X1 Y2

Y3 0.0003 0.0654

(c) Y3

Table 4.2: The estimated MI between each latent variable and other variables.

in Table 4.2c. The pair Y2 and Y3 has the largest MI. We thus add a latent variable Y4

and make it the parent of Y2 and Y3.

Lastly, we introduce a latent variable Y5 for the only two remaining variables Y4 and

X1. The final model structure is a binary tree as shown in Figure 4.1a.

4.2 Cardinalities of Latent Variables

After obtaining a model structure, the next step is to determine the cardinalities of the

latent variables. We set the cardinalities of all the latent variables at a certain value C.

In the following we show that, under the assumption of large training samples, one can

achieve better estimation by choosing larger C. We then derive the maximum possible

value of C subject to the presumed constraint on the inferential complexity of the resultant

model.

4.2.1 Larger C for Better Estimation

We first discuss the impact of the value of C on the estimation quality. We start by

considering the case when C equals to∏

X∈X|X|, i.e., the product of the cardinalities

of all the manifest variables. In this case, each latent variable can be viewed as a joint

variable of all the manifest variables. Intuitively, for an arbitrary distribution P (X), we

can set the parameters θm so that the obtained LTM represents P (X) exactly. That is, m

can capture all the interactions among the manifest variables. This intuition is justified

by the following proposition.

Proposition 4.1 Let m be an LTM over a set X of manifest variables. If the cardinalities

of the latent variables in m are all equal to∏

X∈X|X|, then for any joint distribution

31

P (X), there exists a parameter value θm of m such that

P (X|m, θm) = P (X). (4.1)

Proof: We enumerate the variables in X as X1, X2, . . . , Xn. Let Y be a latent variable

in m. Note that the cardinality of Y equals to∏

X∈X|X|. Hence, there are two ways

to represent a state of Y . The first is to index a state using a natural number i ∈{1, 2, . . . , ∏X∈X

|X|}. The second is to write a state as a vector < x1, x2, . . . , xn >, where

xi denotes a state of Xi, ∀i = 1, 2, . . . , n. We will use both representations.

Given a joint distribution P (X), we define a parameter value θm of m as follows:

• The prior distribution for root node W of m:

P (W =< x1, x2, . . . , xn > |m, θm) = P (X1 = x1, X2 = x2, . . . , Xn = xn),

• The conditional distribution for each non-root latent node Y :

P (Y = i|pa(Y ) = j, m, θm) =

{

1 i = j0 otherwise

• The conditional distribution for manifest node Xi, ∀i = 1, 2, . . . , n:

P (Xi = x′i|pa(Xi) =< x1, x2, . . . , xi, . . . , xn >, m, θm) =

{

1 xi = x′i

0 otherwise

It is easy to verify that Equation 4.1 holds for the parameter value θm defined as

above.

Q.E.D.

What happens if we decrease C? It can be shown that the estimation quality will

degrade. Let m be a model obtained with value C and m′ be another model obtained

with a smaller value C ′. It is easy to see that m includes m′. The following lemma states

that the estimation quality of m′ is no better than that of m.

Lemma 4.1 Let P (X) be the generative distribution underlying the training data. Let m

and m′ be two models with manifest variables X. If m includes m′, then

minθm

D[P (X)‖P (X|m, θm)] ≤ minθm′

D[P (X)‖P (X|m′, θm′)].

Proof: Define

θ⋆m′ = arg min

θm′

D[P (X)‖P (X|m′, θm′)].

32

Because m includes m′, there must be parameters θ⋆m of m such that

P (X|m, θ⋆m) = P (X|m′, θ⋆

m′).

Therefore,

minθm

D[P (X)‖P (X|m, θm)] ≤ D[P (X)‖P (X|m, θ⋆m)]

= D[P (X)‖P (X|m′, θ⋆m′)]

= minθm′

D[P (X)‖P (X|m′, θm′)]

Q.E.D.

As mentioned earlier, when C is large enough, model m can capture all the interactions

among the manifest variables and hence can represent the generative distribution P (X)

exactly. If C is not large enough, we can only represent P (X) approximately. According

to the previous discussion, as C decreases, the estimation accuracy (in terms of KL

divergence) will gradually degrade, indicating that model m can capture less and less

interactions among the manifest variables. The worst case occurs when C = 1. In this

case, all the interactions are lost. The estimation is the poorest.

4.2.2 Maximum Value of C Under Complexity Constraint

HCL assumes that there is a predetermined bound on the inferential complexity of the

resultant model. So we need to define the inferential complexity of a model first. We

use the clique tree propagation (CTP) algorithm for inference. Therefore, we define the

inferential complexity to be the sum of the clique sizes in the clique tree of m.

Let m be a model obtained by using the technique described in Section 4.1 and setting

the cardinalities of latent variables at C. The following proposition states the relationship

between the inferential complexity of m and the value of C.

Proposition 4.2 The inferential complexity of m is

(|X| − 2) · C2 +∑

X∈X

|X| · C. (4.2)

Note that |X| is the number of manifest variables, while |X| is the cardinality of a manifest

variable X.

Proof: m is a tree model. Therefore, the cliques in the clique tree of m one-to-one

correspond to the edges in m. Moreover, the clique corresponding to an edge Z—Z ′

33

consists of {Z, Z ′}, and its size equals to |Z| · |Z ′|. So, in order to calculate the inferential

complexity of m, we only need to enumerate the edges in m.

Recall that m has a binary tree structure over manifest variables X. Therefore, it

contains (2|X| − 2) edges. Among all the edges, |X| edges connect manifest variables to

latent variables, one for each manifest variable. The other (|X| − 2) edges are between

latent variables. Note that all the latent variables have cardinality C. Therefore, the

inferential complexity of m is

(|X| − 2) · C2 +∑

X∈X

|X| · C.

Q.E.D.

As a corollary of Proposition 4.2, the maximum achievable value of C subject to a

predetermined constraint on the inferential complexity is given as follows.

Corollary 4.1 Given a bound Imax on the inferential complexity, the maximum possible

value of C is

Cmax =⌊

√b2 + 4aImax − b

2a

⌋

, (4.3)

where a = (|X| − 2) and b =∑

X∈X|X|.

Proof: The solution can be easily obtained by solving the inequality

(|X| − 2) · C2 +∑

X∈X

|X| · C ≤ Imax

and applying the condition that C takes value from natural numbers.

Q.E.D.

In HCL, we set the cardinalities of the latent variables at Cmax.

4.3 Model Simplification

Suppose we have obtained a model m using the techniques described in Sections 4.1 and

4.2. In the following two subsections, we will show that it is sometimes possible to simplify

m without compromising the approximation quality.

4.3.1 Model Regularization

We first notice that m could be irregular. As an example, let us consider the model

in Figure 4.1a. Its structure was constructed in Example 4.2 and the cardinalities of

34

its latent variables were set at 8 to satisfy a certain inferential complexity constraint.

By checking the latent variables, we find that Y5 violates the regularity condition. It

has only two neighbors and |Y5| ≥ |X1|·|Y4|/ max{|X1|, |Y4|}. Y1 and Y3 also violate

the regularity condition because |Y1| > |X4|·|X6|·|Y2|/ max{|X4|, |X6|, |Y2|} and |Y3| >

|X2|·|X3|·|Y4|/ max{|X2|, |X3|, |Y4|}. The following proposition suggests that irregular

models should always be simplified until they become regular.

Proposition 4.3 If m is an irregular model, then there must exist a model m′ with lower

inferential complexity such that

AIC(m′|D) ≥ AIC(m|D). (4.4)

Proof: Let Y be a latent variable in m which violates the regularity condition. Denote

its neighbors by Z1, Z2, . . . , Zk. We obtain another model m′ from m as follows:

1. If Y has only two neighbors, then remove Y from m and connect Z1 with Z2.

2. Otherwise, replace Y with a new latent variable Y ′, where

|Y ′| =∏k

i=1|Zi|maxk

i=1 |Zi|.

As shown by Zhang (2004), m′ is marginally equivalent to m and has fewer independent

parameters. As a direct corollary, inequality 4.4 holds.

To show that the inferential complexity of m′ is lower than that of m, we compare the

clique trees of m and m′. Consider the aforementioned two cases:

1. Y has only two neighbors. In this case, cliques {Y, Z1} and {Y, Z2} in the clique

tree of m are replaced with {Z1, Z2} in the clique tree of m′. Assume |Z2| ≥ |Z1|.The difference in the sum of clique sizes is

sum(m)− sum(m′) = |Y ||Z1|+ |Y ||Z2| − |Z1||Z2|≥ |Z1||Z1|+ |Z1||Z2| − |Z1||Z2|= |Z1||Z1|> 0.

2. Y has more than two neighbors. In this case, for all i = 1, 2, . . . , k, clique {Y, Zi}in the clique tree of m is replaced with a smaller clique {Y ′, Zi} in the clique tree

of m′.

35

In both cases, the inferential complexity of m′ is lower than that of m.

Q.E.D.

To simplify an irregular model, we put it through the regularization process described

in Section 2.3.4. In particular, we handle the latent variables in the order by which they

are created in Section 4.1. In the following, we use the irregular model m in Figure 4.1a

to demonstrate the process.

Example 4.3 We begin with latent variable Y1. It has three neighbors and violates the reg-

ularity condition. So we decrease its cardinality to |X4|·|X6|·|Y2|/ max{|X4|, |X6|, |Y2|} =

4. Then we consider Y2. It satisfies the regularity condition and hence no changes are

made. The next latent variable to examine is Y3. It violates the regularity condition. So

we decrease its cardinality to |X2|·|X3|·|Y4|/ max{|X2|, |X3|, |Y4|} = 4. We do not change

Y4 because it does not cause irregularity. At last, we remove Y5, which has only two neigh-

bors and violates the regularity condition, and connect Y4 with X1. There is no latent

variable violating the regularity condition any more. So we end up with the regular model

m′ as shown in Figure 4.1b.

4.3.2 Redundant Variable Absorption

After regularization, there are sometimes still opportunities for further model simpli-

fication. To facilitate our discussion, we first introduce the notions of saturation and

subsumption.

Definition 4.2 Let Y be a latent variable in a regular model m. Enumerate its neighbors

as Z1, Z2, . . ., Zk. Then Y is saturated if

|Y | =∏k

i=1 |Zi|maxk

i=1 |Zi|.

Definition 4.3 If a latent variable is saturated, we say it subsumes all its neighbors

except the one with the largest cardinality.

Saturated latent variables are potentially redundant. Take the model m′ in Figure

4.1b as an example. It contains two adjacent latent variables Y1 and Y2. Both variables

are saturated. Y1 subsumes X4 and X6, and Y2 subsumes Y1 and X5. Conceptually, Y2

can be viewed as a joint variable of Y1 and X5, while Y1 can be in turn viewed as a joint

variable of X4 and X6. Intuitively, we can eliminate Y1 and directly make Y2 the joint

variable of X4, X5, and X6. This intuition is formalized by the following proposition.

36

(a) (b)

Figure 4.2: Redundant variable absorption. (a) A part of a model that contains twoadjacent and saturated latent nodes Y1 and Y2, with Y2 subsuming Y1. (b) Simplifiedmodel with Y1 absorbed by Y2.

Proposition 4.4 Let m be a model with more than one latent node. Let Y1 and Y2 be two

adjacent latent nodes. If both Y1 and Y2 are saturated while Y2 subsumes Y1, then there

exists another model m′ with lower inferential complexity such that

AIC(m′|D) ≥ AIC(m|D). (4.5)

Proof: We enumerate the neighbors of Y1 as Y2, Z11, Z12, . . ., Z1k, and the neighbors of Y2

as Y1, Z21, Z22, . . ., Z2l. Define another model m′ by removing Y1 from m and connecting

Z11, Z12, . . . , Z1k to Y2. We refer to this operation as variable absorption: Latent variable

Y1 is absorbed by latent variable Y2. See Figure 4.2.

We first prove that inequality 4.5 holds for models m and m′ obtained above. It

is sufficient to show that m′ is marginally equivalent to m and has fewer independent

parameters.

We start by proving the marginal equivalence. For technical convenience, we will work

with unrooted models and show that the unrooted versions of m and m′ are marginally

equivalent. For simplicity, we abuse m and m′ to denote the unrooted models.

Recall that an unrooted model is semantically a Markov random field over an undi-

rected tree. Its parameters include a potential for each edge in the model. The potential

is a non-negative function of the two variables that are connected by the edge. We use

f(·) to denote a potential in the parameters θm of m, and g(·) to denote a potential in

the parameters θm′ of m′.

Note that Y1 and Y2 are saturated, while Y2 subsumes Y1. When all variables have no

less than two states, this implies that:

1. Y1 subsumes Z11, Z12, . . . , Z1k.

37

2. Suppose that |Z2l| = maxlj=1 |Z2j |. Then Y2 subsumes Z21, Z22, . . . , Z2l−1.

Therefore, a state of Y1 can be written as y1 =< z11, z12, . . . , z1k >, while a state of Y2

can be written as y2 =< y1, z21, z22, . . . , z2l−1 >. The latter can be further expanded as

y2 =< z11, z12, . . . , z1k, z21, z22, . . . , z2l−1 >.

We first show that m′ includes m. Let θm be parameters of m. We define parameters

θm′ of m′ as follows:

• Potential for edge Y2 — Z2l:

g(Y2 =< z11, z12, . . . , z1k, z21, z22, . . . , z2l−1 >, Z2l = z2l)

=∑

Y1,Y2

f(Y1, Y2)k

∏

i=1

f(Y1, Z1i = z1i)l

∏

j=1

f(Y2, Z2j = z2j).

• Potential for edge Y2 — Z1i, ∀i = 1, 2, . . . , k:

g(Y2 =< z11, z12, . . . , z1k, z21, z22, . . . , z2l−1 >, Z1i = z′1i) =

{

1 z1i = z′1i

0 otherwise

• Potential for edge Y2 — Z2j , ∀j = 1, 2, . . . , l − 1:

g(Y2 =< z11, z12, . . . , z1k, z21, z22, . . . , z2l−1 >, Z2j = z′2j) =

{

1 z2j = z′2j

0 otherwise

• Set the other potentials in θm′ the same as those in θm.

It is easy to verify that

∑

Y1,Y2

f(Y1, Y2)

k∏

i=1

f(Y1, Z1i)

l∏

j=1

f(Y2, Z2j) =∑

Y2

k∏

i=1

g(Y2, Z1i)

l∏

j=1

g(Y2, Z2j). (4.6)

Therefore,

P (X|m, θm) = P (X|m′, θm′). (4.7)

Next, we prove that m includes m′. Given parameters θm′ of m′, we define parameters

θm of m as follows:

• Potential for edge Y1 — Y2:

f(Y1 =< z11, z12, . . . , z1k >, Y2 = y2) =k

∏

i=1

g(Y2 = y2, Z1i = z1i).

38

• Potential for edge Y1 — Z1i, ∀i = 1, 2, . . . , k:

f(Y1 =< z11, z12, . . . , z1k >, Z1i = z′1i) =

{

1 z1i = z′1i

0 otherwise

• Set the other potentials in θm the same as those in θm′ .

It can be verified that Equation 4.6 and 4.7 also hold. Therefore, m and m′ are marginally

equivalent.

We now prove that m′ contains less independent parameters than m. To show this

point, we return to rooted models. The parameters θm and θm′ consist of a CPT for each

node in models m and m′. There are only two differences between θm and θm′ :

1. The CPTs P (Z1i|Y1) for all i = 1, 2, . . . , k and P (Y1|Y2) are peculiar to θm;

2. The CPTs P (Z1i|Y2) for all i = 1, 2, . . . , k are peculiar to θm′ .

Therefore, the difference between the numbers of free parameters in m and m′ is

d(m)− d(m′) =∑

i

(|Z1i| − 1)|Y1|+ (|Y1| − 1)|Y2| −∑

i

(|Z1i| − 1)|Y2|

=∑

i

(|Z1i| − 1)|Y1|+(

∏

i

|Z1i| − 1)

|Y2| −∑

i

(|Z1i| − 1)|Y2|

=∑

i

(|Z1i| − 1)|Y1|+(

∏

i

|Z1i| −∑

i

|Z1i|+ k − 1)

|Y2|

≥∑

i

(|Z1i| − 1)|Y1|.

The second equation holds because Y1 is saturated and subsumes Z11, Z12, . . . , Z1k. The

last inequality holds because∏

i |Z1i| ≥∑

i |Z1i| when |Z1i| ≥ 2 for all i = 1, 2, . . . , k.

It is clear that the difference is strictly positive when |Y1| ≥ 2. The model m hence has

more parameters than m′ when Y1 and Z1i are non-trivial.

Finally, we compare the inferential complexity of m and m′. According to the con-

struction of m′, the clique tree of m′ is different from the clique tree of m in that it contains

one less clique {Y1, Y2} and replaces clique {Y1, Z1i} with {Y2, Z1i} for all i = 1, 2, . . . , k.

39

Algorithm 4.1 RedundantVariableAbsorption(m)

1: repeat2: for each pair of adjacent latent variables Y and Y ′ in m do3: if Y and Y ′ are saturated and Y subsumes Y ′ then4: Denote the set of neighbors of Y ′ by nb(Y ′)5: Connect nb(Y ′) \ {Y } to Y6: Remove Y ′ from m7: end if8: end for9: until no further changes

Therefore, the difference between the sums of clique sizes is

sum(m)− sum(m′) = |Y1||Y2|+∑

i

|Y1||Z1i| −∑

i

|Y2||Z1i|

= |Y2|∏

i

|Z1i|+∑

i

|Y1||Z1i| −∑

i

|Y2||Z1i|

= |Y2|(

∏

i

|Z1i| −∑

i

|Z1i|)

+∑

i

|Y1||Z1i|

≥∑

i

|Y1||Z1i|.

The last inequality holds when |Z1i| ≥ 2 for all i = 1, 2, . . . , k. Therefore, the inferential

complexity of m′ is always lower than that of m when Z1i is nontrivial.

Q.E.D.

Based on Proposition 4.4, we design a procedure called RedundantVariableAbsorption

for simplifying regular models. Algorithm 4.1 shows the pseudo code of this procedure. It

checks each pair of adjacent latent variables in the input model m. If one latent variable

of the pair is redundant, it is absorbed by the other. This process repeats until no further

changes. We use an example to demonstrate this procedure.

Example 4.4 Consider to simplify the regular model m′ in Figure 4.1b. We first check

the pair Y1 and Y2. Both of them are saturated while Y2 subsumes Y1. We thus remove Y1

and connect Y2 to X4 and X6. We then check Y3 and Y4. It turns out that Y3 is redundant.

Therefore, we remove it and connect Y4 to X2 and X3. The last pair to check are Y2 and

Y4. They are both saturated, but neither of them subsumes the other. Hence, they cannot

be removed. The final model m′′ is shown in Figure 4.1c.

4.4 Parameter Optimization

After we obtain a simplified model m, we run EM to optimize its parameters. Throughout

the learning process, we run EM only once. Therefore, we can afford a relatively high

40

setting. This can help avoid poor local maxima in the parameter space and usually leads

to better estimation to the generative distribution.

4.5 The HCL Algorithm

To wrap up, we have outlined the HCL algorithm for learning LTMs. HCL has 2 inputs: a

data set D and a predetermined bound Imax on the inferential complexity of the resultant

model. The output of HCL is an LTM that approximates the generative distribution

underlying D and satisfies the inferential complexity constraint. HCL is briefly described

as follows.

1. Obtain an LTM structure by performing hierarchical clustering of variables, using

empirical MI based on D as the similarity measure. (Section 4.1)

2. Set cardinalities of latent variables at Cmax according to Equation 4.3. (Section 4.2)

3. Simplify the model through regularization and redundant variable absorption. (Sec-

tion 4.3)

4. Optimize the parameters by running EM. (Section 4.4)

5. Return the resultant LTM.

4.6 Summary

We have presented a heuristic algorithm called HCL for learning LTMs. HCL deals with

density estimation problems with large samples, and assumes a predetermined bound

on the complexity of the resultant LTM. It constructs model structure by hierarchically

clustering manifest variables, and determines the cardinalities of latent variables based

on the given complexity constraint. In Chapter 6, we will empirically compare HCL with

EAST. In particular, we will show that HCL is about two to three orders of magnitude

faster than EAST. In Chapter 7, we will present an application of HCL in approximate

inference in Bayesian networks. We will use HCL to learn an LTM to approximate a given

Bayesian network, and utilize the approximate LTM to answer probabilistic queries.

41

CHAPTER 5

ALGORITHM 3: PYRAMID

We have so far presented two algorithms for learning LTMs, namely EAST and HCL.

EAST is a general-purpose algorithm. It finds high quality models and hence can yield

good solutions to density estimation problems. Moreover, it can discover interesting latent

structures behind data. However, it has a high computational cost. HCL is much faster

than EAST. However, it is a special-purpose algorithm, designed only for density estima-

tion problems where there is a predetermined bound on model complexity. Moreover, it

always produces binary latent tree structures which are typically not meaningful.

In this chapter, we present a third algorithm called Pyramid. It is also a general-

purpose algorithm and is much faster than EAST. It usually finds high quality models

and discovers interesting latent structures as EAST does. As will be seen in Chapter 8,

Pyramid offers a better tradeoff between model quality and computational efficiency in

classification problems than EAST.

5.1 Basic Ideas

In this section, we explain the basic ideas of the Pyramid algorithm.

5.1.1 Bottom-up Construction of Model Structure

The Pyramid algorithm is an extension to HCL. The two algorithms both exploit the

same intuition, i.e., sibling variables in LTMs are in general more closely correlated than

variables that are located far apart. To utilize this intuition in LTM construction, one can

first somehow identify variables that are closely correlated and then make them siblings

in the LTM under construction.

Here is a general bottom-up procedure for constructing LTMs that adopts the above

idea. The procedure keeps a working list of variables that initially consists of all manifest

variables. It operates as follows:

1. Choose from the working list a closely correlated subset of variables;

2. Introduce a new latent variable as the parent of each variable in the subset;

42

3. Remove from the working list the variables in the subset and add to it the new

latent variable.

4. Repeat until there is only one variable in the working list.

Apparently, step 1 is the critical step in the procedure. The subset of variables that it

chooses become siblings after step 2. So, we say that the task of step 1 is sibling cluster

determination.

HCL takes a simple approach to sibling clustering determination. It simply chooses,

from the working list, the two variables with the highest mutual information as the mem-

bers of the sibling cluster. So, at each iteration, it introduces a new latent variable as the

parent of two existing variables. The final model structure is hence a binary tree.

To overcome this limitation of HCL, we allow Pyramid to create sibling clusters with

more than two variables. Pyramid starts with a subset consisting of the two variables

with the highest mutual information. The other variables from the working list are then

added to the subset one by one until a certain condition is met. A sibling cluster is then

chosen from the subset.

There are three questions. First, at each step, which variable do we choose to add

to the current subset? Second, when do we stop growing the subset? Third, how do we

obtain a sibling cluster from the final subset?

Here is the answer to the first question: At each step, we add to the current subset the

variable from the working list that has the highest mutual information with the current

subset. A definition of mutual information between a variable and a set of variables will

be given in Section 5.2.

The other two questions will be answered in Sections 5.1.3 and 5.1.4, respectively. In

preparation, we will next introduce unidimensionality test.

5.1.2 Unidimensionality Test

We begin with some definitions.

Definition 5.1 An LTM is called a single-latent tree (1-LT) model if it contains only

one latent variable.

Definition 5.2 An LTM is called a twin-latent tree (2-LT) model if it contains exactly

two latent variables.

Definition 5.3 An LTM is simple if it is either an 1-LT or a 2-LT model.

43

Suppose we are given a data set D over a set of variables X. Let S be a subset of

X with more than 2 variables and D↓S be the projection of D onto S. Among all simple

models over S, denote the model with the highest AIC score with respect to D↓S byM⋆S.

Definition 5.4 A subset S of variables is unidimensional if M⋆S

is an 1-LT model.

The unidimensionality test checks whether a subset S of variables is unidimensional.

This is done in two steps:

1. Find the optimal simple LTMM⋆S

over S;

2. Return true ifM⋆S

is an 1-LT model; returns false, otherwise.

The first step is a non-trivial task. We will address it in Section 5.4.

When it is necessary to distinguish between simple models over S and the LTM over

X under construction, we will refer to the former as local models and the latter as the

global model.

5.1.3 Subset Growing Termination

Let us return to the second question raised in Section 5.1.1, i.e., when to stop growing

the subset. Our answer is to conduct unidimensionality test. After adding a new variable

to the current subset S, we test S for unidimensionality. If S passes the test, we continue

to grow it. Otherwise, we stop.

Let M⋆ be the optimal LTM over X with respect to D, i.e.,

M⋆ = arg maxM

AIC(M|D).

The intuition is that, if a subset S of X is not unidimensional, then the variables in S are

probably not from the same sibling cluster in M⋆. Growing it further would not change

the fact. So, we stop.

Example 5.1 Consider m manifest variables X1, X2, . . ., Xm. Assume that the pair X1

and X2 has the highest MI among all possible pairs; X3 has the highest MI with {X1, X2}among X3, . . ., Xm; X4 has the highest MI with {X1, X2, X3} among X4, . . ., Xm; etc.

To determine a sibling cluster, Pyramid starts with the subset S = {X1, X2} and

grows it gradually. First, it adds X3 to S and runs a unidimensionality test. Suppose the

optimal simple model over X1–X3 is an 1-LT, as shown in Figure 5.1a. Then the new

subset S = {X1, X2, X3} passes the test and the subset growing process continues.

44

(a) Step 1 (b) Step 2 (c) Step 3

Figure 5.1: An example subset growing process. The numbers within the parenthesesdenote the cardinalities of the latent variables.

Pyramid next adds X4 to S. Suppose the optimal model over X1–X4 is still an 1-LT,

as shown in Figure 5.1b. Then the new subset again passes the unidimensionality test and

the subset growing process moves on.

Pyramid then adds X5 to S. Suppose the optimal simple model over X1–X5 is a 2-LT,

as shown in Figure 5.1c. Then S fails the unidimensionality test and the subset growing

process terminates.

Pyramid can now obtain a sibling cluster from S. The next subsection explains how.

5.1.4 Sibling Cluster Determination

When the subset growing process stops, we have a subset S of variables from the working

list. We have put S through unidimensionality test. We use the intermediate results

computed during the test to determine a sibling cluster.

LetM⋆S

be the optimal simple model over S that was found during the unidimension-

ality test. It is a 2-LT model. So it has two latent variables and hence two sibling clusters.

These clusters should not be confused with sibling clusters in the global model under con-

struction. Denote the two sibling clusters in M⋆S

by S1 and S2, where S = S1 ∪ S2. Let

X1 and X2 be the pair of variables with which we started the subset growing process.

Recall that, among all possible pairs of variables in the working list, X1 and X2 have the

highest MI.

Pyramid chooses one of S1 and S2 as a sibling cluster for the global model. Specifically,

it picks the one that contains both X1 and X2. In case that X1 and X2 lie in different

subsets, Pyramid chooses between S1 and S2 arbitrarily. However, this seldom happens

in practice.

Without lose of generality, suppose S1 is chosen as a sibling cluster. The next step is

to introduce a new latent variable Y to the global model and make it the parent of each

variable in S1. An immediate question is what the cardinality of Y should be. InM⋆S, the

45

Figure 5.2: The model structure after adding one latent variable Y1.

variables in S1 share one common parent, which we denote by Z. We set the cardinality

of Y to be the same as that of Z.

Example 5.2 Consider the subset S that we obtained in Example 5.1. It consists of X1–

X5. The optimal simple model over S is a 2-LT, as shown in Figure 5.1c. It contains two

sibling clusters {X1, X2, X3} and {X4, X5}.

Since {X1, X2, X3} contains X1 and X2, it is picked as a sibling cluster for the global

model under construction. A new latent variable Y1 is introduced as the parent of the three

variables. This gives the model structure shown in Figure 5.2. The cardinality of Y1 is set

to be the same as the latent variable Y in Figure 5.1c.

Going back to the general procedure outlined in Section 5.1.1, we have illustrated

steps 1–3. In step 4, the variables X1, X2, X3 are removed from the working list and

Y1 is added to the working list. The process then repeats itself. In the next iteration,

a new latent variable might be introduced for several other manifest variables, or for Y1

and some other manifest variables.

5.2 The Pyramid Algorithm

Algorithm 5.1 shows the pseudo code for Pyramid. It takes a data set D over a set of

variables X as the input, and outputs an LTMM with manifest variables X.

The algorithm begins with a BN M that consists of the nodes from X and that

contains no edges (line 2). Latent nodes will be added toM one by one. By the end,Mwill become an LTM.

To decide what latent nodes to add toM and how to add them, the algorithm main-

tains a working list W. Initially, it consists of a copy of all the variables from X (line

3). Variables will then be added to and removed from W. The newly added variables

will be a copy of the latent nodes to be introduced toM. By the end, W will become a

singleton set, containing only a copy of the root of the final model.

In each pass through the main while loop (lines 4–19), the algorithm first picks the

pair of variables from W that has the highest MI among all possible pairs (line 5). The

46

Algorithm 5.1 Pyramid(D)

1: Let X be the set of variables in D2: M← BN with nodes X and no edges3: W← A copy of X4: while |W| > 1 do5: {W1, W2} ← arg maxW,W ′∈W I(W ; W ′)6: S← {W1, W2}7: while |S| < |W| do8: {W3, W4} ← arg maxS∈S,W∈W\S I(S; W )9: S← S ∪ {W4}

10: MS ← RestrictedExpand(S, W3, W4,M,D)11: if MS is a 2-LT model then12: break13: end if14: end while15: MS ← PickSiblingCluster(MS, W1)16: M← AddNode(M,MS,D)17: Let Y be the new node just added toM18: W←W \ {The children of Y inM}∪ {A copy of Y}19: end while20: return Refine(M,D)

Algorithm 5.2 PickSiblingCluster(MS , W1)

1: Let Y be the latent node inMS that is not the parent of W12: Remove Y and all its manifest children fromMS

3: return MS

MI between two manifest variables is estimated based on data. The estimation of the MI

between two latent variables or between a latent variable and a manifest variable is not

trivial. This issue will be discussed in Section 5.3.

In lines 6–14, the algorithm determines a sibling cluster. It does so by iteratively

growing a set S that initially contains the pair of variables picked at line 5. In each pass

through the inner while loop, the algorithm first computes the MI between each variable

W from W \S and the subset S. In particular, we adopt the minimum linkage definition

(Duda & Hart, 1973). That is, the MI between a variable W and the subset S of variables

is defined to be the maximum of the MI values between W and each variable S in S,

I(W ;S) = maxS∈S

I(W ; S).

The variable W4 from W \ S that is the most closely related to S is then picked at line

8. It is added to S at line 9. Unidimensionality test is then carried on S in lines 10–13.

The key step is the search for the optimal simple model over S. This is done using a

subroutine called RestrictedExpand. This subroutine uses the variable W3 in S that has

the highest MI with W4. Section 5.4 will explain how the subroutine works.

When S fails the unidimensionality test, the algorithm proceeds to obtain a sibling

47

Algorithm 5.3 AddNode(M,MS,D)

1: Let L be the set of nodes inM that correspond to the leaf nodes ofMS

2: Create a new latent node Y and set its cardinality to be that of the root ofMS

3: Add Y toM and make it the parent of each node in L4: Let MY be the submodel ofM rooted at Y5: return EM(MY ,D)

cluster using a subroutine called PickSiblingCluster (line 15). The pseudo code of

the subroutine is given in Algorithm 5.2. Because S fails the unidemsionality test, the

simple model MS contains two latent nodes. PickSiblingCluster prunes fromMS all

the nodes that are not the parent or siblings of W1 and returns the resulting 1-LT model.

The sibling cluster obtained is the set of leaf nodes remaining in MS after the pruning.

Note that the subroutine returns the 1-LT model rather than just the sibling cluster itself.

This is because another aspect of the model will also be needed in the next step.

After line 15 of Pyramid, MS is an 1-LT model. Let L be the set of the nodes inMthat correspond to the leaf nodes of MS. It is the sibling cluster obtained. At line 16,

the subroutine AddNode (see Algorithm 5.3) is called to create a new latent node Y and

add it to the BNM. It is made the parent of each node in L. The cardinality of Y is set

to be the same as that of the root ofMS. The parameters pertaining to the new node Y

and in the subtree rooted at Y are optimized using EM, so that future MI estimates will

be accurate.

At line 18 of Pyramid, the nodes in L are removed from the working list W because

they already have a parent Y . A copy of Y is then added to W so that Y can be connected

to future latent nodes.

After line 19, M becomes an LTM. At line 20, a subroutine Refine is called to

optimize the cardinalities of the latent variables inM and the model parameters as well.

This subroutine will be discussed in Section 5.5. The output of the subroutine is the

output of the entire algorithm.

In the next 3 subsections, we present more implementation details about the Pyramid

algorithm.

5.3 Mutual Information

At lines 5 and 8, Pyramid needs to compute MI between variables. In this section, we

discuss the issues related to the calculation.

48

5.3.1 MI Between Manifest Variables

Theoretically, exact computation of MI is possible only when there is a probabilistic model

over the variables. However, in density estimation problems, we only have a data set at

hand. Therefore, we can only estimate MI based on data.

Suppose D is a data set over a set of manifest variables X. Let P (X) be the empirical

distribution over X induced by D. The empirical MI between two manifest variables X

and X ′ is defined as

I(X; X ′) =∑

X,X′

P (X, X ′) logP (X, X ′)

P (X)P (X ′).

We use the empirical MI I(X; X ′) to estimate the MI I(X, X ′).

5.3.2 MI Between a Latent Variable and a Manifest Variable

Pyramid sometimes need the MI between two latent variables or between a latent variable

and a manifest variable. As an example, consider the situation shown in Figure 5.2. The

latent variable Y1 has just be added to the global model. Consequently, X1, X2 and X3

have just been removed from the working list W, and Y1 has just be added to it. In the

next iteration of the main while loop, Pyramid will need the MI between Y1 and X4–Xm

at lines 5 and 8.

MI cannot be estimated solely from the empirical distribution when latent variables

are involved. This is because values of latent variables are not available in data. We

address this issue in the remaining of this section. We begin with the task of estimating

the MI I(Y ; X) between a latent variable Y and a manifest variable X.

Pyramid builds an LTM by adding latent nodes to a BNM that initially contains no

latent nodes and no edges. At the time when it needs the MI I(Y ; X), there is already

a sub-model in M rooted at Y . We denote it by MY . In Figure 5.2, for example, when

Pyramid needs to calculate the MI between latent variable Y1 and manifest variables

X4–Xm, there is a sub-model rooted at Y1 and it happens to be an 1-LT model.

The values of Y are not available in data. Conceptually, Pyramid exploits the sub-

modelMY to complete the values of Y in data, and then estimates the MI I(Y ; X) based

on the completed data. Let XY be the set of manifest variables in MY . Technically,

Pyramid first obtains a joint distribution over Y and X from the data D and the sub-

modelMY as follows:

PD,M(Y,X) =1

|D|∑

d∈D

P (Y |d↓XY,MY )1X(d↓X),

49

where |D| is the sample size of D, 1X(·) is the indication function, while d↓XYand d↓X

denote the values of XY and X in data case d, respectively. Then, Pyramid estimates

the MI I(Y ; X) using the following quantity:

ID,M(Y ; X) =∑

Y,X

PD,M(Y, X) logPD,M(Y, X)

PD,M(Y )PD,M(X).

In order for the estimation to be accurate, the parameters in the sub-modelMY should

be optimized. This is done at the end of the subroutine AddNode after the introduction

of each latent node.

5.3.3 MI Between Two Latent Variables

We next consider the MI I(Y ; Y ′) between two latent variables Y and Y ′. When Pyramid

needs the MI, there are a sub-model rooted at Y and another sub-model rooted at Y ′

in M. We denote them by MY and MY ′ , respectively. Let XY and XY ′ be the sets of

manifest variables in the two sub-models. Pyramid first obtains a joint distribution over

Y and Y ′ from the data D and the two sub-models MY andMY ′ as follows:

PD,M(Y, Y ′) =1

|D|∑

d∈D

P (Y |d↓XY,MY )P (Y ′|d↓XY ′

,MY ′).

Then, it estimates the MI I(Y ; Y ′) using the following quantity:

ID,M(Y ; Y ′) =∑

Y,Y ′

PD,M(Y, Y ′) logPD,M(Y, Y ′)

PD,M(Y )PD,M(Y ′).

5.4 Simple Model Learning

At line 10, Pyramid tries to find the optimal simple LTM over a set S. It does so by calling

a subroutine named RestrictedExpand. In this section, we explain how the subroutine

works. Because it is called many times, efficiency was an important consideration when

designing the subroutine.

The set S consists of variables from the working list W. It might contain only manifest

variables, or it might contain latent variables as well as manifest variables. We will focus

on the first case in Sections 5.4.1 and 5.4.2, and deal with the second case in Section 5.4.3.

5.4.1 Exhaustive Search

Suppose S consists of only manifest variables. A conceptually straightforward way to find

the optimal simple model over S is exhaustive search. That is, we enumerate all simple

50

Algorithm 5.4 RestrictedExpand(S, W, W ′,M,D)

1: MS ← 1-LT model over S with a binary latent variable Y2: while true do3: if MS is an 1-LT model then4: M1 ← SI(MS, Y )5: M2 ← NI(MS, Y, W, W ′)6: M′

S← PickModel1-IR({M1,M2},MS)

7: else8: Let Y and Y ′ be the two latent variables inMS

9: M1 ← SI(MS, Y )10: M2 ← SI(MS, Y ′)11: M′

S← PickModel1({M1,M2},MS)

12: end if13: if AIC(M′

S|M,D) ≤ AIC(MS|M,D) then

14: return MS

15: else if M′S

is a 2-LT model then16: MS ← EnhanceNI(M′

S,MS,D)

17: else18: MS ←M′

S


regular LTMs over S, calculate their AIC scores, and return the one with the highest

score. This solution is computationally intractable because there are exponentially many

regular 2-LT structures over S, and hence exponentially many simple regular LTMs.

Proposition 5.1 There are 2m−1 − 1 different regular 2-LT structures over m manifest

variables.

Proof: A regular 2-LT structure can be uniquely determined by partitioning the m

manifest variables into two non-empty sibling clusters. Therefore, the number of regular

2-LT structures equals to the number of different partitions over m manifest variables,

i.e.,Pm−1

k=1Cm

k

2= 2m−1 − 1.

Q.E.D.

5.4.2 Restricted Expansion

Instead of exhaustive search, RestrictedExpand finds a simple LTM over S through hill-

climbing. The pseudo code of the subroutine is given in Algorithm 5.4. It is similar to the

expansion phase of the EAST algorithm discussed in Chapter 3, except that a number of

differences are introduced to make it as efficient as possible. As mentioned earlier, the

subroutine is called many times, and hence efficiency is critical.

Like EAST, RestrictedExpand starts the search from the 1-LT model over S that

has a single binary latent variable (line 1).

51

At each step of search, RestrictedExpand generates a number of candidate models

(lines 4, 5, 9, 10). Here, it differs from EAST in two aspects. First, it no longer considers

apllying the NI operator once the current model MS has two latent nodes (lines 9 and

10). This is because the objective here is to find the optimal simple LTM, i.e., the best

model among LTMs with 1 or 2 latent nodes. There is hence no need to consider models

with more than 2 latent nodes.

The second difference is that RestrictedExpand does not consider all possible ways

to apply the NI operator when there is only one latent node. Rather, it considers only

the NI operation that introduces a new latent node to mediate the existing latent node

Y and two of its children W and W ′ (line 5). The two nodes W and W ′ are provided as

arguments by the call to the subroutine.

This second modification clearly reduces the number of candidate models. However,

does it significantly reduce the chance to find a good simple model, and hence compromise

the quality of unidimensionality test? To answer this question, we need to go back to line

10 of the Pyramid algorithm where the subroutine is called. There, a new variable W4

has just been added to S. Let S′ = S \ {W4}. The variable W3 is the one in S′ that has

the highest MI with W4. In the subroutine call, W is W3 and W ′ is W4.

In EAST, |S|(|S|−1)/2 possible ways to apply the NI operator are considered, one for

each pair of nodes in S. Divide those operations into two groups: (1) Those for pairs of

nodes in S′, and (2) those for pairs that consist of W4 and a node in S′. The operations

in the first group have already been considered in previous unidimensonality tests on the

subsets created during the process of growing the initial subset {W1, W2} up to S′. They

were found to be inferior to corresponding SI operations on the previous simple models or

not to improve the AIC scores of those models. As a heuristic, we believe that the same

would be true if they are applied to the current simple model. So, we do not consider

them.

Among the operations in the second group, we believe that, as another heuristic, the

operation that introduces a new latent variable for the pair W3 and W4 would bring about

the most (if any) improvement over the current model. This is because W3 is the variable

in S′ that has the highest MI with W4. So, we choose to consider this operation only.

After generating the candidate models, RestrictedExpand evaluates them one by one

and picks the best model from the list. This is done by invoking subroutines PickModel1-IR

and PickModel1 at lines 6 and 11. Here comes the third difference between

RestrictedExpand and EAST. EAST needs to deal with a large number of candidate

models. Therefore, it uses PickModel and PickModel-IR (Algorithms 3.7 and 3.8) for

efficient model evaluation, which adopt a two-stage strategy. In the first screening stage,

PickModel and PickModel-IR optimize the parameters of the candidate models by run-

52

Algorithm 5.5 PickModel1(L,M)

1: for each M′ ∈ L do2: Run LocalEM(M,M′, µ, ν) to estimate the parameters ofM′

3: end for4: return the model in L with the highest AIC score as given by Equation 3.2

Algorithm 5.6 PickModel1-IR(L,M)

1: for each M′ ∈ L do2: Run LocalEM(M,M′, µ, ν) to estimate the parameters ofM′

3: end for4: return the model in L with the highest IR score as given by Equation 3.4

ning local EM at a low setting, calculate the AIC/IR scores of the candidate model based

on the estimated parameters, and prune most candidate models with low AIC/IR scores

from consideration. Then, in the second evaluation stage, they refine the parameters of

remaining candidate models by running local EM at a relatively high setting, and pick

the best one based on the parameter estimations obtained.

In contrast, RestrictedExpand only needs to handle a small number of candidate

models. Therefore, PickModel1 and PickModel1-IR accomplishes model evaluation in

one stage, as shown in Algorithms 5.5 and 5.6.

PickModel1 and PickModel1-IR have two algorithmic parameters, µ and ν. They

control the local EM in the same way as in PickModel and PickModel-IR. In practice, we

use a high setting for µ and ν to obtain accurate parameter estimations. Since PickModel1

and PickModel1-IR only deals with two candidate models which are usually small, we

believe that tuning local EM at a high setting will not significantly slows down the model

evaluation process.

After the model evaluation/selection phase, one candidate model is chosen.

RestrictedExpand compares the AIC score of the picked candidate model with that

of the current model (line 13). If the score does not increase, the search terminates and

the current model is returned as the output of RestrictedExpand (line 14). Otherwise,

the search continues (line 15–19). If a new latent node has just been introduced to the

best candidate model, the subroutine EnhanceNI (Algorithm 3.4) is called to adjust con-

nections among nodes so that potentially more manifest nodes can be connected with the

new latent node (line 17). This is the same as in EAST.

5.4.3 When S Contains Latent Variables

We now deal with the case when S contains latent variables. We begin with an example.

Suppose the current global model M is as shown in Figure 5.2. To introduce the next

53

(a) Candidate model MY (b) Sub-modelMY1(c) Concatenated modelMY Y1

Figure 5.3: An example for evaluating simple models over latent variables. The colors ofthe nodes in Figure 5.3c indicate where the parameters of the nodes come from.

latent node, Pyramid might call RestrictedExpand to find a simple model over Y1 and

some manifest variables, say X4–X6. In that process, it might need to evaluate candidate

models such as the modelMY shown in Figure 5.3a. An issue with the evaluation of the

model is that the node Y1 is latent and its values are absent from the data D. Therefore,

we cannot directly calculate the AIC score ofMY .

To solve this problem, we exploit the same idea that was used earlier in Section 5.3 to

estimate the MI between latent variables. We notice that, in the current global modelM,

there is a sub-model rooted at Y1. We refer to this sub-model as MY1. For convenience,

we reproduce it here in Figure 5.3b. The idea is to use the model MY1to complete the

values for Y1 in the data, and then use the completed data to evaluate the candidate

modelMY .

Technically, to evaluate MY , we first concatenate it with MY1. The result is a two-

layer LTM MY Y1, as shown in Figure 5.3c. In particular, MY Y1

borrows the parameters

of Y , Y ′, Y1, X4–X6 from MY , and the parameters of X1–X3 from MY1. Note that the

leaf nodes inMY Y1are all manifest variables. We then calculate the AIC score ofMY Y1

,

and return it as the AIC score ofMY .

In general, suppose we need to evaluate a candidate modelMS over S, while S contains

a set Y of latent variables. For each Y ∈ Y, there is a sub-model rooted at Y in the

current global modelM. Denote it by MY1. To estimate the AIC score ofMS, we first

concatenateMS with the sub-modelsMY for all Y ∈ Y. Refer to the resultant model as

MSY. We then calculate the AIC score of MSY, and return it as the AIC score of MS.

This estimation is formalized as follows:

AIC(MS|D) ≈∑

d∈D

log∑

Y

P (d↓S\Y,Y|MS)∏

Y ∈Y

P (d↓XY|Y,MY )− d(MSY),

where d(MSY) is the dimension of the concatenated modelMSY.

1This is not to be confused with the modelMY as shown in Figure 5.3a.

54

Algorithm 5.7 Refine(M,D)

1: while true do2: M′ ← arg maxM′∈SI(M) AIC(M′|D)3: if AIC(M′|D) < AIC(M|D) then4: break5: else6: M←M′

7: end if8: end while9: M← EM(M,D)

10: return M

5.5 Cardinality and Parameter Refinement

Pyramid starts with a BNM that consists of only manifest nodes and adds latent nodes

to it one by one. Latent nodes are added by the subroutine AddNode. When a latent

node is added, its cardinality is also determined (line 2 of AddNode). The cardinality is

determined locally using information from a subset of manifest variables. In the final

model, however, the latent node is connected to all manifest variables via other latent

nodes. As such, the cardinality might need to be increased so as to achieve better model

fit.

Consider the latent variable Y1 that was introduced in Example 5.2. As shown in

Figure 5.1c, the cardinality of Y1 was determined based on the interactions among manifest

variables X1–X5. Specifically, Y1 was introduced to capture the interactions between

two subsets of manifest variables, namely {X1, X2, X3} and {X4, X5}. The tighter the

interactions, the higher the cardinality of Y1 needs to be. In the final model, on the other

hand, Y1 needs to capture not only the interactions between {X1, X2, X3} and {X4, X5},but also the those between {X1, X2, X3} and all the other manifest variables. Hence, its

cardinality might need to be increased.

After the completion of model structure construction at line 19, Pyramid invokes the

subroutine Refine to determine whether the cardinalities of latent variables should be

increased and to increase them if necessary. The pseudo code for Refine is shown in

Algorithm 5.7. It is a hill-climbing process. At each step, the possibility of increasing the

cardinality of each latent variable by one is considered. This results in a list of candidate

models, with one corresponding to each latent variable. The candidate model with the

highest AIC score is then picked. If the chosen candidate model improves over the current

model, the search moves on to the next step. Otherwise, the search process terminates.

During the hill-climbing process, local EM instead of full EM is run on the candidate

models for the sake of efficiency. At the end of the search, full EM is run on the final

model to optimize the parameters.

55

5.6 Summary

In this chapter, we have developed a third algorithm, called Pyramid, for learning LTMs.

It is designed to be a general-purpose algorithm (1) that is much faster than EAST and (2)

that can find good quality models and discover interesting latent structures. In the next

chapter, we will provide empirical evidence that Pyramid indeed meets the objectives.

Furthermore, in Chapter 8, we will apply both Pyramid and EAST to classification. The

reader will see that Pyramid represents a better tradeoff between computational time and

model quality/classification performance than EAST.

56

CHAPTER 6

EMPIRICAL EVALUATION

We have presented three algorithms for learning LTMs for the task of density estima-

tion, namely EAST, HCL, and Pyramid. In this chapter, we empirically evaluate the

performance of those algorithms on both synthetic and real-world data.

6.1 Data Sets

We used 3 synthetic data sets and 2 real-world data sets in our experiments. They are

detailed in the following 2 subsections.

6.1.1 Synthetic Data Sets

The synthetic data sets were generated from 3 LTMs. The generative models contain 7,

12, and 18 manifest variables. We denote them by MG7 , MG

12, and MG18, respectively.

The structures of the models are shown in Figure 6.1. All the manifest variables in the

models take 3 values. The cardinalities of the latent variables are denoted by the numbers

within the parentheses in the latent nodes. The parameters of the models were randomly

generated such that correlations along the edges are strong.

From each model, we sampled a training set with 5k instances over the manifest

variables. We also sampled a separate 5k test set. In the experiments, we ran the learning

algorithms on the training set and evaluate the quality of the resulting LTMs using the

test set. Henceforth, we use D7, D12, and D18 to denote the 3 training/test pairs sampled

fromMG7 ,MG

12, andMG18, respectively.

6.1.2 Real-World Data Sets

CoIL Challenge 2000 Data

The first real-world data set comes from the contest of CoIL Challenge 2000 (van der

Putten & van Someren, 2000). We refer to it as the CoIL data for short. This data

contains information on customers of a Dutch insurance company. Its training set and

test set consist of 5,822 and 4,000 customer records, respectively. Each record consists of

86 attributes, containing socio-demographic information (Attributes 1–43) and insurance

57

(a) MG7

(b)MG12

(c) MG18

Figure 6.1: The structures of the generative models of the 3 synthetic data sets.

product ownerships (Attributes 44–86). The socio-demographic data are derived from

zip codes. In previous analysis, these variables were found more or less useless. In

our experiments, we included only three of them, namely Attributes 4 (average age),

5 (customer main type), and 43 (purchasing power class). All the product ownership

attributes were included in the experiments.

The data was preprocessed as follows: First, similar attribute values were merged so

that there are at least 30 records in the training set for each value. In the resulting

training set, there are fewer than 10 records where Attributes 50, 60, 71, and 81 take

nonzero values. Those attributes were excluded from the experiments. The final data set

consists of 42 attributes, each with 2 to 9 values.

Kidney Deficiency Data

The second real-world data set is a survey data from the domain of traditional Chinese

medicine (TCM) (Zhang et al., 2008a, 2008b). The data set contains symptom information

on seniors at or above the age of 60 years from several regions in China. It consists

of 35 symptom variables and 2,600 patient records. The symptom variables are the

most important factors that a TCM doctor would consider when determining whether a

patient has an illness condition called Kidney Deficiency and if so, which subtype. Hence,

we refer to the data set as the Kidney data. Each symptom variable has four possible

values, namely ‘no’, ‘light’, ‘medium’, and ‘severe’, representing four severity levels. In

our experiments, we split the data into a training set with 2,000 cases and a test set with

58

600 cases.

6.2 Measures of Model Quality

In density estimation problems, an estimate is of high quality if it is close to the generative

distribution. Hence, for LTMs learned from the synthetic data, we measure their quality

using the empirical KL divergence from the generative models. Given a generative model

MG and test data D, the empirical KL divergence of a modelM fromMG is defined as

D(MG‖M) =log P (D|MG)− log P (D|M)

|D| , (6.1)

where |D| denotes the sample size of D. It is an approximation to the true KL divergence

D(MG‖M) =∑

X

P (X|MG) logP (X|MG)

P (X|M),

where X denotes the set of manifest variables. The smaller the empirical KL divergence,

the better the modelM.

For real-world data, the generative models are unknown. Therefore, we cannot calcu-

late empirical KL divergence. We use the log-loss to measure the model quality. Given

test data D, the log-loss of a modelM is defined as follows,

log-loss(M|D) = − log P (D|M)

|D| .

It is different from the empirical KL divergence (Equation 6.1) in that it leaves out the

first term, which is independent of the model M. Hence, the smaller the log-loss, the

closer M to the unknown generative model, and the higher the quality ofM.

6.3 Impact of Algorithmic Parameters

Each of the three learning algorithms has a set of algorithmic parameters for the users

to set. In this section, we examine the impact of those parameters on the performance of

the algorithms. The empirical results provide a clue for choosing appropriate parameter

values in practice.

6.3.1 Experimental Settings

We begin with some experimental settings that are common to the three algorithms.

59

Taking a training data as input, each of the three algorithms outputs an LTM. The

parameters of the output models are optimized using full EM. To avoid local maxima,

we adopted the pyramid scheme proposed by Chickering and Heckerman (1997a) and set

the number of starting points of EM at 32. Moreover, we ran EM until the difference in

log-likelihood between two consecutive iterations fell below 0.1.

In EM, initial parameter values are randomly picked. Hence, there is inherent ran-

domness in the three algorithms. Consequently, we ran each algorithm 10 times on each

training set. We report the average performance, along with the standard deviation.

All the experiments in this chapter were run on a Linux server with an Intel Core2

Duo CPU at clock rate 2.4GHz and 4GB main memory.

6.3.2 EAST

As described in Chapter 3, EAST has six algorithmic parameters. Among those parame-

ters, four are used in the subroutines PickModel and PickModel-IR (Algorithms 3.7 and

3.8), i.e.,

1. νs, the number of iterations of local EM in the screening phase;

2. k, the number of candidate models that are selected to enter the evaluation phase;

3. µ, the number of starting points of local EM in the evaluation phase;

4. ν, the number of iterations of local EM in the evaluation phase.

The other two parameters are used by full EM, which is run on the models returned

by PickModel and PickModel-IR in order to compute their AIC scores. Those two

parameters are

1. µf , the number of starting points of full EM;

2. νf , the maximum number of iterations of full EM.

In our experiments, we fixed the values of k and νs at 50 and 10, respectively. We

then tested three different settings on the other four parameters. The details are given

in the Table 6.1. In the following, we examine the performances of EAST under different

settings.

We first look at the impact of the algorithmic parameters on model quality. The

quality of the models induced by EAST are indicated by the red curves in Figure 6.2. In

general, the model quality remains relatively stable as the setting changes. The quality

60

Setting µ ν µf νf

Coarse 4 10 8 20Mild 8 20 16 50Fine 16 50 32 100

Table 6.1: The 3 settings on the algorithmic parameters of EAST and Pyramid that havebeen tested.

at the mild and fine settings is roughly the same, and is slightly higher than that at the

coarse setting.

We then inspect how the computational efficiency of EAST changes along with the

algorithmic parameters. The running time of EAST is shown in Figure 6.3. Again, we

focus on the red curves for now. It is clear that the running time almost always increases

as the setting of the parameters lifts. However, the increase from the coarse setting to

the mild setting is not significant.

In summary, by increasing the algorithmic parameters of EAST, one can earn a slight

improvement in model quality. However, the running time will also increase. In practice,

users can tune the parameters to achieve an appropriate tradeoff between model quality

and computational efficiency. In Chapter 8, we will apply EAST to build classifiers. In

this application, we found the mild setting a good choice.

6.3.3 HCL

The HCL algorithm has one parameter for users to set, i.e., the bound Imax on the

inferential complexity of the resulting LTM. Or equivalently, one can set the bound Cmax

on the cardinalities of the latent variables according to Equation 4.3. In this subsection,

we test three different values of Cmax, namely 2, 4, 16, and examine the performance of

HCL under those settings.

The quality of the models produced by HCL is indicated by the green curves in Figure

6.2. It is clear that HCL achieved better model fit with larger Cmax value. This confirms

our analysis in Section 4.2: The larger the cardinalities of latent variable, the stronger

the expressive power of the resulting model, and thus the closer the resulting model to

the generative distribution.

On the other hand, larger Cmax value also leads to longer training time. This can be

seen from Figure 6.3.

61

6.3.4 Pyramid

As described in Chapter 5, Pyramid has 4 algorithmic parameters: (1) µ and ν for local

EM used in PickModel1 and PickModel1-IR, and (2) µf and νf for full EM used at line 5

of AddNode and line 13 of RestrictedExpand. In the experiments, we ran Pyramid using

the same three parameters settings as in the case of EAST (Table 6.1). We now examine

the impact of the parameters on the performance of Pyramid.

The quality of the models learned by Pyramid is denoted by the blue curves in Figure

6.2. In general, when we change to finer settings, the average model quality becomes

better. The only exception occurs for D7, where the empirical KL slightly increases when

we move from the mild setting to the fine setting. Moreover, the variance of model quality

decreases as we move from the coarse setting to the fine setting. This implies that Pyramid

with higher setting is more robust.

We then look into the impact of the algorithmic parameters on the time efficiency

of Pyramid. By examining Figure 6.3, we can see that the running time of Pyramid

consistently increases as we move from the coarse setting to the fine setting.

In summary, one can achieve a tradeoff between the model quality and the time effi-

ciency of Pyramid by tuning the algorithmic parameters. In practice, the users can set the

parameters according to the requirement of the applications. In our classification work

reported Chapter 8, we found the mild setting is appropriate.

6.4 Comparison of EAST, HCL and Pyramid

We now compare the performances of the three learning algorithms. We focus the com-

parisons on three aspects, namely model quality, computational efficiency, and latent

structure discover capability. The comparisons will give us information on when to use

which algorithm.

6.4.1 Model Quality

We first compare the quality of the models induced by the 3 learning algorithms. By

examining Figure 6.2, we can see that EAST is the clear winner among the three algo-

rithms. It produced the best models in almost all cases. The only two exceptions occur in

the coarse and mild settings on D7, where EAST performed slightly worse than Pyramid.

Pyramid comes in the second place. On D7, the empirical KL for Pyramid is com-

parable with that for EAST. On the other four data sets, Pyramid is not as good as

EAST. This is expected. A number of heuristics are introduced in Pyramid for the sake

62

0

0.2

0.4

Em

piric

al K

L D

iver

genc

e

Coarse Mild Fine6

8

10

12x 10

−3

HCL

EASTPyramid

(a) D7

0.5

1

1.5

2

Em

piric

al K

L D

iver

genc

e

Coarse Mild Fine0

0.02

0.04

0.06

(b) D12

0

1

2

3

Em

piric

al K

L D

iver

genc

e

Coarse Mild Fine0

0.05

0.1

(c) D18

9

11

13

15Lo

g−lo

ss

Coarse Mild Fine8.4

8.6

8.8

9

(d) CoIL

4.4

4.6

4.8

5

Coarse Mild Fine4.1

4.2

4.3

Log−

loss

(e) Kidney

Figure 6.2: The quality of the models produced by the three learning algorithms undervarious settings.

63

Coarse Mild Fine10

1

102

103

Tim

e (S

econ

d)

EASTHCLPyramid

(a) D7

Coarse Mild Fine10

1

102

103

104

Tim

e (S

econ

d)

(b) D12

Coarse Mild Fine10

1

102

103

104

105

Tim

e (S

econ

d)

(c) D18

Coarse Mild Fine10

1

102

103

104

105

106

Tim

e (S

econ

d)

(d) CoIL

Coarse Mild Fine10

1

102

103

104

105

106

Tim

e (S

econ

d)

(e) Kidney

Figure 6.3: The running time of the three learning algorithms under various settings.

64

of efficiency. There should be a decrease in the quality of the model it can find. On the

other hand, Pyramid also achieved good model fit. As we move from the coarse setting

to the fine setting, the performance of Pyramid gradually approaches that of EAST. In

particular, at the fine setting, the difference between Pyramid and EAST in empirical KL

divergence or log-loss is only 0.02–0.04.

HCL produced the worst models among the three algorithms. In comparison with that

of the EAST and the Pyramid models, the quality of the HCL models is so poor that we

have to plot the curves for HCL in a separate figure. We argue that, however, HCL can

yield solutions good enough to some density estimation problems in practice, especially

those with large sample sizes. The reader will see one such example in Chapter 7, where

we show that the LTMs induced by HCL provide accurate approximations to complex

Bayesian networks.

6.4.2 Computational Efficiency

In terms of computational efficiency, the ordering of the three algorithms is reversed.

EAST is the slowest among the three algorithms. As shown in Figure 6.3, on the smallest

synthetic data D7, EAST took several minutes. On the other two synthetic data D12 and

D18, it took several hours to produce a model. On the two real-world data sets, it took

up to one day to finish. Therefore, EAST is only suitable for small scale problems.

Pyramid is much more efficient than EAST. On the smallest data D7, Pyramid was

already 4–5 times faster. On the other 4 data sets, the difference was even larger. Pyramid

was more efficient than EAST by at least an order of magnitude. However, for real-world

data like CoIL and Kidney, Pyramid still ran for hours. Hence, it can be used to deal

with moderate size problems.

HCL is clearly the most efficient among the three algorithms. It is significantly faster

than Pyramid (and hence even faster than EAST). The gap can be up to two orders of

magnitude. See Figure 6.3d for example. Furthermore, on all the 5 data sets that we

tested, the learning process took at most several minutes. Therefore, HCL is the right

algorithm to choose when handling large problems.

6.4.3 Latent Structure Discovery

Besides model quality and computational efficiency, we are also concerned with the ca-

pability of discovering latent structures behind data. The latent structures can reveal

underlying regularities and give us insights into the domain. In this subsection, we com-

pare the performance of the three algorithms on this aspect.

65

(a)ME7 (b)ME

12

(c) ME18

Figure 6.4: The structures of the best models learned by EAST from the 3 synthetic datasets.

To evaluate the performance of an algorithm, we examine the structures of the models

that it learned from the 3 synthetic data sets, and compare them with the structures of

the generative models. On each synthetic data, we have run the algorithm 10 times using

3 different settings. This results in 30 models in total. Among those models, we only

pick the one with the highest quality and compare it with the generative model. We have

examined the other 29 models though and found their structures more or less the same

as that of the best model.

We start with EAST. Denote byME7 ,ME

12,ME18 the best models that EAST induced

from D7, D12, D18, respectively. Their structures are shown in Figure 6.4. We now

compare them with the generative models MG7 , MG

12, MG18. We first notice that EAST

perfectly recovered the structure of the generative model MG12. The structure of ME

12 is

exactly the same as that of MG12 . We also find that EAST almost perfectly recovered

the structures of the generative models MG7 and MG

12. There are only two differences

between ME7 and MG

7 : the latent variable Y1 in MG7 is missing from ME

7 , and X4 is

wrongly connected to latent variable Y2. The model ME18 is only different from MG

18 in

that it wrongly connects X5 to Y2 rather than Y6.

EAST also performed well in determining the cardinalities of the latent variables. By

comparing withMG7 ,MG

12,MG18, we can see that the latent variable cardinalities inME

7 ,

ME12,ME

18 are always correct or close to the true values.

We next examine the 3 best models learned by Pyramid. Denote them byMP7 ,MP

12,

MP18, respectively. Their structures are given in Figure 6.5. In general, Pyramid was

66

(a) MP7 (b)MP

12

(c) MP18

Figure 6.5: The structures of the best models learned by Pyramid from the 3 syntheticdata sets.

slightly inferior to EAST, but still did well in latent structure discovery. For D7, it

yielded the same model structure as EAST. See Figures 6.4a and 6.5a. For D12 and D18,

Pyramid only made a few more minor mistakes than EAST did. InMP12, Pyramid wrongly

connected X11 to Y5 rather than Y6. In MP18, Pyramid wrongly connected X11 to Y7 and

X14 to Y1, and missed the latent variable Y3, which leads to the incorrect connections

between Y2, Y6, Y7, and Y8. Due to those structural errors, Pyramid had to increase the

cardinalities of the latent variables in MP12 and MP

18 in order to achieve good model fit.

This leads to over-estimation of the cardinalities.

Finally, we examine the models produced by HCL. Their structures are shown in

Figure 6.6. The structures are all binary trees and very different from the structures of

the generative models MG7 , MG

12, MG18. In fact, we can hardly map any latent variables

in MH7 , MH

12, MH18 to the latent variables in MG

7 , MG12, MG

18. This partly explains why

the quality of the HCL models is significantly inferior to the quality of the EAST and

Pyramid models on synthetic data.

6.5 Summary

We have empirically compared EAST, HCL, and Pyramid. Among the three algorithms,

EAST yields the best models. It also outperforms the other two algorithms in terms

of latent structure discovery. In fact, on the 3 synthetic data we tested, EAST almost

67

(a) MH7 (b)MH

12

(c) MH18

Figure 6.6: The structures of the best models learned by HCL from the 3 synthetic datasets.

perfectly recovered the generative latent structures. However, EAST is computationally

expensive to use, which limits its applicability to only small scale problems. Pyramid is

more efficient than EAST. It produces models that are slightly worse than those produced

by EAST. It also makes a few more mistakes in discovering latent structures than EAST

does.

HCL is the most efficient among the three algorithms. It is faster than the other two by

orders of magnitude and can be used on large scale problems. However, the models induced

by HCL are significantly inferior to those induced by EAST and Pyramid. Moreover, HCL

always yields binary latent structures, which are very different from the ground truth and

not very meaningful.

In the next two chapters, we apply the three algorithms to two applications of different

natures. In the first application, the training data is usually large. Therefore, we use the

68

HCL algorithm in this case. In the second application, the training data is of moderate

size. Learning good models and discovering interesting latent structures are more critical.

Therefore, we apply EAST and Pyramid.

69

CHAPTER 7

APPLICATION 1: APPROXIMATEPROBABILISTIC INFERENCE

In the previous chapters, we have been focusing on developing algorithms for density

estimation using LTMs. Henceforth, we will turn to the applications of those techniques.

In this chapter, we study the problem of probabilistic inference in Bayesian networks and

propose an approximate method by utilizing the HCL algorithm. In the next chapter, we

will apply EAST and Pyramid to classification.

7.1 Probabilistic Inference in Bayesian Networks

We start by introducing the problem of probabilistic inference. Let N be a BN over a

set of nodes X. Denote by PN (X) the joint distribution that N represents. Given a set

of querying nodes Q and a piece of evidence E = e, the task of probabilistic inference is

to calculate the posterior distribution PN (Q|E = e). The set E can be empty. In this

case, there is no evidence observed and the quantity of interest reduces to the marginal

distribution PN (Q). For simplicity, we assume that Q contains only one single node Q.

However, the following discussions apply to the general case as well.

Probabilistic inference in general Bayesian networks is computationally intractable. As

shown by Cooper (1990), it is an NP-hard problem. As a matter of fact, all exact inference

algorithms, including clique tree propagation (Lauritzen & Spiegelhalter, 1988; Jensen

et al., 1990; Shenoy & Shafer, 1990), recursive conditioning (Darwiche, 2001), and variable

elimination (Zhang & Poole, 1994; Dechter, 1996), share an exponential complexity in the

network treewidth, which is defined to be one less than the size of the largest clique in

the optimal triangulation of the moral graph (Robertson & Seymour, 1984). Densely

connected Bayesian networks usually have large treewidths. To speed up inference in

such networks, researchers often resort to approximations. However, one cannot expect

a universally accurate yet efficient approximate method because approximate inference

with guaranteed accuracy is also NP-hard (Dagum & Luby, 1993).

Despite the negative complexity results, there are a special class of Bayesian net-

works amenable to fast inference, namely tree-structured Bayesian networks. This class

of networks have treewidth 1. Exact inference in them only takes time linear in the num-

ber of nodes (Pearl, 1988). In the following, we exploit this fact to develop an efficient

approximate inference method for general Bayesian networks.

70

7.2 Basic Idea

Our approximate method is based on LTMs. The idea is as follows:

• Offline: Construct an LTM M that approximates a BN N in the sense that the

joint distribution of the manifest variables in M approximately equals the joint

distribution of the variables in N .

• Online: Use M instead of N to compute answers to probabilistic queries.

Intuitively, our method can produce high quality approximations at a low online cost.

On one hand, LTMs are tree-structured. Hence, the online phase only takes time linear

in the number of nodes. On the other hand, due to the introduction of latent variables,

LTMs can capture complicated relationships among manifest variables. Therefore, they

can approximate complex BNs well and thus provide accurate answers to probabilistic

queries.

7.2.1 User Specified Bound on Inferential Complexity

The cardinalities of the latent variables play a crucial role in the approximation scheme.

They determine inferential complexity and influence approximation accuracy. At one

extreme, we can represent a BN exactly using an LTM by setting the cardinalities of the

latent variables large enough (Proposition 4.1). In this case, the inferential complexity

is very high. At the other extreme, we can set the cardinalities of the latent variables

at 1. In this case, the manifest variables become mutually independent. The inferential

complexity is the lowest and the approximation quality is the poorest.

In our approximate scheme, we seek an appropriate middle point between those two

extremes. In particular, we require the user to specify a bound Imax on the inferential

complexity. In the next section, we discuss how to construct an LTM approximation that

satisfies this bound.

7.3 Approximating Bayesian Networks with Latent

Tree Models

Given a BN N and a inferential complexity constraint Imax, we now study the problem

of approximating N with an LTM M. Let X be the set of variables in N and PN (X)

be the joint distribution represented by N . For an LTM M to be an approximation of

N , it should use X as its manifest variables. Moreover, an approximation M is of high

71

quality if PM(X) is close to PN (X). We measure the quality of the approximation by the

KL divergence (Cover & Thomas, 1991)

D[PN (X)‖PM(X)] =∑

X

PN (X) logPN (X)

PM(X).

Our objective is thus to find an LTM that minimizes the KL divergence

M⋆ = arg minM

D[PN (X)‖PM(X)]

subject to the complexity constraint Imax.

7.3.1 Two Computational Difficulties

The optimization problem is computationally intractable. There are two difficulties. The

first is that, given an LTMM, it is hard to compute the KL divergence D[PN (X)‖PM(X)]

due to the presence of latent variables in M. This can be seen by expanding the KL

divergence as follows,

D[PN (X)‖PM(X)] =∑

X

PN (X) logPN (X)

PM(X)

=∑

X

PN (X) logPN (X)−∑

X

PN (X) log PM(X)

=∑

X

PN (X) logPN (X)−∑

X

PN (X) log∑

Y

PM(X,Y).

The first term on the last line can be neglected because it is independent of M. The

difficulty lies in computing the second term. The summation over latent variables Y

appearing inside the logarithm makes this term indecomposable. Therefore, one has to

sum over all possible values of X in the outer summation. This takes time exponential in

|X|, i.e., the number of variables in X.

The second difficulty is on how to find a good LTM efficiently. Given a set of manifest

variables X, there are super-exponentially many LTMs over X. Exhaustive search over

such a large model space is apparently prohibitive. Even greedy search is infeasible given

the fact that evaluating the quality of one single LTM is already computationally hard.

7.3.2 Optimization via Density Estimation

Instead of directly tackling the optimization problem, we transform it into a density

estimation problem. The idea is as follows:

72

1. Sample a data set D with N i.i.d. cases from PN (X).

2. Learn an LTM M′⋆ from D that maximizes the AIC score and satisfies the com-

plexity constraint Imax.

It is well known that the KL divergence (and thus the approximation quality) of M′⋆

converges almost surely to that ofM⋆ as the sample size N approaches infinity (Akaike,

1974). In practice, we use large N to achieve good approximation.

We now discuss the implementation of this solution. We start by generating D from

PN (X). Since PN (X) is represented by BN N , we use forward sampling (Henrion, 1988)

for this task. Specifically, to generate a piece of sample from PN (X), we process the nodes

in a topological ordering1. When handling node X, we sample its value according to the

conditional distribution P (X|pa(X) = j), where pa(X) denotes the set of parents of X

and j denote their values that have been sampled earlier. To obtain D, we repeat the

procedure N times.

In the second step of this solution, we need to learn from D an LTM that has a high

AIC score and that satisfies the complexity constraint. We consider the three algorithms

developed in Chapters 3–5, namely EAST, HCL, and Pyramid, for this task. Recall that,

in order to achieve good approximation, we set the sample size N at a large value. Hence,

it would be computationally unaffordable to run EAST and Pyramid on D. So we choose

to run HCL to learn an LTM from D with inferential complexity constraint Imax.

7.3.3 Impact of Imax

Given the inferential complexity constraint Imax, HCL sets the cardinalities of latent

variables at Cmax as given by Equation 4.3. It is clear that Cmax a monotonically increasing

function of Imax. The larger the value of Imax, the larger the value of Cmax, and according

to the discussion in 4.2, the better the approximation that our method can achieve.

Therefore, the user can obtain more accurate approximation at the cost of longer online

running time.

7.4 LTM-based Approximate Inference

The focus of this chapter is approximate inference in Bayesian networks. We propose the

following two-phase method:

1A topological ordering sorts the nodes in a DAG such that a node always precedes its children.

73

1. Offline: Given a BN N and a bound Imax on the inferential complexity, sample a

data set with N samples from N and use HCL to learn an approximate LTM Mfrom the data. The sample size N should be set as large as possible.

2. Online: Make inference in M instead of N . More specifically, given a piece of

evidence E = e and a querying variable Q, return PM(Q|E = e) as an approximation

to PN (Q|E = e).

7.5 Empirical Results

In this section, we empirically evaluate our approximate inference method. We first ex-

amine the impact of sample size N and inferential complexity constraint Imax on the

performance of our method. Then we compare our method with clique tree propagation

(CTP), which is the state-of-the-art exact inference algorithm, and loopy belief propaga-

tion (LBP), which is a standard approximate inference method that has been successfully

used in many real world domains (Frey & MacKay, 1997; Murphy et al., 1999). We also

compare our method with another two approximate methods that perform exact inference

in approximate models, one based on Chow-Liu (CL) tree, and the other based on latent

class model (LCM).


We used 8 networks in our experiments. They are listed in Table 7.1. Cpcs54 is a subset

of the Cpcs network (Pradhan, Provan, Middleton, & Henrion, 1994). The other net-

works are available at http://www.cs.huji.ac.il/labs/compbio/Repository/. Table

7.1 also reports the characteristics of the networks, including the number of nodes, the

average/max indegree and cardinality of the nodes, and the inferential complexity (i.e.,

the sum of the clique sizes in the clique tree). The networks are sorted in ascending order

with respect to the inferential complexity.

For each network, we simulated 500 pieces of evidence. Each piece of evidence was set

on all the leaf nodes by sampling based on the joint probability distribution. Then we

used the CTP algorithm and the approximate inference methods to compute the posterior

distribution of each non-leaf node conditioned on each piece of evidence. The accuracy

of an approximate method is measured by the average KL divergence between the exact

and the approximate posterior distributions over all the query nodes and evidence.

All the algorithms in the experiments were implemented in Java and run on a machine

with an Intel Pentium IV 3.2GHz CPU and 1GB RAM.

74

NetworkNumber Average/Max Average/Max Inferentialof Nodes Indegree Cardinality Complexity

Alarm 37 1.24/4 2.84/4 1,038Win95pts 76 1.47/7 2/2 2,684Hailfinder 56 1.18/4 3.98/11 9,706Insurance 27 1.93/3 3.3/5 29,352Cpcs54 54 2/9 2/2 109,208Water 32 2.06/5 3.62/4 3,028,305Mildew 35 1.31/3 17.6/100 3,400,464Barley 48 1.75/4 8.77/67 17,140,796

Table 7.1: The networks used in the experiments and their characteristics.

7.5.2 Impact of N and Imax

We discussed the impact of N and Imax on the performance of our method in Sections

7.2 and 7.3.3. This subsection empirically verifies the claims.

Three sample sizes were chosen in the experiments: 1k, 10k, and 100k. For each

network, we also chose a set of Imax. LTMs were then learned using HCL with different

combination of the values of N and Imax. For parameter learning, we terminated EM

either when the improvement in loglikelihoods is smaller than 0.1, or when the algorithm

ran for two months. The pyramid scheme by Chickering and Heckerman (1997b) was used

to avoid local maxima. The number of starting points was set at 16.

The running time of HCL is plotted in Figure 7.1. The y-axes denote the time in hours,

while the x-axes denote the value of Cmax in HCL for different choices of Imax. Recall

that Cmax is a monotonically increasing function of Imax. Larger Cmax value implies larger

Imax value. The three curves correspond to different values of N . In general, the running

time increases with N and Cmax, ranging from seconds to weeks. For some settings, EM

failed to converge in two months. Those settings are indicated by arrows in the plots. We

emphasize that HCL is executed offline and its running time should not be confused with

the time for online inference, which will be reported next.

After obtaining the LTMs, we used clique tree propagation to make inference. The

approximation accuracy are shown in Figure 7.2. The y-axes denote the average KL

divergence, while the x-axes still denote the value of Cmax for HCL. The three curves in

each plot correspond to the three sample sizes we used.

We first examine the impact of sample size by comparing the corresponding curves in

each plot. We find that, in general, the curves for larger samples are located below those

for smaller ones. This shows that the approximation accuracy increases with the size of

75

1 2 4 810

−3

10−2

10−1

100

101

102

C

Tim

e (H

our)

N=1kN=10kN=100k

(a) Alarm

1 2 4 8 1610

−3

10−2

10−1

100

101

102

C

Tim

e (H

our)

(b) Win95pts

1 4 16 3210

−4

10−2

100

102

104

C

Tim

e (H

our)

(c) Hailfinder

1 4 16 3210

−3

10−1

101

103

C

Tim

e (H

our)

(d) Insurance

Figure 7.1: Running time of HCL under different settings. Settings for which EM did notconverge are indicated by arrows.

76

1 2 4 8 1610

−4

10−2

100

102

104

C

Tim

e (H

our)

(e) Cpcs54

1 4 16 6410

−4

10−2

100

102

104

C

Tim

e (H

our)

(f) Water

1 4 16 6410

−4

10−2

100

102

104

C

Tim

e (H

our)

(g) Mildew

1 4 16 6410

−4

10−2

100

102

104

C

Tim

e (H

our)

(h) Barley

Figure 7.1: Running time of HCL under different settings (continued). Settings for whichEM did not converge are indicated by arrows.

77

1 2 4 810

−2

10−1

100

C

Ave

rage

KL

Div

erge

nce

N=1kN=10kN=100k

(a) Alarm

1 2 4 8 1610

−3

10−2

10−1

100

C

Ave

rage

KL

Div

erge

nce

(b) Win95pts

1 4 16 3210

−4

10−3

10−2

10−1

100

C

Ave

rage

KL

Div

erge

nce

(c) Hailfinder

1 4 16 3210

−3

10−2

10−1

100

C

Ave

rage

KL

Div

erge

nce

(d) Insurance

Figure 7.2: Approximation accuracy of the LTM-based method under different settings.

78

1 2 4 8 1610

−4

10−3

10−2

C

Ave

rage

KL

Div

erge

nce

(e) Cpcs54

1 4 16 6410

−4

10−3

10−2

10−1

100

C

Ave

rage

KL

Div

erge

nce

(f) Water

1 4 16 6410

−2

10−1

100

C

Ave

rage

KL

Div

erge

nce

(g) Mildew

1 4 16 6410

−2

10−1

100

C

Ave

rage

KL

Div

erge

nce

(h) Barley

Figure 7.2: Approximation accuracy of the LTM-based method under different settings(continued).

79

the training data.

To see the impact of Cmax, we examine each individual curve from left to right. Ac-

cording to our discussion, the curve is expected to drop monotonically as Cmax increases.

This is generally true for the results with sample size 100k. For sample sizes 1k and 10k,

however, there are cases in which the approximation becomes poorer as Cmax increases.

See Figure 7.2e and 7.2f. This phenomenon does not conflict with our claims. As Cmax

increases, the expressive power of the learned LTM increases. So it tends to overfit the

data. On the other hand, the empirical distribution of a small set of data may signifi-

cantly deviate from the joint distribution of the BN. This also suggests that the sample

size should be set as large as possible.

Finally, let us examine the impact of N and Cmax on the inferential complexity. Figure

7.3 plots the running time for calculating answers to all the queries using the learned

LTMs. It can be seen that the three curves for different sample sizes overlap in all

plots. This implies that the running time is independent of the sample size N . On the

other hand, all the curves are monotonically increasing. This confirms our claim that the

inferential complexity is positively dependent on Cmax.

In the following subsections, we will only consider the results for N = 100k. Under this

setting, our method achieves the highest accuracy. For clarity, we reproduce the average

KL divergence and the online running time of our method with N = 100k in Figures 7.4

and 7.5, respectively. See the blue curve in each plot.

7.5.3 Comparison with CTP

We now compare our method with CTP, the state-of-the-art exact inference algorithm.

The first concern is that, how accurate is our method. By examining Figure 7.4, we argue

that our method always achieves good approximation accuracy: For Hailfinder, Cpcs54,

Water, the average KL divergence of our method is around or less than 10−3; For the other

networks, the average KL divergence is around or less than 10−2.

We next compare the inferential efficiency of our method and the CTP algorithm. The

running time of CTP is denoted by dashed horizontal lines in the plots of Figure 7.5. It

can be seen that our method is more efficient than the CTP algorithm. In particular, for

the five networks with the highest inferential complexity, our method is faster than CTP

by two to three orders of magnitude.

To summarize, the results suggest that our method can achieve good approximation

accuracy at low computational cost.

80

1 2 4 810

−1

100

C

Tim

e (S

econ

d)

N=1kN=10kN=100k

(a) Alarm

1 2 4 8 1610

−1

100

101

C

Tim

e (S

econ

d)

(b) Win95pts

1 4 16 3210

−1

100

101

C

Tim

e (S

econ

d)

(c) Hailfinder

1 4 16 3210

−2

10−1

100

101

C

Tim

e (S

econ

d)

(d) Insurance

Figure 7.3: Running time of the online phase of the LTM-based method under differentsettings.

81

1 2 4 8 1610

−1

100

101

C

Tim

e (S

econ

d)

(e) Cpcs54

1 4 16 6410

−2

10−1

100

101

102

C

Tim

e (S

econ

d)

(f) Water

1 4 16 6410

−1

100

101

102

C

Tim

e (S

econ

d)

(g) Mildew

1 4 16 6410

−1

100

101

102

C

Tim

e (S

econ

d)

(h) Barley

Figure 7.3: Running time of the online phase of the LTM-based method under differentsettings (continued).

82

1 2 4 810

−2

10−1

100

C

Ave

rage

KL

Div

erge

nce

LTMLBPCLLCM

(a) Alarm

1 2 4 8 1610

−3

10−2

10−1

100

C

Ave

rage

KL

Div

erge

nce

(b) Win95pts

1 4 16 3210

−4

10−3

10−2

10−1

100

C

Ave

rage

KL

Div

erge

nce

(c) Hailfinder

1 4 16 3210

−3

10−2

10−1

100

C

Ave

rage

KL

Div

erge

nce

(d) Insurance

Figure 7.4: Approximation accuracy of various inference methods.

83

1 2 4 8 1610

−4

10−3

10−2

C

Ave

rage

KL

Div

erge

nce

(e) Cpcs54

1 4 16 6410

−4

10−3

10−2

10−1

100

C

Ave

rage

KL

Div

erge

nce

(f) Water

1 4 16 6410

−3

10−2

10−1

100

C

Ave

rage

KL

Div

erge

nce

(g) Mildew

1 4 16 6410

−2

10−1

100

C

Ave

rage

KL

Div

erge

nce

(h) Barley

Figure 7.4: Approximation accuracy of various inference methods (continued).

84

1 2 4 8

10−1

100

101

C

Tim

e (S

econ

d)

LTMCTPLBPCLLCM

(a) Alarm

1 2 4 8 1610

−1

100

101

102

C

Tim

e (S

econ

d)

(b) Win95pts

1 4 16 3210

−1

100

101

102

C

Tim

e (S

econ

d)

(c) Hailfinder

1 4 16 3210

−2

10−1

100

101

102

103

C

Tim

e (S

econ

d)

(d) Insurance

Figure 7.5: Running time of various inference methods.

85

1 2 4 8 16

10−1

100

101

102

103

C

Tim

e (S

econ

d)

(e) Cpcs54

1 4 16 6410

−2

100

102

104

C

Tim

e (S

econ

d)

(f) Water

1 4 16 6410

−1

101

103

105

C

Tim

e (S

econ

d)

(g) Mildew

1 4 16 6410

−2

100

102

104

106

C

Tim

e (S

econ

d)

(h) Barley

Figure 7.5: Running time of various inference methods (continued).

86

7.5.4 Comparison with LBP

We now compare our method with LBP. The latter is an iterative algorithm. It can be used

as an anytime inference method by running a specific number of iterations. In our first set

of experiments, we let LBP run as long as our method and compare their approximation

accuracy. We did this for each network and each value of Cmax. The accuracy of LBP are

denoted by the curves labeled as LBP in Figure 7.4. By comparing those curves with the

LTM curves for N = 100k, we see that our method achieves significantly higher accuracy

than LBP in most cases: For Water, the difference in average KL divergence is up to

three orders of magnitude; For the other networks, the difference is up to one order of

magnitude. For Hailfinder with Cmax = 32, LBP is two times more accurate than our

method. However, our method also achieves good approximation accuracy in this case.

The average KL divergence is smaller than 10−3. Finally, we noticed that LBP curves

are horizontal lines for Cpcs54, Mildew, and Barley. Further investigation on those cases

shows that LBP finished only one iteration in the given time period.

We next examine how much time it takes for LBP to achieve the same level of accuracy

as our method. For each piece of evidence, we ran LBP until its average KL divergence

is comparable with that of our method or the number of iterations exceeds 100. The

running time of LBP are denoted by the curves labeled as LBP in Figure 7.5. Comparing

those curves with the LTM curves, we found that LBP takes much more time than our

method: For Mildew, LBP is slower than our method by three orders of magnitude; For

the other networks except Hailfinder, LBP is slower by one to two orders of magnitude;

For Hailfinder with Cmax = 32, the running time of the two methods are similar. The

results show that our method compares more favorably to LBP in the networks that we

examined.

7.5.5 Comparison with CL-based Method

Our inference method is fast because LTM is tree-structured. One can also construct a

Chow-Liu tree (Chow & Liu, 1968) to approximate the original BN and use it for inference.

We refer to this approach as the CL-based method. In this subsection, we empirically

compare our method with the CL-based method. More specifically, for each network, we

learn a tree model from the 100k samples using the maximum spanning tree algorithm

developed by Chow and Liu (1968). We then use the learned tree model to answer the

queries.

The approximation accuracy of the CL-based method are shown as solid horizontal

lines in the plots in Figure 7.4. Comparing with the CL-based method, our method

achieves higher accuracy in all the networks except for Mildew. For Insurance, Water,

87

and Barley, the differences are significant. For Mildew, our method is competitive with

the CL-based method. In the meantime, we notice that the CL-based method achieves

good approximations in all the networks except for Barley. The average KL divergence

is around or less than 10−2.

An obvious advantage of CL-based method is its high efficiency. This can be seen

from the plots in Figure 7.5. In most of the plots, the CL line locates below the second

data point on the LTM curve. The exception is Mildew, for which the running time of

the CL-based method is as long as our method with Cmax = 16.

In summary, the results suggest that the CL-based method is a good choice for ap-

proximate inference if the online inference time is very limited. Otherwise, our method

is more attractive because it is able to produce more accurate results when more time is

allowed.

7.5.6 Comparison with LCM-based Method

Lowd and Domingos (2005) have previously investigated the use of LCM for density

estimation. Given a data set, they determine the cardinality of the latent variable using

hold-out validation, and optimize the parameters using EM. It is shown that the learned

LCM achieves good model fit on a separate testing set. The LCM was also used to answer

simulated probabilistic queries and the results turn out to be good.

Inspired by their work, we also learned a set of LCMs from the 100k samples and

compared them with LTMs on the approximate inference task. Our learning strategy

is slightly different. Since LCM is a special case of LTM, its inferential complexity can

also be controlled by changing the cardinality of the latent variable. In our experiments,

we set the cardinality such that the sum of the clique sizes in the clique tree of the

LCM is roughly the same as that for the LTM learned with a chosen Cmax. In this way,

the inferential complexity of the two models are comparable. This can be verified by

examining the LCM curves in Figure 7.5. We then optimize the parameters of the LCM

using EM with the same setting as in the case of LTM.

As shown in Figure 7.4, for Alarm, Win95pts, Cpcs54, Water, and Barley, the LCM

curves are located above the LTM curves. That is, our method consistently outperforms

the LCM-based method for all Cmax. For Hailfinder and Mildew, our method is worse

than the LCM-based method when Cmax is small. But when Cmax becomes large, our

method begins to win. For Insurance, the performance of the two methods are very

close. The results suggest that unrestricted LTM is more suitable for approximation

inference than LCM does.

88

7.6 Related Work

The idea of approximating complex BNs by simple models and using the latter to make

inference has been investigated previously. The existing work mainly falls into two cat-

egories. The work in the first category approximates the joint distributions of the BNs

and uses the approximation to answer all probabilistic queries. In contrast, the work

in the second category is query-specific. It assumes the evidence is known and directly

approximates the posterior distribution of the querying nodes.

Our method falls in the first category. We investigate the use of LTMs under this

framework. This possibility has also been studied by Pearl (1988) and Sarkar (1995).

Pearl (1988) develops an algorithm for constructing an LTM that is marginally equivalent

to a joint distribution P (X), assuming such an LTM exists. Sarkar (1995) studies how to

build good LTMs when only approximations are amenable. Their methods, however, can

only deal with the cases of binary variables.

Researchers have also explored the use of other models. Chow and Liu (1968) consider

tree-structured BNs without latent variables. They develop a maximum spanning tree

algorithm to efficiently construct the tree model that is closest to the original BN in

terms of KL divergence. Lowd and Domingos (2005) learn an LCM to summarize a data

set. The cardinality of the latent variable is determined so that the logscore on a hold-out

set is maximized. They show that the learned model achieves good model fit on a separate

testing set, and can provide accurate answers to simulated probabilistic queries. In both

work, the approximation quality and the inferential complexity of the learned model are

fixed. Our method, on the other hand, provides a parameter Imax to let users make the

tradeoff between approximation quality and inferential complexity.

Our method and the methods mentioned in above build approximate models from

scratch. Alternatively, one can start with the given Bayesian network and simplify it

to obtain an approximation. This idea is realized by van Engelen (1997). The author

proposes to simplify a Bayesian network by removing a set of edges from it. The selection

of edges is made so as to achieve a tradeoff between the computational efficiency and the

accuracy of the resultant approximate network.

Rather than simplifying a Bayesian network itself, researchers have also considered

to simplify its clique tree when clique tree propagation is used for inference. Jensen and

Andersen (1990) propose to reduce clique sizes by annihilating configurations with small

probabilities from potential functions. Kjærulff (1994) develops a complementary method

that removes weak dependencies among variables within cliques. Removal of dependencies

causes cliques to split into smaller ones and thus reduces the computational cost of running

propagation on the resultant clique tree.

89

The work in the second category is mainly carried out under the variational framework.

The mean field method (Saul, Jaakkola, & Jordan, 1996) assumes that the querying

nodes are mutually independent. It constructs an independent model that is close to

the posterior distribution. As an improvement to the mean field method, the structured

mean field method (Saul & Jordan, 1996) preserves a tractable substructure among the

querying nodes, rather than neglecting all interactions. Bishop et al. (1997) consider

another improvement, i.e., mixtures of mean field distributions. It essentially fits an

LCM to the posterior distribution.

As a different variational method, Choi and Darwiche (2006) simplify the given Bayesian

network to obtain an approximation of the posterior distribution. The idea is to remove

a set of edges from the original network, and optimize the parameters of the simplified

network such that the divergence between the approximate and the true posterior distri-

butions is minimized. The more edges deleted, the faster the inference, and the worse

the approximation accuracy. At one extreme, when enough edges are removed to yield a

polytree, the proposed method reduces to LBP.

All those methods from the second category directly approximate posterior distribu-

tions. Therefore, they might be more accurate than our method when used to make

inference. However, these methods are evidence-specific and construct approximations

online. Moreover, they involve an iterative process for optimizing the variational pa-

rameters. Consequently, the online running time is unpredictable. With our method, in

contrast, one can determine the inferential complexity beforehand.

7.7 Summary

We have shown one application of the HCL algorithm described in Chapter 4 and use it

to develop a novel scheme for BN approximate inference. With our scheme one can make

tradeoff between the approximation accuracy and the inferential complexity. Our scheme

achieves good accuracy at low costs in all the networks that we examined. In particular,

it outperforms LBP. Given the same amount of time, our method achieves significantly

higher accuracy than LBP in most cases. To achieve the same accuracy, LBP needs one

to three orders of magnitude more time than our method. We also show that LTMs are

superior to Chow-Liu trees and LCMs when used for approximate inference.

90

CHAPTER 8

APPLICATION 2: CLASSIFICATION

In this chapter, we describe an application of LTMs in classification. The idea is to

estimate the class-conditional distributions of each class using LTMs and then use the

Bayes rule for classification. The resulting classifier are called latent tree classifiers (LTCs).

Both EAST and Pyramid are considered for the density estimation problem. Empirical

results are provided to compare EAST and Pyramid in this setting and to compare LTC

with a number of related alternative methods.

8.1 Background

Classification is one of the subjects that have received the most attention in the machine

learning literature. It has numerous applications in many areas such as computer vision,

speech recognition, text categorization, among the others. Given training data D =

{(x1, c1), (x2, c2), . . . , (xn, cn)} where each instance is described by a set X of attributes

and has a class label c, the problem is to build a classifier f : X→ C that can accurately

predict the class labels of future instances based on their attribute values.

Assume there is a generative distribution P (X, C) underlying the data. Given a

classifier f , we measure its classification accuracy using the probability of error :

err(f) =∑

X

P (X)(

1− P (f(X)|X))

.

The lower the error, the better the classifier.

Bayes decision theory (Duda & Hart, 1973) states that the minimum achievable prob-

ability of error is

minf

err(f) =∑

X

P (X)(

1−maxC

P (C|X))

.

This quantity is well known as the Bayes error rate. It is realized by the optimal classifier

f ⋆(X), where

f ⋆(X) = arg maxC

P (C|X). (8.1)

91

8.2 Build Classifiers via Density Estimation

If we knew the posterior class distribution P (C|X), we could easily obtain the optimal

classifier f ⋆(X) according to Equation 8.1. However, in practice, we only have a data

set D that were drawn from P (X, C). A common approach in machine learning is to

construct an estimate P (C|X) of P (C|X) from D, and then build a classifier based on

P (C|X). In general, the more accurate the estimate, the better the resulting classifier.

8.2.1 The Generative Approach to Classification

There are two different ways to estimate P (C|X). The generative approach first constructs

an estimate P (X, C) of P (X, C), and then computes the posterior distribution P (C|X)

based on P (X, C) using Bayes rule. In contrast, the discriminative approach directly

estimates P (C|X) from data.

The generative and discriminative approaches each has its own advantages and draw-

backs. On one hand, the generative approach requires an intermediate step to model the

joint distribution P (X, C) over both X and C. However, what we need for classification

is only the conditional distribution P (C|X). In this sense, the discriminative approach is

more direct than the generative approach. It usually yields more accurate classifiers than

the latter. On the other hand, modeling the joint distribution enables the generative ap-

proach to handle missing values in a principled manner and reveal interesting structures

underlying the data. Learning is also computationally more efficient in the generative

approach than in the discriminative approach. For detailed comparisons between the

generative and discriminative approaches, we refer the readers to (Rubinstein & Hastie,

1997; Ng & Jordan, 2001; Jebara, 2004).

8.2.2 Generative Classifiers Based on Latent Tree Models

In this chapter, we focus on the generative approach. Different generative methods make

different assumptions about the form of the true distribution P (X, C). The simplest one

is naıve Bayes (NB) classifier (Duda & Hart, 1973). It assumes that all the attributes in

a data set are mutually independent given the class label. Under this assumption, the

generative distribution decomposes as follows:

P (X, C) = P (C)∏

X∈X

P (X|C).

All dependencies among attributes are ignored. An example NB is shown in Figure 8.1a.

Despite its simplicity, NB has been shown to be surprisingly accurate in a number of

domains (Domingos & Pazzani, 1997).

92

(a) NB (b) TAN

(c) LTC

Figure 8.1: NB, TAN, and LTC. C is the class variable, X1, X2, X3, and X4 are fourattributes, Y1 and Y2 are latent variables.

The conditional independence assumption underlying NB is rarely true in practice.

Violating this assumption could lead to poor approximation to P (X, C) and thus er-

roneous classification. The past decade has seen a large body of work on relaxing this

assumption. One such work is tree augmented naıve Bayes (TAN) (Friedman et al., 1997).

It builds a Chow-Liu tree (Chow & Liu, 1968) to model the attribute dependencies (Fig-

ure 8.1b). Another work is averaged one-dependence estimators (AODE) (Webb et al.,

2005). It constructs a set of tree models over the attributes and averages them to make

classification.

We propose a novel generative approach based on LTMs. In our approach, we treat

attributes as manifest variables and build LTMs to model the relationships among them.

The relationships could be different for different classes. Therefore, we build one LTM

for each class. The LTM for class c is an estimate of the class-conditional distribution

P (X|C = c). We refer to the collection of LTMs plus the prior class distribution as latent

tree classifier (LTC). Figure 8.1c shows an example LTC. Each rectangle in the figure

contains the LTM for a class. Since the LTMs can model complex relationship among

attributes, we expect LTC to approximate the true distribution P (X, C) well and thus to

achieve good classification accuracy. Moreover, the latent structure induced for each class

may reveal underlying generative mechanism. We empirically verify those hypotheses in

93

the experiments.

As mentioned in Chapter 1, researchers have considered a special class of LTMs called

latent class models (LCMs) for density estimation and have obtained some promising

results (Lowd & Domingos, 2005). One can also build classifiers using LCMs instead of

LTMs. By restricting to LCMs, we obtain latent class classifier (LCC). In our experi-

ments, we empirically compare LTC with LCC. The results show that LTC is superior to

LCC. We attribute this to the flexibility of LTMs.

In the following two sections, we formally define latent tree classifier and present a

learning algorithm.

8.3 Latent Tree Classifier

We consider the classification problem where each instance is described using n attributes

X = {X1, X2, . . . , Xn}, and belongs to one of the r classes C = 1, 2, . . . , r. A latent

tree classifier (LTC) consists of a prior distribution P (C) on C and a collection of r

LTMs over the attributes X. We denote the c-th LTM by Mc = (mc, θc) and the set of

latent variables in Mc by Yc. The LTC represents a joint distribution over C and X,

∀c = 1, 2, . . . , r,

P (C = c,X) = P (C = c)PMc(X)

= P (C = c)∑

Yc

PMc(X,Yc). (8.2)

Given an LTC, we classify an instance X = x to the class c⋆, where

c⋆ = arg maxC

P (C|X = x)

= arg maxC

P (C,X = x). (8.3)

Making prediction with LTC requires marginalizing out all the latent variables from

each LTM. Since LTM is tree-structured, the marginalization can be done in time linear

in the number of latent variables. Moreover, regular LTMs contain strictly less latent

variables than manifest variables. Therefore, the time complexity of making prediction

with LTC is O(|X| · r).

8.4 A Learning Algorithm for Latent Tree Classifier

Given a labeled training set D, we now outline an algorithm for learning an LTC from D.

The algorithm consists of four steps:

94

1. Calculate the maximum likelihood estimate (MLE) P (C) of P (C) from D.

2. Split data D according to class label into r subsets, one for each class. Denote the

subset for class c by Dc.

3. For each class c = 1, 2, . . . , r, learn an LTM Mc from Dc to estimate the class-

conditional density P (X|C = c).

4. Smooth the obtained estimates P (C) and the parameters of each LTMMc.

In the first step, we calculate the MLE P (C). This can be easily done by counting

the number of instances belonging to each class in D. More specifically, we calculate

P (C = c) =|Dc||D| .

In the third step, we learn an LTM Mc for each class c. One can use either EAST

or Pyramid for this task. This results in two variants of the learning algorithm. We

refer to them as LTC-E and LTC-P, respectively. We will evaluate both variants in the

experiments.

In the last step, we smooth the parameters of the obtained LTC. This is detailed in

the next subsection.

8.4.1 Parameter Smoothing

We estimate the prior P (C) and the LTM parameters θc using maximum likelihood es-

timation. As noticed in previous work (e.g., Friedman et al., 1997), when the size of the

training data is small, this could lead to unreliable estimation and thus deficient classifi-

cation accuracy. One common way to address this issue is to smooth the parameters using

Laplace correction (Niblett, 1987). Let N be the total number of instances in training

data and Nc be the number of instance labeled as class c. Let α be a predefined smoothing

factor. We smooth the class prior distribution as follows:

P (C = c) =Nc + α

N + rα.

We also smooth the parameters for each LTM Mc. Let θcijk = P(

Zi = j|π(Zi) = k, mc

)

be the parameter estimation produced by EM. We calculate the smoothed parameter θscijk

for all i, j, k as

θscijk =

Ncθcijk + α

Nc + |Zi|α.

In the Section 8.5.3, we will empirically show that the smoothing technique leads to

significant improvement in classification accuracy.

95

8.5 Empirical Evaluation

In this section, we empirically evaluate LTC on an array of data. We first demonstrate the

necessity of parameter smoothing. We then contrast LTC-E with LTC-P, and compare

the latter with several mainstream classification algorithms including NB, TAN, AODE,

and C4.5 (Quinlan, 1993). We also include a restricted version of LTC, namely LCC, in

the comparison. Finally, we use an example to show that one can reveal meaningful latent

structures with LTC.

8.5.1 Data Sets

To get data for our experiments, we started with all the 47 data sets that are used by

Friedman et al. (1997) and recommended by Weka (Witten & Frank, 2005). Most of the

data sets are from the UCI machine learning repository (Asuncion & Newman, 2007).

We preprocessed the data as follows. The learning algorithms of TAN and AODE

proposed by Friedman et al. (1997) and Webb et al. (2005), as well as Pyramid, do

not handle missing values. Thus, we removed incomplete instances from the data sets.

Among the 47 data sets, there are 10 data sets in which every instance contains missing

values. Those data sets were excluded from our experiments. TAN, AODE, LTC, and

LCC require discrete attributes. Therefore, we discretized the remaining data sets using

the supervised discretization method proposed by Fayyad and Irani (1993).

After preprocessing, we got 37 data sets. The data are from various domains such

as medical diagnosis, handwriting recognition, Biology, Chemistry, etc. The number of

attributes ranges from 4 to 61; the number of classes ranges from 2 to 26; and the sample

size ranges from 80 to 20,000. Table 8.1 summarizes the characteristics of the data.


We implemented LTC-E, LTC-P, NB, TAN, AODE, and LCC in Java. For C4.5, we used

the J48 implementation in WEKA. The detailed settings are as follows.

• LTC: For EAST and Pyramid, we used the mild setting as described in Section 6.3.

For parameter smoothing, we set the smoothing factor α = 1.

• NB: We smoothed the parameters in the same way as for LTC. This strategy is also

recommended by WEKA.

• TAN: We followed the parameter smoothing strategy suggested by Friedman et al.

(1997). In particular, we set the smoothing factor at 5.

96

Name # Attributes # Classes Sample Size

Anneal 38 6 898Australian 14 2 690

Autos 25 7 159Balance-scale 4 3 625Breast-cancer 9 2 277

Breast-w 9 2 683Corral 6 2 128Credit-a 15 2 653Credit-g 20 2 1,000Diabetes 8 2 768Flare 10 2 1,066Glass 9 7 214Glass2 9 2 163Heart-c 13 5 296

Heart-statlog 13 2 270Hepatitis 19 2 80Ionosphere 34 2 351

Iris 4 3 150Kr-vs-kp 36 2 3,196Letter 16 26 20,000Lymph 18 4 148

Mofn-3-7-10 10 2 1,324Mushroom 22 2 5,644Pima 8 2 768

Primary-tumor 17 22 132Satimage 36 6 6,435Segment 19 7 2,310

Shuttle-small 9 7 5,800Sonar 60 2 208Soybean 35 19 562Splice 61 3 3,190Vehicle 18 4 846Vote 16 2 232Vowel 13 11 990

Waveform-21 21 3 5,000Waveform-5000 40 3 5,000

Zoo 17 7 101

Table 8.1: The 37 data sets used in the experiments.

97

• AODE: As suggested by Webb et al. (2005), we set the frequency limit on super

parent at 30. We also smoothed the parameters in the same way as for LTC.

• C4.5: We used the default setting suggested by WEKA.

• LCC: Similar to LTC, we partitioned the data by class and learned a latent class

model for each class. Starting by the latent class model with the cardinality of the

latent variable equal to 2, we greedily increased the cardinality until the AIC score

of the model ceased to increase. The EM was tuned at the same setting as EAST

and Pyramid for learning LTC. The parameters were smoothed in the same way as

for LTC.

Given a data set, we estimated the classification accuracy of an algorithm using strat-

ified 10-fold cross validation (Kohavi, 1995). All the algorithms were run on the same

training/test splits.

All the classifiers were trained on a server with two dual core AMD Opteron 2.4GHz

CPUs and tested on a machine with an Intel Pentium IV 3.4GHz CPU.

8.5.3 Effect of Parameter Smoothing

We start by investigating the effect of parameter smoothing. In the experiments, we first

ran LTC-E and LTC-P on all the data sets. We then turned off the parameter smoothing

module and re-ran the two algorithms. In the following, we compare the performance of

the smoothed and unsmoothed versions of each algorithm.

Table 8.2 shows the mean and the standard deviation of the classification accuracy of

LTC-E and LTC-P with/without parameter smoothing. We first look at the two columns

for LTC-E. For each data set, the entry with the higher accuracy is highlighted in boldface.

For each variant of LTC-E, Table 8.2 also reports the average accuracy over all the data

sets and the number of wins, i.e., the number of data sets on which it achieved higher

accuracy. Note that the sum of the number of wins of the two variants is larger than 37,

the total number of data sets. This is because the two variants achieved the same level

of accuracy on 3 data sets.

By examining Table 8.2, we can see that parameter smoothing leads to better perfor-

mance in general. The smoothed version achieved higher overall classification accuracy

and won on 35 of the 37 data sets.

To compare the two variants of LTC-E, we also conducted one-tailed paired t-test with

p = 0.05. As shown in the last row of Table 8.2, the smoothed version significantly won

the unsmoothed version on 16 data sets. Those data sets are indicated by small circles

98

Data SetLTC-E LTC-P

Smoothed Non-smoothed Smoothed Non-smoothed

Anneal 98.44±1.41 ◦ 96.66±1.57 98.66±1.02 97.77±1.17Australian 85.51±2.90 ◦ 84.20±3.51 85.07±3.21 84.20±4.07

Autos 84.83±10.89 ◦ 67.96±7.33 82.29±10.86 ◦ 68.54±9.79Balance-scale 70.88±4.09 70.88±4.09 71.03±3.85 71.03±3.85Breast-cancer 74.01±7.33 ◦ 70.77±5.15 74.37±5.71 ◦ 70.04±5.04

Breast-w 97.37±1.50 96.78±1.66 97.37±2.04 ◦ 96.93±2.52Corral 100.00±0.00 100.00±0.00 100.00±0.00 100.00±0.00

Credit-a 86.53±4.09 ◦ 85.01±4.47 85.61±3.67 ◦ 84.23±3.77Credit-g 73.30±4.57 72.80±4.92 73.00±4.92 ◦ 72.30±4.85Diabetes 77.48±3.82 77.35±3.83 76.96±3.05 76.83±2.97Flare 82.65±2.58 82.37±2.39 83.49±1.79 83.40±1.68Glass 75.69±6.96 74.72±6.93 75.69±8.28 74.24±7.67Glass2 85.18±9.44 84.60±9.82 84.56±9.48 83.97±9.81Heart-c 82.79±4.84 81.41±4.33 82.45±5.71 81.08±5.84

Heart-statlog 82.59±9.57 82.22±9.85 83.33±8.05 83.33±8.05Hepatitis 91.25±11.86 88.75±9.22 90.00±12.91 91.25±8.44Ionosphere 93.17±2.74 92.31±2.34 93.45±2.69 92.32±2.66

Iris 96.00±4.66 95.33±4.50 94.67±5.26 94.67±4.22Kr-vs-kp 96.90±1.22 97.00±1.19 95.49±1.24 95.62±1.17Letter 92.30±0.44 ◦ 91.10±0.62 91.80±0.58 ◦ 91.04±0.83Lymph 85.76±6.78 81.24±11.63 87.19±4.88 ◦ 77.71±10.93

Mofn-3-7-10 93.73±2.00 93.80±2.15 92.60±1.29 92.60±1.29Mushroom 100.00±0.00 100.00±0.00 100.00±0.00 100.00±0.00

Pima 76.57±4.32 76.44±4.30 77.09±4.06 77.09±3.72Primary-tumor 40.99±11.09 ◦ 28.85±12.10 45.55±13.48 ◦ 30.44±11.42

Satimage 89.65±1.69 ◦ 87.74±1.42 89.87±1.07 ◦ 88.31±1.16Segment 95.58±1.55 ◦ 93.94±1.40 96.06±1.28 ◦ 94.50±1.34

Shuttle-small 99.88±0.12 ◦ 99.43±0.26 99.86±0.14 ◦ 99.47±0.19Sonar 85.64±8.34 84.67±7.63 83.74±8.68 82.76±8.98Soybean 93.42±2.50 ◦ 74.04±4.17 93.78±2.82 ◦ 76.87±4.09Splice 94.67±1.33 94.17±2.21 92.29±1.42 ◦ 90.85±1.73Vehicle 73.63±4.53 ◦ 71.62±5.51 75.17±5.26 ◦ 72.21±6.60Vote 95.86±3.22 94.25±3.27 94.94±3.57 93.57±3.02Vowel 79.60±3.15 ◦ 78.28±2.79 79.80±4.74 80.00±3.22

Waveform-21 85.90±1.71 ◦ 85.72±1.67 85.96±1.61 ◦ 85.76±1.70Waveform-5000 86.16±1.51 ◦ 85.92±1.43 86.06±1.47 85.98±1.52

Zoo 94.09±5.09 ◦ 82.18±6.29 94.09±5.09 ◦ 82.18±6.29

Mean 86.43±4.16 83.91±4.21 86.31±4.19 83.87±4.21# Wins 35 5 34 10

# Sig. Wins 16 0 16 0

Table 8.2: The classification accuracy of LTC-E and LTC-P with/without parametersmoothing. Boldface numbers denote higher accuracy. Small circles indicate significantwins.

99

in the table. Moreover, the smoothed version never significantly losed to the unsmoothed

version.

The conclusions for LTC-E carry on to LTC-P. With parameter smoothing turned on,

LTC-P achieved higher overall accuracy and won on 34 out of the 37 data sets. The

smoother version of LTC-P also significantly outperformed the unsmoothed version (16

wins/0 lose).

In summary, the smoothing technique often leads to improvement in classification

accuracy and never significantly degrades the accuracy. In the following experiments, we

conduct parameter smoothing by default.

8.5.4 LTC-E versus LTC-P

We next compare LTC-E with LTC-P. We consider their performance on two aspects,

namely classification accuracy and computational efficacy.

We first compare the classification accuracy. For clarity, we reproduce the accuracy

of LTC-E and LTC-P in Table 8.3. Again, boldface numbers denote higher accuracy,

while small circles indicate significant wins according to t-test with p = 0.05. From Table

8.3, we can see that LTC-E performs slightly better than LTC-P. In particular, LTC-E

achieved slightly higher overall classification accuracy. It won LTC-P on 21 data sets but

losed on 20 data sets. However, the difference between the two algorithms on most of the

data sets was statistically insignificant. See the last line in Table 8.3 for the number of

significant wins.

In terms of computational efficiency, LTC-P compares more favorably to LTC-E. Fig-

ure 8.2 plots the training time of LTC-E and LTC-P on the 37 data sets. Note that the

data sets are sorted in an ascending order with respect to the time that LTC-E spent on

them. From this figure, we can see that LTC-P ran consistently faster than LTC-E. On

10 of the 37 data sets, LTC-P was faster than LTC-E by at least an order of magnitude.

The largest difference occurred on Kr-vs-kp where LTC-P was more than 30 times faster

than LTC-E.

Taking both sides into consideration, we argue that LTC-P achieves a better trade-

off between classification performance and computational efficiency. Therefore, we only

consider this algorithm in the following text. We will refer to it as LTC for short.

8.5.5 Comparison with the Other Algorithms

We now compare LTC with NB, TAN, AODE, C4.5, and LCC. The classification accuracy

of those algorithms are shown in Table 8.4. From this table, we can see that LTC achieved

100

Data Set LTC-E LTC-P

Anneal 98.44±1.41 98.66±1.02Australian 85.51±2.90 85.07±3.21

Autos 84.83±10.89 82.29±10.86Balance-scale 70.88±4.09 71.03±3.85Breast-cancer 74.01±7.33 74.37±5.71

Breast-w 97.37±1.50 97.37±2.04Corral 100.00±0.00 100.00±0.00Credit-a 86.53±4.09 85.61±3.67Credit-g 73.30±4.57 73.00±4.92Diabetes 77.48±3.82 76.96±3.05Flare 82.65±2.58 83.49±1.79Glass 75.69±6.96 75.69±8.28Glass2 85.18±9.44 84.56±9.48Heart-c 82.79±4.84 82.45±5.71

Heart-statlog 82.59±9.57 83.33±8.05Hepatitis 91.25±11.86 90.00±12.91Ionosphere 93.17±2.74 93.45±2.69

Iris 96.00±4.66 94.67±5.26Kr-vs-kp 96.90±1.22 ◦ 95.49±1.24Letter 92.30±0.44 ◦ 91.80±0.58Lymph 85.76±6.78 87.19±4.88

Mofn-3-7-10 93.73±2.00 ◦ 92.60±1.29Mushroom 100.00±0.00 100.00±0.00Pima 76.57±4.32 77.09±4.06

Primary-tumor 40.99±11.09 45.55±13.48 ◦Satimage 89.65±1.69 89.87±1.07Segment 95.58±1.55 96.06±1.28 ◦

Shuttle-small 99.88±0.12 99.86±0.14Sonar 85.64±8.34 83.74±8.68

Soybean 93.42±2.50 93.78±2.82Splice 94.67±1.33 ◦ 92.29±1.42Vehicle 73.63±4.53 75.17±5.26Vote 95.86±3.22 94.94±3.57Vowel 79.60±3.15 79.80±4.74

Waveform-21 85.90±1.71 85.96±1.61Waveform-5000 86.16±1.51 86.06±1.47

Zoo 94.09±5.09 94.09±5.09

Mean 86.43±4.16 86.31±4.19# Wins 21 20

# Sig. Wins 4 2

Table 8.3: Comparison of classification accuracy between LTC-E and LTC-P.

101

0 10 20 30 4010

0

102

104

106

Data Set

Tim

e (S

econ

d)

LTC−ELTC−P

Figure 8.2: The training time of LTC-E and LTC-P.

the best overall accuracy, followed by AODE, TAN, LCC, C4.5, and NB, in that order. In

terms of the number of wins, LTC was also the best (13 wins), with AODE (10 wins) and

TAN (7 wins) being the two runners-up. If we consider only generative approaches and

ignore C4.5, the difference is even larger. LTC achieves 3 additional wins on Kr-vs-kp,

Mofn-3-7-10, and Vote, i.e., 16 wins in total. The other algorithms remain unaffected.

We also conducted one-tailed paired t-test to compare LTC with each of the other

algorithms. Again, we set the p-value at 0.05. The number of significant wins, ties, and

loses is given in Table 8.5. It shows that LTC significantly outperformed NB (17 wins/3

loses) and C4.5 (12/3). LTC was also better than TAN (8/4), AODE (7/5), and LCC

(7/2).

Besides accuracy, classification efficiency is also a concern. As mentioned in Section

8.3, theoretically, making prediction with LTC takes time O(|X| · r), where |X| is the

number of attributes and r is the number of classes. Figure 8.3 reports the average time

that it takes for LTC to classify an instance in each data set used in our experiments. The

data sets are sorted in ascending order with respect to the running time. The running

time ranges from 1 ms to 38 ms. We argue that LTC is efficient enough for practical use.

Figure 8.3 also shows the running time of TAN, AODE, and LCC. LTC was consistently

slower than TAN. Depending on the number of attributes and the cardinalities of latent

variables, LTC was slower than AODE on some data sets and faster on the others. In

most cases, the difference was small. We also see that LTC was almost as efficient as

LCC.

102

Domain LTC NB TAN AODE C4.5 LCC

Anneal 98.66±1.02 96.10±2.54 98.22±1.68 98.44±1.41 98.77±0.98 98.78±1.43Australian 85.07±3.21 85.51±2.65 85.22±5.19 86.09±3.50 85.65±4.07 84.93±3.82

Autos 82.29±10.86 72.88±10.12 83.67±7.29 80.42±8.40 78.58±8.54 77.92±13.76Balance-scale 71.03±3.85 70.71±4.08 70.39±4.17 69.59±4.01 69.59±4.27 69.28±4.10Breast-cancer 74.37±5.71 75.41±6.44 67.53±5.27 76.85±8.48 74.39±7.34 71.51±9.51

Breast-w 97.37±2.04 97.51±2.19 96.78±2.16 97.36±2.04 95.76±2.61 97.52±1.53Corral 100.00±0.00 85.96±7.05 100.00±0.00 89.10±8.98 94.62±8.92 97.63±3.82

Credit-a 85.61±3.67 87.29±3.53 86.38±3.27 87.59±3.51 86.99±4.48 86.68±3.30Credit-g 73.00±4.92 75.80±4.32 73.50±3.63 77.10±4.38 72.10±4.46 72.90±3.90Diabetes 76.96±3.05 77.87±3.50 78.77±3.32 78.52±4.11 78.26±3.97 76.44±4.29Flare 83.49±1.79 80.30±3.42 82.84±2.26 82.46±2.31 82.09±1.80 83.31±2.53Glass 75.69±8.28 74.37±8.97 77.60±7.85 76.19±7.41 73.94±9.76 75.71±8.50Glass2 84.56±9.48 83.97±8.99 87.06±8.03 83.97±9.91 84.01±7.32 85.18±9.44Heart-c 82.45±5.71 84.11±7.85 82.79±5.54 83.77±6.80 74.66±6.49 80.38±3.27

Heart-statlog 83.33±8.05 83.33±6.36 81.85±7.08 81.85±6.86 81.85±5.91 79.63±8.24Hepatitis 90.00±12.91 85.00±15.37 88.75±13.76 85.00±12.91 90.00±14.19 86.25±12.43Ionosphere 93.45±2.69 90.60±3.83 92.60±4.27 92.31±2.34 89.17±5.35 92.31±2.70

Iris 94.67±5.26 94.00±5.84 94.00±5.84 93.33±5.44 94.00±4.92 94.00±5.84Kr-vs-kp 95.49±1.24 ◦ 87.89±1.81 92.24±2.24 91.18±0.83 99.44±0.48 92.49±1.72Lymph 87.19±4.88 83.67±6.91 82.38±7.41 85.62±8.66 78.33±10.44 82.95±9.07

Mofn-3-7-10 92.60±1.29 ◦ 85.35±1.53 91.31±2.00 88.97±2.60 100.00±0.00 92.38±1.85Mushroom 100.00±0.00 97.41±0.72 99.98±0.06 100.00±0.00 100.00±0.00 100.00±0.00

Pima 77.09±4.06 78.13±4.24 78.78±4.50 78.65±3.81 78.38±2.90 76.83±3.53Primary-tumor 45.55±13.48 47.14±11.59 42.53±11.57 47.14±11.00 43.24±10.55 45.60±11.04

Segment 96.06±1.28 91.52±1.60 95.89±1.37 95.63±1.23 95.32±1.63 95.41±1.73Shuttle-small 99.86±0.14 99.34±0.27 99.79±0.11 99.83±0.11 99.59±0.19 99.86±0.14

Sonar 83.74±8.68 85.62±5.41 85.64±8.70 87.07±6.31 79.81±8.14 84.14±11.24Soybean 93.78±2.82 91.64±4.44 93.06±3.31 91.99±4.22 91.82±3.75 91.28±3.70Splice 92.29±1.42 95.36±1.00 95.33±1.39 96.21±1.07 94.36±1.58 96.11±1.60Vehicle 75.17±5.26 62.65±4.15 73.04±4.52 73.06±4.65 71.99±3.45 73.76±4.64Vote 94.94±3.57 ◦ 89.91±4.45 93.36±4.84 94.03±4.07 95.18±4.48 94.71±2.67Vowel 79.80±4.74 67.07±6.14 88.59±2.61 81.92±4.11 80.91±2.31 79.49±3.44

Waveform-21 85.96±1.61 81.76±1.49 82.92±1.45 86.60±1.26 75.44±2.10 86.06±1.71Waveform-5000 86.06±1.47 80.74±1.38 82.04±1.25 86.36±1.65 76.48±1.47 86.10±1.79

Zoo 94.09±5.09 93.18±7.93 94.09±6.94 94.18±6.60 92.18±8.94 94.18±6.60zzLetter 91.80±0.58 74.04±1.04 86.43±0.67 88.91±0.50 78.63±0.62 92.47±0.65

zzzSatimage 89.87±1.07 82.42±1.51 88.16±0.99 89.26±0.59 84.37±1.34 89.20±1.22

Mean 86.31±4.19 83.12±4.72 85.77±4.23 85.85±4.49 84.32±4.59 85.50±4.62# Best 13 (16) 3 7 10 5 4

Table 8.4: The classification accuracy of the tested algorithms. The 3 entries indicatedby small circles become the best after taking out C4.5.

103

NB TAN AODE C4.5 LCC

# Wins 17 8 7 12 7# Ties 17 25 25 22 28# Loses 3 4 5 3 2

Table 8.5: The number of times that LTC significantly won, tied with, and losed to theother algorithms.

0 10 20 30 4010

−4

10−3

10−2

10−1

Data Set

Tim

e (S

econ

d)

LTCTANAODELCC

Figure 8.3: The classification time of different classifiers.

8.5.6 Appreciating Learned Models

One advantage of LTC is that it can capture concepts underlying domains and automati-

cally discover interesting subgroups within each class. In this section, the readers will see

one such example.

The example is involved with the Corral data set. It contains two classes true and

false, and six boolean attributes A0, A1, B0, B1, Irrelevant, and Correlated. The

target concept is (A0∧A1)∨ (B0∧B1). Irrelevant is an irrelevant random attribute, and

Correlated is loosely correlated to the class variable.

We learned an LTC from the Corral data and obtained two LTMs, one for each class.

We denote the LTMs byMt andMf , respectively. Their structures are shown in Figure

8.4. The numbers in parentheses denote the cardinalities of latent variables. The width

of an edge denote the mutual information between the incident nodes.

The model Mt contains one binary latent variable Yt. Yt partitions the samples of

104

(a) Mt (b)Mf

Figure 8.4: The structures of the LTMs for Corral data.

the true class into two groups, each corresponding to one of its two states. We call those

groups latent classes. Similarly, the model Mf contains two binary latent variables Yf1

and Yf2. Each latent variable partitions the samples of the false class into two latent

classes in a peculiar way. We look into each latent class and obtain some interesting

findings: (1) the latent classes Yt = 1 and Yt = 2 correspond to the two components of

the concept, A0 ∧A1 and B0∧B1, respectively; (2) the latent classes Yf1 = 1 and Yf1 = 2

correspond to ¬A0 and ¬A1, while the latent classes Yf2 = 1 and Yf2 = 2 correspond to

¬B0 and ¬B1; (3) the latent variables Yf1 and Yf2 jointly enumerate the four cases when

the target concept (A0 ∧ A1) ∨ (B0 ∧ B1) does not satisfy. The details are presented in

the following.

We first notice that in both models, the four attributes A0, A1, B0, and B1 are closely

correlated to their parents. In contrast, Irrelevant and Correlated are almost indepen-

dent of their parents. This is interesting as both models correctly picked the four relevant

attributes to the target concept. Henceforth, we only focus on those four attributes.

To understand the characteristics of each latent class, we examine the conditional dis-

tribution of each attribute, i.e., P (X|Y = 1) and P (X|Y = 2) for all X ∈ {A0, A1, B0, B1}and Y ∈ {Yt, Yf1, Yf2}. Those distributions are plotted in Figure 8.5. The height of a bar

indicates the corresponding probability value.

We start by the latent classes associated with Yt. In latent class Yt = 1, A0 and

A1 always take value true, while B0 and B1 emerge at random. Clearly, this group of

instances belong to class true because they satisfy A0 ∧ A1. In contrast, in latent class

Yt = 2, B0 and B1 always take value true, while A0 and A1 emerge at random. Clearly,

this group corresponds to the concept B0 ∧ B1.

We next examine the two latent variables in Mf . It is clear that A0 never occurs

in latent class Yf1 = 1, while A1 never occurs in latent class Yf1 = 2. Therefore, the

two latent classes correspond to ¬A0 and ¬A1, respectively. Yf1 thus enumerates the two

cases when A0 ∧ A1 does not satisfy. Similarly, we find that B0 never occurs in latent

class Yf2 = 1, while B1 never occurs in latent class Yf2 = 2. Therefore, the two latent

105

A0 A1 B0 B10

0.2

0.4

0.6

0.8

1

Pro

babi

lity

(a) Yt = 1: A0 ∧A1

A0 A1 B0 B10

0.2

0.4

0.6

0.8

1

Pro

babi

lity

(b) Yt = 2: B0 ∧B1

A0 A1 B0 B10

0.2

0.4

0.6

0.8

1

Pro

babi

lity

(c) Yf1 = 1: ¬A0

A0 A1 B0 B10

0.2

0.4

0.6

0.8

1

Pro

babi

lity

(d) Yf1 = 2: ¬A1

A0 A1 B0 B10

0.2

0.4

0.6

0.8

1

Pro

babi

lity

(e) Yf2 = 1: ¬B0

A0 A1 B0 B10

0.2

0.4

0.6

0.8

1

Pro

babi

lity

(f) Yf2 = 2: ¬B1

Figure 8.5: The attribute distributions in each latent class and the corresponding concept.

classes correspond to ¬B0 and ¬B1, respectively. Yf2 thus enumerates the two cases when

B0 ∧B1 does not satisfy. Consequently, Yf1 and Yf2 jointly represent the four cases when

the target concept (A0 ∧A1) ∨ (B0 ∧B1) does not satisfy.

8.6 Related Work

There are a large body of literatures on generative classifier that attempt to improve

classification accuracy by modeling attribute dependencies. They mainly divide into two

categories: Those directly model relationship among attributes, and those attribute such

relationship to latent variables. Besides TAN and AODE, examples from the first cate-

gory include general Bayesian network classifier and Bayesian multinet (Friedman et al.,

106

1997). The latter learns a Bayesian network for each class and uses them jointly to make

prediction. Our method is based on the similar idea, but we learn an LTM to represent

the joint distribution of each class.

Our method falls into the second category. In this category, various latent variable

models have been tested for continuous data. To give two examples, Monti and Cooper

(1995) combine finite mixture model with naıve Bayes classifier. The resultant model is a

continuous counterpart of LCC. Langseth and Nielsen (2005) propose latent classification

model. It uses mixture of factor analyzers to represent attribute dependencies.

In contrast, we are aware of much less work on categorical data. The one that is the

most closely related to ours is the hierarchical naıve Bayes model (HNB) proposed by

Zhang et al. (2004). HNB also exploits LTM to model the relationship among attributes.

It differs from LTC in two aspects. First, HNB assumes that the LTM structures are iden-

tical for all classes, while LTC describes different classes using different LTMs. Therefore,

HNB cannot reveal diversity across classes for domains like Corral. Second, HNB models

the attribute dependencies using a forest of LTMs, each over an exclusive subset of at-

tributes. In contrast, LTC connects all attributes using one single tree. Recently, Hinton

et al. (2006) propose the notion of deep belief net (DBN). DBN models the attribute

dependencies using multiple layers of densely connected latent variables. It is designed

for image recognition problem and only handles binary attributes.

8.7 Summary

We propose a novel generative classifier, namely, latent tree classifier. It exploits LTMs

to estimate the class-conditional distributions of each class, and uses the Bayes rule to

make classification. We considered both EAST and Pyramid for learning LTC. Empirical

results suggest that Pyramid makes a better tradeoff between classification accuracy and

computational efficiency than EAST.

LTMs can capture complex relationship among attributes within each class. Therefore,

LTCs usually approximate generative distributions well and often yield good classification

accuracy. In particular, we empirically show that LTC compares favorably to mainstream

classification algorithms including NB, TAN, AODE, and C4.5. In terms of classification

efficiency, LTC is good enough for real-world applications. Though it is slower than NB

and TAN, it is as efficient as AODE. We further demonstrate an advantage of LTCs, i.e.,

they can reveal underlying concepts and discover interesting subgroups within each class.

As far as we know, this advantage is unique to our method.

107

CHAPTER 9

CONCLUSIONS AND FUTURE WORK

We conclude this thesis by providing a summary of contributions and discussing several

possible future directions.

9.1 Summary of Contributions

We study the use of latent tree models for density estimation of discrete random vari-

ables. LTMs can represent complex relationships among manifest variables and yet are

computationally simple to work with. They were recognized as potentially good models

for density estimation two decades ago. However, the potentials have not been realized

before this thesis due to the lack of efficient algorithms for LTMs. Only special LTMs,

such as latent class models, were used for density estimation previously.

This thesis is the first to investigate the practical use of unrestricted LTMs for density

estimation. Our contributions lie in two aspects, one in developing efficient learning

algorithms for LTMs, the other in exploring novel applications of the density estimation

techniques that we develop.

On the algorithm side, we first test EAST, the previous state-of-the-art algorithm for

learning LTMs, for the task of density estimation. The results show that EAST can yield

good estimates and reveal interesting latent structures. However, it is computationally

expensive and can only be applied to small scale problems.

We then develop two algorithms that are more efficient than EAST. The first algorithm

is HCL. It is a special-purpose algorithm that requires a predetermined bound on the

complexity of the resulting LTM. It is faster than EAST by orders of magnitude but

yields significantly poorer estimates than EAST does. The second algorithm that we

develop is Pyramid. In contrast to HCL, it is a general-purpose algorithm. It is slower

than HCL and is more efficient than EAST. The quality of the estimates produced by

Pyramid is only slightly lower than those produced by EAST. As such, Pyramid provides

a better tradeoff than EAST between estimation quality and computational efficiency.

On the application side, we study the use of LTMs to BN approximate inference

and statistical classification. In the first application, we propose a bounded-complexity

approximate method for BN inference. The idea is to build an LTM to approximate the

108

distribution represented by a BN using the HCL algorithm, and make inference in the LTM

instead of the original BN. With our scheme one can trade off between the approximation

accuracy and the inferential complexity. Our scheme achieves good accuracy at low costs

in all the networks that we examined. In particular, it consistently outperforms LBP.

In the second application, we use LTMs to estimate class-conditional distributions,

and apply Bayes rule to make prediction. This leads to a novel method for classification

called latent tree classifier (LTC). We empirically show that LTC achieves comparable or

higher classification accuracy than mainstream algorithms including NB, TAN, AODE,

and C4.5. In terms of the speed of online classification, LTC is slower than NB and

TAN, and is as efficient as AODE. It is fast enough for real-world applications. We also

demonstrate that LTC can reveal underlying concepts and discover interesting subgroups

within each class. As far as we know, the second feature is unique to our method.

9.2 Future Work

Possible future work falls in three directions. We discuss them in details in the following

subsections.

9.2.1 Other Applications

The first direction is to explore other applications of the developed density estimation

techniques. An ongoing research is to investigate their usefulness in ranking problem.

Given a set of labeled training data, the problem is to build a ranker to sort the test

data so that the positive samples appear before the negative samples. It has extensive

applications in marketing research, social science, information retrieval, etc.

One approach to ranking problem is to construct an estimate to the generative dis-

tribution P (X, C), and then sort the test data in a descending order with respect to the

posterior probability P (C = +|X = x) that a sample x belongs to the positive class.

Intuitively, the more accurate the estimate, the better the ranking. A natural idea is

thus to apply LTC to this problem. Since LTC can approximate the true distribution

underlying the data well, we expect it to be a good ranker also. We are testing this idea

on direct marketing data such as the CoIL Challenge 2000 (van der Putten et al., 2000),

where the task is to rank new customers so that those who would buy a particular product

appear before those who would not. We are also working with marketing researchers to

see whether LTC can discover interesting subgroups of the customers.

109

9.2.2 Handling Continuous Data

LTMs assume categorical random variables and deal with only discrete data. It would be

interesting to adapt LTM to handle continuous data. One possible solution is as follows.

We use leaf nodes to represent continuous manifest variables, and still use internal nodes

to represent discrete latent variables. We still model the direct dependencies between

latent variables using conditional probability tables. In contrast, we now model the direct

dependencies between manifest variables and latent variables using conditional Gaussian

distribution. Formally, let Y be a latent variable and X be the set of manifest variables

connected to Y . The conditional distribution of X given Y is defined as a Gaussian

distribution

P (X|Y = j) = N (µi,Σi), j = 1, 2, . . . , |Y |,

with mean µi and covariance matrix Σi. One can make certain assumptions about the

structure of the covariance matrix Σi, e.g., identity or diagonal matrices. One can also

consider to infer such structures from data automatically.

We refer to the models defined in this way as the Gaussian latent tree models (GLTMs).

It can be treated as a generalization to the finite Gaussian mixture models (GMMs)

(McLachlan & Peel, 2000). The latter contains only one single latent variable and is

commonly used in literatures to model density functions over continuous variables. What

GLTM is to GMM is that LTM is to LCM. Given the fact that LTM consistently outper-

forms LCM in density estimation tasks with discrete data, we expect GLTM to improve

over GMM in continuous case as well.

Like in the case of LTM, to learn a GLTM from data, we need to determine the latent

variables, the tree structure that connects latent variables and manifest variables, and the

parameters including the CPTs for latent variables and the mean µi and the covariance

matrix Σi for manifest variables. We are currently considering to adapt the EAST and

Pyramid algorithms for this purpose.

9.2.3 Generalization to Partially Observed Trees

A restriction imposed on LTMs is that manifest variables must reside at leaf nodes.

As a consequence, interactions between manifest variables are not directly modeled but

via the latent variables. On one hand, this property enables LTMs to capture high-

order dependencies among manifest variables using a tree structure. On the other hand,

however, it makes LTMs inefficient to model low-order dependencies.

Consider the case where the generative distribution can be well approximated using

a tree model without latent variables. In this case, the random variables exhibit only

110

second-order dependencies. Now imagine to construct an LTM to estimate the generative

distribution. Since the dependencies between manifest variables are indirectly modeled by

latent variables, we usually need to set the cardinalities of latent variables at large values

in order to achieve good estimation. The resulting LTM can be much more complex than

the plain tree model. To give a concrete example, consider the results for the Mildew

network in Section 7.5. In this example, Chow-Liu tree yields higher approximation

accuracy than LTM while the former is much simpler than the latter.

One way to relax the restriction imposed on LTMs is to allow manifest variables to

enter the interior of models. This results in a class of models that we refer to as the

partially observed trees, or POTs for short. POTs are more flexible than LTMs. As a

matter of fact, POTs subsume both LTMs and Chow-Liu trees as special cases. Therefore,

POTs have stronger expressive power than the latters and could lead to better solutions

to density estimation problem.

The model space of POTs is larger than that of LTMs. Therefore, the problem of

learning POTs should be at least as hard as that of learning LTMs. One possible solution

is to adapt the EAST algorithm. We can enhance EAST search by introducing additional

search operators that (1) pushes manifest variables at leaf nodes into the interior of the

current model, and (2) pulls manifest variables at interior nodes out. Since more search

operators are involved, we expect the resulting algorithm to be slower than EAST.

A potentially more efficient way to learn POTs is to start with the optimal Chow-

Liu tree and then introduce latent variables to the interior of the model. To realize this

solution, ones need to develop heuristic to determine whether it is beneficial to introduce

new latent variables and how the latent variables should be added to current model.

111

Bibliography

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions

on Automatic Control, 19 (6), 716–723.

Asuncion, A., & Newman, D. J. (2007). UCI machine learning repository. http://www.

ics.uci.edu/~mlearn/MLRepository.html.

Bartholomew, D., & Knott, M. (1999). Latent variable models and factor analysis (2nd

edition). Arnold, London.

Bishop, C. M., Lawrence, N., Jaakkola, T., & Jordan, M. I. (1997). Approximating

posterior distributions in belief networks using mixtures. In Advances in Neural

Information Processing Systems 10, pp. 416–422.

Chen, T. (2008). Search-Based Learning of Latent Tree Models. Ph.D. thesis, The Hong

Kong University of Science and Technology.

Chen, T., Zhang, N. L., & Wang, Y. (2008). Efficient model evaluation in the search-

based approach to latent structure discovery. In Proceedings of the 4th European

Workshop on Probabilistic Graphical Models, pp. 57–64.

Chickering, D. M. (2002). Optimal structure identification with greedy search. Journal

of Machine Learning Research, 3, 507–554.

Chickering, D. M., & Heckerman, D. (1997a). Efficient approximations for the marginal

likelihood of Bayesian networks with hidden variables. Machine Learning, 29 (2-3),

181–212.

Chickering, D. M., & Heckerman, D. (1997b). Efficient approximations for the marginal

likelihood of Bayesian networks with hidden variables. Machine Learning, 29, 181–

212.

Choi, A., & Darwiche, A. (2006). A variational approach for approximating Bayesian

networks by edge deletion. In Proceedings of the 22nd Conference on Uncertainty

in Artificial Intelligence, pp. 80–89.

Chow, C. K., & Liu, C. N. (1968). Approximating discrete probability distributions with

dependence trees. IEEE Transactions on Information Theory, 14 (3), 462–467.

Cooper, F. G. (1990). The computational complexity of probabilistic inference using

Bayesian belief networks. Artificial Intelligence, 42, 393–405.

Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. Wiley-

Interscience, New York.

112

Dagum, P., & Luby, M. (1993). Approximating probabilistic inference in Bayesian belief

networks is NP-hard. Artificial Intelligence, 60, 141–153.

Darwiche, A. (2001). Recursive conditioning. Artificial Intelligence, 125 (1–2), 5–41.

Dechter, R. (1996). Bucket elimination: A unifying framework for probabilistic inference.

In Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence, pp.

211–219.

Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier

under zero-one loss. Machine Learning, 29, 103–130.

Duda, R. O., & Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley,

New York.

Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued

attributes for classification learning. In Proceedings of the 13th International Joint

Conference on Artificial Intelligence, pp. 1022–1027.

Fraley, C., & Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and

density estimation. Journal of the American Statistical Association, 97 (458), 611–

631.

Frey, B. J., & MacKay, D. J. C. (1997). A revolution: Belief propagation in graphs with

cycles. In Advances in Neural Information Processing Systems 10.

Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine

Learning, 29 (2-3), 131–163.

Green, P. (1999). Penalized likelihood. Encyclopaedia of Statistical Science, Update Vol-

umn 3, 578–586.

Guindon, S., & Gascuel, O. (2003). A simple, fast, and accurate algorithm to estimate

large phylogenies by maximum likelihood. Systematic Biology, 52 (5), 696–704.

Heckerman, D. (1995). A tutorial on learning with Bayesian networks. Tech. rep. MSR-

TR-95-06, Microsoft Research.

Henrion, M. (1988). Propagating uncertainty in Bayesian networks by probabilistic logic

sampling. In Uncertainty in Artificial Intelligence 2, pp. 317–324.

Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep

belief nets. Neural Computation, 17 (7), 1527–1554.

Jebara, T. (2004). Machine learning : discriminative and generative. Kluwer Academic

Publishers, Boston.

Jensen, F., & Andersen, S. K. (1990). Approximations in Bayesian belief universes for

knowledge based systems. In Proceedings of the 6th Conference on Uncertainty in

Artificial Intelligence, pp. 162–169.

113

Jensen, F. V., Lauritzen, S., & Olesen, K. (1990). Bayesian updating in recursive graphical

models by local computation. Computational Statistics Quarterly, 4, 269–282.

Jordan, M. I. (Ed.). (1998). Learning in graphical models. Kluwer Academic Publishers,

Boston.

Kindermanna, R., & Snell, J. L. (1980). Markov Random Fields and Their Applications.

American Mathematical Society, Providence, RI.

Kjærulff, U. (1994). Reduction of computational complexity in Bayesian networks through

removal of weak dependences. In Proceedings of the 10th Conference on Uncertainty


Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation

and model selection. In Proceedings of the 14th International Joint Conference on


Langseth, H., & Nielsen, T. D. (2005). Latent classification models. Machine Learning,

59 (3), 237–265.

Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities

on graphical structures and their application to expert systems. Journal of Royal

Statistics Society (Series B), 50 (2), 157–224.

Lazarsfeld, P. F., & Henry, N. W. (1968). Latent Structure Analysis. Houghton Mifflin,

Boston.

Loftsgaarden, D. O., & Quesenberry, C. P. (1965). A nonparametric estimate of a multi-

variate density function. Annal of Mathematical Statistics, 36, 1049–1051.

Lowd, D., & Domingos, P. (2005). Naive Bayes models for probability estimation. In

Proceedings of the 22nd International Conference on Machine Learning, pp. 529–

536.

McLachlan, G., & Peel, D. (2000). Finite Mixture Models. John Wiley and Sons.

Monti, S., & Cooper, G. F. (1995). A Bayesian network classifier that combines a finite

mixture model and a naive Bayes model.. In Proceedings of the 11th Conference on

Uncertainty in Artificial Intelligence, pp. 447–456.

Murphy, K. P., Weiss, Y., & Jordan, M. I. (1999). Loopy belief propagation for ap-

proximate inference: An empirical study. In Proceedings of the 15th Conference on

Uncertainty in Artificial Intelligence, pp. 467–475.

Nachmana, I., Elidan, G., & Friedman, N. (2004). “Ideal parent” structural learning for

continuous variable networks. In Proceedings of the 12th Conference on Uncertainty


114

Ng, A., & Jordan, M. I. (2001). On discriminative vs. generative classifiers: A comparison

of logistic regression and naive Bayes. In Advances in Neural Information Processing

Systems, Vol. 14.

Niblett, T. (1987). Constructing decision trees in noisy domains. In Proceedings of the

Second European Working Session on Learning, pp. 67–78.

Parzen, E. (1962). On estimation of a probability density function and mode. The Annals

of Mathematical Statistics, 33 (3), 1065–1076.

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible

Inference. Morgan Kaufmann Publishers, San Mateo, CA.

Pradhan, M., Provan, G., Middleton, B., & Henrion, M. (1994). Knowledge engineering

for large belief networks. In Proceedings of the 10th Conference on Uncertainty in


Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San

Mateo, CA.

Robertson, N., & Seymour, P. D. (1984). Graph minors III: Planar tree-width. Journal

of Combinatorial Theory (Series B), 36, 49–64.

Rubinstein, Y. D., & Hastie, T. (1997). Discriminative vs informative learning. In Proceed-

ings of the 3rd International Conference on Knowledge Discovery and Data Mining,

pp. 49–53.

Sarkar, S. (1995). Modeling uncertainty using enhanced tree structures in expert systems.

IEEE Transactions on Systems, Man, and Cybernetics, 25 (4), 592–604.

Saul, L. K., Jaakkola, T., & Jordan, M. I. (1996). Mean field theory for sigmoid belief

networks. Journal of Artificial Intelligence Research, 4, 61–76.

Saul, L. K., & Jordan, M. I. (1996). Exploiting tractable substructures in intractable

networks. In Advances in Neural Information Processing Systems 8, pp. 486–492.

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6 (2),

461–464.

Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualiza-

tion. Wiley, New York.

Shao, J. (1997). An asymptotic theory for linear model selection. Statistica Sinica, 7,

221–264.

Shenoy, P., & Shafer, G. (1990). Axioms for probability and belief-function propagation.

In Uncertainty in AI, Vol. 4, pp. 169–198.

Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman

and Hall, New York.

115

van der Putten, P., & van Someren, M. (Eds.). (2000). CoIL Challenge 2000: The Insur-

ance Company Case. Sentient Machine Research, Amsterdam.

van der Putten, P., de Ruiter, M., & van Someren, M. (2000). CoIL challenge 2000 tasks

and results: Predicting and explaining caravan policy ownership..

van Engelen, R. A. (1997). Approximating Bayesian belief networks by arc removal. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 19 (8), 916–920.

Wang, Y., & Zhang, N. L. (2006). Severity of local maxima for the EM algorithm:

Experiences with hierarchical latent class models. In Proceedings of the 3rd European

Workshop on Probabilistic Graphical Models, pp. 301–308.

Wang, Y., Zhang, N. L., & Chen, T. (2008a). Latent tree models and approximate

inference in Bayesian network. In Proceedings of the 23rd National Conference on


Wang, Y., Zhang, N. L., & Chen, T. (2008b). Latent tree models and approximate

inference in Bayesian networks. Journal of Artificial Intelligence Research, 32, 879–

900.

Webb, G. I., Boughton, J. R., & Wang, Z. (2005). Not so naive Bayes: Aggregating

one-dependence estimators. Machine Learning, 58, 5–24.

Witten, I. H., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and

Techniques (2nd edition). Morgan Kaufmann, San Francisco.

Zhang, N. L. (2004). Hierarchical latent class models for cluster analysis. Journal of

Machine Learning Research, 5 (6), 697–723.

Zhang, N. L., & Poole, D. (1994). A simple approach to bayesian network computations.

In Proceedings of the 10th Canadian Conference on Artificial Intelligence, pp. 171–

178.

Zhang, N. L., Wang, Y., & Chen, T. (2008). Discovery of latent structures: Experience

with the CoIL challenge 2000 data set. Journal of Systems Science and Complexity,

21 (2), 172–183.

Zhang, N. L., & Kocka, T. (2004). Efficient learning of hierarchical latent class models.

In Proceedings of the 16th IEEE International Conference on Tools with Artificial

Intelligence, pp. 585–593.

Zhang, N. L., Nielsen, T. D., & Jensen, F. V. (2004). Latent variable discovery in classi-

fication models. Artificial Intelligence in Medicine, 30 (3), 283–299.

Zhang, N. L., Yuan, S., Chen, T., & Wang, Y. (2007). Hierarchical latent class models

and statistical foundation for traditional Chinese medicine. In Proceedings of the

11th Conference on Artificial Intelligence in Medicine, pp. 139–143.

116

Zhang, N. L., Yuan, S., Chen, T., & Wang, Y. (2008a). Latent tree models and diagnosis

in traditional Chinese medicine. Artificial Intelligence in Medicine, 42 (3), 229–245.

Zhang, N. L., Yuan, S., Chen, T., & Wang, Y. (2008b). Statistical validation of traditional

Chinese medicine theories. Journal of Alternative and Complementary Medicine,

14 (5), 583–587.

117

latent tree models for multivariate density …lzhang/paper/pspdf/wang-thesis.pdf · yi wang this...

Documents