neural and decision trees

Intelligent Heart Disease Prediction System Using Naïve Bayes

Existing Systems Clinical decisions are often made based on doctors’ intuition and

experience rather than on the knowledge rich data hidden in the database.

This practice leads to unwanted biases, errors and excessive medical costs which affects the quality of service provided to patients.

There are many ways that a medical misdiagnosis can present itself. Whether a doctor is at fault, or hospital staff, a misdiagnosis of a serious illness can have very extreme and harmful effects.

The National Patient Safety Foundation cites that 42% of medical patients feel they have had experienced a medical error or missed diagnosis. Patient safety is sometimes negligently given the back seat for other concerns, such as the cost of medical tests, drugs, and operations.

Medical Misdiagnoses are a serious risk to our healthcare profession. If they continue, then people will fear going to the hospital for treatment. We can put an end to medical misdiagnosis by informing the public and filing claims and suits against the medical practitioners at fault.

Proposed Systems This practice leads to unwanted biases, errors and excessive

medical costs which affects the quality of service provided to patients.

Thus we proposed that integration of clinical decision support with computer-based patient records could reduce medical errors, enhance patient safety, decrease unwanted practice variation, and improve patient outcome.

This suggestion is promising as data modeling and analysis tools, e.g., data mining, have the potential to generate a knowledge-rich environment which can help to significantly improve the quality of clinical decisions.

The main objective of this research is to develop a prototype Intelligent Heart Disease Prediction System (IHDPS) using three data mining modeling techniques, namely, Decision Trees, Naïve Bayes and Neural Network.

So its providing effective treatments, it also helps to reduce treatment costs. To enhance visualization and ease of interpretation,

Neural Networks

Introduction

The power and speed of modern digital computers is truly astounding. No human can ever hope to compute a million operations a second. However, there are some tasks for which even the most powerful computers cannot compete with the human brain, perhaps not even with the intelligence of an earthworm.

Imagine the power of the machine which has the abilities of both computers and humans. It would be the most remarkable thing ever. And all humans can live happily ever after (or will they?). This is the aim of artificial intelligence in general.

Neural Networks approaches this problem by trying to mimic the structure and function of our nervous system. Many researchers believe that AI (Artificial Intelligence) and neural networks are completely opposite in their approach. Conventional AI is based on the symbol system hypothesis. Loosely speaking, a symbol system consists of indivisible entities called symbols, which can form more complex entities, by simple rules. The hypothesis then states that such a system is capable of and is necessary for intelligence.

The general belief is that Neural Networks is a sub-symbolic science. Before symbols themselves are recognized, some thing must be done so that conventional AI can then manipulate those symbols. To make this point clear, consider symbols such as cow, grass, house etc. Once these symbols and the "simple rules" which govern them are known, conventional AI can perform miracles. But to discover that something is a cow is not trivial. It can perhaps be done using conventional AI and symbols such as - white, legs, etc. But it would be tedious and certainly, when you see a cow, you instantly recognize it to be so, without counting its legs.

But this belief - that AI and Neural Networks are completely opposite, is not valid because, even when you recognize a cow, it is because of certain properties which you observe, that you conclude that it is a cow. This happens instantly because various parts of the brain function in parallel. All the properties which you observe are "summed up". Certainly there are symbols here and rules - "summing up". The only difference is that in AI, symbols are strictly indivisible, whereas here, the symbols (properties) may occur with varying degrees or intensities.

Progress in this area can be made only by breaking this line of distinction between AI and Neural Networks, and combining the results obtained in both, towards a unified framework.

http://www.iiit.net/~vikram/robots.html

Neural Network Architectures and Learning Algorithms

To cope with the difficulties mentioned in the previous section, the problem has been divided into sub problems. Different types of neural networks have been proposed. Each type restricts the kind of connections that are possible. For example it may specify that if one neuron is connected to another, then the 2nd neuron cannot have another connection towards the first. The type of connections possible is generally referred to as the architecture of the neural network.

Whenever the neural network makes a mistake, some weights and thresholds have to be changed to compensate for this error. The rules which govern how exactly these changes are to take place, is called as the learning algorithm. Different types of neural networks may have different learning algorithms.

The term `architecture' has been much abused in the history of mankind. It has many meanings depending on whether you are talking about buildings, inside of computers or neural networks among other things. Even in neural networks, the term architecture and what we have been referring to as `type' of neural network are used interchangeably. So when we refer to such and such an architecture, it means the set of possible interconnections (also called as topology of the network) and the learning algorithm defined for it.

Each type of neural network has been designed to tackle a certain class of problems. Hopefully, at some stage we will be able to combine all the types of neural networks into a uniform framework. Hopefully, then we will reach our goal of combining brains and computers.

Different types of neural networks.

The Perceptron

This is a very simple model and consists of a single `trainable' neuron. Trainable means that its threshold and input weights are modifiable. Inputs are presented to the neuron and each input has a desired output (determined by us). If the neuron doesn't give the desired output, then it has made a mistake. To rectify this, its threshold and/or input weights must be changed. How this change is to be calculated is determined by the learning algorithm.

The output of the perceptron is constrained to Boolean values - (true, false), (1,0), (1,-1) or whatever. This is not a limitation because if the output of the perceptron were to be the input for something else, then the output edge could be made to have a weight. Then the output would be dependant on this weight.

The perceptron looks like -

x1, x2, ..., xn are inputs. These could be real numbers or Boolean values depending on the problem.

y is the output and is Boolean. w1, w2, ..., wn are weights of the edges and are real valued. T is the threshold and is real valued.

The output y is 1 if the net input which is

w1 x1 + w2 x2 + ... + wn xn

is greater than the threshold T. Otherwise the output is zero.

The idea is that we should be able to train this perceptron to respond to certain inputs with certain desired outputs. After the training period, it should be able to give reasonable outputs for any kind of input. If it wasn't trained for that input, then it should try to find the best possible output depending on how it was trained.

So during the training period we will present the perceptron with inputs one at a time and see what output it gives. If the output is wrong, we will tell it that it has made a mistake. It should then change its weights and/or threshold properly to avoid making the same mistake later.

Note that the model of the perceptron normally given is slightly different from the one pictured here. Usually, the inputs are not directly fed to the trainable neuron but are modified by some "preprocessing units". These units could be arbitrarily complex, meaning that they could modify the inputs in any way. These units have been deliberately eliminated from our picture, because it would be helpful to know what can be achieved by just a single trainable neuron, without all its "powerful friends".

To understand the kinds of things that can be done using a perceptron, we shall see a rather simple example of its use - Compute the logical operations "and", "or", "not" of some given boolean variables.

Computing "and": There are n inputs, each either a 0 or 1. To compute the logical "and" of these n inputs, the output should be 1 if and only if all the inputs are 1. This can easily be achieved by setting the threshold of the perceptron to n. The weights of all edges are 1. The net input can be n only if all the inputs are active.

Computing "or": It is also simple to see that if the threshold is set to 1, then the output will be 1 if at least one input is active. The perceptron in this case acts as the logical "or".

Computing "not": The logical "not" is a little tricky, but can be done. In this case, there is only one Boolean input. Let the weight of the edge be -1, so that the input which is either 0 or 1 becomes 0 or -1. Set the threshold to 0. If the input is 0, the threshold is reached and the output is 1. If the input is -1, the threshold is not reached and the output is 0.

The XOR Problem

There are problems which cannot be solved by any perceptron. Infect there are more such problems than problems which can be solved using perceptions. The most often quoted example is the XOR problem - build a perceptron which takes 2 Boolean inputs and outputs the XOR of them. What we want is a perceptron which will output 1 if the two inputs are different and 0 otherwise.

Input | Desired Output--------|----------------0 0 | 00 1 | 11 0 | 11 1 | 0

Consider the following perceptron as an attempt to solve the problem -

If the inputs are both 0, then net input is 0 which is less than the threshold (0.5). So the output is 0 - desired output.

If one of the inputs is 0 and the other is 1, then the net input is 1. This is above threshold, and so the output 1 is obtained.

But the given perceptron fails for the last case. To see that no perceptron can be built to solve the problem, try to build one yourself.

Pattern Recognition Terminology

The inputs that we have been referring to, of the form (x1, x2, ..., xn) are also called as patterns. If a perceptron gives the correct, desired output for some pattern, then we say that the perceptron recognizes that pattern. We also say that the perceptron correctly classifies that pattern.

Since a pattern by our definition is just a sequence of numbers, it could represent anything -- a picture, a song, a poem... anything that you can have in a computer file. We could then have a perceptron which could learn such inputs and classify them eg. a neat picture or a scribbling, a good or a bad song, etc. All we have to do is to present the perceptron with some examples -- give it some songs and tell it whether each one is good or bad. (It could then go all over the internet, searching for songs which you may like.) Sounds incredible? At least that the way it is supposed to work. But it may not. The problem is that the set of patterns which you want the perceptron to learn, might be something like the XOR problem. Then no perceptron can be made to recognize your taste. However, there may be some other kind of neural network which can.

Linearly Separable Patterns and Some Linear Algebra

If a set of patterns can be correctly classified by some perceptron, then such a set of patterns is said to be linearly separable. The term "linear" is used because the perceptron is a linear device. The net input is a linear function of the individual inputs and the output is a linear function of the net input. Linear means that there is no square(x2 ) or cube(x3), etc. terms in the formulas.

A pattern (x1,x2, ..., xn) is a point in an n-dimensional space. (Stop imagining things.) This is an extension of the idea that (x,y) is a point in 2-dimensions and (x,y,z) is a point in 3 dimensions. The utility of such a wierd notion of an n-dimensional space is that there are many concepts which are independant of dimension. Such concepts carry over to higher dimensions even though we can think only of their 2 or 3-dimensional counterparts. For example, if the distance to a point (x,y) in 2 dimensions is r, then

r2 = x2 + y2

Since the distance to a point (x,y,z) in 3 dimensions is also defined similarly, it is natural to define the distance to a point (x1,x2, ..., xn) in n dimensions as

r2 = x12 + x2

2 + ... + xn2

r is called as the norm (actually euclidean norm) of the point (x1,x2, ..., xn).

Similarly, a straight line in 2D is given by -

ax + by = c

In 3D, a plane is given by -

ax + by + cz = d

When we generalize this, we get an object called as a hyperplane -

w1x1 + w2x2 + ... + wnxn = T

Notice something familiar? This is the net input to a perceptron. All points (patterns) for which the net input is greater than T belong to one class (they give the same output). All the other points belong to the other class.

We now have a lovely geometrical interpretation of the perceptron. A perceptron with weights w1,w2, ..., wn and threshold T can be represented by the above hyperplane. All points on one side of the hyperplane belong to one class. The hyperplane (perceptron) divides the set of all points (patterns) into 2 classes.

Now we can see why the XOR problem cannot have a solution. Here there are 2 inputs. Hence there are 2 dimensions (luckily). The points that we want to classify are (0,0), (1,1) - in one class and (0,1), (1,0) in the other class.

Clearly we cannot classify the points (crosses on one side, circles on other) using a straight line. Hence no perceptron exists which can solve the XOR problem.

Perceptron Learning Algorithms

During the training period, a series of inputs are presented to the perceptron - each of the form (x1,x2, ..., xn). For each such input, there is a desired output - either 0 or 1. The actual output is determined by the net input which is w1 x1 + w2 x2 + ... + wn xn. If the net input is less than threshold then the output is 0, otherwise output is 1. If the perceptron gives a wrong (undesirable) output, then one of two things could have happened -

1. The desired output is 0, but the net input is above threshold. So the actual output becomes 1.

In such a case we should decrease the weights. But by how much? The perceptron learning algorithm says that the decrease in weight of an edge should be directly proportional to the input through that edge. So,

new weight of an edge i = old weight - cxi

There are several algorithms depending on what c is. For now, think that it is a constant.

The idea here is that if the input through some edge was very high, then that edge must have contributed to most of the error. So we reduce the weight of that edge more (i.e. proportional to the input along that edge).

2. The other case when the perceptron makes a mistake is when the desired output is 1, but the net input is below threshold.

Now we should increase the weights. Using the same intuition, the increase in weight of an edge should be proportional to the input through that edge. So,

new weight of an edge i = old weight + cxi

What about c? If c is actually a constant, then the algorithm is called as the "fixed increment rule". Note that in this case, the perceptron may not correct its mistake immediately. That is, when we change the weights because of a mistake, the new weights don't guarantee that the same mistake will not be repeated. This could happen if c is very small. However, by repeated application of the same input, the weights will change slowly each time, until that mistake is avoided.

We could also choose c in such a way that it will certainly avoid the most recent mistake, next time it is presented the same input. This is called as the "absolute correction rule". The problem with this approach is that by learning one input, it might "forget" a previously learnt input. For example, if one input leads to an increase in some weight and an other input decreases it, then such a problem may arise.

Decision TreesAbout Decision Tree

The Decision Tree algorithm, like Naive Bayes, is based on conditional probabilities. Unlike Naive Bayes, decision trees generate rules. A rule is a conditional statement that can easily be understood by humans and easily used within a database to identify a set of records.

In some applications of data mining, the accuracy of a prediction is the only thing that really matters. It may not be important to know how the model works. In others, the ability to explain the reason for a decision can be crucial. For example, a Marketing professional would need complete descriptions of customer segments in order to launch a successful marketing campaign. The Decision Tree algorithm is ideal for this type of application.

Decision Tree Rules

Oracle Data Mining supports several algorithms that provide rules. In addition to decision trees, clustering algorithms (described in Chapter 7) provide rules that describe the conditions shared by the members of a cluster, and association rules (described in Chapter 8) provide rules that describe associations between attributes.

Rules provide model transparency, a window on the inner workings of the model. Rules show the basis for the model's predictions. Oracle Data Mining supports a high level of model transparency. While some algorithms provide rules, all algorithms provide model details. You can examine model details to determine how the algorithm handles the attributes internally, including transformations and reverse transformations. Transparency is discussed in the context of data preparation in Chapter 19 and in the context of model building in Oracle Data Mining Application Developer's Guide..

Confidence and Support

Confidence and support are properties of rules. These statistical measures can be used to rank the rules and hence the predictions.

Support: The number of records in the training data set that satisfy the rule.

Confidence: The likelihood of the predicted outcome, given that the rule has been satisfied.

For example, consider a list of 1000 customers (1000 cases). Out of all the customers, 100 satisfy a given rule. Of these 100, 75 are likely to increase spending, and 25 are not likely to increase spending. The support of the rule is 100/1000 (10%). The confidence of the prediction (likely to increase spending) for the cases that satisfy the rule is 75/100 (75%).

http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28131/models_building.htm#DMPRG003

http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/xform_data.htm#BABGADFF

http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/market_basket.htm#BABDCBJG

http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/clustering.htm#CHDIIABF

Advantages of Decision Trees

The Decision Tree algorithm produces accurate and interpretable models with relatively little user intervention. The algorithm can be used for both binary and multiclass classification problems.

The algorithm is fast, both at build time and apply time. The build process for Decision Tree is parallelized. (Scoring can be parallelized irrespective of the algorithm.)

Decision tree scoring is especially fast. The tree structure, created in the model build, is used for a series of simple tests, (typically 2-7). Each test is based on a single predictor. It is a membership test: either IN or NOT IN a list of values (categorical predictor); or LESS THAN or EQUAL TO some value (numeric predictor).

Growing a Decision Tree

A decision tree predicts a target value by asking a sequence of questions. At a given stage in the sequence, the question that is asked depends upon the answers to the previous questions. The goal is to ask questions that, taken together, uniquely identify specific target values. Graphically, this process forms a tree structure.

Figure 11-2 Sample Decision Tree

Splitting

During the training process, the Decision Tree algorithm must repeatedly find the most efficient way to split a set of cases (records) into two child nodes. Oracle Data Mining offers two homogeneity metrics, gini and entropy, for calculating the splits. The default metric is gini.

Homogeneity metrics asses the quality of alternative split conditions and select the one that results in the most homogeneous child nodes. Homogeneity is also called purity; it refers to the degree to which the resulting child nodes are made up of cases with the same target value. The objective is to maximize the purity in the child nodes. For example, if the target can be either yes or no (will or will not increase spending), the objective is to produce nodes where most of the cases will increase spending or most of the cases will not increase spending.

Cost Matrix

All classification algorithms, including Decision Tree, support a cost-benefit matrix at apply time. You can use the same cost matrix for building and scoring a Decision Tree model, or you can specify a different cost/benefit matrix for scoring.

See "Costs" and "Priors".

Preventing Over-Fitting

In principle, Decision Tree algorithms can grow each branch of the tree just deeply enough to perfectly classify the training examples. While this is sometimes a reasonable strategy, in fact it can lead to difficulties when there is noise in the data, or when the number of training examples is too small to produce a representative sample of the true target function. In either of these cases, this simple algorithm can produce trees that over-fit the training examples. Over-fit is a condition where a model is able to accurately predict the data used to create the model, but does poorly on new data presented to it.

To prevent over-fitting, Oracle Data Mining supports automatic pruning and configurable limit conditions that control tree growth. Limit conditions prevent further splits once the conditions have been satisfied. Pruning removes branches that have insignificant predictive power.

XML for Decision Tree Models

You can generate XML representing a decision tree model; the generated XML satisfies the definition specified in the Data Mining Group Predictive Model Markup Language (PMML) version 2.1 specification. The specification is available at http://www.dmg.org.

http://www.dmg.org/

http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/classify.htm#i1005760

http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/classify.htm#BGBCBDII

Tuning the Decision Tree Algorithm

The Decision Tree algorithm is implemented with reasonable defaults for splitting and termination criteria. It is unlikely that you will need to use any of the build settings that are supported for Decision Tree. The settings are described as follows.

Settings to specify the homogeneity metric for finding the optimal split condition:

TREE_IMPURITY_METRIC can be either gini or entropy. The default is gini.

Settings to control the growth of the tree:

TREE_TERM_MAX_DEPTH specifies the maximum depth of the tree, from root to leaf inclusive. The default is 7.

TREE_TERM_MINPCT_MODE specifies the minimum number of cases required in a child node, expressed as a percentage of the rows in the training data. The default is .05%.

TREE_TERM_MINPCT_SPLIT specifies the minimum number of cases required in a node in order for a further split to be possible. Expressed as a percentage of all the rows in the training data. The default is 1%.

TREE_TERM_MINREC_MODE specifies the minimum number of cases required in a child node. Default is 10.

TREE_TERM_MINREC_SPLIT specifies the minimum number of cases required in a node in order for a further split to be possible. Default is 20.

Data Preparation for Decision Tree

The Decision Tree algorithm manages its own data preparation internally. It does not require pretreatment of the data. Decision Tree is not affected by Automatic Data Preparation.

Decision Tree interprets missing values as missing at random. The algorithm does not support nested tables and thus does not support sparse data.

Naive Bayes Algorithm

The Microsoft Naive Bayes algorithm is a classification algorithm

provided by Microsoft SQL Server Analysis Services for use in predictive

modeling. The name Naive Bayes derives from the fact that the algorithm uses

Bayes theorem but does not take into account dependencies that may exist, and

therefore its assumptions are said to be naive.

This algorithm is less computationally intense than other Microsoft

algorithms, and therefore is useful for quickly generating mining models to

discover relationships between input columns and predictable columns. You can

use this algorithm to do initial explorations of data, and then later you can apply

the results to create additional mining models with other algorithms that are more

computationally intense and more accurate.

Example

As an ongoing promotional strategy, the marketing department for the

Adventure Works Cycle company has decided to target potential customers by

mailing out fliers. To reduce costs, they want to send fliers only to those

customers who are likely to respond. The company stores information in a

database about demographics and response to a previous mailing. They want to

use this data to see how demographics such as age and location can help predict

response to a promotion, by comparing potential customers to customers who

have similar characteristics and who have purchased from the company in the

past. Specifically, they want to see the differences between those customers who

bought a bicycle and those customers who did not.

By using the Microsoft Naive Bayes algorithm, the marketing department

can quickly predict an outcome for a particular customer profile, and can

therefore determine which customers are most likely to respond to the fliers. By

using the Microsoft Naive Bayes Viewer in Business Intelligence Development

Studio, they can also visually investigate specifically which input columns

contribute to positive responses to fliers.

How the Algorithm Works

The Microsoft Naive Bayes algorithm calculates the probability of every

state of each input column, given each possible state of the predictable column.

You can use the Microsoft Naive Bayes Viewer in Business Intelligence

Development Studio to see a visual representation of how the algorithm

distributes states, as shown in the following graphic.

The Microsoft Naive Bayes Viewer lists each input column in the dataset,

and shows how the states of each column are distributed, given each state of the

predictable column. You can use this view to identify the input columns that are

important for differentiating between states of the predictable column. For

example, in the Commute Distance column shown here, if the customer

commutes from one to two miles to work, the probability that the customer will

buy a bike is 0.387, and the probability that the customer will not buy a bike is

0.287. In this example, the algorithm uses the numeric information, derived from

customer characteristics such as commute distance, to predict whether a customer

will buy a bike. For more information about using the Microsoft Naive Bayes

Viewer, see Viewing a Mining Model with the Microsoft Naive Bayes Viewer.

Data Required for Naive Bayes Models

When you prepare data for use in training a Naive Bayes model, you

should understand the requirements for the algorithm, including how much data is

needed, and how the data is used.

The requirements for a Naive Bayes model are as follows:

A single key column Each model must contain one numeric or text

column that uniquely identifies each record. Compound keys are not allowed.

Input columns In a Naive Bayes model, all columns must be either

discrete or discretized columns. For information about discretizing columns, see

Discretization Methods (Data Mining). For a Naive Bayes model, it is important

to ensure that the input attributes are independent of each other.

At least one predictable column The predictable attribute must contain

discrete or discretized values. The values of the predictable column can be treated

as input and frequently are, to find relationships among the columns.

Viewing the Model

To explore the model, you can use the Microsoft Naive Bayes Viewer.

The viewer shows you how the input attributes relate to the predictable attribute.

The viewer also provides a detailed profile of each cluster, a list of the attributes

that distinguish each cluster from the others, and the characteristics of the entire

training data set. For more information, see Viewing a Mining Model with the

Microsoft Naive Bayes Viewer.

If you want to know more detail, you can browse the model in the

Microsoft Generic Content Tree Viewer (Data Mining Designer). For more

information about the type of information stored in the model, see Mining Model

Content for Naive Bayes Models (Analysis Services - Data Mining).

Making Predictions

After the model has been trained, the results are stored as a set of patterns,

which you can explore or use to make predictions.

You can create queries to return predictions about how new data relates to

the predictable attribute, or you can retrieve statistics that describe the correlations

found by the model.

For information about how to create queries against a data mining model,

see Querying Data Mining Models (Analysis Services - Data Mining). For

examples of how to use queries with a Naive Bayes model, see Querying a Naive

Bayes Model (Analysis Services - Data Mining).

Remarks

Supports the use of Predictive Model Markup Language (PMML) to create

mining models.

Supports drillthrough.

Does not support the creation of data mining dimensions.

Supports the use of OLAP mining models.

Contents

Neural Networks Prediction

Introduction

Artificial neural networks are relatively crude electronic networks of "neurons" based on the neural structure of the brain. They process records one at a time, and "learn" by comparing their prediction of the record (which, at the outset, is largely arbitrary) with the known actual record. The errors from the initial prediction of the first record is fed back into the network, and used to modify the networks algorithm the second time around, and so on for many iterations.

Roughly speaking, a neuron in an artificial neural network is

1. A set of input values (xi) and associated weights (wi) 2. A function (g) that sums the weights and maps the results to an output (y).

Neurons are organized into layers.

http://www.resample.com/xlminer/help/Index.htm

The input layer is composed not of full neurons, but rather consists simply of the values in a data record, that constitute inputs to the next layer of neurons. The next layer is called a hidden layer; there may be several hidden layers. The final layer is the output layer, where there is one node for each class. A single sweep forward through the network results in the assignment of a value to each output node, and the record is assigned to whichever class's node had the highest value.

Training an Artificial Neural Network

In the training phase, the correct class for each record is known (this is termed supervised training), and the output nodes can therefore be assigned "correct" values -- "1" for the node corresponding to the correct class, and "0" for the others. (In practice it has been found better to use values of 0.9 and 0.1, respectively.) It is thus possible to compare the network's calculated values for the output nodes to these "correct" values, and calculate an error term for each node (the "Delta" rule). These error terms are then used to adjust the weights in the hidden layers so that, hopefully, the next time around the output values will be closer to the "correct" values.

The Iterative Learning Process

A key feature of neural networks is an iterative learning process in which data cases (rows) are presented to the network one at a time, and the weights associated with the input values are adjusted each time. After all cases are presented, the process often starts over again. During this learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of input samples. Neural network learning is also referred to as "connectionist learning," due to connections between the units. Advantages of neural networks include their high tolerance to noisy data, as well as their ability to classify patterns on which they have not been trained. The most popular neural network algorithm is back-propagation algorithm proposed in the 1980's.

Once a network has been structured for a particular application, that network is ready to be trained. To start this process, the initial weights (described in the next section) are chosen randomly. Then the training, or learning, begins.

The network processes the records in the training data one at a time, using the weights and functions in the hidden layers, then compares the resulting outputs against the desired outputs. Errors are then propagated back through the system, causing the system to adjust the weights for application to the next record to be processed. This process occurs over and over as the weights are continually tweaked. During the training of a network the same set of data is processed many times as the connection weights are continually refined.

Note that some networks never learn. This could be because the input data do not contain the specific information from which the desired output is derived. Networks also don't converge if there is not enough data to enable complete learning. Ideally, there should be enough data so that part of the data can be held back as a validation set.

Feedforward, Back-Propagation

The feedforward, back-propagation architecture was developed in the early 1970's by several independent sources (Werbor; Parker; Rumelhart, Hinton and Williams). This independent co-development was the result of a proliferation of articles and talks at various conferences which stimulated the entire industry. Currently, this synergistically developed back-propagation architecture is the most popular, effective, and easy-to-learn model for complex, multi-layered networks. Its greatest strength is in non-linear solutions to ill-defined problems. The typical back-propagation network has an input layer, an output layer, and at least one hidden layer. There is no theoretical limit on the number of hidden layers but typically there are just one or two. Some work has been done which indicates that a maximum of five layers (one input layer, three hidden layers and an output layer) are required to solve problems of any complexity. Each layer is fully connected to the succeeding layer.

As noted above, the training process normally uses some variant of the Delta Rule, which starts with the calculated difference between the actual outputs and the desired outputs. Using this error, connection weights are increased in proportion to the error times a scaling factor for global accuracy. Doing this for an individual node means that the inputs, the output, and the desired output all have to be present at the same processing element. The complex part of this learning mechanism is for the system to determine which input contributed the most to an incorrect output and how does that element get changed to correct the error. An inactive node would not contribute to the error and would have no need to change its weights. To solve this problem, training inputs are applied to the input layer of the network, and desired outputs are compared at the output layer. During the learning process, a forward sweep is made through the network, and the output of each element is computed layer by layer. The difference between the output of the final layer and the desired output is back-propagated to the previous layer(s), usually modified by the derivative of the transfer function, and the connection weights are normally adjusted using the Delta Rule. This process proceeds for the previous layer(s) until the input layer is reached.

Structuring the Network

The number of layers and the number of processing elements per layer are important decisions. These parameters to a feedforward, back-propagation topology are also the most ethereal - they are the "art" of the network designer. There is no quantifiable, best answer to the layout of the network for any particular application. There are only general rules picked up over time and followed by most researchers and engineers applying this architecture to their problems.

Rule One: As the complexity in the relationship between the input data and the desired output increases, the number of the processing elements in the hidden layer should also increase.

Rule Two: If the process being modeled is separable into multiple stages, then additional hidden layer(s) may be required. If the process is not separable into stages, then additional layers may simply enable memorization of the training set, and not a true general solution effective with other data.

Rule Three: The amount of training data available sets an upper bound for the number of processing elements in the hidden layer(s). To calculate this upper bound, use the number of cases in the training data set and divide that number by the sum

of the number of nodes in the input and output layers in the network. Then divide that result again by a scaling factor between five and ten. Larger scaling factors are used for relatively less noisy data. If you use too many artificial neurons the training set will be memorized. If that happens, generalization of the data will not occur, making the network useless on new data sets.

See also

Using Neural Network in XLMiner™

Example - Neural Network

You can support Wikipedia by making a tax-deductible donation.

Alternating decision tree

From Wikipedia, the free encyclopedia

Jump to: navigation, search

An Alternating Decision Tree (ADTree) is a machine learning method for classification. The ADTree data structure and algorithm are a generalization of decision trees and have connections to boosting. ADTrees were introduced by Yoav Freund and Llew Mason [1] .

Contents[hide]

1 Motivation 2 Description of the structure 3 Description of the algorithm 4 Empirical Results 5 References

6 External links

[edit] Motivation

Original boosting algorithms typically combined either decision stumps or decision trees. Boosting decision stumps creates a set of T weighted weak hypotheses (where T is the number of boosting iterations), which can be visualized as a set for reasonable values of T. Boosting decision trees could result in a final combined classifier with thousands (or millions) of nodes for modest values of T. Both of these scenarios produce final classifiers in which it is either difficult to visualize correlations or difficult to visualize at all. Alternating decision trees provide a method for visualizing decision stumps in an ordered and logical way to demonstrate correlations. In doing so, they simultaneously

http://en.wikipedia.org/wiki/Decision_stump

http://en.wikipedia.org/wiki/Decision_tree


http://en.wikipedia.org/w/index.php?title=Alternating_decision_tree&action=edit&section=1

http://en.wikipedia.org/wiki/Alternating_decision_tree#External_links

http://en.wikipedia.org/wiki/Alternating_decision_tree#References

http://en.wikipedia.org/wiki/Alternating_decision_tree#Empirical_Results

http://en.wikipedia.org/wiki/Alternating_decision_tree#Description_of_the_algorithm

http://en.wikipedia.org/wiki/Alternating_decision_tree#Description_of_the_structure

http://en.wikipedia.org/wiki/Alternating_decision_tree#Motivation

http://en.wikipedia.org/wiki/Alternating_decision_tree#cite_note-Freund99-0

http://en.wikipedia.org/w/index.php?title=Llew_Mason&action=edit&redlink=1

http://en.wikipedia.org/wiki/Yoav_Freund

http://en.wikipedia.org/wiki/Boosting

http://en.wikipedia.org/wiki/Decision_tree

http://en.wikipedia.org/wiki/Algorithm

http://en.wikipedia.org/wiki/Data_structure

http://en.wikipedia.org/wiki/Machine_learning

http://en.wikipedia.org/wiki/Alternating_decision_tree#searchInput

http://en.wikipedia.org/wiki/Alternating_decision_tree#column-one

http://wikimediafoundation.org/wiki/Fundraising?source=enwiki_04

http://www.resample.com/xlminer/help/NNCPredict/NNCPredict_ex.htm

http://www.resample.com/xlminer/help/NNCPredict/NNCPredict_Using.htm

generalize decision trees and can be used to essentially grow boosted decision trees in parallel.

[edit] Description of the structure

The alternating decision tree structure consists of two components: decision nodes and prediction nodes. Decision nodes specify a predicate condition. Prediction nodes specify a value to add to the score based on the result of the decision node. Each decision node can be seen as a conjunction between a precondition (the decision node was reached) and the condition specified in the decision node.

Perhaps the easiest way to understand the interaction of decision and prediction nodes is through an example.The following example is taken from JBoost performing boosting for 6 iterations on the spambase dataset (available from the UCI Machine Learning Repository). Positive examples indicate that the message is spam and negative examples are not spam. During each iteration, a single node is added to the ADTree. The ADTree determined by the learning algorithm implemented in JBoost is:

The tree construction algorithm is described below in the Description of the algorithm section. We now show how to interpret the tree once it has been constructed. We focus on one specific instance:

An instance to be classified

Feature Value

char_freq_bang 0.08

word_freq_hp 0.4

capital_run_length_longest 4

char_freq_dollar 0

word_freq_remove 0.9

word_freq_george 0

http://en.wikipedia.org/w/index.php?title=JBoost&action=edit&redlink=1

http://www.ics.uci.edu/~mlearn/MLRepository.html

http://www.ics.uci.edu/~mlearn/MLRepository.html

http://www.ics.uci.edu/~mlearn/databases/spambase/

http://en.wikipedia.org/w/index.php?title=JBoost&action=edit&redlink=1


Other features ...

For this instance, we obtain a score that determines the classification of the instance. This score not only acts as a classification, but also as a measure of confidence. The actual order that the ADTree nodes are evaluated will likely be different then the order in which they were created. That is, the node from iteration 4 can be evaluated before the node from iteration 1. There are constraints to this (e.g. node from iteration 2 must be evaluated before the node from iteration 5). In general, either breadth-first or depth-first evaluation will yield the correct interpretation.

The following table shows how the score is created (progressive score) for our above example instance:

Score for the above instance

Iteration 0 1 2 3 4 5 6

Instance values N/A.08 < .052 = n

.4 < .195 = n

0 < .01 = y

0 < 0.005 = y

N/A.9 < .225 = n

Prediction-0.093

0.74 -1.446 -0.38 0.176 0 1.66

Progressive Score

-0.093

0.647 -0.799 -1.179 -1.003-1.003

0.657

There are a few observations that we should make

The final classification of the example is positive (0.657), meaning that the example is considered to be spam.

All nodes at depth 1 have their predicate evaluated and one of their prediction nodes contributes to the score. Thus a tree with depth 1 is the equivalent of boosted decision stumps.

If a decision node is not reached (the node from iteration 5 in the above example) then the node's predicate and subsequent prediction nodes will not be evaluated.

Description of the algorithm

The alternating decision tree learning algorithm is described in the original paper[1]. The general idea involves a few main concepts:

The root decision node is always TRUE or FALSE The tree is grown iteratively. The total number of iterations is generally decided

prior to starting the algorithm. Each decision node (c2) is selected by the algorithm based on how well it

discriminates between positive and negative examples. Once a decision node is created, the prediction node is determined by how well

the decision node discriminates.

Before the algorithm, we first define some notation. Let c be a predicate, then

W + (c) is the weight of all positively labeled examples that satisfy c W − (c) is the weight of all negatively labeled examples that satisfy c W(c) is the weight of all examples that satisfy c We call c a precondition when it is a conjunction of previous base conditions and

negations of previous base conditions

The exact algorithm is:

INPUT: m examples and labels

Set the weight of all examples to W1(xj) = 1 / m

Set the margin of all examples to r1(xj) = 0

The root decision node is always c = TRUE, with a single prediction node

For do:

Let be a precondition (that is, the node being created can be reached via c1) and c2 be a condition (the new node). Then each decision node (c2) is selected by the algorithm based on how well it discriminates between positive and negative examples. The original ADTree algorithm minimizes the criterion .

Once a decision node is created, the prediction nodes are determined by and Add the conditions and to the set of possible preconditions Pt + 1 Update the weights:


[edit] Empirical Results

Figure 6 in the original paper[1] demonstrates that ADTrees are typically as robust as boosted decision trees and boosted decision stumps.

[edit] References1. ^ a b c Yoav Freund and Llew Mason. The Alternating Decision Tree Algorithm.

Proceedings of the 16th International Conference on Machine Learning, pages 124-133 (1999)

[edit] External links An introduction to Boosting and ADTrees (Has many graphical examples of

alternating decision trees in practice)

http://www.cs.ucsd.edu/~aarvey/jboost/presentations/BoostingLightIntro.pdf


http://en.wikipedia.org/wiki/Alternating_decision_tree#cite_ref-Freund99_0-2





http://en.wikipedia.org/wiki/Decision_trees



neural and decision trees

Documents