ai briefly

8/15/2019 AI Briefly

http://slidepdf.com/reader/full/ai-briefly 1/126

What is AI?

Is it...

• A type of applied computer science? (But what about theoretical and applied AI?)• A branch of cognitive science?

• A science at all?

What is intelligence?

• What distinguishes human behavior from the behavior of everything else?

• Whatever behaviors we don't understand?

• Whatever "works", that is, results in survival (or flourishing) in complex

environments

Possible goals

• To understand intelligence independent of any particular "implementation"

• To model human or animal behavior

• To model human thought

• To implement rational thought

• To implement rational action

How is the topic divided up?

• By domain

o Vision

o Natural language

o Diagnosis (medicine, etc.)

o Mathematics

o Robotics

o Education

• By domain-independent task type

o Search

o Representation

o Inferenceo Learning

o Evolution

o Perception

o Planning and action

• By theoretical framework

o Symbolic AI

Logicist AI



Case-based reasoning

Soar

ACT-R o "Biomorphic" AI, Non-symbolic AI, Subsymbolic AI

Neural networks, connectionist AI, dynamical systems

Evolutionary computation

What are some basic differences in approach?

• The role of learning

o If you want an intelligent machine, program in the intelligence.

o If you want an intelligent machine, make it a good learner and send it out

into the world.

• The role of the hardware

o Intelligence is a software problem. An intelligent program can be run on a

brain or a computer.

o The hardware is relevant. We should look for intelligence that is based on

the properties of nervous systems because the smartest systems we knowabout are smart because of their nervous systems. [neural networks,

connectionism]

• The role of domain-independent methods

o Neat AI

Theories should be elegant and parsimonious.

We should understand precisely what our theories can do and how

they behave. Most (or all) of intelligence is governed by general principles.

o Scruffy AI

The mind is a kludge. To make things efficient, inelegant shortcutsare often appropriate.

It may be impossible to come to a precise understanding of our

theories. There are only a few general principles that apply across domains.

Intelligence comes from domain knowledge.

• Two frameworks: symbolic and subsymbolic (connectionist) AI

o Symbolic models

Physical Symbol Systems (Newell, Pylyshyn, Fodor; summarized

by Harnad)

A set of arbitrary physical tokens (scratches on paper,

holes on a tape, events in a digital computer, etc.) that are manipulated on the basis of explicit rules that are

likewise physical tokens and strings of tokens. The rule-governed symbol-token manipulation is based

purely on the shape of the symbol tokens (not their

"meaning"), i.e., it is purely syntactic, and consists of

rulefully combining and recombining symbol tokens.There are



primitive atomic symbol tokens and

composite symbol-token strings. The entire cognitive

system and all its parts--the atomic tokens, the compositetokens, the syntactic manipulations (both actual and

possible) and the rules--are all

semantically interpretable: The syntax can besystematically assigned a meaning (e.g., as standing for

objects, as describing states of affairs).

Processes happen sequentially. There is a central controller which coordinates the activities of

the modules of the cognitive system and selects among candidate

processes at each point in time.

The cognitive system interacts with the world through interfaces to perception and action, which operate very differently from the

internal (cognitive) system.

Time is often mapped onto space; that is, the cognitive system has

simultaneous access to all of a pattern of some length (word,sentence, etc.). Inputs may also be presented sequentially, but the

problem of temporal short-term memory is side-stepped becausethe inputs are preprocessed.

Knowledge is usually programmed into the cognitive system by

someone who has a theory of how knowledge is organized.

Learning is also possible, but models that learn usually startintelligent.

o Subsymbolic (connectionist, dynamical) models

Control is distributed. There is just the illusion of someone beingin charge because the behavior seems purposeful, and it seems to

be possible to write a centralized program to make it happen. The basic processes involve very simple interactions among

primitive elements arranged in a network . Usally the interaction

amounts to the spread of activation.

Many of the processes happen in parallel. The cognitive system may interact with the world through

perception and action components which are similar to the

internal (cognitive) parts of the creature. In some models, the

environment and the creature itself constitute one large dynamicalsystem.

Knowledge is distributed, usually in the form of patterns of

connectivity among the primitive elements. The knowledge insuch systems is implicit; it often cannot be simply read off.

Knowledge gets into the cognitive system through learning as the

system discovers the statistical properties of the world around it or through evolution as generations of creatures are forced to survive

in the world.



The problem of temporal short-term memory is often addressed,

though the continuous interaction of components of the cognitive

system with each other and the world may not be.

Introduction to representationWhy representation?

• Tasks: going from inputs (stimuli) to outputs (responses)

• The most primitive solution: a lookup table which specifies an output for every

input

• The problem with lookup tables:

o There may be too many inputs to store (the world is continuous, after all).

o There is a need for the system to be able to respond appropriately to novel

inputs.

• The alternative solution: a function from inputs to outputso AI is about these functions: what they might look like for tasks requiring

"intelligence".

o The functions may be very complex, requiring one or more

transformations of the input on the way to the output: internal

representations.

What are representations like?

• Are they explicit (directly interpretable), or are they in a form that looks like

garbage to an outside observer (even though they serve their function for thesystem)?

• Are they localized (in one place), or are they distributed throughout the system?

• Are they propositional (language-like), or are they in some other form, for

example, more like images?

• Are they static or dynamic? Do they just sit there or do they "happen"?

• Are there different kinds of representations for different domains that have little in

common with each other?

What do they need to have?

• Distinction between objects and relations

• Wholes consisting of parts, which are in turn wholes consisting of parts:

recursive structure

• (Maybe) slots (roles) and fillers (values)

o In an object, the SHAPE slot may have filler CYLINDER. In a sentence,

the VERB slot may have filler "PUT".



o In an event, the AGENT slot may have filler ROBOT and the PATIENT

slot may have filler STICK1.

• (Maybe) truth and falsehood

• Generality (abstraction)

o The generalization (concept) BLOCK is an abstraction over all blocks,

including, for example, BLOCK4.o The generalization (concept) PUT is an abstraction over all instances of

putting, including, for example, PUT8, in which the Robot puts a block on

another block.

Two basic kinds of representations

• Symbolic

o The primitives are symbols, e.g., PUT, BLOCK4.

o More complex expressions are built up by concatenating symbols

together into symbol structures, e.g., PUT(ROBOT, BLOCK4, TABLE).

o Similarity is all-or-none: eq?.

• Connectionist (subsymbolic)

o The primitives are vectors of numbers.

o There are no more complex expressions; combinations are produced

through addition or some other form of superposition.

o Similarity is (potentially) continuous: the distance between the vectors.

Some other questions concerning representation

• What sort of evidence do we have for what the internal representations of

cognition look like?• What about the input and output themselves? What form do they take?

• How do representations relate to the world, to perception, and to action?

Memory and learning (and representation)

• There is usually a basic distinction between long-term, general knowledge of a

particular type (in long-term memory) and the short-term characterization of thecurrent situation.

• The system needs to be able to take items in STM and use them to access

knowledge in LTM (a kind of search). In symbolic systems this is usuallyaccomplished through some form of pattern matching.

• LTM normally consists of categories (concepts) and rules specifying what sort of

action to take or inference to make in a particular sort of situation. Application of

a rule can result in external behavior or a change in the contents of STM.

• Knowledge may get into LTM through hard-wiring by the programmer, through

learning, or through evolution.



• Learning is usually induction: in response to a set of examples, the system creates

new categories or rules (a kind of search). Ease of learning depends on the quality

of the feedback (if any) provided by the environment.

Kinds of processing: what we need to do with

representations

• Categorization: given a representation of an object or situation, assign it to one

of a finite set of labels.

Categorizing an input image as a BLOCK.

Categorizing an input word as a NOUN.

• Parsing: given a pattern, assign some structure to it.

Segmenting an input image into constituent objects.Segmenting a input sentence into constituent phrases.

• Compression: given a pattern, represent it in a more compact way, taking

advantage of the regularity in it.

Representing an input image in turns of a small number of dimensions.

• Deduction, inference: given some facts, infer one or more other facts that follow

from them.

Given that BLOCK1 is ON BLOCK2, infer that BLOCK2 is UNDER BLOCK1.Given that X is a BLOCK, infer that it has flat faces.

• Induction: given some examples, create a general rule or category that covers

them all.

Given multiple instances of scenes, create the rule relating ON and UNDER.

Given multiple instances of blocks, create the category BLOCK and use it tocategorize new instances.

• Action: given a situation (or some facts), take an appropriate action.

Given a command to put a block in a box, perform the sequence of actions

involved in doing it.

Predicate calculus

What should a good representational format do for us?



• It should be correct. It should represent what we think it represents, permitting the

inferences that we would like the system to make and failing to make inferences

that we would not like (together with an inference mechanism).• It should be expressive; it should allow us to distinguish all situations that we

need to distinguish (it should be unambiguous). It should allow us to point to

entities that need to be pointed to.• It should treat situations which we believe are similar as similar.

• It should be flexible, allowing us to represent the same situation in different ways,

reflecting different construals.

Some spatial examples

• Some "facts" about objects

o BLOCKS

They have square corners.

They have straight edges. They have six faces. They don't roll.

They are a kind of prism. The thing I'm looking at now is one of them.

o BALLS

They're round. They have no edges or faces.

They roll.

• Some facts about relations

o IN

The contained object is surrounded by the container at some point.

The contained object is smaller than the container in somedimension.

The container has some empty space within it. In order to be freely moved, the contained object needs to be taken

out of the container.

It's a kind of spatial relation. The scene I'm looking at now has one in it. It relates a cup (the

container) and a stick (the contained thing).

o BEHIND (viewer-centered)

At least part of the object that is behind is obscured by the object

that is in front.

The object that is behind is further from the viewer than the objectthat is in front.

To touch the object that is behind, the viewer has to reach over or

around the object that is in front.o ON

The bottom of the supported object is in contact with the top of the

supporter.

• Some facts about functions



o TOP-OF

The top of an object is a surface or a corner.

We can find the top of an object by looking for the part of it that isfurthest from the earth (assuming we're on the earth).

When an object is turned over, whatever was the top of it is now

the bottom of it.o INSIDE-OF

The inside of an object is a region in space.

If an object is solid, the inside of it is part of it.o SUPPORTER

Given two objects, one on the other, the supporter of the two (or of

the situation), is the one that if taken away, would cause the other to fall.

Relations, objects, predicates, functions

• Unlike objects, relations take arguments.• A relation predicated of one or more objects is either true or not.

• In predicate calculus

o Objects and relations are represented by explicit constant symbols:

block23, in

o Predicates are represented by expressions consisting of relation constants

followed by object constant in a fixed order:(in block23 cup4)

o An alternate way of representing predicates: the arguments are paired with

role constants, and their order is unspecified:(in (container cup4) (contained block23))

o Note that nouns on the one hand and verbs, pre/postpositions, adjectiveson the other hand do not map neatly onto object and relation constants.

o Functions are represented by function constants, and function

expressions, like predicates, by a function constant followed by one or

more arguments:(top-of block8)

o Function expressions return objects so can replace object symbols in

predicates:(above (bottom-of block8) (top-of block4))

Categories and categorization

• Categorization is about going from lower-level to higher-level representations, for

example, from a specific object instance to an object category such as BLOCK.

• We need to be able to distinguish instances from categories and to represent their

relationship.• We need to be able to distinguish categories at different levels of abtractness

(different taxonomic levels) and represent their relationship.

• Predicate calculus



o Represents object instances and object categories with expicit symbols and

their relationship as a predicate: (block obj23).

o Represents the relationship between categories at different levels using

variables, universal quantification, and implication:(forall (?x) (if (block ?x) (prism ?x)))

o An alternative way of representing the relationship between instances andcategories and between categories at different taxonomic levels: AKO (akind of) and ISA relations:(isa obj23 block)

(ako block prism)

• Representing relation instances

o Not explicit in ordinary predicate calculus; no way to directly express that

a particular event belongs to the event category SING.o Alternative: relation instance (and relation category) symbols.

(sing event23)

(= (singer-of event23) terry)

Miscellaneous considerations

• Connectives: conjunction (and), disjunction (or), implication (if), equivalence

(equiv), negation (not): truth of each defined in terms of the truth of its

operand(s)• Existential quantification

(exists (?x) (and (b551-student ?x) (not (know ?x scheme))))

(not (exists (?x) (and (b551-student ?x) (not (know ?x arithmetic)))))

• Sentences, well-formed formulas

• Equivalences of particular expressions (examples)

(equiv (if p q)

(if (not q) (not p)))

(equiv (if p q)

(or (not p) q))

(equiv (or p (and q r))(and (or p q) (or p r)))

(equiv (not (exists (?x) (p ?x)))

(forall (?x) (not (p ?x))))

(equiv (forall (?x) (and (p ?x) (q ?x)))(and (forall (?x) (p ?x)) (forall (?x) (q ?x))))



• Primitives

o If we opt for a symbolic representation such as predicate calculus, we

sneed an alphabet of basic symbols.o How would we decide on a set of primitives? This is usually based on

practical, rather than theoretical, considerations.

o We must make decisions about representational granularity.• Limitations of first-order predicate calculus

o Representing belief and knowledge

(believes al (loves mary al))

(believes al (exists-life mars))

Representing predicate calculus in Scheme

• One possibility: relation constants are procedures, predicates are procedure calls,

returning either #t or #f

• A better option: all predicate calculus expressions take the form of lists, and

procedures are ways of manipulating and searching through them, such asinfer.and true?.

• The second option requires that we build in the definitions of conjunction,

negation, etc.

What will we do with our representations?

• assert takes a sentence and adds it to a database, which is a conjunction of facts

taken as true.

(assert '(block b1) database) → modified database

• infer takes a database and returns a list of sentences that can be inferred from the

database

(infer '((block b1)

(forall (?b) (if (block ?b) (nfaces ?b six)))))→ ((nfaces b1 six))

• true? takes a database and a sentence and tells whether the sentence is true, giventhe database. dont-know is a possibility.

(true? '((block b1)(forall (?b) (if (block ?b) (nfaces ?b six))))

'(nfaces b1 six))

→ yes



• fill-in takes a database and a sentence containing one or more variables and

returns bindings for the variables.

(fill-in '((block b1)(forall (?b) (if (block ?b) (nfaces ?b six))))

'(nfaces b1 ?nf))→ ((?nf six))

• categorize takes a conjunction of facts about an individual and returns a

category for the individual.

(categorize '((nfaces b1 six)

(height b1 5cm)

(square-corners b1)(forall (?b) (if (and (nfaces ?b six)

(square-corners ?b)

(less-than (height ?b 20cm))(greater-than (height ?b 2cm)))

(block ?b))))

'b1)→ (block b1)

• learn takes a conjunction of facts about instances of a category and a category

symbol and returns a generalization about the category.

(learn '((block b1) (nfaces b1 six) (square-corners b1) (color b1 red)

(block b2) (square-corners b2) (color b2 black)...)

'block)→ (forall (?b) (if (and (nfaces ?b six) (square-corners ?b))

(block ?b)))

Predicate calculus: practice

Using predicate calculus representation (Scheme format), show how to represent the

knowledge embodied in the following English sentences. As far as possible, show therelationships among the different elements.

When a moving ball strikes a stationary ball, the moving ball is deflected in a directionwhich is roughly opposite to its original direction, and the originally stationary ball starts moving roughly in the direction of the originally moving ball. Ball 3 struck Ball 2,

which was stationary. Ball 3 was moving south when this happened.

(Ignore details like velocity, the effect of the mass of the balls, and the angle at which the

moving ball strikes the stationary ball (unless you want to be really ambitious :-). This isvery naive physics.)



Search 1

Problem solving as search

• A solution as a state in the space of possible solutions• Problem solving as search through this state space

Some basic questions concerning problem solving as

search

• How can solving the problem be treated as the execution of a sequence of steps?

• Do the steps result in more and more complete solutions, or are candidate

complete solutions available at every step?

• Is it easier (or cheaper) to start at the goal and work backwards?

• Is the consequence of a step predictable? Is there an adversary (another agent whocan constrain the possible steps)?

• Is the solution the way a goal is reached or the nature of the goal itself (what's

found at the end)?

• Is it important to know more than one way to reach the goal?

• Is it important that the goal is reached in the most efficient way, the way requiring

the least cost to the agent?

• Is it important that the solution is found quickly?

• Is there a way to estimate the "distance" to the goal?

• How easy is it to reconsider and try a completely different set of steps?

• Is the step size adjustable?

• Is it possible to consider a number of different options in parallel?

• Is the number of potential ways to the goal finite?

Formalizing search

• Problem state: a particular configuration of the objects that are relevant for a

problem

Figuring out how to represent the problem states is no simple matter.

• State space: the space of all possible problem states

Normally only a partial state space is considered.

• Initial state: where the search starts• Goal states: where the search should end

There may be any number of these, and we may be interested in finding only one

or all of them.

• State space search: search through the space of problem states for a goal state

• Search trees: nodes are states (or paths), links are one-step transitions from one

state to a successor state

• Expanding a node (extending a state/path)



• Testing for goal states (nodes)

• The queue of untried states (paths); adding new states (paths) to the queue (stack)

• Branching factor of the search: average number of successor states each state has

• Depth of the search: how far down the tree is extended during the search

Basic schema for search

(Assuming that the path makes a difference, we maintain a queue of paths rather than just

states.)

The algorithm makes use of two procedures specific to the problem

1. A predicate goal?, which takes a state and returns #t if the state is a goal state

2. A procedure expand, which takes a state and returns all of the successor states to

the state

• Form a one-element queue consisting of a zero-length path that contains only the

root node.

• Until the first path in the queue terminates at a goal node (satisfies goal?) or the

queue is empty,o Remove the first path from the queue; create new paths by expanding the

first path to all the neighbors of the terminal node.

o Reject all new paths with loops.

o ...

o Add the new paths, if any, to the queue.

o ...

• If the goal node is found, announce success and return the path to the goal state

found; otherwise, announce failure.

Depth-first, breadth-first, nondeterministic search



• Depth-first search

Add the new paths to the front of the queue. (That is, use a stack.)

o Backtracks when a dead-end or "futility" limit is reached

o Appropriate when there is a high branching factor

o May be advantageous when many solutions exist but only one needs to be

foundo May fail to find a solution

• Breadth-first search

Add the new paths to the back of the queue. o Appropriate when there are long useless branches but not when there is a

high branching factor (time and space complexity is exponential)

o Guaranteed to find a solution (if there is one) and to find the one with the

least steps (though not necessarily the least cost) first

• Nondeterministic search

Add the new paths at random places in the queue. o When unable to choose between depth-first and breadth-first

To implement the type of search, we can add a merge-queue argument to our search

procedure. There is a different one of these for each of the ways we will add new paths to

the queue.

Tower of Hanoi

• To do best-first search using your basic search procedure, you only need to define

a new merge-queue procedure. best-first-merge takes the queue, the new

paths, and the problem-specific estimate procedure, adds the new paths to the

queue, and then sorts the new queue by the values returned by estimate when

applied to the first state on each path. You can use the Scheme procedure sort to

do this. sort takes a comparison predicate and a list and sorts the list using the

predicate to compare items.

• Figure out how you will represent states for your particular problem. Here's one

way for Tower of Hanoi.o For the Tower of Hanoi Puzzle, states are lists consisting of a sublist for

the disks on each peg. Numbers represent the diameters of the disks, and

they are arranged from top to bottom. Thus this is the initial state for the 3-disk puzzle:

((1 2 3) () ())

• Write the goal?, expand, estimate, and print-state procedures for your

particular problem.

o For the Tower of Hanoi Puzzle, these are one possibility.

• Callthe search procedure on the problem-specific states and procedures.

o Best-first search on the Tower of Hanoi Puzzle: a trace indicating states

which are searched and the queue at each point in the search

http://www.cs.indiana.edu/classes/b551/Notes/search-hanoi.html

http://www.cs.indiana.edu/classes/b551/Notes/search-hanoi.html#trace

http://www.cs.indiana.edu/classes/b551/Notes/search-hanoi.html

http://www.cs.indiana.edu/classes/b551/Notes/search-hanoi.html#trace



Heuristic search: using estimated distance remaining

• Another procedure specific to the problem: estimate, which takes a state and

estimates the distance from it to a goal state

• Hill climbing

Sort the new paths by the estimated distance left to the goal (using estimate ) and add them to the front of the queue.

o Parameter-oriented hill climbing: each problem state is a setting for a set

of parameters

o Problems for hill-climbing: foothills (local maxima), plateaus, ridges

o Nondeterministic search as a way of escaping from local maxima

o Gradient ascent

For a parameter x and "goodness" g which is a smooth function of x, thechange in x should be proportional to the speed with which g changes as a

function of x, that is, ∂ g /∂ x.



• Best-first search

After adding new paths to the queue, sort all the paths in the queue by the

estimated distance left to the goal (using estimate ).

Unlike hill climbing, can jump around in the search space.

Heuristic search: optimal search

• British Museum procedure: blindly find all paths, selecting best

• Branch-and-bound search

After adding new paths to the queue, sort all the paths in the queue by the current

path length.

Note: When a goal state is found, it is still necessary to extend partial paths whichare shorter than the complete one because they may end up shorter overall.

• Using underestimates of distance remaining

After adding new paths to the queue, sort all the paths in the queue by an

underestimate of total path length (using estimate ).

Underestimates allow you to stop when partial path estimates are longer than theshortest complete path.

• Eliminating redundant paths

After adding new paths to the queue, if there are two or more paths reaching a

common node, keep only the one with the shortest path length.

• A*: branch-and-bound with underestimates and redundant paths eliminated

Genetic search

• Parameter adjustment, function optimization problems

o Calculus methods

o Blind search and random methods

o Parallel search

• Evolutionary computation

o What's needed for evolutionary computation (abstract or real) to work

"Creatures" which1. Give birth to other creatures, passing on their traits to them

2. Die

A way for traits to be passed on: an inherited genotype, ininteraction with the environment, results in a phenotype

A way of evaluating the creatures' traits: some aid in survival or

reproduction, others don't ("survival of the fittest") A way of generating new traits: mutation

o How it works

Each creature is born with some combination of traits; it may not be possible to simply figure out what combination works best for

the environment of the creatures.



Creatures live their lives. Some survive long enough to reproduce

and have offspring.

If a particular trait or combination of traits helps creaturesreproduce or live longer, creatures with that trait (those traits) will

tend to have more offspring.

Creatures pass on (at least some of) their traits to their offspring. Insexual reproduction, they pass on a combination of the parents'

traits.

The percentage of creatures with the good traits should increase oneach generation.

There is a small probability that a new creature will end up with

some random traits which it did not inherit from its parent(s). In

this way new traits or combinations of traits can be tried out in theworld.

o Genetic algorithms

What makes them special

Work from a coding of the parameter set, not the

parameters themselves

Search from a population of points, not a single point

Use fitness information only, no auxiliary knowledgeUse probabilistic transition rules

Used for

Parameter-oriented search for problems in which partialsolutions are not evaluated (paths are not sought)

Modeling biological evolution

Designing a suitable initial architecture for cognition (say, a

neural network) The basic GA

Operators

Selection: Select individuals in the population for mating on the basis of the individuals' fitness. A

common choice is fitness-proportionate selection:

the probability of selecting an individual is itsrelative fitness (implemented through "roulette-

wheel sampling")

Crossover: Combine the genomes of two parents to produce the genome of the choice. The most

common choice: select a position and exchange the

substrings before and after that locus between twogenomes to create two offspring.

Mutation: Make random changes in the genome of

an individual. The most common choice: with a

small probability, flip each bit in the genome.Parameters

n individuals in the population



l loci in each genome

Probability of crossover: pc (often something like

.7) Probability of mutation: pm (often something like

.001)

Environment Fitness function f which evaluates each individual

assigning a quantity to it

Or (less commonly) a "world" which permits someindividuals to produce more offspring than others

Algorithm

For each run

Start with a population of n randomlygenerated genomes

For each generation

(Realize the genomes as phenomes

(individuals).) Evaluate each of the individuals with

the fitness function. Until n offspring have been created,

Using the selection operator,

choose a pair of parents from

the population. Produce two offspring. Use

the crossover operator with

probability pc. Otherwise produce copies of the parents.

For each locus in each new

offspring, apply the mutationoperator with probability pm.

Place the resulting offspring

in the new population. Replace the old population with the

new.

An example: maximizing a function ( x2)

Generation Genomes Fitness Individuals selected for mating

1 0 1 0 1 0

1 1 1 1 01 1 0 0 0

0 0 1 0 0

100 (.06)

900 (.57)576 (.36)

16 (.01)

1 1 1|1 0

1 1 0|0 01|1 1 1 0

0|1 0 1 0

2 1 1 1 0 01 1 0 1 0

1 1 0 1 0

0 1 1 1 0

784 (.34)676 (.29)

676 (.29)

196 (.08)

1 1 1 0|01 1 0 1|0

1 1 1|0 0

1 1 0|1 0



Forward and backward chaining

The basic elements

Given a set of assertions (facts), a set of rules, and a goal, prove the goal.

• Assertions

Each a predicate calculus sentence with no variables and no connectives other

than not.

• Goal

Either a sentence with no variables and no connectives, in which case the goal is

to prove that it is true (inferrable from the facts and the rules), or an existentially

qualified sentence whose variables are to be assigned values (if possible) giventhe facts and the rules.

• Rules

Each a universally qualified implication with a conjunction of sentences asantecedent and a single sentence as consequent.

An example

• Assertions

o A ball is on a block (on ball1 block1)

o A pyramid is above the ball. (above pyramid1 ball1)

• Goals

o Is the pyramid above the block? (above pyramid1 block1)

o What's above the block? (above ?x block1)

• Rules

1. If something is on something, it's also above it.(((on ?x ?y)) (above ?x ?y))

2. If a is above b, b is below a.(((above ?a ?b)) (below ?b ?a))

3. If b is above c and a is above b, then a is above c.

(((above ?b ?c) (above ?a ?b)) (above ?a ?c))

(In Homework 2 you will be doing a restricted version of forward chaining. Because each

rule has only one conjunct in its antecedent, you can iterate through the assertions rather

than the rules to attempt to find new assertions (extend the current state).)

Forward chaining



Attempt to match the antecedents of rules with the assertions, adding new assertions

based on the consequents if this is possible, until an assertion matching the goal is added.

To prove: (above pyramid1 block1)

•

The antecedent of the first rule matches (on ball1 block1), so we can asssert(above ball1 block1).

• The antecedent of the third rule matches (above ball1 block1) and (above

pyramid1 ball1), so we can assert (above pyramid1 block1). This matches

the goal.

Backward chaining

Attempt to match the consequents of rules with a goal, replacing the goal with new goals based on the antecedent of the rule if this is possible, until all of these goals match

assertions.

To prove: (below block1 ?x)

• The consequent of the second rule matches the goal, so we can replace the goal

with (above ?x block1)

• The consequent of the first rule matches the goal, so we can replace the goal with

(on ?x block1). This goal matches the assertion (on ball1 block1) with the

variable binding ?x = ball1

More on representation

Frames

Much of human knowledge seems to be organized in chunks representing types of events:

frames or schemas.

Objects

Consider what we know about CABBAGE: what it looks like, how it tastes, how it's

prepared, how nutritious it is, how much it costs, what other vegetables and plants it's

related to. Within the CABBAGE frame, there is knowledge about what CABBAGE has.Since CABBAGE is (probably) a basic-level category, there is a lot of knowledge in its

frame.

But some knowledge about CABBAGE is shared with other vegetables, and someknowledge about vegetables is shared with other food items. Also there are subtypes of

CABBAGE such as RED-CABBAGE. Knowledge seems to be organized in an



inheritance (is-a) hierarchy, one sort of ontology. Categories also have default

properties that can be overridden by subcategories or instances.

Events

Consider what we know about instances of GOING and GIVING.

When something is given, there is a GIVER, a RECEIVER, and a GIVEN-OBJECT.

Before the giving, the GIVER controls the OBJECT, and RECEIVER doesn't. After the giving, the RECEIVER controls the OBJECT, and the GIVER doesn't. The giving is

consciously initiated by the GIVER, who wants the RECEIVER to control the OBJECT.

(forall (?g ?r ?o ?t0)

(if (give ?g ?r ?o ?t0)

(and (exists (?t1)

(and (before ?t1 ?t0)

(control ?g ?o ?t1)

(not (control ?r ?o ?t1))(exists (?t2)


(goal ?g (control ?r ?o ?t2) ?t1)))))

(exists (?t3)


(control ?r ?o ?t3)

(not (control ?g ?o ?t3)))))))

Though we don't have an English verb for it, there is a more abstract event category that

includes GIVE, STEAL, TAKE, and RECEIVE. And there are subtypes of GIVE such as

DONATE and LEND.

Using frames

• How is knowledge within a frame instantiated when an instance of the category is

created?

• How can we efficiently access inherited knowledge for an instance of a category?

• How can we answer questions about properties using an inheritance hierarchy?

Frame representation(PHYS-OBJ

(is-a THING)

(color)

(weight)(shape)

(edibility)

...)

(FOOD-ITEM

(is-a PHYS-OBJ)

(nutritional-value)

(fat-content)

(starch-content)



(vitamin-content)

(source)

(taste)

(preparation

(processing)

(cooking)

(accompanying-ingredients)

(serving))

(availability-form)

(edibility YES)

(english-lex

(neutral "FOOD")

(informal "GRUB"))

(japanese-lex "TABEMONO")

...)

(VEGETABLE

(is-a FOOD-ITEM)

(plant-part)

(plant-type)

(source PLANT)

(nutritional-value HIGH)(fat-content LOW)

(vitamin-content HIGH)

(english-lex

(neutral "VEGETABLE")

(informal "VEGGIE"))

...)

(LEAF-VEGETABLE

(is-a VEGETABLE)

(plant-part LEAF)

(color GREEN)

(taste BITTER)

...)

(COLE-VEGETABLE

(is-a VEGETABLE)

...)

(CABBAGE

(is-a LEAF-VEGETABLE)

(is-a COLE-VEGETABLE)

(plant-type CABBAGE-PLANT)

(taste CABBAGE-TASTE)

(availability-form VEGETABLE-HEAD)

(shape SPHERICAL)

...)

(RED-CABBAGE

(is-a CABBAGE)

(color PURPLE)

(english-lex "RED CABBAGE")...)

(RED-CABBAGE23

(is-a RED-CABBAGE)

(preparation

(accompanying-ingredients

{MAYONNAISE, MUSTARD, TARRAGON})

(cooking NIL)))

(ABSTRACT-TRANSFER

(source ?s)



(1 i-m-beef)

(2 i-m-green-beans)

(3 i-m-spices))))))

(define instantiate

(lambda (frame bindings)

...))

(define inherit

(lambda (instance role)

...))

Semantic networks

Knowledge in the form of a graph. A single node corresponds to a frame symbol. Two

"styles":

1. labeled links represent many relations and roles

2. all relations and roles are represented by nodes; there are only a small number of

very general link types

Examples in a particular formalism of type 2 (NETL: Falhman, 1979)

Types, roles and value, is-a

Individuals, sets

Insects have six legs. Each leg is jointed. Ladybugs are insects. Sam is a ladybug. Sam's

right rear leg is broken.



Events



An example of a type 1 formalism



Inheritance in a semantic network

Distributed connectionist representation

Elements

A fixed network of nodes (units) that can be active or not (or active to different degrees).

The unlabeled, weighted, modifiable links (connections) between the units represent the



tendency for pairs of units to be co-activated. Any representation is a pattern of

activation across the network.

A representation is distributed if each element is involved in the representation of multiple concepts and each concept is represented by multiple elements. Units may

represent primitive semantic features or (for example, in holographic representations)may not be directly interpretable.

Instantiation, inheritance

Instantiation does not involve the creation of any new hardware (because the size of the

network is fixed) but rather in the activation of certain and units and the strengthening or

weakening of some weights. There is no distinction in the network between individuals

and types.

Different types are not represented by separate units in the network.

Abstraction/generality corresponds to uncertainty about the value of units.

Inheritance is in a sense automatic. When we activate a type, for example, on the basis of

an input word such as cabbage, the pattern represents features of the instance as well as

the type.

Representing commonsense knowledge



Primitives

• Acts (Schank, etc.)

• Semantic roles

Scripts, plans, goals

Knowledge of stereotypical situations helps in understanding language (Schank, etc.).

Mary wanted that camera really bad, so she went and bought a gun.

Phil went into a restaurant and sat down at a table. A waiter came over after a fewminutes, but he said he wasn't ready to order.

Big Projects (CYC, etc.)

The Grounding Problem

• Will the knowledge be in a usable form if it's not tied to perception and action?

• Is it possible to build in commonsense knowledge, or will it have to be learned?

Machine learning: overview

Introduction

• Internal changes in biological systems recorded in memory of one kind or

another. Change may be more or less permanent, resulting in different (usually

improved) behavior following the change.Kinds of change and kinds of memory:

o Evolutionary change; genetic memory, genotype

o Development; phenotype

o Learning; long-term memory

o Cultural change; cultural memory

o Processing (temporary change); short-term (working) memory

• Why learn (and develop), rather than evolve? (Miller & Todd)

o Learning (development) allows an organism to build a more complex

phenotype than it could otherwise, given a genotype of a certain size.Environmental regularities can do much of the work of wiring up adaptive behavior-generators.

o Learning allows an organism to make use of the past as well as the here-

and-now. This sort of learning consists in the creation of episodicmemories and their retrieval later on.



o Learning allows an organism to adjust its behavior faster than natural

selection would allow. This is advantageous because there may be changes

in the organism's body

the organism's family

the organism's environment

This function dominates thinking about learning but may be less importantthan the other two for most animals.

Some basic concepts

• Availability of feedback

o Available

Supervised learning: there is access to the correct output

Reinforcement learning: there is access to the goodness of the

actual output The credit assignment problem in reinforcement (and sometimes

supervised) learning: when the behavior of the system is wrong,

what aspect of the system's internals led to the error?o Unavailable: unsupervised learning, no information about the correctness

of outputs Example: A speech system is to be trained to recognize English words.During an initial phase, the system is simply exposed to samples of

English speech without being told what the content of the speech is. The

hope is that the system will pick up on some of the systematic phonetic properties that characterize English.

• Prior knowledgeo Learning from scratch

o Building on prior knowledge

o Martin's law: You cannot learn anything unless you almost know it

already.

• What is learned

o Stimulus-response behavior

o Concepts

o Regularities in the environment: cooccurrences, clustering, prediction

o Utility information concerning possible states of the world

o Results of possible actions

o Ways of organizing knowledge internally to maximize performance

Induction

(supervised or reinforcement)

• Learning the representation of a function f



• Given a collection of examples of f , find a function h (the hypothesis) which

approximates f

• Positive and negative examples of the function

• Generalization of the current hypothesis through positive examples

• Specialization of the current hypothesis through negative examples; value of near

misses Example of a bad negative example: A robot is being trained to recognize a sodacan. It is shown a can from two different angles. Then in order that it doesn't

produce too general a concept, it is shown a chair and told that that is not an

example of a soda can. The robot does not seem to improve.

• In general induction is not sound: a hypothesis is not usually a logical conclusion

of the data; numerous hypotheses may be consistent with the data

• Incremental learning: hypothesis is updated whenever an example arrives

Example: A robot is being trained using reinforcement learning to find all of theempty soda cans in the lab and throw them in the recycling bin. This proves too

difficult, so the robot is first trained only to recognize soda cans. Then it is trained

to approach soda cans. ...

Introduction to neural networks

Overview

• Elements

o Units: simple processing elements that respond to the behavior of other

units via input connections, produce an activation, and send it along output

connections to other units. The activations of all of the units in thenetwork represent the system's short-term memory.o Connections: Weighted (unlabeled) links between units, multiplying the

activation from the source unit on its way to the destination unit. The

weights along all of the connections in the network represent the system's

long-term memory.• Formalization

o State



Vector of activations x(t )

Matrix of weights W

o Task

Set of input vectors I(t ), possibly infinite

(Sometimes) an associated set of target vectors T(t )

o Dynamics Discrete (difference equations) or continuous (differential

equations)

Activation

x(t +1) = g(h(x(t ), W(t ), I(t )))

g the activation function, h the input function

Weight

W(t +1) = f (x(t ), W(t ), I(t ), T(t )))

f the learning rule

• Some differences between models

o What sort of connectivity (reflected in where there are gaps in the weight

matrix)?o Are there targets (a supervised network)?

o Is this is a feedforward network, a partially recurrent network, or a

completely recurrent (settling, attractor, constraint satisfaction) network?

o Is the activation function threshold or continuous?

o Does the network handle sequences of inputs or just static patterns?

o Are there separate input and output units?

o Does the network have hidden units?

• Running a network

o Update units

For attractor (feedback) networks, update units (usually randomly

selected) until the network has settled (no further changes inactivations occur)

For feedforward (and simple recurrent networks), update each unit

once in a (mostly) fixed sequence Update a unit

Calculate the input (h) to the unit, the weighted sum of the

activations of units feeding into the unit

Calculate the activation ( x) of the unit, a function ( g ) of theinput

Example: activation of a unit with a threshold activation function:

if hi > θi, g (hi) = 1

else g (hi) = 0



o Update connections

Following the presentation of a single training pattern or the whole

set of training patterns Weight changes are usually small changes in a given direction,

determined by a learning rate (η)

The direction and magnitude of the change is usually proportional

to the activation of the source unit and either the activation of thedestination unit or some error measure.

Supervised learning in neural networks

• Patterns

o Training set: pairs of inputs and targets for training the network

o Test set: pairs of inputs and targets for testing the network for generalization

• Training

o Training phase: weights are adjusted in response to training set

o Test phase: weights are not adjusted as test set is presented

Feedforward networks

• Appropriate for problems with no interacting bottom-up and top-down effects;

pattern association

• Usually trained with error-driven learning, a form of supervised learning in

which the change in weights depends on the error, ultimately the difference between the target and the actual output for each output unit

• Networks with no hidden units, for example, perceptrons

• Networks with hidden layers, usually trained with backpropagation

Perceptrons



Pattern association problems

• Given a representation of one kind of entity (the input), generate a representation

of another (the output).

• Both representations often take the form of patterns, that is, vectors of numbers.

In this case, the inputs and outputs are represented in a distributed way.• Training (supervised): expose the learner to a number of input patterns and their

correct associated output patterns.• Generalization: given an unfamiliar input pattern, respond with an appropriate

output pattern.

• Examples

o Perceptual input, category output (pattern classification)

o State (perceptual) input, action Q-value output

o Word input, meaning output

o Meaning input, word output

o English input, Spanish output

• Implementation in feedforward neural networks

Feedforward networks and pattern association

• In a feedforward neural network, the units can be thought of as arranged in

separate groups, or layers. The connections joining units in two layers all have the

same direction.

• Some of the units are designated input units. These are clamped to particular

activations when the network is presented an input pattern. The activation of a

clamped unit does not change.

• Some of the units are designated output units; their activations represent thenetwork's response to the current input, the pattern that the network "believes"

should be associated with the input pattern.

• Each output unit repeatedly updates its activation while the network is "running".

In a feedforward network, each output unit updates its activation once in responseto each input pattern.



• Also the possibility of one or more layers of hidden units between the input and

output layers

Perceptrons: how they work

• The simplest architecture for supervised pattern classification• Architecture

o Input units + bias (threshold) unit

o Binary output unit; each output unit a separate perceptron

• Input and activation rules

y = δ(

N

∑

j=1

w j x j + b)

•

δ( x) = {1 if x > 0

0 otherwise

• That is, the activation function is a simple linear threshold function.

• Learning

o For each input pattern p, there are three cases:

The pattern is classified correctly. In this case, no changes are

made to the weights.

The target is 1, but the network yielded 0. In this case, we need to

change each weight so that the output will be higher. We canachieve this by adding the input vector (or a fraction of the input

vector, the learning rate η) to the weight vector (the superscript

represents the particular pattern).

Δw = ηx p

The target is 0, but the network yielded 1. In this case, we need to

change each weight so that the output will be lower. We can

achieve this by subtracting the input vector (or a fraction of theinput vector) to the weight vector.

Δw = -ηx p

o The three cases can all be expressed with this general rule:

Δw = ηx p(t p - y p)

• The Perceptron Convergence Theorem (Rosenblatt)



o If there is a set of weights that solves the problem, then there is weight

vector w* that never yields a sum lying in a region around 0 of width 2 *

ε. That is, it for inputs that are supposed to yield a positive output, w*

yields an output greater than ε, and for inputs that are supposed to give a

negative output, w* yields an output less than -ε.

o The theorem proves that the angle between the current weight vector andw* is bounded for each training pattern by an envelope that decreases with

each presentation of the pattern.

o The theorem does not guarantee that the angle between the current and

final weight vectors will decrease monotonically, only that the envelope

within which this angle is found decreases with the number of updates.

That is, the error may sometimes rise for a given run through the training patterns.

o The theorem only guarantees convergence if there is a set of weights that

solves the problem.

Perceptrons: what they can and can't do

• One input

o The input patterns fall along a line; all points on one side of a given value

are in, the other outside the category. The trainable bias establishes thethreshold, the sign of the single weight the direction of category

membership on either side of the threshold.

o If the points in the category are broken up by points not in the category,

there is no way for the network to solve the problem.

• Two inputs

o The two weights and the bias define a line



w1 x1 + w2 x2 + b = 0

with slope -w1 /w2 and y-intercept -b/w2. Points on one side of the line turn

on the output unit; points on the other side turn it off. Because the perceptron defines an inequality (two possibilities for each line), three

values are required rather than the two required to define a line.

o The line defined by the weights and bias divides the input space into two

regions. A perceptron in 2-space can only learn to separate two sets of

points that are on either side of a line.



• In general, a perceptron can only solve a pattern classification problem if the sets

of patterns are linearly separable; that is, if in N-space, there is a hyperplane of

N-1 dimensions which separates the two sets of points. N values (N-1 weights andthe trainable bias) are needed to specify the desired behavior because in addition

to the hyperplane, we need to say on which side of the hyperplane points are in

the category.• Problems which perceptrons can't solve

o Examples of non-linearly separable sets of patterns

Exclusive OR: (0, 0), (1, 1); (1, 0), (0, 1)

Connectivity

o Solving the problems with additional input dimensions, for example, for

exclusive OR an input unit that codes for whether the other two inputs are

the same

o Solving the problems with hidden units

Networks with hidden units with linear activation functions are

equivalent to networks without hidden units Hidden units must have non-linear activation functions, for

example, a simple threshold function (like perceptron output units)

or sigmoidal or gaussian functions.

One possibility: provide "enough" hidden units connected by

random, hard-wired weights to the input units. Another possibility: train the input-to-hidden weights. But how?

The delta rule and backpropagation

Activation functions

• The simplest activation function is the identity function: the activation is just the

input.

• Another possibility is a threshold function which converts inputs above some

threshold to a maximum value (usually 1.0) and inputs below the threshold to aminimum values (usually -1.0 or 0.0).



• A third possibility is a "soft" threshold function which smooths out the region

near the threshold. The following function, the sigmoid, has a minimum of 0.0

and a maximum of 1.0 (note that these are never reached):

g (h) = 1 / (1 + e-h)

(h is the input to the unit.)

The delta rule

• Supervised learning (pattern association) in feedforward networks with multiple

output units and continuous activation functions

• The delta rule (least mean squares rule) for supervised learning (when the

activation function of the output unit is the identity function)

Δw ji = η (t j - x j) xi

(The xs represent activations, t represents a target, and η is a learning rate between0.0 and 1.0.)

• A formal way to derive the delta rule for the more general case

o Gradient descent learning: learning by moving in the direction which

looks locally to be the best

o For supervised neural network learning, the best direction to move in

"weight space": for each weight, how the global error changes with respectto that weight

o A global error function: for each pattern the sum of the errors over all of

the output units

E = ∑ j ½ (t j - x j)2

o We want to move in "weight space" in a direction which is opposite that of

the slope of the error function with respect to each weight because this

will move us toward a region with a lower error. The size of the move

should be proportional to the magnitude of the slope.



1. To find the slope, we take the partial derivative of the error with

respect to the weight. But the only element in the sum of error

terms that depends on the weight is the one for the output unitwhere that weight ends ( j in what follows).

∂ E /∂w ji = ∂ [½ (t j - x j)2

] / ∂w ji

2. Using the chain rule, we can decompose this derivative into two

that are easier to calculate:

(∂[½ (t j - x j)2]/∂ x j) (∂ x j/∂w ji)

3. The first derivative is easy to figure; it's just

-(t j - x j)

4. The second derivative can be decomposed using the chain ruleagain if we remember that the activation of unit j is a function of

the input to the unit, h j, which is in turn a function of the weightsinto the unit.

∂ x j/∂w ji = (∂ x j/∂h j) (∂h j/∂w ji)

5. Since the activation of an output unit is the activation function g

applied to the input h, the first derivative on the right-hand side of (4) is just

g′ (h j)

that is, the derivative of whatever the activation function is at the

value of the current input to unit j.

6. The second derivative on the right-hand side of (4) can be derivedas follows:

∂ I j/∂w ji = ∂(∑k xk w jk )/∂w ji = xi

because none of the other weights or input activations depend on

w ji.

7. Putting all of the parts together, we get

∂ E /∂w ji = -(t j - x j) g ′(h j) xi

8. Remember that we want the weight change to be proportional to

the negative of the derivative with respect to the weight. So with a



learning rate to control the step size for weight changes, we get the

more general delta (least mean squares) learning rule

Δw ji = η (t j - x j) g' (h j) xi

Backpropagation

• Problems that are not linearly separable can't be solved by a perceptron (or a

network learning with the delta rule).

• How hidden units can solve non-linearly separable problems.

• Backpropagation: a gradient descent algorithm for learning the weights into

hidden units as well as output units

• A network with hidden units with linear activation functions is equivalent to anetwork with no hidden units, at least the hidden units must have non-linear

activation functions, and these must be differentiable for backpropagation toapply: usually the sigmoid function.



• The learning rule:

Δw ji = ηδ j xi

o For output units:

δ j = (t j - x j) g′ (h j)

o For hidden units (k indexes the units in the next highest layer):

δ j = [∑k δk wkj] g′ (h j)

• A famous example: NETTALK, the text-to-speech problem



• Some questions and concerns

o Does BP get stuck in local minima?

o Does it take forever to learn the weights?

Faster as number of hidden units increases (assuming parallel

update)

Faster with higher learning rate, within limitso How does the network solve the problem? What sort of hidden-layer

representations does it build? Using statistical techniques to analyze

hidden-layer representations.o Does it generalize? Does the network behave appropriately on patterns

which it has not been trained on?

More local and more distributed (greater generalization) hidden-layer patterns

Effect of too many trainable connections: overfitting, the network

"memorizes" individual patterns rather than generalizing over them

• Optimization: setting the learning rate, other parameters

•

Incremental training: learning a simpler task which enables the learning of amore complex task

• Multiple tasks in a single network

o Catastrophic forgetting: does the network unlearn one set of patterns when

trained on a second?

o Does the network fail to learn two interfering tasks which it is trained on

simultaneously? Example: the what-where vision problemo Modularity as a solution to problems of interference

Sequential problems and simple recurrent networks

• Sequence processing: inputs consist of sequences of patterns; the network's output

depends on previous patterns as well as the current one

o Prediction: given a partial sequence, predict the next element (pattern)

o Sequence classification

o Parsing

• Sequence processing requires some form of short-term memory (in addition to

unit activations

• Simple recurrent network (Elman net): recurrent connections on the hidden

layer with a time delay of one sequence event, usually implement with a context

layer that maintains a copy of the hidden-unit activations on the last time step



• Training an SRN on prediction

o The input and output layers represent a single sequence event.

o During training, sequences of inputs are presented repeatedly.o On a single training trial, an event is presented to the input layer, and the

network is run in the usual fashion, with the context layer treated asanother input layer.

o The target is the next event in the sequence. Error is back-propagated, and

weights are updated using the backpropagation rule, with context-to-hidden weights treated exactly as input-to-hidden weights.

o Finally the activations on the hidden layer are copied to the context layer.

Unsupervised learning

Auto-association and content-addressable memories

• Auto-association: one form of unsupervised learning

o Patterns are associated with themselves

o Purposes: dimensionality reduction, data compression, pattern completion

o Implementation: Hopfield nets (pattern completion only), other constraint

satisfaction (settling) networks with hidden layers, feedforward nets (can be trained with backpropagation)

o Content-addressable memories

Desired behavior: When part of a familiar pattern enters the memory system, the

system fills in the missing parts (recall).

When a familiar pattern enters the memory system, the response isa stronger version of the input (recognition).

When an unfamiliar pattern enters the memory system, it is

dampened (unfamiliarity).



When a pattern similar to a stored pattern enters the memory

system, the response is a version of the input distorted toward the

stored pattern (assimilation). When a number of similar patterns have been stored, the system

will respond to the central tendency of the stored patterns, even if

the central tendency itself never appeared (prototype effects).o Discrete Hopfield networks

Basic properties

CAM Potentially completely recurrent

Symmetric weights

Activation rule (θ a threshold, sgn(): 1 if its argument is

positive, -1 otherwise):

xi(t + 1) = sgn(∑ j wij x j(t) - θi)

Settling: asynchronous, random update Training: single presentation of each pattern

Each training pattern should yield an (fixed-point) attractor. Stability

Lyapunov stability: if there is a function of the network

state which decreases or stays the same as the network isupdated, then the network is asymptotically stable.

Energy of network (a Lyapunov function):

E = -½ ∑i ∑ j wij xi x j

For symmetric weights, this can be rewritten as

E = C - ∑(ij) wij xi x j

where (ij) refers to distinct pairs of indeces, and C is aconstant.

Activation rule minimizes energy

Assuming no thresholds, for a given updated unit i, either

its activation is unchanged, in which case the energy isunchanged, or it is negated, in which case xi and ∑ j wij x j

have opposite signs, and x′ i = - xi, where x′ i is the activationof unit i following the update.Then the difference between the energy after and before the

update of unit i is

E ′ - E = - ∑ j≠i wij xi′ x j + ∑ j≠i wij xi x j

= 2 ∑ j≠i wij xi x j



= 2 xi ∑ j≠i wij x j

= 2 xi ∑ j wij x j - 2 wii

But both of these terms are negative, so, for asynchronousupdates, the energy always either remains the same or

decreases.

Learning

Storing Q memories in a Hopfield net:

Hebbian learning: weight on the connection joining two

units is proportional to the correlation between their

activations. For one pattern p, we get stability if, for all i :

The expression in parentheses (the input to unit i) is

If the magnitude of the second term, the crosstalk term, is

less than N , then pattern p is stable.

Capacity of a network

Crosstalk between patterns Number of random patterns storable is proportional to N if

small percentage of errors tolerated, but it is quite small.

Competitive learning

• What it is

o Winner-take-all output units compete to classify input patterns, only one

(roughly) coming on at a time

o Clustering unlabelled data

o Categorization, vector quantization

• Simple, single-layer competitive learning



o Binary output units fully connected to (usually binary) input units by non-

negative weights

o Only one output unit fires at a time, the one whose input weight vector is

closest to the input vector (i* is the winning unit):

|wi* - x| ≤ |wi - x|(for all i)

For normalized weights, the winner is always the one with highest input(dot product of input pattern and weight vector):

wi* ⋅ x ≥ wi

(for all i)

o Winner-take-all process can be implemented by simply picking the unit

with the highest activation or through lateral inhibitory connections.

o Learning

Weights initially random

For each input pattern, update the weights into the winning unitonly

The standard rule moves the winning weight vector directlytowards the input pattern.

Because losers are not activated, the rule is equivalent to ( yi is the

activation of the ith category unit)

Geometric analogy



o Problem of "dead units", units which start out far away from input patterns

and never win

• Feature maps

o Networks in which location of output unit conveys information

o Output units have fixed positions in one-, two-, or three-dimensional grids

o Topology preserving map from the space of possible inputs to the line,

plane, or cube of the output units

A mapping that preserves neighborhood relations.

As two input patterns get closer in input space, the winning outputunits get closer in output space.

o Self-organizing feature maps (Kohonen nets), one type of feature map

architecture)

The neighborhood relations in the output array are built into thelearning rule. Weights into many (sometimes all) units are changed

on each update (there may be no dead units).

Winning output unit i*:

|wi* - x| ≤ |wi - x|

(for all i)



Learning rule:

Δwij = ηΛ(i, i*) ( x j - wij)

Λ(i, i*) = 1, for i = i*

The neighborhood function falls off with the distance |ri - ri*| in theoutput array, where the r vectors are the coordinates of the units in

the output space.

Network as an elastic net in which the weight vector of the winner

is dragged toward the input vector and the weight vectors of neighboring units are pulled along with it. Nearby units respond to

nearby input patterns.

Typical neighborhood function:

Both σ and η start large and are decreased during training Result is sensitive to probability of inputs as well as their location

in input space: more output units are associated with regions of

higher probability 1-to-1, 2-to-1, 2-to-2 mappings

Convergence

Usually in two stages: (1) untangling, (2) detailed adapting Kinds of tangles: twists (2 dimensions), kinks (1

dimension) Example applications

Robot joint angles, rather than actual positions, as input Phoneme similarity

Reinforcement learning

Markov decision processes

• The agent and the environment (the world)

• Discrete time• States

At each time step, the agent's sensory/perceptual system returns a state, xt , arepresentation of its current situation in the environment, which may be errorful

and normally misses many aspects of the "real" situation.

• Actions

At each time step, the agent has the option of executing one of a finite set of possible actions, ut , each of which potentially puts it in a new state.



• Reinforcements: rewards and punishments

In response to the agent's action in a particular state, the world provides a

reinforcement.• The reinforcement function in the world: r ( x,u)

• The next-state function in the world: s( x,u)

•

An example

Simple reinforcement learning

• The goal: to learn a value for each state-action pair

• One possibility

Q( xt , ut ) = r ( xt , ut )

• But this bases too much on a single instance of reinforcement. We need to learn in

smaller steps.

Qnew( xt , ut ) = (1 - η)Qold ( xt , ut ) + η r ( xt , ut )

where η is a learning rate between 0 and 1.

• But so far the algorithm can only learn in response to immediate reinforcement.

What about delayed reinforcement?

Q learning

• The real value of an action in a state (optimal Q) depends not only on immediate

reinforcement but also on reinforcements that can be received later as a result of

the next state the agent gets to.

• The value (estimate Q) that an agent stores for each state-action pair should

reflect how much reinforcement it will receive immediately and in the future if ittakes that action in that state.

• Policy: a way of using the stored Q values to select actions.

• More precisely, an optimal Q value for a given state and action is the sum of all

reinforcements received if that action is taken in the state, and then the agent

follows the optimal policy specified by the other Q values. A first definition:

Qopt ( xt , ut ) = r ( xt , ut ) + maxut + 1[Qopt ( xt + 1, ut + 1)]



• But this causes problems because there may be many, even an infinite number of,

future reinforcements. We need to weight the future by a discount rate (γ)

between 0 and 1.

Qopt ( xt , ut ) = r ( xt , ut ) + γ maxut + 1[Qopt ( xt + 1, ut + 1)]

• To approach optimal Q values, the learner starts with 0 or random values for each

state-action pair, then updates the values gradually usually the reinforcement

received and what it thinks is the best Q value for the next state.

Qnew( xt , ut ) = (1 - η)Qold ( x, u) + η{r ( xt , ut ) + γ maxut + 1[Qold ( xt + 1, ut + 1)]}

An example

γ = 0.8, η = 0.5 and all Q values initialized at 0. In the chart, "new" means the

reinforcement received plus the discounted maximum value of the next state. The "new"

value is combined with the "old" using the learning rate to give the updated Q valueappearing in the next line of the chart. (Note: in this example, in order to illustrate how

the agent can learn to "look ahead", it is effectively picked up after it reaches the goal

state and dropped back in state 1. There is no "natural" way of reaching state 1 from state

4.)

xQ

1,r 2,r 2,l 3,r 3,l 4,lnew u

1 0 0 0 0 0 0 0 r

2 0 0 0 0 0 0 0 r

3 0 0 0 0 0 0 1 r

4 0 0 0 .5 0 0 0 l

1 0 0 0 .5 0 0 0 r

2 0 0 0 .5 0 0 .4 r

3 0 .2 0 .5 0 0 1 r

4 0 .2 0 .75 0 0 0 l

Making decisions

o How is the agent to pick an action? One possibility is exploitation, to pick

the action that has the highest Q-value for the current state.

o But the agent can only learn about the value of actions that it tries. Thus it

should try a variety of actions. This may involved ignoring what it thinksis best some of the time: exploration.

o Exploration makes more sense early on in learning when the agent doesn't

know much.



o One possibility for selecting an action; pick the "best" action with

probability P = 1 - e-E a, where a is the number of training samples (the

"age" of the agent). Here is how the probability of selecting the "best"action depends on age when E is 0.1.

Here is how it depends on age when E is 0.01

o A smarter possibility would be to have the probability of picking an action

depend on how high its value is relative to the values of all of the other

possible actions. Here is one way:



where the vs represent all of the possible actions in state xt .

Implementing Q learning

• A lookup table: a Q value for each state-action pair

• But in the real world the number of states may be very large, even infinite.

Distributed representations of states permit

o More efficient coding

o Generalization to novel states

• A neural network

o Inputs are distributed representations of states.

o Outputs Q values for each action (represented locally).

o Weights represent associations of state features with actions.

o Error-driven learning: for the selected action, the target is the "new" Q

value from the Q learning rule.

Concept learning

Concepts (categories)

• Features

o Sufficient features and decision treeso Necessary features

o Typical features

o Prohibited features

• Which features are relevant for concepts (of particular types)?

• Where does the set of features come from?

• One-shot learning

• Quine's problem

Positive and negative examples, generalization and

specialization

• Evolving hypotheses

o Current best guess

o Current set consistent with examples

• Generalization

o In response to a positive example that should not be an example,

according to the current hypothesis: a false negative



o Variabilization

o Assigning a more general type to an element

o Dropping features from a conjunction of features

o Making a disjunction of features

• Specialization

o In response to a negative example that should be an example, according tothe current hypothesis: a false positive

o The value of near misses

o Adding features to a conjunction of features

o Assigning a more specific type to an element

o Prohibiting a feature

• Generalization (left) and specialization (right) illustrated



• An example of current-best-guess learning (that fails): SAME-COLOR

o A positive example



Object 1 is to the left of object 2.

Object 1 is a square.

Object 2 is a square.

Object 1 is red.

Object 2 is red.

o Another positive example


Object 1 is a rectangle.

o A negative example




Object 1 is a rectangle.

Object 2 must not be yellow.

o The learner fails because of the inadequacy of the representation; learning

relies in part on perception.

Version Space learning (Mitchell)

• Incremental learning vs. batch learning

• Maintaining the set of all hypotheses that are consistent with the set of positive

and negative examples so far

• Version Space learning: incremental, least-commitment algorithm (makes no

arbitrary choices)

• Version space: set of all hypotheses consistent with examples seen so far

• Version graph: directed acyclic graph in which nodes are elements of version

space and there is an arc from node p to node q iff p is less general than q andthere is no node r that is more general than p and less general than q.

• Positive examples lead to the elimination of hypotheses that are too specific.

• Negative examples lead to the elimination of hypotheses that are too general.

• Problem of size of version graph

• Solution: use of boundary sets, sets of hypotheses defining boundaries on which

hypotheses are consistent with exampleso Most general boundary set (G-set): every member consistent with

examples, and there are no more general consistent hypotheses

o Most specific boundary set (S-set): every member consistent with

examples, and there are no more specific consistent hypotheses

• Updating the boundary sets, given a new example (the rules for G-set and S-setare symmetric)

o G-set

If the example is positive, exclude any hypotheses that do not

cover the example.

If the example is negative, find the most general set within or below the old G-set and above the S-set which fails to cover the

example.

o S-set

If the example is positive, find the most specific set within or

above the old S-set and below the G-set which covers the example.

If the example is negative, exclude any hypotheses that cover theexample.

• Update until

o there is exactly one concept left in the version space or

o the version space collapes--either the S-set or G-set becomes empty or

o there are no more examples.

• Example



• Another detailed example, from a class by Julian Francis Miller when he was at

the University of Birmingham

ID3 (decision tree) learning (Quinlan)

• ID3: batch learning of concepts using decision trees

• A set of classified examples

class body covering habitat flies? breathes air?

1 m hair land no yes

http://www.cs.bham.ac.uk/~jfm/mlearn/lecture4/sld006.htm

http://www.cs.bham.ac.uk/~jfm/mlearn/lecture4/sld006.htm



2 m hair land yes yes

3 m other water no yes

4 b feathers land yes yes

5 b feathers land no yes

6 f scales water no no

7 f scales water yes no

8 f other water no no

9 f scales water no yes

10 r scales land no yes

• Alternate decision trees that successfully classify the data, differing in number of

decisions

• Creating a decision tree

For each decision point,

o If all remaining examples are all classified, we're done.



o Else if there are some unclassified examples left and attributes left, pick

the remaining attribute which is the "most important", the one which tends

to divide the remaining examples into homogeneous setso Else if there are no examples left, no such example has been observed;

return default

o Else if there are no attributes left, examples with the same descriptionhave different classifications: noise or insufficient attributes or

nondeterministic domain

• Selection of "most important" attribute

o For possible answers v to a question with probabilities P (vi), the

information content of the answer is

I ( P (v1), ... , P (vn)) = ∑in − P (vi) log2 P (vi)

o In our example the information content of a tree that classifies the data is

I ( P (m), P (b), P (f), P (r)) = P (m) log2 P (m) + P (b) log2 P (b) + P (f) log2 P (f) + P (r) log2 P (r) =

= −0.3 log20.3 − 0.2 log20.2 −0.4 log20.4 − 0.1 log20.1 =

(0.3)(1.73) + (0.2)(2.32) + (0.4)(1.32) + (0.1)(3.32) = 1.843

o Each decision point adds to the information content. To determine the

information gain, we subtract the information content left after the

decision from the total. The information content left after the decision isthe weighted sum of the information content in each of the groups

resulting from the decision. At each decision point, the algorithm picks the

attribute of those left that maximizes the information gain.

o For the first decision point, the four attributes classify the examples in thisway:

o body covering: hair(M: 2), feathers(B: 2), scales(F: 3, R:

1), other(M: 1, F: 1)

o habitat: land(M: 2, B: 2, R: 1), water(M:1, F: 4)

o flies?: yes(M:1, B: 1, F: 1), no(M: 2, B: 1, F: 3, R: 1)

o breathes air?: yes(M: 3, B: 2, F: 1, R: 1), no(F: 3)

o For each of these, the information content remaining after the decision is:o body covering: (0.2)(0) + (0.2)(0) + (0.4)(0.811) +

0.2(1.0) = 0.524

o habitat: (0.5)(1.524) + (0.5)(0.722) = 1.123

o flies?: (0.3)(1.586) + (0.7)(1.845) = 1.767

o breathes air?: (0.7)(0.802) + (0.3)(0) = 1.291

Therefore the appropriate first choice is the "body covering" attribute.

o For the "scales" branch below this decision point, the three remaining

attributes classify the four examples as followso habitat: land(R: 1), water(F: 3)

o flies?: yes(F: 1), no(F: 2, R: 1)

o breathes air?: yes(F: 1, R: 1), no(F: 2)



o For each of these, the information content remaining after the decision is:o habitat: (0.1)(0) + (0.3)(0) = 0

o flies?: (0.1)(0) + (0.3)(0.918) = 0.275

o breathes air?: (0.2)(1) + (0.2)(0) = 0.2

Therefore "habitat" is the right choice for this branch.

Some implications of AI

Philosophy and cognitive science

• Is intelligence something that can be defined and studied abstractly, independently

from the details for the intelligent agent? Does intelligence require a body?

• Are there different intelligences (logics?), each with its own advantages for a

given environment, that could collaborate in solving problems?

•

Where does the mind stop and the "outside world" begin?• Are there fundamental differences between human and "animal" intelligence, or

are all of the differences just a matter of degree?

• How much of human intelligence is cultural, as opposed to genetic?

Society

• How can AI give certain groups of people (more) power over other groups of

people?

• Who benefits from applied AI (medicine, law, design, education, information

retrieval, natural language processing, stock market, marketing, transportation,

entertainment, agriculture, military)?• Who funds AI research?

• How might AI be used to further

o free, independent media; equal access to information; a better informed

public?

o a public capable of making rational economic and political decisions?

o international (intercultural) understanding (tolerance)?

o equal access to (or equitable distribution of) the world's resources?

o protection of the environment?

Artificial neural network An artificial neural network (ANN) or commonly just neural network (NN) is an

interconnected group of artificial neurons that uses a mathematical model or

computational model for information processing based on a connectionist approach tocomputation. In most cases an ANN is an adaptive system that changes its structure based

on external or internal information that flows through the network.

http://en.wikipedia.org/wiki/Artificial_neuron


http://en.wikipedia.org/wiki/Mathematical_model

http://en.wikipedia.org/wiki/Computational_model

http://en.wikipedia.org/wiki/Information_processing

http://en.wikipedia.org/wiki/Connectionism


http://en.wikipedia.org/wiki/Computation

http://en.wikipedia.org/wiki/Adaptive_system


http://en.wikipedia.org/wiki/Mathematical_model

http://en.wikipedia.org/wiki/Computational_model

http://en.wikipedia.org/wiki/Information_processing


http://en.wikipedia.org/wiki/Computation




(The term "neural network " can also mean biological-type systems.)

In more practical terms neural networks are non-linear statistical data modeling tools.

They can be used to model complex relationships between inputs and outputs or to find patterns in data.

A neural network is an interconnected group of nodes, akin to the vast network of

neurons in the human brain.

More complex neural networks are often used in Parallel Distributed Processing.

Background

There is no precise agreed definition among researchers as to what a neural network is,

but most would agree that it involves a network of simple processing elements (neurons)which can exhibit complex global behavior, determined by the connections between the

processing elements and element parameters. The original inspiration for the technique

was from examination of the central nervous system and the neurons (and their axons, dendrites and synapses) which constitute one of its most significant information

processing elements (see Neuroscience). In a neural network model, simple nodes (called

http://en.wikipedia.org/wiki/Neural_network

http://en.wikipedia.org/wiki/Non-linear

http://en.wikipedia.org/wiki/Statistical


http://en.wikipedia.org/wiki/Data_modeling


http://en.wikipedia.org/wiki/Pattern_recognition


http://en.wikipedia.org/wiki/Neuron

http://en.wikipedia.org/wiki/Human_brain


http://en.wikipedia.org/wiki/Parallel_Distributed_Processing


http://en.wikipedia.org/wiki/Neurons

http://en.wikipedia.org/wiki/Central_nervous_system

http://en.wikipedia.org/wiki/Axons


http://en.wikipedia.org/wiki/Dendrites

http://en.wikipedia.org/wiki/Synapses


http://en.wikipedia.org/wiki/Neuroscience

http://en.wikipedia.org/wiki/Node_(neural_networks)


http://en.wikipedia.org/wiki/Image:LeabraScreenShot.jpg

http://en.wikipedia.org/wiki/Image:LeabraScreenShot.jpg

http://en.wikipedia.org/wiki/Image:Artificial_neural_network.svg

http://en.wikipedia.org/wiki/Image:Artificial_neural_network.svg


http://en.wikipedia.org/wiki/Non-linear





http://en.wikipedia.org/wiki/Neuron


http://en.wikipedia.org/wiki/Parallel_Distributed_Processing


http://en.wikipedia.org/wiki/Neurons



http://en.wikipedia.org/wiki/Dendrites


http://en.wikipedia.org/wiki/Neuroscience




variously "neurons", "neurodes", "PEs" ("processing elements") or "units") are connected

together to form a network of nodes — hence the term "neural network." While a neural

network does not have to be adaptive per se, its practical use comes with algorithmsdesigned to alter the strength (weights) of the connections in the network to produce a

desired signal flow.

These networks are also similar to the biological neural networks in the sense that

functions are performed collectively and in parallel by the units, rather than there being aclear delineation of subtasks to which various units are assigned (see also

connectionism). Currently, the term Artificial Neural Network (ANN) tends to refer

mostly to neural network models employed in statistics, cognitive psychology andartificial intelligence. Neural network models designed with emulation of the central

nervous system (CNS) in mind are a subject of theoretical neuroscience.

In modern software implementations of artificial neural networks the approach inspired

by biology has more or less been abandoned for a more practical approach based on

statistics and signal processing. In some of these systems neural networks, or parts of neural networks (such as artificial neurons) are used as components in larger systems that

combine both adaptive and non-adaptive elements. While the more general approach of such adaptive systems is more suitable for real-world problem solving, it has far less to

do with the traditional artificial intelligence connectionist models. What they do however

have in common is the principle of non-linear, distributed, parallel and local processingand adaptation.

[edit] Models

Neural network models in artificial intelligence are usually referred to as artificial neural

networks (ANNs); these are essentially simple mathematical models defining a function. Each type of ANN model corresponds to a class of such functions.

[edit] The network in artificial neural network

The word network in the term 'artificial neural network' arises because the function f ( x) is

defined as a composition of other functions g i( x), which can further be defined as a

composition of other functions. This can be conveniently represented as a network structure, with arrows depicting the dependencies between variables. A widely used type

of composition is the nonlinear weighted sum, where ,where K is some predefined function, such as the hyperbolic tangent. It will be

convenient for the following to refer to a collection of functions g i as simply a vector

.

http://en.wikipedia.org/wiki/Biological_neural_networks


http://en.wikipedia.org/wiki/Statistics


http://en.wikipedia.org/wiki/Cognitive_psychology

http://en.wikipedia.org/wiki/Artificial_intelligence






http://en.wikipedia.org/w/index.php?title=Theoretical_neuroscience&action=edit


http://en.wikipedia.org/wiki/Neural_network_software


http://en.wikipedia.org/wiki/Adaptive_systems

http://en.wikipedia.org/w/index.php?title=Artificial_neural_network&action=edit&section=2


http://en.wikipedia.org/wiki/Hyperbolic_tangent



http://en.wikipedia.org/wiki/Biological_neural_networks



http://en.wikipedia.org/wiki/Cognitive_psychology

http://en.wikipedia.org/wiki/Artificial_intelligence







http://en.wikipedia.org/wiki/Adaptive_systems






ANN dependency graph

This figure depicts such a decomposition of f , with dependencies between variables

indicated by arrows. These can be interpreted in two ways.

The first view is the functional view: the input x is transformed into a 3-dimensional

vector h, which is then transformed into a 2-dimensional vector g , which is finallytransformed into f . This view is most commonly encountered in the context of

optimization.

The second view is the probabilistic view: the random variable F = f (G) depends upon therandom variable G = g ( H ), which depends upon H = h( X ), which depends upon therandom variable X . This view is most commonly encountered in the context of graphical

models.

The two views are largely equivalent. In either case, for this particular network

architecture, the components of individual layers are independent of each other (e.g., thecomponents of g are independent of each other given their input h). This naturally enables

a degree of parallelism in the implementation.

Recurrent ANN dependency graph

Networks such as the previous one are commonly called feedforward, because their graph

is a directed acyclic graph. Networks with cycles are commonly called recurrent. Such

networks are commonly depicted in the manner shown at the top of the figure, where f isshown as being dependent upon itself. However, there is an implied temporal dependencewhich is not shown. What this actually means in practice is that the value of f at some

point in time t depends upon the values of f at zero or at one or more other points in time.

The graphical model at the bottom of the figure illustrates the case: the value of f at time t

only depends upon its last value. Models such as these, which have no dependencies in

the future, are called causal models.

http://en.wikipedia.org/wiki/Optimization_(mathematics)

http://en.wikipedia.org/wiki/Random_variable

http://en.wikipedia.org/wiki/Graphical_models


http://en.wikipedia.org/wiki/Feedforward



http://en.wikipedia.org/wiki/Directed_acyclic_graph



http://en.wikipedia.org/wiki/Path_(graph_theory)


http://en.wikipedia.org/wiki/Recurrent


http://en.wikipedia.org/w/index.php?title=Causal_models&action=edit



http://en.wikipedia.org/wiki/Image:Recurrent_ann_dependency_graph.png

http://en.wikipedia.org/wiki/Image:Recurrent_ann_dependency_graph.png

http://en.wikipedia.org/wiki/Image:Ann_dependency_graph.png

http://en.wikipedia.org/wiki/Image:Ann_dependency_graph.png


http://en.wikipedia.org/wiki/Random_variable










See also: graphical models

[edit] Learning

However interesting such functions may be in themselves, what has attracted the most

interest in neural networks is the possibility of learning , which in practice means thefollowing:

Given a specific task to solve, and a class of functions F , learning means using a set of

observations, in order to find which solves the task in an optimal sense.

This entails defining a cost function such that, for the optimal solution f * ,

(no solution has a cost less than the cost of the optimalsolution).

The cost function C is an important concept in learning, as it is a measure of how far away we are from an optimal solution to the problem that we want to solve. Learning

algorithms search through the solution space in order to find a function that has thesmallest possible cost.

For applications where the solution is dependent on some data, the cost must necessarily

be a function of the observations, otherwise we would not be modelling anything related

to the data. It is frequently defined as a statistic to which only approximations can bemade. As a simple example consider the problem of finding the model f which minimizes

, for data pairs ( x, y) drawn from some distribution . In

practical situations we would only have N samples from and thus, for the above

example, we would only minimize . Thus, the cost is

minimized over a sample of the data rather than the true data distribution.

When some form of online learning must be used, where the cost is partially

minimized as each new example is seen. While online learning is often used when is

fixed, it is most useful in the case where the distribution changes slowly over time. In

neural network methods, some form of online learning is frequently also used for finitedatasets.

See also: Optimization (mathematics), Statistical Estimation, Machine Learning



http://en.wikipedia.org/wiki/Cost_function



http://en.wikipedia.org/wiki/Statistic


http://en.wikipedia.org/wiki/Estimation_theory


http://en.wikipedia.org/wiki/Machine_Learning





http://en.wikipedia.org/wiki/Statistic



http://en.wikipedia.org/wiki/Machine_Learning



[edit] Choosing a cost function

While it is possible to arbitrarily define some ad hoc cost function, frequently a particular

cost will be used either because it has desirable properties (such as convexity) or becauseit arises naturally from a particular formulation of the problem (i.e., In a probabilistic

formulation the posterior probability of the model can be used as an inverse cost).Ultimately, the cost function will depend on the task we wish to perform. The three main

categories of learning tasks are overviewed below.

[edit] Learning paradigms

There are three major learning paradigms, each corresponding to a particular abstract

learning task. These are supervised learning, unsupervised learning and reinforcementlearning. Usually any given type of network architecture can be employed in any of those

tasks.

[edit] Supervised learning

In supervised learning, we are given a set of example pairs and

the aim is to find a function f in the allowed class of functions that matches the examples.In other words, we wish to infer the mapping implied by the data; the cost function is

related to the mismatch between our mapping and the data and it implicitly contains prior

knowledge about the problem domain.

A commonly used cost is the mean-squared error which tries to minimise the average

error between the network's output, f(x), and the target value y over all the example pairs.

When one tries to minimise this cost using gradient descent for the class of neural

networks called Multi-Layer Perceptrons, one obtains the well-known backpropagationalgorithm for training neural networks.

Tasks that fall within the paradigm of supervised learning are pattern recognition (also

known as classification) and regression (also known as function approximation). The

supervised learning paradigm is also applicable to sequential data (e.g., for speech andgesture recognition). This can be thought of as learning with a "teacher," in the form of a

function that provides continuous feedback on the quality of solutions obtained thus far.

[edit] Unsupervised learning

In unsupervised learning we are given some data x, and the cost function to be minimisedcan be any function of the data x and the network's output, f .

The cost function is dependent on the task (what we are trying to model) and our a priori

assumptions (the implicit properties of our model, its parameters and the observed

variables).


http://en.wikipedia.org/wiki/Ad_hoc


http://en.wikipedia.org/wiki/Supervised_learning

http://en.wikipedia.org/wiki/Unsupervised_learning


http://en.wikipedia.org/wiki/Reinforcement_learning






http://en.wikipedia.org/wiki/Mean-squared_error

http://en.wikipedia.org/wiki/Gradient_descent

http://en.wikipedia.org/wiki/Backpropagation



http://en.wikipedia.org/wiki/Regression_analysis




http://en.wikipedia.org/wiki/Ad_hoc








http://en.wikipedia.org/wiki/Mean-squared_error










As a trivial example, consider the model f ( x) = a, where a is a constant and the cost C =

( E [ x] − f ( x))2. Minimising this cost will give us a value of a that is equal to the mean of

the data. The cost function can be much more complicated. Its form depends on theapplication: For example in compression it could be related to the mutual information

between x and y. In statistical modelling, it could be related to the posterior probability of

the model given the data. (Note that in both of those examples those quantities would bemaximised rather than minimised)

Tasks that fall within the paradigm of unsupervised learning are in general estimation

problems; the applications include clustering, the estimation of statistical distributions,

compression and filtering.

[edit] Reinforcement learning

In reinforcement learning, data x is usually not given, but generated by an agent's

interactions with the environment. At each point in time t , the agent performs an action yt

and the environment generates an observation xt and an instantaneous cost ct , according tosome (usually unknown) dynamics. The aim is to discover a policy for selecting actions

that minimises some measure of a long-term cost, i.e. the expected cumulative cost. The

environment's dynamics and the long-term cost for each policy are usually unknown, but

can be estimated.

More formally, the environment is modeled as a Markov decision process (MDP) with

states and actions with the following probability

distributions: the instantaneous cost distribution P (ct | st ), the observation distribution P ( xt

| st ) and the transition P ( st + 1 | st ,at ), while a policy is defined as conditional distribution

over actions given the observations. Taken together, the two define a Markov chain (MC).

The aim is to discover the policy that minimises the cost, i.e. the MC for which the cost isminimal.

ANNs are frequently used in reinforcement learning as part of the overall algorithm.

Tasks that fall within the paradigm of reinforcement learning are control problems, games

and other sequential decision making tasks.

See also: dynamic programming, stochastic control

[edit] Learning algorithms

Training a neural network model essentially means selecting one model from the set of

allowed models (or, in a Bayesian framework, determining a distribution over the set of

allowed models) that minimises the cost criterion. There are numerous algorithmsavailable for training neural network models; most of them can be viewed as a

straightforward application of optimization theory and statistical estimation.

http://en.wikipedia.org/wiki/Mutual_information

http://en.wikipedia.org/wiki/Posterior_probability

http://en.wikipedia.org/wiki/Estimation

http://en.wikipedia.org/wiki/Data_clustering


http://en.wikipedia.org/wiki/Statistical_distributions


http://en.wikipedia.org/wiki/Data_compression

http://en.wikipedia.org/wiki/Bayesian_spam_filtering



http://en.wikipedia.org/wiki/Markov_decision_process


http://en.wikipedia.org/wiki/Markov_chain

http://en.wikipedia.org/wiki/Control

http://en.wikipedia.org/wiki/Games

http://en.wikipedia.org/w/index.php?title=Sequential_decision_making&action=edit

http://en.wikipedia.org/wiki/Dynamic_programming

http://en.wikipedia.org/w/index.php?title=Stochastic_control&action=edit


http://en.wikipedia.org/wiki/Bayesian


http://en.wikipedia.org/wiki/Statistical_estimation


http://en.wikipedia.org/wiki/Mutual_information

http://en.wikipedia.org/wiki/Posterior_probability

http://en.wikipedia.org/wiki/Estimation



http://en.wikipedia.org/wiki/Data_compression

http://en.wikipedia.org/wiki/Bayesian_spam_filtering




http://en.wikipedia.org/wiki/Markov_chain

http://en.wikipedia.org/wiki/Control

http://en.wikipedia.org/wiki/Games

http://en.wikipedia.org/w/index.php?title=Sequential_decision_making&action=edit

http://en.wikipedia.org/wiki/Dynamic_programming

http://en.wikipedia.org/w/index.php?title=Stochastic_control&action=edit


http://en.wikipedia.org/wiki/Bayesian





Most of the algorithms used in training artificial neural networks are employing some

form of gradient descent. This is done by simply taking the derivative of the cost function

with respect to the network parameters and then changing those parameters in a gradient-related direction.

Evolutionary methods, simulated annealing, and Expectation-maximization and non- parametric methods are among other commonly used methods for training neural

networks. See also machine learning.

[edit] Employing artificial neural networks

Perhaps the greatest advantage of ANNs is their ability to be used as an arbitrary function

approximation mechanism which 'learns' from observed data. However, using them is not

so straightforward and a relatively good understanding of the underlying theory isessential.

• Choice of model: This will depend on the data representation and the application.Overly complex models tend to lead to problems with learning.

• Learning algorithm: There are numerous tradeoffs between learning algorithms.

Almost any algorithm will work well with the correct hyperparameters for

training on a particular fixed dataset. However selecting and tuning an algorithm

for training on unseen data requires a significant amount of experimentation.

• Robustness: If the model, cost function and learning algorithm are selected

appropriately the resulting ANN can be extremely robust.

With the correct implementation ANNs can be used naturally in online learning and large

dataset applications. Their simple implementation and the existence of mostly local

dependencies exhibited in the structure allows for fast, parallel implementations inhardware.

[edit] Applications

The utility of artificial neural network models lies in the fact that they can be used toinfer a function from observations. This is particularly useful in applications where the

complexity of the data or task makes the design of such a function by hand impractical.

[edit] Real life applications

The tasks to which artificial neural networks are applied tend to fall within the following broad categories:

• Function approximation, or regression analysis, including time series prediction

and modeling.

• Classification, including pattern and sequence recognition, novelty detection and

sequential decision making.


http://en.wikipedia.org/wiki/Gradient-related


http://en.wikipedia.org/wiki/Evolutionary_methods


http://en.wikipedia.org/wiki/Simulated_annealing

http://en.wikipedia.org/wiki/Expectation-maximization


http://en.wikipedia.org/wiki/Non-parametric_methods



http://en.wikipedia.org/wiki/Machine_learning


http://en.wikipedia.org/w/index.php?title=Hyperparameters&action=edit


http://en.wikipedia.org/wiki/Online_algorithm



http://en.wikipedia.org/wiki/Function_approximation


http://en.wikipedia.org/wiki/Time_series_prediction

http://en.wikipedia.org/wiki/Statistical_classification






http://en.wikipedia.org/wiki/Simulated_annealing







http://en.wikipedia.org/wiki/Online_algorithm



http://en.wikipedia.org/wiki/Function_approximation


http://en.wikipedia.org/wiki/Time_series_prediction

http://en.wikipedia.org/wiki/Statistical_classification




• Data processing, including filtering, clustering, blind source separation and

compression.

Application areas include system identification and control (vehicle control, processcontrol), game-playing and decision making (backgammon, chess, racing), pattern

recognition (radar systems, face identification, object recognition and more), sequencerecognition (gesture, speech, handwritten text recognition), medical diagnosis, financial

applications, data mining (or knowledge discovery in databases, "KDD"), visualizationand e-mail spam filtering.

[edit] Neural network software

Main article: Neural network software

Neural network software is used to simulate, research, develop and apply artificialneural networks, biological neural networks and in some cases a wider array of adaptive

systems.

[edit] Types of neural networks

[edit] Feedforward neural network

The feedforward neural networks are the first and arguably simplest type of artificial

neural networks devised. In this network, the information moves in only one direction,

forward, from the input nodes, through the hidden nodes (if any) and to the output nodes.There are no cycles or loops in the network.

[edit] Single-layer perceptron

The earliest kind of neural network is a single-layer perceptron network, which consists

of a single layer of output nodes; the inputs are fed directly to the outputs via a series of

weights. In this way it can be considered the simplest kind of feed-forward network. Thesum of the products of the weights and the inputs is calculated in each node, and if the

value is above some threshold (typically 0) the neuron fires and takes the activated value

(typically 1); otherwise it takes the deactivated value (typically -1). Neurons with thiskind of activation function are also called McCulloch-Pitts neurons or threshold neurons.

In the literature the term perceptron often refers to networks consisting of just one of

these units. They were described by Warren McCulloch and Walter Pitts in the 1940s.

A perceptron can be created using any values for the activated and deactivated states aslong as the threshold value lies between the two. Most perceptrons have outputs of 1 or -1

with a threshold of 0 and there is some evidence that such networks can be trained more

quickly than networks created from nodes with different activation and deactivationvalues.

http://en.wikipedia.org/wiki/Data_processing

http://en.wikipedia.org/wiki/Data_mining

http://en.wikipedia.org/wiki/E-mail_spam




http://en.wikipedia.org/wiki/Simulation

http://en.wikipedia.org/wiki/Research

http://en.wikipedia.org/wiki/Technology_development

http://en.wikipedia.org/wiki/Biological_neural_network






http://en.wikipedia.org/wiki/Feedforward_neural_networks




http://en.wikipedia.org/wiki/Perceptron


http://en.wikipedia.org/wiki/Warren_McCulloch

http://en.wikipedia.org/wiki/Walter_Pitts

http://en.wikipedia.org/wiki/1940s

http://en.wikipedia.org/wiki/Data_processing

http://en.wikipedia.org/wiki/Data_mining




http://en.wikipedia.org/wiki/Simulation

http://en.wikipedia.org/wiki/Research

http://en.wikipedia.org/wiki/Technology_development

http://en.wikipedia.org/wiki/Biological_neural_network









http://en.wikipedia.org/wiki/Warren_McCulloch

http://en.wikipedia.org/wiki/Walter_Pitts

http://en.wikipedia.org/wiki/1940s



Perceptrons can be trained by a simple learning algorithm that is usually called the deltarule. It calculates the errors between calculated output and sample output data, and uses

this to create an adjustment to the weights, thus implementing a form of gradient descent.

Single-unit perceptrons are only capable of learning linearly separable patterns; in 1969

in a famous monograph entitled Perceptrons Marvin Minsky and Seymour Papert showedthat it was impossible for a single-layer perceptron network to learn an XOR function.

They conjectured (incorrectly) that a similar result would hold for a multi-layer perceptron network. Although a single threshold unit is quite limited in its computational

power, it has been shown that networks of parallel threshold units can approximate any

continuous function from a compact interval of the real numbers into the interval [-1,1].This very recent result can be found in [Auer, Burgsteiner, Maass: The p-delta learning

rule for parallel perceptrons, 2001 (state Jan 2003: submitted for publication)].

A single-layer neural network can compute a continuous output instead of a step function.

A common choice is the so-called logistic function:

With this choice, the single-layer network is identical to the logistic regression model,widely used in statistical modelling. The logistic function is also known as the sigmoid

function. It has a continuous derivative, which allows it to be used in backpropagation.

This function is also preferred because its derivative is easily calculated:

y' = y(1 − y)

[edit] Multi-layer perceptron

A two-layer neural network capable of calculating XOR. The numbers within the neuronsrepresent each neuron's explicit threshold (which can be factored out so that all neurons

have the same threshold, usually 1). The numbers that annotate arrows represent the

weight of the inputs. This net assumes that if the threshold is not reached, zero (not -1) is

http://en.wikipedia.org/wiki/Delta_rule




http://en.wikipedia.org/wiki/1969

http://en.wikipedia.org/wiki/Monograph

http://en.wikipedia.org/wiki/Perceptrons

http://en.wikipedia.org/wiki/Marvin_Minsky


http://en.wikipedia.org/wiki/Seymour_Papert

http://en.wikipedia.org/wiki/XOR_function


http://en.wikipedia.org/wiki/Step_function


http://en.wikipedia.org/wiki/Logistic_function

http://en.wikipedia.org/wiki/Logistic_regression

http://en.wikipedia.org/wiki/Statistical_modelling



http://en.wikipedia.org/wiki/Sigmoid_function



http://en.wikipedia.org/wiki/Image:XOR_perceptron_net.png

http://en.wikipedia.org/wiki/Image:XOR_perceptron_net.png





http://en.wikipedia.org/wiki/Monograph

http://en.wikipedia.org/wiki/Perceptrons


http://en.wikipedia.org/wiki/Seymour_Papert




http://en.wikipedia.org/wiki/Logistic_regression

http://en.wikipedia.org/wiki/Statistical_modelling







output. Note that the bottom layer of inputs is not always considered a real neural

network layer

This class of networks consists of multiple layers of computational units, usuallyinterconnected in a feed-forward way. Each neuron in one layer has directed connections

to the neurons of the subsequent layer. In many applications the units of these networksapply a sigmoid function as an activation function.

The universal approximation theorem for neural networks states that every continuousfunction that maps intervals of real numbers to some output interval of real numbers can

be approximated arbitrarily closely by a multi-layer perceptron with just one hidden

layer. This result holds only for restricted classes of activation functions, e.g. for the

sigmoidal functions.

Multi-layer networks use a variety of learning techniques, the most popular being back-

propagation. Here the output values are compared with the correct answer to compute the

value of some predefined error-function. By various techniques the error is then fed back through the network. Using this information, the algorithm adjusts the weights of each

connection in order to reduce the value of the error function by some small amount. After

repeating this process for a sufficiently large number of training cycles the network will

usually converge to some state where the error of the calculations is small. In this caseone says that the network has learned a certain target function. To adjust weights properly

one applies a general method for non-linear optimization that is called gradient descent.

For this, the derivative of the error function with respect to the network weights iscalculated and the weights are then changed such that the error decreases (thus going

downhill on the surface of the error function). For this reason back-propagation can only

be applied on networks with differentiable activation functions.

In general the problem of teaching a network to perform well, even on samples that werenot used as training samples, is a quite subtle issue that requires additional techniques.

This is especially important for cases where only very limited numbers of training

samples are available. The danger is that the network overfits the training data and fails tocapture the true statistical process generating the data. Computational learning theory is

concerned with training classifiers on a limited amount of data. In the context of neural

networks a simple heuristic, called early stopping, often ensures that the network willgeneralize well to examples not in the training set.

Other typical problems of the back-propagation algorithm are the speed of convergence

and the possibility of ending up in a local minimum of the error function. Today there are

practical solutions that make back-propagation in multi-layer perceptrons the solution of choice for many machine learning tasks.

[edit] ADALINE

Adaptive Linear Neuron or later called Adaptive Linear Element. It was developed by

Professor Bernard Widrow and his graduate student Ted Hoff at Stanford University in

http://en.wikipedia.org/wiki/Back-propagation





http://en.wikipedia.org/wiki/Overfitting


http://en.wikipedia.org/wiki/Computational_learning_theory

http://en.wikipedia.org/wiki/Heuristic

http://en.wikipedia.org/wiki/Early_stopping

http://en.wikipedia.org/wiki/Local_minimum




http://en.wikipedia.org/wiki/Bernard_Widrow

http://en.wikipedia.org/wiki/Ted_Hoff

http://en.wikipedia.org/wiki/Stanford_University






http://en.wikipedia.org/wiki/Computational_learning_theory

http://en.wikipedia.org/wiki/Heuristic

http://en.wikipedia.org/wiki/Early_stopping




http://en.wikipedia.org/wiki/Bernard_Widrow

http://en.wikipedia.org/wiki/Ted_Hoff

http://en.wikipedia.org/wiki/Stanford_University



1960. It's based on the McCulloch-Pitts model. It consists of a weight, a bias and a

summation function.

Operation: yi = wxi + b

Its adaptation is defined through a cost function (error metric) of the residual e = d i − (b +

wxi) where d i is the desired input. With the MSE error metric the

adapted weight and bias become: and

While the Adaline is through this capable of simple linear regression, it has limited

practical use.

There is an extension of the Adaline, called the Multiple Adaline (MADALINE) that

consists of two or more adalines serially connected.

[edit] Radial basis function (RBF) network

Main article: Radial basis function network

Radial Basis Functions are powerful techniques for interpolation in multidimensionalspace. A RBF is a function which has built into a distance criterion with respect to a

centre. Radial basis functions have been applied in the area of neural networks where

they may be used as a replacement for the sigmoidal hidden layer transfer characteristicin multi-layer perceptrons. RBF networks have 2 layers of processing: In the first, input is

mapped onto each RBF in the 'hidden' layer. The RBF chosen is usually a Gaussian. In

regression problems the output layer is then a linear combination of hidden layer valuesrepresenting mean predicted output. The interpretation of this output layer value is the

same as a regression model in statistics. In classification problems the output layer is

typically a sigmoid function of a linear combination of hidden layer values, representing

a posterior probability. Performance in both cases is often improved by shrinkagetechniques, known as ridge regression in classical statistics and known to correspond to a

prior belief in small parameter values (and therefore smooth output functions) in a

Bayesian framework.

RBF networks have the advantage of not suffering from local minima in the same way as

multi-layer perceptrons. This is because the only parameters that are adjusted in the

learning process are the linear mapping from hidden layer to output layer. Linearity

ensures that the error surface is quadratic and therefore has a single easily foundminimum. In regression problems this can be found in one matrix operation. In

classification problems the fixed non-linearity introduced by the sigmoid output function

is most efficiently dealt with using iterated reweighted least squares.

http://en.wikipedia.org/wiki/Mean_squared_error




http://en.wikipedia.org/wiki/Radial_basis_function_network



http://en.wikipedia.org/wiki/Radial_basis_function_network



RBF networks have the disadvantage of requiring good coverage of the input space by

radial basis functions. RBF centres are determined with reference to the distribution of

the input data, but without reference to the prediction task. As a result, representationalresources may be wasted on areas of the input space that are irrelevant to the learning

task. A common solution is to associate each data point with its own centre, although this

can make the linear system to be solved in the final layer rather large, and requiresshrinkage techniques to avoid overfitting.

Associating each input datum with an RBF leads naturally to kernel methods such as

Support Vector Machines and Gaussian Processes (the RBF is the kernel function). All

three approaches use a non-linear kernel function to project the input data into a spacewhere the learning problem can be solved using a linear model. Like Gaussian Processes,

and unlike SVMs, RBF networks are typically trained in a Maximum Likelihood

framework by maximizing the probability (minimizing the error) of the data under themodel. SVMs take a different approach to avoiding overfitting by maximizing instead a

margin. RBF networks are outperformed in most classification applications by SVMs. In

regression applications they can be competitive when the dimensionality of the inputspace is relatively small.

[edit] Kohonen self-organizing network

The self-organizing map (SOM) invented by Teuvo Kohonen uses a form of unsupervised

learning. A set of artificial neurons learn to map points in an input space to coordinates in

an output space. The input space can have different dimensions and topology from theoutput space, and the SOM will attempt to preserve these.

[edit] Recurrent network

Contrary to feedforward networks, recurrent neural networks (RNs) are models with bi-

directional data flow. While a feedforward network propagates data linearly from input tooutput, RNs also propagate data from later processing stages to earlier stages.

[edit] Simple recurrent network

A simple recurrent network (SRN) is a variation on the multi-layer perceptron, sometimes

called an "Elman network" due to its invention by Jeff Elman. A three-layer network isused, with the addition of a set of "context units" in the input layer. There are connections

from the middle (hidden) layer to these context units fixed with a weight of one. At each

time step, the input is propagated in a standard feed-forward fashion, and then a learningrule (usually back-propagation) is applied. The fixed back connections result in the

context units always maintaining a copy of the previous values of the hidden units (since

they propagate over the connections before the learning rule is applied). Thus the network

can maintain a sort of state, allowing it to perform such tasks as sequence-prediction thatare beyond the power of a standard multi-layer perceptron.

http://en.wikipedia.org/wiki/Support_Vector_Machine


http://en.wikipedia.org/wiki/Self-organizing_map

http://en.wikipedia.org/wiki/Teuvo_Kohonen





http://en.wikipedia.org/wiki/Recurrent_neural_network


http://en.wikipedia.org/wiki/Jeff_Elman

http://en.wikipedia.org/wiki/Support_Vector_Machine


http://en.wikipedia.org/wiki/Self-organizing_map

http://en.wikipedia.org/wiki/Teuvo_Kohonen






http://en.wikipedia.org/wiki/Jeff_Elman



In a fully recurrent network , every neuron receives inputs from every other neuron in the

network. These networks are not arranged in layers. Usually only a subset of the neurons

receive external inputs in addition to the inputs from all the other neurons, and another disjunct subset of neurons report their output externally as well as sending it to all the

neurons. These distinctive inputs and outputs perform the function of the input and output

layers of a feed-forward or simple recurrent network, and also join all the other neuronsin the recurrent processing.

[edit] Hopfield network

The Hopfield network is a recurrent neural network in which all connections are

symmetric. Invented by John Hopfield in 1982, this network guarantees that its dynamicswill converge. If the connections are trained using Hebbian learning then the Hopfield

network can perform as robust content-addressable memory, resistant to connection

alteration.

[edit] Echo State Network

The Echo State Network (ESN) is a recurrent neural network with a sparsely connectedrandom hidden layer. The weights of output neurons are the only part of the network that

can change and be learned. ESN are good to (re)produce temporal patterns.

[edit] Stochastic neural networks

A stochastic neural network differs from a regular neural network in the fact that itintroduces random variations into the network. In a probabilistic view of neural networks,

such random variations can be viewed as a form of statistical sampling, such as Monte

Carlo sampling.

[edit] Boltzmann machine

The Boltzmann machine can be thought of as a noisy Hopfield network. Invented by

Geoff Hinton and Terry Sejnowski in 1985, the Boltzmann machine is important because

it is one of the first neural networks to demonstrate learning of latent variables (hiddenunits). Boltzmann machine learning was at first slow to simulate, but the contrastive

divergence algorithm of Geoff Hinton (circa 2000) allows models such as Boltzmann

machines and products of experts to be trained much faster.

[edit] Modular neural networks

Biological studies showed that the human brain functions not as a single massive

network, but as a collection of small networks. This realisation gave birth to the concept

of modular neural networks, in which several small networks cooperate or compete tosolve problems.


http://en.wikipedia.org/wiki/Hopfield_network

http://en.wikipedia.org/wiki/John_Hopfield



http://en.wikipedia.org/wiki/Hebbian_learning



http://en.wikipedia.org/wiki/Echo_State_Network




http://en.wikipedia.org/wiki/Stochastic_neural_network

http://en.wikipedia.org/wiki/Statistical_sampling

http://en.wikipedia.org/wiki/Monte_Carlo_sampling




http://en.wikipedia.org/wiki/Boltzmann_machine

http://en.wikipedia.org/wiki/Geoff_Hinton

http://en.wikipedia.org/wiki/Terry_Sejnowski





http://en.wikipedia.org/w/index.php?title=Contrastive_divergence_algorithm&action=edit





http://en.wikipedia.org/w/index.php?title=Modular_neural_networks&action=edit


http://en.wikipedia.org/wiki/Hopfield_network

http://en.wikipedia.org/wiki/John_Hopfield







http://en.wikipedia.org/wiki/Stochastic_neural_network

http://en.wikipedia.org/wiki/Statistical_sampling




http://en.wikipedia.org/wiki/Boltzmann_machine

http://en.wikipedia.org/wiki/Geoff_Hinton







http://en.wikipedia.org/w/index.php?title=Modular_neural_networks&action=edit



[edit] Committee of machines

A committee of machines (CoM) is a collection of different neural networks that together

"vote" on a given example. This generally gives a much better result compared to other neural network models. In fact in many cases, starting with the same architecture and

training but using different initial random weights gives vastly different networks. A CoMtends to stabilize the result.

The CoM is similar to the general machine learning bagging method, except that thenecessary variety of machines in the committee is obtained by training from different

random starting weights rather than training on different randomly selected subsets of the

training data.

[edit] Associative Neural Network (ASNN)

The ASNN is an extension of the committee of machines that goes beyond a

simple/weighted average of different models. ASNN represents a combination of anensemble of feed-forward neural networks and the k-nearest neighbour technique (kNN).

It uses the correlation between ensemble responses as a measure of distance amid theanalysed cases for the kNN. This corrects the bias of the neural network ensemble. An

associative neural network has a memory that can coincide with the training set. If new

data becomes available, the network instantly improves its predictive ability and providesdata approximation (self-learn the data) without a need to retrain the ensemble. Another

important feature of ASNN is the possibility to interpret neural network results by

analysis of correlations between data cases in the space of models. The method isdemonstrated at www.vcclab.org, where you can either use it online or download it.

[edit] Other types of networks

These special networks do not fit in any of the previous categories.

[edit] Holographic associative memory

Holographic associative memory represents a family of analog, correlation-based,associative, stimulus-response memories, where information is mapped onto the phase

orientation of complex numbers operating.

[edit] Instantaneously trained networks

Instantaneously trained neural networks (ITNNs) were inspired by the phenomenon of short-term learning that seems to occur instantaneously. In these networks the weights of

the hidden and the output layers are mapped directly from the training vector data.

Ordinarily, they work on binary data, but versions for continuous data that require smalladditional processing are also available.



http://en.wikipedia.org/wiki/Bootstrap_Aggregating



http://cogprints.soton.ac.uk/documents/disk0/00/00/14/41/index.html

http://www.vcclab.org/lab/asnn



http://en.wikipedia.org/wiki/Holographic_associative_memory



http://en.wikipedia.org/wiki/Instantaneously_trained_neural_networks






http://cogprints.soton.ac.uk/documents/disk0/00/00/14/41/index.html

http://www.vcclab.org/lab/asnn








[edit] Spiking neural networks

Spiking neural networks (SNNs) are models which explicitly take into account the timing

of inputs. The network input and output are usually represented as series of spikes (deltafunction or more complex shapes). SNNs have an advantage of being able to

continuously process information. They are often implemented as recurrent networks.

Networks of spiking neurons -- and the temporal correlations of neural assemblies in such

networks -- have been used to model figure/ground separation and region linking in thevisual system (see e.g. Reitboeck et.al.in Haken and Stadler: Synergetics of the Brain.

Berlin, 1989).

Gerstner and Kistler have a freely-available online textbook on Spiking Neuron Models.

Spiking neural networks with axonal conduction delays exhibit polychronisation, and

hence could have a potentially unlimited memory capacity.

In June 2005 IBM announced construction of a Blue Gene supercomputer dedicated to

the simulation of a large recurrent spiking neural network [1].

[edit] Dynamic neural networks

Dynamic neural networks not only deal with nonlinear multivariate behaviour, but alsoinclude (learning of) time-dependent behaviour such as various transient phenomena and

delay effects. Meijer has a Ph.D. thesis online where regular feedforward perception

networks are generalized with differential equations, using variable time step algorithms

for learning in the time domain and including algorithms for learning in the frequency

domain (in that case linearized around a set of static bias points).

[edit] Cascading neural networks

Cascade-Correlation is an architecture and supervised learning algorithm developed by

Scott Fahlman and Christian Lebiere. Instead of just adjusting the weights in a network of fixed topology, Cascade-Correlation begins with a minimal network, then automatically

trains and adds new hidden units one by one, creating a multi-layer structure. Once a new

hidden unit has been added to the network, its input-side weights are frozen. This unitthen becomes a permanent feature-detector in the network, available for producing

outputs or for creating other, more complex feature detectors. The Cascade-Correlation

architecture has several advantages over existing algorithms: it learns very quickly, thenetwork determines its own size and topology, it retains the structures it has built even if

the training set changes, and it requires no back-propagation of error signals through the

connections of the network.


http://en.wikipedia.org/wiki/Spiking_neural_network


http://diwww.epfl.ch/~gerstner/BUCH.html


http://en.wikipedia.org/w/index.php?title=Polychronisation&action=edit

http://en.wikipedia.org/wiki/IBM

http://en.wikipedia.org/wiki/Blue_Gene


http://en.wikipedia.org/wiki/Supercomputer

http://domino.research.ibm.com/comm/pr.nsf/pages/news.20050606_CognitiveIntelligence.html


http://en.wikipedia.org/w/index.php?title=Dynamic_neural_network&action=edit

http://www.seeingwithsound.com/thesis.htm



http://en.wikipedia.org/wiki/Algorithm


http://en.wikipedia.org/wiki/Scott_Fahlman

http://en.wikipedia.org/w/index.php?title=Christian_Lebiere&action=edit






http://en.wikipedia.org/w/index.php?title=Polychronisation&action=edit

http://en.wikipedia.org/wiki/IBM


http://en.wikipedia.org/wiki/Supercomputer

http://domino.research.ibm.com/comm/pr.nsf/pages/news.20050606_CognitiveIntelligence.html


http://en.wikipedia.org/w/index.php?title=Dynamic_neural_network&action=edit

http://www.seeingwithsound.com/thesis.htm




http://en.wikipedia.org/wiki/Scott_Fahlman

http://en.wikipedia.org/w/index.php?title=Christian_Lebiere&action=edit




[edit] Neuro-fuzzy networks

A neuro-fuzzy network is a fuzzy inference system in the body of an artificial neural

network. Depending on the FIS type, there are several layers that simulate the processesinvolved in a fuzzy inference like fuzzification, inference, aggregation and

defuzzification. Embedding an FIS in a general structure of an ANN has the benefit of using available ANN training methods to find the parameters of a fuzzy system.

[edit] Theoretical properties

[edit] Capacity

Artificial neural network models have a property called 'capacity', which roughly

corresponds to their ability to model any given function. It is related to the amount of

information that can be stored in the network and to the notion of complexity.

[edit] Convergence

Nothing can be said in general about convergence since it depends on a number of

factors. Firstly, there may exist many local minima. This depends on the cost function andthe model. Secondly, the optimization method used might not be guaranteed to converge

when far away from a local minimum. Thirdly, for a very large amount of data or

parameters, some methods become impractical. In general, it has been found that

theoretical guarantees regarding convergence are not always a very reliable guide to practical application.

[edit] Generalisation and statistics

In applications where the goal is to create a system that generalises well in unseen

examples, the problem of overtraining has emerged. This arises in overcomplex or overspecified systems when the capacity of the network significantly exceeds the needed

free parameters. There are two schools of thought for avoiding this problem: The first is

to use cross-validation and similar techniques to check for the presence of overtrainingand optimally select hyperparameters such as to minimise the generalisation error. The

second is to use some form of regularisation. This is a concept that emerges naturally in a

probabilistic (Bayesian) framework, where the regularisation can be performed by puttinga larger prior probability over simpler models; but also in statistical learning theory,

where the goal is to minimise over two quantities: the 'empirical risk' and the 'structural

risk', which roughly correspond to the error over the training set and the predicted error inunseen data due to overfitting.


http://en.wikipedia.org/w/index.php?title=Fuzzy_inference_system&action=edit














Confidence analysis of a neural network

Supervised neural networks that use an MSE cost function can use formal statistical

methods to determine the confidence of the trained model. The MSE on a validation set

can be used as an estimate for variance. This value can then be used to calculate theconfidence interval of the output of the network, assuming a normal distribution. A

confidence analysis made this way is statistically valid as long as the output probability

distribution stays the same and the network is not modified.

By assigning a softmax activation function on the output layer of the neural network (or asoftmax component in a component-based neural network) for categorical target

variables, the outputs can be interpreted as posterior probabilities. This is very useful in

classification as it gives a certainty measure on classifications.

The softmax activation function:

[edit] Dynamical properties

Various techniques originally developed for studying disordered magnetic systems (spin

glasses) have been successfully applied to simple neural network architectures, such asthe Hopfield network. Influential work by E. Gardner and B. Derrida has revealed many

interesting properties about perceptrons with real-valued synaptic weights, while later

work by W. Krauth and M. Mezard has extended these principles to binary-valuedsynapses.

About Fuzzy LogicWhat is fuzzy logic? What's the difference between fuzzy logic and

Boolean logic? What are connectives in fuzzy logic?

Boolean or "two-valued" logic is traditional logic with all statements

either being true or false. Symbolic logic is something that you can

master. The hardest thing about symbolic logic is learning how to work

with the symbols. Once you know what all the symbols stand for, the

logic should come more easily.




http://en.wikipedia.org/wiki/Confidence_interval


http://en.wikipedia.org/wiki/Normal_distribution

http://en.wikipedia.org/wiki/Probability_distribution



http://en.wikipedia.org/w/index.php?title=Spin_glasses&action=edit


http://en.wikipedia.org/wiki/Image:Synapse_deployment.jpg

http://en.wikipedia.org/wiki/Image:Synapse_deployment.jpg



http://en.wikipedia.org/wiki/Normal_distribution








Philosophers and Logicians

First, I'd like to do a bit of philosophizing as a way to lead into the logic. Philosophers

and logicians have a lot of overlap in what they do. Many logicians are also philosophers,and all philosophers are logicians to some extent (some much more so than others).

Given that there is such a connection between philosophers and logicians, I find itstriking just how radically different the fields are. Philosophers are interested in finding

deep truths about the world, be they epistemological, metaphysical, ethical, etc. Logicians(qua logicians) are only interested in using a set of rules to manipulate arbitrary symbols

that have no relevance to the world.

The (sometimes difficult) marriage between philosophy and logic comes from the factthat everyone in the world (except, I would argue, people who are commonly called

"crazy") accepts the truths proven by logic to be universally true and unquestionable.

Philosophy needs logic because in order to establish that a philosophical doctrine is true,

one needs to show that the doctrine is universally and unquestionably true. One needs to,

in other words, make a demostration that everyone would accept as proof that the proposition is true. To do that, the philosopher needs logic.

[TOP]

Logic Takes Small Steps

Logic accomplishes this magical universal acceptance because it makes only little tiny

steps. It does silly little things like:

ASSUMING: The dog is brown.

AND ASSUMING: The dog weighs 15 lbs.

I CONCLUDE: The dog is brown and the dog weighs 15 lbs.which anyone who understands what the word "and" means would agree with.

[TOP]

Shorthand Sentences

Logicians, though, are very lazy people. They don't like to write long derivations in

English because English sentences can be fairly long. So what they do instead is a kind of shorthand. If you give a logician a sentence like

The dog is brown.

he will pick a letter and assign it to that sentence. He now knows that the letter is just

shorthand for the sentence. The way I learned logic, capital letters are used for sentences(with the exception of U, V, W, X, Y, and Z; I'll get to those later).

http://mathforum.org/dr.math/faq/symbolic_logic.html#top






So let's just start at the beginning of the alphabet and use the letter "A" to represent the

sentence "The dog is brown." While we're at it, let's use the letter "B" to represent the

sentence "The dog weighs 15 lbs."

In addition to saving time and ink, this practice of using capital letters to represent whole

sentences has a couple of other advantages. The first is that to a logician, not every wordis as interesting as every other. Logicians are extremely interested in the following list of

words:

and

or

if...then

if and only if

not

They call these words "connectives." This is because you can use them to connect

sentences that you already have together to make new sentences.

When you write in English, those words don't stand out; they just get lost in the middle of

sentences. Logicians want to make sure the words look special, so they take the whole

rest of the sentence (the part they don't care about) and use a single letter to represent

that. Then their favorite words stand out. Let's rewrite our earlier example about the dogusing our logician's shorthand:

ASSUMING: A

AND ASSUMING: B

I CONCLUDE: A and B

The other advantage of using capital letters to represent sentences is that you ignore allthe information that isn't relevant to what you're trying to do. For the derivation I did

above, it didn't matter that the sentences were both about some dog. It didn't matter that

they were about weight or color. They could have just as easily been sentences about how

tall the dog is or about a cat or a person or a war or whatever. And if we can do thederivation for A and B, then we can do the same exact derivation for C and D or E and N

or any other sentences we like.

[TOP]

Connectives





Now, as I said before, logicians are lazy. They really don't want to have anything to do

with English. So instead of using the English words:

and

or

if...then

if and only if

not

they make up their own symbols for these:

For these words Logicians use this symbol

and ^

or v

if ... then ->

if and only if <->

not ~

(Sometimes they also use a triple equals sign for '<->', but I can't type that.)

Here are some examples:

You and I would write this A logician writes this

The dog is brown and the dog weighs 15 lbs. (A ̂ B)

The dog is brown or the dog weighs 15 lbs. (A v B)

if the dog is brown, then the dog weighs 15 lbs. (A -> B)

The dog is brown if and only if the dog weighs 15 lbs. (A <-> B)

The dog is not brown. ~A

There are a few things to notice here:

1. The symbols: ,̂ v, ->, and <-> are called "two-place connectives." This is because

they connect two sentences together into a more complicated sentence.



2. The symbol: ~ is called a "one-place connective" because you only add it to one

sentence. (You cannot join multiple sentences together with it.) To negate a

sentence, all you have to do is stick a ~ on the front.

3. When you join two sentences with a two-place connective, you ALWAYS put

parentheses around it. So it is NOT appropriate to write this:

A ^ B

That makes as much sense in symbolic logic as writing:

Nn7&% mm)]mm (

[TOP]

Parentheses

I know that a lot of books and instructors claim that it is okay to drop the outermost

parentheses in a sentence. I've done it myself many times. And 95% of the time it won'tcause you trouble if you're careful. But let's say we started with this 'sentence'

A ^ B

and decided to negate it. Well, the way to negate a sentence is to stick a ~ on the front, so

let's do that:

~A ^ B

But wait ! What we did there was just negate the A. We wanted to negate the whole

sentence. If we were really sharp, then we might notice that somebody had given us an

illegitimate sentence that was missing parentheses, and so we would add the parentheses

before adding the ~, to get:

~(A ^ B)

which is what we wanted.

It seems silly to make such a big deal about parentheses when we're dealing with simple

sentences, but when you're doing a 30-line derivation and you're tired, it's easy to make amistake just like that on line 17 and get yourself into real trouble. It's better to just

remember the simple rule and always add parentheses when you have a two-place

connective.

. . .





Let's take a deep breath and then go quickly over what we have so far.

[TOP]

Using Connectives

Connectives are logical terms,

^ (and)

v (or)

-> (if...then)

<-> (if and only if)

~ (not)

which you can add to a sentence.

A simple sentence is one that has no connectives. For example: A (the dog is brown).

A complex sentence is a sentence which is made up of one or more simple sentences and

one or more connectives. Some examples are:

(A ^ B)

(A v B)

(A -> B)

(A <-> B)

~A

You can use connectives on complex sentences just as you can on simple sentences. Let's

introduce a new simple sentence "it is raining," and let's call our new sentence C. We nowhave a lot more sentences that we can make. (Keep in mind, we have no idea yet which of

these sentences are true or false; we also don't yet know how these sentences relate.) For

example:

(C ^ B)

~C

(B v C)

(C v B)





(B -> C)

(A -> C)

(C -> A)

(B <-> C)

(~B <-> C)

~(B <-> C)

C

~~C

((A ^ B) v C)

(((A ^ ~B) v ~C) -> (~(A v B) <-> C))

These can get a little complicated. That last sentence is especially scary looking; we'llcome back to it in a little while. For now, here is a quick run-down of how to use

connectives to make complex sentences from simple ones.

To make this complex

sentenceDo this We say

~C Stick a ~ on C. ~C is the negation of C.

(B v C)Use a v to join B and

C.

(B v C) is the disjunction of B and

C.

(B ^ C)Use a ^ to join B and

C.

(B ^ C) is the conjunction of B and

C.

(B -> C)Use a -> to join B and

C.

B implies C.(B -> C) is a conditional or

implication.

(B <-> C) Use a <-> to join Band C. B implies C and C implies B.(B <-> C) is a biconditional.

[TOP]

Subsentences

Some of our sentences had more than one connective:





(~B <-> C)

~(B <-> C)

~~C

((A ^ B) v C)

(((A ^ ~B) v ~C) -> (~(A v B) <-> C))The sentence

(~B <-> C)

is made by joining the sentences

~B

C

with

<->The complex sentence

~Bis called a "subsentence" of the larger sentence

(~B <-> C)

because it is a smaller sentence inside the large one.

The simple sentence

C

is also a subsentence of the larger sentence(~B <-> C)

The simple sentence

Bis a subsentence of the subsentence

~B

and soB

is also a subsentence of

(~B <-> C)

There are two connectives used in the larger sentence<->

~ but they are not equally important. In this case the

<->

is much more important than the~

Remember how the sentence was made by taking the two smaller sentences



~B

Cand connecting them with a

<->

The <-> is therefore called the "main connective" of the sentence. Main connectives are,

without a doubt, absolutely the most important idea in logic. The hardest skill to learn inlogic is to identify the main connective of a sentence. Make sure you understand what

main connectives are.

Compare the sentence we've been looking at,

(~B <-> C)with one that looks similar,

~(B <-> C)

This new sentence is very different. It was made by negating(B <-> C)

The main connective of

~(B <-> C)

is therefore~

and

(B <-> C)is just a subsentence of

~(B <-> C)

[TOP]

Complicated sentences

Now let's take a closer look at the most complicated sentence on our list and see if we can

make it more manageable. The way to analyze a complicated sentence is to start at the

outside and work your way in.

The outermost parentheses on this ugly sentence(((A ^ ~B) v ~C) -> (~(A v B) <-> C))

are used to connect these two sentences

((A ^ ~B) v ~C)

(~(A v B) <-> C)

with a

->So the way to build our ugly sentence is to start with these two less ugly sentences:





((A ^ ~B) v ~C)

(~(A v B) <-> C)and connect them with the main connective

->

We can then analyze each subsentence if we like.

I told you before that simple sentences are represented by the capital letters A through T,and that U, W, X, Y, and Z are saved for something else. (I rarely use V because it looks

to much like the symbol for 'or'.) U, W, X, Y, and Z are used as shorthand for other

sentences in logic (some books use italic letters and others use Greek letters, but since Ionly have plain text to work with, I use the end of the alphabet). I call these "sentence

variables."

So, just as we can take the English sentence

There is nothing on TV.

and use the capital letter D to represent it, we can take the sentence in logic((A ^ ~B) v ~D)

and use the capital letter U to represent it.

[Note: it is also legal to use capital variables to stand for simple sentences. So you can

take the simple sentenceB

and use the letter Z to stand for it.]

This can be useful in analyzing complicated sentences. For example, if we have the scarylooking sentence

((((A ^ ~B) v (B <-> C)) -> (~(C v D) ^ ~(~A -> ~~D))) v A)we can start using sentence variables to stand for subsentences. So if U stands for (A ^ ~B)

Then we have

(((U v (B <-> C)) -> (~(C v D) ^ ~(~A -> ~~D))) v A)and if V stands for

(U v (B <-> C))

then we have

((V -> (~(C v D) ^ ~(~A -> ~~D))) v A)If W stands for

~(C v D)

we have((V -> (W ^ ~(~A -> ~~D))) v A)

and if X stands for

~(~A -> ~~D)we have

((V -> (W ^ X)) v A)

and if Y stands for (V -> (W ^ X))



we have

(Y v A)

So we know where our main connective is. And by substituting back in for the sentencevariables, we can recreate our sentence in managable chunks.

It is very important to keep track of what sentence variables stand for when you're doing

this kind of substitution. This can be a major source of error if you're not keeping close

track of what every letter stands for.

. . .

Now that we know all the details of the language of symbolic logic, it's time to actually

do symbolic logic.

The first step with every sentence is to identify the main connective. The reason is

simple:

In symbolic logic, the main connective of a sentence is the only thing that you canwork with.

Let's look at our complicated sentence from earlier

(((A ^ ~B) v ~C) -> (~(A v B) <-> C))

is fundamentally an implication between these two subsentences:

((A ^ ~B) v ~C)

(~(A v B) <-> C)

There is no->

in either subsentence, but the sentence as a whole is still first and foremost an implication

because of what its main connective is. So when you're trying to figure out how in theheck you can work with this ugly sentence

(((A ^ ~B) v ~C) -> (~(A v B) <-> C))

you need to remember that it is an implication and treat it just as one.

[TOP]

What the Connectives Mean

Here's a quick course on what the connectives mean. (I assume you have some familiarity

with them.)

The

sentenceis TRUE whenever is FALSE whenever

A A is true A is false





~A A is false A is true

(A ̂ B) A is true and B is trueA is false; or B is false; or both A andB are false

(A v B) A is true; or B is true; or both A andB are true A is false and B is false

(A <-> B)A and B are both true; or A and B

are both false

A is false and B is true, or A is true

and B is false

(A -> B)A is false; or B is true; or A is false

and B is trueA is true and B is false

This last one is a little weird, so let's think about it. If we translate it back into English,

we get

If the dog is brown then the dog weighs 15 lbs.

How would we go about proving that this sentence is false?

Let's say that the dog is brown and the dog weighs 15 lbs. Does that disprove the if...then

statement? Certainly not!

What if the dog is brown but the dog weighs 25 lbs.? That does disprove the statement.

What if the dog turns out to be white? Then we cannot disprove the inference because itonly makes a prediction about a brown dog. If the dog isn't brown, then we can't test the

prediction.

So the only way to make the sentence

(A -> B)

false is to make A true and B false at the same time. Given any other values of A and B,the sentence comes out true.

[TOP]

The Rules of Logic

Now we're finally ready to learn the rules of logic. There are exactly 12 - no more, noless. Each connective has two rules associated with it, and there are two special rules.

Let's start with one of the special rules first.

[TOP]







1. Assumptions

The first special rule is the rule of assumptions. It is deceptively easy. The rule is:

You are allowed to assume anything you want at any time.

But there is a catch:You have to keep track of what assumptions you have made.

Well that makes sense. Let's say you and I are detectives trying to solve a mystery. I could

say something like "let's assume for the time being that the dog is brown." Once I said

that, we could discuss what that would mean. Anything we conclude from thatassumption is perfectly okay, as long as we remember that it was under the assumption

that the dog is brown. In other words, whatever we do prove under the assumption that

the dog is brown must be followed by a disclaimer "assuming that the dog is brown."

Eventually, we would want to prove something about the case that doesn't depend on the

dog being brown. Logicians call this "discharging" the assumption. Fortunately, some of our other rules tell us how to discharge assumptions.

1. When I do derivations, I number each new line. I start new assumptions using curly

brackets{

2. and then I indent everything after a new assumption;

3. When I discharge an assumption I close the curly brackets

}4. And then I stop indenting.

One other very important thing to keep in mind:

Once you close off an assumption, you can no longer use any lines between the

curly brackets. So since I've closed the curly brackets above, I would no longer be

able to use either of the two lines between them: they are gone forever. So lines(2) and (3) above are illegal .

However, line (1) is legal because it is outside the curly brackets, and so is line(4).

This can get complicated if you have assumptions inside of assumptions.

And finally, and perhaps central to logic:

A logical truth is something that you can write with all your assumptions

discharged.

Before we can do some short derivations, we need to learn two other rules. Let's startwith one of the two rules that we get from the

->



connective.

[TOP]

2. -> Introduction

The rule is called "-> introduction." The way it works is:

If you assume

X

and then you derive

Ythen you are entitled to discharge the assumption and write

(X -> Y)

That makes sense. Let's just say that we assumedA

[The dog is brown]And then we did some logic and out of that we proved

E

[The killer is a man]If we did that, we would be entitled to say to a jury

(A -> E)

[If the dog is brown then the killer is a man]

The sentence

(A -> E)is true.

[TOP]

3. ^ Elimination

Let's learn one more rule for now. This one is called "^ elimination."

If you have

(X ^ Y)then you are entitled to

Xand you are also entitled to

Y

That makes sense too. Lets say we knew for a fact that

(A ^ B)

[The dog is brown and the dog weighs 15 lbs]

Then we would certainly be entitled to conclude







A

[The dog is brown]and we would also certainly be entitled to conclude

B

[The dog weighs 15 lbs]

A Derivation

Now let's take an example of a derivation. Suppose I wanted to prove that this is a logical

truth

((A ^ B) -> A)I would start by identifying the main connective, which is a ->. I know how to introduce a

new ->, I assume the left and then derive the right. Let's try it:{

1) (A ^ B) [assumption]

2) A [^elim on 1]

}

3) ((A ^ B) -> A) [->intro on 1-2]

We just used our 3 rules to derive((A ^ B) -> A)

[If (the dog is brown and the dog weighs 15 lbs) then the dog is brown]

[TOP]

4. Repetition

There's one other special rule. It's called "repetition." The rule simply says that if you

have

X

Then you are entitled to writeX

Provided that it was not inside a closed curly bracket.

[TOP]

5. ^ Introduction

The other rule with ^ is called "^ introduction." It says, if you have

X

and you also haveY







then you are entitled to

(X ^ Y)

That makes sense too. Let's say that I have already provenE

[The killer is a man]And I have also proven

F

[The killer is tall]

Then I am certainly allowed to say to the jury

(E ^ F)

[The killer is a man and the killer is tall]

Continuing the Derivation

Let's continue our derivation using our new rules.

{


2) A [^elim on 1]

}

3) ((A ^ B) -> A) [->intro on 1-2]

4) C [assumption]

5) ((A ^ B) -> A) [repetition of 3]

6) (((A ^ B) -> A) ^ C) [^intro on 4 and 5]

}

7) (C -> (((A ^ B) -> A) ^ C) [->intro on 4-5]

[TOP]

6. -> Elimination

The other rule for -> is called "-> elimination." It says that if you have

X

and you have(X -> Y)


Y

That makes sense too. If I know





A

[The dog is brown]and I know

(A -> E)

[If the dog is brown then the killer is a man]

then I am certainly entitled to conclude

E

[The killer is a man]

Adding to the Derivation

Let's add a little more to our derivation:

{


2) A [^elim on 1]

}

3) ((A ^ B) -> A) [->intro on 1-2]

{

4) C [assumption]


6) (((A ^ B) -> A) ^ C) [^intro on 4 and 5]

}

7) (C -> (((A ^ B) -> A) ^ C) [->intro on 4-5]

{



[Note: Line (9) is NOT a repetition of (5) because

(5) is inside closed curly brackets. (3) is not,

so it is okay to repeat it here.]

10) A [->elim on 8 and 9]

[Note: I did not discharge the assumption I made on line (8). So

A

is not a logical truth; it is true only on the assumption that



(A ^ B)

is true.]

[TOP]

7. <-> Introduction

The next two rules have to do with <->. The first is called "<-> introduction." It states

that if you have

(X -> Y)and you have

(Y -> X)

then you are entitled to(X <-> Y)

This one is a little tricky to explain, and the best way (I'm sorry to say) is truth tables. So

you should try all the possible combinations for X and Y and convince yourself that if

(X -> Y)and

(Y -> X)

are both true, then(X <-> Y)

must be true too.

[TOP]

8. <-> Elimination

The next rule is called "<-> elimination." This one says that if you have

(X <-> Y)And you have

XThen you are entitled to

Y

OR

If you have

(X <-> Y)

And you haveY

Then you are entitled toX

This makes sense because if you know

(X <-> Y)

then you know that X and Y have the same truth value. So if you know one of them is

true, then the other must also be true.

A New Derivation







Let's start a new derivation.

{


2) A [êlim on 1]

3) B [êlim on 1]

4) (B ^ A) [întro on 2 and 3]

}

5) ((A ^ B) -> (B ^ A)) [->intro on 1-4]

{

6) (B ^ A) [assumption]

7) B [êlim on 6]

8) A [êlim on 6]

9) (A ^ B) [întro on 7 and 8]

}

10) ((B ^ A) -> (A ^ B)) [->intro on 6-9]

11) ((A ^ B) -> (B ^ A)) [repetition of 5]

12) ((A ^ B) <-> (B ^ A)) [<->intro on 10 and 11]

[TOP]

9. ~ Introduction

Next we have "~ introduction." It says that if you assume

X

And then you derive a contradiction, you are entitled to discharge the assumption andwrite

~X

A contradiction is any sentence

Yfollowed on the next line by the negation of that sentence

~Y

This rule is the familiar "reductio ad absurdum." An easy way to think of it is this. If weassume

~F

[The killer does not have red hair]





And we prove from that

A

[The dog is brown]

and

~A

[The dog is not brown]

then something is wrong with our assumption.

[TOP]

10. ~ Elimination

"~ elimination" is almost identical. It says that if you assume

~X

and derive a contradiction, then you are entitled to discharge the assumption and writeX

A Quick Derivation

{

1) (A ^ ~A) [assumption]

2) A [^elim on 1]

3) ~A [^elim on 1]

}

4) ~(A ^ ~A) [~intro on 1-3]

Lastly, let's look at the rules for v.

[TOP]

11. v Introduction

The first is "v introduction." It says that if you have

X

then you are entitled to write(X v Y)

no matter what Y is.

That seems a little strange. Normally you wouldn't think you can just go throwing any old

sentence into a derivation. But remember

(X v Y)







is true as long as X is true OR Y is true OR both are true. So if you already know that X

is true, then the disjunction of X and anything else will be true.

A Short Derivation

{

1) A [assumption]

2) A [repetition of 1]

}

3) (A -> A) [->intro on 1-2]

4) ((A -> A) v B) [vintro on 3]

[TOP]

12. v Elimination

The last rule is a little tricky. It's "v elimination." It says if you have

(X v Y)and you have

(X -> Z)

and you have

(Y -> Z)Then you are entitled to

Z

[Most of the time this means that when you have a disjunction that you don't know whatto do with, you have to derive an implication for each side of the disjunction before you

can go on.]

The rule is hard to do with derivations, but it is actually not too hard to understand if you

take an example.

Let's say we know(A v B)

[The dog is brown or the dog weighs 15 lbs]And we know

(A -> E)

[If the dog is brown, then the killer is a man]

And we know





(B -> E)

[If the dog weighs 15 lbs, the killer is a man]Then we don't have to bother figuring out whether A is true or B is true; either way we

are entitled to

E

[The killer is a man]

Continuing Our Last Derivation

Let's continue our last derivation to get a demonstration of "velim."

{

1) A [assumption]


}

3) (A -> A) [->intro on 1-2]

4) ((A -> A) v B) [vintro on 3]

{

5) (A -> A) [assumption]

{

6) C [assumption]

7) C [repetition of 6]

}

9) (C -> C) [->intro on 6-7]

}

10) ((A -> A) -> (C -> C)) [->intro on 5-9]

{

11) B [assumption]

{

12) C [assumption]


}



14) (C -> C) [->intro on 12-13]

}

15) (B -> (C -> C)) [->intro on 11-14]

16) ((A -> A) v B) [repetition of 4]

17) ((A -> A) -> (C -> C)) [repetition of 10]

18) (B -> (C -> C)) [repetition of 15]

19) (C -> C) [velim on 16, 17, 18]

[Not the most efficient way to prove (C -> C), but it is valid.]

There are a lot of other rules people try to tell you, but anything you can do with those,you can do with these 12 rules.

[TOP]

Why These 12 Rules? A Review

The reason I like these rules is that with these rules you can do any derivation using thesame five steps:

Step 1: Find the main connective of the sentence you are trying to derive.

Step 2: Apply the rule for introducing that main connective.

Step 3: When you're in the middle of a derivation and you don't know what to do,

find the main connective of the sentence you have and eliminate it.

Step 4: Along the way you may have to derive subsentences using steps 1 through

3.

Step 5: If all else fails, you may have to do a "~ elimination" [I'll explain this step

a little later].

If you use those five steps, you should always know which rule to use. The reason is thatthere are ONLY four things you are ever allowed to do in a derivation:

1. Eliminate the main connective of the sentence you are on.

2. Use the sentence you are on to eliminate the main connective of another sentence

(AS LONG AS THAT OTHER SENTENCE ISN'T CLOSED OFF IN CURLY

BRACKETS).

3. Repeat an earlier line that isn't closed off in curly brackets.





4. Make a new assumption.

[TOP]

Mundane Rules: What Do You Have?

Now that we have the steps for doing derivations, let me try to explain that confusing business about discharging assumptions. I'm going to approach this from a slightly

different angle this time.

Of the 12 rules I gave you, 8 are pretty straightforward. They are what I would call the

"Mundane Rules." The way Mundane Rules work is: they say "if you have X and Y andZ, then you are entitled to U."

The tricky thing with Mundane Rules is knowing what you "have."

You "have" any sentence that is written down on a line of the derivation except those

which are closed off in curly brackets (which are gone forever once the brackets close).

Being "entitled" to something just means that you can legally write it down as the nextline of the derivation.

The easiest Mundane Rule is repetition:

If you have

Xthen you are entitled to

X

Another Mundane Rule is ^ introduction:

If you haveX

and you have

Y

then you are entitled to(X ^ Y)

Simple enough. (I went into more detail on _why_ this is a sound rule in the last e-mail.)

Another pretty easy Mundane Rule is ^ elimination:

If you have

(X ^ Y)then you are entitled to

X





Or, if you prefer, you are also entitled to

Y

So far so good.

Another Mundane Rule is -> elimination:

If you haveX

and you have

(X -> Y)then you are entitled to

Y

This is actually the same thing as Modus Ponens, so you can call it that if you prefer.

Since I don't speak Latin, I prefer calling it "-> elimination" because that is more

descriptive of what the rule is doing.

Another Mundane Rule is <-> introduction:

If you have

(X -> Y)and you have

(Y -> X)

then you are entitled to(X <-> Y)

This one is a little tricky to explain. Let's assume somehow we have

(X -> Y)Under what conditions could that be true? There are 3 possibilities:X is true and Y is true

X is false and Y is true

X is false and Y is false

Also, we have(Y -> X)

That can only be true under these conditions:

X is true and Y is true

X is true and Y is false

X is false and Y is falseSince we "have" both of these sentences, then they must both be true. So under what

conditions are they both true? Well, only these two:



X is true and Y is true

X is false and Y is falseWhich are exactly the conditions for:

(X <-> Y)

Which means we are entitled to write that.

Another Mundane rule is <-> elimination:

If we have(X <-> Y)

and we have

Xthen we are entitled to

Y

OR:

If we have

(X <-> Y)

and we haveY

then we are entitled to

X

Another Mundane Rule is v introduction:

If you haveX

then you are entitled to(X v Y)

and you are also entitled to

(Y v X)

This is a little tricky too. We "have" X, which means X must be true. Now we can just,out of the blue, pick any sentence we like and put it into a disjunction with X. Why can

we do that? Well, let's say we pick a FALSE sentence. Is that still okay?

Yes, it is! Even if Y is false, the disjunction with X is still true, so we haven't written afalse sentence, and we are still okay.

The last Mundane Rule is v elimination:

If you have

(X v Y)

and you have



(X -> Z)

and you have

(Y -> Z)then you are entitled to

Z

So much for the Mundane Rules. Mundane Rules are useful in derivations because theylet you move from one step to the next. They tell you what you can do with the sentences

you have. They also can give you a hint as to what you need to do next. For example, if

you have(X v Y)

and you want to eliminate the v, but you don't have

(X -> Z)

(Y -> Z)

yet, then you'd better go get those two sentences.

The problem with the Mundane Rules is that they only let you play around with sentencesyou already HAVE. You can't get anything NEW out of them.

So far we've gone over the 8 Mundane Rules. There are 12 rules in total. Of the 4

remaining, one is a 'Special Rule' and three are 'Fun Rules'.

[TOP]

Special Rule

The Special Rule is the rule of assumptions:

You are free to assume anything you like at any time as long as you do these things:

1. Use curly brackets and indentation to keep track of what you have assumed.

2. Only discharge the assumption using one of the Fun Rules.

[Discharging an assumption just means you close the curly brackets and stopindenting. So you can forget about the assumption.]

[TOP]

Three 'Fun' Rules

The three Fun Rules all have this form:

If you assume

X

and then, on that assumption, you derive

Y







You can discharge the assumption you made at X and then you are entitled to

Z

The first Fun Rule is -> introduction:

If you assumeX

And then, on that assumption, you derive

YYou can discharge the assumption you made at X and then you are entitled to

(X -> Y)

This is, in my opinion, the most important and fundamental rule in logic. It is the

foundation of all logic. [It's also really important for derivations. If you look up at theMundane Rules, a lot of them require you to have sentences of the form (X -> Y) to apply

them.]

The justification is that if you assume (but don't prove)

A

[The dog is brown]

and then, on that assumption, you deriveP

[The floor is wet]

then you HAVE NOT proven that the floor is wet, but you have PROVEN (noassumptions required) that

(A -> P)

[If the dog is brown, then the floor is wet.]

The last two Fun Rules are closely related. One is ~ introduction:

If you assume

XAnd then, on that assumption, you derive

Y

and

~Yyou can discharge the assumption you made at X; then you are entitled to

~X

[You may have been taught this rule as a reductio ad absurdum.] The idea is that if assuming X leads you to a contradiction, then there must've been something contradictory

ABOUT X ITSELF. So X must be false. If X is false, then by definition ~X is true: no ifs,

ands, buts, or assumptions about it.]



The last Fun Rule is ~ elimination:

If you assume

~Xand then, on that assumption, you derive

Yand

~Yyou can discharge the assumption you made at ~X and then you are entitled to:

X

The idea here is basically the same. ~X is contradictory and therefore false, so X is

proven true.

I have another nickname for ~ elimination. It is what I call the "Fallback Rule." With

every other rule, the way it works is by getting what you want by introducing the main

connective or by using what you have by eliminating the main connective. But take alook back at what I just did with ~ elimination. I just proved X is true. There's no way to

tell by looking at X that you can prove it by eliminating a ~, but you can. [Actually, ANY

sentence you like can be proven with ~ elimination, but it is sometimes hard to do.]

So this is where Step 5 of the derivations comes from. If you are trying to prove somesentence X, the first thing to try is to try to introduce the main connective of X. But if you

run into a dead end doing that, then assume ~X and try to derive a contradiction.

Those are the rules reviewed and better organized so that they make sense, and the

difficult bit about discharging assumptions is (I hope) a little clearer.

One other thing to watch out for. Some logic problems ask you to prove that a certainsentence is a logical truth. On those problems, you have to discharge all your assumptions

and prove that the sentence is true with no assumptions (that is, write it without indenting

and outside of all the curly brackets). I'll do an example of a derivation like that in aminute.

[TOP]

Deriving a Conclusion

Other logic problems give you a list of "givens" or "hypotheses" and ask you to derive aconclusion from them. In those problems, what they are saying is that you need to assume

the hypotheses, but not discharge those assumptions. Let me give you an example:

Given:

A

(A -> B)





(~B v C)

Prove:C

Here we go:

{

1) A [assumption, given]

{

2) (A -> B) [assumption, given]

{

3) (~B v C) [assumption, given]


5) (A -> B) [repetition of 2]

6) B [->elim on 4&5]

Now what do we do? We have a disjunction on 3 that we don't know what to do with, so

we need to eliminate it. But in order to eliminate it, we need to get (~B -> something) and

(C -> something).

Let's first work on getting (~B -> something).]

{ [Note: this assumption isn't given,

so we're going to have to discharge it]

7) ~B [assumption]

Let's see if we can derive C. If we derive (~B -> C) then we'll be most of the way to

finishing the problem. How can we derive C? Well, we should try to introduce the main

connective. But wait ! There is no main connective. C is just a simple sentence. So whatcan we do? I guess we have to do step 5, try ~elimination.

{ [another assumption we'll have to discharge]

8) ~C [assumption]

9) B [repetition of 6]

10) ~B [repetition of 7]

} [Closes off lines 8-10]

11) C [~elim on 8-10]



Up to this point we haven't closed off any assumptions. That means that all of the lines up

to this point were sentence that we "have" and can use. But now we just closed off lines

8-10 by discharging the assumption at 8. That means that lines 8-10 are gone, they areoff-limits and illegal forever. The good news, though is that we derived C, so now we can

discharge the assumption we made at line 7.

} [Closes off 7-11]

12) (~B -> C) [->intro on 7-12]

This may seem like a bit of sleight of hand, like I'm trying to pull the wool over your

eyes. How can I use the assumption I made at line 7 as part of the contradiction? I just did

a ~elimination to prove C, but there was nothing contradictory about ~C itself; thecontradiction was that I had B and then I assumed ~B. This is the familiar refrain

"anything can be proven from a contradiction." Once I assumed ~B, I could've proven

(~B -> anything-I-want), I chose to prove (~B -> C) because I eventually want to get C.

Now we have made some progress on eliminating the disjunction we had on line 3: (~B vC). We have (~B -> C), now we need (C -> C), so let's go get it.

{

13) C [assumption]


Notice that I can repeat 13 because I have not yet discharged that assumption. However, I

cannot repeat the C on line 11 because I closed off line 11 already, so it is gone forever.

} [Closes off 13-14]

15) (C -> C) [->intro on 13-14]

16) (~B v C) [repetition of 3]

17) (~B -> C) [repetition of 12]

18) (C -> C) [repetition of 15]

19) C [velim on 16,17,&18]

Not all of those repetitions were necessary, since we "had" those lines already (they

hadn't been closed off), but I added them for clarity.

You should go back and double-check the derivation to make sure that I never broke the

rules by using a line that was closed off and that I didn't break any other rules. Also, makesure that I discharged all the assumptions except the three I was given at the start.

[TOP]





Deriving a Sentence

As I said before, the other type of problem is where you are handed a sentence and told to

derive it. In this problem, you can make any assumptions you need, but you have todischarge all of them and end up with the sentence you're looking for at the end. This

usually involves finding the main connective of the sentence you're supposed to proveand then introducing it (sometimes you have to find other sentences too: for example if

the main connective is ^, you need to prove each half of the sentence and then do anîntro).

Sometimes trying to do this will get you to a dead end, and then you may try to assume

the negation of the sentence you're trying to get and see if you can find a contradiction.

Let's do an example. Let's try to prove

(((~A ^ ~C) v (~C <-> B)) -> (B -> ~C))

The main connective is a ->, so let's introduce it. To do that we need to assume the leftand derive the right.

{

1) ((~A ^ ~C) v (~C <-> B)) [assumption]

Here we have a disjunction, so we need to eliminate it. That means we need to find twoentailments. It would be great if we had ((~A ^ ~C) -> (B -> ~C)) and ((~C <-> B) -> (B

-> ~C)), so let's try to get those. First, let's work on ((~A ^ ~C) -> (B -> ~C)).

{

2) (~A ^ ~C) [assumption]

3) ~A [êlim on 2]

4) ~C [êlim on 2]

We actually don't need line 3, but it's good practice to get both sides of an ^ while you

can just in case you might need them later. Now we have ~C, but what we want is (B ->

~C), so let's work toward getting that.

{

5) B [assumption]

6) ~C [repetition of 4]

} [closes 5-6]

7) (B -> ~C) [->intro on 5-6]



} [closes 2-7]

8) ((~A ^ ~C) -> (B -> ~C)) [->intro on 2-7]

So now what we need to finish the velimination on line 1 is to derive ((~C <-> B) -> (B-> ~C)).

{

9) (~C <-> B) [assumption]

So now we need (B -> ~C).

{

10) B [assumption]

11) ~C [<->elim on 9-10]

If you wanted to, you could make this a little more clear by repeating (~C <-> B) andthen doing the <->elim, but it's not necessary since we have both (~C <-> B) and B we

are entitled to ~C.

} [closes 10-11]

12) (B -> ~C) [->intro on 10-11]

} [closes 9-12]

13) ((~C <-> B) -> (B -> ~C)) [->intro on 9-12]

Now we can do our velimination on line 1 because we have ((~C <-> B) -> (B -> ~C))

and ((~A ^ ~C) -> (B -> ~C)). If you want, for clarity, you can repeat line 1 and line 8,

but it's not necessary.

14) (B -> ~C) [velim on 1,8,13]

}

15) (((~A ^ ~C) v (~C <-> B)) -> (B -> ~C))

As always, you should go back and double-check the derivation once you're done.

The hardest thing about doing derivations is figuring out what to do next. When you have

a lot of random rules with Latin names to choose from, it's difficult. This set of ruleshelps you to know what to do by either introducing what you're trying to get or

eliminating what you have.

My advice to you is to try to do some of the derivations in your book or that you had for

your class using these rules. Any derivation is possible with them. It takes a lot of time to



learn logic and have it sink in, but if you take it slowly enough and practice, it will

become easier. Get comfortable with these 12 basic rules and the 5-step method for doing

derivations.

[TOP]

Rules with Latin Names

Many (probably most) places, you don't learn these 12 rules with logic. Even though I

think these rules make the most sense and allow a straightforward approach to solvingany problem in symbolic logic, more advanced students may want to study the rules such

as Modus Tollens and DeMorgan's Law.

The twelve rules I've presented here are systematic and straightforward, and all of them

move by baby steps. The way I think of these rules (in some cases this is not historicallyaccurate) is that logicians noticed that when doing derivations, they often repeated the

same steps over and over. Eventually, someone decided that rather than doing these samefive or ten steps, you can take shortcuts.

Modus Tollens serves as an instructive example. Let's say I have:

(A -> B)

And I have:

~B

And I am trying to get:~A

Here's what I'd have to do. Since I'm trying to find ~A, I'll do a ~introduction on A:{

1) (A -> B) [assumption, given]

{

2) ~B [assumption, given]

{

3) A [new assumption]

4) B [->elim on 1 and 3]

5) ~B [repetition of 2]

} [closes off 3-5]

6) ~A [~intro 3-5]

We have to do this so often that we just call these five steps "Modus Tollens." ModusTollens is a shortcut rule. There are several others too, some more involved.





The following is a list of the major rules, together with a justification of why each of

them is valid and a short example of how you might use some of the more challenging

ones.

[TOP]

1. Modus Ponens

This is one of the most straightforward laws in logic. It states that if you have

(X -> Y)

and you have

X

then you are entitled to:

Y

This is just what we've been calling "-> elimination."

The reason it works is that we are given (X -> Y). Which means that X cannot be true at

the same time Y is false. So if X is true (which is the other given), then Y must be true aswell, so we are free to conclude Y is true.

Example: "If it is raining, then there are clouds" and "it is raining" together imply "there

are clouds."

[TOP]

2. Modus Tollens

This law is just the flip side of modus ponens. It states that if you have

(X -> Y)

and you have

~Y


~X







The reason this works is that we are again given (X -> Y). This means that X cannot be

true at the same time Y is false. So if Y is false (which is the other given), then X must be

false as well. So we are free to conclude X is false (or ~X is true).

Example: "If it is raining, then there are clouds" and "there are no clouds" together imply

"it is not raining."

[TOP]

3. DeMorgan's Law (I)

DeMorgan came up with a couple sets of equivalencies. The first is that if you have

~(X ^ Y)

then you can conclude

(~X v ~Y)

and if you have

(~X v ~Y)


~(X ^ Y)

The reason this works is that our starting point is ~(X ^ Y), which is the negation of (X ^Y). Now, (X ^ Y) can only be true if X is true and Y is also true. So (X ^ Y) will be false

if X is false or if Y is false. That is, (X ^ Y) will be false if (~X v ~Y) is true. So ~(X ^ Y)

is equivalent to (~X v ~Y).

Example: "My dog is fat, or my cat is fat" is equivalent to "It is not true that both my dog

and cat are thin."

[TOP]

4. DeMorgan's Law (II)

The second equivalence which bears DeMorgan's name is that

~(X v Y)

is interchangeable with

(~X ^ ~Y)







The only way in which ~(X v Y) can be true is if X and Y are both false. So the two

expressions can be interchanged just like in the first law.

Example: "My dog is fat and my cat is fat" is equivalent to "It is not true that my dog or cat is thin."

[TOP]

5. Hypothetical Syllogism

The rule here is that if you have

(X -> Y)

and you have

(Y -> Z)


(X -> Z)

Here's why:

We know that "if X is true, then Y is true." And we know that "if Y is true, then Z is true."

But we don't know anything about whether any of the letters are actually true or not.

Let's assume (or hypothesize) for a second that X is true. Then, by modus ponens, Y istrue. And then by modus ponens again, Z is true. So: If we assume X is true, then we

conclude Z is true. Since we didn't know X was true, we cannot take Z home with us, but

we can say that "If X was true, then Z would be true." This is equivalent to saying "If X,then Z" or (X -> Z).

Example: "If it is raining, then there are clouds" together with "if there are clouds, then

the sun will be blocked" imply "if it is raining, then the sun will be blocked."

[TOP]

6. Disjunctive Syllogism

The rule here is that if you have

(X v Y)

and you have

~X








Y

Here's why:

We know first of all that "X or Y is true." We also know that X is false. If X or Y is true,

and X is false, then Y has no choice but to be true. So we can conclude that Y is true.

Example: "My dog is fat or my cat is fat" together with "my dog is thin" imply "my cat is

fat."

[TOP]

7. Reductio Ad Absurdum (Proof by Contradiction)

This rule states that if you assume

X

and, from that, you conclude a contradiction, such as

(Y ^ ~Y)

then you can conclude that your assumption was false, and

~X

must be true. You can find a more complete explanation of this at

Proof by Contradictionhttp://mathforum.org/library/drmath/view/62852.html

[TOP]

8. Double Negation

This rule simply states that if you have

~~X

then you can interchange that with

X

which should be apparent based on what the ~ is.


http://mathforum.org/library/drmath/view/62852.html



http://mathforum.org/library/drmath/view/62852.html




[TOP]

9. Switcheroo

(I've heard that this was actually named after a person, but I don't know that for certain.)

This is a shortcut rule which states that if you have

(X v Y)

then you can interchange that with

(~X -> Y)

To understand why, let's think about (~X -> Y). This says that ~X cannot be true at the

same time that Y is false. Or, to put that another way, X cannot be false at the same time

Y is false.

So (~X -> Y) can only be false when X and Y are both false. Similarly, the only way for

(X v Y) to be false is to have X and Y both false. So the two expressions are true unless X

and Y are both false, so they have the same "truth conditions" and are thereforeequivalent (i.e. interchangeable).

Example: "My dog is fat, or my cat is fat" is equivalent to "If my dog is thin, then my cat

is fat."

(This one is hard to wrap your mind around, but think about what must be true/false

about the world in order to make each statement true or false and it should eventually become clear.)

[TOP]

10. Disjunctive Addition

This is just what we've been calling "v introduction."

[TOP]

11. Simplification

This is just what we've been calling "^ elimination."

[TOP]

12. Rule of Joining

This is just what we've been calling "^ introduction.""











The problem with shortcut rules is that they're easy to misuse. In my opinion, the best

way to learn them is to practice with the twelve systematic rules and if you find yourself doing the same steps over and over, you may have found a shortcut rule.

If there's a rule you don't understand, try to use the twelve systematic rules to figure out

how the rule works. Once you see the steps in deriving the rule and you know why it is a

valid shortcut, you won't have any trouble using it. And remember, if you get stuck anddon't know what to do, you can always fall back on the twelve systematic rules.

Fuzzy or "multi-valued" logic is a variation of traditional logic in

which there are many (sometimes infinitely many) possible truth values

for a statement. True is considered equal to a truth value of 1, false

is a truth value of 0, and the real numbers between 1 and 0 are

intermediate values.

What is Fuzzy Logic?

The easy definition is that fuzzy logic is a kind of logic in which

propositions don't have to be either true or false.

In normal binary logic, the answer to a question like "Is Joe tall?"

would have to be either "yes" or "no" - either 1 or 0. In terms of

attributes, Joe would either have "tallness," or he wouldn't.

This is one of the things that makes binary logic break down so easily

when you try to apply it to the real world, where people are "sort of

tall," food is "mostly cooked," cars are "pretty fast," jewelry is

"very expensive," patients are "barely conscious," and so on.

To paraphrase Einstein, to the extent that binary logic applies toreality, it is not certain; and to the extent that it is certain, it

doesn't apply to reality.

In fuzzy logic, Joe can have a tallness value of, say, 0.9, which can

combine with values for other attributes to produce "conclusions" that

look more like "The clothes have a dryness value of 0.91" than "The

clothes are dry" or "The clothes are not dry." It is frequently used

to control physical processes - washing or drying clothes, toasting

bread, bringing trains to smooth stops at the right places, keeping

planes on course, and so on. It's also used to support decisions -

whether to buy or sell stock, whether to support or oppose a

particular political position, and so on.

In short, fuzzy logic provides an alternative to logic that is useful

whenever you want to be able to express attributes in shades of gray,

rather than as black or white.

As far as the more advanced question of connectives in fuzzy logic,

here are what I believe are the generally accepted rules.

You start with the basic connectives is symbolic logic:



^ (and)

v (or)

-> (if, then)

<-> (if and only if)

~ (not)

And extend them.

Let's start with ^ (and). In Boolean logic (A ^ B) is true if and only

if both A is true and B is true. We can say that in a different way:

(A ^ B) has a value of 1 if A has a value of 1 and B has a value

of 1, if either A or B has a value of 0, then the conjunction

has a value of 0.

So the way this is traditionally extended to fuzzy logic is to say

that the conjunction (A ^ B) carries the minimum truth value of A or

B.

For example, if A has a truth value of 1 and B has a truth value of

0.8, then the minimum of these two is 0.8 and the conjunction (A ^ B)will carry the truth value of 0.8.

Next, let's talk about v (or). In Boolean logic (A v B) is true if A

is true or if B is true or if both are true. We can say that in a

different way:

(A v B) has a value of 1 if A has a value of 1 or B has a value

of 1, if both A and B have a value of 0, then the disjunction has

a value of 0.

So the way this is traditionally extended to fuzzy logic is to say

that the disjunction (A v B) carries the maximum truth value of A or

B.

For example, if A has a truth value of 0.2 and B has a truth value of

0.6, then the maximum of these is 0.6 and the disjunction (A v B) will

carry the truth value of 0.6.

Next, let's talk about ~ (not). Unlike the other connectives, ~

doesn't join two sentences; it is only applied to a single sentence.

In Boolean logic, ~A carries the opposite truth value of A. So it is

false if A is true and true if A is false.

Or: ~A has a value of 0 if A has a value of 1 and a value of 1 if A

has a value of 0.

The way this is extended to fuzzy logic is to say that ~A has a truth

value equal to 1 minus the truth value of A.

So, for example, if A has a value of 0.3, then 1 - 0.3 = 0.7, so ~A

has a value of 0.7.

Next, let's talk about -> (if, then). In Boolean logic, (A -> B) is

ONLY false if A is true AND B is false; in all other cases it is true.



Or, (A -> B) carries a value of 1 UNLESS A has a value of 1 and B has

a value of 0. So, if A has a value of 0, then (A -> B) will definitely

have a value of 1; or if B has a value of 1 then (A -> B) will

definitely have a value of 1.

So (A -> B) is at least as true as the opposite of A, and it is also

at least as true as B. So in Boolean logic, (A -> B) has a truth value

equal to the maximum of ~A and B.

Fuzzy logic uses this definition. The truth value of (A -> B) is equal

to the maximum of the truth value of ~A and the truth value of B.

For example, if B has a truth value of 0.5 and A has a truth value of

0.4, then ~A has a value of 0.6. The maximum of B (0.5) and ~A (0.6)

is 0.6 so (A -> B) will have a value of 0.6.

The last connective, <-> (if and only if), is the hardest to extend to

fuzzy logic.

In Boolean logic, (A <-> B) is true if both A and B have the same

truth value, and false otherwise. Or you can say:

(A <-> B) has a value of 1 if A and B both have a value of 1 or

if A and B both have a value of 0, and it has a value of 0

otherwise (i.e. if A has a value of 0 and B has 1 or vice versa).

My first guess for how to apply this to fuzzy logic was that it is

simply an equals relation. So if A and B have equal values, then

(A <-> B) will have a value of 1; otherwise it will have a value of 0.

This is okay as a first guess, but the problem is that now (A <-> B)

can only have 1 or 0 as a value, which is not really very fuzzy at

all.

Let's go back to Boolean logic for a minute and think a bit more about

(A <-> B). This is read "A if and only if B." It means that if we know

A is true, then we can conclude that B is true AND we if we know B is

true, then we can conclude that A is true. In other words, (A <-> B)

is equivalent to the conjunction of (A -> B) and (B -> A) or to put it

more formally:

(A <-> B) = ((A -> B) ^ (B -> A))

But we've already figured out how to do fuzzy logic on these

connectives. So let's just apply that to our <-> connective.

Specifically:

(A <-> B) has a value equal to the minimum (conjunction) of:(A -> B)

and

(B -> A)

So first you figure out the value of (A -> B) (which is the maximum

of ~A and B), then you figure out the value of (B -> A) (which is the

maximum of ~B and A), and then you take the minimum of those.

That's pretty complicated, so before we do an example calculation with



fuzzy logic, let's make sure it works with two-valued (Boolean) logic.

Let's say that A and B both have truth values of 0. What does

(A <-> B) have in this case?

First, we take the maximum of ~A and B. ~A will have a value of 1 and

B will have a value of 0. So the maximum is 1.

Second, we take the maximum of ~B and A. ~B will have a value of 1 and

A will have a value of 0. So the maximum is 1.

Finally, we take the minimum of the first two steps above. Step one

gave us 1 and step two gave us 1, so the minimum is 1.

What if A has a truth value of 1 and B has a truth value of 0?

First, we take the maximum of ~A and B. ~A will have a value of 0 and

B will have a value of 0. So the maximum is 0.

Second, we take the maximum of ~B and A. ~B will have a value of 1 and

A will have a value of 1. So the maximum is 1.


gave us 0 and step two gave us 1, so the minimum is 0.

For a fuzzy logic example, let's say that A has a value of 0.7 and B

has a value of 0.5. What is the value of (A <-> B)?

First, we take the maximum of ~A and B. ~A will have a value of 0.3

and B will have a value of 0.5. So the maximum is 0.5.

Second, we take the maximum of ~B and A. ~B will have a value of 0.5

and A will have a value of 0.7. So the maximum is 0.7.


gave us 0.5 and step two gave us 0.7, so the minimum is 0.5. So in

thi l (A < > B) h t th l f 0 5

ai briefly

Documents