neural coding and decoding: communication channels and...

INSTITUTE OF PHYSICS PUBLISHING NETWORK: COMPUTATION IN NEURAL SYSTEMS

Network: Comput. Neural Syst. 12 (2001) 441–472 PII: S0954-898X(01)25412-8

Neural coding and decoding: communication channelsand quantization

Alexander G Dimitrov and John P Miller

Center for Computational Biology, Montana State University, Bozeman, MT 59717-3148, USA

E-mail: [email protected] and [email protected]

Received 19 July 2000, in final form 15 March 2001Published 15 August 2001Online at stacks.iop.org/Network/12/441

AbstractWe present a novel analytical approach for studying neural encoding. As afirst step we model a neural sensory system as a communication channel.Using the method of typical sequence in this context, we show that acoding scheme is an almost bijective relation between equivalence classes ofstimulus/response pairs. The analysis allows a quantitative determination of thetype of information encoded in neural activity patterns and, at the same time,identification of the code with which that information is represented. Due to thehigh dimensionality of the sets involved, such a relation is extremely difficultto quantify. To circumvent this problem, and to use whatever limited data set isavailable most efficiently, we use another technique from information theory—quantization. We quantize the neural responses to a reproduction set of smallfinite size. Among many possible quantizations, we choose one which preservesas much of the informativeness of the original stimulus/response relation aspossible, through the use of an information-based distortion function. Thismethod allows us to study coarse but highly informative approximations of acoding scheme model, and then to refine them automatically when more databecome available.

1. Introduction

One of the steps toward understanding the neural basis of an animal’s behaviour ischaracterizing the code with which its nervous system represents information. Allcomputations underlying an animal’s behavioural decisions are carried out within the contextof this code. A determination of the neural coding schemes is an extremely important goal,due not only to our interest in the nature of the code itself, but also to the constraints that thisknowledge places on the development of theories for the biophysical mechanisms underlyingneural computation [19].

0954-898X/01/040441+32$30.00 © 2001 IOP Publishing Ltd Printed in the UK 441

http://stacks.iop.org/ne/12/441

442 A G Dimitrov and J P Miller

Deciphering the neural code of a sensory system is often reduced to a few interconnectedtasks. One task is to determine the correspondence between neural activity and sensory signals.That task can be reduced further to two interrelated problems: determining the specificstimulus parameters or features encoded in the neural ensemble activity and determiningthe nature of the neural symbols with which that information is encoded. An ancillarytask is to define quantitative measures of significance with which the sensory informationand associated neural symbols are correlated. Considerable progress has been made byapproaching these tasks as independent problems. Approaches that we and others havetaken include stimulus reconstruction [26, 35] and the use of impoverished stimulus sets tocharacterize stimulus/response properties [15,36]. Recent work also provided direct estimatesof information-theoretic measures of correlations between stimulus and response [5,33], whilecompletely bypassing the problem of identifying the stimulus. However, independent treatmentof these interconnected tasks often introduces multiple assumptions that prevent their completesolution. Some of these approaches start with an assumption of the relevant codewords (e.g.a single spike in the first-order stimulus reconstruction method, or the mean spike rate over adefined interval) and proceed by calculating the expected stimulus features that are correlatedwith these codewords. Other approaches make an assumption about the relevant stimulusfeatures (e.g. moving bars and gratings in the investigation of parts of the visual cortex) andproceed to identify the pattern of spikes that follow the presentation of these features. We havedeveloped an analytical approach that takes all three tasks into consideration simultaneously.Specifically, the aim of this approach is to allow a quantitative determination of the type ofinformation encoded in neural activity patterns and, at the same time, identify the code withwhich that information is represented.

There are two specific goals of this paper. The first is to formulate a model of a neuralsensory system as a communication channel (section 2). In this context we show that acoding scheme consists of classes of stimulus/response pairs which form a structure akin toa dictionary: each class consists of a stimulus set and a response set, which are synonymous.The classes themselves are almost independent, with few intersecting members. The secondgoal is to design a method for discovering this dictionary-like structure (section 3). To dothis, we quantize the neural responses to a reproduction set of small finite size. Among manypossible quantizations, we choose one which preserves as much of the informativeness ofthe original stimulus/response relation as possible, through the use of an information-baseddistortion function.

We start with the observation that the neural code must satisfy two conflicting demands.On one hand, the organism must recognize the same natural object as identical in repeatedexposures. In this sense, the signal processing operations of the organism need to bedeterministic at this ‘behavioural’ level. On the other hand, the neural coding scheme mustdeal with projections of the sensory environment to a smaller stimulus space and uncertaintiesintroduced by external and internal noise sources. Therefore, the neural signal processing must,by necessity, be stochastic on a finer scale. In this light, the functional issues that confrontthe early stages of any biological sensory system are very similar to the issues encountered bycommunication engineers in their work of transmitting messages across noisy media.

With this in mind we model the input/output relationship present in a biological sensorysystem as a communication channel [31]. Although this approach has been suggestedbefore [2,3], to our knowledge all the properties that information theory assigns to this objecthave not been completely appreciated in the neural research literature. The principal methodthat we present is based on the identification of jointly typical sequences ([6, appendix A.4])in the stimulus/response sets. Joint typicality is a rigorously defined concept used extensivelyin information theory [6] and described in more detail in section 2.3. We use this technique

Neural coding and decoding 443

to elucidate the structure of a neural stimulus/response relation. Although our coding modelis stochastic, we demonstrate how an almost deterministic relation emerges naturally on thelevel of clusters of stimulus/response pairs.

To investigate such a model we consider the possibility that temporal patterns of spikesacross a small ensemble of cells are the basic elements of information transmission in suchsystem. Due to the high dimensionality of the sets involved, such a relation is extremelydifficult to quantify [17, 33, 38]. To circumvent this problem, and to use whatever limiteddata set is available most efficiently, we quantize the neural responses to a reproduction set ofsmall finite size (section 3.1). Quantization is another standard technique from informationtheory ([6, 14], appendix A.5.2). By quantizing to a reproduction space, the size of which issufficiently small, we can assure that the data size requirements are much diminished.

Among many possible quantizations, we choose one which preserves as much of theinformativeness of the original stimulus/response relation as possible, through the use of aninformation-based distortion function (section 3.2). The relationship between stimulus andreproduction will be an approximation of the coding scheme described above. This methodallows us to study coarse but highly informative models of a coding scheme, and then toautomatically refine them when more data become available. Ultimately, a simple descriptionof the stimulus/response relation can be recovered (section 3.4).

2. Neural information processing

2.1. Neural systems as communication channels

Communication channels characterize a relation between two random variables: an input andan output. For neural systems, the output space is usually the set of activities of a group ofneurons. The input space can be sensory stimuli from the environment or the set of activitiesof another group of neurons. We would like to recover the correspondence between stimuliand responses, which we call a coding scheme [34]. For any problem that has inputs andoutputs, a coding scheme is a map (correspondence) from the input space to the output space.A decoding scheme is the inverse map. In general, both maps can be probabilistic and manyto one. We will provide a more precise definition, which describes the situation when they canbe considered almost deterministic and almost bijective.

The early stages of neural sensory processing encode information about sensory stimuliinto a representation that is common to the whole nervous system. We will consider thisencoding process within a probabilistic framework [3,8,18,26]: the signal X is produced by asource with a probability p(x). For example this may be a sensory stimulus from the animal’senvironment, or the activity of a set of neurons. The encoder q(y|x) is a stochastic map fromone stochastic signal to another. This will model the operations of a neuronal layer. The outputsignal Y is produced by q with probability p(y). This is the temporal pattern of activity acrossa set of cells. This description is naturally cast in the formalism of information theory. If weconsider a neuron or a group of neurons as a communication channel, we can apply the theoryalmost directly for insights into the operation of a neural sensory system, including a detailedpicture of the correspondence between sensory stimuli and their neural representations.

Our basic assumption is that sensory stimuli are represented by patterns of spikes in oneor several neurons. The number of neurons comprising an information channel, the temporalextent of the patterns and temporal precision of spikes are parameters which can be determinedfrom data [9]. We can formulate hypotheses about particular coding schemes by fixing thevalues of these parameters. Our initial assumption is that all distinct patterns of fixed lengthand precision may be important. Some ideas about information transmission in the presence of


noise allow us to group patterns into larger classes and consider coding with these equivalenceclasses. This scheme is general, and encompasses most of the existing hypotheses about thenature of neural coding as special cases. Once equivalence classes are uncovered, they can befurther analysed with regard to formal rules and physiological mechanisms which can describethem more succinctly. In certain cases these may turn out to be one of the above commonlyconsidered coding schemes.

Sensory stimuli and their neural representation can be quite complex. Information theorysuggests ways for dealing with this complexity by extracting the essential parts of the signalwhile maintaining most of its information content. To achieve this, we use the method of typicalsequences. What follows is an informal discussion of their properties in relation to coding andour model of a sensory system. Many terms here are preceded by ‘almost’, ‘nearly’, ‘close to’etc, which describe events that occur with high probability. The following three sections (2.2–2.4) discuss standard topics from information theory. See appendix A.3 for formal definitionsand properties and [6] for a more complete treatment.

2.2. Typical sequences

The basic concepts of information theory are the entropy H(X) and the mutual informationI (X, Y ) of random sources (X, Y ) ( [6, 31], appendix A.2). When used in the analysis ofcommunication systems, they have very specific meaning [31]. Consider a random sourceX. All the sequences of length n form the nth extension Xn of X. Xn can be described wellby using about 2nH(X) distinct messages. These are the typical sequences of Xn. Typicalsequences (events) comprise the typical set and are nearly equiprobable. The typical set ofX has probability near unity, and the number of elements in it is nearly 2nH(X) (theoremappendix A.2). We can summarize this by saying ‘Almost all events are almost equallysurprising’ [6]. This enables us to divide the set of all sequences into two sets—the typical set,where the entropy of a sequence is close to the entropy of the whole set, and the nontypical set,which contains the rest of the sequences. Any property that is proved for the typical sequenceswill be true with high probability and will determine the average behaviour of a large sampleof sequences. The functional property of the entropy H(X) here is to give the approximatesize of the ‘relevant’ set of events in Xn.

To illustrate, consider x ∈ X ≡ {0, 1}, p(x) = [p, q], p + q = 1 (the Bernoulli source).An example of a particular sequence of length 12 is (010001011010). The n-typical sequencesare all the sequences of length n with approximately np zeros and nq ones. Notice that, despiteits simplicity, this process can be used as a model of neural activity, where the presence of a spikeis marked by one and its absence by zero. Spikes in a sequence are usually not independent ofeach other [22], so this should be considered at most a zeroth-order approximation.

Typical sequences are applicable to continuous random variables as well. Let x ∈ N (0, σ )

(x is drawn from a normally distributed random variable with mean zero and variance σ 2).An n-sequence of these variables (x1, . . . , xn) ≡ Xn ∈ N(0n, σ In) is obviously drawn froman n-dimensional normal distribution. The n-typical sequences here are all Xn : ‖Xn‖2 �√n(σ 2 + ε), that is all the points contained in an n-ball with the given radius. This property

can be extended to general normal distributions, with a full covariance matrix, in which casethe typical set lies within the n-ellipsoid determined by the principal directions and eigenvaluesof the covariance matrix.


SEncoder

Decoder(Higherneural areas)

BehaviourEarly sensory areas(visual, auditory, mechanosensory)

Outputchannelp(O | u)

u(S,O) O

NO

O

Figure 1. Neural sensory systems as information channels.

2.3. Jointly typical sequences

When analysing information channels, we deal with two sets of random sequences—input andoutput. In this case it is necessary to consider the combined behaviour of the pair (X, Y ). Weapproach this by using jointly typical sequences ( [6], appendix A.4). For a pair of sources(X, Y ), there are about 2nH(X,Y ) jointly typical sequences of pairs (typical in the product space(Xn, Y n)). As with typical sequences, all the elements are nearly equiprobable and the sethas probability close to unity (appendix A.4). The correspondence between input and outputsequences is not necessarily one to one, due to noise or mismatch between the stimulus andresponse sizes.

Not all pairs of typical xn and typical yn are also jointly typical. For each typical sequencein Xn, there are about 2nH(Y |X) sequences in Yn which are jointly typical with it, and vice versa.The probability that a randomly chosen pair is jointly typical is about 2−nI (X;Y ) [6]. Hence,for a fixed yn, we can consider about 2nI (X;Y ) xn before we are likely to come across a jointlytypical pair. This suggests there are about 2nI (X;Y ) distinguishable messages (codewords) inXn that can be communicated through Yn (figure 2). Thus the knowledge of I (X;Y ) placesimportant bounds on the performance of any coding scheme.

The structure of a coding scheme in this framework can be seen intuitively by the followingargument [6].

For each (typical) input n-sequence xn, there are about 2nH(Y |X) possible Y (jointlytypical) sequences, all of which are approximately equally likely. The total numberof possible (typical) Y sequences is about 2nH(Y ). In order to insure that no two X

sequences produce the same Y output sequence, we need to divide the output set intochunks of size about 2nH(Y |X), corresponding to the different input X sequences. Thetotal number of disjoint sets after this procedure is about 2n(H(Y )−H(Y |X)) = 2nI (X;Y ).Hence we can transmit about 2nI (X;Y ) distinguishable X sequences of length n.

2.4. Coding and decoding with jointly typical sequences

Let the jointly typical pairs (xn, yn) represent related stimulus/response signals. Since there are2nI (X;Y ) distinguishable codewords and 2nH(X,Y ) signals, some of the signals represent the same


Figure 2. The structure of the jointly typical set. There are about 2nH(X) typical stimulus (x)sequences and 2nH(Y ) typical response (y) sequences but only 2nH(X,Y ) jointly typical sequences.This suggests there are about 2nI (X;Y ) distinguishable equivalence classes Ci of stimulus/responsepairs. Each class has about 2nH(X|Y ) input sequences and 2nH(Y |X) output sequences. The numberof stimulus/response pairs in each class is about |Ci | ≈ 2n(H(Y |X)+H(X|Y )).

codeword. Signals representing the same codewords can be combined in equivalence classes,which we call codeword classes. The codeword classes will represent the distinguishablemessages that can be transmitted in this communication system. Within each class, a stimulusin Xn invokes a corresponding jointly typical response in Yn with high probability (about1 − 2−nI (X;Y )).

We define the codebook of this system as the map F : xn → yn. The codebook isstochastic on individual elements, so it is represented through the association probabilitiesq(yn|xn). However, when considered on codeword classes, the map is almost bijective. Thatis, with probability close to unity, elements of Yn from one codeword class are associated withelements of Xn in the same codeword class. We shall decode an output yn as (any of) theinputs that belong to the same codeword class. Similarly, we shall consider the representationof an input xn to be any of the outputs in the same codeword class. Stimuli from the sameequivalence class are considered indistinguishable from each other, as are responses from thesame class. If the stimulus or response has additional structure (e.g. resides in a metric spacewith a natural distance), the classes can be represented more succinctly using constraints fromwithin this structure (e.g. by an exemplar with a minimum reconstruction error according tothis distance).

An example of coding with jointly typical sequences. We shall use the binary sourceX : x ∈ {0, 1}, p(x) = [r, 1 − r] and sequences from its finite extensions Xn. HereH(p) = −r log r−(1−r) log(1−r). The communication system we consider in this exampleis the binary symmetric channel (BSC) (figure 3(a)). BSC is a model of a communicationsystem with noise. The input and output are binary digits (zero or one; x ∈ Z2). Thechannel is described by the transition probabilities Pr(a → a) = p for correct transmittal andPr(a → b) = q for an error (flipped bit); p + q = 1, a, b ∈ [0, 1].

The BSC has a nice representation in terms of typical sequences. It can be considered asa random binary source with probability measure [p, q] which generates sequences of onesand zeros. These sequences are then summed modulo 2 (xor-ed) with the original source to


11 11

00 00pp

pp

qqqq

00

010

100

110

011

101

111

01

(b)(a)

Figure 3. (a) The BSC. (b) A three-dimensional binary space X3 and a particular coding schemethat allows the correction of one flipped bit. In this example, the codewords are 000 and 111 and thearrows mark the decoding process. Any pair of opposing vertices provides the same functionality.

produce the output. From the theorems on typicality (appendix A.3), the channel will typicallyproduce sequences of nq ones (errors) and np zeros (correct transmission). There are about2nH(p) possible errors per transmitted input sequence; i.e., H(Yn|Xn) ≈ nH(p).

A natural error measure for the BSC is the Hamming distance, which returns the numberof bits in which two sequences differ. Figure 3(b) illustrates an example in X3 which cancorrect one error in a block of three. The two distinct sets of arrows denote the two binary ballsof unit radius in which this code can correct an error. We can use the two opposing verticesas codewords to transmit one bit of information. There are two equivalence classes with fourelements each. Each element is decoded as the centre of the binary ball, with Hammingdistance zero from the original codeword.

For Xn, let us consider only the uniform input distribution p(x) = [1/2, 1/2]. Then allsequences are typical. There are 2nH(X) = 2n of them. BSC with error probability q willproduce about 2nH(q) typical sequences of about nq flips. Thus each possible sequence in X

will be received as one of about 2nH(q) possible neighbours in Y . In order to transmit with smallprobability of error we need to pick as codewords sequences in X which are at least nH(q)

bits apart in Hamming distance. There are about 2n/2nH(q) = 2n(1−H(q)) such sequences.

2.5. Comments on the continuous case

The picture that we present is essentially one of discrete finite sets and relations between them.Some readers may raise the issue of how this whole scheme would work when both input andoutput processes are continuous, in space and/or time. We argue that most such cases can behandled by the current scheme. First, notice that any continuous coding scheme will containuncertainties due to channel noise and measurement error at the receptor level. The scheme canbe approximated to an arbitrary level of detail by a discrete coding scheme through the processof quantization (appendix A.5.2), which comes with another set of uncertainties (quantizationnoise). The two schemes will be indistinguishable in practice if their functionalities lie withineach others’ error bars. If they are distinguishable, there will be an experiment that can resolvethe issue.

Even if we insist on considering continuous stimuli and responses, there are somearguments which again point to the benefit of discrete coding schemes. In the case of object


and feature recognition, it was noted [24] that the continuity of stimuli is usually due tosymmetries in the signal, which do not contribute to (and even interfere with) the recognitionof features. The solution suggested in [24] is to pre-process the stimulus to remove as muchof the continuous symmetries as possible and then to continue with the recognition algorithm.

Another argument comes from recent work in rate distortion theory. It was shown [27] thatthe optimal reproduction space of a continuous source is continuous only in a few special cases(e.g. if the source is Gaussian). In any other case, compressing and transmitting a continuoussource is done best through discrete representatives.

We do not argue that this analytical approach covers every conceivable case. Our intentionis to illustrate how coding with discrete sets can be used in a much wider context than mayinitially be perceived.

2.6. Neural coding with jointly typical sequences

The picture of coding and decoding with jointly typical sequences gives us a frameworkwithin which to think about the problem of neural coding and decoding. The descriptionabove (figure 2) is an idealization used to prove properties of optimal communication codingschemes, but the basic structure is valid for any coding scheme.

• A bijective or almost bijective coding scheme is needed for robust decoding, otherwisethe animals will be confusing distinct stimuli.

• In any real system, there is noise which compromises the amount of information that canbe represented. To achieve robustness, some of the capabilities of the system must bedevoted to combating noise.

• Noise reduction can be achieved by combining signals that are likely to be confused intoequivalence classes.

• A coding scheme that is almost bijective on equivalence classes will present a consistentrepresentation of a sensory modality.

Unlike the idealized case of coding with jointly typical sequences, the codeword classesin a neural system need not be similar in size. In our model they still have the property of beingalmost self-consistent. Stimuli belonging to a certain codeword class will invoke a responsewithin the same codeword class with high probability and only rarely produce a responseoutside this codeword class (in another class or nontypical). In such a case the stimulus willbe considered incorrectly represented.

It should be noted that in the idealized picture there are many coding schemes which areoptimal in the sense that the error rate asymptotically approaches zero [6]. This can be achievedby splitting the input and output into clusters of size appropriate for the channel (H(X|Y ) forthe input, H(Y |X) for the output), and then connecting them at random. When applied to theneural case, this means that a neural coding scheme for similar sensory modalities could beunique for species or even individuals.

Many neural coding schemes currently in use can be seen as special cases of the generalcoding scheme we describe here. For example, in rate coding schemes [1], all sequences in ashort interval that have the same number of spikes are considered to be in the same equivalenceclass. The codeword classes are labelled by the number of spikes. All stimuli that precede asequence in the same codeword class are considered jointly typical with it and hence decodedas being represented by this class. There is no guarantee that the classes are nonoverlappingor that the decoding error is small. Similarly, in population vector coding schemes [12, 29]the output sequences are assigned to classes based on the value of a linear functional of thenumber of spikes in each neuron (population mean). Decoding is done as before and labelled


by the class identity, which is interpreted in [12] as the direction of planned hand movement.A spike latency code is an example of a continuous coding scheme. It fits in this formalismthrough the process of quantization (section 2.5). Classes are determined by the mean latencyand jitter (variance) of the spike timing. The classes may be partially overlapping. A stimulusfeature is decoded as in the rate code case, based on the latency class into which a spike falls.

Given this model of a neural subsystem, our task is to recover the codebook. There aretwo basic ways to approach this. Since the codebook is almost deterministic on equivalenceclasses, we can approximate it with a deterministic map. Then each equivalence class willhave identified fixed members. All errors will be caused by disregarding certain regions of theevent space. A more general approach is to represent F by the conditional probability of classmembership q(y|x) in which case we explicitly model the probability of misclassification, butlose some of the simplicity offered by jointly typical sequences. We shall use both approachesas warranted.

3. Finding the codebook

With communication systems, the usual use of information theory is to design a coding schemegiven the structure of the communication channel. Our application differs, since now we areanalysing a neural coding scheme already implemented within an animal’s nervous system.Our goal is to uncover the structure described so far (section 2.4, figure 2) based on observationsof the stimulus and response properties of the neural system under investigation.

3.1. Quantizing the response

As we mentioned before, the information quantities H and I depend only on the underlyingprobability function and not on the structure of the event space. This allows us to estimatethem in cases where more traditional statistical measures (e.g. variance, correlations etc)simply do not exist. There is a drawback, though, since now we have to model the necessaryprobabilities, or else use large numbers of data to estimate the probabilities nonparametrically.The problem is compounded by several factors. Recent research suggests that nervous systemsrepresent sensory stimuli by using relatively long temporal windows (tens to a few hundredms in diverse preparations) [9, 17, 23, 25] and through the coordinated activity of multipleneurons [17, 21, 30, 39]. Unlike research which is geared towards finding better unbiasedestimates of H and I [33, 38], our goal is to recover the complete coding scheme used in asensory system. As pointed out in [17], the number of data points needed for nonparametricanalysis of neural responses which are correlated across long time periods (length T ) andmultiple neurons (K) grows exponentially with T and K . It is conceivable that for somesystems the required data recording time may well exceed the expected lifespan of the system.

To resolve this issue we choose to sacrifice some detail in the description of the codingscheme in order to obtain robust estimates of a coarser description. This can be achievedthrough quantization (appendix A.5.2, [6, 14]) of the neural representation Y into a coarserrepresentation in a smaller event space YN . YN is referred to as the reproduction of Y . Bycontrolling the size of the reproduction, we ensure that the data requirements to describe sucha relation are much diminished. Instead of exponentially growing with T and K , now thenumber of needed data is proportional to N , the size of the reproduction, which is chosen bythe researcher to be small. Most of the results in this section are valid for the general casewhen the sources are continuous, ergodic random variables [14]. However, the formulationfor the most general case requires special attention to details. For clarity of the presentation,we shall assume that all random variables are finite (though possibly large) and discrete.


Quantizers are maps from one probability space to another. They can be deterministic(functions) or stochastic (given through a conditional probability) (appendix A.5.2). The sizeof the reproduction space is smaller than the size of the quantized space. We shall consider themost general case of a stochastic quantizer q(yN |y)—the probability of a response y belongingto an abstract class yN . A deterministic quantizer f : Y → YN is a special case in which q takesvalues zero or one only. In both cases, stimulus, response and reproduction form a Markovchain X → Y → YN . The quality of a quantization is characterized by a distortion function(appendix A.5.3). We shall look for a minimum distortion quantization using an informationdistortion function and discuss its relationship to the codebook estimation problem.

The quantization idea has been used implicitly in neuroscience for some time now. Forexample, the rate coding scheme effectively uses a deterministic quantizer to assign the neuralresponse to classes based on the number of spikes that each pattern has. The metric spaceapproach [40] uses an explicit cost (distortion) function to make different sequences identicalif their difference according to the cost function is below a certain threshold. The cost functionand identification threshold induce a deterministic quantization of the input space to a smalleroutput space. We decided to state the problem explicitly in the language of information theory,so that we could use the powerful methods developed in this context for putting all these ideasin a unified framework.

3.2. A distortion measure based on the mutual information

In engineering applications, the distortion function is usually chosen in a fairly arbitraryfashion [6, 13], and is the one that introduces structures in the original space, to be preservedby the quantization. We can avoid this arbitrariness, since we expect that the neural system isalready reflecting pertinent structures of the sensory stimuli that we would like to preserve inthe reproduction. Thus our choice of distortion function is determined by the informativenessof the quantization. The mutual information I (X;Y ) tells us how many different states onaverage can be distinguished in X by observing Y . If we quantize Y to YN (a reproductionwith N elements), we can estimate I (X;YN)—the mutual information between X and thereproduction YN . Our information preservation criterion will then require that we choose aquantizer that preserves as much of the mutual information as possible, i.e. choose the quantizerq(YN |Y ) which minimizes the difference

DI(Y, YN) = I (X;Y ) − I (X;YN) (1)

(note that DI � 0). We use the functional DI as a measure of the average distortion ofthe quality of a quantization. It can be interpreted as an information distortion measure(appendix A.5.3, appendix B), hence the symbol DI . The only term that depends on thequantization is I (X;YN) so we can reformulate the problem as the maximization of the effectivefunctional Deff = I (X;YN). A closely related method using this cost function was recentlypresented in [37].

For several reasons it is useful to consider the full functional. First, we may choose toquantize the stimulus, in which case the quantizer is q(xN |x), or quantize both stimulus andresponse, in which case there are two quantizers. In these versions other parts of the informationdistortion are relevant. A second reason is that the average distortion can be rewritten as theexpectation of a pointwise distortion function of a rather interesting form. Using the definitionof the mutual information and the Markov relation X → Y → YN between the spaces, we canexpress DI (appendix B) as the expectation

DI = Ep(y,yN )d(y, yN) (2)

where


d(y, yN) ≡ KL(q(x|y)||q(x|yN)) (3)

is the Kullback–Leibler (KL) directed divergence of the input stimulus conditioned on aresponse y relative to the stimulus conditioned on a reproduction yN . Intuitively, this measureshow similar the stimulus partition induced by the quantization is to the partition induced bythe sensory system. This expression is very appealing from a theoretical standpoint, due tovarious properties of the KL divergence. For example it allows us to describe the cases wherethe solution to the quantization problem will be an exact representation of the coding scheme.Indeed, because of the properties of KL [6], a reproduction with zero distortion is achieved ifand only if q(x|y) = q(x|yN) almost everywhere, which is exactly the case with the codingscheme described in section 2.4.

3.3. Implementations

Using a quantization (deterministic or stochastic) of the output space (appendix A.5.2, [14])allows us to control the exponential growth of required data. With this approach we estimatea quantity which is known to be a lower bound of the actual mutual information. We obtaina biased estimate but control the precision with which it can be estimated. Theorems fromquantization theory (appendix A.5.2, [14]) insure that estimates of the quantized informationquantities are always bounded by the original quantities and that a refinement of the quantizationdoes not lower these estimates. In such an environment it is beneficial to fix the coarseness ofthe quantization (the size of the reproduction, N ) and look for a quantization that minimizesthe information distortion measure DI = I (X;Y ) − I (X;YN) described previously.

3.3.1. Maximum entropy with nonlinear information distortion constraint. The problemof optimal quantization was formulated for a class of linear distortion functions [28] as amaximum-entropy problem [16]. We cannot use this analysis directly, since in our casethe distortion function depends nonlinearly on the quantizer. However, we can still use themaximum-entropy formulation. The reasoning behind this is that, among all quantizers thatsatisfy a given set of constraints, the maximum-entropy quantizer does not implicitly introducefurther restrictions in the problem. In this framework, the minimum-distortion problem is posedas a maximum-quantization-entropy problem with a distortion constraint:

maxq(yN |y)

H(YN |Y ) constrained by

DI(q(yN |y)) � Do and∑yN

q(yN |y) = 1 ∀y ∈ Y.

(4)

This is an ordinary constrained optimization problem that can be solved numerically withstandard optimization tools. The cost function H(YN |Y ) is concave in q(yN |y), and theprobability constraints

∑yN

q(yN |y) = 1 are linear in q(yN |y) [6]. The constraint DI is alsoconcave in q(yN |y) (theorem appendix B.1), which makes the whole problem one of concavemaximization.

The problem with this formulation is that it relies on knowing DI , which depends on themutual information between X and Y . We can easily avoid the need for that by using theeffective distortion Deff ≡ I (X;YN). In this case, the optimization problem is

maxq(yN |y)

H(YN |Y ) constrained by

Deff ≡ I (q(yN |y)) � Io and∑yN

q(yN |y) = 1 ∀y ∈ Y.

(5)


The solution to the optimization problem (5) depends on a single parameter Io, whichcan be interpreted as the informativeness of the quantization. If Io � 0, the distortionconstraint is always satisfied and we obtain only the unconstrained maximum entropy solutionq(yN |y) = 1/N for all pairs (y, yN). For Io � 0 the distortion constraint becomes activeand the uniform quantizer is no longer a solution to the optimization problem. Because of theconvexity of Deff , the optimal solution will lie on the boundary of the constraint and thus carryI (X;YN) = Io bits of information. Thus this formulation has a nice intuitive interpretation:‘find the maximum-entropy quantizer of Y which carries at least Io bits of information aboutX’.

We can continue pushing Io up for more informative solutions (with lower distortion) untilwe reach a point where the problem has no solutions. Imax

o at this point is the best lower boundof I (X;Y ) for the N -element reproduction. Since the solution is continuous with respect toIo, values near Imax

o are also good lower bounds of I (X;Y ). By choosing one of the optimalquantizers near Imax

o , we can achieve the minimum distortion quantization.

3.3.2. Maximum cost with linear probability constraint. A standard approach to constrainedoptimization problems is through the use of Lagrange multipliers. The system (4) can besolved as the unconstrained optimization of

maxq(yN |y)

(H(YN |Y ) − βDI (q(yN |y)) +

∑y

λy

∑yN

q(yN |y)).

The solution depends on the parameters (β, {λy}) which can be found from the constraints

DI(q(yN |y)) � Do∑yN

q(yN |y) = 1 ∀y ∈ Y.

Since β is a function of Do, which is a free parameter, we can discard Do and reformulate theoptimization problem as finding the maximum of the cost function

maxq(yN |y)

F (q(yN |y)) ≡ maxq(yN |y)

(H(YN |Y ) − βDI (q(yN |y)))

constrained by∑yN

q(yN |y) = 1 ∀y ∈ Y. (6)

As in equation (5), we shall continue the discussion using the effective distortion Deff . In thiscase,

maxq(yN |y)

F (q(yN |y)) ≡ maxq(yN |y)

(H(YN |Y ) + βDeff(q(yN |y)))

constrained by∑yN

q(yN |y) = 1 ∀y ∈ Y. (7)

Even though this formulation is identical to (4), by transferring the nonlinear constraint in thecost function we can analyse the problem further. Following [28], we consider the behaviour ofthe cost function F at two limiting cases of β. When β → 0, F → H(YN |Y ) and the optimalsolution is the unconstrained maximum-entropy solution q(yN |y) = 1/N . This correspondsto the case Io � 0 in section 3.3.1. At the other limit, when β → ∞, F → βDeff andthe solution to the optimization problem approaches a maximum-Deff (minimal-information-distortion DI ) solution. This is identical to the case where Io → Imax

o above. In order toavoid the divergence of the cost function with β, we rescale F to F/(β + 1), which has the


same extrema, but is bounded. Intermediate values of β produce intermediate solutions whichconnect the two limiting cases through a series of bifurcations. In [28] the parameter β wasgiven the meaning of an annealing parameter and the whole procedure for a general class ofdistortion function was named ‘deterministic annealing’, drawing an analogy from a physicalannealing process.

3.3.3. An implicit solution for the optimal quantizer. Further analysis of the problem uses thesimplicity of the linear constraint in (7). Extrema of F can be found by setting its derivativeswith respect to the quantizer q(yN |y) to zero. In the subsequent steps we shall explicitly use theassumption that all spaces are finite and discrete. The results for continuous random variablescan easily be adapted from this using analogous methods from the calculus of variations. Weuse Latin indices (i, j, k) to denote members in the original spaces X, Y and Greek indices(µ, ν, η) for elements of the reproduction YN . With this in mind (appendix B.3), we continueto solve the Lagrange multiplier problem

0 =(

∇(F +

∑j

λj

∑ν

q(yν |yj )

))νk

= (∇H)νk + β(∇Deff)νk + λk

= −p(yk)(

ln q(yν |yk) + 1)

+ β(∇Deff)νk + λk

⇔ 0 = ln q(yν |yk) − β(∇Deff)νk

p(yk)− µk (8)

where µk = λk

p(yk)− 1. Using this,

ln q(yν |yk) = β(∇Deff)νk

p(yk)+ µk

⇔ q(yν |yk) = eµk eβ

((∇Deff )νk

p(yk )

).

(9)

The constraint on q requires that

1 =∑ν

q(yν |yk)

⇒ 1 = eµk

∑ν

eβ

((∇Deff )νk

p(yk )

)

⇔ eµk = 1∑ν eβ

((∇Deff )νk

p(yk )

) . (10)

We can substitute this in equation (9) and obtain an implicit expression for the optimal q(yν |yk),

q(yν |yk) = eβ

((∇Deff )νk

p(yk )

)∑

ν eβ

((∇Deff )νk

p(yk )

) . (11)

Note that ∇Deff is a function of the quantizer q(yN |y). As ∇Deff = −∇DI , the implicitsolution in terms of the information distortion DI is

q(yN |y) = e−β∇DIp(y)∑

yNe−β

∇DIp(y)

. (12)

In practice, the expression (11) can be iterated for a fixed value of β to obtain a solutionfor the optimization problem, starting from a particular initial state. For small β, before the


first bifurcation described in [28] occurs, the obvious initial condition is the uniform solutionq(yN |y) = 1/N . The solution for one value of β can be used as the initial condition for asubsequent value of β because solutions are continuous with respect to β.

3.4. Approximations to the codebook

The optimal information distortion procedure can help us resolve the neural decoding problemwe outlined in section 2.4. Indeed, in the limit of no distortion (no information loss), anidentity transformation preserves all the structure of the output, but usually is unavailable dueto lack of data. When we quantize the neural response by fixing the size of the reproductionto N , this bounds our estimate of Deff to be no more than logN bits. In the ideal casemax Deff ≡ max I (X;YN) ≈ logN , but in general it will be lower. On the other hand we havea bound I (X;YN) � I (X;Y ). Since logN increases with N and I (X;Y ) is a constant, thesetwo independent bounds intersect for some N = Nc, at which point adding more elementsto YN does not improve the distortion measure. If I (X, YN) increases with N until N = Nc

and then levels off, we can identify the correct Nc by examining the behaviour of the expecteddistortion (or, equivalently, Deff ≡ I (X;YN)) as a function of N , given sufficient data. Wetake the elements of YN as the labels of the equivalence classes which we wanted to find. Thequantizer q(yN |y) gives the probability of a response y belonging to an equivalence class yN .Rose [28] conjectured that the optimal quantizer for low distortions (high β) is deterministic (oreffectively deterministic, in the case of duplicate classes). In this case the responses associatedwith class yN are YN = {y|q(yN |y) ≈ 1}. The optimal quantizer also induces a coding schemefrom X → YN by p(yN |x) = ∑

y q(yN |y)p(y|x). This is the most informative approximationof the original relation p(x|y). It induces the quantization X → XN by associating xN withthe stimulus set XN = {x|p(yN |x) ≈ 1} of all x which correspond to the same output classyN . The resulting relation p(yN |xN) is almost bijective and so we recover an almost completereproduction of the model described in section 2.3.

If there are not enough data to support a complete recovery even under the reduced datarequirements, the algorithm has to stop earlier. The criterion we use in such a case is thatthe estimate of Deff does not change with N within its error bounds (obtained analyticallyor by statistical re-estimation methods like bootstrap, or jack-knife). Then N < Nc and thequantized mutual information is at most logN . We can recover at most N classes and someof the original classes will be combined. The quantizer may also not be deterministic due tolack of enough data to resolve uncertainties. Thus we can recover a somewhat impoverishedpicture of the actual input/output relationship, which can be refined automatically as more databecome available, by increasing N and repeating the optimization procedure.

4. Results

We shall discuss the application of the method described so far to a few synthetic testcases. Applying it to physiological data from a sensory system involves additional difficultiesassociated with the estimates of DI for complex input stimuli, which are dealt withelsewhere [10, 11].

4.1. Random clusters

We present the analysis of the probability distribution shown in figure 4(a). In this model weassume that X represents a range of possible stimulus properties and Y represents a range ofpossible spike train patterns. We have constructed four clusters of pairs in the stimulus/response


Y

X

(a)

10 20 30 40 50

10

20

30

40

50

100

101

0

1

2

N

I(X

,YN

), b

its

(f)

Y

Y N

(b)

10 20 30 40 50

1

2

Y

(c)

10 20 30 40 50

1

2

3

(d)

Y N

10 20 30 40 50

1

2

3

4

(e)

10 20 30 40 50

12345

Figure 4. (a) A joint probability for the relation between two random variables X and Y with 52elements each. (b)–(e) The optimal quantizers q(yN |y) for different numbers of classes. Thesepanels represent the conditional probability q(yN |y) of a pattern y from (a) (horizontal axis)belonging to class yN (vertical axis). White represents zero, black represents one and intermediatevalues are represented by levels of grey. The behaviour of the mutual information with increasingN can be seen in the log–linear plot (f ). The dashed curve is I (X;Y ), which is the least upperbound of I (X;YN).

space. Each cluster corresponds to a range of responses elicited by a range of stimuli. Thismodel was chosen to resemble the model of coding and decoding with jointly typical sequences(section 2.4). The mutual information between the two sequences is about 1.8 bits, which iscomparable to the mutual information conveyed by single neurons about stimulus parametersin several unrelated biological sensory systems [9,18,25,33]. For this analysis we assume theoriginal relation between X and Y is known (the joint probability p(x, y) is used explicitly).

The results of the application of our approach are shown in panels (b)–(f ) of figure 4. Thegrey-scale map in these, and later, representations of the quantizer depicts zero with white,one with black and intermediate values with levels of grey. When a two-class reproductionis forced (b), the algorithm recovers an incomplete representation. The representation isimproved for the three-class refinement (c). The next refinement (d) separates all the classescorrectly and recovers most of the mutual information. Further refinements (e) fail to split theclasses and are effectively identical to (d) (note that classes 1 and 2 in (e) are almost evenlypopulated and the class membership there is close to a uniform one-half). The quantizedmutual information (f ) increases with the number of classes approximately as logN until itrecovers about 90% of the original mutual information (N = 4), at which point it levels off.

Further details of the course of the optimization procedure that lead to the optimal quantizerin panel (d) are presented in figure 5. The behaviour of Deff as a function of the annealingparameter β can be seen in the top panel. Snapshots of the optimal quantizers for differentvalues of β are presented on the bottom row (panels 1–6). We can observe the bifurcations ofthe optimal solution (1–5) and the corresponding transitions of the effective distortion. Theabrupt transitions (1 → 2, 2 → 3) are similar to the ones described in [28] for a lineardistortion function. We also observe transitions (4 → 5) which appear to be smooth in Deff

even though the solution for the optimal quantizer undergoes a bifurcation.A random permutation of the rows and columns of the joint probability in figure 4(a) has

the same channel structure. The quantization is identical to the case presented in figure 4 afterapplying the inverse permutation and fully recovers the permuted classes (i.e., the quantizationis contravariant with respect to the action of the permutation group).


Y

1

Y N

1020304050

2

4

Y

2

1020304050

2

4

Y

3

1020304050

2

4

Y

4

1020304050

2

4

Y

5

1020304050

2

4

Y

6

1020304050

2

4

0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6

0.5

1

1.5

β

I(X

,YN

), b

its

Figure 5. Behaviour of Deff (top) and the optimal quantizer q(yN |y) (bottom) as a function of theannealing parameter β.

4.2. Hamming code

There exist noise-correcting codes that transform blocks of symbols at the input to larger blocksof binary symbols in such a way that the original block can be recovered even after some noiseperturbs the codewords [6]. For this example we are going to use a simple noise-correctingcode—the Hamming (7, 4) code (H74). It operates on binary blocks of size four and expandseach block to seven binary symbols by adding three parity bits. Blocks can be considered asvectors in a Boolean vector space. The input x ∈ Z4

2 and the output y ∈ Z72 . The Hamming

code then can be described as a linear operator on these spaces:

y = HT · xwhere

H =

1 0 0 0 1 0 10 1 0 0 1 1 00 0 1 0 1 1 10 0 0 1 0 1 1

and any addition is modulo 2. This code can detect and correct one error (a flipped bit) in ablock of seven.

The properties of H74 are known very well. We are going to use this system as anothertestbed to evaluate the performance of different implementations of the algorithms describedearlier. In this case we used data sampled from the joint probability p(X, Y ) to perform allestimates. The two variables X and Y are related in the following manner: we generated asequence of points from a uniform source in Z4

2 (about 104 for this example) and applied H74to it. Two independent copies of the result were perturbed by noise, which flipped a random bitof each sample. The two perturbed outputs are Xo and Yo. The relation between these is shownin figure 6. There are four bits of mutual information between Xo and Yo. A permutation ofthe rows and columns, which orders equivalent sequences of H74, makes the relation mucheasier to comprehend (figure 7(a)). The variables X and Y are permuted versions of Xo and


Yo

Xo

20 40 60 80 100 120

20

40

60

80

100

120

Figure 6. The joint probability between Xo and Yo in the original H74 variables.

Y

X

(a)

20 40 60 80 100 120

20

40

60

80

100

120

100

101

0

1

2

3

4

N

I(X

,YN

), b

its(f )

Y

YN

(b)

20 40 60 80 100 120

1

2

Y

(c)

20 40 60 80 100 120

1

2

3

4

(d)

YN

20 40 60 80 100 120

2

4

6

8

(e)20 40 60 80 100 120

5

10

15

20

Figure 7. The joint probability between X and Y after permuting rows and columns (a) and optimalquantizations for different number of classes (b)–(e). The behaviour of the mutual information withincreasing N can be seen in the log–linear plot (f ). The dashed curve is I (X;Y ).

Yo and have the same relation between them. As mentioned in section 4.1, any permutation ofvariables leaves the information distortion invariant. Thus the results for (X, Y ) and (Xo, Yo)

are equivalent.The results can be seen in figure 7. The reproduction classes were permuted to follow

roughly the relation 7(a). As in section 4.1, the algorithm recovers several incompleterepresentations (b)–(e), each one a refinement of the previous. Refinements beyond the correctnumber of clusters (16) fail to improve the distortion by much (e), and some of the classes areeffectively copies of each other (e.g. 1 and 2). The quantized mutual information (f ) increaseswith the number of classes approximately as logN until it recovers almost all of the originalmutual information (N = 16), at which point it levels off.


Y

X

(a)

20 40

10

20

30

40

100

101

0

1

2

N

I(X

,YN

), b

its

(f )

Y

YN

(b)

20 40

1

2

Y

(c)

20 40

1

2

3

4

(d)

YN

20 40

2

4

6

(e)

20 40

2

4

6

8

Figure 8. A joint probability for a linear relation between two random variables X and Y (a)with optimal quantization (b)–(e) for different number of classes. The behaviour of the mutualinformation with increasing N can be seen in (f ). The dashed curve is I (X;Y ).

4.3. Linear encoding

We also applied the algorithm to a case which, unlike the previous cases, does not have clearlydefined clusters. This model tries to simulate the process of a physical measurement where X

represents a physical system and Y is the set of possible results from a measurement. In thisexample we model a linear relation between X and Y and Gaussian measurement noise, thatis

Y = kX + η

where η ∈ N (0, σ ) is drawn from a normal distribution with zero mean and variance σ 2. Theparticular relation we used (figure 8(a)) contains about two bits of mutual information.

The results can be seen in figure 8. The algorithm recovers a series of representations (b)–(e), where each quantizer is a refinement of the previous one. The reproduction classes werepermuted to roughly follow the original linear relation. There is no natural stopping point andso the quantized mutual information I (X;YN) approaches the original mutual informationI (X;Y ) gradually. This is in contrast to the previous two cases, where the rate of change ofI (X;YN) abruptly decreased after some N .

The course of the bifurcations here also differs from the previous cases (figure 9). Again,the reproduction classes were permuted to roughly follow the original linear relation. Thereare no obvious abrupt transitions and the uncertainty of the quantizer is resolved smoothlywith β.

5. Conclusions and discussion

This paper has two goals. The first goal is to formulate a precise model of an early stage ofa sensory system as a communication channel, and describe the properties of this model. Ingeneral, a communication channel is fully described by the conditional probability of responsegiven a stimulus. Using ideas from information theory on optimal information transmission inthe presence of noise, and the method of jointly typical sequences, we have demonstrated theexistence of equivalence classes of stimulus/response pairs, which we have called codewordclasses. A coding scheme in this system can be described by an almost deterministic map


0.6 0.8 1 1.2 1.4 1.6 1.80

0.5

1

1.5

β

I(X

,YN

), b

its

Y

1

YN

20 40

1

2

3

4

Y

2

20 40

1

2

3

4

Y

3

20 40

1

2

3

4

Y

4

20 40

1

2

3

4

Y

5

20 40

1

2

3

4

Figure 9. Behaviour of Deff (top) and the optimal quantizer q(yN |y) (bottom) as a function of theannealing parameter for the linear encoding case.

when restricted to the codeword classes. The number of codeword classes is related to themutual information I (X;Y ) between stimulus and response.

The second goal of this paper is to provide a method for recovering the structure ofsuch a model from observations. Characterizing the relation between individual elements inthe stimulus and response spaces has been shown to require large numbers of data points,increasing exponentially with the length of spike sequences (T ) and the number of neurons(N) considered. We choose to recover an impoverished description of the coding schemeby quantizing the responses to a reproduction set of a few elements. To assess the qualityof the reproduction, we define the information distortion DI = I (X;Y ) − I (X;YN), whichmeasures how much information is lost in the quantization process. For a fixed reproductionsize N we pose the optimization problem of finding the quantization with smallest distortion,as the one which preserves most of the information present in the original relation between X

and Y . Refining the reproduction by increasing N was shown to decrease the distortion. Wedemonstrate empirically on a set of synthetic problems that, if the original relation containsalmost disjoint clusters, a sufficiently fine optimal quantization recovers them completely. Ifthe quantization is too coarse, then some of the clusters will be combined, but in such a waythat a large fraction of the original information is still preserved.

In realistic cases of physiological recordings, there are not usually enough data to supporta sufficiently fine quantization. In such cases, we are forced to accept a coarse quantizationwhich does not recover all the structure of a particular neural coding scheme. The criterionwe have adopted for stopping the refinement process is when the estimate of the informationdistortion does not change within its error bounds, which may be obtained analytically or bystatistical re-estimation procedures.

Many of the coding schemes currently in use can be seen as special cases of the methodwe present here. A rate code can be described as a deterministic quantization to the set ofintegers. The quantizer assigns all spike patterns with the same number of spikes to the same


equivalence class. A spike latency code can be seen as a quantization to classes determinedby the latency and jitter of the spike’s timing. A stimulus feature is decoded as in the ratecode case, based on which latency ‘class’ a spike falls in. The metric space approach [40] usesan explicit cost (distortion) function to determine what different sequences are identical: theyare equivalent if according to the cost function their difference is below a certain threshold.The cost function and identification threshold induce a deterministic quantization of the inputspace to a smaller output space.

Most current approaches to studying neural coding rely on formulating a hypothesisabout the coding scheme a neural system may use and then using observations to estimateparameters of the hypothesis. The complexity of the hypothesis determines the amount ofdata needed for reliable estimates of the necessary parameters. The method presented in thisreport offers a means for data-driven hypothesis formulation. When we stop the refinementof the reproduction due to lack of data, we effectively formulate a hypothesis about the mostinformative coding scheme that can be supported with the available amount of data. Whenmore observations become available, the hypothesis can be refined automatically to includethem for a better approximation.

Recent applications of information theory to problems of neural coding [5, 7, 33]concentrate on estimating information-theoretic quantities without regard to the actualstimulus/response relationship that produced them. The strong point in that approach—thatit does not require a detailed understanding of the coding process—is also a weak point fromanother perspective. As we mentioned, without the model of a coding scheme we presenthere, many candidate hypotheses must be investigated one by one. For example, in [5] thebasic elements are a pair of events, in [4] a combination of rate and latency codes. Usingthe approach described here, we can characterize coding schemes without reference to theunderlying physical processes that produce them. It could be that the codeword classes thatemerge after the analysis are well described by a simple mechanism, but we also have the abilityto analyse and describe quite succinctly coding schemes which have not been considered sofar.

We developed the information distortion method from a purely practical necessity—the need to describe a neural coding scheme on large input and output spaces. Our earlierresearch [9, 35] and others [23, 25, 33, 41] estimate just a few bits of mutual information perneuron in several distinct sensory systems. In view of our model of a coding scheme thissuggests that there are just a few codeword classes that need to be identified, regardless of thesize of the response space. Thus it was natural to devise a method that clustered the neuralrepresentation in a few large sets, while preserving most of the mutual information. We weretherefore quite excited to find another recently developed approach [37], which suggesteda very similar method (‘the information bottleneck’) for completely abstract reasons (theauthors attempt to extract ‘meaningful’ or ‘relevant’ information from a pair of interactingsystems). The different motivations are obvious. We explicitly concentrate on the case offinite (albeit large) spaces, so that the method is applicable to computer-recorded data andnumerical analysis. The ‘information bottleneck’ method [37] uses a variational approach tocontinuous random variables, which is better suited for abstract analysis. We cannot assessits applicability to actual stimulus/response datasets, since no examples of its performance arepresented in [37]. The only application we found cited there [32] suffers from two unfortunatechoices. First, it applies the method to a problem which, albeit real, is not well understood, sowe could not distinguish the limitations of the algorithm from the constraints of the problemitself. Second, it chooses to forgo the simplicity offered by the small reproduction space andprobabilistic clustering [28] and instead uses an ad hoc deterministic clustering method to findan approximation of the solution to their problem, which makes assessing the properties of


the method even more difficult. We hope that further developments in both approaches willeventually converge to a unified method for data analysis.

The information distortion DI seems to be extremely well suited for uncovering thestructure of information channels. While appendix B addresses some of the properties ofthis object, more research is needed to elucidate these further. Since it is concave and its firstand second derivatives are continuous except at the boundaries of its domain, it is amenableto various analytical techniques, which can help clarify the structure of the optimal solutionand bifurcations at different values of the distortion. However, this is beyond the scope of thispaper.

It is interesting to note that, although we had neural coding in mind while developing theinformation distortion method, the ensuing analysis is in no way limited to nervous systems.Indeed, the constraints on the pair of signals we analyse are so general that they can representalmost any pair of interacting physical systems. In this case, finding a minimal informationdistortion reproduction allows us to recover certain aspects of the interaction between the twophysical systems, which may improve considerably any subsequent analysis performed onthem. It is also possible to analyse parts of the structure of a single physical system Y , if X

is a system with known properties (e.g. a signal generator, controlled by a researcher) and isused to perturb Y . These cases point to the exciting possibility of obtaining a more automatedapproach for succinct descriptions of arbitrary physical systems through the use of minimalinformation distortion quantizers.

Acknowledgments

This research was supported in part by NIH grants MH12159 (AGD) and MH57179 (JPM). Theauthors are indebted to Penio Penev (Rockefeller University), Tomas Gedeon, Zane Aldworth,Zuzana Gedeon, Kay Kirkpatrick (Montana State University) and the two anonymous reviewersfor numerous discussions and suggestions which helped shape this paper.

Appendix A. Information theory

The goal of this section is to summarize results from information theory, pertinent to thebody of the paper. Statements of standard results are made with some precision so that theinterested reader can trace the foundation of our discussion about coding and quantizationearlier in the paper. An extended treatment of these topics and more can be found in [6]. Amore formal treatment can be found in [14], where the subject is approached with greater careand mathematical precision.

Appendix A.1. Source space, probability measure and random variables

An information source is a mathematical model for a physical system that produces a successionof symbols in a manner which is unknown to us and is treated as random. The set X , containingall possible output symbols, is called the alphabet of the source. Let A be a σ -field of subsetsof X . The treatment of events (sets of sequences of symbols) is achieved through assigninga probability measure p(x) = Pr{X = x}, x ∈ X to (X , A). The triplet (X , A, p) is oftencalled a random variable, X, when the context of the alphabet and measure is clear.

In cases with two or more alphabets, one can define a probability measure p(x, y) onthe product space (x, y) ∈ X × Y . In this context the induced probability measures on theindividual spaces are called marginal measures or marginals. An important measure which


emerges from this setup is the conditional probability, defined as

p(x|y) ≡ p(x, y)

p(y), ∀(x, y) ∈ X × Y.

Two random variables are called independent if

p(x|y) = p(x) ∀y ∈ Y.

A measurable function defined on (X , A) and taking values in another measurable space(Y, B) is a mapping f : X → Y with the property that

F ∈ B �⇒ f −1(F ) = {y : f (y) ∈ F } ∈ A.

Whenever it is well defined (usually when the measurable function is in a linear vectorspace, e.g. Rn), one can perform integration, which is a functional on the space of measurablefunctions. In probability theory this is called the expectation of the measurable function g(X)

and is defined as

Epg(X) ≡∑x∈X

g(x)p(x)

where the sum is replaced by an integral for continuous alphabets.

Appendix A.2. Information-theoretic quantities

The basic concepts of information theory are entropy and mutual information. The notion ofmutual information is introduced as an integral measure of the degree of dependence betweena pair of variables. The concept of entropy can then be understood as the self-informationof a random variable. They are both special cases of a more general integral quantity calledrelative entropy (or KL divergence [20]), which is an integral measure of the difference betweentwo probability distributions. While mutual information is well defined for both discrete andcontinuous alphabets, entropy for continuous alphabets is a problematic quantity, undefinedfor many measures of interest. Fortunately many identities and bounds on mutual informationare still valid if one uses another integral measure from probability—the KL divergence, orrelative entropy, between two probability measures on the same event space:

KL(p‖q) = Eplog

(p(x)

q(x)

)(A.1)

where p and q are two different probability measures on X . KL(p||q) quantifies the differencebetween two probability measures on the same sample space and is extensively used inprobabilistic decision theory. Note that, as with most of the quantities here, KL dependsonly on probability measures and not on the elements of the space, which can be non-numeric.Thus expectations are always well defined here irrespective of the structure of the event space.The usual definitions of these quantities use a base two logarithm, so any further referencesto the log function implicitly assume base two. On rare occasions we shall use the naturallogarithm, denoted by the symbol ln.

Appendix A.2.1. Mutual information. In order to measure the statistical independencebetween two random variables X and Y it is useful to introduce the notion of mutualinformation. It is defined as the KL distance between the joint probability p(x, y) on theproduct alphabet X × Y and the product of marginal probabilities p(x) and p(y). I (X;Y ) isequal to zero if and only if X and Y are independent:

I (X;Y ) ≡ KL (p(x, y), p(x)p(y)) = Ep(x,y)logp(x, y)

p(x)p(y). (A.2)

The mutual information is symmetric with respect to its arguments: I (X;Y ) = I (Y ;X).


Appendix A.2.2. Entropy. A measure of self-information of a probability distribution is givenby the entropy H(X):

H(X) ≡ Ep log1

p(x)= I (X;X). (A.3)

In communication theory H measures the average information that each symbol carries whensampled.

The joint entropy of a pair of random variables is defined as

H(X, Y ) ≡ Ep(x,y) log1

p(x, y). (A.4)

We also define the conditional entropy of a random variable given another as the expectedvalue of the entropies of the conditional distributions:

H(X|Y ) ≡ Ep(x)H(Y |X = x) = Ep(x,y) log1

p(y|x) . (A.5)

Appendix A.2.3. Information identities. There are various identities connecting I and H [6].Here is a short list of the most frequently used.

H(X) = I (X;X)

I (X;Y ) = H(X) − H(X|Y )

I (X;Y ) = H(Y) − H(Y |X)

I (X;Y ) = H(X) + H(Y) − H(X, Y )

H(X1, . . . , Xn) =n∑

i=1

H(Xi |Xi−1, . . . , Xn)

I (X1, . . . , Xn;Y ) =n∑

i=1

I (Xi;Y |Xi−1, . . . , X1).

(A.6)

Appendix A.3. Typical sequences

Theorem appendix A.1 (Asymptotic equipartition property (AEP)). If X1, X2, . . . Xn areindependent and identically distributed random variables (i.i.d.) with probability measurep(x), then

n−1 logp(X1, X2, . . . , Xn)−1 → H(X) (A.7)

in probability.

The theorem is a simple consequence of the weak law of large numbers. For its proofand that of most other theorems consult [6]. Its extension to arbitrary ergodic finite-valuedprocesses is known as the Shannon–McMillan–Breiman theorem. All definitions and theoremsin this section will be presented in their i.i.d. form, but in general they are correct for ergodicsources. The AEP allows us to define a structure in the set of events X .

Definition. The typical setAnε with respect top(x) is the set of sequences (x1, x2, . . . , xn) ∈ X n

with the following property:

2−n(H(X)+ε) � p(x1, x2, . . . , xn) � 2−n(H(X)−ε). (A.8)

The typical set has the following properties.


Theorem appendix A.2 (Properties of typical sequences).

(i) If (x1, x2, . . . , xn) ∈ Anε , then |n−1 logp(x1, x2, . . . , xn)

−1 − H(X)| � ε.(ii) Pr{An

ε } > 1 − ε for n sufficiently large.(iii) (1 − ε)2n(H(X)−ε) � |An

ε | � 2n(H(X)+ε) for n sufficiently large. Here |A| is the number ofelements in set A.

Appendix A.4. Jointly typical sequences

When analysing information channels, we deal with two sets of random sequences—input andoutput. In this case it is necessary to consider the combined behaviour of the pair (X, Y ).

Definition. The set Anε of jointly typical sequences {(xn, yn)} with respect to the distribution

p(x, y) is the set

Anε = {(xn, yn) ∈ X n × Yn : | − n−1 logp(xn) − H(X)| < ε,

| − n−1 logp(yn) − H(Y)| < ε, | − n−1 logp(xn, yn) − H(X, Y )| < ε},(A.9)

where p(xn, yn) = ∏ni=1 p(xi, yi).

Theorem appendix A.3 (Properties of jointly typical sequences). Let (Xn, Y n) be i.i.d.pair sequences of length n. Then the following hold.

(i) Pr(Anε ) > 1 − ε.

(ii) |Anε | � 2n(H(X,Y )+ε) for n sufficiently large.

(iii) If (Xn, Y n) are a pair of random variables with joint probability p(xn, yn) = p(xn)p(yn)

(i.e. Xn and Y n are independent with the same marginal distributions as Xn and Yn), thenfor sufficiently large n

(1 − ε)2−n(I (X,Y )+3ε) � Pr((Xn, Y n) ∈ An

ε

)� 2−n(I (X,Y )−3ε).

The properties of jointly typical sequences can be used to prove one of the principaltheorems in information theory: the channel coding theorem. For lack of space we cannot dothis here. Instead we just state this important result. A complete proof using these techniquescan be found in [6].

Definition. A discrete channel is a system consisting of an input source X, an output sourceY and a transition probability p(y|x) of observing the output symbol y given x was sent. Thechannel is memoryless if the transition probability is conditionally independent of previouschannel inputs and outputs.

Definition. The channel capacity of a discrete memoryless channel is

C = maxp(x)

I (X;Y ). (A.10)

Theorem appendix A.4 (The channel coding theorem). For any channel with capacity C

and every rate R < C there exists a sequence of codes with this rate and maximum probabilityof error λ(n) → 0.

Conversely, any sequence of codes with λ(n) → 0 must have R � C.


Appendix A.5. Distortion theory

Appendix A.5.1. The data processing inequality. The data processing inequality is used inthe proofs of most of the statements later, so we shall take some time to give proper definitionsand proofs.

Definition. Random variables X, Y,Z form a Markov chain (denoted by X → Y → Z) if theconditional distribution of Z depends only on Y and is conditionally independent of X. In thiscase, the joint probability can be written as

p(x, y, z) = p(z|y)p(y|x)p(x). (A.11)

In particular, if z = f (y), where f is a deterministic function of y, then X → Y → Z. Alsonote that if X → Y → Z then Z → Y → X as well.

Theorem appendix A.5 (Data processing inequality). If X → Y → Z, then I (X;Z) �I (X;Y ).

Proof. By the chain rule (last item in (A.6)) we can expand the mutual information in twodifferent ways.

I (X;Y,Z) = I (X;Z) + I (X;Y |Z)

= I (X;Y ) + I (X;Z|Y ).

Since X and Z are independent given Y , this implies I (X;Z|Y ) = 0. On the other hand,I (X;Y |Z) � 0, so

I (X;Z) � I (X;Y ).

In particular, if Z = f (Y ) then I (X;Z) � I (X;Y ). �

One intepretation of this inequality is that any physical measurement, deterministic or not,will usually decrease the information carried by one random variable about another.

Appendix A.5.2. Quantization. A random variable Y can be related to another randomvariable YN through the process of quantization [6, 14]. YN is referred to as the reproductionof Y . Quantizers can be deterministic (functions) or stochastic (given through a conditionalprobability). The size of the reproduction space is smaller than the size of the quantized space.

A deterministic quantizer (or just quantizer, also referred to as hard clustering [28]) is anysimple measurable function f : X → Xf from X to a reproduction space Xf with finitelymany elements xi

f . The quantizer f induces a partition {Qif } such that

(i) Qif ⊂ X,

(ii) Qif

⋂Q

j

f = ∅ if i �= j ,

(iii)⋃

i Qif = X.

The information quantities in the reproduction Xf are

H(Xf ) = Ep(xf ) log1

p(xf )= Ep(Qf ) log

1

p(Qf )(A.12)

I (Xf , Yg) = Ep(xf ,yg) logp(xf , yg)

p(xf )p(yg)= Ep(Qf ,Qg) log

p(Qf ,Qg)

p(Qf )p(Qg). (A.13)

The quantizer h refines f (h > f ) if the partition Qh of X induced by h refines thepartition Qf induced by f . Qh refines Qf (Qh > Qf ) if any Qi

h ∈ Qh is a subset of some


Qj

f ∈ Qf . Note that in this case X → Xh → Xf ; i.e., the reproductions form a certainMarkov chain with the original space.

A stochastic quantizer (soft clustering [28]) is a mapping q from one probability measurespace X to another Xq . The mapping is given by the conditional probability q(xq |x), whichcan be interpreted as the probability of x belonging to the reproduction class xq . The sourcespace induces a probability measure on the reproduction space by p(xq) = ∑

x q(xq |x)p(x).A stochastic quantizer can also be considered as a communication channel. The deterministicquantizer discussed above is a special case of a stochastic quantizer with q(xq |x) = 1 if x ∈ xq

and zero otherwise.The information quantities in Xq are

H(Xq) = Ep(xq) log1

p(xq)(A.14)

I (Xq, Yg) = Ep(xq ,yg) logp(xq, yg)

p(xq)p(yg). (A.15)

Unlike the deterministic case, here we do not have a nice inverse image of the quantizermap to define refinements in the usual way. Thus we choose a different property of the quantizeras a definition of refinement: the quantizer h refines f (h > f ) if X → Xh → Xf (that is,Xf is upstream of Xh in a Markov chain).

There are some properties of quantizers (deterministic or stochastic) that are useful fordiscussing information transmission.

If h > f , then from the Markov relation and theorem appendix A.5 it follows that

H(Y |Xh) � H(Y |Xf )

I (Y,Xh) � I (Y,Xf ).(A.16)

For any X (discrete or continuous), and any quantizer f

I (Y,X) � I (Y,Xf ) (A.17)

that is, estimates in the quantized spaces are always lower bounds of the actual informationquantities.

When X is continuous, H(Xf ) diverges with refinements [14]. I (X, Y ) on the otherhand can always be obtained as the least upper bound over all refinements. Without furtherconstraints, this is achieved by a deterministic quantizer [28].

The statements above suggest that quantizations could provide lower-bound estimates ofH and I with controlled precision, since the size of the pattern set is fixed by the size of thequantized space and could be potentially much lower than the size of the original pattern space.This allows us to obtain more precise estimates of the quantities in question.

Appendix A.5.3. Distortion theory. The quality of a quantization is the topic of distortiontheory [6]. It is characterized by a distortion function d : X × Xq → R+. The distortionfunction d(x, xq) measures how well x is represented by xq . In general it can be arbitrary. Theexpected distortion

D(q(xq |x)) = Eq(xq |x)p(x)d(x, xq)

measures the quality of quantization by the quantizer q(xq |x).Definition. The rate distortion function R(D) for a source X, reproduction Xq and distortiond(x, xq) is defined [6] as

R(D) = minq(xq |x):D(p(x,xq ))�D

I (X;Xq). (A.18)


The minimization here is over all stochastic quantizers q(xp|x) which satisfy the distortionconstraint.

The solution to (A.18) is a standard minimization problem of the convex functionI (X,Xq) over the convex set {q(xq |x) : q(xq |x) � 0; ∑

xqq(xq |x) = 1;D(q(xq |x)p(x)) ≡∑

x,xpq(xp|x)p(x) d(x, xp) � D}. Analytically it can be solved by using Lagrange multipliers

to minimize

J (q(xq |x)) = I (q(xq |x)p(x)) + βD(q(xq |x)) +∑x

µ(x)∑xq

q(xq |x)

and find λ and µ(x) from the constraints.

Appendix B. Properties of the information distortion function

Appendix B.1. Definition

The (average) information distortion DI between a random variable Y and its reproductionYN is defined through another random variable, X, related to Y so that X → Y → YN form aMarkov chain. In this case, we define the information distortion between Y and YN as

DI(Y, YN) = I (X, Y ) − I (X, YN). (B.1)

The only part that depends on the quantizer q(yN |y) is I (X, YN) so we can concentrateon the properties of Deff = I (X, YN).

Appendix B.2. Properties

DI is a bounded function of the quantizer. Indeed, since DI = I (X, Y ) − I (X, YN) and0 � I (X, YN) � I (X, Y ), this implies 0 � DI � I (X, Y ).

DI is an expected distortion—an integral characteristic of the relation between thewhole sets Y and YN . We can write it in a form that includes explicitly the expectationof a pointwise distortion function. Indeed, using that p(x, y) = ∑

yNp(x, y, yN) and

p(x, yN) = ∑y p(x, y, yN), we have

DI = I (X, Y ) − I (X, YN)

=∑x,y

p(x, y) logp(x, y)

p(x)p(y)−

∑x,yN

p(x, yN) logp(x, yN)

p(x)p(yN)(B.2)

=∑

x,y,yN

p(x, y, yN)(

logp(x|y) − logp(x|yN))

(B.3)

=∑y,yN

p(y, yN)∑x

p(x|y) logp(x|y)p(x|yN)

=∑y,yN

p(y, yN) KL(p(x|y)‖p(x|yN)

). (B.4)

Here (B.3) uses the Bayes property p(x, y)/p(y) = p(x|y) and logp(x) is common for thetwo parts and cancels. Step (B.4) uses the Markov property p(x, y, yN) = p(x|y)p(y, yN).This shows that the information distortion

DI = Ep(y,yN )KL(p(x|y)‖p(x|yN)

)(B.5)

is the expectation of the KL divergence (A.1) of p(x|y) with respect to p(x|yN). Unlikethe pointwise distortion functions usually investigated in information theory [6, 28], this onedepends on the quantizer q(yN |y), thorough p(x|yN).


In the process of investigating the information distortion function we may have to evaluateit at values of q(yN |y) which are not conditional probabilities. The only requirement we wantto impose on q is that it be non-negative: q(yN |y) � 0. For this reason we need to definecertain relations which are automatically true when q(yN |y) is a conditional probability. Weshall use that p(x, y) is a probability function. With that in mind,

p(x) =∑y

p(x, y)

p(y) =∑x

p(x, y).(B.6)

We define

p(x, yN) ≡∑y

q(yN |y)p(x, y) (B.7)

p(yN) ≡∑x

p(x, yN) =∑y

q(yN |y)∑x

p(x, y)

q(yN |x) ≡∑y

q(yN |y)p(y|x). (B.8)

All these are automatically true when q(yN |y) is a conditional probability. If q(yN |y) is nota conditional probability, none of the above are probabilities but all are non-negative. Alsop(x) �= ∑

yNp(x, yN) unless q(yN |y) is a conditional probability. It is still the case that

p(x, yN) = q(yN |x)p(x) (B.9)

(the Bayes property) because of (B.6) and the definitions (B.7). With these definitions, thenatural extensions for the information quantities (A.2), (A.5) for arbitrary non-negativeq(yN |y)are

I (X, YN) ≡∑x,yN

p(x, yN) logp(x, yN)

p(x)p(yN)

H(YN |Y ) ≡∑y,yN

q(yN |y)p(y) log q(yN |y).(B.10)

Lemma appendix B.1 (Convexity of DI ). The information distortion function DI is aconcave function of any q(yN |y) � 0.

Proof. Consider DI = I (X, Y ) − I (X, YN). The first term is constant with respect tothe q(yN |y), so it is sufficient to consider only the second term. Using the definition of I

above (B.10) and (B.9) we can see that I (X, YN) is a function of q(yN |x)p(x). For a fixedp(x), I (q(yN |y)) is a convex function of q(yN |x) [6] (shown there for q(yN |x) a probability,but easily extensible for any q(yN |x) � 0). Hence,

I (λq1(yN |x) + (1 − λ)q2(yN |x)) � λI (q1(yN |x)) + (1 − λ)I (q2(yN |x)) (B.11)

for any q1, q2 � 0. We need to show that I is also a convex function of q(yN |y). Becauseof (B.7),

q(yN |x) =∑y

q(yN |y)p(y|x). (B.12)

Consider I (q(yN |y)) ≡ I( ∑

y q(yN |y)p(y|x)). For any q1(yN |y), q2(yN |y) � 0,

I (λq1(yN |y) + (1 − λ)q2(yN |y)) = I

( ∑y

(λq1(yN |y) + (1 − λ)q2(yN |y)))p(y|x)

)


= I

(λ

∑y

q1(yN |y)p(y|x) + (1 − λ)∑y

q2(yN |y)p(y|x))

= I(λq1(yN |x) + (1 − λ)q2(yN |x)) (B.13)

� λI (q1(yN |x)) + (1 − λ)I (q2(yN |x))= λI

( ∑y

q1(yN |y)p(y|x))

+ (1 − λ)I

( ∑y

q2(yN |y)p(y|x))

= λI (q1(yN |y)) + (1 − λ)I (q2(yN |y)) (B.14)

where (B.13) follows from (B.12) and (B.14) is a consequence of the convexity of I (B.11). Thisfinishes the proof that I (X, YN) is a convex function of q(yN |y). Since DI ∝ −I (X, YN) ⇒DI is a concave function of the quantizer q(yN |y). �

Appendix B.3. Derivatives

DI is a function of the quantizer q(yN |y). The effective distortion Deff = I (X, YN) has thesame derivatives with respect to the quantizer as DI , with opposite sign, so we shall consideronly the derivatives of Deff . In this section we shall explicitly use the assumption that allspaces are finite and discrete. We do not use anywhere the fact that q(yN |y) is a conditionalprobability. The final results are thus ready for programming in a computer or for furtherfinite-dimensional analysis. The results for continuous random variables can easily be adaptedfrom here using analogous methods from calculus of variations. We use Latin indices (i, j, k)to denote members in the original spaces X, Y and Greek indices (µ, ν, η) for elements of thereproduction YN . While differentiating, we shall use the natural logarithm (ln) in all definitionsand rescale the results to base two logarithm (log) at the end.

In terms of the quantizer,

Deff =∑i,µ

p(xi, yµ) lnp(xi, yµ)

p(xi)p(yµ)(B.15)

where (B.7)

p(xi, yµ) ≡∑j

q(yµ|yj )p(xi, yj )

p(yµ) ≡∑j

q(yµ|yj )p(yj ).(B.16)

It is beneficial to calculate the derivatives of (B.16) before attempting this for Deff . We use thefact that ∂q(yµ|yj )

∂q(yν |yk)= δµνδjk . The derivatives are

∂p(xi, yµ)

∂q(yν |yk)= δµνp(xi, yk)

∂p(yµ)

∂q(yν |yk)= δµνp(yk).

(B.17)

Using (B.17),

(∇Deff)νk ≡ ∂Deff

∂q(yν |yk)

= ∂

∂q(yν |yk)

∑i,µ

p(xi, yµ) lnp(xi, yµ)

p(xi)p(yµ)


=∑i,µ

∂p(xi, yµ)

∂q(yν |yk)ln

p(xi, yµ)

p(xi)p(yµ)+ p(xi, yµ)

∂

∂q(yν |yk)

(ln p(xi, yµ) − ln p(yµ)

)

=∑i,µ

δµνp(xi, yk) lnp(xi, yµ)

p(xi)p(yµ)+ δµνp(xi, yµ)

(p(xi, yk)

p(xi, yµ)− p(yk)

p(yµ)

)

=∑i

p(xi, yk) lnp(xi, yµ)

p(xi)p(yµ)−

≡p(yk)︷︸︸︷∑i

p(xi, yk) +p(yk)

p(yµ)

≡p(yµ)︷︸︸︷∑i

p(xi, yµ) (B.18)

⇒ (∇Deff)νk =∑i

p(xi, yk) lnp(xi, yν)

p(xi)p(yν). (B.19)

The second derivatives are

∂2Deff

∂q(yη|yl)∂q(yν |yk)= ∂

∂q(yη|yl)

∑i

p(xi, yk) lnp(xi, yν)

p(xi)p(yν)

=∑i

p(xi, yk)∂

∂q(yη|yl)

(ln p(xi, yν) − ln p(yν)

)

=∑i

p(xi, yk)δνη

(p(xi, yl)

p(xi, yν)− p(yl)

p(yν)

)(B.20)

⇒ ∂2Deff

∂q(yη|yl)∂q(yν |yk)= δνη

( ∑i

p(xi, yk) p(xi, yl)

p(xi, yν)− p(yk)p(yl)

p(yν)

). (B.21)

In all cases we assume that all relevant quantities are absolutely continuous with respect to oneanother, so that all divisions can be performed. In practice this means that any optimizationhas to be restricted away from zero, since the gradients diverge there. When a deterministicquantizer (hard clustering) is required, the optimization can be brought close to a boundary(q(yN |y) ≈ 0 for some (yN, y)) and then thresholded to obtain the deterministic map.

When posing the optimization problem (section 3.3), we encounter the functional F =H(YN |Y ) − βDI . To analyse it further we also need the derivatives of H(YN |Y ):

(∇H)νk ≡ ∂H(YN |Y )

∂q(yν |yk)

= − ∂

∂q(yν |yk)

∑j,µ

q(yµ|yj )p(yj ) ln q(yµ|yj )

= −∑j,µ

p(yj )δµνδjk(

ln q(yµ|yj ) + 1)

⇒ (∇H)νk = −p(yk)(

ln q(yν |yk) + 1). (B.22)

The second derivatives are

∂2H(YN |Y )

∂q(yη|yl)∂q(yν |yk)= − ∂

∂q(yη|yl)p(yk)

(ln q(yν |yk) + 1

)

= − p(yk)

q(yν |yk)δνηδkl . (B.23)

To obtain the results when the information quantities are measured in bits, all of the derivativesabove should be divided by ln 2 ≈ 0.6931.


References

[1] Adrian E D 1928 The Basis of Sensation: the Action of the Sense Organs (New York: Norton)[2] Atick J J 1992 Could information theory provide an ecological theory of sensory processing? Network: Comput.

Neural Syst. 3 213–51[3] Barlow H B 1961 Possible principles underlying the transformation of sensory messages Sensory

Communications ed W A Rosenblith (Cambridge, MA: MIT Press)[4] Berry M J and Meister M 2001 Firing events: fundamental symbols in the neural code of retinal ganglion cells?

Computational Neuroscience: Trends in Research ed J Bower (Amsterdam: Elsevier)[5] Brenner N, Strong S P, Koberle R, Bialek W and de Ruyter van Steveninck R R 2000 Synergy in a neural code

Neural Comput. 12 1531–52[6] Cover T and Thomas J 1991 Elements of Information Theory (Wiley Series in Communication) (New York:

Wiley)[7] de Ruyter van Steveninck R R, Lewen G D, Strong S P, Koberle R and Bialek W 1997 Reproducibility and

variability in neural spike trains Science 275 1805–8[8] Dimitrov A G 1998 Aspects of cortical information processing PhD Thesis The University of Chicago[9] Dimitrov A G and Miller J P 2000 Natural time scales for neural encoding Neurocomputing 32–3 1027–34

[10] Dimitrov A G, Miller J P and Aldworth Z 2002 Spike pattern-based coding schemes in the cricket cercal sensorysystem Computational Neuroscience: Trends in Research ed J Bower to be published

[11] Dimitrov A G, Miller J P, Aldworth Z and Gedeon T 2001 Non-uniform quantization of neural spike sequencesthrough an information distortion measure Computational Neuroscience: Trends in Research ed J Bower(Amsterdam: Elsevier)

[12] Georgopoulos A P, Schwartz A B and Kettner R E 1986 Neuronal population coding of movement directionScience 233 1416–9

[13] Gersho A and Gray R M 1992 Vector Quantization and Signal Compression (Dordrecht: Kluwer)[14] Gray R M 1990 Entropy and Information Theory (Berlin: Springer)[15] Hubel D and Wiesel T 1961 Receptive fields, binocular interaction and functional architecture in the cat’s visual

cortex J. Physiol. (Lond.) 195 215–43[16] Jaynes E T 1982 On the rationale of maximum-entropy methods Proc. IEEE 70 939–52[17] Johnson D H, Gruner C M, Baggerly K and Seshagiri C 2001 Information-theoretic analysis of the neural code

J. Comput. Neurosci. 10 47–70[18] Kjaer T W, Hertz J A and Richmond B J 1994 Decoding cortical neuronal signals: network models, information

estimation and spatial tuning J. Comput. Neurosci. 1 109–39[19] Koch C and Segev I (ed) 1992 Methods in Neuronal Modeling (Cambridge, MA: MIT Press)[20] Kullback S 1959 Information Theory and Statistics (New York: Wiley)[21] Meister M and Berry M J 1999 The neural code of the retina Neuron 22 435–50[22] Oram M W, Wiener M C, Lestienne R and Richmond B J 1999 The stochastic nature of precisely timed spike

patterns in visual system neural responses J. Neurophysiol. 81 3021–33[23] Panzeri S, Schultz S R, Treves A and Rolls E T 1999 Correlations and the encoding of information in the nervous

system Proc. R. Soc. B 266 1001–12[24] Penev P S and Sirovich L 2000 The global dimensionality of face space Proc. 4th Int. Conf. on Automatic Face

and Gesture Recognition (Grenoble, 2000) (Los Alamitos, CA: IEEE Computer Society Press) pp 264–70[25] Reinagel P and Reid R 2000 Temporal coding of visual information in the thalamus J. Neurosci. 20 5392–400[26] Rieke F, Warland D, de Ruyter van Steveninck R R and Bialek W 1997 Spikes: Exploring the Neural Code

(Cambridge, MA: MIT Press)[27] Rose K 1994 A mapping approach to rate distortion computation and analysis IEEE Trans. Inform. Theory 40

1939–52[28] Rose K 1998 Deteministic annealing for clustering, compression, classification, regression, and related

optimization problems Proc. IEEE 86 2210–39[29] Salinas E and Abbott L F 1994 Vector reconstruction from firing rates J. Comput. Neurosci. 1 89–107[30] Schultz S R and Panzeri S 2001 Temporal correlations and neural spike train entropy Phys. Rev. Lett. 86 5823–6[31] Shannon C E 1948 A mathematical theory of communication Bell Syst. Tech. J. 27 623–56[32] Slonim N and Tishby N 2000 Agglomerative information bottleneck Advances in Neural Information Processing

Systems vol 12, ed S A Solla, T K Leen and K-R Muller (Cambridge, MA: MIT Press) pp 617–23[33] Strong S P, Koberle R, de Ruyter van Steveninck R R and Bialek W 1998 Entropy and information in neural

spike trains Phys. Rev. Lett. 80 197–200[34] Theunissen F and Miller J P 1995 Temporal encoding in nervous systems: a rigorous definition J. Comput.

Neurosci. 2 149–62


[35] Theunissen F, Roddey J C, Stufflebeam S, Clague H and Miller J P 1996 Information theoretic analysis ofdynamical encoding by four primary sensory interneurons in the cricket cercal system J. Neurophysiol. 751345–59

[36] Theunissen F E and Miller J P 1991 Representation of sensory information in the cricket cercal sensory system.II. Information theoretic calculation of system accuracy and optimal tuning curve widths of four primaryinterneurons J. Neurophysiol. 66 1690–703

[37] Tishby N, Pereira F C and Bialek W 2000 The information bottleneck method LANL Preprinthttp://arXiv.org/abs/physics/0004057

[38] Treves A and Panzeri S 1995 The upward bias in measures of information derived from limited data samplesNeural Comput. 7 399–407

[39] Victor J D 2000 How the brain uses time to represent and process visual information Brain Res. 886 33–46[40] Victor J D and Purpura K 1997 Metric-space analysis of spike trains: theory, algorithms, and application

Network: Comput. Neural Syst. 8 127–64[41] Warland D 1991 Reading between the spikes: real-time processing in neural systems PhD Thesis University of

California at Berkeley

neural coding and decoding: communication channels and...

Documents