clustering financial time series and evidences of memory e

176
Clustering Financial Time Series and Evidences of Memory Eects Facoltà di Scienze Matematiche, Fisiche e Naturali Corso di Laurea Magistrale in Fisica Candidate Gabriele Pompa ID number 1146901 Thesis Advisor Prof. Luciano Pietronero Academic Year 2011/2012

Upload: gabriele-pompa-phd

Post on 14-Feb-2017

184 views

Category:

Science


1 download

TRANSCRIPT

  • Clustering Financial Time Series and

    Evidences of Memory Eects

    Facolt di Scienze Matematiche, Fisiche e Naturali

    Corso di Laurea Magistrale in Fisica

    Candidate

    Gabriele Pompa

    ID number 1146901

    Thesis Advisor

    Prof. Luciano Pietronero

    Academic Year 2011/2012

  • Clustering Financial Time Series and Evidences of Memory EectsMaster thesis. Sapienza University of Rome

    2012 Gabriele Pompa. All rights reserved

    This thesis has been typeset by LATEX and the Sapthesis class.

    Authors email: [email protected]

    mailto:[email protected]

  • Non scholae, sed vitae discimus.

    dedicata a mia madre, per avermi insegnato limpegno,a mio padre, per avermelo fatto amare,

    e a Lilla, per sopportare tutto questo con amore.

  • v

    Contents

    Introduction vii

    1 Financial Markets 11.1 Ecient Market Hypothesis . . . . . . . . . . . . . . . . . . . . . . . 11.2 Random-Walk Models . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Stylized Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Technical trading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    1.4.1 Main assumptions and Skepticism . . . . . . . . . . . . . . . 81.4.2 Feed-back Eect and Common Figures . . . . . . . . . . . . . 9

    2 Pattern Recognition 112.1 From the Iris Dataset to Economic Taxonomy . . . . . . . . . . . . . 112.2 Supervised and Unsupervised Learning and Classification . . . . . . 162.3 Bayesian learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.3.1 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.2 Bayesian Model Selection . . . . . . . . . . . . . . . . . . . . 18

    2.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4.1 Definition and Distinctions . . . . . . . . . . . . . . . . . . . 202.4.2 Time Series Clustering . . . . . . . . . . . . . . . . . . . . . . 21

    2.5 Distance and Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 222.5.1 Information Theoretic Interpretation . . . . . . . . . . . . . . 23

    3 Monte Carlo Framework 253.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 Static Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    3.3.1 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . 273.3.2 Hit-or-Miss Sampling: a numerical experiment . . . . . . . . 28

    3.4 Dynamic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4.2 MCMC and Metropolis-Hastings Algorithm . . . . . . . . . . 32

    4 Memory Eects: Bounce Analysis 354.1 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    4.1.1 T Seconds Rescaling . . . . . . . . . . . . . . . . . . . . . . . 364.2 Bounce: Critical Discussion About Definition . . . . . . . . . . . . . 374.3 Consistent Random Walks . . . . . . . . . . . . . . . . . . . . . . . . 42

  • vi Contents

    4.4 Memory Eects in Bounce Probability . . . . . . . . . . . . . . . . . 454.5 Window Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    4.5.1 Recurrence Time . . . . . . . . . . . . . . . . . . . . . . . . . 484.5.2 Window Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.5.3 Fluctuations within Window . . . . . . . . . . . . . . . . . . 55

    5 The Clustering Model 595.1 Structure of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . 595.2 Toy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.3 Real Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.4 Best Partition: Bayesian Characterization . . . . . . . . . . . . . . . 65

    5.4.1 Gaussian Cost Prior . . . . . . . . . . . . . . . . . . . . . . . 675.4.2 Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . . . . 68

    5.5 MCMC Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.6 Splitting Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    5.6.1 RANDOM of SP LIT T ING . . . . . . . . . . . . . . . . . . . . 785.7 Merging Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    5.7.1 RANDOM of MERGING . . . . . . . . . . . . . . . . . . . . . 82

    6 The Clustering Results 876.1 Role of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

    6.1.1 Noise Dependency of RANDOM Thresholds . . . . . . . . . 896.2 Toy Model Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 92

    6.2.1 Insights of Convergence . . . . . . . . . . . . . . . . . . . . . 926.2.2 prior Analysis and Sub-Optimal Partitions . . . . . . . . . . 956.2.3 Results of the Entire 3-Steps Procedure . . . . . . . . . . . . 100

    6.3 Real Series Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.3.1 Missteps: Granularity and Short Series Eects . . . . . . . . 1026.3.2 Correct Clustering Results . . . . . . . . . . . . . . . . . . . . 1036.3.3 Attempt of Cause-Eect Analysis . . . . . . . . . . . . . . . . 108

    6.4 Conclusions and Further Analysis . . . . . . . . . . . . . . . . . . . . 111

    A Noise Dependency of Merging Threshold - List of Plots 113

    B Clustering Results - List of Plots 119

    C Cause-Eect Clustering - List of Plots 141C.1 Half-Series Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141C.2 Cause-Eect Relations . . . . . . . . . . . . . . . . . . . . . . . . . . 160Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

  • vii

    Introduction

    Among the community of investors, that of technical trading is a growing school ofthought. Every day more and more investors rely on technical indicators.Standard economic theory considers the concept of "eciency" of the market (Fama,1970) as the cornerstone of the whole theory.This assumption postulates the impossibility of the existence of investment strategieswithout risk. This hypothesis would endorse a simplistic stochastic modeling of themarket in terms of a random walk of the price.But there are various empirical evidences, known as "stylized facts", which seriouslyrestrict the validity of totally random models to explain the performance of themarket.Recently the problem was approached from a dierent point of view, that is, if inthe performance of market there were evidences of investment strategies familiarand simultaneously adopted by a large community of investors [57].Focusing on "technical" indicators of investment, such as Supports and Resistances,was studied the probability of correct prediction of such indicators conditional onthe number of times they had been previously exploited. It were shown evidences ofmemory eects.In this thesis work was developed a new method to investigate regularity in themarket. The problem was approached from the point of view of the presence of"similarity" in the performance of the price around the points where a level of Supportor Resistance had been already identified.The procedure that was defined and refined over months led to design an originalalgorithm for the clustering of time series.The thesis consists of 6 chapters and 3 appendices, namely the core of the resultsare presented in chapters from 4 to 6, whereas the first three chapters providethe necessary background and in the appendices are listed all plots necessary forcompleteness and omitted in the text for clarity. The structure of the chapters isthe following:

    Chapter 1: is a rapid review of the standard economic theory mainly focusedaround the Ecient Market Hypothesis, the main random walk models designedto explain market behavior and the corresponding drawbacks. The chapterends with an introduction on the philosophy of technical trading and providesexamples of common technical indicators, such as Supports and Resistances.

    Chapter 2: provides the essential background of the theory of Pattern Recogni-tion, together with examples of applications from ancient and modern literature,then specializes around the statistical framework developed starting from the

  • viii Introduction

    Bayes rule and finally introduces the central concept of Clustering, stressingthe aspects concerning time series clustering.

    Chapter 3: introduces the numerical instruments adopted, namely those ofthe Monte Carlo sampling theory. The MC theory is briefly revised consideringseparately static and dynamic methods. The chapter ends the essential featuresof the theory of Markov Chains, necessary in order to contextualize the MonteCarlo Markov Chains methods widely adopted in the numerical simulationsperformed.

    Chapter 4: intends to critically review the results previously obtained onthe analysis of the rebounds on the Support and Resistance levels. Then, thebounce analysis is extended to the statistical properties characteristics timesdescribing bounces and typical fluctuations of price around those events.

    Chapter 5: this completely original chapter introduces the bayesian algorithmadopted in the subsequent clustering analysis. Stated the 3-steps structure ofthe procedure, each step is analyzed in detail in order to provide a mathematicalbasis and in order to make possible more reproducible results.

    Chapter 6: here are reported all the results obtained via the clusteringprocedure adopted. Are reported both the results obtained with the toy-model,used to test the algorithm, both those with the real financial time series, withthe hope of being able to report as objectively as possible weaknesses as wellas positive aspects of the algorithm designed for clustering purposes.

    Although it should be considered that this thesis work represents, in my opinion,only the beginning of this original and fascinating analysis, evidences of structuralregularities among time series analyzed are eectively found even at this early stage.

  • 1

    Chapter 1

    Financial Markets

    1.1 Ecient Market HypothesisSince 1970, with Famas work [1], the dominant assumption on capital markets hasbeen the Ecient Market Hypothesis (EMH). Under this hypothesis the marketis viewed as an open system instantly processing all the informations available.The concept of information available, or information set t, describes the corpus ofknowledge on which traders base their investment decisions. These could be basedon public or private informations, such as expected profit of a company, interest rateand expectation of dividends [2, Chap. 8].

    Jensen (1978): "A market is ecient with respect to information sett if it is impossible to make economic profits by trading on the basis ofinformation set t"[3].

    The eciency of the market is expressed by the absence of arbitrage, namely theimpossibility to realize riskless strategies relying only on the time needed by theprice to reach again to its fundamental value after an operation, the fundamentalvalue being that expected on the base of t.This automatic self-organization of the market yields to prices always fully reflectingthe information available, namely price increments are substantially random.In finance, the variable related to the price increment in a lapse of time, fromprice pt to price pt+ , is called the return r (t) and is defined in various ways.

  • 2 1. Financial Markets

    Let pt be the price of a financial asset at time t. Then possible definitions ofreturns adopted are:

    Linear Returnsr (t) = pt+ pt (1.1)

    which has the advantage of being linear, but directly depends on the currency.

    Relative Returnsr (t) =

    pt+ ptpt

    (1.2)

    which takes account only of the percentage changes. However, with thisdefinition two consecutive and opposite variations are not equivalent to nullvariation1.

    Logarithmic Returns

    r (t) = log(pt+ ) log(pt) pt+ ptpt

    (1.3)

    where the approximation is valid in the case of high frequency data (moredetails in section 4.1) in which the absolute variation |pt+ pt| of the price ismuch smaller than the value pt.

    Market eciency can be mathematically formalized by the martingale property:

    E[pt+1|p0, p1, ..., pt] = pt (1.4)to be satisfied by the price time series, which states the statistical independence ofthe price from its history. This condition correspond exactly to what is defined as aperfect market [2].Really, an ecient market needs a finite time to self-organize itself, time quantifiedpractically via the two-point autocorrelation function:

    fl (t, t) =E[r (t)r (t)] E[r (t)]E[r (t)]

    E[r2 (t)] E[r (t)]2(1.5)

    which is indeed always zero, except for very short scales, up to few minutes (figures(1.1) and (1.2))

    1.2 Random-Walk ModelsThe ecient market hypothesis led to a random-walk modelization of price timeseries. The first attempt to formalize the dependence between price variation x andthe time t, was made by Bachelier in his doctoral thesis [4]. He proposed to use agaussian form for the distribution of the price change x at the time t

    x N (0,2) , tx = pt+ p

    1Two consecutive increments such as a gain of +1% and a loss of 1% of a pt

    = 100 $ will notresult on the same value for the price after them.

  • 1.2 Random-Walk Models 3 6

    -0.4

    -0.2

    0

    0.2

    0.4

    0 10 20 30 40 50 60 70 80 90

    Au

    toco

    rrel

    ati

    on

    Lag

    BNPP.PA 1-minute returnBNPP.PA 5-minute return

    FIG. 3. Autocorrelation function of BNPP.PA returns.

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 10 20 30 40 50 60 70 80 90

    Au

    toco

    rrel

    ati

    on

    Lag

    BNPP.PA 1-minute returnBNPP.PA 5-minute return

    FIG. 4. Autocorrelation function of BNPP.PA absolute re-turns.

    3. Volatility clustering

    The third stylized-fact that we present here is of pri-mary importance. Absence of correlation between re-turns must no be mistaken for a property of indepen-dence and identical distribution: price fluctuations arenot identically distributed and the properties of the dis-tribution change with time.In particular, absolute returns or squared returns ex-

    hibit a long-range slowly decaying auto correlation func-tion. This phenomena is widely known as volatilityclustering, and was formulated by Mandelbrot (1963)as large changes tend to be followed by large changes of either sign and small changes tend to be followed bysmall changes.On figure 4, the autocorrelation function of absolute

    returns is plotted for = 1 minute and 5 minutes. Thelevels of autocorrelations at the first lags vary wildly withthe parameter . On our data, it is found to be maxi-

    10-3

    10-2

    10-1

    100

    0 1 2 3 4 5 6

    Em

    pir

    ical

    cum

    ula

    tive

    dis

    trib

    uti

    on

    Normalized return

    = 1 day = 1 week

    = 1 monthGaussian

    FIG. 5. Distribution of log-returns of S&P 500 daily, weeklyand monthly returns. Same data set as figure 2 bottom.

    mum (more than 70% at the first lag) for a returns sam-pled every five minutes. However, whatever the samplingfrequency, autocorrelation is still above 10% after severalhours of trading. On this data, we can grossly fit a powerlaw decay with exponent 0.4. Other empirical tests re-port exponents between 0.1 and 0.3 (Cont et al. (1997);Liu et al. (1997); Cizeau et al. (1997)).

    4. Aggregational normality

    It has been observed that as one increases the timescale over which the returns are calculated, the fat-tailproperty becomes less pronounced, and their distribu-tion approaches the Gaussian form, which is the fourthstylized-fact. This cross-over phenomenon is docu-mented in Kullmann et al. (1999) where the evolutionof the Pareto exponent of the distribution with the timescale is studied. On figure 5, we plot these standardizeddistributions for S&P 500 index between January 1st,1950 and June 15th, 2009. It is clear that the larger thetime scale increases, the more Gaussian the distributionis. The fact that the shape of the distribution changeswith makes it clear that the random process underlyingprices must have non-trivial temporal structure.

    B. Getting the right time

    1. Four ways to measure time

    In the previous section, all stylized facts have beenpresented in physical time, or calendar time, i.e. timeseries were indexed, as we expect them to be, in hours,minutes, seconds, milliseconds. Let us recall here thattick-by-tick data available on financial markets all overthe world is time-stamped up to the millisecond, but the

    Figure 1.1. Autocorrelation function (1.5) of BNP Paribas (BNPP.PA) logarithmic returns(1.3), over periods = 1 and 5 minutes, as a function of the lag tt. Source: Chakrabortiet al. Econophysics: Empirical Facts and Agent-Based Models [13].

    8 M. Cristelli, L. Pietronero, and A. Zaccaria

    0 100 200 300t (day)

    0

    0,2

    0,4

    0,6

    0,8

    1

    Aut

    ocor

    rela

    tion

    r

    0 50 100 150t (tick)

    -0,2

    0

    0,2

    0,4

    0,6

    0,8

    1

    Aut

    ocor

    rela

    tion

    r

    Fig. 3. We report the autocorrelation function of returns for two time series. The series ofthe main plot is the return series of a stock of New York Stock Exchange (NYSE) from 1966 to1998 while the series of the inset is the return series of a day of trading of a stock of LondonStock Exchange (LSE). As we can see the sign of prices are unpredictable that is the correlationof returns is zero everywhere. The time unit of the inset is the tick, this means that we arestudying the time series in event time and not in physical time.

    which describes the tail behavior of the distribution P (x) of returns.The complementary cumulative distribution function F (x) of real returns isfound to be approximately a power law F (x) x with exponent in the range2 4 [15], i.e. the tails of the probability density function (pdf) decay with an exponent + 1. Since the decay is much slower than a gaussian this evidence is calledFat or Heavy Tails. Sometimes a distribution with power law tails is called a Paretodistribution. The right tail (positive returns) is usually characterized by a dierent ex-ponent with respect to the left tail (negative returns). This implies that the distributionis asymmetric in respect of the mean that is the left tail is heavier than the right one(+ > ).Moreover the return pdf is a function characterized by positive excess kurtosis, a Gaus-sian being characterized by zero excess kurtosis. In fig. 4 we report the complementarycumulative distribution function F (x) of real returns compared with a pure power law

    Figure 1.2. Autocorrelation function (1.5) of the return time series of a stock of New YorkStock Exchange (NYSE) from 1966 to 1998 (main plot) and of a stock of London StockExchange (LSE) (in the inset). The lag-time unit of the inset is the event time, or tick,i.e. the number of transactions (more details on the meaning of this choice can be foundin section (4.1)). Note that the exact definition for the returns here is note relevantbecause the graphs refer to high-frequency data. Source: Cristelli M., Pietronero L. andZaccaria A. (2001): Critical Overview of Agent Based Models for Economics [12].

    The expected value of a common stocks price change is always zero

    E[x] = 0

    thus reflecting the martingale property (1.4), but the Bacheliers model assigns finiteprobability to negative value of stock prices, increasingly with time: x2 .

  • 4 1. Financial Markets

    Citing Samuelson (1973) [5]:

    "Seminal as the Bachelier model is, it leads to ridiculous results.[. . . ] An ordinary random walk of price, even if it is unbiased, will resultin price becoming negative with a probability that goes to 1/2 as t .This contradicts the limited liability feature of modern stocks and bonds.

    The General Motors stock I buy for 100 $ today can at most drop invalue to zero, at which point I tear up my certificate and never look back.

    [. . . ] The absolute-Brownian motion or absolute random-walk modelmust be abandoned as absurd."

    The random-walk paradigm was actually introduced among the economic communityby Samuelsons work as the geometric Brownian motion model providing a geo-metric random-walk dynamics of the price, log-normally distributed, and a normaldistribution of returns:

    r (t) N (,)

    r (t) = log(pt+ ) log(pt) (1.6)

    In order to provide an evidence that the price behavior is substantially not predictable,avvalorating the random walk hypothesis, we present on figure (4.5) a comparisonbetween the performance of a real stock and the simulation of a suitable randomwalk.

    VOD, 110th trading day of 2002 Consistent Random Walk

    Figure 1.3. On the left: price time series of the Vodafone (VOD) stock in the 110thtrading day of the year 2002. On the right: comparison with the consistent randomwalk: pt+1 = pt + N (,), where = 1.2 105 is the mean linear return (1.1) ofVodafone in the case considered and = 0.02 is the corresponding dispersion.The detailed definition of consistent random walk and its meaning can be found insection (4.3).

  • 1.3 Stylized Facts 5

    1.3 Stylized FactsThe geometric brownian motion model circumvents the diculties of the abso-lute-random walk model but still has several drawbacks, summarized in empiricalevidences known as Stylized Facts (SF) [2, Chap. 5] [12]:

    Fat-tailed empirical distribution of returns: very large price fluctuationare more likely than in a gaussian distribution (figure (1.4)).

    5

    10-4

    10-3

    10-2

    10-1

    -1.5 -1 -0.5 0 0.5 1 1.5

    Pro

    ba

    bil

    ity

    den

    sity

    fu

    nct

    ion

    Log-returns

    BNPP.PAGaussian

    Student

    FIG. 1. (Top) Empirical probability density function of thenormalized 1-minute S&P500 returns between 1984 and 1996.Reproduced from Gopikrishnan et al. (1999). (Bottom) Em-pirical probability density function of BNP Paribas unnor-malized log-returns over a period of time = 5 minutes.

    trading. Except where mentioned otherwise in captions,this data set will be used for all empirical graphs in thissection. On figure 2, cumulative distribution in log-logscale from Gopikrishnan et al. (1999) is reproduced. Wealso show the same distribution in linear-log scale com-puted on our data for a larger time scale = 1 day,showing similar behaviour.Many studies obtain similar observations on dierent

    sets of data. For example, using two years of data onmore than a thousand US stocks, Gopikrishnan et al.(1998) finds that the cumulative distribution of returnsasymptotically follow a power law F (r ) |r| with > 2 ( 2.8 3). With > 2, the second mo-ment (the variance) is well-defined, excluding stable lawswith infinite variance. There has been various sugges-tions for the form of the distribution: Students-t, hyper-bolic, normal inverse Gaussian, exponentially truncatedstable, and others, but no general consensus exists onthe exact form of the tails. Although being the mostwidely acknowledged and the most elementary one, thisstylized fact is not easily met by all financial modelling.Gabaix et al. (2006) or Wyart and Bouchaud (2007) re-

    10-3

    10-2

    10-1

    100

    0 0.02 0.04 0.06 0.08 0.1

    Cu

    mu

    lati

    ve d

    istr

    ibu

    tio

    n

    Absolute log-returns

    SP500Gaussian

    Student

    FIG. 2. Empirical cumulative distributions of S&P 500 dailyreturns. (Top) Reproduced from Gopikrishnan et al. (1999),in log-log scale. (Bottom) Computed using ocial daily closeprice between January 1st, 1950 and June 15th, 2009, i.e.14956 values, in linear-log scale.

    call that ecient market theory have diculties in ex-plaining fat tails. Lux and Sornette (2002) have shownthat models known as rational expectation bubbles,popular in economics, produced very fat-tailed distribu-tions ( < 1) that were in disagreement with the statis-tical evidence.

    2. Absence of autocorrelations of returns

    On figure 3, we plot the autocorrelation of log-returnsdefined as (T ) r (t + T )r (t) with =1 minuteand 5 minutes. We observe here, as it is widely known(see e.g. Pagan (1996); Cont et al. (1997)), that thereis no evidence of correlation between successive returns,which is the second stylized-fact. The autocorrelationfunction decays very rapidly to zero, even for a few lagsof 1 minute.

    Figure 1.4. Empirical probability density function of BNP Paribas (BNPP.PA) unnor-malized logarithmic returns (1.3) over a period of time = 5 minutes. The graph iscomputed by sampling a set of tick-by-tick data from 9:05 am till 5:20 pm between Jan-uary 1st, 2007 and May 30th, 2008, i.e. 356 days of trading. Continuous and dashed linesare respectively gaussian and Student-t fits. Source: Chakraborti et al. Econophysics:Empirical Facts and Agent-Based Models [13].

    Absence of simple arbitrage: the sign of next price time variation isunpredictable on average, namely r (t)r (t + T ) is substantially zero. Infigure (1.5) is reported the autocorrelation function of the returns of the DAXindex 2 on the scale = 15 minutes. It is noteworthy that up to a lag-timeof 53 the correlation is positive, whereas until 9.4 there is anti-correlation,nevertheless it is really weak.

    2The DAX index is a blue chip stock market index consisting of the 30 major German companiestrading on the Frankfurt Stock Exchange. According to the New York Stock Exchange (NYSE), ablue chip is stock in a corporation with a national reputation for quality, reliability and the abilityto operate profitably in good times and bad [60].

  • 6 1. Financial Markets

    CAPITOLO 4. FATTI STILIZZATI 52

    Figura 4.2: Grafico della autocorrelazione alla scala di = 15 dellindiceDAX. I punti corrispondono al valore della funzione di autocorrelazioneR15(t t) al variare dellintervallo temporale t t. Le barre derrorecorrispondono allintervallo di confidenza di 3, la linea continua il fit.Tratta da: S.Dresdel, Modellierung von Aktienmarkten durch stochastischeProzesse, Diplomarbeit, Universitat Bayreuth, 2001.

    Figure 1.5. Autocorrelation of return time series of the DAX index. Returns are evaluatedon a period of = 15 minutes and the autocorrelation function (1.5) is plotted againstthe lag time T = t t. Error bars correspond to confidence interval of 3 and thecontinuous line is the fit. Source: S. Dresdel (2001) "Modellierung von Aktienmarktendutch stochastische Prozesse", Diplomarbeit, Universitat Bayreuth [9].

    Volatility Clustering: intermittent behavior of price fluctuation, regardlessthe sign.

    10 M. Cristelli, L. Pietronero, and A. Zaccaria

    0 2000 4000 6000 8000t (days)

    -40

    -20

    0

    20

    40

    retu

    rns (

    p)

    0 2000 4000 6000 8000t (days)

    -0,2

    -0,1

    0

    0,1

    log

    retu

    rns

    Fig. 5. Return time series of a stock of NYSE from 1966 to 1998. The two figures represent thesame price pattern but returns are dierently computed. In the top figure returns are calculatedas simple dierence, i.e. rt = pt ptt while in the bottom one returns are log returns that isrt = log pt log ptt. From the lower plot we can see that volatility appears to be clusteredand therefore large fluctuations tend to be followed by large ones and vice versa. The visualimpression that the return time series appears to be stationary for log returns suggests the ideathat real prices follow a multiplicative stochastic process rather than a linear process.

    behavior happens for small ones.In Economics the magnitude of price fluctuations is usually called volatility. It is worthnoticing that a clustered volatility does not deny the fact that returns are uncorrelated(i.e. arbitrage eciency). In fact correlation does not imply probabilistic independence,while the contrary is true. Therefore the magnitude of the next price fluctuations is cor-related with the present one while the sign is still unpredictable. In other words stockprices define a stochastic process where the increments are uncorrelated butnot independent.

    Dierent proxies for the volatility can be adopted: widespread measures are the absolutevalue and the square of returns. As a consequence of the previous considerations aboutthe clustering of volatility, the autocorrelation function of absolute (or square)

    Figure 1.6. Return time series of a stock of New York Stock Exchange (NYSE) from 1966to 1998. At the top are reported linear returns (1.1) r (t) = pt+ pt, whereas at thebottom are plotted logarithmic returns (1.3). Mainly from the lower plot is evidentthat price changes tend to be clustered, so to move coherently despite the sign. Source:Cristelli M., Pietronero L. and Zaccaria A. (2001): Critical Overview of Agent BasedModels for Economics [12].

  • 1.3 Stylized Facts 7

    Even if the eciency condition r (t)r (t + T ) negligible is substantiallysatisfied, non-linear correlations of absolute |r (t)||r (t + T )| and squaredr2 (t)r2 (t + T ) returns are still present due to volatility clustering [13]:

    Mandelbrot (1963) [6]: "Large changes tend to be followed bylarge changes of either sign and small changes tend to be followedby small changes" (figure (1.6)).

    6

    -0.4

    -0.2

    0

    0.2

    0.4

    0 10 20 30 40 50 60 70 80 90

    Au

    toco

    rrel

    ati

    on

    Lag

    BNPP.PA 1-minute returnBNPP.PA 5-minute return

    FIG. 3. Autocorrelation function of BNPP.PA returns.

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 10 20 30 40 50 60 70 80 90

    Au

    toco

    rrel

    ati

    on

    Lag

    BNPP.PA 1-minute returnBNPP.PA 5-minute return

    FIG. 4. Autocorrelation function of BNPP.PA absolute re-turns.

    3. Volatility clustering

    The third stylized-fact that we present here is of pri-mary importance. Absence of correlation between re-turns must no be mistaken for a property of indepen-dence and identical distribution: price fluctuations arenot identically distributed and the properties of the dis-tribution change with time.In particular, absolute returns or squared returns ex-

    hibit a long-range slowly decaying auto correlation func-tion. This phenomena is widely known as volatilityclustering, and was formulated by Mandelbrot (1963)as large changes tend to be followed by large changes of either sign and small changes tend to be followed bysmall changes.On figure 4, the autocorrelation function of absolute

    returns is plotted for = 1 minute and 5 minutes. Thelevels of autocorrelations at the first lags vary wildly withthe parameter . On our data, it is found to be maxi-

    10-3

    10-2

    10-1

    100

    0 1 2 3 4 5 6

    Em

    pir

    ica

    l cu

    mu

    lati

    ve d

    istr

    ibu

    tio

    n

    Normalized return

    = 1 day = 1 week

    = 1 monthGaussian

    FIG. 5. Distribution of log-returns of S&P 500 daily, weeklyand monthly returns. Same data set as figure 2 bottom.

    mum (more than 70% at the first lag) for a returns sam-pled every five minutes. However, whatever the samplingfrequency, autocorrelation is still above 10% after severalhours of trading. On this data, we can grossly fit a powerlaw decay with exponent 0.4. Other empirical tests re-port exponents between 0.1 and 0.3 (Cont et al. (1997);Liu et al. (1997); Cizeau et al. (1997)).

    4. Aggregational normality

    It has been observed that as one increases the timescale over which the returns are calculated, the fat-tailproperty becomes less pronounced, and their distribu-tion approaches the Gaussian form, which is the fourthstylized-fact. This cross-over phenomenon is docu-mented in Kullmann et al. (1999) where the evolutionof the Pareto exponent of the distribution with the timescale is studied. On figure 5, we plot these standardizeddistributions for S&P 500 index between January 1st,1950 and June 15th, 2009. It is clear that the larger thetime scale increases, the more Gaussian the distributionis. The fact that the shape of the distribution changeswith makes it clear that the random process underlyingprices must have non-trivial temporal structure.

    B. Getting the right time

    1. Four ways to measure time

    In the previous section, all stylized facts have beenpresented in physical time, or calendar time, i.e. timeseries were indexed, as we expect them to be, in hours,minutes, seconds, milliseconds. Let us recall here thattick-by-tick data available on financial markets all overthe world is time-stamped up to the millisecond, but the

    Figure 1.7. Autocorrelation function (1.5) of BNP Paribas (BNPP.PA) absolute logarithmicreturns (1.3), over periods = 1 and 5 minutes, as a function of the lag-time. Source:Chakraborti et al. Econophysics: Empirical Facts and Agent-Based Models [13].

    Absolute or squared returns exhibit long-range slow decaying autocorrelationcompatible with a stochastic process with increments uncorrelated but notindependents (figures (1.7) and (1.8)).

    Figure 1.8. Autocorrelation function of = 1 minute returns, squared returns and absolutereturns of the Vodafone (VOD) stock in the 110th trading day of the year 2002 (pricetime series is shown on the left in figure (4.5)). Lag-time T is in 1 minute units.

  • 8 1. Financial Markets

    1.4 Technical tradingTechnical analysis is a method of forecasting price movements using past prices. Aleading technical trader defines his field:

    Pring (2002) [14]: "The technical approach to investment is essentiallya reflection of the idea that prices move in trends that are determinedby the changing attitudes of investors toward a variety of real andpsychological forces" .

    Here real and psychological forces should be considered as exogenous and endogenousinformations, namely, economical and political news on one side, and past priceseries, interpreted as technical patterns, on the other one.

    1.4.1 Main assumptions and SkepticismAmong traders the knowledge of random-walk theory is rather widespread so themotivations underpinning technical analysis seems to be even more peculiar:

    1. the market discounts everything: technical traders believe that price isitself the only t needed to make decisions. Price at present time reflects allthe possible causes of its future movements.

    2. price moves in trends: trend is the "behaviour" of the price time series, itcould be bullish, bearish or sideway and it is more likely to persist than to beended.

    3. history repeats itself : if some particular kind of figures or patterns haveanticipated the same bullish or bearish behavior, technical analysis argues itwill happen again. Investors react in the same way to similar conditions.

    This assumptions sound in open contrast with the EMH as they rely directly onthe price time series, as they appear, instead on the unknown stochastic processgenerating price movements.Despite the widespread use of technical instruments among traders, the academiccommunity tend to be skeptical about technical analysis, mainly for these reasons:

    Acceptance of EMH

    Linguistic and methodological barriers

    Assuming EMH and the full rationality of agents, no speculative opportunitiesshould be present in the market. Operators should base their investment decisionsonly on market information set t, namely would be present only "fundamentalists".However, especially in case of bubble crash [58], the influence of fundamental data isnot so strong as to rule out the possibility of other, possible endogenous, influences,i.e. the presence of investors who rely in past price histories in order to make theirinvestment decisions: "Technical traders" (or simply: "Chartists") [2, Chap. 8].

  • 1.4 Technical trading 9

    On the other side, linguistic barriers could be illustrated contrasting this technicaljargon statement [15]:

    The presence of clearly identified Supports and Resistance levels,coupled with a one-third retracement parameter when prices lie betweenthem, suggest the presence of strong buying and selling opportunities inthe near term.

    with this one:

    The magnitude and decay patterns of the first twelve autocorrelationsand the statistical significance of the Box-Pierce Q-Statistic suggest thepresence of a high frequency predictable component in stock returns.

    The last barrier I quote is of methodological nature: technical analysis is primarilyvisual, employing the tools of geometry and pattern recognition, whereas quantitativefinance is mainly algebraic and numerical [16].

    1.4.2 Feed-back Eect and Common FiguresOne of the main reasons for which technical analysis should work is that a hugenumber of investors rely on it:

    Gehrig and Menkho (2003) [17]: " [...] technical analysis dominatesforeign exchange and most foreign exchange traders seem to be chartistnow".

    This mass behavior reflects in a feedback eect of investors own decisions: if acommon figure is known in literature to anticipate a specific trend, even if the futureprice movement would have been in the opposite direction, the huge amount ofcapital moved aects the price history making it fulfilling investors expectations.In this perspective, I briefly review two of the main technical figures 3.

    Moving Average: PM (t) is the average of the past n price values, defined as

    PM (t) =n

    k=0wkP (t k)

    where wk are the weights. The moving average is an example of technicalindicator as it signals the inversion of the trend crossing the price graph. Intechnical words: PM (t) acts as a dynamical supports in a bullish trend and asa resistance during a bearish one (figure (1.9)).

    3There are more technical figures to be mention, such as: Head-and-Shoulders, Inverse-Head-and-Shoulders, Broadening tops and bottoms and so on [18]. Their definitions are slightly moreinvolved than the two mentioned in the text and will not be discussed in this thesis.

  • 10 1. Financial Markets

    ULVR, 66th trading day of 2002

    Figure 1.9. Moving average of the price time series of the Unilever (ULVR) stock in the66th trading day of the year 2002. The average is computed with constant weights wkover the last 5 minutes of trading. The two squared points represent (authors opinion)respectively selling and buying signals.

    Supports and Resistances: these levels represent local minima and maximaand the price here is more likely expected to bounce than to cross them. Ininvestors psychology, at a Support level an investor relying on this indicatorbelieves that the demand is so strong as to overcome the supply, preventingfurther decrease of price, vice versa for a Resistance level (figure (1.10)).

    CAPITOLO 5. ANALISI TECNICA 70

    Figura 5.2: Esempio di supporto (linea verde) e di supporto (linea rossa).Fonte: stockcharts.com

    gerisce che quando il prezzo scende avvicinandosi al supporto i compratorisono pi inclini a comprare e i venditori meno inclini a vendere. Quandoil prezzo raggiunge il livello del supporto si crede che la domanda superilofferta e impedisce che il prezzo cada sotto il supporto. Una resistenza quel livello del prezzo al quale si pensa che lofferta sia forte abbastanza daimpedire che il prezzo salga ancora. Quando il prezzo si avvicina alla resi-stenza, i venditori sono pi inclini a vendere e i compratori diventano menoinclini a comprare. Quando il prezzo raggiunge il livello della resistenza,si crede che lofferta superi la domanda in modo da impedire che il prezzosalga ancora (Cf. [29]). In figura 5.2 mostrato un esempio di supporto eresistenza.

    I supporti e le resistenze verranno ampiamente trattati nel capitolo 6 incui si discuter la possibilit di introdurre una definizione quantitativa chesar testata per mezzo dellanalisi statistica delle serie temporali finanziarie.

    Figure 1.10. Example of support and resistance from the Lilly Eli & Co. (LLY) stock,traded at the New York Stock Exchange (NYSE) during Febraury the 2nd of the year2000. Source: http://stockcharts.com [61].

    http://stockcharts.com

  • 11

    Chapter 2

    Pattern Recognition

    In this chapter the main features of the theory of Pattern Recognition will bereviewed in order to provide a wide context for the algorithm that will be adoptedcoping with the clustering of financial time series.

    2.1 From the Iris Dataset to Economic TaxonomyThe term Pattern Recognition (PR) refers to the task of placing some object toa correct class based on the measurements about the object [19]. Objects to berecognized, measurements and possible classes can be almost everything so there arevery dierent PR tasks.A spam (junk-mail) filter, a recycling machine, a speech recognizer or the Opticalcharacter recognition protocol (OCR) are both Patter Recognition Systems, theyplay a central role in everyday life.The beginning of the modern discipline probably dates back to the paper of R.A.Fisher: "The use of multiple measurements in taxonomic problems" [20, 1936]. Heconsidered as an example the Iris dataset, collected by E. Anderson in 1935, whichcontains sepal width, sepal lent, petal width and petal length measurement from150 irises belonging to three dierent sub-species (figure (2.1)).

  • 12 2. Pattern Recognition

    Figure 2.1. Correct Classification of the Iris dataset. 150 observations of sepal and petalwidth and length. Irises belong to three dierent subspecies: Setosa, Versicolor andVirginica. Classification is performed through a 5-Nearest Neighbor algorithm with30 training observations (see section (2.2)) assigning each new Iris observation to thesubspecies most common among its 5 nearest neighbors.

    In order to provide a basis for comparison with current developments andapplications of the theory, we can consider the Economic Taxonomy, which is a

  • 2.1 From the Iris Dataset to Economic Taxonomy 13

    system of classification of the economic activity, including products, companies andindustries [25] [64].In 1999 Mantegna [21] considered the portfolio of stocks used to compute theStandard and Poors 500 (S&P 500) index and the 30 stocks considered in the DowJones Industrial Average (DJIA) over the time period from July 1989 to October1995 and introduced a method for finding a hierarchical arrangement of stocks tradedin financial market.He compared the synchronous time evolution of pairs (i, j) of daily stock prices bythe correlation coecient flij of the logarithmic returns, defined in (1.3), evaluatedover a period of = 1 trading day

    rstock=i=1day(day = t)def= ri

    and then, defining an appropriate metric1 dij = dij(flij), he quantified the degree ofsimilarity between stocks as:

    flij =rirj rirj

    (r2i ri2)(r2j rj2)(2.1)

    dij =

    2(1 flij) (2.2)

    Defined the distance matrix 2 di,j , he used it to determine the minimal spanningtree (MST) [22] of the stocks in the portfolio.To provide a simple description of the constructive procedure defining the MST, Iwill refer to figure (2.2), from Mantegnas work, which describes the MST of theportfolio of stocks considered in computing the DJIA index [62].Stocks are labelled by their ocial tick symbol, whose coding can be found atwww.forbes.com and recent updating of the labels can be found at http://en.wikipedia.org/wiki/List_of_S%26P_500_companies. The MST of a set of nelements is a graph with n 1 links [23]. Here the nodes are represented by thestocks and the links between them are weighted by the dij s.In building the MST, one has firstly to fill a list, sorted in ascending order, with thenon-diagonal elements of the distance matrix di,j :

    {d1st, d2nd, d3rd, ..., d n(n1)2 th} where: d1st < d2nd < d3rd < ... < d n(n1)2 th

    then the two nearest stocks, here Chevron (CHV) and Texaco (TX), have to beadded as a start:

    d1st = dCHV T X = 0.949 (2.3)

    The growth of the tree is provided following the aforementioned list: d2nd =dT XXON = 0.962 so the Exxon company (XON) is added to the MST and linkedto TX stock.

    1In section (2.5) I will stress further the concept of metric/distance and its role in the theory.2By definition d

    ij

    is a symmetric matrix with d11 = ... = dnn = 0 so that only n(n1)2 are relevantand need to be computed.

    www.forbes.comhttp://en.wikipedia.org/wiki/List_of_S%26P_500_companieshttp://en.wikipedia.org/wiki/List_of_S%26P_500_companies

  • 14 2. Pattern Recognition

    Figure 2.2. Minimal spanning tree connecting the 30 stocks used to compute the DowJones Industrial Average (DJIA) in the period July 1989 to October 1995. The 30stocks are labeled by their tick symbols. See text for details. The red oval (authorsgraphical modification) encloses Chevron and Texaco stocks, the nearest stocks. Texacowas acquired by Chevron on October 9, 2001. Original source: R.N. Mantegna (1999),Hierarchical structure in financial markets.

    To the next step: d3rd = dKOP G = 1.040 and these two stocks (Coca-Cola andProcter & Gamble) are both added to the tree because none of them has beenalready counted.The MST keeps growing in this way discarding dkth = duv if and only if both u andv stocks are already nodes of the tree.Looking back at (2.3) and at the red oval in figure (2.2), the following news soundsless surprising:

    NEW YORK, October 3, 2001 /PRNewswire/ Standard & Poors[...] will make the following changes in the S&P 500, [...] after the close

    of trading on Tuesday, October 9, 2001 [...] Equity Oce Properties(NYSE: EOP) will replace Texaco Inc. (NYSE: TX) in the S&P 500Index. Texaco is being acquired by S& 500 Index component Chevron(NYSE: CHV) in a transaction expected to close on that date. [63]

    The logic behind the minimal spanning tree is that of an arrangement of elementswhich selects the most relevant connections of each element of the set, based on thedistance matrix dij .With this MST Mantegna provided a taxonomy of the well defined group of stocksconsidered.

  • 2.1 From the Iris Dataset to Economic Taxonomy 15

    The link between Mantegnas work and the pioneering Fishers work is based on thestudy of the meaningfulness of this taxonomy.Mantegna compared his results with the reference grouping of stocks provided byForbes [65] which assigns each stock to one of 12 business sectors and 51 industries(figure (2.3)).

    In assessing the meaningfulness of the taxonomy provided by Mantegnas method,

    onomy. We will now explore this issue further, as the mean-ingfulness of the emerging economic taxonomy is the keyjustification for the use of the current methodology. In Ref.!1", Mantegna examined the meaningfulness of the tax-onomy by comparing the grouping of stocks in the tree witha third party reference grouping of stocks by their industry,etc., classifications. In this case, the reference was providedby Forbes !15", which uses its own classification system,assigning each stock with a sector #higher level$ and industry#lower level$ category.

    In order to visualize the grouping of stocks, we con-structed a sample asset tree for a smaller dataset !14", shownin Fig. 5. This was obtained by studying our previous dataset!14", which consists of 116 S&P 500 stocks, extending fromthe beginning of 1982 to the end of 2000, resulting in a totalof 4787 price quotes per stock !16".

    Before evaluating the economic meaningfulness of group-ing stocks, we wish to establish some terminology. We usethe term sector exclusively to refer to the given third partyclassification system of stocks. The term branch refers to asubset of the tree, to all the nodes that share the specifiedcommon parent. In addition to the parent, we need to have areference point to indicate the generational direction #i.e.,who is whos parent$ in order for a branch to be well defined.Without this reference there is no way to determine whereone branch ends and the other begins. In our case, the refer-ence is the central node. There are some branches in the tree,in which most of the stocks belong to just one sector, indi-cating that the branch is fairly homogeneous with respect tobusiness sectors. This finding is in accordance with those ofMantegna !1", although there are branches that are fairly het-erogeneous, such as the one extending directly downwardsfrom the central vertex, see Fig. 5.

    Since the grouping of stocks is not perfect at the branchlevel, we define a smaller subset whose members are morehomogeneous as measured by the uniformity of their sectorclassifications. The term cluster is defined, broadly speaking,

    as a subset of a branch. Let us now examine some of theclusters that have been formed in the sample tree. We use theterms complete and incomplete to describe, in rather strictterms, the success of clustering. A complete cluster containsall the companies of the studied set belonging to the corre-sponding business sector, so that none are left outside thecluster. In practice, however, clusters are mostly incomplete,containing most, but not all, of the companies of the givenbusiness sector, and the rest are to be found somewhere elsein the tree. Only the Energy cluster was found complete, butmany others come very close, typically missing just one ortwo members of the cluster.

    Building upon the normalized tree length concept, we cancharacterize the strength of clusters in a similar manner, asthey are simply subsets of the tree. These clusters, whethercomplete or incomplete, are characterized by the normalizedcluster length, defined for a cluster c as follows:

    Lc# t $!1Nc %di jt !c

    di jt , #8$

    where Nc is the number of stocks in the cluster. This can becompared with the normalized tree length, which for thesample tree in Fig. 5 at time t* is L(t*)&1.05. A full ac-count of the results is to be found in Ref. !16", but as a shortsummary of results we state the following. The Energy com-panies form the most tightly packed cluster resulting inLEnergy(t*)&0.92, followed by the Health-care cluster withLHealth care(t*)&0.98. For the Utilities cluster we haveLUtilities(t*)&1.01 and for the diverse Basic Materials clusterLBasic materials(t*)&1.03. Even though the Technology clusterhas the fewest number of members, its mean distance is thehighest of the examined groups of clusters beingLTechnology(t*)&1.07. Thus, most of the examined clustersseem to be more tightly packed than the tree on average.

    FIG. 5. #Color online$ Snapshot of a dynamicasset tree connecting the examined 116 stocks ofthe S&P 500 index. The tree was produced usingfour-year window width and it is centered onJanuary 1, 1998. Business sectors are indicatedaccording to Ref. !15". In this tree, General Elec-tric #GE$ was used as the central vertex and eightlayers can be identified.

    ONNELA et al. PHYSICAL REVIEW E 68, 056110 #2003$

    056110-6

    Figure 2.3. Minimal spanning tree of 116 S&P 500 stocks returns. Data extend fromthe beginning of 1982 to the end of 2000. Links are weighted according to Mantegnascorrelation based distance (2.2). Business sectors are indicated according to Forbes [65].Source: Onnela et al. (2003) Dynamics of market correlations: Taxonomy and portfolioanalysis [26].

    the classification of Forbes represents a reference in the same way in which the threesubspecies, Setosa, Virginica and Versicolor, were the benchmark for the classificationof the irises.O course, while there could be common agreement in how many and which are theirises sub-species, in coping with economic taxonomy, one could have alternativelyused dierent kind of classifications, e.g. the Global Industry Classification Standard(GICS) [24]: when dealing with the classification of real data often there is neither aunique nor a proper way to classify. The classification system makes the dierence.

  • 16 2. Pattern Recognition

    2.2 Supervised and Unsupervised Learning and Classi-fication

    Usually the most part of data describing objects is useless for classification. So theclassification task is preceded by a feature extraction to select only those featuresthat best represent the data for classification.The classifier takes as an input the feature vector x from the object to be classifiedand places the feature vector (i.e. the object) to the class that is the most appropriateone.

    xiris = (sepal lenght = 5.9, sepal width = 4.2, petal width = 3.0, petal length = 1.5)

    xiris Versicolor subspeciesIn this thesis we shall deal with statistical pattern recognition in which the classsand objects within the classes are modeled statistically.Formally, feature vectors as xiris belong to a feature space F and classes are denotedby {1,2, ...,c}.The classifier can be thought as a mapping from the feature space to the set ofpossible classes:

    : F {1,2, ...,c}(x) = k

    The classifier is usually a machine or a computer, for this the theory is also knownas machine learning.Depending on the task and on the data available, we can broadly distinguish to kindof learning: supervised and unsupervised, or clustering.

    In supervised classification, examples of correct classification are presented tothe classifier, namely as a training set:

    Di = {x1, ..., xni

    } of class iwhere ni is the number of training samples from the class i. Based on thisprototypes, the goal of the classfier is to deduct the class of a never seen object.

    In Clustering, there is no explicit teacher nor training samples. The classifi-cation of the feature vectors must be based only on similarity between them.Whether any two feature vectors are similar depends on the application.

  • 2.3 Bayesian learning 17

    2.3 Bayesian learning2.3.1 Bayes RuleLets begin with a general description of bayesian reasoning [29]: consider theuniverse of events , the measured event E and a complete class of hypothesisHi to "explain" E . By definition, these hypotheses must be exhaustive and mutuallyexclusive:

    Hi fl Hj = (i = j)n

    i=1=

    The conditional probability of the hypothesis Hi given the measurement of E is:

    P (Hi|E) = P (E|Hi)P (Hi)P (E)

    and by the property of complete class, satisfied by the His, the probability of eventE can be decomposed on the entire set {Hi}ni=1 of classes:

    P (Hi|E) = P (E|Hi)P (Hi)qj P (E|Hj)P (Hj)

    (2.4)

    This Bayes Rule, introduced by Thomas Bayes (1702-1761) to provide a solutionto a problem of inverse probability 3, was presented in "An Essay towards solvinga Problem in the Doctrine of Chances and was read at the Royal Society in 1763,after Bayess death [27]. His definition of probability was stated as follows:

    T.Bayes - "The probability of any event is the ratio between the valueat which an expectation depending on the happening of the event oughtto be computed, and the value of the thing expected upon its happening"[28].

    Lets analyze the terms in Bayes Rule (2.4) [29]:

    P (Hi) is the initial probability or prior probability, namely the probability ofthe hypothesys Hi conditioned by all preliminary hypotheses with the exclusionof the occurrence or nonoccurrence of E.

    P (E|Hi) is called likelihood and its a measure of how likely is E in the lightof Hi. In terms of cause and eect, it means how easily the cause Hi mayproduce the eect E.

    P (Hi|E) is simply the final or posterior probability, namely is the probabilityof Hi reproposed in the light of the hypothesis that E is true.

    3The "inverse" probability is the approach which endeavours to reason from observed events tothe probabilities of the hypotheses which may explain them, as distinct from "direct" probability,which reasons deductively from given probabilities to the probabilities of contingent events [66].

  • 18 2. Pattern Recognition

    Bayesian probability theory can be used to represent degrees of belief in uncertainpropositions. Cox (1946) [30] and Jaines (2003) [31] show that if one has to representnumerically the strength of his beliefs about the world, then the only reasonableand coherent way of manipulating these beliefs is to have them satisfy the rules ofprobability, such Kolmogorov axioms [32], together with Bayes Rule.In order to motivate the use of the rules of probability to encode degrees of beliefs,there is also a game-theoretic result in the form of:

    Dutch Book Theorem [33]: if you are willing to accept bets with oddsbased on your degrees of confidence, then unless your beliefs are coherentin the sense that they satisfy the rules of probability, there exists a setof simultaneous bets (called "Dutch Book") which you will accept andwhich is guaranteed to lose you money, no matter what the outcome.The only way to ensure the Dutch Books dont exist against you, is tohave degrees of beliefs that satisfy Bayes rule and the other rules ofprobability theory.

    2.3.2 Bayesian Model SelectionFrom Bayes rule may be derived a framework for machine learning. Here we adoptthe term model as a synonymous of class because in this way it can be stressed clearlythe statistical interpretation of the classification problem: to find the statisticalmodel m that best describes the data.Consider a set of N data points, D = {x1, x2, ..., xN }, belonging to some modelm M, universe of models. The machine (i.e. the classifier) starts with some priorbeliefs over models, such that:

    m

    P (m) = 1

    In our respect a model m is represented by a probability distribution over datapoints, i.e. P (D|m).The classification/model selection goal is achieved computing the posterior distribu-tion over "all" m M:

    P (m|D) = P (D|m)P (m)P (D) (2.5)

    Almost in all cases the space M is a huge infinite-dimensional space, so some kindof approximation is required, this point will be considered later discussing aboutMonte-Carlo methods.However the principle is simple: the model m which results in the highest posteriorP (m|D) over our dataset D will be selected as the best model for our data.Often models are defined by a parametric distribution, so that P (D|mk) can beactually written in terms of a P (D|k, mk), being k the set of parameters definingthe model mk. In case of gaussian models, for example:

    P (D|k, mk) = N(D|k,k)Given the prior preferences P (mk) over the models, the only term necessary tocompare models is the marginal likelihood term:

    P (D|mk) =

    P (D|k, mk)P (k)dk (2.6)

  • 2.4 Clustering 19

    where:

    P (k|mk) is the prior over the parameters of mk, namely the probability thatits parameters take values: k.

    P (D|k, mk) is the likelihood term, depending only on a single setting of theparameters of the model mk.

    the integral extends over all possible values of mk parameters.

    The interpretation of marginal likelihood, sometimes called the evidence for themodel mk, is very interesting: it is the probability of generating data set D fromparameters that are randomly sampled from under the prior P (k|mk).Usually, in order to evaluate the marginal likelihood term (2.6), a point-estimate ischosen to select only one setting k of the parameters of the model mk. There aretwo natural choices:

    kMAP

    = arg max

    k

    {P (D|k, mk)P (k|mk)}

    kML

    = arg max

    k

    {P (D|k, mk)}

    The kMAP

    is known as the Maximum A Posteriori estimate for the marginallikelihood term, while k

    ML

    is the frequentist Maximum Likelihood estimate.There is a deep dierence in these two approaches: the ML estimation often resultsin overfitting, namely the preference for models more complex then necessary, dueto the fact that a model mk with more parameters will have a higher maximum ofthe likelihood P (D|k

    ML

    , mk).Instead, the complete marginal likelihood term P (D|mk) (2.6) may decrease as themodel become more complicated: in a more complicated model, i.e. with moreparameters, sampling random parameter values k may generate a wider range ofpossible N -dimensional datasets DN , but since the probability over data sets has tointegrate to 1:

    {DN

    }P (DN |mk)dDN = 1

    spreading the density to allow for more complicated data sets necessarily results insome simpler data sets having lower density under the model mk [34].The decrease of marginal likelihood as additional parameters are added has beencalled the automatic Occams Razor. [35] (figure (2.4)).

    2.4 Clustering

    Coping with the problem of finding patterns in a time series data set, I have dealtwith a Clustering problem. In this section will outlined the main features of thistheory together with the most relevants methods and teqniques.

  • 20 2. Pattern Recognition

    too simple

    too complex

    "just right"

    All possible data sets

    P(D

    |mi)

    D

    Figure 4: The marginal likelihood (evidence) as a function of an abstract one dimensional representationof all possible data sets of some size N . Because the evidence is a probability over data sets, it mustnormalise to one. Therefore very complex models which can account for many datasets only achieve modestevidence; simple models can reach high evidences, but only for a limited set of data. When a dataset D isobserved, the evidence can be used to select between model complexities.

    11.1 Laplace approximation

    It can be shown that under some regularity conditions, for large amounts of data N relative to the number ofparameters in the model, d, the parameter posterior is approximately Gaussian around the MAP estimate, :

    p(|D,m) (2) d2 |A| 12 exp12( )A ( )

    (51)

    Here A is the d d negative of the Hessian matrix which measures the curvature of the log posterior at theMAP estimate:

    Aij = d2

    didjlog p(|D,m)

    =

    (52)

    The matrix A is also referred to as the observed information matrix. Equation (51) is the Laplace approxi-mation to the parameter posterior.

    By Bayes rule, the marginal likelihood satisfies the following equality at any :

    p(D|m) = p(,D|m)p(|D,m) (53)

    The Laplace approximation to the marginal likelihood can be derived by evaluating the log of this expressionat , using the Gaussian approximation to the posterior from equation (51) in the denominator:

    log p(D|m) log p(|m) + log p(D|,m) + d2log 2 1

    2log |A| (54)

    11.2 The Bayesian information criterion (BIC)

    One of the disadvantages of the Laplace approximation is that it requires computing the determinant of theHessian matrix. For models with many parameters, the Hessian matrix can be very large, and computingits determinant can be prohibitive.

    The Bayesian Information Criterion (BIC) is a quick and easy way to compute an approximation to themarginal likelihood. BIC can be derived from the Laplace approximation by dropping all terms that do notdepend on N , the number of data points. Starting from equation (54), we note that the first and third termsare constant with respect to the number of data points. Referring to the definition of the Hessian, we cansee that its elements grow linearly with N . In the limit of large N we can therefore write A = NA, where

    23

    Figure 2.4. Pictorical one-dimensional representation of the marginal likelihood, or evidence,distribution over data sets D of a given size N (2.6). The normalization to 1 acts as apenalization in the way that very complex models, which can account for many datasets,only achieve modest evidence, whereas simple models can reach high evidences, but onlyfor a limited set of data. Source: Zoubin Ghahramani Unsupervised Learning.

    2.4.1 Definition and DistinctionsCluster analysis groups data objects based only on informations found in the datathat describe the objects and their relationships [36]. Namely the goal of clusteringis to identify structures in an unlabeled data set by objectively organizing them sothat objects belonging to the same cluster are similar (or related) to one anotherand dierent from (or unrelated to) the objects in other groups. Clustering can beregarded as a form of classification in that it creates labeling of objects with class(cluster) labels. However it derives these labels only from the data.Clustering methods had been classified in five major categories [37]:

    Partitioning methods

    Hierarchical methods

    Density-based methods

    Grid-based methods

    Model-based methods

    Lets review the main features and distinctions among these methods: the firstdistinction is wether the set of clusters is nested or unnested, or in other wordshierarchical or partitional.A partitional clustering is a division of objects into subsets such that each data

  • 2.4 Clustering 21

    object is in at least one subset. The partition is crisp, or hard, if each object belongsexactly to one cluster, or fuzzy if one object is allowed to be in more than one clusterto a dierent degree.Instead a hierarchical clustering method works by grouping data objects into atree of clusters. A hierarchical clustering method can be agglomerative if it startsplacing each object in its own cluster an than merges clusters into larger and largerclusters until a termination condition is satisfied, or it can be divisive if it actssplitting clusters. A pure hierarchical clustering suers from its inability to performadjustment once a merge or split decision has been executed.The general idea of density-based methods is to continue growing a cluster as long asthe density (number of objects or data points) in the "neighborhood" exceeds somethreshold.Grid-based methods quantize the object space into a finite number of cells that forma grid structure on which all of the operations for clustering are performed. Theprocedure is iterative and acts varying the size of the cells, corresponding to somekind of resolution of the objects.At last, model-based methods assume a model for each clusters and attempt to bestfit the data to the assumed model.

    2.4.2 Time Series ClusteringTime series are dynamic data, namely they can be thought as objects whose featurescomprise values changing with time.In this thesis we will deal with financial stock-price time series and in this respectclustering can be an econometric tool for studying dependencies between variables[38]. It finds applications for example in:

    identifying area or sectors for policy-making purposes,

    identifying structural similarities in economic processes for economic forecast-ing,

    identifying stable dependencies for risk management and investment manage-ment.

    Coming back to the clustering methods, time series clustering algorithms usually tryto modify the existing algorithms for clustering static data in such a way that timeseries can be handled or to convert time series data into the form of static data. Wecan broadly distinguish three main approaches:

    Raw-data-based approaches: here can be placed methods working either intime or frequency domain and the major modification in respect to static-dataalgorithms lies in replacing the distance/similarity measure for static data withan appropriate one for time series.

    Feature-based approaches: this choice is usually adopted when dealingwith noisy time series or data sampled at fast sampling rates.

    Model-based approaches: each time series is considered to be generatedby some kind of model or by a mixture of underlying statistical distributions.

  • 22 2. Pattern Recognition

    The similarity here relies on the structure of the models or on the remainingresiduals after fitting the model.

    2.5 Distance and SimilarityGiven two time series, the correlation coecient flij can be interpreted as a measureof the existence of some causal dependency between variables or in respect to somecommon exogenous factor, so it can be exploited for clustering purposes. But cor-ralation can be invoked only if variables follow similar and concurrent time patterns.In order to consider not only common causation of random variables and co-movements of time series but also similarities in their structure (similar patternsmay evolve at dierent speeds at dierent time scales) it is needed to encompass thebroader concept of similarity [36].The function used in cluster analysis to measure the similarity between two databeing compared could be in various forms, for example:

    Euclidean distance: let xi and xj be two -dimensional time series, dE isdefined by

    dE =

    k=1(xi

    k

    xjk

    )2

    Pearsons correlation coecient and related distances:

    flij =q

    k=1(xik

    i)(xjk

    j)SiSj

    where:

    i =1

    k=1xi

    k

    is the mean and Si =

    k=1(xi

    k

    i) is the scatter.

    Two corralation-based distance can be derived [39]:

    d1fl = (1 fl1 + fl)

    , > 0

    d2fl =

    2(1 fl)The last one is the one adopted by Mantegna [21] in the work cited in section(2.1).

    These distances are computed directly in the space of time series, eventually afterpreprocessing.Often, projecting time series into a space of distance-preserving transforms avoids apart of the computational cost. Many projection techniques have been proposed,such as the Fast Fourier Transform (FFT) [40], which is distance-preserving dueto the Parseval theorem, or the piece-wise linearization [41] which provides a linearsmoothing of the time series.

  • 2.5 Distance and Similarity 23

    2.5.1 Information Theoretic InterpretationWhen time series are modeled within a probabilistic model, their values are thoughtto be sampled by an underlying probability distribution. Another approach toquantify their similarity is based on the projection of the time series in the space oftheir probability distributions.This is a more abstract notion of similarity as it allows to measure series of dierentlengths and with shapes that, thought similar in their distributions, cannot bedirectly matched.Let x1(t) and x2(t) be two time series and assume at least the weak stationaritycondition. Namely the first two moments of their distributions must not depend ontemporal index t and their auto-covariance R(t, t + k) depends only on the time lagk:

    E[x(t)] = xV ar[x(t)] = E[(x(t) E[x(t)])2] = E[x2(t)] E[x(t)]2 = 2x

    R(t, t + k) = E[x(t)x(t + k)] = R(k)

    empirically estimated by [42]:

    E[x(t)] = 1

    t=1x(t)

    V ar[x(t)] = E[x2(t)] E[x(t)]2 = 1

    t=1x(t)2 1

    2(

    t=1x(t))2

    R(t, t + k) = E(x(t)x(t + k)) = 1

    k

    t=1(x(t) E[x(t)])(x(t + k) E[x(t + k)])

    Being p(x) and q(x) respectively the distributions of values from x1(t) and x2(t), anatural distance function can be the Kullback-Leibler divergence defined as:

    KL(p||q) = E[log2(p(x)q(x) )] =

    x

    p(x)log2(p(x)q(x) )

    which has a symmetric version:

    dpq =12[KL(p||q) + KL(q||p)]

    The information theoretical interpretation of the KL-divergence is very interesting.Let x(t) be a time series whose values are distributed according to an unknowndistribution p(x).In order to transmit x(t) to a receiver we should encode it in some way, with theintuitive criterion of encoding with fewer bits (quanta of information) those valueswhich occur more frequently in x(t).Shannons source coding theorem quantifies this optimal number of bits to be usedto encode a symbol which can occur with probability p(x), as log2(p(x)).Using these number of bits per value, the expected coding cost is the entropy of thedistribution p(x) and is the minimum cost [43]:

    H(p) =

    x

    p(x)log2(p(x))

  • 24 2. Pattern Recognition

    In a machine learning perspective we could theorize a model for x(t)s values: let itbe denoted as the q(x) distribution. In this case the expected coding cost will benot optimal:

    H(q) =

    x

    p(x)log2(q(x))

    with coding ineciency quantified precisely through the KL(p||q) divergence.This example allows us to appreciate the prominent role of machine learning toachieve an ecient communication and data compression:

    the better our model of the data, the more eciently we can compressand communicate new data [44]

    I stress that there is no "natural" distance function, each distance implements aspecific concept of similarity and it is very problem dependent.

  • 25

    Chapter 3

    Monte Carlo Framework

    Monte Carlo methods are a standard and often extremely ecient way of computingcomplicated integrals over high dimensional or poorly-behaved functions [45]. Theidea of Monte Carlo calculation, however, is a lot older than the computer.The name "Monte Carlo" is relatively recent, it was coined by Nicolas Metropolis in1949 but under the older name of "statistical sampling". The history of the methoddates back to 1946:

    While convalescing from an illness in 1946, Stan Ulam was playing asolitaire. It, then, occurred to him to try to compute the chances that aparticular solitaire laid out with 52 cards would come out successfully.After attempting exhaustive combinatorial calculations, he decided to gofor the more practical approach of laying out several solitaires at randomand then observing and counting the number of successful plays [46].

    This idea of selecting a statistical sample to approximate a hard combinatorialproblem by a much simpler problem is at the heart of modern Monte Carlo simulation.In 1949, the young phisicist Nick Metropolis published the first document on MonteCarlo simulation with Stan Ulam [47] and few years later he proposed the famousMetropolis algorithm [48].

    3.1 MotivationsMonte Carlo techniques are often applied to solve integration and optimizationproblems, here are some examples:

    Bayesian inference and learning: intractable integration or summationproblems occur typically in Bayesian statistics.

    Normalization problems, i.e. computing the normalizing factor in Bayestheorem

    P (m|D) = P (D|m)P (m)P (D)

    whereP (D) =

    mMP (D|m)P (m) (3.1)

  • 26 3. Monte Carlo Framework

    Computing the Marginal Likelihood, which is the integral of the likelihoodunder the prior:

    P (D|mk) =

    P (D|k, mk)P (k)dk

    Statistical Mechanics: here usually is needed to compute the partitionfunction Z of a system with states s and Hamiltonian H[s]:

    Z =

    {s}expH[s]

    kbT

    where kb is the Boltzmanns constant and T denotes the temperature of thesystem. Summing over all the configurations {s} is often prohibitive, so aMonte Carlo sampling is necessary.

    Optimization: the goal is to extract the solution that minimizes some objec-tive function from a large set of feasible solutions.In the Clustering algorithm it will be introduced in next chapters, I faced thiskind of problem dealing with finding the optimal splitting or merging of aclusters made of time series, basing those operations on the maximization of alikelihood function over the data.

    3.2 PrinciplesAll the Monte Carlo methods have the same general structure: given some probabilitymeasure p(x) on some configuration space X , it is wished to generate many randomsample {x(i)}Ni=1 from p [49].These N samples can be used to approximate the target density with the followingempirical point-mass function:

    pN (x) =1N

    N

    i=1(x x(i)) (3.2)

    Aiming to evaluate the mean I(f) of some function f(x) of the configurations, onecan build the Monte Carlo estimate IN (f) simply as the sample mean fN (x), butthe configurations must be sampled from pN (x):

    I(f) = fX =

    Xf(x)p(x)dx

    IN (f) =1N

    N

    i=1f(x(i))

    The MC-estimate converges almost certainly 1 (ac-limn) [7] to I(f):

    ac limn

    IN (f) = I(f)

    1A sequence Xn

    of random variables defined on elementary event converges to X if:

    limn+

    X

    n

    () = X() \ {set of zero measure}

  • 3.3 Static Methods 27

    If the variance 2f of f (in the univariate case for simplicity) is finite, then the CentralLimit Theorem (TLC) states the asymptotic behavior of the values of IN (f) whenthe sample size N approaches infinity:

    IN (f) N (I(f),2I )

    where the dispersion I of the estimate IN (f) behaves as an ordinary error of anaveraged variable:

    I =fN

    This strong form of convergence guarantees the eectiveness of Monte Carlo integra-tion.

    3.3 Static MethodsStatic methods are those that generate a sequence of statistically independentsamples from the desired probability distribution fi. Coming back to equation (3.2),we see that central problem is a sampling problem.

    3.3.1 Rejection SamplingIf we know the target distribution p(x) up to a constant, we can sample from it bysampling from another easy-to-sample proposal distribution that satisfies:

    p(x) Mq(x) , M <

    using the accept/reject procedure described in the pseudo-code below:

    Rejection Sampling:

    i = 1repeat:

    sample x(i) q(x)sample u U(0,1)if u < p(x

    (i))Mq(x(i)) then: accept x

    (i) and i i + 1else: rejectend if

    until i = N

  • 28 3. Monte Carlo Framework

    INTRODUCTION 9

    The N samples can also be used to obtain a maximum of the objective function p(x) asfollows

    x = arg maxx (i);i=1,...,N

    p(

    x (i))

    However, we will show later that it is possible to construct simulated annealing algorithmsthat allow us to sample approximately from a distribution whose support is the set of globalmaxima.

    When p(x) has standard form, e.g. Gaussian, it is straightforward to sample from it usingeasily available routines. However, when this is not the case, we need to introduce moresophisticated techniques based on rejection sampling, importance sampling and MCMC.

    2.2. Rejection sampling

    We can sample from a distribution p(x), which is known up to a proportionality constant,by sampling from another easy-to-sample proposal distribution q(x) that satisfies p(x) Mq(x), M < , using the accept/reject procedure describe in figure 1 (see also figure 2).The accepted x (i) can be easily shown to be sampled with probability p(x) (Robert &

    Figure 1. Rejection sampling algorithm. Here, u U(0,1) denotes the operation of sampling a uniform randomvariable on the interval (0, 1).

    Figure 2. Rejection sampling: Sample a candidate x (i) and a uniform variable u. Accept the candidate sample ifuMq(x (i)) < p(x (i)), otherwise reject it.

    Figure 3.1. Rejection sampling: Sample a candidate x(i) and a uniform variable u. Acceptthe candidate sample if uMq(x(i)) < p(x(i)), otherwise reject it. Source: Andrieu et al.(2003) An Introduction to MCMC for Machine Learning.

    A candidate x(i) is accepted only if the ratio between p(x) and Mq(x) is over theu-threshold (figure (3.1)). This procedure reshapes q(x) to the target distributionp(x) and does not depend on the absolute normalization of p(x) (any normalizationconstant can be absorbed in M). Those x(i) which are accepted result in beingsampled with probability p(x) [50].

    3.3.2 Hit-or-Miss Sampling: a numerical experimentNow I will be concerned on a particular form of the Rejection sampling with referenceto a specific situation Ive encountered during the analysis of bounces on Supportand Resistance levels.Consider a target p(x) whose support ranges over the interval [0, ].Suppose you dont know the analytic form of p(x) but, for accepting/rejectingpurposes, you can rely on a suciently dense histogram depicting its profile.Now suppose you are also able to find a bound for its values:

    you know M such that: p(x) < M , x [0, ]for example from a visual inspection of the histogram.In figure (3.2) is presented a computational experiment of Hit-or-Miss sampling.Defining a data set of 105 random walk time series (length 103)

    pt+1 = pt + N (0, 1) (3.3)p0 = 0

    The expected mean absolute return of each random walk series can be computed as:

    =

    |x|Nx(0, 1)dx =

    2fi 0.8

    where with Nx(0, 1) I mean the distribution of the increments of these random walks,according to (3.3). This value has been adopted as the width of a strip [0, ].

  • 3.3 Static Methods 29

    I let the random walks evolve, getting out from under the strip and then con-sidered the event of reaching again the strip. I measured the position of the firstpoint reaching again the strip, computed the histogram over the data set consideredand considered all the events of this kind occurring during the path of each randomwalk in the data set.The resulting histogram is an approximation of the target distribution p(x). So theinterval [0, ] has been considered as the domain of definition for p(x). With the aim

    Figure 3.2. Example of Hit-or-Miss sampling. Histogram bars refer to the distributionof the position of first points reaching the [0, ] strip from below. Simuation has beencarried on a sample of N = 105 random walk histories, whose length was L = 103. Thepdf was reconstructed via the procedure exposed in the text through the extraction of106 (x, y) pairs of points.

    of sampling points x(i) from this unknown p(x), Ive adopted the so calledHit-or-Miss Sampling:

    i = 1repeat:

    sample x(i) U(0,)sample y U(0,M)if y < p(x(i)) then: accept x(i) and i i + 1else: rejectend if

    until i = N

    where with p(x(i)) I have denoted the histogram-estimate of p at the point x(i).Red points in figure (3.2) represent the profile of the distribution reconstructed bythe method exposed through the extraction of a sample of 106 (x, y) pairs of points.

  • 30 3. Monte Carlo Framework

    3.4 Dynamic MethodsThe idea of dynamic Monte Carlo methods is to invent some kind of stochasticprocess with state space X and p(x) as unique equilibrium distribution.

    3.4.1 Markov ChainsA Markov Chain, after the Russian mathematician A.A. Markov, is one of thesimplest example of nontrivial, discrete-time and discrete-state stochastic processes[43].Consider a random variable x(i) which, at any discrete time/step i, may assume Spossible values x(i) X = {x1, x2, ..., xS}.Indicate with j the state xj and suppose that the generating process verifies theMarkov property:

    Prob(x(i) = ji|x(i1) = ji1, x(i2) = ji2, ..., x(ik) = jik, ...) = Prob(x(i) = ji|x(i1) = ji1)(3.4)

    namely every future state is conditionally independent of every prior state but thepresent one, in other words the process has memory "1".Now restrict ourself to time-homogeneous Markov Chains:

    Prob(x(i) = j|x(i1) = k) = p(j|k) def= Wjkwhich are characterized exclusively by their Transition Matrix Wij with properties:

    non-negativity: Wij 0 normalization:

    qSi=1 Wij = 1

    In order to introduce the concept of invariant distribution, consider an ensemble ofrandom variables all evolving with the same transition matrix. The probability

    Prob(x(i) = j) def= Pj(i)

    of finding the random variable in state j at time i is given, due to the Markovproperty (3.4), by:

    Pj(i) =S

    k=1WjkPk(i 1) (3.5)

    i.e. the probability to be in j at time i is equal to the probability to have been in kat i 1, times the probability to jump from k to j, summed over all the possibleprevious states k. Using matrix notation

    P(i) = (P1(i), P2(i), ..., PS(i))

    we can write the (3.5) as:

    P(i) = WP(i 1) = P(i) = W iP(0)

  • 3.4 Dynamic Methods 31

    The relevant question, at this point, is the possibility of convergence of P(i) to somelimit and the whether such a limit is unique.Clearly if the limiP(i) exists, it will be the invariant or equilibrium probabilityPinv that satisfies the eigenvalue equation:

    Pinv = WPinv

    Two more definitions to complete the theory (figure (3.3)):

    June 30, 2009 11:56 World Scientific Book - 9.75in x 6.5in ChaosSimpleModels

    74 Chaos: From Simple Models to Complex Systems

    1 2 3

    1 p1

    4

    1

    23

    4

    1

    1

    1

    p

    1 p

    1

    2

    3

    4

    (a)

    (c)(b)

    p1

    p 1

    1 p

    1

    1 q

    q

    q

    1 q

    Fig. B6.2 Three examples of MC with 4 states. (a) Reducible MC where state 1 is transientand 2, 3, 4 are recurrent and periodic with period 2. (b) Period-3 irreducible MC. (c) Ergodicirreducible MC. In all examples p, q = 0, 1.

    and reducible (decomposable) Markov Chains according to the fact that each state isaccessible from any other or not. The property of being accessible, in practice, meansthat there exists a k 1 such that Wkij > 0 for each i, j. The notion of irreducibilityis important in virtue of a theorem (see, e.g., Feller, 1968) stating that the states of anirreducible chain are all of the same kind. Therefore, we shall call a MC ergodic if it isirreducible and its states are ergodic. Figure B6.1 is an example of ergodic irreducible MCwith two states, other examples of MC are shown in Fig. B6.2.

    Consider now an ensemble of random variables all evolving with the same transitionmatrix, analogously to what has been done for the logistic map, we can investigate theevolution of the probability Pj(t) = Prob(Xt = j) to find the random variable in state jat time t. The time-evolution for such a probability is obtained from Eq. (B.6.1):

    Pj(t) =S

    k=1

    WjkPk(t 1) , (B.6.3)

    i.e. the probability to be in j at time t is equal to the probability to have been in k att 1 times the probability to jump from k to j summed over all the possible previousstates k. Equation (B.6.3) takes a particularly simple form introducing the column vectorP (t) = (P1(t), .., PS(t)), and using the matrix notation

    P (t) = WP (t 1) = P (t) = WtP (0) . (B.6.4)

    A question of obvious relevance concerns the convergence of the probability vector P (t)to a certain limit and, if so, whether such a limit is unique. Of course, if such limit exists,it is the invariant (or equilibrium) probability P inv that satisfies the equation

    P inv = WP inv , (B.6.5)

    i.e. it is the eigenvector of the matrix W with eigenvalue equal to unity.The following important theorem holds:

    For an irreducible ergodic Markov Chain, the limit

    P (t) = WtP (0) P () for t ,

    Figure 3.3. Three examples of Markov Chains with 4 states. (a) A reducible MC wherestate 1 is transient (never reached again, if left) whereas states 2, 3, 4 are recurrent (thereexist a finite probability to come back) and periodic with period 2: the probability tocome back in k steps is null, unless k 2. (b) Period-3 irreducible MC. (c) Ergodicand irreducible MC. In all examples p and q are supposed to be dierent from 0, 1.Source: Vulpiani et al. Chaos: From Simple Models To Complex Systems.

    Irreducible chain: a chain whose states are accessible from any other. For-mally this means that there exists a k > 0 such that W kij > 0 i, j. The chainis called reducible if this does not happen.

    Ergodic chain: a chain that is irreducible and whose states are ergodic.Namely each of them, once visited, will be visited again by the chain, with afinite mean recurrence time.

    For this special class of Markov Chains, a Fundamental Theorem asses the existenceand unicity of Pinv:Fundamental Theorem of the Markov Chains 1 For an irreducible ergodicMarkov Chain, the limit

    P(i) = W iP(0) P() for t exists, is unique and is independent of the initial distribution P(0). Moreover:

    P() = PinvP

    inv = WPinv (3.6)meaning that the limit distribution is invariant.

  • 32 3. Monte Carlo Framework

    3.4.2 MCMC and Metropolis-Hastings AlgorithmDealing with sampling problems, we are interested in constructing Markov Chainsfor which the distribution we wish to sample from, given by p(x), x X , is invariant,i.e. once reached, it never changes [51].We restrict to homogeneous Markov Chains. A sucient but not necessary conditionto ensure that a particular p(x) is the desired invariant distribution is the followingdetailed balance condition, which is a reversibility condition:

    p(x(i))W (x(i1)|x(i)) = p(x(i1))W (x(i)|x(i1)) (3.7)where x(i) is the state of the chain at time i and W (x(i1)|x(i)) is the jump probability.This condition obviously implies that p(x) is the invariant distribution of the chain,indeed summing both sides on the possible states at time i 1:

    {x(i1)}p(x(i1))W (x(i)|x(i1)) =

    {x(i1)}p(x(i))W (x(i1)|x(i))

    = p(x(i))

    {x(i1)}W (x(i1)|x(i))

    = p(x(i))

    where the last equality is based on the normalization condition.These kinds of Markov Chains are called Monte Carlo Markov Chains (MCMC)samplers are irreducible and aperiodic Markov chains that have the target distributionas the invariant distribution [52]. The Metropolis-Hasti