franz siegmund turoial for correspondence analysis
DESCRIPTION
A tutorial for correspondence analysis in archaeologyTRANSCRIPT
1 / 35Siegmund, F. (2014). Tutorial correspondence analysis
Tutorial for archaeologists:
How to perform a correspondence analysis ‐
a practitioners guide to success and reliability
Frank Siegmund
( Heinrich Heine University Düsseldorf )
( draft, version 0.9; March 20th, 2014 ) *
(1) Introduction
Correspondence analysis (CA) is a widely known and well‐reasoned multivariate statisti‐
cal method for bringing data (cases as variables) into a sequence, when the data follow
a unimodal model. The term “multivariate” means statistic procedures, which take more
than two variables into account at once instead of describing only one phenomenon or
looking for the relation between two variables, like stature and body weight. A well‐
known graph for a “unimodal model” is a bell shaped curve (fig. 1). The term unimodal
model is meant as a contrast to linear (or similar, like quadratic, exponential, ...) models.
Let us keep things simple and give examples known to everyone: The relation between
stature and body weight or the relation between the velocity and the weight of cars and
their gasoline consumption is following a linear model more or less. In general, taller
people are heavier and shorter people are lighter. The heavier a car is and the quicker
the driver is moving it, the more gasoline it consumes. A good example for unimodal
behaviour is the usual relation between age and body weight: New born babies get
heavier when growing up, young adults get heavier further on, but most people loose
weight when getting old. The short history of the devices for data storage in a computer
is another example of a unimodal model. This history is already close to the archaeologi‐
cal applications of CA. Mechanical solutions to store information like punched tapes and
punched cards were replaced by magnetic devices in the 1960s. After some years the
huge Winchester devices were followed by 8‐inch floppy disks in the 1980s, then by
5.25‐inch floppy disks, then by 2.5‐inch disks, then by re‐writable CDs up to our actual
USB sticks. This is exactly the way archaeologists think about time and artefacts: A
special kind of object ‐ often named as a “type” by archeologists ‐ isn’t still invented, its
frequency in the world is zero. When it is invented it occurs in low frequencies first, and
when becomes fashionable it is produced and used in high frequencies. After some time
2 / 35Siegmund, F. (2014). Tutorial correspondence analysis
the frequencies become lower, because other useful alternatives are invented and come
into common use, and after some time the type isn’t in use any longer, which results in
frequencies close to zero again. To apply a CA correct from an statistical point of view,
the data mustn’t follow the bell shaped curve ‐ statistically named as normal distribution
‐ exactly, they are only expected to show one maximum somewhere in the mid part and
minima on both ends.
Unimodal models aren’t exotic and restricted to archaeology and to the question of
chronology. Another example are plants and animals, that prefer certain environmental
conditions. They grow best over a certain span of temperature, humidity, acidity, ... and
equally avoid deviations from this optimum. Along their environmental conditions they
show a unimodal behaviour.
Another characteristic of CA is its robustness, it can be applied to various kinds of data.
CA is able to work with names (presence / absence information), with counted frequen‐
cies (abundance information) as well as with measured observations, while many other
multivariate statistical methods handle measured observations only. This is another
reason for the frequent use of CA in archaeology. The application of CA isn’t limited to
archaeology. On the contrary, it is widely used e.g. in social sciences or in biology and
ecology. The book “Distinction: A Social Critique of the Judgment of Taste” (1984) from
the famous French sociologist Pierre Bourdieu gives an early example of its application
in sociology.
If you want to analyse a multivariate data set, but you are unsure, whether your obser‐
vations follow a unimodal model, but a linear model instead, the suitable and compara‐
ble linear method is Principal Component Analysis (PCA), a method belonging to the
wide family of factor analyses. Applying PCA to unimodal data ends in erroneous results,
as applying CA to data following the linear model. We will look at an example later on. In
practise, PCA is more sensible to slight violations against its model expectations, while
CA is more robust against violations of the expected model.
(2) The theory and aims of CA
When looking at the actual (March 2014) Wikipedia article “correspondence analysis”,
CA seems to be a highly complicated method. In truth, it isn’t. It is easy to understand,
and easy to calculate and to perform. However, we needn’t to explain the theory here,
3 / 35Siegmund, F. (2014). Tutorial correspondence analysis
because there are good books doing it better (see chapter 4), and we needn’t to calcu‐
late a CA, because there are computers to do so (chapter 3). We must only recognize its
aim: to re‐arrange the sequence of the rows and columns of a table in a way, that the
table is diagonalized at the end. Usually these tables consist of many empty cells or cells
with zero, and all the other cells with frequencies e.g. should be agglomerated to a
diagonal cloud in the mid of the table, starting at its upper left side and ending at its
lower right side, or vice versa (fig. 1). After achieving this aim, the observations are
arranged after a unimodal model: each row and each column of the table starts with a
minimum (empty cells or zero cells), followed by a maximum and followed by a mini‐
mum again. The input into a CA is unsorted information (a disordered table), the result
of a CA is a new sequence of the rows and columns of a table.
Fig. 1. Example of a well diagonalized matrix, where the rows and columns follow the unimodal
model.
In the early stages of this method it was performed by a repeatedly mechanical re‐arranging of the
sequence of the rows and the columns of a table. First, the sequence of the rows was optimised, then
the sequence of the columns, then the sequence of the rows again, and so on, up to the final stable
solution. Therefore this method was also named as sequencing, or sequence dating, and sometimes
as seriation or ordination in the sense of sorting and re‐sorting. A picture of a helpful mechanical
solution can be found at Périn (1980, fig. 23). The first computer software was an automatisation of
this repeated re‐sorting of rows and columns only. Nowadays the mathematical solution is much more
sophisticated and achieved by calculation, at the end the CA proposes a new sequence of the rows and
columns and the table has to be re‐arranged only once. In English as well as in German literature the
aspect of changing the sequence of rows and columns was named as “to order” and ordination, the
term is still in use.
PÉRIN, P. (1980). La datation des tombes mérovingiennes : historique, méthodes, applications. Genève:
Droz.
4 / 35Siegmund, F. (2014). Tutorial correspondence analysis
(3) Software to perform a CA
Most of the widely used programs for CA are freeware or open source software, avail‐
able without any financial costs. You have to only learn how they work. The following list
is a personal selection of the author, who has himself extended practice with WinBASP
and PAST. All of these programs are distributed with documentation. It is really useful to
read their instructions.
‐ WinBASP: The Bonn Archaeological Software Package, Version 5.43 (by Scollar, Irwin
et al.)
source: http://www.uni‐koeln.de/~al001/[ If there are any problems with this or other links, please look for it by Google ]
Look for BaspPast too, which helps exchanging data sets between WinBASP and PAST
(and other spread sheets).Be cautious, there is a software error in BaspPast: When transforming data from WinBASP to PAST, the last
type from the WinBASP list vanishes. When you know it happens, you can prevent it with a simple trick.
Before transferring the data to PAST, add a new last type (without any meaning) to the data set in Win‐
BASP and then use BaspPast for change the data set to PAST. ;‐)
‐ PAST 3.0: PAleontological STatistics Version 3.0 (by Hammer, Øyvind)
source: http://folk.uio.no/ohammer/past/index.html
Version 3.0 of PAST is under construction in the years 2013‐14. If you have problems
with it, take the older, stable version 2.7.Hammer, Ø., Harper, D. A. T. & Ryan, P. D. (2001). PAST: Paleontological statistics software package for
education and data analysis. Palaeontologia Electronica 4(1): 9pp.
http://palaeo‐electronica.org/2001_1/past/issue1_01.htm
‐ CAPCA 2.2 (by Madsen, Torsten), an add‐in to Ms‐Excel 2003 or 2007
source: http://archaeoinfo.dk/
Other software solutions are available at:
‐ WinSERION 3.1 (by Stadler, Peter)
source: http://www.winserion.org/
‐ There are several packages for CA in “R”, see: http://www.r‐project.org/
‐ CANOCO 5 (by ter Braak, Cajo, Univ. Wageningen), a powerful package, but it has to
be paid. More information:
http://www.wageningenur.nl/en/Expertise‐Services/Research‐Institutes/plant‐rese
arch‐international/show/Canoco‐for‐visualization‐of‐multivariate‐data.htm
This tutorial is mainly for computers run by MS‐Windows. Most of these special pro‐
grams run only in Windows, so we are bound to this software system. The only exception
known to me is the powerful statistical package “R” which is distributed for Linux and
5 / 35Siegmund, F. (2014). Tutorial correspondence analysis
MAC, too. There are several packages within “R” to calculate a CA. “R” is for free, it is
really powerful and often used by people working with statistics professionally. But all
my experience with introducing beginners into statistics shows that “R” seems to be
highly complicated to them.
I have been told, that by emulating Windows on a MAC, WinBASP can be used, but I
have no experience with this. A MAC version of PAST was announced in October 2013 as
forthcoming, a test version for MAC OSX can be downloaded (since March 2013) at:
http://folk.uio.no/ohammer/past/Past3.dmg I don’t have any experiences with it.
(4) Literature
CA was invented several times by different researchers and therefore got different
names at its beginnings. In Germany e.g. it was named as “Seriation” when introduced
mathematically by Klaus Goldmann and Ernst Kammerer in 1972. They took the term for
it from Sir William Matthew Flinders Petrie (1853‐1942), who named a similar procedure
in the late 19th century as “seriation” (Petrie 1899), but it was done without mathematics
and computers in these times. Nowadays the scientific community follows the name
proposed for it by the first one ever publishing the actual theory and statistical proce‐
dure, the French statistician Jean‐Paul Benzécri (1973‐1976, L’ analyse des donnees. 2
vols). The statistical literature about CA is overwhelmingly numerous and easy to find. If
you want to read as much as necessary and as few as possible, take the book of Michael
J. Greenacre (2007) including many practical advices, while Greenacre (1984) is fre‐
quently cited as the actual standard introduction into the theory of CA.
GREENACRE, M. J. (1984). Theory and application of correspondence analysis. London:
Academic Press.
GREENACRE, M. J. (2007). Correspondence analysis in practice. 2nd ed. Boca Raton: Chap‐
man & Hall.
There are numerous applications of correspondence analysis in archaeology, too many
to make a short list here. They are easy to find with the usual search engines or library
catalogues. At the end of this text you will find a selected list (“Further readings”).
Everyone planning an archaeological research project with a CA should take one or a few
of them as an example. It is useful to choose examples which are close to the topic of
their own research project, which is another reason to avoid a fixed list of literature
here. As an exception my own book might be allowed to be mentioned here (Siegmund
1998), where I have developed a chronology for a large sample of Early Medieval graves
6 / 35Siegmund, F. (2014). Tutorial correspondence analysis
from Western Germany, and a book on Early Medieval chronology in English language
(Bayliss et al. 2013), which gives a very detailed insight into the methodology and the
real working process of an analysis.
SIEGMUND, F. (1998). Merowingerzeit am Niederrhein. Rheinische Ausgrabungen 34. Köln:
Rheinland‐Verlag.
BAYLISS, A., HINES, J., HØILUND NIELSEN, K., MCCORMAC, G. & SCULL, CHR. (2013). Anglo‐Saxon
graves and grave goods of the 6th and 7th centuries AD: a chronological framework.
Edited by J. Hines & A. Bayliss. The Society for Medieval Archaeology Monograph 33.
London: The Society for Medieval Archaeology.
In Italian language: ALBERTI, G. (forthcoming). [title still unknown]. Archeologia e Calcolato‐
ri 2013 (in press). See: http://soi.cnr.it/archcalc/
(5) Starting with the practical part: the choice of software
CA means working with tables. It is not just putting data into a table, to calculate the CA
and to have a final result then. On contrary, it means working intensively with such a
table for some time, changing things and experimenting with several ideas to get good
and stable results at the end. If there are many data, these tables could get really large
and difficult to oversee, and adding or changing information to these tables is error‐
prone. Therefore I like the program WinBASP, although it looks old‐fashioned now.
WinBASP has powerful tools to support the input and the management of data, espe‐
cially when these data are based originally on lists instead of tables (like: type xyz occurs
in graves a, b, c, d, ...). I don’t have practical experience with WinSERION, but its data
management seems to follow a similar concept. Everyone planning a project with
numerous graves (or features) and types should have a look at these two programs, they
could be very useful. Whenever there is a small table only, the program PAST is my
preference. It is free and really well done, and it offers much more statistical procedures
than CA only. This short tutorial therefore is based on PAST. If you want to follow the
practical part of this tutorial, please download PAST from its website (see chapter 3) and
install it on your computer. Additionally this tutorial uses some sample data. You can
type them into your computer by yourself, or copy them from the shared devices.
The next pages are easier to follow, when you print them. You mustn’t, but it
makes it easier to follow the instructions step by step.
7 / 35Siegmund, F. (2014). Tutorial correspondence analysis
(6) Starting with PAST
Start PAST now by double‐clicking on its icon. Go to “File”, go to “Open” and open the
file “1a_ideal‐matrix‐unordered” (fig. 2). You should see a table now, similar to an Excel
sheet (or to sheet e.g. made by Open‐/Libre‐Office Calc). The table contains ten types
named from A to K and ten features named grave 1 to 10. In general, each type is
present in three different graves, and each grave contains three different types. The
types are noted as counted frequencies, therefore you will see numbers in the cells, or
zeros, when this combination of a grave and a type isn’t present. If you want, click on
“Bands” to mark it, which makes the table easier to read.
Fig. 2. Screen shot of our first practical example, the data matrix loaded into PAST.
(6.1) PAST step 1: calculating a CA and reading the scatter plot of axis 1 and axis 2
Let’s do a CA now. Click with your pointer (mouse / mouse & Control) on the uppermost
left corner of the table (grave 6, type H), and go to the lowermost right corner (grave 5,
type C) to mark and highlight the whole table. It should be in a light blue tone now. This
is an important step when using PAST, you must always select an active data set to be
analysed by marking it. See the uppermost command line now with “File”, “Edit”,
“Transform”, ... and go to the button “Multivariate”, then “Ordination”, then “Corre‐
spondence (CA)” and click on it (fig. 3). Immediately a new window “Correspondence
analysis” should appear on your screen. The CA is already done, let’s look at the results.
The new window shows four flags above: “Summary”, “Scatter plot”, “Row scores”,
“Column scores”. Click on “Scatter plot” (fig. 4). You can draw the window a little bigger
with your mouse. This scatter plot shows (by default) the first two axes calculated by the
CA and puts in the graves/rows (black) and the types/columns (blue) according to the
results of the CA. The dots all together should form a kind of a parabola or a horse shoe
8 / 35Siegmund, F. (2014). Tutorial correspondence analysis
now. On the right side of the window you can change the default settings according to
your needs. If you click e.g. on “Plot columns” to demark it, the columns/types will
disappear from the scatter plot. Now the graph is easier to read.
Fig. 3. Screen shot of PAST immediately before calculation the CA.
Fig. 4. Screen shot of the new, second window of PAST, showing the scatter plot of axis 1 (hori‐
zontally) with axis 2 (vertically).
Our example tables use “graves” and “types” as categories. But CA isn’t restricted
to analysis of grave assemblages. It was also successfully applied to features or
layers of settlements and their findings. Another useful application of CA is the
analysis of objects themselves with the object taken as “grave/feature” and their
attributes as “types”.
9 / 35Siegmund, F. (2014). Tutorial correspondence analysis
(6.2) PAST step 2: changing the sequence of graves and types according to the results
of CA and analysing the new table
The scatter plot shows “Axis 1" as x‐axis that means horizontally from left to right (fig 4).
This is the first, the dominant dimension in the data set calculated by the CA. The second
axis “Axis 2" is shown vertically from top to down. We should read the sequence of the
graves along the horizontal axis only. It says grave 1, 2, 3, .... to grave 10, which is differ‐
ent from the succession of our original table “1a_ideal‐matrix‐unordered”. We take a
look at the types now: go to the right side, click on “Plot columns” to mark the columns
(types) and on “Plot rows” to unmark the rows (graves). Now we can see the new
sequence of the types from left to right, which is type A, B, C, ... to type K. Once again,
this is the sequence along the first, dominant axis of our material as calculated by the
CA.
Now we will take a look at the other flags above (fig. 5). “Row scores” and “Column
scores” show the graves and the types respectively, “Axis 1", 2, 3, ... and numbers in the
cells. These are the statistical results of the CA displayed in the scatter plot analysed
first. The meaning of these axes will be given later on, it needs an explanation. The flag
at the left end, “Summary”, shows Axis 1, 2, 3... and their “Eigenvalue”, “% of total” and
“Cumulative”. These terms will be explained in a moment. First, let’s try to re‐arrange
our original table. To do so, we go back to the new window “Correspondence analysis”
and there to the flag “Scatter plot” to re‐display it. Let’s start with the types, displayed
in blue.
Fig. 5. The flag “Summary” activated showing the numerical results of our CA. For the explanati‐
on see chapter 7.1.
10 / 35Siegmund, F. (2014). Tutorial correspondence analysis
We return to our first (main) window with the “1a_ideal‐matrix‐unordered”. There is a
field with several boxes below the uppermost command line. Look at the (second) box
“Click mode”, where a flag was set by default at “Select”. Click to “Drag rows/columns,
sort”. Now you are able to drag and move the columns and the rows of the table to
change their position (not their content!). And so we are doing now, we re‐arrange the
position of all the columns according to the sequence given in the scatter plot in our
window “Correspondence analysis”. Go with the mouse pointer to the title of the
column and drag it into the position you want. The column type A should be now the
first column of the table. Then re‐arrange type B, and so on. First step is done. Now we
have to renovate the sequence of the graves. Go to the window “Correspondence
analysis”, change from “Plot columns” to “Plot rows” to see the graves now, and read
their order from left to right. Now we re‐arrange the sequence of the graves according
to the same rule. Move grave 1 first to the top, grave 2 one row below, and so on. At the
end, the sequence of the graves should be 1, 2, 3, ... to 10, and the sequence of the types
A, B, C, .. to K (fig. 6). Both sequences are now in exactly the order proposed by the axis
1 calculated by the CA. Please analyse the final table to see the ‐ very simple ‐ model I
have proposed for this tutorial: In the columns the types are non‐existing (0), invented
(1), fashionable (2), becoming out‐dated (1) and vanished (0) afterwards, and in the rows
the graves contain three types, two of them with one piece and one of them in two
pieces. This is my “ideal table”. Sorry for giving such simple names to graves and types,
but it is easier this way to gain our first experience with CA.
Fig. 6. Screen shot of our matrix, with rows and columns re‐arranged now according to the
results of the CA.
11 / 35Siegmund, F. (2014). Tutorial correspondence analysis
(7) Some further explanations of the statistical values
First of all: The most important work when performing a CA is the archaeological part.
We need to define types or attributes well, and we have to select the right ones, those
who are able to answer our question. If the question is about chronology ‐ which is
assumed here ‐ the types should be sensible to time, if one is asking e.g. about gender,
the types should be sensible to gender. We have to choose the graves (or features, or
objects with attributes) carefully. When asking about chronology, “closed features” like
well documented graves are necessary, while features which collected different material
for centuries before being buried under earth are less useful. The focus of your work
should be archaeology, a deeper understanding of the statistics is useful, but not really
necessary. The following short explanation of some of the statistical terms and back‐
ground could be handy.
(7.1) “Axis”, “Eigenvalue”, “Inertia”
CA is multi‐dimensional. CA first tries to find the main, the dominant dimension (or
sequence) in the data, named as axis 1. After calculating the first sequence, it looks for
a second dimension, which is independent from the first one, and goes on with a third
dimension, and so on. This is a purely statistical process, which is similar to Principal
Component Analysis (alias Factor Analysis), which is extracting several independent
factors from a set of data. It is possible, that these second, third or higher dimensions of
the material have some archaeological meaning. But in the practise of archaeological
research it is uncommon to use the second or third dimension. There is an example,
what could be theoretically achieved when working with graves: axis 1 means gender,
axis 2 means time, axis 3 means social status. But this is pure idealistic theory, not
achieved by any study known to me. In practice we should take care of understanding
axis 1, and eventually axis 2 too.
All those axes together should give an optimal “explanation” for all the variations
embedded in the whole table. This total variation within the data set is named as
“inertia”. A part of this whole variation is embedded into the first axis (axis 1). The
importance ‐ in a statistical sense ‐ of each axis (and grave/feature, type/attribute) is its
“eigenvalue”. The higher the eigenvalue of an axis, the more important it is. Axis 1 forms
the first eigenvector of the data set. In our example its eigenvalue is about 0.95 (see
column “Eigenvalue” in the flag “Summary” of the window “Correspondence analysis”;
fig. 5), or 30.56 percent of the total inertia of the whole table (flag “% of total”). Axis 2 in
12 / 35Siegmund, F. (2014). Tutorial correspondence analysis
our example shows an eigenvalue of 0.80, which is 25.86 percent of the total inertia. The
flag on the right side adds the percentages axis by axis up to the end (“Cumulative”). In
general, the first axis should have a high eigenvalue, it should have a high amount of the
total inertia of the table. But practically these numbers should not be considered as too
important. I have seen tables analysed by CA with very good looking eigenvalues, but
without archaeological sense, and on contrary, I have seen very good and valuable
archaeological analyses with poor numbers from a statistical point of view. The proof of
the success and value of a CA is not to be found by statistics in these numbers, but by
archaeological arguments.
(7.2) Row and column scores
Now we can understand the row and columns scores. Each axis reflects a dimension of
its own, and the score shows the position of a grave / feature or a type / attribute within
this dimension. The row and column scores are the values displayed in the scatter plot.
There are more than two axes, and therefore we can look at more than one two‐dimen‐
sional scatter plots. We will see this later in another exercise.
It is important to know, that each axis shows a well defined sequence and the scores
display the position of an individual type or grave within this sequence, but there is no
defined direction of any sequence. The direction of any sequence can be freely changed
into its opposite, e.g. by multiplying all values with ‐1. But changing the direction would
not change the sequence itself and would not change the distance of the individual cases
to each other, which is given by the scores. To give an easier explanation for archaeolo‐
gists: when a sequence presumably means time, the position of each case (grave, type)
within this time sequence is statistically well calculated, but there is no answer to the
question: where is the old end, where is the new end of this sequence. This answer can’t
be given by CA, it has to be derived from external archaeological arguments like stratig‐
raphy or radiocarbon dates.
(7.3) Behind the first dimension (axis) of a CA
As we have seen above, CA usually calculates more than one axis up to a statistical point,
where a further extraction of axes isn’t justified any longer. The number of these ex‐
tracted axes depends on the data, it isn’t fixed. In most archaeological applications of CA
only the first axis or the first few axes have meaning and all the rest can’t be interpreted.
The parabola or horseshoe formed by the scatter plot of axis 1 with axis 2 (fig. 4) is
13 / 35Siegmund, F. (2014). Tutorial correspondence analysis
purely statistical result. When the data follow the assumed unimodal model ideally, axes
1 and 2 shows such a parabola. In these cases there is also a special curve when display‐
ing axis 1 with axis 3 and axis 2 with axis 3. The typical shape of these curves should be
known to everybody, and therefore we will study it now.
Open the table “1b_ideal‐matrix‐ordered” with PAST, calculate a CA and display the
scatter plot, just as we did it before. The scatter plot of axis 1 with axis 2 with the graves
and types should be visible now forming a parabola. This well‐shaped parabola indicates,
that our data follow the unimodal model well. Just on the right side of the window
“Correspondence analysis” there are some buttons we haven’t used till now. On the top
there are two buttons with the headline “X axis” and “Y axis”. Go to the second one (“Y
axis”), click on axis 2 and go to axis 3, click. The scatter plot immediately changes, now it
displays axis 1 horizontally (like before) and axis 3 vertically (fig. 7). The dots follow a
lying S‐shaped curve now, which is typical and once again close to the mathematically
expected ideal.
Fig. 7. Scatter Plot of axis 1 (horizontally) with axis 3 (vertically).
Go to the uppermost button with the headline “X axis” now, click on “Axis 1", go to “Axis
2" and click. Once again the graph changes, now displaying axis 2 horizontally and axis 3
vertically (fig. 8). You should be looking at a curve like the sketch of a fish now. Follow
the points from grave 1, 2, ... to 10 with your eyes to recognise the course of this special
curve, which is ‐ once again ‐ typical and close to the mathematically expected ideal.
14 / 35Siegmund, F. (2014). Tutorial correspondence analysis
Fig. 8. Scatter plot of axis 2 (horizontally) with axis 3 (vertically).
These three displays ‐ axes 1 with 2, 1 with 3 and 2 with 3 ‐ are three looks at a three‐
dimensional cube with a three‐dimensional cloud of points within. We looked at it from
three different sides seeing only plane two‐dimensional views. If you are really inter‐
ested in this complicated curve, you could try building a physical model looking like that,
but this isn’t really necessary. For our purposes here the two‐dimensional scatter plots
suit well, we had a chance to see these ideal curves once. You should try to remember
them as a pattern, which could be seen in your future analyses with less ideal data as
well.
(7.4) CA and seriation
Now we are ready to answer the question: What is the difference between a CA and a
“seriation”? Well, as we have seen, the CA is multi‐dimensional and in many cases offers
several axes. Seriation isn’t, it gives one sequence, i.e. a one‐dimensional solution. The
sequence proposed by a seriation ‐ when properly done ‐ is an equivalent of the first axis
of a CA. Therefore older studies which analysed archaeological problems with the help
of a seriation are not incorrect, the result of a CA should be identical or very similar.
(7.5) What is relevant, the curve or the axes?
After seeing these curves and their typical course a usual question is, which sequence is
the relevant one: the position of a point in the course of one of these typical curves, or
the position of a point in the sense of axis 1? The last answer is correct. The position of
the points has to be read along the axes, not along the curves. To get a better idea of the
“real” distances along axis 1 (and 2, 3, ...) statisticians have invented a variant of the CA
named Detrended Correspondence analysis (DCA). Here a usual CA is computed first, but
15 / 35Siegmund, F. (2014). Tutorial correspondence analysis
afterwards the curves are re‐calculated into a line. The idea is, that the distances be‐
tween single points given by a DCA are more accurate than the distances along the axis
of a CA. Perhaps, but the difference is a small one and has ‐ from my point of view ‐ no
meaning for the archaeological practice.
If you are interested in performing a DCA, you could easily calculate it with PAST. Go
back to our table, mark (highlight) the relevant rows and columns and go along the
uppermost line to the flags “Multivariate”, then “Ordination”, and then “Detrended
correspondence (DCA)” instead of “Correspondence (CA)”. That’s it. The new window
displays axis 1 and axis 2 as re‐calculated by the DCA. The sequence of the types and
graves along the first axis hasn’t changed, but the scale and the distances are slightly
different now.
(7.6) The “parabola test”
Sometimes in archaeological applications of CA you will read, that a parabola test was
done. Could be, that you don’t know this test and consult your textbook on statistics,
where topics like chi‐square test, Mann‐Whitney U‐test or Kruskal‐Wallis H‐test are
introduced and explained in detail. But you won’t find any “parabola test” there. This is
not the fault of your text book. The parabola test is a myth. The term doesn’t mean a
serious statistical test. This means, that someone had a look at the display of his results
of a CA, the scatter plot of axes 1 with axis 2. He compared the distribution of the points
in his display visually with the expected shape of a parabola, just as we have seen it
above (fig. 4). That’s it, the famous parabola test.
It can really be worth checking whether the results of a CA show a parabola when
displaying axis 1 with axis 2. It shows, that the data set is close to the unimodal model.
But it is never something like a serious statistical test. So, please never talk about a
parabola test.
Horse shoe or parabola? ‐ that’s the question. As we have learnt already, there is
no real direction of the sequence of the axes, they can be flipped over freely.
Sometimes the display of axis 1 with axis 2 forms the shape of a parabola with two
ends up, sometimes it forms a horse shoe with two ends down. The final mathe‐
matical solution with ends up or down depends on the sequence of the data input
and on a random process, there is no different meaning in it. In a real analysis it is
16 / 35Siegmund, F. (2014). Tutorial correspondence analysis
useful to make the displays comparable to each other, all of them showing a
parabola, or all of them showing a horse shoe ‐ just as you prefer. European
archaeologists often prefer a parabola, American archaeologists often prefer a
horse shoe. It is important to know that there is no real difference between them,
it’s only different convention.
(8) Gaining more experience with CA
It’s obvious, that we need some introduction into the interpretation of the results of a
CA and into the possibilities to get a better idea how to work with a CA and how to
interpret its results. Before that, I would like to study further some artificial tables with
simulated data. Exercises of this kind will give you more experience with these tables
and pictures before working on a real archaeological problem.
(8.1) Case study with an unspecific type
Matrix “2_ideal‐matrix_with‐one‐unsensible‐type” shows an usual case of real applica‐
tions: The table is dominated by well‐defined, closed features (graves) and by time
sensible types (fig. 9). But there is one type which occurs all over the time. Archaeolo‐
gists often name such a type a “long runner”. To keep things simple, this table is already
put into the ideal sequence, so you can oversee its structure from the very start.
Fig. 9. The model table with the additional type “unsensible” (right side).
Please activate PAST, load this table and look at it, then perform a CA and look at the
window “Correspondence analysis” for the scatter plot of axis 1 to axis 2. The scatter
plot of the graves (i.e. rows) looks very similar to the plot derived from our first (ideal)
table but the scatter plot of the types (columns) is different now (fig. 10). While the
types A to K are sorted in the same way as before, we can see the type named
“unsensible” in the mid of our parabola or horse shoe. Well, this is the typical picture.
17 / 35Siegmund, F. (2014). Tutorial correspondence analysis
When a table shows a generally good and stable sequence, one or some few types,
which are not sensible to the underlying dimension ‐ time in our assumption ‐ are
collected in the mid of the open parabola. You can use this phenomenon to get hints,
which graves or types are not really useful for the analysis.
Fig. 10. Scatter plot of axis 1 (horizontally) with axis 2 (vertically) of the CA of the table in fig. 9
with type “unsensible”.
(8.2) Case study with mixed graves
The same could happen to features (graves), as shown in “3_ideal‐matrix_with‐
unspecific‐grave”. Please load the table into PAST and perform a CA. The new grave
named “collector” contains one piece of each type. Just like the “un‐sensible” type in the
example before, this grave is now in the mid of the parabola, while the other graves and
types are arranged as expected.
The next table “4_ideal‐matrix_with‐mixed‐grave” gives a more dramatic version of an
unsuitable feature (fig. 11). The new grave “mixed” is a real mixture out of the types
from grave 2 and 9, with additional types in the mid. After calculating the CA we can see
that the general sequence of the graves and types is still similar to those of the first,
perfect table. But the scatter plot shows distortions now (fig. 12). It is no longer symmet‐
ric, and the sequence, especially for the type H to type I is slightly changed. While the
model tables number 2 and 3 with an un‐sensible type or grave just kept the rule “uni‐
modal”, our mixed grave shows a bimodal collection of types, combining a very old and
a very new assemblage. This violation of the unimodal model has bigger influence on the
resulting sequence of our table.
18 / 35Siegmund, F. (2014). Tutorial correspondence analysis
Fig. 11. Model table with an additional mixed grave (bottom line).
Fig. 12. Scatter plot of axis 1 (horizontally) with axis 2 (vertically) of the CA of the table in fig. 11
with a mixed grave.
Our table is small in comparison to a real data set and thereby more sensible to single
changes. You can easily prove this by making our model tables larger, i.e. by adding more
artificial types and graves. Real, larger tables can’t be altered by a single case so easily.
But our observation shows, how and where these violations work. Types or graves,
which are not so sensible to the underlying dimension (time in our model assumption)
like the other types and graves, can’t be situated in the resulting sequence really well,
but their general influence on the quality of the final sequence is small. Real mixed
features (graves), combining typical assemblages of different times, or types, which re‐
occur after becoming outdated already, are more influential. When those disturbing
19 / 35Siegmund, F. (2014). Tutorial correspondence analysis
cases are rare, you will find and detect them by the CA, but when they form a bigger
amount of the data set, the CA won’t give suitable results.
(8.3) Case study with weak connection
CA is analysing assemblages and combination of types within them. A type existing in
only one grave or a grave containing only one type doesn’t show a combination and
therefore it is of no value for the analytical process. Such cases should be excluded when
you try to find and establish a sequence. The minimal requirement for our table is: each
grave contains two types at least, and each of those types has to be represented in two
graves at least. But although respecting this rule, parts of a table could be filled with few
data only. To see this in practice, load “5_ideal‐matrix_with‐weak‐connection” (fig. 13),
look at the table and calculate a CA. In comparison to the matrices analysed before, this
matrix shows higher frequencies of types in the graves at both ends of the table, but in
the mid part it was thinned, look at grave 5 and 6 and at type E and F especially. The
aforementioned minimal conditions are still kept.
Fig. 13. Modified model table with graves and types more connected to each other and a mid
part with very few combinations only.
The scatter plot of the CA mirrors these changes and the resulting structure (fig. 14):
type A, B, C and D are plotted close to each other on one end, and type G, H, I and K are
distributed as usual on the other end. While the first ones are drawn together by type
frequencies up to 4, the last ones show type frequencies only up to 3. The mid of our
parabola is thinned with the biggest distance between type E and F. This is just the
structure to be seen in the table. What does this example show? Graves and types can
be more intensively combined with each other, and they can show more distances to
each other than usual. The scatter plot of a CA shows these densities and thinner zones,
20 / 35Siegmund, F. (2014). Tutorial correspondence analysis
which could be taken as an instrument for defining phases in a chronologically ordered
material. It is not necessary to follow the results of a CA, when there are other good
reasons for a phasing. When there are no other or better arguments, pictures like this
can be used for phasing the sequence. In our case we could use the gap between type E
and type F to draw a borderline between two phases, maybe with grave 6 ordered to the
side with grave 7 ff., because it stands a little closer to them.
Fig. 14. Scatter plot of axis 1 (horizontally) with axis 2 (vertically) of the CA of the table in fig. 13
with a mid part with few combinations only.
(8.4) Don’t overemphasise the scatter plot, look at the table!
As we have learnt from these experiments with our artificial data sets, the scatter plot of
axis 1 with axis 2 could give valuable insights into the characteristics and structure of
your data. But the analysis of the scatter plot of axis 1 with axis 2 should not be used
alone and overestimated. At the end you have to look intensively at the final table,
where you can see the real combinations of the types and the graves. Sometimes ‐
especially at the beginning of a real project ‐ there are some errors in the table, often
simple typos. You won’t find them by looking at the scatter plot, you can find them by
controlling the table.
Another typical “error” may occur while preparing and defining the typology. Often a
large group of objects is organized by an archaeologist into several well defined types.
Usually close to the end of this process of classifying all the objects one or few pieces are
left. They don’t fit well into any of these types. An archaeologist often feels a certain
need not to leave any objects unclassified, so those single pieces are added to the most
plausible category. When performing a CA later on, many of these decisions about the
unusual objects don’t come up again, because they were classified well. But sometimes
21 / 35Siegmund, F. (2014). Tutorial correspondence analysis
errors in these decisions are detected by the CA later on, as unusual combinations
disturbing the sequence. When a sorted table shows a well‐shaped longish diagonal
cloud of frequencies in the mid, you should read it carefully row by row and column by
column. You can often detect outliers then, i.e. a single combination lying far away from
the rest of the cloud of points. Those combinations can be true, outliers are a possible
phenomenon! On the other hand, sometimes you can find your problematic typological
solutions here, and you should re‐think them now.
(9) A look at two tables with real archaeological data
Till now we have collected some experience with artificial tables. It’s time to look at
some real archaeological data sets. Two examples are given here: “Langweiler‐2_Stehli‐
1973‐p91‐fig49" and “beads_Koch‐U‐1977‐table‐4". The first data set derives from the
analysis of the settlement “Langweiler 2" from the Linear Pottery culture (ca. 5.500‐
4.900 BC) in Western Germany (Stehli 1973, 91 fig. 49). It shows single features from this
settlement (rows) and types of the main decorations of the pottery (columns). This was
an early study of this problem, outperformed by actual studies now, but it was to my
knowledge the first time, when the frequency of types in combinations was respected
and calculated, while up to then seriation was based on the presence and absence of
types only. In the original publication the features were divided into three phases; our
table shows these phases 1 to 3 noted as the first letters of the label of the features.
Our data set used as an input here shows the sequence proposed by Stehli (1973). When
performing a CA with this table you will recognize, that the three phases proposed by
Stehli are reproduced very well, but the order proposed by the CA differs in some details
from the proposal published by Stehli (1973). The parabola isn’t formed as well as in our
artificial tables. But this is normal especially when analysing assemblages of findings
from settlements. The scatter plot of axis 1 with axis 2 could be read as a hint, that the
features 1‐0485 and 2‐0821 incorporate types from different times, and that the decora‐
tion type a12 is not very sensitive to time but a “long runner”. These are hints only,
which should be argued in detail on an archaeological ground. The sequence gets better
‐ in a technical sense ‐ when those two features and type a12 are excluded from CA.
22 / 35Siegmund, F. (2014). Tutorial correspondence analysis
Our next example is taken out of the book of U. Koch (1977), where she studied the
beads and the strings of beads from the Early Medieval cemetery near Schretzheim,
Southern Germany (ca. 530‐665 AD). The decorated beads were analysed and classified
into distinct types. The table shows these types (columns) and their representation in
the strings of beads (rows), which were worn by Early Medieval women as necklaces. A
copy of Koch’s printed table (Koch 1977, table 4) is enclosed here at the end (fig. 18). The
original sequence of this table was handmade by U. Koch and obviously follows a differ‐
ent concept: the latest type dates the complex. This is an usual approach e.g. of numis‐
matists when dealing with treasure hoards of coins. We will discuss the methodological
aspect later (see chapter 10). The rows in our table show the strings of beads, which are
labelled in a special way: the leading number gives the chronological phasing of the
graves according to the actual chronology of the cemetery of Schretzheim (Koch 2004),
followed by a hyphen, followed by the grave number as in the original table (Koch 1977,
table 4). Undated graves are marked by two leading hyphens before their number. With
the help of this coding technique we can read the results of the CA easier, because one
can see immediately, whether and how far the sequence of the beads proposed by the
CA is in accordance to the actual chronology of the cemetery.
When a CA of the table of beads from Schretzheim is computed, its order shows a good
concordance with the overall chronology of the cemetery of Schretzheim. However, the
results differ from the sequence of the originally published table. The display of axis 1
with axis 2 shows a well‐formed parabola, but there seem to be some outliers: grave 6‐
258 and 7‐420 and bead type 33,15‐16. Because we can’t go into the archaeological
details here to analyse the reasons, we take the simple solution and remove them ‐ as an
experiment ‐ from the data set (highlight the row or column respectively, then >> “Edit”
>> “Remove”). Re‐calculate the CA and compare the results, ...
Stehli, P. (1973). Keramik. In Farrugia, J.‐P., Kuper, R., Lüning, J. & Stehli, P. (eds.).
Der bandkeramische Siedlungsplatz Langweiler 2, Gemeinde Aldenhoven, Kreis
Düren. Rheinische Ausgrabungen 13 (pp. 57‐100). Bonn: Rheinland‐Verlag.
Koch, U. (1977). Das Reihengräberfeld bei Schretzheim. Germanische Denkmäler
der Völkerwanderungszeit A 13. Berlin: Gebr. Mann.
Koch, U. (2004). Schretzheim §2 Archäologisches. Reallexikon der Germanischen
Altertumskunde vol. 27 (pp. 294‐302). Berlin: de Gruyter.
23 / 35Siegmund, F. (2014). Tutorial correspondence analysis
(10) “The latest type dates the complex”? ‐ Or: how does CA date?
As mentioned above, many numismatists, especially when dealing with hoards, follow
the concept, that the latest piece in an assemblage is dating it. Converting this model
into an ordered table, the table should look like the one cited above (Koch 1977, table
4): a rectangular table with one empty triangle and another triangle filled with frequen‐
cies most densely accumulated along the diagonal border line between both areas. This
picture differs from the tables generated by a CA, which show a symmetrical accumula‐
tion of frequencies along the diagonal. The table derived by a CA mirrors the unimodal
model, which orders types and graves into a mid. Thereby the CA estimates the most
probable mean (!) time of an assemblage and the mean time of a type, not the time of
the last piece. Grave goods are a collection: some pieces are recent, some pieces can be
old. They all were deposited in the earth when the corpse was buried, but some of them
might be acquired by the dead in their early years, some in the last days of their life,
some pieces can be produced for the occasion of this burial. The CA draws all this
together to a mean estimation for the assemblage. If you think this concept isn’t suitable
for your findings, don’t use CA.
Now you might ask, whether there is another statistical solution instead of CA,
more suitable to the model “the last piece dates the complex”? I have to disap‐
point you, because there is no suitable and statistically valid solution for this
different concept. If you try e. g. to analyse this table following the linear model
by a PCA (see chapter 11.5), you will easily recognize that the deviation from the
original results of Koch (1977) and the chronology of the cemetery is more signifi‐
cant than the deviation from the results derived by our CA. From my point of view
the model “the latest type dates the complex” is not suitable to archaeological
problems ‐ but this is my personal opinion only.
(11) It’s time to start with your own projects now
Now you are ready to perform your own projects. It would be the best to use your own,
real data. The following part of this tutorial will give some useful practical advices for
your first steps into CA.
(11.1) Data preparation, or: How does a suitable table look like?
This question isn’t as silly as it looks like at a first glance. Usually the archaeological
information is prepared in a structure like that: grave 1 contains a sword type 1 and a
24 / 35Siegmund, F. (2014). Tutorial correspondence analysis
shield type 44; grave 2 contains a sword type 2 and a shield type 55. One could trans‐
form this information into such a table (fig. 15):
sword shield
grave 1 type 1 type 44
grave 2 type 2 type 55
Fig. 15. Simple table showing the types found in the graves.
But this isn’t a table suitable for CA. Furthermore you have to transfer your data into a
table of such kind (fig. 16)
sword
type 1
sword
type 2
shield
type 44
shield
type 55
grave 1 1 0 1 0
grave 2 0 1 0 1
Fig. 16. Modified table with the same information as fig. 15, but ready now to be analysed
by a CA.
Each row represents a single grave (or feature), each column represents a single type (or
attribute) now, with the numbers in the cells representing presence or absence of this
type in this grave, or the frequency. It is important to recognise the difference between
the two tables and to prepare the input correctly.
There is another kind of table often used in older archaeological literature, but not
suitable for being analysed with a CA. I mean quadratic symmetric tables, where the
rows as well as the columns show types, and the cells show, how often a type is com‐
bined with another type. These tables are symmetric with a diagonal in the mid, showing
the combination of a type with itself, while the two triangles show ‐ symmetrically
mirrored ‐ the number of combinations of each type with the other types. Nowadays
these tables are named “Burt table” in the statistical literature. In our small collection of
examples I have added a table named “8_burt‐table_from‐ideal‐matrix‐1", where I have
transformed the information in our table 1 into a Burt matrix (fig. 17). As far as I know of
archaeology a Burt table was first use by Heinz Gatermann (1942, p. 11 fig. 1) when
analysing the decoration of beaker pottery in western Germany. His study inspired David
25 / 35Siegmund, F. (2014). Tutorial correspondence analysis
L. Clarke (1970, p. 429, 469) to use those matrices in his book about the beaker pottery
in Great Britain and Ireland.
Fig. 17. Our “ideal matrix” transformed into a “Burt table”, which is not suitable to be analysed
by a CA.
Such tables should not be analysed by a usual CA; although this is technically possible,
the results are not correct. Greenacre (2007, pp. 137‐152) explains the statistical prob‐
lems of such process and sketches out a possible solution, named Joint Correspondence
Analysis (JCA). But a JCA needs a different way of calculation. From an archaeological
point of view these tables are also difficult to work with, because the original archaeo‐
logical information ‐ the combination of types in graves ‐ cannot be seen any more.
When you are working with such a table and have to change something like modifying a
typological decision or to delete an unsuitable type or a mixed grave, this isn’t an easy
procedure. So, even when applying a JCA instead of CA to a Burt table gives a valid
statistical solution, don’t use such tables.
GATERMANN, H. (1942). Die Becherkulturen der Rheinprovinz. Würzburg: Triltsch.
CLARKE, D. L. (1970). Beaker pottery of Great Britain and Ireland. Cambridge: University
Press.
NEUFFER, E. M. (1965). Eine statistische Bearbeitung von Kollektivfunden. Bonner Jahr‐
bücher 165, pp. 28‐56.
GEBÜHR, M. (1970). Beigabenvergesellschaftung in mecklenburgischen Gräberfeldern der
älteren römischen Kaiserzeit. Neue Ausgrabungen und Forschungen in Niedersachsen 6,
pp. 93‐116.
(11.2) You need good material, good questions and a suitable benchmark
To achieve a good chronology you need a large amount of material and a well done
typology ‐ stated Oscar Montelius in the introduction of his famous book on archaeologi‐
cal methods in 1903. This simple truth is still valid. The typology must be suitable for
your specific questions. If you are interested in chronology, the types have to be sensible
26 / 35Siegmund, F. (2014). Tutorial correspondence analysis
to the dimension of time. If you are interested in questions of social status, the types
have to be sensible to this specific question. Therefore the one and only optimal typo‐
logy for a certain material doesn’t exist, but there are several ones. A brooch e. g. will be
classified by its style for chronological questions, but may be classified after its material
(gold, silver, bronze) or its weight for a social analysis, or after its position in a grave for
the analysis of costume.
Clearly it is not possible to achieve your goals with too few findings. It is difficult, but not
impossible, to answer the question: how many will be enough? CA helps to find the
answer. How? By observing the stability of your results. Whenever a real project is done,
there is a time of trial and error, when you are trying to get better results step by step.
This is an important part of the research process. After some time you will recognize (I
hope), that further tries to improve your sequence don’t change it any more. Don’t be
disappointed, but be happy instead: the stadium of stability has been reached. When‐
ever you add a new finding to your table now, the addition should be integrated in the
sequence well, but without having much influence on the sequence of the table in
general, in comparison to the results before, without this special finds (graves or types).
When an analysis has reached this point, it is stable, your material is vast enough.
If you can’t add any new findings to your table to test its stability, you can try the oppo‐
site: delete one type or one grave and look, what happens. When the effect on the order
of the whole table is low, stability has been achieved. This concept seems to be a little
handmade, but it isn’t. Statistical theory names this process as “jack‐knifing and boot‐
strapping” (Efron & Tibshirani 1993; Chernick 1999; Good 2013). Jack‐knifing means
deleting single cases from a data set, and bootstrapping means doing this systematically
and observing the results after each step. Delete case 1 from the data set, perform your
analysis and save the results. Put back the deleted case 1 to your data set, delete case 2
now, perform your analysis and save the results, and so on. This process is often named
as sampling. At the end you can analyse all these results, in our case the scores of the
types and the features. The results should be similar to each other, and it would be
interesting to identify those single cases, which are responsible for the single most
deviant results. Analyse them from an archaeological point of view. They could indicate
the weaknesses of your table, e.g. bad defined types or mixed graves, or they could
simply be very influential, without any error. Analysing your results type/grave by
type/grave could take weeks of your precious time! But there are ways to do this sys‐
27 / 35Siegmund, F. (2014). Tutorial correspondence analysis
tematically with the help of a computer. If you plan a project of this kind, have a look at
the statistical package “R”, where you could write a script to do these deletions, addi‐
tions and comparisons automatically (Good 2013).
CHERNICK, M. R. (1999). Bootstrap Methods. A practitioner's guide. Wiley Series in probabil‐
ity and statistics. New York: John Wiley & Sons.
EFRON, B. & TIBSHIRANI, R. J. (1993). An Introduction to the Bootstrap. Monographs on
Statistics and Applied Probability 57. New York: Chapmann & Hall.
GOOD, PH. I. (2013). Introduction to statistics through resampling methods and R. 2nd ed.
Hoboken NY: Wiley.
MONTELIUS, O. (1903). Die typologische Methode. Stockholm: Selbstverlag des Verfassers.
Once again: a statistically validated result is nice, but the archaeological validation is
more important. When starting an analysis, a benchmark is needed, a hypothesis which
can be used to compare it with the results of your actual CA. In the case of a chronologi‐
cal question this could be the archaeological standard chronology used till now, it could
be a stratigraphical information, or some radiocarbon dates, or dated coins in some of
the assemblages, or the chorological / topo‐chronological analysis of a cemetery, which
had grown systematically. It is not necessary to have information of that kind for all of
your grave assemblages, but for some of them you should have it. The best would be, if
you wrote a short chapter for your later publication just at the beginning of your project,
before starting with the CA, where you explicitly describe and reason these benchmarks
or test hypotheses of your study, for yourself as well as for your readers. After this step
you can clearly judge each change of your table in comparison to your benchmark:
whether the results are better than before or not.
When working with the table and the scatter plots of the CA, it is important for your
practical process to have your benchmark(s) distinguishable. I propose to embed this
information into the labelling of the types and features, e. g. by adding special signs to
the type or grave names. Yes, just as I have done it in the examples “6_Langweiler‐
2_Stehli‐1973‐p91‐fig49" and “7_beads_Koch‐U‐1977‐table‐4". One can immediately see
the conventional dating of the graves, and thus read and understand the results of the
CA easier.
28 / 35Siegmund, F. (2014). Tutorial correspondence analysis
(11.3) What is allowed, and what shouldn’t be done? Some practical advices
No table and no CA is ready right away. In most of the cases the final result is an effect
of a long process of trial and error. What are your possibilities? You can't change assem‐
blages individually, e. g. delete a single "disturbing" type within a grave. But you can
delete unsuitable graves as a whole, whenever there are arguments to do so, like graves
mixed by errors during the excavation or during storage in a unprofessional magazine.
You can delete unsuitable types as a whole, for example when they are too unspecific in
relation to your question. Selecting useful features and types or deleting unsuitable
findings is an important part of the enhancement of the table. Before starting the
working process, you should develop some rules and criteria for these operations, and
these rules should be explained and should be a part of your publication.
It is often difficult and needs some time of trial and error to select the set of types you
want to use. If they are all very specific and fine graded, you might have few combina‐
tions only. If you integrate (too) many unspecific, roughly defined types, you enrich the
number of your combinations, but you won’t get a detailed chronology then. There is no
fixed rule of solving these questions, you have to try finding a good solution, and a good
explanation for your decisions.
What can be done, when parts of your table show too few combinations and too few
connections to the rest of the material? Maybe you could look for some additional
materials, e.g. from comparable finding places nearby. Re‐think your typology, it could
be too rough, or too detailed. Sometimes it is useful to split some of your types into
attributes. Instead of types of belt buckles as a whole it could make sense to divide
them: to put one real group of objects once into your table as a belt buckle type after
shape, and a second time as a belt buckle type decorated in style xyz. Such an approach
could help you to bridge or strengthen insufficient zones in a table.
Sometimes it is useful to tighten the rule “each grave contains two types at least, each
type has to be represented in two graves at least”. Especially when analysing archaeo‐
logical assemblages derived from settlements it can be useful to rise this minimum from
two to three or four, which will exclude singularities more efficiently.
On the other hand too vast assemblage can ruin the quality of a sequence as well. When
a single feature comprises much more findings than all the other features, it will domi‐
29 / 35Siegmund, F. (2014). Tutorial correspondence analysis
nate the order, which is often (but not always) inappropriate. From an archaeological
point of view it is likely, that this feature collected its material for a longer time than the
others, which reduces its value for chronological studies. Therefore one could try to form
another rule: exclude overly frequent types and assemblages with too many findings.
In a real research process there is a long time of working with the tables and making
decisions about graves and types to be included or excluded from CA. Where to start?
Should you start with all the material and successively eliminate assemblages and types
which seem to disturb the sequence? Or should you start with a core of well‐known
good suitable types and graves? After having the first good and stable result on their
basis you could add further material to this core in a process of trial and error. Simple
answer: there is no single and easy way to Heaven. My professional experience shows
that it is better to start with a well‐reasoned core if you are a beginner. A very helpful
technique to get your own way: write it down before you start with the CA. At the end,
the whole research design has to be published. You can often test a certain solution by
trying to write down your arguments for the final publication. By then, you will immedi‐
ately see which solution is accompanied by weaker and which by stronger arguments.
You shouldn’t start discussing single cases and decisions here, but introduce the general
rules your study is following. The question you should answer can be therefore specified
as: are there transparent arguments and rules for deleting cases from the study when
starting with all the material? And on contrary: are there transparent rules for electing
a core of material for the start, and how to add further complexes and types succes‐
sively? The answers to these questions could help to find your way.
(11.4) The “edge effect” and how to work with it
The sequence of a table is usually insufficient on both edges. It is common that at the
beginning and at the end of a chronological sequence the archaeological information is
limited in relation to the more central parts of the table. This is a typical cause for the
unsatisfying sequence on the edges. Another reason is the lack of combinations behind
the edges of the actual table. Quite contrary, there were also combinations behind the
edges. But you can’t see them ‐ and the statistical process can’t calculate them ‐ because
they are not represented in the actual table. Therefore types and findings, which should
be situated at the borders of your table, show combinations only trying to move them
into the central part of a table and no ones moving them into the edge. It is often wise
to accept the fact, that the sequence is not optimal on the edges. But what to do, when
30 / 35Siegmund, F. (2014). Tutorial correspondence analysis
these outer parts of the table are important for your analysis? Simple answer: Enrich
your table with material beyond the edges. This is often possible by looking for some
additional archaeological material slightly older and/or slightly younger than the mate‐
rial under study. By that, the actual edges aren’t edges any longer but are traversed to
the more central parts of the table, while the newly added findings form the edges of
your table now ‐ with some edge effect, of course.
(11.5) On de‐trending, weighting and canonical correspondence analysis
There are several variants of CA, which could by applied in special situations. I will give
a short explanation here, ending with a clear advise not to apply them in most cases. The
meaning of the term de‐trending has been explained already above, the procedure is
named Detrended Correspondence Analysis (DCA). De‐trending means to re‐calculate
the parabola ‐ a quadratic function ‐ out of the axes 1 and 2. This could be useful when
a reference of one of these axes to another linear scale should be achieved, like to
estimate real calendaric time from the scores of axis 1. The other purpose of a de‐
trending is to get better results for the second axis. Whenever you want to interpret the
order of the second axis more detailed, a DCA could be useful. So, if there is a problem
which really needs those ideas, de‐trending would be a really serious approach, but in
our standard applications it should not be used.
An archaeologist often thinks about weighting, in order to express their idea of more or
less important things. Some types seem to be more important to achieve a suitable
sequence of the table then others. Such weighting isn’t forbidden and the possibility for
weighting is well implemented in the tools of WinBASP. But... Weighting should not be
too complicated, it should follow simple, clear and precise rules, which are explicitly
listed at the beginning of the study. This could be for example: decorated beads or
decorated potsherds are counted as double in relation to undecorated beads and
undecorated potsherds, because they seem to be more sensible to the chronology. My
personal experience with weighting is: it could be useful, it could make a table more
complicated to read, often its effect on the final result is less intense than expected.
Therefore my advice is: keep things as simple as possible.
Canonical Correspondence Analysis (CCA) is a technique different from CA. Its aim is to
re‐arrange a table with information following the unimodal model into a new sequence,
but along the given “canonical” axis first. A CCA has a specified canonical variable, which
31 / 35Siegmund, F. (2014). Tutorial correspondence analysis
gives the first, “canonical” sequence, which then is followed by further axes (dimensions)
freely ordered similar to the usual CA. If there is a fixed first dimension for all or most
cases in your study, CCA could be a good idea. Any example? Grave assemblages often
show strong differences of gender. The usual approach of chronological studies is to
perform two different analyses, one for the male and one for the female graves. Theo‐
retically, you could analyse them together in one table and define gender as canonical
axis to get a combined chronology as second, or first free axis respectively. Well, I have
tried this several times and my results were not satisfying. CCA is not a standard process,
it should be applied only when good reasons are given. For some examples and further
details see e. g. Müller & Zimmermann (1997).
You should be always aware of the question, whether you assume the unimodal or the
linear model. There is a concept similar to CCA, but for linear models only, which is called
Redundancy Analysis (RDA; see: Jongman, ter Braak & van Tongeren 1995). I have
applied it once to an archaeological problem, where findings out of a short stratigraphi‐
cal sequence had to be analysed (Siegmund 1994). The reason for choosing a linear
model in this special case was the short time span embedded in the sequence. When
types and assemblages in general follow the unimodal model, but the time span repre‐
sented in the archaeological sample is very short, your sample shows ‐ or could show ‐
only one half of the bell shaped life curves of the types. In such case a linear model is
more appropriate.
If you want to see what happens, when a wrong model is applied to a data set, you can
get a visualisation by performing a PCA (instead of a CA) with our “1_ideal‐matrix‐or‐
dered”: Go to PAST, then “Multivariate” >> “Ordination” >> Principal components
(PCA)”, and analyse the obtained scatter plot. Time has been “folded” now, the begin‐
ning and the end of our table are drawn together into the mid of axis 1.
MÜLLER, J. & ZIMMERMANN, A. (EDS.) (1997). Archäologie und Korrespondenzanalyse: Beispie‐
le, Fragen, Perspektiven. Internationale Archäologie 23. Espelkamp: Marie Leidorf.
SIEGMUND, F. (1994). Jülich. Scherben und Schichten zu den Feuersbrünsten des 15. und 16.
Jahrhunderts. Jülicher Geschichtsblätter = Jahrbuch des Jülicher Geschichtsvereins 62, pp.
131‐184.
32 / 35Siegmund, F. (2014). Tutorial correspondence analysis
(12) Applying the results of a given CA
Sometimes there is a well‐reasoned and established chronology based on a large amount
of material and on a CA, and you want to embed your few findings into this given order.
How to do it? There are three different solutions, all of them acceptable.
(a) Keep things simple and don’t use statistics. Read the reference study you are
using, analyse the phasing of the relevant types there, and put your material into these
phases without any statistics. This is the usual way and not bad at all.
(b) Re‐calculate the given CA with your new, additional data. This approach will (or
should) embed your material into the already established sequence. It is a fine way to
approach the problem, but it is possible (or very likely, which depends on the amount of
additions), that your material will change the order of the types and features of the
given study. If you want to avoid this, you could choose solution (c).
(c) Apply the scores of the given CA to your new data. The position of each grave
(feature) in a given CA can be calculated from the scores of the types, and vice versa,
thereby new features (and types) can be included into a given CA very accurately,
without changing the original order. The new features and types are statistically named
as “supplementary points” (Greenacre 2007, pp. 89‐96). The calculation can be done in
the following way, when we assume that we are calculating the position of a new grave
along axis 1 of a given CA: Take the scores of each type along axis 1 of the given CA and
multiply them by the observed frequencies of each type in the new grave you want to
integrate. This will be often a multiplying by zero, which equals zero. Then, you must
calculate the sum of these results, and divide this sum by the sum of objects (not the
number of types) represented in this new feature. The result is the score of the new
grave 1 along axis 1.
What I wanted to underline by this remark is that the scores of the relevant axes of
a CA are important pieces of information, and therefore they should be published.
(13) Final remark
At first glance, the theory of CA and the practical calculations seem to be complicated.
I wanted to show you that they are easy to understand in a general way, and that the
calculations could be practised quickly. The core of your work should be the archaeologi‐
cal part of such an analysis. It is useful to look for an example close to your specific
problem, and follow this example like following a cookbook on your first steps. It helps
to have an experienced colleague, who could be asked for assistance and discussion
from time to time. Be brave and start to gain your own experience with CA, it is a mighty
33 / 35Siegmund, F. (2014). Tutorial correspondence analysis
and useful method, you will often need it.
(14) Some further readings
GOLDMANN, K. (1972). Zwei Methoden chronologischer Gruppierung. Acta Praehistorica et
Archaeologica 3, p. 1‐34.
GOLDMANN, K. (1979). Die Seriation chronologischer Leitfunde der Bronzezeit Europas. Berliner
Beiträge zur Vor‐ und Frühgeschichte NF Bd. 1. Berlin: Spiess.
HAIR J. F., BLACK, W. C., BABIN, B. J. & ANDERSON, R. E. (2010). Multivariate data analysis. 7th ed.
Upper Saddle River: Pearson Prentice Hall.
HAMMER, Ø., HARPER, D. A. T. & RYAN, P. D. (2001). PAST: Paleontological Statistics Software Package
for Education and Data Analysis. Palaeontologia Electronica 4(1): 9 pp.
IHM, P. (1983). Korrespondenzanalyse und Seriation. Archäologische Informationen 6, pp. 8–21.
IHM, P. & VAN GROENEWOUD, H. (1984). Correspondence Analysis and Gaussian Ordination. COMP‐
STAT Lectures 3, pp. 5‐60.
JONGMAN, R. H. G., TER BRAAK, C. J. F. & VAN TONGEREN, O. F. R. (1995), Data analysis in community
and landscape ecology. Cambridge: Cambridge Univ. Press.
KENDALL, D. G. (1963). A statistical approach to Flinders Petrie's sequence dating. Bulletin of the
International Statistical Institute 40, p. 657‐680.
MÜLLER, J. & ZIMMERMANN, A. (Hrsg.) (1997). Archäologie und Korrespondenzanalyse: Beispiele,
Fragen, Perspektiven. Internationale Archäologie 23. Espelkamp: Marie Leidorf.
PETRIE, F. W. M. (1899). Sequences in prehistoric remains. Journal of the Anthropological Institute
29, p. 295–301.
SOKAL, R. R. & ROHLF, F. J. (2012). Biometry: The principles and practice of statistics in biological
research. New York: Freeman.
TER BRAAK, C. J. F. (1987). Unimodal models to related species to environment. Wageningen:
Agricultural Mathematics Group.
WILKINSON, E. M. (1974). Techniques of Data Analysis. Seriation Theorie. Archaeo‐Physika 5. Köln:
Rheinland‐Verlag.
A proposal for an interesting test and training project: Perform a CA of the data set
published by Oscar Montelius (1885), which shows the foundation of his chronology of
north European bronze age. Montelius’ book includes tables of his material which are
easy to transfer (p. 270‐311). The book is available online now and the text (but without
these tables) is available in English, too (Montelius 1996).
MONTELIUS, O. (1885). Om tidsbestämning inom bronsåldern. Stockholm: På Akademiens
Förlag.
(https://openlibrary.org/books/OL22888482M/Om_tidsbest%C3%A4mning_inom_bron
s%C3%A5ldern). ‐ (Incomplete) English translation: MONTELIUS, O. (1996). Dating in the
34 / 35Siegmund, F. (2014). Tutorial correspondence analysis
Bronze Age. Stockholm: Kungl. Vitterhets Historie och Antikvitets akademien.
Author
Priv. Doz. Dr. phil. Frank Siegmund
mail@frank‐siegmund.de
www.frank‐siegmund.de
http://uni‐duesseldorf.academia.edu/FrankSiegmund
* (Very) Extended version of my presentation “Archaeological chronologies based on
correspondence analysis: a practitioner's guide to success and reliability”, University of
Bologna, March 31th 2014.
35 / 35Siegmund, F. (2014). Tutorial correspondence analysis
Fig. 18. Copy of Koch 1977, table 4.