chapter 5 information extractionnature.berkeley.edu/~penggong/271/chp5p1.pdfthus far, shape analysis...

Chapter 5 INFORMATION EXTRACTION Peng Gong, University of California, Berkeley After images are processed, they are subsequently analyzed for information extraction. Image interpretation is the general term for information extraction from remotely sensed data. It is a process in which such image elements as tone, color, shape, size, texture, pattern, shadows, site, and association are used to achieve the goal of detection, identification, delineation, and enumeration purposes (e.g., Lillesand and Kiefer, 1997). From a computer analysis point of view, image interpretation can be divided into three components: image analysis, understanding, and assessment. They are carried out based on the results of image processing. Therefore, image interpretation is usually after image processing. The results of image analysis are features (parametric descriptors) extracted from an image such as edgeness, contrast, local maxima and minima, texture, spectral properties, object size and shape, clustering, enumeration, distance between objects, and density, etc. Image understanding refers to procedures involving pattern recognition based on spatial and spectral features. Image assessment involves more general and abstractive (intelligent) operations leading to purpose-driven decision-making processes such as environmental quality assessment, planning, etc. In applications of remotely sensed data, one makes no strict distinction among these components. The goal of image interpretation is to extract information from remotely sensed data using a combination of procedures in those components. This is particularly true for human based image interpretation. While image interpretation techniques based on human brains have been well developed (Philipson, 1997), image interpretation by computers in an automated or semi-automated manner is still under intensive investigation. However, two obvious restrictions exist limiting the wide use of human interpretation of remotely sensed imagery. Firstly, it is hard to achieve consistent results even through the same procedure is followed by different human interpreters due to human subjectivity. Secondly, the human brain cannot detect subtle changes and cannot handle large volumes of data. Human interpretation is usually labor extensive and cost ineffective. For over 40 years, a large number of efforts have been devoted towards automating human image interpretation by digital computers. In this chapter, we focus on introducing the use of computers to derive meaningful information from remotely sensed data – computer based information extraction. Data not from remote sensing sources, but helpful to improve results of information extraction, are considered as ancillary data. Strategies for computer based information extraction with remote sensing can be summarized into six categories: image classification, statistical regression, morphological measurement, shape analysis, inversion of radiative transfer model, and change detection and hypertemporal analysis. Image classification is actually a mapping process that groups image pixels into categories of information that are more general than the data, e.g., land cover or land use. It requires a classification scheme specifically designed according to an application purpose, proper classification algorithm optimal for deriving the desired land categories, and a minimum mapping unit smaller than patches of classified pixels which will be merged into neighboring larger patches. As in cartography, image classification is a generalization and abstraction process that simplifies original image data to a level that can be better used in decision making or to improve our understanding of an area of interest. In section 4.5.1, we will further elaborate various image classification methods. Statistical regression with remotely sensed data can be done in several ways. The dependant variable is the parameter to be estimated from remotely sensed data. The independant variable(s) can be brightness data from a single spectral band, a combination of bands, ratio or difference data from a number of bands, or other transformed values of brightness data such as various texture measures. The regression function can take many forms. The dependant variable ranges from physical parameters such as climatological, hydrological, biophysical, biogeochemical variables to socio-economic parameters such as population density, family income, environmental quality, etc. This strategy will be further explained in Chapter 7. Statistical models are empirically built and are flexible to use. However, they need to be adjusted in time and space as they are calibrated based on data collected from a specific location at a given time. For a different location, a new set of model coefficients need to be calculated. Sometimes, a new model may be necessary. Although generality and accuracy in modeling are usually two horses pulling in opposite directions, in practice for a given accuracy we often expect to build a model that is applicable to a larger spatial extent. To meet such an expectation, the radiative transfer theory based on quantum mechanics and modern physics is perhaps more helpful. Chandrasekhar (1960) published the classic treatment on radiative transfer. Earlier applications of radiative transfer theory in remote sensing were made to the study of other planets in the solar system using reflectance spectroscopy and to the studies of the atmosphere and water bodies on Earth (Hapke, 1993). During the past 30 years, its application was expanded to the remote sensing of snow and ice (Bohren and Barkstrom, 1974), crops (Suits, 1972) and forests (Li and Strahler, 1985). The inversion of such models result in desirable

1

physical and biochemical parameters (Gong et al, 1999). However, little research has been done to apply these methods to human settlements. Morphological measurement is about the derivation of 3D geometrical information of objects of interest in the image. Traditionally this has been done through digital photogrammetry. Topographic elevation and coordinates of typical land marks are the most essential 3D geometrical information and are the primary information to be derived from aerial photogrammetry. Limited by the spatial resolution and the cross track (whisk-broom) scanning mechanism, early non-photographic remote sensing imagery was rarely used in deriving 3D information. Since the advent of SPOT HRV sensors that image the earth surface using pushbroom scanners at 10 m spatial resolution, more photogrammetric research has been made to derive digital elevation models (DEM) from satellite data. Elevation data provided by lidars and interferometric radars have equivalent or even better accuracies when compared with those obtained with traditional aerial photogrammetric methods (Armour et al, 1998; Sun and Ranson, 2000; Naesset, 2002; Drake et al 2002). Morphological information, particularly DEM, can be used in radiometric correction of images, creating 3D view graphs for visualization purposes, and image orthorectification for mapping (Jensen, 2000). Their applications in shape analysis and image classification are rare. Shape analysis is about the analysis of linear and curvilinear features in remote sensing. Thus far, shape analysis is primarily limited to linear feature extraction such as building edges, roads, faults, and river streams. We will introduce some of the basic methods of linear feature extraction in section 5.3. Change detection is the examination of differences in land cover, land use, morphology and vegetation phenology on images acquired at different times. Hypertemporal analysis refers to the analysis of a large temporal set of images acquired at high frequency intervals such as daily image of NOAA AVHRR imagery or Terra MODIS imagery. Due to the limited resolution with the high frequency remote sensing data, hypertemporal analysis techniques are rarely used in remote sensing of human settlements. Change detection strategies for human settlement change monitoring is presented in Chapter 6. In the context of human settlement remote sensing, the following six types of information extraction methods are particularly relevant: Classification Spectral unmixing Linear feature extraction Image classification is the most commonly used thematic information extraction approach. It is an information generalization and abstraction process. It converts radiometric data at the ratio measurement scale down to thematic classes at the nominal measurement scale (see Robinson et al. 1995, for a discussion of scale of measurements). It partitions the remotely sensed data into various classes according to their similarity in spectral and spatial aspects. Particularly, land cover and land use types are mapped with image classification techniques. Land cover is the physical material on earth surface while land use is a cultural concept related to human activities on the land. Because the radiometric properties of earth surface materials are directly recorded by remotely sensed data, it is relatively easy to classify land cover classes such as various vegetation, paved surface, and water bodies (Gong and Howarth, 1992b). However, it is more difficult to map land use types because they appear spatially heterogeneous in physical properties (e.g., a residential area is characterized by a mixture of roof tops, paved surfaces, and gardens and trees). Because land use classes such as residential, commercial, and industrial lands are important in human settlement remote sensing, we will introduce some advanced classification techniques in this section in addition to some of the standard classification methods such as minimum distance and maximum likelihood classifiers. The advanced classification algorithms use not only the spectral information in the image but also spatial information indirectly characterized by spatial measures that can be calculated from pixel neighborhoods. Texture measures introduced in the previous section can be used to characterize spatial properties of various land cover and land use types. Decision making in image classification is traditionally done at the pixel level, i.e., no matter how similar the spectral and spatial properties of a pixel are to several classes, this particular pixel can only be classified into one class. This kind of decision is usually unfavorable to categories whose spatial extents are smaller than a pixel, resulting in underestimation of areas for such classes. Moreover, information at the subpixel level is sometimes desirable. Therefore, a number of algorithms have been developed to estimate the partial membership of a pixel to multiple classes. The algorithms are divided into two general groups: those based on statistical similarity and those based on spectral similarity. Methods based on statistical similarity include fuzzy classification algorithms (Bezdek et al., 1984). They are natural extension of conventional classification that is considered as “hard” or “crisp” classification. Although such methods are applicable to both abstractive categories such as land use classes

2

and less abstractive categories such as land cover types, they are statistically based and most of them are based on “guessing.” By “guessing” the subpixel proportions of various classes in a pixel, these algorithms are usually under determined, i.e., the number of unknowns is greater than that of the equations. On the other hand, methods based on spectral similarities are usually applied to physically existing categories that are directly observable by the sensors. Therefore, spectral similarity-based methods are usually limited to land cover classes. This group of algorithms is known as spectral unmixing. They are considered as a simplified type of version of radiative transfer modeling. As will be seen later, these algorithms usually solve an over-determined group of equations. “Guessing” algorithms are extensions of standard classification algorithms. Linear feature extraction refers to a different group of algorithms that uses spatial contrasts in images to derive information related to such features as field boundaries, streams, geological structures, and road networks. This group of algorithms is based on brightness contrasts at edges between adjacent features that are indicators of some significant environmental objects. Relevant to remote sensing of human settlements, a road network extraction algorithm will be introduced. Change detection is an important aspect of remote sensing of human settlements. Actually most remote sensing tasks have some relevance to monitoring of changes over time. This phenomenon is discussed in chapter 6. Morphological information extraction refers to the derivation of 3D coordinates of surface features from remotely sensed images. This is an important area in human settlement remote sensing because 3D data of landscape features such as buildings and forests in addition to topography are critical to city planners and utility engineers. This information has been traditionally obtained through field survey or aerial photogrammetry. Recently, SAR interferometry and lidar techniques have been developed to derive morphological information. However, SAR interferometry has problems with the horizontal positioning and lidar technology does not produce wall-to-wall fully sampled coverage. In addition, the cost for those technologies is high. Therefore, digital photogrammetry is a viable alternative. Semi-automatic digital photogrammetry has been developed for applying photogrammetric techniques to stereopairs of aerial photographs to generate 3D models of buildings (Fraser et al., 2002) and forest landscapes (Gong, Biging, et al., 1999; Gong et al., 2002; Sheng et al, 2001. An example is addressed in chapter 3 or 6.??? (I am not sure if this gets introduced anywhere in the manual, if this is not introduced anywhere, I would leave it as “A detailed introduction of such techniques is out of scope”) Statistical regression is an important tool in estimation of physical and socio-economic parameters from remotely sensed data or their derivatives, it is widely used. Examples can be found in Chapter 6 for population estimation. 5.1 Image Classification

First we will discuss some general procedures regarding image classification. In image space I, a classification unit is defined as the image segment based on which a classification decision is made. A classification unit could be a pixel, a group of neighboring pixels or the entire image. Conventional multispectral classification techniques perform class assignments based only on the spectral signatures of a classification unit. Contextual classification refers to the use of spatial, temporal, and other related information, in addition to the spectral information of a classification unit, in the classification of an image. Usually, it is the pixel that is used as the classification unit.

General image classification procedures include (Gong and Howarth 1990b):

* Designing an image classification scheme: they are usually information classes such as urban, agriculture, forest areas, etc. Conducting field studies and collect ground information and other ancillary data of the study area.

* Preprocessing images: radiometric correction, geometric and topographic corrections, image enhancement, and sometimes initial image clustering.

* Selecting representative areas on the image, analyze the initial clustering results, or generate training signatures.

* Conducting image classification via one of the following modes: Supervised mode: using training signature and a classification algorithm to classify an image into information classes, or Unsupervised mode: image clustering, cluster grouping, and analysis of clusters for information class labels.

3

* Post-processing classification or clustering results: complete geometric correction and filtering and classification decorating.

* Accuracy assessment: compare classification results with more accurate data sources such as ground truth data obtained from field studies.

The difference between supervised and unsupervised classification lies in the analyst’s role in the classification relative to the computer classification. In supervised classification, the analyst first develops the classification scheme and selects representative areas with the aid of various reference sources such as more precise images or field notes. The computer is then used to “learn” the statistical patterns from the representative areas and classify the rest of the image of interest. This “learning” is known as supervised training. This procedure is more subjective and can be regarded as a “goal-driven” approach. It is suitable to situations where the image analyst knows well of the study area and knows what to be extracted. It is possible that certain classes may be omitted although it may be well distinguishable from the image under study. In unsupervised classification, algorithms are used to make initial exploration of the image by a computer. Letting the computer coming up with some initial clusters corresponding to groups of pixels with similar spectral or spatial properties, the analyst then investigate each individual cluster and label it into a meaningful information class. This procedure starts from data. Therefore, it can be considered as a “data driven” approach. In unsupervised classification, the risk of omitting important land cover classes is less likely than with the supervised approach. It is more suitable to use when the image analyst is less familiar with the study area. Sometimes, an image analyst may begin with unsupervised classification and when the knowledge about the image area increases the image analyst may choose to use supervised classification to complete the mapping project. Therefore, there are hybrid classification strategies involving both the unsupervised and supervised classification. The major steps involved in the supervised and unsupervised classification strategies are shown in the following:

Supervised:

Image → Supervised Training → Pixel Labelling → Accuracy Assessment

Unsupervised:

Image → Clustering & Cluster Analysis → Cluster Grouping & Labeling

→ Accuracy Assessment

Richards and Lee (1984) distinguish two types of classes:

Information class: a class specified by an image analyst referred to as the information to be extracted.

Spectral class: a class which includes similar gray-level vectors in the multispectral space.

From above definitions we can see that spectral classes are relatively simple and less abstractive than information classes. Spectral classes can be easily associated with land cover classes while information class may be associated to land use classes. In an ideal information extraction task, we can directly associate a spectral class in the multispectral space with an information class. For example, we have in a 2D space three classes: water, vegetation, and concrete surface (Figure 5-1). By defining boundaries among the three groups of gray-level vectors in the 2D space, we can separate the three classes.

4

NIR

R

Vegetation

ConcreteWater

Figure 5-1. Spectral classes in the NIR and R space that can be directly associated with information classes to be extracted. [B/W]

For supervised classification, an image analyst begins first by specifying an information class on the image. A supervised training algorithm is then used to summarize multispectral information from the specified areas on the image to form class signatures. While for the unsupervised case, an algorithm is first applied to the image and some spectral classes (clusters) are formed. The image analyst then tries to assign a spectral class to the desirable information class. Clearly, it is easier to classify an image into spectral classes. In most classification cases for human settlement studies, information classes and spectral classes do not have a one-to-one relationship. This makes it difficult to derive information classes by simply grouping spectral classes into information classes. As will be shown later, some contextual classification algorithms can be used to convert spectral classes into information classes based on spatial occurrence of spectral classes in each information class (Gong and Howarth, 1992a; 1992b).

5.1.1 Conventional Multispectral Classification Methods

We will introduce six conventional methods of image classification followed by a discussion of some special considerations of conventional methods including accuracy assessment. In conventional pixel-labeling algorithms, an image pixel is assigned into an information class according to certain criteria. This can be illustrated using the previous diagram with a pixel of unknown class (Figure 5-2).

Gray-level vector of a pixel to be labeled

Concrete

Water

Vegetation

R

NIR

Figure 5-2. A pixel with a gray-level vector is positioned among three classes. The purpose of pixel-labeling is to find out the class belonging of the pixel. [B /W]

There are two obvious ways to classifying this pixel, multidimensional thresholding and minimum distance classification. However, the most commonly used traditional classification algorithm is the maximum likelihood classification method. These three will lead the list of six methods as follows:

(1) Multidimensional thresholding

We define two threshold values along each axis for each class (Figure 5-3). A gray-level vector is classified into a class only if it falls between the thresholds of that class along each axis. The advantage of this method is its

5

simplicity. The drawback is that it does not include all possible gray-level vectors into the specified class thresholds. There are different ways for selecting thresholds. As shown in Figure 5-3, the borders of the minimum bounding rectangles of each class can be chosen as the thresholds. The standard deviation of each class along an axis can be used to calculate the thresholds. The thresholds in each axis can be determined by adding and subtracting from the class center a certain number of standard deviations.

Tw1 Tv1 Tw2 Tv2 Tc1 Tc2 R

NIR

Tw1

Tv1

Tw2

Tv2

Tc1

Tc2

Figure 5-3. Multidimensional thresholding as a simple method of classification. Tw1 and Tw2, Tc1 and Tc2, and Tv1 and Tv2 along each axis are respectively the threshold range for class water body, concrete and vegetated land. [B/W]

(2) Minimum-distance classification

Distance can be used as a similarity measure for classification. The closer two points are in the multispectral space, the more likely they are in the same class. One can use various types of distance as similarity measures to develop a classifier, i.e. minimum-distance classifier.

In a minimum-distance classifier, suppose we have nc known class centers obtained from supervised training. C = {c1, c2, ..., cnc}, ci (i = 1, 2, ..., nc) is the center gray-level vector for class i.

ci = (DNi1, DNi2, ..., DNinb)T in digital number form.

(ri1, ri2, ...,rinb)T in spectral reflectance form. (5-1)

As an example, we show a special case in Figure 5-4, where we have 3 classes (nc = 3) and two spectral bands (nb = 2). Given a pixel with a gray-level vector located at A in the B1-B2 space, the distances between A and each of the centers can be calculated. A is assigned to the class whose center has the shortest distance to A.

6

B2

C2C3

B1

C1

Pixel value with unknown class

A

Fig. 3

Figure 5-4. Minimum distance classification of A (an empty dot). The three solid lines partition the multispectral space into territories each belonging to a class. A point falling into a particular territory of a class belongs to that class.

In a general form, an arbitrary pixel with a gray-level vector g = (g1, g2, ..., gnb)T.

This pixel is classified as ci if

d(ci, g) = min (d(c1,g), d(c2,g), ..., d(cnc,g)) (5-2)

The most-popularly used distance is the Euclidian distance

de(ci, g) = [(ci – g)T(ci – g)]0.5 (5-3)

However, Euclidian distance is sensitive to scale of measurement. If gray-level values in one spectral band are significantly greater than those in other bands, the large gray-level values will have the great influence on the classification decision making. Another popularly used distance is Mahalanobis distance

dm(ci,g) = [(g - ci)T V-1(g - ci)]0.5 (5-4)

where V-1 is the inverse of the covariance matrix of the data. Through V-1, the relative differences of gray level values in different bands are normalized. Though more complicated than the Euclidian distance, it is scale independent. If the Mahalanobis distance is used, we also call the classifier as a Mahalanobis Classifier.

The simplest distance measure is the city-block distance, also called Hamming distance.

(5-5) nb

ij jj=1

dc( ) = |c - g |∑ci,g

Class centers c and the data covariance matrix V are usually determined from training samples if a supervised classification procedure is followed. They can also be obtained from clustering. Their calculation methods can be found from section 4.4.3.

(3) Maximum likelihood classification (MLC)

MLC is based on the Baye's rule. For a given pixel with a gray-level vector x, the probability that x belongs to class ci is P(ci|x), i = 1, 2, ..., nc. If P(ci|x) is known for every class, then x is classified into the class with the greatest probability. This can be done by comparing P(ci|x), i = 1, 2, ..., nc.

7

x belongs to ci, if P(ci|x) ≥ P(cj|x) for all j ≠ i

However, P(ci|x) is not known directly. Baye's theorem provides a way to assess P(ci|x).

P(ci|x) = p(x|ci) • P(ci)/P(x) (5-6)

where P(ci) is the probability that ci occurs in the image. It is called a priori probability. P(x) is the probability of x occurring in all classes. It is not needed for the classification purpose because they can be cancelled from both sides of a comparison. p(x|ci) i = 1, 2, ..., nc are the conditional probability densities that must be determined. They can be obtained from the training samples. At this point, a parametric and non-parametric approach can be taken. With the parametric approach, the conditional probability is estimated through statistical modelling. This is done by assuming that the conditional probability distribution function (PDF) is normal (also called, Gaussian distribution). If the PDF can be defined for each class, and the a priori probabilities, then the classification problem can be solved. For a one-dimensional case, the probability distributions can be obtained by generating training statistics from training samples. What is required is the class mean vector, mi, and the class co-variance metrix, Vi. The one-dimensional Gaussian distribution is:

p(x|ci) = 1

2π • δi · exp { - (x - mi)2/(2δi2)} (5-7)

where only two parameters mi and δi are required for each class i = 1, 2, ..., nc. δi is the the standard deviation of ci. For higher dimensions, the following equation can be used

p(x|ci) = 1

(2π)nb/2 · |Vi| exp {-

12 (x - mi) TVi-1 · (x - mi)} (5-8)

For non-parametric case, p(x|ci) can be approximated by the occurrence frequency of x in class ci, f(x|ci). f(x|ci) can be obtained from enumerating the training samples (Gong and Dunlop, 1991).

P(ci) can also be determined with knowledge about an area. If they are not known, an equal chance of occurrence is assumed. i.e. P(c1) = P(c2) = ... = P(cnc). With the knowledge of p(x|ci) and P(ci), maximum likelihood classification can be done. p(x|ci) · P(ci) i = 1, 2, ..., nc can be compared for classification instead of P(ci|x).

Decision boundary

X1 X2

C2

C1

P(Ci)P(x|Ci)

Figure 5-5. Classification decision boundary between two classes in maximum likelihood classification. [B/W]

8

The interpretation of maximum likelihood classifier is illustrated in Figure 5-5. An x is classified according to the maximum p(x|ci) · P(ci). x1 is classified into c1, x2 is classified into c2. The class boundary is determined by the point of equal probability. In 2 and higher dimension, classification decision is made based on comparison of p(x|ci) · P(ci). When p(x|ci) · P(ci) are compared, a logarithm of p(x|ci) · P(ci) is taken

log {p(x|ci) · P(ci)} = -nb/2 · log 2π -1 - + log (P(ci)) (5-9) i/ 2log | |V 1/ 2( ) ( )T −− 1i i ix m V x m−

)

Since - nb/2 · log 2π is a constant and often P(ci) is the same, the RHS can be simplified to gi(x)

gi(x) = (5-10) ilog | | ( ) (T −− − − −1i i iV x m V x m

gi(x) is referred to as the discriminant function. Therefore, due to the lack of knowledge on the a priori probability P(ci), MLC is often reduced to comparing gi(x). If all Vi’s are the same, gi(x) degrades to Mahalanobis classifier. With MLC, if p(x|ci) is normally distributed and P(ci) is known, it guarantees that the error of misclassification is minimal. Unfortunately, the normal distribution cannot always be achieved. MLC is relatively robust but it has the limitation when handling data at nominal or ordinal scales. The computational cost increases considerably as the image dimensionality increases.

In order to make best use of the MLC, one has to make sure that classes are close to normal distribution. While this may be achievable for spectral classes (or land cover classes), they do not work for land use classes as they tend to have multi-modal distribution. In image classification, it is critical to have representative training samples. The number of training samples in each class is also important. Working in the era of 4 band Landsat MSS imagery, the number of parameters to be estimated is 10 for variance-covariance matrix and 4 for the mean vector. Swain (1978, p.151) suggested that a training size of 10 times to 100 times nb would be suitable in order to generate accurate enough class estimates for each class. Nowadays, we have remote sensing data with spectral bands as many as 200 to 300. With a 300 band image, if all the bands are used in classification, the parameters to be estimated would be 45,450 (300x(300+1)/2+300). Therefore, the number of training samples usually should be rather large if all bands are used. Since many bands in imaging spectrometers are correlated, it is possible to select bands for a particular classification task.

(4) Non-parametric Bayesian classifier

As mentioned earlier that the difference between a non-parametric classifier and MLC is the estimation of the probability density function p(x|ci) by the occurrence frequency f(x|ci). For class ci, the frequency in the m-dimensional space can be estimated from the training samples:

f(x| ci) = ni (x )/Ni , i = 1, 2, ..., nc (5-11)

where ni (x ) is the number of samples with a vector x and Ni is the total number of sample pixels for training

class i. Following the Bayesian theorem, the class conditional probability given x for class i, P(ci|x), is evaluated by:

( ) / ( )( )

( )i i

in N PP

P⋅

=x cc | x

xi (5-12)

Similar to MLC, P(x) can be omitted in the classification. For each class, a look-up table for this class can be constructed using training samples, with its entries being every possible vector x and output the conditional a posteriori probabilities, P(ci|x). For any x in the training samples, all the a posteriori probabilities can be taken from look-up tables and compared. x is assigned to the class with the greatest probability. For those vectors in the m-dimensional space which were not sampled during training of any class, a label "unclassified" is assigned.

The advantage of this algorithm lies in the fact that it does not require any assumption of the probability density

9

function for a class. The class probability density function is exactly the density of the class training sample. Since the density can take any form, the algorithm is not limited to the Gaussian probability density as is the case with the MLC. The problem associated with this algorithm is that it requires large look-up tables when m is large. To employ this algorithm, one must either reduce the variability within each data source or use fewer sources of data (Duda et al, 2001). A second problem is that the size of the training sample must be large in order to be representative. In this sense, it is less robust than the parametric MLC. (5) Classification based on evidential theory Let a vector X = (x1, x2,...,xn ) denote a set of observations or measurements made at a particular location. X is a set of features or n pieces of evidence. Classification can be considered as a multivalued mapping, Γ: E 2C, that associates each element X in E with a set of elements in 2C. E is the feature space, also called observation space or evidence space, C = {C1, C2, ..., Ck} is the class space whose elements are mutually exclusive, and 2C is the universe of discourse or the frame of discernment, i.e., the set that contains all possible sets consisting of elements in C and the empty set φ.

→

To realize the mapping: Γ: E 2C, in classification it is often simplified to: Γ : E C. We use C instead of 2C because our interest is focused on the individual elements in C, i.e., the case of singleton hypotheses. For example, our purpose usually is to find the probabilities of each individual class in C = {Urban, Agriculture, Forest, Water}: P(U), P(A), P(F) and P(W), but we are not interested in knowing P({U, A}), P({A, F, W}), ..., etc.

→ →

In evidential theory (Shafer, 1976), a basic probability assignment (bpa) of C, denoted by m: C [0, 1], is defined as:

→

(5-13)

m ( A ) =f ( x i ) = Ai:

∑ p ( xi )

where f is the mapping function from a subspace of E to C, A is a subset of C which is called a focal element, and p(xi) is the probability density of xi in a subspace of E. The "bpa" is also referred to as a mass function to distinguish it from the probability distribution. A mass function has the following property:

m(A) = 1 m(φ) = 0A⊂C∑

The probability distribution of C can be estimated by the mass function. Since the precise probability distribution of C may not be known exactly, in evidential theory, bounds of probability distribution are defined. The lower and upper probability of a subset B of C are denoted as B's belief measure Belm(B) and plausibility measure Plsm(B), respectively. They can be determined from the mass function as follows:

Belm(B) = m(A)

A⊂B∑ (5-14)

Plsm(B) = (5-15) mφΑ∩Β≠

(Α)∑Generally, Belm(B) Plsm(B) and, therefore, somewhere in the belief interval [Belm(B), Plsm(B)] lies the true probability of B. In evidential theory, Belm(B) indicates the amount of belief committed to B based on the given piece of evidence, while Plsm(B) represents the maximum extent to which the current evidence allows one to believe A.

≠

If m1 and m2 are two mass functions of C induced by two mapping, Γ1: E1 C and Γ2: E2 C, where E1 and E2 are independent sources of evidence, then the combined mass function, denoted by m1 m2, can be calculated using Dempster's rule of combination:

→ →⊕

10

m1 ⊕ m2 (D) =1 2

1 2

( ) ( )

1 ( )i j

k l

i jA B D

k lA B

m A m B

m A m Bφ

∩ =

∩ =

−

∑

∑ ( ) (5-16)

where the combination operator " " is called "orthogonal sum", D C and D ≠ φ, and m1 m2 (φ) = 0. Using the orthogonal summation, one can update the beliefs and plausibilities in space C with additional sources of evidence. If a piece of evidence from a third source is given, we can treat m1

⊕ ⊂ ⊕

⊕ m2 as m1 or m2 and apply m3 in the same manner as we combine m1 and m2. Since operator " ⊕ " is commutative and associative, the order of applying the orthogonal summation does not affect the final results. A number of applications of the evidential theory can be found in expert system development (e.g., Gordon and Shortliffe, 1985; Shafer and Logan, 1987; Kruse et al., 1991) and the applications of knowledge-based systems to image analysis (Goldberg et al., 1985; Srinivasan and Richards, 1990; Kontoes et al., 1993). To illustrate the orthogonal sum, consider data from two spectral bands as two independent sources for classification. The "bpa" values from the mass functions are listed in Table 5-1. None of the row-wise sums in Table 5-1 equals 1. The residual, [1- m(U) - m(A) - m(F) - m(W)], treated as the degree of ignorance, is denoted by I. To calculate m1 m2(D), where D⊂ C = {U, A, F, W}, we illustrate the procedure using Table 5-2. Table 5-2 is divided into two parts, the top part is used for calculating the mass product and the second part is devoted to the orthogonal sum and the beliefs and plausibilities.

⊕

Table 5-1. Basic probability assignment values for a set of evidence from two spectral bands for the classification of four classes

Urban (U) Agriculture (A) Forest (F) Water (W)

Band 1 0.2 0.3 0.3 0.1

Band 2 0.1 0.3 0.4 0.0 A requirement for use of evidential theory is that evidence sources are independent of each other (Shafer, 1976). This independence is, however, not solely a statistical one. Two highly correlated sources of data may seem redundant in a statistical sense but can improve our confidence of the classification results obtained from evidential reasoning. For the sake of simplicity and the lack of a way of verifying evidential independence, researchers often use all sources available (Lee et al., 1987). Alternatives to reducing statistical dependencies are (1) to treat each individual source, Ei, as a component in a subspace of E rather than an independent source, and (2) to decorrelate multisource data through principal component analysis or factor analysis (Durrand and Kerr, 1989). To apply the evidential theory to a classification problem, the following steps are needed: 1. Determine the probability distribution pij(xi) of Cj for each evidential source Ei (note this is different

from algorithms based on Bayes theory in which construction of a mapping between the entire feature space E to an individual class Cj is usually required). Each band of image or a digital map can be considered as an individual evidential source. For example, the histogram of the "ith" image can be used to approximate a probability distribution denoted by: pij(DN), where DN ∈{0, 1, ..., 255}for an 8-bit image

2. Determine the mass function of each class Cj, j = 1, ..., k, for each evidential source Ei (Yen, 1989).

mi(Cj) = Σf(DN) = Cjpij(DN) (5-17)

11

where f(DN) = Cj defines a mapping between value DN in evidential source Ei and class Cj. If Cj is a singleon class, mi(Cj) = pij(DN).

3. Combine the mass functions from two evidential sources using equation (5-16). This formula can be used iteratively, one source at a time until "bpa" values from all sources are combined.

4. Determine the belief interval for each class Cj. Assume the combined mass function, m, has been obtained from step 3, the belief interval can be calculated using equations (5-14) and (5-15).

5. Classification of a set of evidence or observations and measurements, X = (x1, x2,...,xn)) can be based on either the total belief or the total plausibility (Lee et al., 1987).

Different from classification algorithms based on Bayes theory (see, Gong and Dunlop, 1991), algorithms based on evidential theory are restricted neither by the dimension of the spatial data nor by the number of data sources. Therefore, a non-parametric model can be employed with no dimensionality limitation. Since the lack of certain data sources only reduces the number of times that equation (5-16) is employed, to some extent the evidential reasoning algorithm is tolerant to incomplete data coverage. Moreover, the evidential reasoning algorithm does not differentiate the prior and posterior probabilities. It does not require any prior probabilities to be known explicitly. Therefore, the evidential reasoning algorithm has fewer limitations as compared with the Bayesian algorithms.

Table 5-2. Components for Calculating the Dempster's Orthogonal Sum

Band 1 Band 2

U 0.2

A 0.3

F 0.3

W 0.1

I 0.1

U 0.1 U 0.02

φ 0.03

φ 0.03

φ 0.01

U 0.01

A 0.3 φ 0.03

A 0.09

φ 0.09

φ 0.03

A 0.03

F 0.4 φ 0.08

φ 0.12

F 0.12

φ 0.04

F 0.04

I 0.2 U 0.04

A 0.06

F 0.06

W 0.02

I 0.02

= 0.07 =0.18 =0.22 = 0.02 = 0.02

1 – = 0.51

m1 m2 ⊕ 0.07/0.51 = 0.14

0.18/0.51 = 0.35

0.22/0.51 = 0.43

0.02/0.51 = 0.04

0.02/0.51 = 0.04

Belm1 m2 ⊕ 0.14 0.35 0.43 0.04 0.04

Plsm1 m2 ⊕ 0.18 0.39 0.47 0.08 0.04

G H2( ) ( )i

G H A

m G m H∩ =∑ 2( ) ( )i

G H Im G m H

∩ =∑2( ) ( )i

G H Wm G m H

∩ =∑2( ) ( )i

G H Fm G m H

∩ =∑2( ) ( )i

Um G m H

∩ =∑

2( ) ( )iG H

m G m Hφ∩ =

∑

Training in the Evidential Reasoning Algorithm: Moon (1990) intuitively assigned probabilities based on expert knowledge and suggested a more systematic and quantitative approach be established. A combination of both parametric and non-parametric modeling in the construction of mass functions can be used based on the following methods, • occurrence-frequency table. Occurrence frequency densities can be estimated for data that are obtained from

any measurement scales. For data at nominal or ordinal measurement scales, this is the only method that can be used. The frequency table is built with its row entries being the values of what a data source may have and each of its columns corresponding to a class;

12

• normal distribution model. Similar to MLC, a normal distribution model can be built for each class for those data sources that are acquired at the interval and ratio scales.

If the occurrence-frequency table method is applied to data at interval and/or ratio measurement scales, certain interpolation or extrapolation methods can be used to adjust the occurrence table to fill some of the gaps in the feature space caused by under-sampling in training samples. When training samples are too few or do not exist, one may use fuzzy set theory to establish mass functions. With fuzzy set theory (Zadeh, 1965), expert knowledge can be encoded using fuzzy membership functions (Mulder and Corns, 1993; Zhu and Band, 1994). To do so, one needs to determine the fuzzy membership function on each source Ei, i = 1, 2, ... , n for each Cj, j = 1, 2, ... , k. Thus a total of n x k membership functions need to be found. Fuzzy membership functions can then be normalized or transformed using other methods so as to meet the requirements of mass functions. It should be noted that one of the advantages of evidential theory is that it allows training to be conducted in hierarchical classification problems. This facet, however, will not be explored here. (6) Artificial neural network classification A network of elemental processors arranged and connected in a feed-forward manner reminiscent of biological neural nets can be used to classify a set of observations, X = [x1, x2, ... , xn]T from n different sources, and label it into a class cj c = {c1, c2, ... , cn}. Rumelhart et al. (1986) developed a generalized delta rule (GDR) for supervised training of a neural network based on error back propagation.

∈

The architecture of a layered net with feed forward capability is shown in Figure 5-6. The basic elements are nodes " " and links "∅". Nodes are arranged in layers and each of them is a processing element.

.

.

.

.

.

.

.

.

.

i j k

Input Output

Input Layer

Hidden Layer

Output Layer

Figure 5-6 The structure of a feed-forward error back-propagation artificial neural network. [B/W]

Each input node accepts a single value which corresponds to an element in X. Each node generates an output value. Depending on the layer in which a node is located, its output may be used as the input for all nodes in the next layer. The links between nodes in successive layers are weight coefficients. The number of hidden layers can be greater than 1. In the output layer, each node corresponding to a single class in C generates the membership value vk of that class. For example, ωji is the link between two nodes from layer i to its successive layer j. Each node, except those in the input layer, is an arithmetic unit. It takes the inputs from all the nodes of its previous layer and uses the linear combination of those input values as its net input. For a node in layer j, the net input is,

uj = ∑ ωji · xi (5-18) The output of the node in layer j is Oj = f (uj) (5-19) where f is an activation function that often takes the form of a sigmoidal function,

13

Oj = 1

1 + e-(uj+θj) (5-20)

where θj serves as a threshold or bias. This function allows for each node to react to an input differently. Some nodes may be easily activated or fired to generate a high output value when θj is large. In contrary, when θj is small a node will have a slower response to the input uj. This is considered occurring in the human neural system where neurons are activated by different levels of stimuli. Such a feed forward network requires a single set of weights and biases that will satisfy all the input-output pairs presented to it. The input is a set of observations and the outputs are the desirable class membership values Vp = {vp1, vp2 , ..., vpk}. The process of obtaining the weights and biases is network learning, which is essentially the same as supervised training. During network training, elements in a set of observations Xp = {xp1, xp2 , ..., xpm} correspond to the nodes in the input layer. For the given input Xp, we require the network adjust the set of weights in all the connecting links and also all the thresholds in the nodes such that the desired outputs can be obtained. Once this adjustment has been accomplished by the network, another pair of Xp and Vp is presented and the network is asked to learn that association also. In general, the output from the net Op = {opq} will not be the same as the desirable values Vp. For each Xp, the squared error is

εp = ∑q=1

k (vpq - opq) 2 (5-21)

where k is the number of classes and the average system error is

ε = 1nt ∑

p=1nt ∑

q=1k (vpq - opq) 2 (5-22)

where nt is the number of training pairs. The adjustment of weights and thresholds is accomplished by repetitively feeding the network with the X and V pairs in sequence and constantly modifying the weights and thresholds using the generalized delta rule (GDR). With GDR, the correct set of weights is obtained by varying the weights in a manner calculated to reduce the error εp as rapidly as possible. In general, different results will be obtained between the use of εp and ε during the training based on error back propagation (Pao, 1989). The convergence of εp with improved values of weights and thresholds is achieved by taking incremental changes that are proportional to the partial derivatives from (6). For weight adjustment, this is done by modifying weights with an increment proportional to -∂Ep/∂ωji, i.e., with an adjustment of µ(-∂Ep/∂ωji). Starting at the output layer, GDR propagates the "error" backward to earlier layers. Thresholds θj are learned in the same manner as are the weights. Parameter µ is a small positive number experimentally determined and usually fixed during each training process. Our experiences suggest that large differences in range from one data source to another make it harder for us to select µ. When the input data are converted to the range of [0, 1], it is easier to find an appropriate µ so as to make the network learn faster. Data range conversion can be achieved by finding the maximum and minimum in each channel and applying the following linear transformation to the original data: new data value = (original data - minimum) / (maximum - minimum) (5-23)

This is similar to data normalization as suggested in Azimi-Sadjadi et al. (1993) and Freeman and Skapura (1991). Details on the learning algorithm can be found in various texts (e.g., Pao, 1989; Freeman and Skapura, 1991). Although a three-layer network can form arbitrarily complex decision regions, sometimes difficult learning tasks can be simplified by increasing the number of internal layers (Pao, 1989). On the other hand, were too many layers in a network or too many nodes in a layer used, the network would require much more computation and might lose the ability to generalize. Since in feed-forward network nodes in the same layer are independent of each other, they can be implemented in parallel processing. Algorithms similar to the one explained above have been applied to land cover classification of remote sensing data only (Civco, 1993; Dreyer, 1993). Other neural network algorithms have also been developed and applied to remote sensing image classification (e.g., Benedikttson et al., 1993; Salu and Tilton, 1993).

14

In classification, expert knowledge on the spatial location of classes is used to train the algorithms. To do so, we transform this type of knowledge into a computer system. The processes of collecting and encoding expert knowledge are referred to as knowledge acquisition and knowledge representation, respectively. While various complex computer structures for knowledge representation may be used, relatively simple procedures are often employed such as the use of parametric statistical models (Swain, 1978; Richards and Jia, 1999) or non-parametric look-up tables (Duda et al, 2001).

Training in the Neural Network Algorithm In contrast to the evidential theory based algorithm, the GDR training process in the neural network algorithm encodes knowledge through its weights and thresholds associated with each node on the net. Explicit modeling of data sources is not required in the neural network method. In addition, there is no need to treat the data sources independently (Benediktsson et al., 1993). The training process is computationally intensive, however. It requires many training samples and many iterations and it is usually terminated when the system error calculated from equation (6) is smaller than a preset value. We can monitor the progress of training periodically by applying feed forward calculation through the network to classify the training samples as well as some independent test samples. System errors calculated for the training samples and testing samples can be plotted against the number of iterations for the purpose of checking the performance of the network. Uncertainty Measures An advantage of the use of evidential theory as compared to the use of probability is its ability to express ignorance. The commitment of belief to a subset B does not force the remaining

belief to be committed to its complement, i.e., Belm(B) + Belm(B-

) ≤ 1. The amount of belief committed to neither B nor B's complement is the degree of ignorance. 5.1.2 Special considerations in conventional image processing Subpixel image classification. This section in combination with Section 5.2 will cover some of the methods to estimate the area proportions of various surface cover types in each pixel. As introduced at the beginning of the information extraction section, there are two types of subpixel image classification methods, the statistically based, under-determined approach, and the physically based, over-determined approach. We introduce the under-determined approach here. They are extensions of image classification where instead of one class is being stored or mapped for each pixel, the proportions, f = (f1, f2, …, fnc) of different surface cover types or components are stored.

10 and 1

nc

ii

f=

≥ ∑ if = (5-24)

This type of analysis is also called unmixing, pixel decomposition. f can be estimated using a number of methods. To estimate f from the distance of a pixel gray-level vector, g, to class centers, d(ci, g)as in minimum distance classification, the following procedure can be used:

1

1(1 ( ))

1(1 ( ))

ii nc

jj

df

d=

+=

+∑c ,g

c ,g

(5-25)

This is an inverse distance approach. Adding a 1 to the distance is to prevent situations where the distance is 0. The a posteriori probabilities P(ci|g) in MLC can be regarded as the fi. The probability density p(g|ci) must be normalized to one in order to be used as an surrogate of P(ci|g). Similarly, the outputs (oi, i = 1, 2, …, nc) from a neural network can also be normalized to estimate f:

1

ii nc

jj

ofo

=

=

∑ (5-26)

15

Similarly, f can be estimated from the belief functions or any fuzzy membership function that can be derived in a classification process. Although it is not difficult to develop methods to estimate f, the accuracy of various methods has rarely been assessed. Care must be taken in applying any of these methods to estimate area proportions of various surface cover types. Clustering algorithms. For images that the user has little knowledge on the number and the spectral properties of spectral classes, clustering is a useful tool to determine the inherent data structures. Clustering is a process of automatic grouping of pixels with similar spectral characteristics. The similarity measures in a clustering algorithm are usually a simple one such as the Euclidian distance or less often the city-block distance.

(1) Euclidean distance dE(x1, x2) (2) City-block distance dc(x1, x2)

The criteria of clustering determine how well the clustering results are. Sum of squared error (SSE) is often taken to evaluate a clustering results. For a given number of clusters, the smaller the SSE is, the better is the clustering results.

(5-27) 1

( ) (nc

Ti

i

SSE= ∈

= − −∑∑ix c

x m x m )i

where ci = c1, c2, ..., cnc . mi i = 1, 2, ..., nc the mean vector of cluster i. Commonly used clustering algorithms include K-means and ISODATA (Iterative Self Organizing Data Analysis Technique A). ISODATA is an extension to K-mean clustering. In the following, a brief introduction on these algorithms is given.

K-means algorithm. K-means clustering (also called c-means clustering) is a simple moving means algorithm. It derives spectrally similar clusters in four steps in a iterative fashion:

a. Select K points in the multispectral space as candidate clustering centers

Let these points be

moi , i = 1, 2, ..., k.

Although moi can be arbitrarily selected, it is suggested that they be selected evenly in the multispectral

space. The area occupied by the scatterplot in the multispectral space can be used to guide the initial setting of the clusters. Clusters following outside the scatterplot area should be avoided.

b. Assign each pixel x in the image to the closest cluster center moi

c. Generate a new set of cluster centers, mni , based on the processed result in b.

d. If |mni - moi | < ε, (ε>0 is a small tolerance) the procedure is terminated,

otherwise let moi = m

ni , return to step b and continue.

ISODATA. Based on the K-means algorithm, ISODATA adds two additional steps to optimize the

clustering process.

16

a. Merging and deletion of clusters. At a suitable stage, e.g. after a number of iterations of steps b - d in the K-

means algorithm, all the clusters mni i = 1, 2, ..., nc are examined. If the number of pixels in a particular

cluster is too small, then that particular cluster is delete. If two clusters are too close, then they are merged into one cluster.

b. Splitting a cluster. If the variance of a cluster is too large, that cluster can be divided into two clusters. These two steps increased the adaptivity of the algorithm but also increased the complexity of computation. Compared to K-means, ISODATA requires more specification of parameters for deletion and merging and a variance limit for splitting. Variance must be calculated for each cluster. In both of the moving-means algorithms, clustering may not be converging. Therefore, the maximum number of iterations should be specified to terminate a clustering process. K-means and ISODATA are two algorithms that reaches optimality via iteration, therefore it requires more computation. They can be considered as dynamic clustering algorithms as the centers moves from one iteration to the next. There are also static clustering techniques. For example, histogram-based clustering can be considered as a static clustering because it only requires the establishment of a multidimensional histogram. With a multidimensional histogram established, histogram-based clustering searches for the peaks of frequency in the multispectral space. In a 2D histogram (when only two bands are used), the frequency peaks are searched in a 3 x 3 gray-level vector neighborhood. This increases exponentially as the number of bands increases. It requires a search of 3n gray-level vectors to find a local maximum in the histogram space. Furthermore, the memory space increases exponentially with the increase of band number. If every band of image is encoded in 8 bits, an n band histogram would require 28xn integer or real numbers to store the n dimensional histogram. To reduce the memory requirement, one must either reduce the number of bits in each band or limit the use of number of bands. Therefore, the computation efficiency of histogram-based clustering is limited by the number of bands to be used. Nevertheless, this limitation could partly be overcome by a gray-level vector reduction algorithm introduced later (Gong and Howarth, 1992a). Feature selection and reduction. Based on the above introduction, a number of classification algorithms such as the non-parametric classification and the histogram-based clustering techniques are directly contrained by the number of bands. The MLC is also limited by the requirement of a great number of training samples as the number of bands becomes large. Band selection is a special type of feature selection in remote sensing. It can serve as a pre-cursor of effective bands for final classification. Traditionally, band selection is done using separability measures such as divergence and J-M distance before a selected set of bands are used in MLC. Given a pair of classes, c1{m1, V1}, c2{m2, V2}, the Jeffries-Matusita distance between the two classes J(c1, c2), or J12, can be calculated from

0.512

11 2 1 21 2 1 2 0.5

1 2

[2(1 )]| ( ) / 20.125( ) ( ) ( ) 0.5log( )

2 (|T

J e α

α

−

−

= −+

= − − +V V V Vm m m m

V V|| |)+ (5-28)

J-M distance is considered to be closely related to classification error. The J-M distance has a range between 0 and 2. The greater J-M distance is the better is the separability between the two classes. J-M distance can be calculated for each pair of classes and an average of all class pairs can be used to evaluate a set of spectral bands used in a classification. The set of bands with the greatest average J-M distance will be chosen for MLC. Because computer time is no longer a concern, MLC can be directly used followed by a test of classification accuracy. A similar measure for feature separability is transformed divergence (TD) (see, e.g., Jensen, 1996). For the same data set, TD usually gives slightly higher values.

Some of the multispectral transformation methods such as the PCA and KT transforms can be applied in feature reduction. While KT transform is limited to MSS and TM sensors, GST can be applied to any type of multispectral sensors. Particularly when the number of bands is large (10-300), PCA can be used to dramatically reduce the number of spectral bands while preserving most of the data variability.

17

5.1.3 Accuracy assessment

Accuracy assessment is involved in many stages of remote sensing. Ground truth data are collected to calibrate remotely sensed data as discussed in the earlier sections. Such data include field spectral measurements, on-site GPS measurements of location, type of surface covers, various physical properties such as height of buildings, area occupied by homogeneous surface covers, crown closure of trees, forest leaf area, various proportions of different surface cover at a site of particular size, etc. Ground truth data collection needs to be carefully designed to make sure that field measured data can be 1) representative of the area; 2) scaled up for comparison of pixel data in remotely sensed imagery. Sample data representativeness is usually achieved by proper distribution of field sample sites. A sampling scheme would involve the number and pattern of distribution of field sites. The simplest way for scaling up field data is to select relatively homogeneous areas for field sampling. If this is not possible, some data collected in the field such as land cover classes and crown closure of forest may be generalized into a broad spatial area. Sometimes, detailed physical properties are measured in an area to model the radiance distribution via spectral mixing or radiative transfer algorithms. Data at the intermediate scale are always helpful to bridge up the gap between field measurements and broad scale satellite data. The process from remotely sensed data to thematic classes or finally to cartographic product can be summarized Figure 5-7. In many applications other than image classification, data processing accuracy or final map of certain quantitative physical properties such as building heights, surface temperature, albedo, and forest leaf area is assessed with the ground based measurements that usually leads to estimate of root mean squared errors (RMSE), a measure for checking the accuracy of data at the ratio and interval measurement scales. Such assessments are usually more objective than the assessment of classification results.

In many image classification projects, calibration is not done and classification results are checked at the final stage using either ground truth data, or more accurate data derived from other sources. For example, classification results obtained from airphoto analysis are often used as the reference data to verify the satellite based results. However, one needs to be careful that the results derived from analysis of airphotos may also contain error. With the increase in spatial resolution of satellite data approaching the resolution level of 0.5-1.0 m, the need for airphoto analysis is minimal. Sometimes a map obtained through visual interpretation of the same data source is compared with the classification results. The quality of the interpretation results should also be assessed. Therefore, ground truth data should not be easily substituted by other sources of information for use as reference data in accuracy assessment.

In the assessment of image classification results, the reference data is compared with the classification pixel by pixel to produced a contingency table, that is sometimes called a confusion matrix or an error matrix (Table 5-3). The matrix summarizes for the breakdowns of the total number of reference samples for each class and arranged them into columns. Therefore, the sum of each column is the total number of reference samples in a class while each row lists the number of reference samples are classified into the same class. The diagonal elements in the matrix list the correctly classified samples class by class. Any off-diagonal element in column i and row j (i ≠ j) is the number of samples that should be in class i but classified into class j. An example of a confusion matrix is presented in Table 5-3.

18

Ground Truth

DataCalibration, Enhancement & Correction

Information Extraction Maps

Data Acquisition - Spatial resolution - Radiometric resolution

Compare against ground truth - positional accuracy - radiometric accuracy

Compare with Maps produced manually - positional accuracy - categorical accuracy

At this stage, the comparison is objective

Comparison is somewhat subjective.

Figure 5-7. Accuracy assessments at different stages of information extraction. [B/W]

Table 5-3 An Example Confusion Matrix from a Simple Classification

Reference

Cla

ssifi

catio

n

Row Total Commission Error

Column Total

Omission Error

F W U

F

W

U

28 14 15 57 51%

1 15 5 21 29%

1 1 20 22 9%

30 30 40 100

7% 50% 50%

F - Forest W - Water U - Urban

The matrix in Table 5-3 contains the complete information on the categorical accuracy. Some of the classification accuracy assessment algorithms can be found in Rosenfield and Fitzpatrick-Lins (1986), Story and Congalton (1986) and Jensen (1996). Off diagonal elements in each row present the numbers of sample that are misclassified by the classifier, i.e., the classifier is committing a label to those samples which actually belongs to other labels. This type of misclassification error is called commission error. The off-diagonal elements in each column are those samples being omitted by the classifier. Therefore, the misclassification error is also called omission error.

In order to summarize the classification results, the most commonly used accuracy measure is the overall accuracy:

1 1

1/ , nc nc nc

T ii Ti i

N e N eω= =

= =∑ ∑ (5−29) 1

ijj=∑

19

where eij is the number of samples in row i column j of a confusion matrix. NT is the total number of samples. From the example matrix, we can obtain ω = (28 + 15 + 20)/100 = 63%. More specific measures are needed because the overall accuracy does not indicate how the accuracy is distributed across the individual categories. The categories could, and frequently do, exhibit considerably differing accuracies and yet combine for equivalent or similar overall accuracies.

By examining the confusion matrix, it can be seen that at least two methods can be used to determine individual category accuracies:

(1) The ratio between the number of correctly classified and the row total, i.e., commission error; (2) The ratio between the number of correctly classified and the column total, i.e. omission error.

(1) may be called user's accuracy; the user is concerned about the percentage of the classified items that have been correctly classified. (2) may be called producer’s accuracy; the producer is interested in how correctly the reference items are classified.

Kappa coefficient. The Kappa coefficient (K̂ ) is an accuracy measure that removes chance agreement, pc, from the overall accuracy, ω (Cohen, 1960). As shown in Figure 5-8, the confusion matrix is converted into a joint confusion probability matrix by dividing each element by the total number of samples of the entire matrix, i.e., pij = eij/NT. In the joint probability matrix, the row sums, pi+, or the column sums, p+i, are called marginal probabilities. The expected chance agreement, pc, is calculated by summing up the products of marginal probabilities of the same class. Therefore, K̂ is calculated from

K̂ = (ω - pc)/(1 - pc), pc = Σpi+ p+I (5-30) pi+ = row subtotal of pij for row i p+i = column subtotal of pij for column i

0.28 0.14 0.15 0.57

0.01 0.15 0.05 0.21

0.01 0.01 0.2 0.22

0.171

0.0630.088 0.322

0.3 0.3 0.4P+i

Pi+

Pc

Figure 5-8. Kappa coefficient derivation from a joint confusion probability matrix. [B/W]

Since ω = 0.63, pc = 0.322, we get

K̂ = 0.63 - 0.322

1 - 0.322 = 0.3080.678 =· 0.454

K̂ ranges from –1 to 1 (Campbell, 1987). Clearly, a total agreement when all but the diagonal elements are zeros will result in a 1. It is less obvious when K̂ =0, or –1 (Figure 5-9).

20

^ ^ ^ ^2 2 3 1 1

2 1, 2 0, 1, 1 1 0.52 2 3 1 1

K K K K= = = − = −

Figure 5-9 Some examples of kappa coefficients. [B/W]

One of the advantages of using K̂ is that we can statistically compare two classification products. For example, two classification maps are made using different algorithms. We can use the same reference data to verify them. Let K̂ 1, K̂ 2 be the two kappa coefficients to be compared. For each K̂ , the variance V̂ can also be calculated (Fleiss et al, 1965):

^4 2

1

2 2

1 1

= 1/ (1 - ) { [(1 - ) - ( )(1 - )]

(1 - ) ( ) - ( - 2 ) }

NC

T ii c i ii

NC NC

ij i j c ci j

V N pc p p p p

p p p p p

ω

ω ω

+ +=

+ += =

+

+ +

∑

∑∑ 2ω+ (5-31)

It has been suggested that a z-score be calculated

Z= K̂1 - K̂2

V̂1 + V̂2 (5-32)

A normal distribution table can be used to determine from Z whether the two K̂ s are significantly different. For instance, if Z ≥ 1.96, then the difference is said to be significant at the 0.95 probability level.

Sometimes, a kappa coefficient, , for each class is also necessary. This is given by ^

iK

^ii i i

i

i i i

p p pKp p p

+ +

+ +

−=

− +

^

(5-33)

The choice of p+i or pi+ causes some differences just like for the case of user’s and producer’s accuracy. We recommend the use of the former marginal probability as it is normalized according to the reference. Similar to the

general kappa coefficients, significance test between two can also be done (Fleiss et al, 1969) but they are rarely applied in remote sensing.

iK

Number of samples and sample distribution. The collection of test samples is determined not only by the representativeness of samples but also cost. The cost issue may not be as critical in human settlement remote sensing as in natural resources remote sensing where accessibility to a particular site may be difficult. The general considerations for size of test samples are:

(1) The larger the sample size, the more representative the estimate obtained, therefore, more confidence can be achieved.

(2) In order to give each class a proper evaluation, a minimum sample size should apply to every class. (3) A smaller number of samples can be allocated to a class that has less variability or is a less important

class.

Congalton (1991) suggest that a minimum of 50 samples be collected. When a class occupies a large area and the number of classes is greater than 12, 75-100 sample pixels should selected. Statistically, the sample size, N, for

21

each class can be determined based on a binomial distribution (correct vs incorrect) (Fitzpatrick-Lins, 1981). 2

2

(100 )Z p pNe

⋅ −= (5-34)

where p is the expected percent accuracy, e is the allowable error in percentage, Z is the level of confidence from a two sided normal distribution, i.e., Z =1.96 at the 0.95 probability confidence level. The higher the expected accuracy is, the less number of samples are required. Similarly the larger the allowable error is, the smaller the number of samples. Once the number of samples to be collected is determined, the next step is to allocate the distribution of samples. Four major sampling strategies have been proposed (Jensen, 1983; Campbell, 1987). They include random sampling, stratified random sampling, systematic sampling, and stratified systematic unaligned sampling. While these are generally designed to the entire area of an image, we believe samples should be allocated on a class by class basis. This will avoid the problem of undersampling of small but important classes. Therefore a random sample stratified by each individual class is a natural choice. If there is no systematic pattern in the image classification results, a random systemtic sampling scheme is also applicable. The choice of various sampling strategies is often influenced by the convenience in accessibility to ground truth sampling sites.

5.1.4 Spatial contextual classification algorithms

The problem with multispectral classification is that no spatial information on the image has been utilized. On the other hand, visual interpretation always involves use of spatial features in the image such as texture, shape, shade, size, site, association, etc. Clearly there is tremendous strength in computer techniques on the efficient handling of the gray-level values in the image. In terms of making use of spatial information, computer techniques lag far behind. Therefore, a great amount of efforts has been made to develop algorithms that take advantage of spatial features in remotely sensed data. To be able to characterize spatial structural differences in remotely sensed images is an important component in human settlement remote sensing. For image classification, we can summarize three types of algorithms that make use of spatial feature at different stages:

Preprocessing approach Post processing approach Use of contextual classifier

Figure 5-10 shows the procedures involved in the preprocessing and post-processing methods. The indispensable part of a preprocessing classification method is the involvement of the spatial-feature extraction procedures. Once spatial features are extracted, any classification algorithm mentioned earlier can be used in this type of classification. Spatial features can be extracted using texture analysis algorithms introduced in section 4.4.6. Essentially any type of spatial features that can be derived from the original image can be used. For example, an edge-density image can be used to separate the vegetation in the residential and industrial areas from that in the rural/agricultural areas (Gong and Howarth, 1990a). Because many spatial features can be extracted from one band of the original image, feature reduction methods described earlier can be applied to select and reduce the amount of redundancy. Gong et al (1992) assessed a number of texture measures in urban land use classification; they found that the effect of most of spatial measures can be grouped into highlighting the low-frequency or the high-frequency component of the image. Therefore, most texture analysis performed to an image would achieve a smoothing or edge-enhancing effect. Another unique measure is entropy which measures the level of diversity in a pixel neighborhood.

22

Classifier

Spatial Feature Extraction Spatial

Features

Classifier

Post-analysis Intermed.results

Original Image

Original Image

Figure 5-10 Incorporation of spatial features into image classification through a preprocessing procedure (left) or a postprocessing procedure (right). [B/W]

The postprocessing method requires some spatial analysis based on the intermediate classification results obtainable from a multispectral classification of the original imagery. The postprocessing analysis can be as simple as an application of a mode filter to the intermediate results or as complicated as a reclassification of the intermediate results into information classes. The mode filter modifies the classification results based on a majority role, thus it is also called a majority fiter. Simple postprocessing requires a careful initial classification. Some postprocessing algorithms are based on the modification of the probability values of each pixel based on the probability compatibility among pixels in a local neighborhood. This type of technique is known as probabilistic relaxation (Rosenfeld et al., 1976; Richards et al., 1982. Gong and Howarth (1989) applied probabilistic relaxation classification to urban land-cover classification and found while this algorithm can improve classification accuracy to a certain extent (5%) the computation requirement was high. Some postprocessing can be done based on rough classification or clustering results leaving the postprocessing analysis to regroup the clusters into the final classes (Wharton, 1983; Zhang et al., 1988; Gong and Howarth, 1992b). Wharton used histogram based clustering to first transform an airborne multispectral image into clusters. Occurrence frequencies of the various clusters are used to classify land-use categories. Zhang et al (1988) applied land cover classification to a selected section of Landsat TM data and then used the occurrence frequencies to characterize land use types. Gong and Howarth (1992b) applied a similar approach to first classify SPOT multispectral image of an urban and rural-urban fringe area near Toronto, Canada, into 12 land-cover classes and then used the cover-frequency approach to re-map the land cover types into 14 land-use classes. In these post-processing algorithms, land-cover classification or clustering results become the intermediate results for further image classification. Compared to the direct use of MLC for urban land-use classification, the indirect postprocessing techniques can considerably improve the final land-use classification accuracy by 15-25%. This level of land use classification accuracy improvement over the MLC method is hardly matched by other contextual classification algorithms. Therefore, we describe the frequency-based contextual classifier below.

Frequency-based contextual classifier. Occurrence frequency, fl(i,j,v), is defined as the number of times that a pixel value v occur in a pixel window centered at (i,j). Because the information in a pixel window is used to classify a single pixel – the central pixel, we consider this type of classification a contextual classification. For a single band of image, v represents a gray level. For a multispectral image, v represents a gray-level vector. Within each pixel window, one can obtain an occurrence frequency table containing all possible vs.

When a pixel window at a given size is moved all over an image(s), one can generate a frequency table for each class in the image, except for those pixels close to the image boundary. Those pixels within a distance to the image boundaries of half the lateral length, l, of the pixel window used are called boundary pixels. To assure a small proportion of boundary pixels, the pixel window sizes used must be considerably smaller than the image size.

23

The number of occurrence frequencies in a frequency table increases linearly as the number of gray levels in an image increases, and exponentially as the number (or dimensionality) of spectral bands increases. For a single band of image quantized into n gray levels, one can produce gray-level occurrence frequency tables with a maximum number of n frequencies in each table. As discussed earlier, the maximum number of frequencies in a frequency table will increase to nm when m spectral bands having the same number of gray levels are used. It requires a large amount of random access memory (RAM) in a computer to handle the nm frequencies. For this reason, efficient gray-level vector-reduction algorithms are needed. One such algorithm will be introduced in the following section. Frequency tables can be generated from gray-level vector-reduced images or clustered images.

There are several advantages of using frequency tables in comparison with the use of spatial statistical measures as in spatial feature methods. First, a frequency table contains more spatial information than many statistical measures. For instance, the most commonly-used statistical measures such as the mean, standard deviation, skewness, kurtosis, range, and entropy can all be derived from a gray-level frequency table (Gong and Howarth, 1992a). Second, more computation is required to obtain statistical parameters after frequency tables are produced. Therefore, it becomes unnecessary to use statistical measures because frequency tables can be quickly computed, directly compared, and analyzed. The third advantage is that feature selection procedure which is used to evaluate statistical parameters is no longer needed because frequency tables contain more spatial information required for the classification than the above statistical parameters. Fourth, disk storage is not required by frequency tables due to the simplicity of their real-time creation.

The success of the frequency table method in land-use classifications depends largely on the appropriate pixel window size being selected for frequency table generation. If the window size is too small, sufficient spatial information cannot be extracted to a frequency table to characterize a land-use type. If the window size is too large, much spatial information from other land-use types could be included. There seems to be no effective criterion for selecting pixel window sizes. Parametric feature selection criteria such as methods discussed earlier and the probability trend curve (Gong and Howarth, 1990c) do not work for frequency tables, simply because frequency tables are not parametric. Driscoll (1985) examined frequency means obtained from training samples using pixel windows with a range of successive sizes. The minimum window size was selected at which frequency means begin to stabilize in comparison to frequency means extracted from larger window sizes. Comparisons were made visually. This method is, however, based only on within-class variances. Thus, as in discriminat analysis, a separability measure was proposed by Gong and Howarth (1992a) to select optimal window sizes. However, they did not find it particularly useful. Other measures such as semi-variagrams may also be applicable (Woodcock and Strahler, 1987), but our experience suggest that none of these methods are particularly effective. Therefore, optimal window sizes must be empirically determined at this point.

The minimum-distance classifier with the city-block metric (Gonzalez and Wintz, 1987) is used in the frequency-based classifier. A city-block distance between two vectors is calculated by first obtaining a difference between every two corresponding vector elements, and then summing all the absolutes of these differences.

For given mean histograms of all c land-use classes, hu = (fu(1),fu(2), ..., fu(Nv)), u =1, 2,...c, the city-block distance between a new histogram hl(i,j) and hu is calculated from the following:

du = f u( v ) − f ( i , j , v )v =0

Nv∑

(5-35)

The classifier compares all the c distances and assigns pixel (i,j) to the class which has a minimum distance to hl(i,j).

A major barrier to improving classification accuracies in contextual image classification involving spatial features extracted from local neighbourhoods (pixel windows) is the lack of methods in reducing the misclassification that occurs at boundaries of different classes. This type of misclassification is caused by the use of pixel neighbourhoods (e.g., Gong and Howarth, 1992a;b; Eyton, 1993). An example illustrating this phenomenon can be found in Figure 5-11.

24

Figure 5-11 A false color composite of a SPOT multispectral image of the northeast Toronto, Canada (left), and land-use classification results using the frequency-based contextual classifier (right) using a pixel window of 13 X 13. Most of the classification errors are at the borders between two land-use classes. [C]

The boundary effect can be illustrated using a simple example. In Figure 5-12 there presumably exist only two land-use classes: Classes A and B. As a pixel window moves from the area of Class A across the boundary to the area of Class B, the occurrence frequencies extracted from each move of the pixel window change. Assume Class A is industrial/commercial with concrete surface dominant and Class B is golf course with grass dominant. If a pixel window moves from Class A to Class B, one will observe that frequencies extracted from this window change from concrete surface dominating, to similar proportions of concrete surface and grass, to high grass proportion and low concrete proportion, and finally to grass dominating the pixel window. The central two concrete and grass configurations are transitional from Class A to Class B, and depending on other classes included in the classification scheme their frequencies may be more similar to new residential or old residential classes. Thus, as a pixel window moves across the boundary between two classes, four or even more land-use classes would be obtained; those transitional classes are errors.

The level of boundary effects changes as the configuration of image resolution and class definition changes. For a given class, the size and shape of ground components affect the level of boundary effects. If those ground components are important features for discriminating the specific class from others. The image resolution needs to be sufficiently high to allow those components to be observable on the image. The pixel window needs to be large enough to allow those components to be covered in one pixel window in order to guarantee that frequencies extracted are representative to the class. Generally, boundary effects tend to be more serious as image spatial resolution improves because coarser resolution will smooth out the boundary effects. On the other hand, boundary effects will increase as the size of pixel window increases.

Industrial/commercial

Actual boundary

Transition classesPixel windows

Golf course

25

Figure 5-12 An illustration of the pixel-window effect on the classification results at the boundary of two distinct land-use classes. The two patterns between A and B are the transitional classes which are misclassified by an frequency-based classifier. (From Gong 1994). [B/W]

Threshold controlled classification. Since the boundary effect is a spatial problem, it should be corrected spatially. In the FBC (Gong and Howarth, 1992a), a city-block distance ds(i,j) is calculated as a basis for classification:

ds( i , j ) = fk( i , j ) − c

skk =1

n∑

(5-36)

where fk(i,j) is the extracted frequency of grey-level vector k from a pixel window centered around pixel location

(i,j). Cs={cs1, cs2, ..., csn}T is the average gray-level vector frequency for a land-use class s; Cs can be obtained from supervised training on the gray-level vector-reduced image. The pixel window size is m X m. n is the total number of grey-level vectors.

Instead of directly comparing the shortest distance among all the land-use classes, once ds(i,j) is obtained, it is

compared with a threshold ß•m2(0<ß< 2). If ds(i,j)≤ ß•m2, pixel (i,j) is a candidate for land-use class s. Otherwise, pixel (i,j) is rejected from land-use class s. If more than one land use class is a candidate, pixel (i,j) belongs to the closest land-use class. ß = 2 is equivalent to applying no thresholding whereas ß = 0 implies that only those pixels whose calculated occurrence frequencies match exactly with those of a particular class will be classified. Therefore, by adjusting the threshold ß between 0 and 2 some transitional classes between boundaries of two different classes may remain unclassified.

Region growing by majority filtering. A simple region-growing procedure can then be applied iteratively to fill up the gaps (unclassified pixels) between two classes (Figure 5-13). In this procedure, only unclassified pixels may be affected. An unclassified pixel is first located. Its eight neighbours is then checked to see if any of these neighbours has been classified. If the answer is no, the algorithm searches for the next unclassified pixel and does the same neighbourhood check. If the answer is yes, the majority rule is applied to label the unclassified pixel. The number of unclassified pixels is usually small (less than 10% in an image). In addition, the region-growing procedure is computationally simple. Therefore, it requires a very small amount of computation.

Industrial/Commercial

Golf course

Unclassified

Figure 5-13 An illustration of the thresholding and region-growing procedures. The thresholding prevents the area along the boundary of two classes from being classified. The region-growing algorithm will then be used to fill up the gap between classes A and B. (From Gong, 1994) [B/W]

26

The iterating region-growing procedure is terminated according to either (1) a user determined number of iteration or (2) when every unclassified pixel is assigned a class. If a classification scheme is not complete for an area, there should be pixels remaining unclassified at the end of a classification task. In this case, one should use the number of iterations as the criterion to control the region-growing procedure. A suggested number of iterations is m/2 + 1 (i.e., the lateral length of the pixel-window used). This is because the maximum width of any unclassified gaps is m, and in each iteration the algorithm fills up a two-pixel wide gap. On the other hand, if the classification scheme is perfectly suitable for an area, the region growing could be terminated when all pixels are classified. The second criterion is selected to obtain an optimal land-use classification of a portion of the City of Calgary using an 8-band Compact Airborne Spectrographic Imager (CASI) image at a resolution of 7.5 m (Figure 5-14).

Figure 5-14. A false color composite of CASI imagery obtained over the City of Calgary (left), and the frequency-based contextual classification result (right) with a thresholding and region growing adjustment to reduce the boundary effect. [C]

Eigen-based gray-level vector reduction. In order to make better use of the frequency-based classification technique, the number of gray-level vectors in multispectral space has to be reduced. The simplest way of doing this is by compressing the number of gray levels in each band of the image. The gray-level vector reduction in multispectral space is less optimal than in eigenvector space.

As discussed in Section 4.4.3, principal component transformations (PCT) can reduce the data redundancy effectively through transforming the data from multispectral space to the eigenvector space of the data. Given

27

the covariance matrix and the mean gray-level vector, M = (m1, m2,...,mk)T, calculated from k multispectral bands

of the image, the multispectral coordinates from multispectral space Sk can be rotated into eigen coordinates in eigen space Ek. Let V1, V2,..., Vk, represent the eigenvectors. A gray-level vector, G = (g1, g2,...,gk)T, in

multispectral space can be transformed into a gray-level vector Ge= (v1, v2,...,vk)T, in eigen space. This can be obtained from:

v1v2

⋅⋅⋅vk

= V1,V2,. . . ,V

k( )T

g1g2

⋅⋅⋅gk

(5-37)

Letse12 ,s

e 22 , . . . , s

ek2

, represent the eigenvalues corresponding to each eigenvector. These eigenvalues are the variances along each eigen vector direction in eigen space. In order to keep the same signal to noise level between eigen axes (e.g., to make square cells on the eigen plane in Figure 5-15), our partition of eigen space is so designed that the number of gray levels along each eigenvector is proportional to the square root of its corresponding eigenvalue. That is:

Ne1Se1

=Ne 2Se2

=. . .=Nek

Sek (5-38)

where Ne1,Ne 2, . . . ,N

ek are the numbers of gray levels used for each corresponding eigenvector. As can

be seen from the above equations, we have only k-1 equations but k unknown variables Ne1,Ne 2, . . . ,N

ek . To determine all the k unknowns, one condition is added:

Ne1

⋅ Ne 2

⋅ . . . ⋅ Nek

= NE (5-39)

where NE is the total number of gray-level vectors to be expected for the partition of the eigen space.

28

2nd Eigen Axis

1st Eigen Axis

4.2Se1

4.2Se2

Figure 5-15 The partition of an eigen plane into equal-sized cells using the gray-level vector reduction algorithm. Because traditional PCA based algorithms does not partition the eigen space into equal-sized cells, they tend to overweight the eigen-axis with low eigenvalues. (From Gong and Howarth, 1992a) [B/W] To implement the eigen space partition, the origin of the eigen space (the same as the origin in multispectral space)

needs to be shifted to the new origin which is the mean gray-level vector M in multispectral space. E

Eo = ( e1,e2, . . . , e

k)T

o is obtained through the following:

e1e2

⋅⋅⋅ek

= V1,V2,. . . ,V

k( )T

m1

m2

⋅⋅⋅mk

(5-40)

Where the partition starts and how far apart each gray-level interval is along an eigen axis can now be determined. 2.1Sei at each side of the origin along eigen axis i were selected as the starting and ending points for the gray-level partition. This was determined from the normal distribution curve by assuming data were normally distributed along each eigen axis. Based on this assumption, the use of 2.1 guarantees 97 percent of the gray-level vectors in multispectral space fall into the range [-2.1Sei, 2.1Sei] on eigen axis i. The remainder (less than 3 percent) will fall outside the range. Depending on the actual data distribution, the number "2.1" can be slightly adjusted to keep a majority of gray-level vectors falling into the specified range. The gray levels along each eigen axis are numbered in an ascending order from 0 with an increment of 1 to Nei-1. Figure 5-16 illustrates the division of the ith eigen axis into Nei gray levels.

29

By dividing each eigen axis into the number of gray levels obtained above, the original multispectral space will be partitioned into NE pieces or grey-level vectors in eigen space. The purpose, which is to reduce the large number of grey-level vectors in multispectral space, will therefore be achieved. From the transformed gray-level vector of a pixel, Ge= (v1, v2,...,vk)T, the reduced gray-level vector, Gr= (r1, r2,...,rk)T, can be obtained according to the division along each eigen axis, as described above. For example, r1 is reduced from v1 according to the following rule:

a = ( v1

− e1

+ 2 .1Se1) ( N

e1− 2 ) / 4 . 2 S

e1+ 1

r1

=0 if a ≤ 1Ne1

− 1 if a >Ne1

− 2

a else

(5-41)

ith Eigen Axis

10 2 •• • Nei-2 Nei-1

ei

2.1Sei 2.1Sei

Figure 5-16 Division of the eigen-axis into Nei pieces. [B/W]

In order to allow the frequency-based classification algorithms easy access to the data after gray-level vector reduction, it was decided to use only one image to store the data. A labelling system was developed to assign a number to each gray-level vector created in eigen space. A number Ne for a particular gray-level vector, Gr= (r1,

r2,...,rk)T, is calculated according to the following equation:

ne = rk

⋅ Ne1

⋅ Ne 2

⋅. . . ⋅ Ne( k −1 )

+ rk −1

⋅ Ne1

⋅ Ne 2

⋅ . . . ⋅ Ne( k −2 )

+. . .+ r

1 (5-42)

After this labelling, all the partitioned gray-level vectors in eigen space will range from 0 to NE-1.

In summary, it takes four steps to obtain reduced gray-level vectors using the new algorithm. In the first step, the algorithm generates the covariance matrix and mean gray-level vector from the original multispectral image by using either samples or the entire image. In the second step, the eigen values and their corresponding eigen vectors are derived from the covariance matrix. In the third step, the eigen space is partitioned into an expected number (NE) of pieces. Finally, the gray-level values of every pixel in the multispectral image are transformed into the eigen space and each pixel is assigned a new gray-level vector number (ne). The assignment is done according to the section (new gray-level vector) in the partition of the eigen space into which the transformed coordinates of each pixel fall.

The above introduction of gray-level vector reduction can be applied to prepare an intermediate image for the frequency-based contextual classifier to derive land-use classes. This algorithm can be applied to images of any number of spectral bands. Usually it works best to set NE ≥50. When the number of bands in an image is large, a larger NE (60-100) is recommended. In a sense the gray-level vector reduction algorithm performs like a clustering

30

algorithm, it reduces the data complexity into a manageable number without lots of interaction of the image analyst. Therefore the algorithm increases the efficiency of the frequency-based classification without losing much of the accuracy.

Direct contextual classifiers and other algorithms. Instead of extracting contextual information and store them for subsequent classification as discussed above in the preprocessing and postprocessing contextual algorithms, a direct contextual classifier makes direct use of contextual information from a pixel window in pixel labeling. In this sense, a frequency-based classifier can be considered as a contextual classifier as it can be directly applied to an image with small number of gray-level vectors. Clearly the artificial neural network method can be applied to gray-level vectors directly obtained from an image. For instance, the gray-level vectors of multiple pixels of a pixel window can organized in a composite gray-level vector and applied as input to a neural network (Hepner et al., 1990). As a natural extension to the Bayesian rule, the a posteriori class probability calculation of a pixel can be characterized by the joint conditional probabilities of its neighborhood pixels. The decision of classification is governed by the compound information from a pixel neighborhood. This type of classification is known as compound decision method (Welch and Slater, 1971; Landgrebe, 1980; Swain et al., 1981). However, due to the computational complexity and unrealistic assumptions, this type of direct contextual classification has rarely been applied to land cover and land use classification.

Non-spectral data such as digital elevation model data, climate data, population and socio-economic data can be used as ancillary data with the spectral data in land-use classification in order to improve classification accuracy (e.g., Hutchinson, 1982; Cibula and Nyquist, 1987; Mason et al., 1988; Corr et al., 1989). While the minimum classification unit is usually a pixel in most of the classification algorithms, field-based (also known as region-based or object-based) classification methods have been developed since 1976 (Ketting and Landgrebe, 1976) for the classification of agricultural lands with remote sensing. In object based image classification, initially an image must be divided into homogeneous regions (objects). Classification decision is then made object-by-object. Knowledge-based approaches have been developed to process aerial photograph based on homogeneous image segments using rules about neighborhood and shape and size features of individual image objects (segments) (McKeown, 1987). In computer vision, initial image segmentation is usually achieved by image thresholding, region-growing or clustering. The resultant segmented image can then be passed to region extraction procedure, where segments are treated as a whole object for successive processing. Gong and Howarth (1990d) employed a land cover classification to initially segment the original image and applied a knowledge-based algorithm to group land-cover polygons into land-use types in the rural-urban fringe of Toronto using SPOT multispectral image. The classification decision is based on “if – then” rules stored in a knowledge base whose rules are triggered by object properties such as the texture, average gray-level vector, object shape and size and object neighborhoods.

Object-based classification is made easier to implement with the availability of eCognition (Definiens Imaging, 2001). Objects are first obtained through image segmentation. With eCognition the image segmentation can be done at different levels of spatial scales to produce objects of different sizes. The objects at different scales are organized in a hierarchical order so that each object at a higher level has an exclusive inclusion of a number of objects at a lower level. Figure 5-17 gives an example illustrating the image segmentation at two scales using IKONOS image acquired over China Camp of California. The segmentation was produced using both the panchromatic and the multispectral bands. With each object extracted, it is convenient to develop rule-based inference mechanism using fuzzy set theory. Such object properties as object shape, object size, object belonging relationships, as well as average gray-level and texture can be used in image classification.

31

a. IKONOS original image b. Objects with average gray level at a low scale

32

c. Objects with average gray levels at a high scale d. Object contours in (c) overlaid on the panchromatic image

Figure 5-17. Object extraction with image segmentation applied to IKONOS image, China Camp,

California. (a). Original multispectral image presenting bands 4, 3 and 2 with red, green

and blue color guns. (b). Image segmentation results with average gray-level values

displayed for each object with a low scale applied. (c). Image segmentation with a larger

scale than in (b). Each object in this diagram corresponds to one to a number of objects in

(b). (d). Contour of objects in (c) overlaid on top of the panchromatic image.

33

chapter 5 information extractionnature.berkeley.edu/~penggong/271/chp5p1.pdfthus far, shape analysis...

Documents