county level population estimation using knowledge-based image
TRANSCRIPT
APPROVED: Pinliang Dong, Major Professor Paul Hudak, Committee Member and Chair
of the Department of Geography Chetan Tiwari, Committee Member James D. Meernik, Acting Dean of the
Robert B. Toulouse School of Graduate Studies
COUNTY LEVEL POPULATION ESTIMATION USING KNOWLEDGE-BASED
IMAGE CLASSIFICATION AND REGRESSION MODELS
Anjeev Nepali, B.S.
Thesis Prepared for the Degree of
MASTER OF SCIENCE
UNIVERSITY OF NORTH TEXAS
August 2010
Nepali, Anjeev. County level population estimation using knowledge-
based image classification and regression models. Master of Science (Applied
Geography), August 2010, 65 pp., 11 tables, 24 illustrations, references, 38 titles.
This paper presents methods and results of county-level population
estimation using Landsat Thematic Mapper (TM) images of Denton County and
Collin County in Texas. Landsat TM images acquired in March 2000 were
classified into residential and non-residential classes using maximum likelihood
classification and knowledge-based classification methods. Accuracy
assessment results from the classified image produced using knowledge-based
classification and traditional supervised classification (maximum likelihood
classification) methods suggest that knowledge-based classification is more
effective than traditional supervised classification methods. Furthermore, using
randomly selected samples of census block groups, ordinary least squares (OLS)
and geographically weighted regression (GWR) models were created for total
population estimation. The overall accuracy of the models is over 96% at the
county level. The results also suggest that underestimation normally occurs in
block groups with high population density, whereas overestimation occurs
in block groups with low population density.
ii
Copyright 2010
by
Anjeev Nepali
iii
ACKNOWLEDGEMENTS
I would like to take this opportunity to express my appreciation towards Dr.
Pinliang Dong for his full support and supervision throughout this project. I also
like to acknowledge my committee member Dr. Paul Hudak and Dr. Chetan
Tiwari for their support and suggestion to prepare this thesis work and Dr. Bruce
Hunter for providing software application support used in this thesis work.
I also would like to thank my friends Naresh Kanaujiya, Sanjay Gurung,
Nick Enwright and Aldo Avina for their comments and suggestions in improving
my thesis work.
iv
TABLE OF CONTENTS
Page
ACKNOWLEDGEMENTS ..................................................................................... iii LIST OF FIGURES ...............................................................................................vi LIST OF TABLES ............................................................................................... viii CHAPTER 1 INTRODUCTION ............................................................................. 1
Why Estimate Population? ................................................................................ 1
Why Use Remote Sensing for Population Estimation? ..................................... 2
Current Practice of Population Estimation Using Remote Sensing ................... 4
Research Objectives ......................................................................................... 9 CHAPTER 2 STUDY AREA AND DATA ........................................................... 10
Study Area ...................................................................................................... 10
Datasets .......................................................................................................... 12 CHAPTER 3 METHODOLOGY ......................................................................... 13
Image Calibration ............................................................................................ 13
Impervious Dataset ......................................................................................... 14
Calculation of Indices ...................................................................................... 15
Knowledge-Based Classification ..................................................................... 17
Development of the Knowledge-Based Classification Model for Denton County ........................................................................................................................ 20
Accuracy Assessment of Classified Images.................................................... 25
Input Data for Regression Models ................................................................... 26
Regression Modeling ...................................................................................... 26
Geographically Weighted Regression Model .................................................. 27
Accuracy Assessment of Population Estimation ............................................. 29
Relative Error (RE): ......................................................................................... 29
v
CHAPTER 4 RESULTS AND DISCUSSION ..................................................... 30
Results from Maximum Likelihood Classification (MLC) ................................. 30
Results from Knowledge-Based Classification of Landsat TM and Impervious Surface Data ................................................................................................... 31
Results from Knowledge-Based Classification Using Landsat TM Data Alone 33
Regression Models ......................................................................................... 34
Linear Regression Models .............................................................................. 35
Geographically Weighted Regression ............................................................. 48
Discussion....................................................................................................... 51 CHAPTER 5 CONCLUSION.............................................................................. 57 REFERENCES ................................................................................................... 60
vi
LIST OF FIGURES
Page
1. Study area. .............................................................................................. 10
2. Flowchart of methodology. ....................................................................... 18
3. Flowchart of knowledge-based classification model. ............................... 19
4. Indices data image from generated from TM image. ............................... 21
5. Classification indices value of various land use type vs. residential land use type. .................................................................................................. 22
6. Classification indices value of various land use type vs. residential land use type. .................................................................................................. 23
7. Spectral response of band 4 and band 7 in residential build-up area. ..... 24
8. Knowledge-based classification model developed for land use classification using Landsat 7 ETM+ image classification rules and conditions. ................................................................................................ 25
9. Classified image of Landsat TM produced from MLC classification. ........ 30
10. Classified image of Landsat TM after processing using impervious surface data. ......................................................................................................... 32
11. Results from knowledge-based classification using Landsat TM data alone. ....................................................................................................... 33
12. Linear regression models derived from sampling Denton County block-group level. .............................................................................................. 37
13. Linear regression models derived from sampling Denton County block-group level. .............................................................................................. 38
14. Scatter diagrams of relative population estimation error vs. population density at the census block-group level for Denton County. .................... 39
15. Scatter diagrams of relative population estimation error vs. population density at the census block-group level for Denton County. .................... 40
vii
16. Scatter diagrams of relative population estimation error vs. population density at the census block-group level generated from general regression for Denton County (03/04) ....................................................................... 41
17. Linear regression models derived from sampling Collin County block-group level. .............................................................................................. 44
18. Scatter diagrams of relative population estimation error vs. population density at the census block-group level for Collin County. ....................... 45
19. Scatter diagrams of relative population estimation error vs. population density at the census block-group level generated by general linear regression Collin County (03/04/2000)..................................................... 46
20. Scatter diagram of relative population estimation error vs. population density for GWR model (Denton County)................................................. 50
21. Scatter diagram of relative population estimation error vs. population density for GWR model (Denton and Collin County). ............................... 50
22. Scatter Diagram of relative population estimation error Vs. Population Density for GWR Model (Denton and Collin County). .............................. 51
23. Sparsely populated region on aerial, Landsat TM and classified image. . 53
24. Lake shore (sandy beach) and residential built-up on aerial, TM, and classified image. ...................................................................................... 54
viii
LIST OF TABLES
Page
1. Population estimates from July 1, 2006 to July 1, 2007 ........................... 11
2. Error matrix for maximum likelihood classification (MLC) ........................ 31
3. Error matrix for impervious surface data .................................................. 32
4. Error matrix for spectral response alone .................................................. 34
5. Summary of linear regression model results -Denton County .................. 36
6. Summary of population estimates produced by general regression model for Denton County .................................................................................... 41
7. Summary of linear regression model results -Collin County .................... 42
8. Summary of population estimates produced by general regression model for Collin County ...................................................................................... 46
9. Summary of linear regression model results –Denton and Collin County Combine .................................................................................................. 47
10. Summary of population estimates produced by general regression model (Denton & Collin County combined) ......................................................... 48
11. Summary of geographically weighted regression model results .............. 49
1
CHAPTER 1
INTRODUCTION
Why Estimate Population?
Half of the world’s human population now lives in urban settlement with a
rapid growth rate (UNCHS, 2001). According to the United Nation Human
Settlements Program (UN-Habitat), nearly 60% of the world population will be
urban dwellers by 2030. This rapid population growth has a direct impact on all
aspects of human development such as social behavior, health, education,
gender equality, economic development, job opportunities, and environment.
Population growth impacts the sustainability of natural resources through
processes of environmental deterioration, including deforestation and loss of
biodiversity. Therefore, there is an urgent need to develop methods that can
accurately estimate the spatial distribution of populations. This will allow decision
makers/planners and environmental planners to develop a better understanding
of the complex relationships between population growth, social/economic impact,
environmental condition, and decision making process (Lu, Weng, & Li, 2006).
Traditional population estimation is based on census which provides extensive
information on demographic parameters but, in the mean time, it is also very
labor intensive, time consuming and costly. Furthermore, in the United States,
census data is only collected once every ten years, which is inadequate for
modeling the population dynamics for rapidly changing urban environment
2
(Lee & Goldsmith, 1982). Therefore, the usefulness of decennial census dataset
is becoming less representative for those urban settlements which are
developing rapidly. For example, Denton County and Collin County were listed as
one of the top 25 of U.S. Counties which received the largest numeric increase in
population in one year (July 1, 2006 to July 1, 2007) by US Census Bureau 2008
report (www.census.gov). Recent demographic data (population estimations and
projections) has become more important source of information for developing
various applications including decision making processes for marketing,
planning, government, and businesses. However, the resources available to
collect up-to-date demographic information for rapidly growing regions are still
inadequate.
Why Use Remote Sensing for Population Estimation?
Many researchers have used remotely sensed data such as aerial
photographs for estimating population in urban settlement since 1950s. For
example, Green (1956) used aerial images to count the number of dwelling units
and dwelling type to conduct his demographic analysis in Birmingham, Alabama.
Similarly, Collins and El-Beik (1971) used high spatial resolution image from
Leeds, England to investigate the co-relation between dwelling type and resident
population. Lo (1986a) applied the dwelling method (dwelling count and average
household size using 1:20000 spatial resolution aerial Image) to estimate the
population of Athens, Georgia successfully. The outcome of Collins and El-Beik
3
(1971) and Lo (1986b) indicates that remote sensing techniques can be used to
estimate population of small areas with high accuracy.
Early methods based on manual interpretation of remotely sensed data
were highly time consuming, tedious, labor intensive and not feasible to use for
large areas (e.g. metropolitan or county level); They also require high spatial
resolution images, hence, are not suitable for images with resolution courser than
1 meter. In addition, consistency of the result might be an issue because the
result is highly subjective to the image analyst (Zha, Gao, & Ni, 2003). As a
result, image classification methods were employed to overcome the
shortcomings of manual methods. Image classification can provide researchers
with additional information (such as land use type, transportation network, and
impervious surface) related to urban settlements that play crucial role in
population estimation (Hardin, Jackson, & Shumway, 2007; J. T. Harvey, 2002a;
S. Wu, Qiu, & Wang, 2005). Furthermore, readily available images from different
space-borne and airborne sensors with various spatial resolutions are making it
feasible to delineate ancillary information to assist population estimation. To sum
up, remote sensing methods have been developed for population estimation
because (1) remotely sensed images can provide spatial and spectral information
for residential areas; (2) remotely sensed images can cover large geographic
areas to support population estimation at different scales with less cost; and (3)
computer-based digital image analysis methods greatly facilitate information
extraction from remotely sensed images.
4
Current Practice of Population Estimation Using Remote Sensing
In the United States, “small area” generally indicates counties and their
subdivisions. However, some prefer the term “small area” to the land masses
comprised of census tracts, block groups and blocks, as well. Remotely sensed
images with various spatial resolutions (high, medium, low) have been used for
estimating small area population. For example, high spatial resolution aerial
images were used by Lo and Welch (1977) and Lo (1986a) for their research.
Harvey (2002b; 2003), Lo (2003), Li and Weng (2005) used medium spatial
resolution Landsat Thematic Mapper (TM), and Sutton et al. (1997; 2001) did
their research using low spatial resolution data such as defense meteorological
satellite program operational linescan system (DMSP OLS). However, these
different resolutions have their own complications. For example, images with very
high spatial resolution such as aerial photographs and IKONOS images can
create processing problem because of their massive data content and possible
spatial distortions while working on large areas; likewise, low spatial resolution
data, such as DMSP OLS, is unable to provide significant information for
population estimation at the regional and local levels. Because of their relatively
rich spectral information for land cover mapping and intermediate spatial
resolution to cover large areas, medium spatial resolution images, such as
Landsat TM/Enhanced TM (ETM+) images, have become the main image source
for population estimation (J. T. Harvey, 2002b; 2003; Li & Weng, 2005; Lo, 1995;
2003).
5
Different methods have been used for residential population estimation
based on remotely sensed data. Lo (1986b) summarized four distinguished
approaches that are mainly used in remote sensing literature. They are based
on:
1. Counting individual dwelling units on high spatial resolution imagery
2. Extracting the size of the urban settlement from medium or high spatial
resolution images
3. Using land-use type classification for extracting urban settlement
4. Using automated digital image classification based on spectral features
of satellite imagery
The first three approaches have been previously used for visual
interpretation and analysis. However, the fourth technique has emerged as a
different methodology; it can be applied to any remotely sensed data using
particular spectral information and spatial resolution provided by the image (J. T.
Harvey, 2002b). Under Lo’s fourth approach, researchers have invested
considerable effort on modeling automated digital image analysis techniques
using various computer assisted methods. Among the various digital image
analysis techniques that are used for population estimation, supervised
maximum likelihood classification (MLC) is most commonly used in the remote
sensing literature.
The basic MLC principle relies on decision rules that classify image pixels
to particular classes based on probabilities. This classification method is faster
6
and less labor intensive compared with traditional census approaches. However,
automatic classification of remotely sensed data for extracting urban settlement
is a difficult task to achieve at high levels of accuracy. This is due to diverse
range of land cover type associated with the urban environment (Zha et al.,
2003). The majority of the automated image classification (including MLC), in
some extent, requires training samples to run the algorithms. The size, location,
and representativeness of training samples also play a pivotal role on the
reliability of the output of these classification methods. As a result, conventional
supervised classification method is fairly time consuming and labor intensive
(Zha et al., 2003)
Langford et al. (1991) used land use classification to estimate the
population of northern Leicestershire based on supervised classification. They
used regression analysis that takes the number of pixels in each land use
category as explanatory variables. Likewise, Lo (1995) also used a regression
model to estimate population and number of dwelling units based on reflectance
and pixel counts as explanatory variables. Qiu et al. (2003) tested the regression
analysis approach using geographic information system (GIS) derived
transportation networks (roads network) to perform population estimates.
Similarly, dasymetric model is another of the renowned technique that uses
ancillary information from satellite imagery to perform population estimates.
Harvey (2000) adopted dasymetric model in his study and argued that the
method significantly improved the efficiency of the land use classification in
7
determining the residential population estimates. Later, his method was
supported by Wu et al. (2005) who argued that the dasymetric method does
indeed produce more accurate estimation with remotely sensed ancillary
information compared to those without the information. In addition, more remote
sensing attributes, such as texture and temperature, were included as ancillary
information in remote sensing population research.
Wu and Murray (2005) and Lu et al. (2006) explored the possibility of
using impervious surface (any surface where water cannot infiltrate is termed as
impervious surface) as a remote sensing attribute for population estimation.
Impervious surfaces are important ancillary information as they are associated
with roads, buildings, and other built-up areas that are relatively stable. Lu et al.
(2006) also used regression analysis model to estimate population. Their
approach produced an overall population estimation error of -0.97% for the study
area.
Scientists are finding ways to extract residential features more quickly and
precisely in order to develop a base for understanding complexity of urban
ecosystems. To overcome the existing shortcomings of the available
approaches, Ridd (1995) proposed a pixel based classification method that
utilizes spectral properties of green vegetation, impervious surface material, and
surface soil to delineate urban pixels as these attributes are major component
urban ecosystem. Ridd (1995) argues that the developed vegetation-impervious
surface-soil (V-I-S) model produced using spectral properties of vegetation,
8
impervious and soil attributes can discriminate urban built-up with high accuracy.
Qiao et al. (2009) also developed a pixel based “unified conceptual model” for
discriminating urban area more precisely. This model was based on Ridd’s V-I-S
model that uses spectral information such as spectral indices and texture of the
remotely sensed data to perform image classification. Qiao et al. (2009) used
hierarchical classification method that defines the specific rules for classifying
land use classes based on spectral properties of the features.
Regression modeling techniques on remotely sensed data have been
widely used for population estimation. Wu et al. (2005) argues that because of its
unbiased model accuracy test through statistical significance, regression analysis
is widely used methods in remote sensing literature of population estimation.
In order to simplify the process of outlining different land cover
classifications using automated image analysis, researchers have developed
techniques such as using various indices derived from remotely sensed data.
Normalized difference vegetation index (NDVI) is one of the commonly used
indices for delineating vegetation. In addition, normalized difference water index
(NDWI) is used for mapping open water bodies from remotely sensed data.
Similarly, normalized difference built-up index (NDBI) and normalized difference
blue band built-up index (NDBBBI) are two other indices developed for mapping
urban settlements using satellite image data (Baraldi et al., (2006); Zha et al.
(2003)).
9
Research Objectives
The objectives of the research are (1) To develop automated knowledge-
based classification models for extracting residential areas from Landsat
Thematic Mapper (TM) imagery; and (2) to develop linear regression and
geographically weighted regression (GWR) models using classified images and
census data to estimate population for Denton County and Collin County, Texas,
United States.
10
CHAPTER 2
STUDY AREA AND DATA
Study Area
Denton County and Collin County in north Texas were selected as the
study area for this research. According to U.S. Census Bureau, both counties
were listed among the top 25 counties that had the largest numeric population
influx within a year (July 1, 2006 to July 1, 2007). According to Census 2000,
Denton County had 189 block groups, and Collin County had 282 block groups.
Figure 1. Study area.
11
Table 1
Population Estimates from July 1, 2006 to July 1, 2007
Rank Geographic Area Population Estimates Change, 2006 to
2007
County State July 1, 2007
July 1, 2006 Number Percent
1 Maricopa County Arizona 3,880,181 3,778,598 101,583 2.7 2 Riverside County California 2,073,571 2,007,206 66,365 3.3 3 Harris County Texas 3,935,855 3,876,306 59,549 1.5 4 Clark County Nevada 1,836,333 1,777,168 59,165 3.3 5 Tarrant County Texas 1,717,435 1,668,541 48,894 2.9 6 Bexar County Texas 1,594,493 1,555,192 39,301 2.5
7 Wake County North Carolina 832,970 794,129 38,841 4.9
8 Collin County Texas 730,690 696,383 34,307 4.9 9 Travis County Texas 974,365 941,577 32,788 3.5
10 Mecklenburg County
North Carolina 867,067 835,328 31,739 3.8
11 Pinal County Arizona 299,246 268,316 30,930 11.5 12 Orleans Parish Louisiana 239,124 210,198 28,926 13.8 13 Dallas County Texas 2,366,511 2,337,956 28,555 1.2
14 Santa Clara County California 1,748,976 1,720,839 28,137 1.6
15 Fulton County Georgia 992,137 964,649 27,488 2.8 16 Gwinnett County Georgia 776,380 749,836 26,544 3.5
17 San Diego County California 2,974,859 2,948,362 26,497 0.9
18 Denton County Texas 612,357 586,582 25,775 4.4 19 King County Washington 1,859,284 1,834,194 25,090 1.4 20 Fort Bend County Texas 509,822 485,482 24,340 5.0
21 Williamson County Texas 373,363 350,879 22,484 6.4
22 Hidalgo County Texas 710,514 689,494 21,020 3.0 23 Lee County Florida 590,564 570,089 20,475 3.6
24 San Bernardino County California 2,007,800 1,987,505 20,295 1.0
25 Montgomery County Texas 412,638 393,233 19,405 4.9
Source: Population Division, U.S. Census Bureau
Release Date: March 20 2008
12
Datasets
The datasets acquired for this project are:
1. Landsat TM images: Three TM images acquired in 2000 (2000/03/04 and
2000/03/12) were downloaded from the United States Geological Survey
(USGS) for this particular research purpose. They were used for retrieving
ancillary information (residential area) of the study area.
2. Impervious surface image: The impervious surface image (2001) used in
this research is downloaded from the United States Geological Survey
(USGS).
3. Census shapefiles: Census shapefile created by US Census Bureau for
2000 was used to build regression models to estimate population.
4. Census population data: US Census Bureau population data (2000) was
used to build regression models and test population estimation accuracy.
5. Aerial image: Aerial image of Denton County acquired in 2000 was used
as reference for image classification accuracy assessment.
Software packages used in this research include ERDAS IMAGINE 9.3 for image
processing and ArcGIS 9.3 for GIS data analysis.
13
CHAPTER 3
METHODOLOGY
Image Calibration
Chander and Markham (2003) developed methods and parameters to
overcome radiometric calibration error generated by the degraded sensor’s
internal calibrator due to long term use. According to Chander and Markham
(2003), the calibration process helped to improve the attributes of remotely
sensed data such as spectral radiance, reflectance, and temperature estimates,
providing better base for comparing images acquired in different dates and/or by
different sensors. The methods and procedures suggested by Chander and
Markham (2003) for post calibration of image are:
a. Conversion from digital number (DN) to radiance:
Where,
Lλ = spectral radiance at sensor’s aperture in W/(m2*sr*μm);
Qcal = quantized calibrated pixel value in DNs;
Qcalmin= minimum quantized calibrated pixel value (DN = 0)
corresponding to LMINλ .
Qcalmax= maximum quantized calibrated pixel value (DN = 255)
corresponding to LMAXλ ;
14
LMAXλ = spectral radiance that is scaled to Qcalmax in ;
LMINλ = spectral radiance that is scaled to Qcalmin in ;
b. Radiance to reflectance
Where,
ρP = unitless planetary reflectance;
Lλ = spectral radiance at sensor’s aperture;
d= earth-sun distance in astronomical units;
ESUNλ = mean solar exatmospheric units
θs = solar zenith angle in degree.
Impervious Dataset
Surfaces or features that prevent water from infiltrating into the soil are
defined as impervious surfaces. These are the major component of urban
infrastructure, thus considered as an important indicator of urban settlement in
remotely sensed dataset (C. Wu & Murray, 2005). Impervious images represent
the percentage of impervious surface in a pixel. They are produced using
remotely sensed data such as ETM+ and Terra ASTER, and tend to preserve
more spectral information that can be useful for urban land use classification (Ji
& Jensen, 1999; Li & Weng, 2005). The impervious image used in this research
was prepared by USGS using spectral information from ETM+ dataset.
15
Calculation of Indices
Normalized difference vegetation index (NDVI):
NDVI is widely used for predicting vegetation characteristics from remote
sensing image. Vegetation has low reflectance on red (R) band and has a high
reflectance on near infrared (NIR) band on reflectance curve. These different
bands obtained from vegetations were used by NDVI to detect vegetations.
Where, NIR = reflectance value of near-infrared band
R = reflectance value of red band
Modified normalized difference water index (MNDVI):
This index is generally used in identifying individual water bodies from
satellite imagery. MNDWI uses the relation of green (G) band and mid-infrared
(MIR) band to delineate water pixels from other spectral pixels because water
has a higher and lower reflectance in green (G) band and MIR band respectively.
Where, MIR = reflectance value of mid-infrared band
G = reflectance value of green band
16
Normalized difference built-up index (NDBI):
This index was proposed by Zha et al. (2003) for mapping the urban built-
up instead of using NDVI and MNDWI indices. They concluded that urban
settlement has higher reflectance in mid-infrared; and the use of mid infrared and
near-infrared to define the index is more appropriate.
Where, MIR = reflectance value of mid-infrared band
NITR = reflectance value of near-infrared band
Normalized difference blue band built-up index (NDBBBI):
This index was discussed by Baraldi et al. (2006) as a suitable index for
detecting urban pixels. This expression exploits a relation of mid-infrared and
blue (B) band to delineate urban pixel from remotely sensed data.
Where, MIR = reflectance value of mid-infrared band
B = reflectance value of blue band
Wetness index (WI):
Spectral properties of soil depend on various soil attributes such as soil
type, texture, moisture, and organic matter content. On the other hand satellite
imagery obtained from remotely sensed imagery can produce varying spectral
behavior depending on soil property and classification. Hence the use of satellite
17
imagery is a challenge in determining wetness index of soil. However, moist soil,
in general reflects similar spectral property which is useful for delineating it from
other classes such as vegetation, residential, and commercial. In general, wet
soil exhibits low reflectance values in all TM band and the soil wetness index
(WI) information obtained from tasseled cap indices (explained by Crist et
al.(1986)) can assist in delineating it from other land cover types such as
vegetation, industrial and residential classes from remotely sensed data (Crist et
al., 1986).
The wetness index (WI) can be defined as (Todd, Hoffer, & Milchunas, 1998):
Knowledge-Based Classification
This classification method is also termed as expert classification or rule-
based classification. This classification method integrates and processes
information available in multiple knowledge layers (e.g. spectral, temporal) from
remotely sensed data to produce a single classified image. In terms of land use
classification, by using knowledge-based classifier, a user can specify or design
the required attributes or class based on user’s knowledge, land use
characteristics, as well as classification rules based on multi-spectral and multi-
temporal remotely sensed data. This classification approach follows a
hierarchical expert decision tree method to classify defined variables.
18
Different knowledge based layers such as impervious, NDVI, MNDWI,
NDBI, and NDBBI were used within hierarchical classification tree to produce
classified image required for this research. All process performed using the
“knowledge engineer” tool available in ERDAS IMAGINE 9.3 (eardas, 2008)
Figure 2. Flowchart of methodology.
19
Figure 3. Flowchart of knowledge-based classification model.
20
Development of the Knowledge-Based Classification Model for Denton County
The images produced by the indices (NDVI, MNDWI, NDBI, NDBBI and
WI) of TM image showed that specific land-use (features) type has notably
different index values as compared to other land-use types. For example, NDVI
image showed notable high index value for vegetation areas; water is
represented by highest index value on the MNDWI image; and NDBI image
showed higher index values in residential built-up areas as compared to
vegetated area. These results helped define threshold values in the hierarchical
knowledge-based classification model from TM image for residential pixel
extraction.
(a) (b)
21
(c) (d)
(e) (f)
Figure 4. Indices data image from generated from TM image.
(a) Landsat Image; (b) NDVI Index; (c) MNDWI Index; (d) NDBI Index; (e) Wetness Index; and (f) NDBBBI Index
22
(a) Vegetation vs. residential
(b) Water vs. residential
(c) Commercial/Industrial/Transportation (CIT) vs. residential
Figure 5. Classification indices value of various land use type vs. residential land
use type.
0.00
0.20
0.40
0.60
0.80
1.00
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930
Vegetation Residential
Samples
(ND
VI)
-0.40-0.200.000.200.400.600.801.001.20
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930
Water Residential
Samples
MN
DW
I
Samples
MN
DW
I
0.00
0.10
0.20
0.30
0.40
0.50
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
CIT Residential
ND
VI
Samples
23
(a) Soils vs. residential
(b) Soils/construction sites vs. residential
(c) Wetland vs. residential
Figure 6. Classification indices value of various land use type vs. residential land
use type.
-0.15-0.10-0.050.000.050.100.150.200.25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Soils Residential
Samples
ND
BI
-0.60
-0.50
-0.40
-0.30
-0.20
-0.10
0.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Soils/Const Residential
Samples
ND
BBBI
0.00
0.05
0.10
0.15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Wet_Soil Residential
Samples
Wet
ness
Inde
x (W
I)
24
Figure 7. Spectral response of band 4 and band 7 in residential build-up area.
The graphs in Figure 5, 6, and 7 clearly show that the usefulness of
various spectral indices in discriminating residential land use from other land use
types. For example; normalized difference vegetation index (NDVI) graphs show
that vegetation, commercial and industrial pixels can be distinguished from urban
pixels (figures 5a, 5c). Similarly, the spectral indices like normalized difference
built-up index (NDBI) and normalized difference blue band built-up index
(NDBBBI) illustrate that they can be used for delineating urban built-up from its
surrounding features such as bare soil and construction materials/sites (figures
6a, 6b). Figure 7 shows that the spectral response band 4 (NIR) and band 7
(TM7) in residential land use type can be used in defining additional rules for the
knowledge-based model for extracting residential built-up pixels.
0
0.05
0.1
0.15
0.2
0.25
0.3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Band 4 Band 7
Samples
Refle
ctan
ce
25
Figure 8. Knowledge-based classification model developed for land use
classification using Landsat 7 ETM+ image classification rules and conditions.
Accuracy Assessment of Classified Images
Accuracy assessment for remote sensing image classification is the
process involved in understanding the quality of the image produced by
discovering and evaluating classification error. Therefore, accuracy assessment
of the image is very important as it defines the reliability of the information
provided by the image which, in turn, can be used for various decision making
processes (Congalton & Green, 1999).
Since the study concentrates on classifying urban settlements, the
classified image produced from knowledge based model is re-classified into two
26
groups; “residential” and “non-residential”. In the next step, “Accuracy
assessment”, the tool readily available in ERDAS IMAGINE 9.3 is used to test
accuracy of re-classified images. Random samples plots were allocated around
the study area (Denton County) and a 3×3 pixel cluster was selected as a
minimal area for defining pixel classification. Furthermore, aerial image of Denton
County acquired in 2000 was used as reference to define classification of
allocated sample pixels manually.
Input Data for Regression Models
Census block group level is used to generate the regression model for
producing population estimates. In this step, the total number of residential pixels
that lie within individual census block groups is calculated using ArcGIS. This tool
calculates statistics (e.g. sum, mean, median, standard deviation etc) on values
of a raster within the zones of another dataset (raster or vector) and reports the
results to table (ArcGIS, ESRI).
Regression Modeling
Regression is a statistical technique used for investigating relationships
between the given (i.e. depended) variable and one or more other (i.e.
independent) variables. Previous work in the area of population estimation has
used regression methods to determine correlations between spectral reflectance
value pixel and population density. For example, Lusaka and Hegedus (1982)
used spectral reflectance of bands 4, 5 and 7 as input to develop a regression
model to estimate population distribution in Tokyo, Japan. Similarly, Harvey
27
(2002a) used different spectral characteristic such as indices, band sum, and
band difference as variables for regression analysis to estimate population.
The two different regression analysis methods, linear regression and
geographically weighted regression (GWR), are used in this study.
Linear regression model:
Linear regression model can be defined as follows:
Where,
Pe= population estimates
a = regression intercept
b = slope
x = area of residential pixels in a block group
Geographically Weighted Regression Model
In spatial datasets, the relationships between the dependent and
independent variables are different across geographic space; i.e. the same
attribute can have a different effect on the model in different parts of the study
region (Fotheringham, Brunsdon, & Charlton, 2002). However, global regression
models such as linear regression examines the relationship between the
dependent and independent variables without explicitly considering the variations
that may occur due to their spatial context. Therefore, there is a need for
developing modeling techniques that define the relationship between variables
28
locally, i.e. the results produced from such models should be location dependent
(Charlton, Fotheringham, & Brunsdon, 02/02/2009).
Geographically weighted regression (GWR) is a local spatial statistical
technique that defines and analyzes the relationship between various attributes
that vary across geographic space (Fotheringham et al., 2002). Unlike traditional
global model, the GWR model allows the explanatory variable to vary in terms of
location, thus providing detail information on understanding and analyzing
geographic data. A GWR model also takes into account the spatial weighting
function which allows to define relationship among neighborhood according to
spatial variation throughout the study area (Fotheringham et al., 2002), and due
to its ability to incorporate spatial attribute for research, GWR analysis technique
is widely used in many studies such as geography, remote sensing, and
environmental science (Mennis, 2006). A GWR model can be expressed as (Lo,
2008):
∑=
++=n
kiikikii exaaY
10 (i = 1, 2, …, n) (9)
where Yi and xik are the dependent and independent variables at i, k = 1,
2, …, n, ei are normally distributed error terms (with zero mean and constant
variance at point i), and aik is the value of the k-th parameter at location i.
29
Accuracy Assessment of Population Estimation
Accuracy assessment is an important procedure to test the developed
regression model in population estimation research. The three error measures for
accuracy assessment as suggested by (Lu et al., 2006) are:
Relative Error (RE):
Relative error compares the result produced by the developed model with
the census measurement to test the goodness of the model.
Where,
Pe= estimated population calculated from the regression model
Pr= reference population (for this research: block group population)
Mean relative error (MRE):
In the same way, MRE can be used test the overall performance of the
model over the study area.
Where,
RE = relative error; n= total number of census block group used in study area
Median relative error (MdRE):
This measure is used to reduce the effect of extreme values to the overall
result.
30
CHAPTER 4
RESULTS AND DISCUSSION
Results from Maximum Likelihood Classification (MLC)
The Landsat Thematic Mapper (TM) image of Denton County was
classified into several land use classes such as vegetation, water, commercial-
industrial- transportation area (CIT), soils, and residential classes based on
training samples defined by using aerial Image of Denton County; furthermore,
the classified image was reclassified into two major classes: residential and non-
residential for extracting residential built-up.
(a) (b)
Figure 9. Classified image of Landsat TM produced from MLC classification.
(a) TM classified Image (MLC); (b) Reclassified Image from a. Black pixels are classified residential areas.
31
Table 2
Error Matrix for Maximum Likelihood Classification (MLC)
True Data Total User Accuracy
(%) Non-
residential Residential
TM
Image
Non-residential 229 5 234 97.86 Residential 9 19 28 67.85
Total 238 24 262 Producer
Accuracy (%) 96.22 79.16 Over All
94.65
Accuracy assessment was performed using visual interpretation method
by taking high resolution aerial image as a reference data. Random samples
were selected throughout Denton County; despite the overall accuracy of near
95% was achieved for overall classification, the produced classified image was
only able to achieve 67.85% and 79.16% user and producer’s accuracy
respectively in extracting residential area.
Results from Knowledge-Based Classification of
Landsat TM and Impervious Surface Data
Knowledge-based classification model was developed using rules based
on spectral indices and impervious surface layer characteristics. The TM image
is classified by using developed knowledge-based model, further; it is reclassified
into two major classes; residential and non-residential.
32
(a) (b)
Figure 10. Classified image of Landsat TM after processing using impervious
surface data.
(a) Classified image using Landsat TM and impervious surface layer.
(b) Reclassified residential and non-residential areas based on results in a.
Table 3
Error Matrix for Impervious Surface Data
True Data Total User Accuracy
(%) Non-
residential Residential
TM
Image
Non-residential 354 5 359 98.60 Residential 6 31 37 83.78
Total 360 36 396 Producer
Accuracy (%) 98.33 86.11 Over All
97.22
Table 3 summarizes the accuracy assessment result for classified image
produced from knowledge-based model. This classification produce an overall all
33
accuracy of over 97% with improved user and producer accuracy for the
residential built-up as compared to MLC classification result.
Results from Knowledge-Based Classification Using Landsat TM Data Alone
In the same way, knowledge-based model was developed using Landsat
TM spectral properties alone. The image classification was broken into number of
land use classes such as vegetation, water, CIT, soils, and residential classes
based on spectral property described by characteristics of spectral indices. The
image is then reclassified into two major classes: residential and non-residential,
and subjected to accuracy test.
(a) (b)
Figure 11. Results from knowledge-based classification using Landsat TM data alone.
(a) TM Classified image using spectral attributes.
(b) TM reclassified image using spectral attribute.
34
Table 4
Error Matrix for Spectral Response Alone
True Data Total User Accuracy
(%) Non-
residential Residenti
al
TM Image
Non-residential 675 14 689 97.97 Residential 22 133 155 85.81%
Total 697 147 844 Producer
Accuracy (%) 96.84% 90.48% Over All
95.73%
Table 4 summarizes the accuracy assessment result produced from the
classified. The over-all accuracy of over 95% is achieved for the produced
classified image. The error matrix on table 4 also showed the improvement on
user and producer’s accuracy on delineating residential pixels.
In comparison to the MLC and knowledge base model using impervious
layer, the knowledge-based classification model produced using only TM spectral
attributes was the most effective method for delineating residential areas. In
addition to its effectiveness in delineating residential areas, the knowledge-based
classification model does not need impervious surface data for a study area,
thereby facilitating the use of the model in other study areas.
Regression Models
Based on the images classification results described in the previous
section, the classified images produced from the knowledge based model using
Landsat data alone were used as base images for regression modeling.
Random samples from census block-group dataset are selected to generate
35
linear regression models. The developed models are then applied to the entire
study area to make population estimates of the region. Relative errors at block-
group level are calculated for model accuracies in estimating population.
Linear Regression Models
Denton County (03/04/2000 Image)
Table 5 summarizes 12 linear regression models produced by various
census block-group samples using block-group population as a dependent
variable and block-group residential pixel area as explanatory variables. In
addition, a new regression model is derived from incorporating multiple models
generated from different group of samples. A total of 42 random block-groups
were selected for generating each linear regression model. The high R2 value for
sample block-group suggested that the size residential area is highly correlated
with population of the region at block-group level.
36
Table 5
Summary of Linear Regression Model Results -Denton County
Linear Regression (Denton County) Samples R2 Total Error
(%) Mean Error
(%) Median
Error (%) Model
1 0.7088 6.0 41.40 27.50 y = 0.002804x + 595.18 2 0.7077 -3.68 37.61 26.32 y = 0.002385x + 652.90 3 0.7203 1.38 38.35 24.84 y = 0.002750x + 531.41 4 0.6312 3.7 43.92 31.50 y = 0.002347x + 847.87 5 0.6707 -1.88 35.91 22.92 y = 0.002806x + 419.96 6 0.8093 -1.54 33.38 22.46 y = 0.002896x + 368.90 7 0.6707 0.11 38.84 25.14 y = 0.002568x + 620.61 8 0.7671 2.60 37.97 24.85 y = 0.002952x + 427.86 9 0.6999 4.17 38.18 25.64 y = 0.00311x + 360.29
10 0.8075 -0.66 37.86 24.32 y = 0.002620x + 568.96 11 0.7918 -1.86 38.55 26.30 y = 0.002430x + 665.10 12 0.7451 -5.35 34.63 23.27 y = 0.002676x + 425.01
General Linear Equation Model -1.858 Y=0.002695X+492.92
Selected regression models are shown in Figures 12 and 13. Scatter diagrams
are produced for analyzing and understanding the relationship between the
relative population estimation errors and population density at each census
block-group of the Denton County (Figures 14 and 15).
37
(a)
(b)
(c)
Figure 12. Linear regression models derived from sampling Denton County block-group level.
y = 0.002896x + 368.9R² = 0.8094
0
2000
4000
6000
8000
0 500000 1000000 1500000 2000000 2500000
Popu
lati
on
Area (sq. m)
y = 0.002568x + 620.62R² = 0.6631
02000400060008000
10000
0 500000 1000000 1500000 2000000 2500000 3000000
Popu
lati
on
Area(sq. m)
y = 0.002952x + 427.87R² = 0.7672
02000400060008000
1000012000
0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000
Popu
lati
on
Area(sq. m)
38
(a)
(b)
(c)
Figure 13. Linear regression models derived from sampling Denton County block-group level.
y = 0.00311x + 360.3R² = 0.6999
01000200030004000500060007000
0 500000 1000000 1500000 2000000 2500000
Popu
lati
on
Area(sq. m)
y = 0.002620x + 568.96R² = 0.8075
02000400060008000
1000012000
0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000
Popu
lati
on
Area(sq. m)
y = 0.002430x + 665.1R² = 0.7919
0
2000
4000
6000
8000
10000
12000
0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000
Popu
lati
on
Area(sq. m)
39
(a)
(b)
(c)
Figure 14. Scatter diagrams of relative population estimation error vs. population density at the census block-group level for Denton County.
-500
-250
0
250
0 3000 6000 9000 12000 15000
Rela
tive
Err
or(%
)
Population Density (Persons per Sq. Km)
-250
0
250
500
0 3000 6000 9000 12000 15000
Rela
tive
Err
or(%
)
Population Density (Person per Sq. Km)
-250
0
250
500
0 3000 6000 9000 12000 15000
Rela
tive
Err
or(%
)
Population Density (Person per Sq. Km)
40
(a)
(b)
(c)
Figure 15. Scatter diagrams of relative population estimation error vs. population density at the census block-group level for Denton County.
Denton County (03/12/2000 image)
Classified image generated from 03/12/2000 Landsat image is developed
-250
0
250
500
0 3000 6000 9000 12000 15000
Rela
tive
Err
or(%
)
Population Density (Person per Sq. Km)
-250
0
250
500
0 3000 6000 9000 12000 15000
Rela
tive
Err
or(%
)
Population Density (Person per Sq. Km)
-250
0
250
500
0 3000 6000 9000 12000 15000
Rela
tive
Err
or
(%)
Population Density (Person per Sq. Km)
41
from produced knowledge-based classification model. In the next step, the
produced image was used estimate population using the general regression
model produced for Denton County in table 5.The total population estimates and
total error produced by the linear regression are summarized in table 6. Figure 16
summarizes the relationship between relative population estimates error and the
block-group density for general regression model produced in table 5.
Table 6
Summary of Population Estimates Produced by General Regression Model for
Denton County
General Linear equation(Denton County): Y=0.002695X+492.92
Image Total population Est. population Total Error (%)
03/04/2000 432976 424929 -1.86
03/12/2000 432976 458115 5.80
Figure 16. Scatter diagrams of relative population estimation error vs. population density at the census block-group level generated from general regression for Denton County (03/04/2000)
-250
0
250
500
0 3000 6000 9000 12000 15000Rela
tive
Err
or(%
)
Population Density (Person per Sq. Km)
42
Collin County (03/04/2000)
In the same way, table 7 summarizes 12 Linear Regression models
produced by selecting random census block-group samples of Collin County.
Similar to previous methods, block-group population was selected as a
dependent variable and block-group residential pixel area as explanatory
variables. A new regression model is derived from incorporating multiple models
generated from different sample groups.
Table 7
Summary of Linear Regression Model Results -Collin County
Linear Regression (Collin County) Samples R2 Total
Error (%) Mean Error
(%) Median Error
(%) Model
1 0.8176 2.01 239.82 21.44 y = 0.002502x + 258.17 2 0.7986 -7.88 283.52 20.85 y = 0.001704x + 570.56 3 0.8137 -6.00 282.86 20.76 y = 0.001799x + 545.99 4 0.6893 4.54 243.94 20.76 y = 0.002581x + 254.61 5 0.7123 1.75 233.13 21.27 y = 0.002548x + 225.86 6 0.8180 -4.09 279.74 20.47 y = 0.001899x + 518.35 7 0.7918 -1.86 299.27 26.30 y = 0.002430x + 665.10 8 0.8504 3.12 314.91 22.18 y = 0.001934x + 622.57 9 0.6472 7.01 231.02 21.91 y = 0.002804x + 161.99
10 0.8034 3.95 309.42 21.50 y = 0.002016x + 587.64 11 0.7918 0.24 229.09 21.82 y = 0.002515x + 219.22 12 0.7595 -2.27 275.24 20.81 y = 0.002015x + 479.09
General Linear Equation Model 2.10 Y=0.00223X+425.75
A total of 60 random block-groups are selected as a sample size for
generating each linear regression model presented in table 7. Figure 16
represent a graph generated by selected regression model from table 7. Due to
43
some anomalies present in relative population estimation error, the scatter
diagram produced was condensed and became less representative because of
those extreme values. For the purpose of better graphical representation, the
extreme values were removed from the scatter plot (Figure 18), which only
account for 1% of the data.
44
(a)
(b)
(c)
Figure 17. Linear regression models derived from sampling Collin County block-group level.
y = 0.002016x + 587.65R² = 0.8034
02000400060008000
100001200014000
0 1000000 2000000 3000000 4000000 5000000 6000000 7000000
Popu
lati
on
Area (Sq. m)
y = 0.002515x + 219.23R² = 0.7451
02000400060008000
10000
0 500000 1000000 1500000 2000000 2500000 3000000 3500000
popu
lati
on
Area (Sq. m)
y = 0.002015x + 479.09R² = 0.7596
0
2000
4000
6000
8000
10000
0 500000 1000000 1500000 2000000 2500000 3000000 3500000
popu
lati
on
Area (Sq. m)
45
(a)
(b)
(c)
Figure 18. Scatter diagrams of relative population estimation error vs. population density at the census block-group level for Collin County.
-500
0
500
1000
0 1000 2000 3000 4000 5000 6000 7000 8000
Rela
tive
Err
or
Population Density (Person/Sq. Km)
-500
0
500
1000
0 1000 2000 3000 4000 5000 6000 7000 8000
Rela
tive
Err
or
Pop. Density (Person/Sq. Km)
-500
0
500
1000
0 1000 2000 3000 4000 5000 6000 7000 8000Rela
tive
Err
or (%
)
Pop. Density (Person/Per sq. m)
46
Collin County (03/12/2000 image)
Again, a developed knowledge based classification model is used to
generate classified image of Collin County dated 03/12/2000. The produced
image was used to produce population estimates using the general regression
model produced in Table 6. A summary of the results are presented in Table 8.
Table 8
Summary of Population Estimates Produced by General Regression Model for
Collin County
General linear equation(Collin County): Y= 0.002229X + 425.75
Image Total population Est. population Total Error (%)
03/04/2000 491675 502015 2.10
03/12/2000 491675 476297 -3.12
Figure 19. Scatter diagrams of relative population estimation error vs. population density at the census block-group level generated by general linear regression Collin County (03/04/2000).
-500
0
500
1000
0 1000 2000 3000 4000 5000 6000 7000 8000
Rela
tive
Err
or
Pop. Density (Person/Sq. Km)
47
Denton & Collin County (03/04/2000 and 03/12/2000 image)
Similarly, a combination of classified images of Denton and Collin County
is used to generate general regression model for producing population estimates
for both counties. Table 9 summarizes the results of the regression model
produced from different samples groups selected from both counties.
Table 9
Summary of Linear Regression Model Results –Denton and Collin County
Combine
Linear regression (Denton and Collin County Combine) Samples R2 Total Error
(%) Mean Error
(%) Median
Error (%) Model
1 0.7520 8.90 206.87 20.81 y = 0.002653X + 479.54 2 0.6701 10.07 212.25 27.99 y = 0.002646X + 506.57 3 0.7744 -0.80 206.54 24.94 y = 0.002165X + 594.11 4 0.7457 4.26 224.74 27.29 y = 0.002195X + 674.27 5 0.7691 2.72 212.80 25.84 y = 0.002264X + 601.30 6 0.5611 1.99 191.58 23.54 y = 0.002491X + 444.36 7 0.5454 3.29 185.56 23.28 y = 0.002635X + 380.37 8 0.5688 -3.98 217.99 24.79 y = 0.001868X + 717.12 9 0.7251 3.21 239.13 28.54 y = 0.001970X + 794.74
10 0.7083 1.36 238.71 28.27 y = 0.001883X + 812.90 General linear equation model
3.10 Y=0.002277X+600.52
A total of 100 total random block-groups are selected from Denton and
Collin County for generating each linear regression model. In the next step, the
general regression model produced in table 9 is used to perform population
estimates of Denton, Collin, and combine population estimates for both counties
as well. Table 10 summarizes the results produced from the analysis.
48
Table 10
Summary of Population Estimates Produced by General Regression Model
(Denton & Collin County Combined)
General linear equation(Denton & Collin County Combine): Y=0.002277X+600.52
Image Total population Est. population Total Error (%)
03/04/2000 Denton 432976 393807 -9.05
03/04/2000 Collin 491675 559539 13.80
03/04/2000 Combined 924651 953351 3.10
03/12/2000 Denton 432976 421846 -2.57
03/12/2000 Collin 491675 533253 8.45
03/12/2000 Combined 924651 965565 4.42
Geographically Weighted Regression
Table 11 shows the errors calculated from the population estimates when
geographically weighted regression (GWR) model is employed to the entire study
area. GWR model was developed using 03/04/2000 classified data for Denton,
Collin, and combine County separately. The GWR model based on 03/04/2000
image was used to perform population estimation from 03/12/2000 classified
image. In this analysis, adaptive kernels are used, and the bandwidth is
determined using cross validation (CV). Figure 18 illustrates the scatter diagrams
obtained from GWR model and defines the relationships between relative
population estimation error obtained from GWR models and population density.
49
Extreme values were removed from the scatter plot (figures 19, 20), which only
account for one percent of the data for better graphical representation for the
model.
Table 11
Summary of Geographically Weighted Regression Model Results
Geographically weighted regression Study area R2 Total Error
(%) Mean Error Median
Error Denton (03/04) Local
Models 0.04 38.23 25.00
Denton (03/12) Local Models
7.1 44.37 29.84
Collin (03/04) Local Models
0.61 176.11 18.10
Collin (03/12) Local Models
-5.53 133.16 20.17
Combine (03/04) Local Models
-0.46 133.10 21.18
Combine (03/12) Local Models
-0.77 111.47 23.45
Denton (Combine-03/04)
Local Models
-2.0 37.01 26.24
Collin (combine -03/04)
Local Models
1.16 197.49 19.32
Denton (Combine-03/12)
Local Models
3.61 42.05 26.60
Collin (Combine -03/12)
Local Models
-4.62 157.99 20.91
50
(a) GWR County Denton(03/04)
(b) GWR Denton County (03/12)
Figure 20 Scatter diagram of relative population estimation error vs. population density for GWR model (Denton County)
(a) GWR Collin County (03/04)
(b) GWR Collin County (03/12)
Figure 21. Scatter diagram of relative population estimation error vs. population density for GWR model (Denton and Collin County).
-250
0
250
500
0 2000 4000 6000 8000 10000 12000 14000Rela
tive
Err
or
(%)
Population Density(Person per sq. km)
-2500
250500
0 2000 4000 6000 8000 10000 12000 14000Rela
tive
Er
ror (
%)
Population Density (Person per sq. km)
-500
0
500
1000
0 1000 2000 3000 4000 5000 6000 7000 8000Rela
tive
Err
or
(%)
Pop. Density (Person per sq. km)
-500
0
500
1000
0 1000 2000 3000 4000 5000 6000 7000 8000
Rela
tive
Err
or (%
)
Pop. Density (Person per sq. km)
51
(a) GWR Denton and Collin County (03/04)
(b) GWR Denton and Collin County (03/04)
Figure 22. Scatter Diagram of relative population estimation error Vs. Population Density for GWR Model (Denton and Collin County).
Discussion
A few observations and limitations faced while producing the above results are
discussed below.
1. Knowledge-based vs. MLC:
The MLC and knowledge-based model produced the higher overall
accuracy for their respective image. However, the closer examination of
the error matrix (tables 2, 3, 4) showed the knowledge-based model is
able to discriminate residential pixel more accurately as compared to
-500
0
500
1000
0 2000 4000 6000 8000 10000 12000 14000Rela
tive
Err
or (%
)
Pop. Density (Person per sq. km)
-500
0
500
1000
0 2000 4000 6000 8000 10000 12000 14000Rela
tive
Err
or (%
)
Pop. Density (Person per sq. km)
52
MLC. In addition, MLC requires training samples for image classification
whereas knowledge-based classification approach classifies image
without using training samples; as the result, knowledge-based models
facilitate the image classification compared with MLC.
Impervious surface knowledge-based model:
Impervious surface is an important attribute which is closely
associated with urban ecosystems. This attribute can be very useful for
developing the knowledge-based model. However, the impervious surface
layer was not readily available for the temporal period of study (the
impervious layer available through USGS for the research is from 2001).
On the other hand, knowledge-based model developed only by using TM
spectral attribute was able to produce similar accuracy results. The error
matrix (tables 3, 4) showed that the knowledge-based model derived from
spectral indices is able to discriminate residential pixels more precisely as
compared to the knowledge-based model derived from original image
bands and impervious data.
2. Spectral knowledge-based model:
The model discriminated residential land use type with high
accuracy from Landsat TM image. However, the model struggled to
extract residential pixels from the sparsely populated neighborhood with
high accuracy. Since the classification model is exclusively based on
spectral response of TM bands, in thinly populated region, the spectral
53
response of the surrounding feature dominates the residential pixel’s
spectral response and causes errors in classification (figure 23). Another
principal factor affecting the classification accuracy in sparsely populated
areas is that the medium spatial resolution (30m × 30m) of the Landsat
TM which made it difficult to extract residential areas in low population
density areas.
(a) Aerial image (b) TM image (c) Classified TM image
Figure 23. Sparsely populated region on aerial, Landsat TM and classified image.
3. The spectral response of TM band 4 (NIR) is affected by the moister
content of the surrounding, and may affect the threshold value defined for
the indices used in the models such as NDVI and NDBI. Likewise, Zha et
al. (2003) also argue that the consistency of the NDBI for extracting
residential built-up might be indirectly affected by the presence of other
land use types that exhibit seasonal spectral response to TM bands, such
as forests and soils. However, this setback may be overcome with the
selection of the remotely sensed data captured during the time when the
spectral discrimination between the surrounding features and residential
built-up is higher.
54
4. The developed classification model only uses TM spectral properties to
delineate land use type from each other. One of the limitations of this
approach is: it is difficult to distinguish residential built-up from lake-shore
or sandy beaches (where sand and silt concentration is high) because
those features exhibit similar spectral response on all TM bands (Figure
24).
(a) Aerial image (b) TM image (c) Classified TM image
Figure 24. Lake shore (sandy beach) and residential built-up on aerial, TM, and classified image.
5. The majority of the old residential neighborhood of the study area is
surrounded by tree canopies. Therefore, image captured during leaf-off
season more likely to produce good classification results because
residential area can be delineated more precisely as the few features are
covered by tree canopies. In addition, the image acquired on 03/12/2000
was affected by jet trails in the sky and corrected by removing the errors
manually.
6. The high R2 values for both Linear model and GWR models suggest that
population is strongly correlated with the residential built-up.
55
7. Comparison of mean relative error (MRE) and median relative error
(MdRE) in table 7, 9, and 11 indicate that the mean is highly affected by
extreme values, especially in Collin county region.
8. Scatter diagrams produced to analyze the relationship between relative
error and population density (figures 14, 15, 16, 18, 19, 20, 21 and 22)
indicate that the population is often overestimated when population
density at census block group is less than approximately 300 persons per
square kilometer, whereas population count is always underestimated
when population density is greater than approximately 3000 persons per
square kilometer, regardless of the independent variable. It is observed
that the magnitude of error is very high in population underestimation and
it is mainly accord in high population density block-group region. Tables
6,8,10 and 11 summarized the performance of the linear and GWR model
on Denton, Collin and both counties.
9. Similarly, for census blocks with a low population density (for example,
less than 100 persons per square kilometer), relative error of population
estimation in percentage can be misleading. For example, for a census
block group with actual population of 5 and estimated population of 25, the
relative error is 400%, but the actual error of 20 persons may be
insignificant compared with the total population of the county.
10. The knowledge-based model uses spectral reflectance of the TM bands,
and can potentially be applied to extracting urban land use type in
56
developing countries where alternative sources such as census and other
demographic records (e.g. birth, death, migration records) are not readily
available or not reliable. In addition, the developed model is very flexible
as explicit/new rules can be defined as well as the threshold values may
be adjusted based on urban environment characteristics.
Since census or other sources of demographic information may not be
readily available or reliable in many developing countries, sampling
regions could be defined (which cover approximately 15 to 20 percent of
the study area) to collect population data based on random sampling
techniques. Finally, linear and GWR models can be created to perform
population estimation in the study area.
57
CHAPTER 5
CONCLUSION
The knowledge-based classification model successfully extracted
residential land-use areas from remotely sensed data by applying rules and
conditions defined by spectral response features over the study area. For each
rule threshold values were defined, and conditions were set to extract
classification information from the input images. While defining rules; several
spectral indices were combined together to overcome the limitations of one
single index and achieve higher classification accuracy than single indices.
Compared with other traditional land use classification approaches, the
knowledge-based classification approach is more efficient and can produce more
accurate results. It is completely automated once the threshold values for the
conditions are defined and it classifies remotely sensed image without using
training samples thereby making the results more consistent. Therefore, this
model can be a very useful alternative for researchers and planners to map
residential built-up for their research and discipline quickly.
Landsat TM image contain rich spectral information with medium spatial
resolution which is used to define various spectral indices such as NDVI,
MNDWI, NDBI, NDBBI, and WI. These indices are useful in delineating major
58
land-use types of urban environment such as soil, water, vegetation, and
residential area. The total classification accuracy of over 95% is better than the
accuracy produced from general classification method such as MLC, supervised
and unsupervised classification.
A major limitation of this model is the difficulty in delineating bare earth
(e.g. sandy beach) from residential built-up because of similar spectral response
on all TM bands. In fact, this is also a major challenge for Landsat TM data.
Satellite data such as light detection and ranging (LiDAR) could be useful to
overcome this challenge as it can provide building height information to delineate
sandy beach and residential land use type. Soils have wide a range of spectral
responses in TM bands; furthermore, moisture content of the soil strongly
influences the spectral response of all TM bands and thus can affect the defined
threshold values defined for the classification model. The performance of this
model in effectively extracting residential pixels from sparsely populated areas
is also limited because of spatial resolution of Landsat TM images and the
closeness of spectral response from background features.
The population estimation results show that the total accuracy of
population estimation in the study area is controlled by the sign and magnitude of
relative errors at the census block-group level. Furthermore, median absolute
relative errors calculated for the model also suggest that the GWR models
outperformed the linear regression models due to better incorporation of spatial
heterogeneity in GWR models. However, the both linear regression and GWR
59
methods underestimated the population count in census block-groups with high
population density, and overestimated the population count in census block-
groups with low population density.
The recommendation for future research is to improve the knowledge
based model by incorporating other available spatial datasets to increase
accuracy of the classification process, and to better explore the performance of
linear regression and GWR models. Additional measures may be required to
better examine the spatial dependence and spatial heterogeneity issues in
population estimation using remotely sensed data.
60
REFERENCES
ArcGIS (Version 9.3) [Computer software]. Redlands,CA: ESRI.
Baraldi, A., Puzzolo, V., Blonda, P., Bruzzone, L., & Tarantino, C. (2006).
Automatic spectral rule-based preliminary mapping of calibrated landsat TM
and ETM+ images. IEEE Transactions on Geoscience and Remote Sensing,
44(9), 2563-2586.
Chander, G., & Markham, B. (2003). Revised Landsat-5 TM radiometric
calibration procedures and postcalibration dynamic ranges. IEEE
Geoscience and Remote Sensing, 41(11), 2674-2677.
Charlton, M., Fotheringham, S., & Brunsdon, C. (02/02/2009). NCRM methods
review papers, NCRM/006. geographically weighted regression. discussion
paper. Unpublished manuscript.
Collins, W. G., & El-Beik, A. H. A. (1971). Population census with the aid of aerial
photograph: An experiment in the city of leeds. Photogrammetric Record,
7(37), 16-26.
Congalton, R. G., & Green, K. (1999). Assessing the accuracy of remotely
sensed data: Principles and practices. Boca Raton, FL: Lewis Publishers.
61
Crist, E. P., Laurin, R., & Cicone, R. C. (1986). Vegetation and soil information
contained in transformed Thematic Mapper data. International Geoscience
and Remote Sensing Symposium, 2, 1465-1470.
ERDAS IMAGINE (Version 9.3) [Computer software]. Norcross,GA: erdas.
Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002).
Geographically weighted regression: The analysis of spatially varying
relationships. Chichester, UK: John Wiley & Sons.
Green, N. A. (1956). Aerial photographic analysis of residential neighborhoods:
And evaluation of data accuracy. Social Forces, 35(2), 142-147.
Hardin, P. J., Jackson, M. W., & Shumway, J. M. (2007). Intraurban population
estimation using remotely sensed imagery. Geo-Spatial Technologies in
Urban Enviroments: Policy,practice and pixels, 47-92.
Harvey, J. T. (2000). Small area population estimation using satellite imagery.
Statistics in Transition, 4(4), 611-633.
Harvey, J. T. (2002a). Estimating census district populations from satellite
imagery: Some approaches and limitations. International Journal of Remote
Sensing, 23(10), 2071-2095.
62
Harvey, J. T. (2002b). Population estimation models based on individual TM
pixels. Photogrammetric Engineering and Remote Sensing, 68(11), 1181-
1192.
Harvey, J. T. (2003). Population estimation at the pixel level: Developing the
expectation maximization technique. In V. Mesev (Ed.), Remotely sensed
cities (pp. 181-205). London, UK and New York, NY: Taylor and Francis.
Ji, M., & Jensen, J. R. (1999). Effectiveness of subpixel analysis in detecting and
quantifying urban imperviousness from Landsat Thematic Mapper imagery.
Geocarto International, 14(4), 31-39.
Langford, M., Maguire, D. J., & Unwin, D. J. (1991). The areal interpolation
problem: Estimating population using remote sensing in a GIS framework. In
I. Masser & M. Blakemore (Ed.), Handing geographical information:
Methodology and potential applications (pp. 55-77). New York: Wiley.
Lee, E. S., & Goldsmith, H. F. (1982). Population estimates methods for small
area analysis. Beverly Hills, CA: SAGE Publications, Inc.
Li, G., & Weng, Q. (2005). Using Landsat ETM+ imagery to measure population
density in Indianapolis, Indiana, USA. Photogrammetric Engineering and
Remote Sensing, 71(8), 947-958.
63
Lisaka, J., & Hegedus, E. (1982). Population estimation from Lansat imagery.
Remote Sensing of Enviroment, 12(4), 259-272.
Lo, C. P. (1986a). Accuracy of population estimation from medium-scale aerial
photography. Photogrammetric Engineering and Remote Sensing, 52(12),
1859–1869.
Lo, C. P. (1986b). Applied remote sensing. New York: Longman.
Lo, C. P. (1995). Automated population and dwelling unit estimation from high
resolution satellite images: A GIS approach. International Journal of Remote
Sensing, 16(1), 17-34.
Lo, C. P. (2003). Zone-based estimation of population and housing units from
satellite generated land use/land cover maps. In V. Mesev (Ed.), Remotely
sensed cities (pp. 157). London and New York: Taylor & Francis.
Lo, C. P. (2008). Population estimation using geographically weighted
regression. Journal GIScience & Remote Sensing, 45(2), 131-148.
Lo, C. P., & Welch, R. (1977). Chinese urban population estimation. Annals of
the Association of American Geographers, 67(2), 246-253.
Lu, D., Weng, Q., & Li, G. (2006). Residential population estimation using a
remote sensing derived impervious surface approach. International Journal
of Remote Sensing, 27(16), 3553-3570.
64
Mennis, J. (2006). Mapping the results of geographically weighted regression.
Cartographic Journal, 43(2), 171-179.
Qiao, Y., Liu, H. P., Bai, M., Wang, X. D., & Zhou, X. Z. (2009). The decision tree
algorithm of urban extraction from multi-source image data. Virtual Changing
Globe for Visualisation and Analysis, Wuhan, China. , 38
Qiu, F., Woller, K. L., & Briggs, R. (2003). Modeling urban population growth using
remotely sensed imagery and TIGER GIS road data. Photogrammetric Engineering
and Remote Sensing, 69(9), 1031-1042.
Ridd, M. K. (1995). Exploring a V-I-S (vegetation-impervious surface-soil) model for
urban ecosystem analysis through remote sensing: Comparative anatomy for cities.
International Journal of Remote Sensing, 16(12), 2165-2185.
Sutton, P., Roberts, D., Elvidge, C. D., & Baugh, K. (2001). Census from heaven: An
estimate of global human population night-time satellite imagery. International
Journal of Remote Sensing, 22(16), 3061-3076.
Sutton, P., Roberts, D., Elvidge, C. D., & Meij, H. (1997). A comparison of nighttime
satellite imagery and population density for the continental united states.
Photogrammetric Engineering and Remote Sensing, 63(11), 1303-1313.
65
Todd, S. W., Hoffer, R. M., & Milchunas, D. G. (1998). Biomass estimation on
grazed and ungrazed rangelands using spectral indices. International
Journal of Remote Sensing, 19(3), 427-438.
UNCHS. (2001). The state of the world’s cities. Nairobi, Kenya: United Nations
Center for Human Settlements (Habitat).
Wu, C., & Murray, A. T. (2005). A cokriging method for estimating population
density in urban areas. Computers, Environment and Urban Systems, 29(5),
558-579.
Wu, S., Qiu, X., & Wang, L. (2005). Population estimation methods in GIS and
remote sensing: A review. GIScience and Remote Sensing, 42(1), 58-74.
Zha, Y., Gao, J., & Ni, S. (2003). Use of normalized difference built-up index in
automatically mapping urban areas from TM imagery. International Journal
of Remote Sensing, 24(3), 583-594.