abstract and - cdn.ymaws.com...2 importance of quality no unified gain confidence in geodata reduce...
TRANSCRIPT
1
Data Quality
and
Error Analysis in GIS
Joshua Greenfeld, PhD, LS Professor emeritus, NJIT
Professor, Israel Institute of Technology
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 1
ABSTRACT
One of the major challenges of GIS is dealing with the
uncertainty and the assessment of the quality of spatial
information.
The challenge is to assess the quality of spatial
information not just the quality of spatial data.
Many professionals are involved in providing GIS
services. Surveying is only one of them.
For surveying to make a mark on the GIS industry and
become a prominent stake holder of GIS, it has to offer
some expertise that most other professionals cannot.
Unfortunately, the ability to collect spatial data is becoming
a common skill and the surveyors positioning expertise is
not as unique as it used to be. Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 2
ABSTRACT
There is one area that surveyors have an advantage over
other GIS professionals is their propensity and ability to
understand and quantify spatial errors and accuracies.
In surveying, the uncertainty and quality assessment is
mostly confined to positioning or positional accuracies.
The quality of surveying results is typically assessed on the
basis of measurement accuracy and the propagation of
these accuracies into other computed quantities.
In GIS uncertainty and quality issues are much more
broad. In addition to positional accuracy there is:
attribute accuracy, completeness of the data, sources and
lineage of the data, logical consistency, fuzziness of the
spatial phenomenon, currency of the data and other
uncertainty issues. Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 3
Objective
The objective of this seminar is to enable surveyors to
understand the broader issues of accuracy assessment
beyond positional accuracies.
It will outline the extended definition of uncertainty and
quality as it applies to GIS.
It will include an overview on the errors and uncertainties
that could impact the quality of spatial data.
This will be followed by discussing the impact of errors in
spatial data on spatial information.
The ISO geospatial standards will be reviewed as well.
Finally, some practical tools and examples of numerical
and statistical assessment of uncertainty and quality of
spatial information will be discussed and demonstrated. Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 4
2
Importance of Quality
Gain confidence in geodata
Reduce users‘ complaints
Get customer’s satisfaction
Minimize consecutive costs caused by decisions
or actions based on erroneous data
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 5
No unified definition of data quality
1. Data Quality refers to the degree of excellence
exhibited by the data in relation to the portrayal of the
actual phenomena. GIS Glossary
2. The state of completeness, validity, consistency,
timeliness and accuracy that makes data appropriate
for a specific use. Government of British Columbia
3. The totality of features and characteristics of data
that bears on their ability to satisfy a given purpose; the
sum of the degrees of excellence for factors related to
data. Glossary of Quality Assurance Terms
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 6
No unified definition of data quality
4. Information Quality : the fitness for use of
information; information that meets the requirements of
its authors, users, and administrators. (Martin Eppler)
5. Data quality: The processes and technologies
involved in ensuring the conformance of data values to
business requirements and acceptance criteria
6.ISO/PAS 26183:2006 defines product data quality as
a measure of the accuracy and appropriateness of
product data, combined with the timeliness with which
those data are provided to all the people who need
them.
And more……
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 7
Error and Uncertainty in GIS
• One of the major problems currently existing within GIS is
the aura of accuracy surrounding digital geographic data
• Often hardcopy map sources include a map reliability rating
or confidence rating in the map legend
• This rating helps the user in determining the fitness for use
for the map
• However, rarely is this information encoded in the digital
conversion process
• Often because GIS data is in digital form and can be
represented with a high precision it is considered to be
totally accurate Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 8
3
Error and Uncertainty in GIS
• In reality, a buffer exists around each feature which
represents the actual positional location of the feature
• For example, data captured at the 1:20,000 scale
commonly has a positional accuracy of ± 20 metres
• This means the actual location of features may vary 20
metres in either direction from the identified position of the
feature on the map
• Considering that the use of GIS commonly involves the
integration of several data sets, usually at different scales
and quality, one can easily see how errors can be
propagated during processing Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 9
Error and Uncertainty in GIS
• The ease with which geographic data in a GIS can be
used at any scale highlights the importance of
detailed data quality information.
• Although a data set may not have a specific scale
once it is loaded into the GIS database, it was
produced with levels of accuracy and resolution that
make it appropriate for use only at certain scales, and
in combination with data of similar scales.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 10
Error and Uncertainty in GIS
• Error - Two sources of error:
Inherent and Operational
• Inherent error is the error present in source
documents and data
• Operational error is the amount of error produced through the data capture and manipulation functions of a GIS
• Both contribute to the reduction in quality of the
products that are generated by GIS. Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 11
Error and Uncertainty in GIS
• Possible sources of operational errors include :
• Mislabelling of areas on thematic maps • Misplacement of horizontal (positional)
boundaries • Human error in digitizing classification error • GIS algorithm inaccuracies • human bias
• While error will always exist in any scientific process,
the aim within GIS processing should be to identify existing error in data sources and minimize the amount of error added during processing
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 12
4
Errors in Database Creation
Errors are introduced at almost every step of database
creation
Concerns the degree to which the data exhausts the
universe of possible items
Are all possible objects included within the
database?
Affected by rules of selection, generalization and
scale
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 13
Error and Uncertainty in GIS
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 14
Error induced by data cleaning, Longley et al., chapter 6, pages 132-133
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 15
Merging. Longley et al., chapter 6, pages 132-133
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 16
5
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 17
classification error -- difference in pixel class between the map and a
reference
1939
1956
1971
1995
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 19
Error and Uncertainty in GIS
• Because of cost constraints it is often more appropriate to
manage error than attempt to eliminate it!
• There is a trade-off between reducing the level of error in a
data base and the cost to create and maintain the
database
• An awareness of the error status of different data sets will
allow user to make a subjective statement on the quality
and reliability of a product derived from GIS processing
• The validity of any decisions based on a GIS product is
directly related to the quality and reliability rating of the
product Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 20
6
Error and Uncertainty in GIS
• Depending upon the level of error inherent in the source
data, and the error operationally produced through data
capture and manipulation, GIS products may possess
significant amounts of error
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 21
Error and Uncertainty in GIS
Tools to get a handle on uncertainty
Models of uncertainty: methods for assessing and
describing error
Error propagation (during analysis)
Fuzzy approaches (membership of classes)
Sensitivity analysis (effect of errors)
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 22
Error and Uncertainty in GIS
Error assessment, reporting, interpretation - more difficult
Quality of data: standards and metadata
But: No professional GIS currently in use can present the
user with information about the confidence limits that
should be associated with the results of an analysis.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 23
Classification of Errors in GIS [Hun ‘92]
Resulting in
Forms of Error
Source of Error Data Collection and Compilation
Data Processing
Data Usage
Positional Error
Logical Error
Attribute Error Completeness
(Primary) (Secondary)
Final Product Errors
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 24
7
Uncertainty
Uncertainties in geographic information originate from
different sources:
Uncertainty due to the inherent nature of geography:
different interpretations can be equally valid;
Cartographic uncertainty resulting in positional and
attribute errors;
Conceptual uncertainty as a result of differences in
“what it is that is being mapped”.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 25
Uncertainty (Definition of a Forest)
0
2
4
6
8
10
12
14
16
0 10 20 30 40 50 60 70 80 90
Tre
e H
eig
ht
(m)
Canopy Coverage (%)
Portugal
Mexico
U.S. Israel
Belgium Malaysia
UN
Turkey
Estonia
Switzerland
Somalia New Zealand
UNESCO Australia Japan
Denmark
Morocco
Kenya
Zimbabwe
Sudan
Tanzania
Ethiopia
South Africa
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 26
Internal and External Data Quality
internal quality - Corresponds to the level of similarity that
exists between “perfect” data to be produced (what is
called “nominal ground”) and the data actually produced
external quality - Corresponds to the similarity between
the data produced and user needs
Data that should have
been produced
Data produced
User needs 1
User needs 2
User needs n
Internal
Quality
External
Quality 2
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 27
Characteristics to define the
internal quality – Completeness: presence and absence of features,
their attributes and relationships.
– Logical consistency: degree of adherence to logical
rules of data structure, attribution, and relationships (data
structure can be conceptual, logical or physical).
– Positional accuracy: accuracy of the position of
features.
– Temporal accuracy: accuracy of the temporal
attributes and temporal relationships of features.
– Thematic accuracy: accuracy of quantitative attributes
and the correctness of non-quantitative attributes and of
the classifications of features and their relationships. Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 28
8
six characteristics to define the
external quality (Beard and Vallière)
– Definition: to evaluate whether the exact nature of a
data and the object that it describes, that is, the “what”,
corresponds to user needs (semantic, spatial and
temporal definitions).
– Coverage: to evaluate whether the territory and the
period for which the data exists, that is, the “where” and
the “when”, meet user needs.
– Lineage: to find out where data come from, their
acquisition objectives, the methods used to obtain them,
that is, the “how” and the “why”, and to see whether the
data meet user needs.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 29
six characteristics to define the
external quality (Beard and Vallière)
– Precision: to evaluate what data is worth and whether
it is acceptable for an expressed need (semantic,
temporal, and spatial precision of the object and its
attributes).
– Legitimacy: to evaluate the official recognition and the
legal scope of data and whether they meet the needs of
de facto standards, respect recognized standards, have
legal or administrative recognition by an official body, or
legal guarantee by a supplier, etc.;
– Accessibility: to evaluate the ease with which the user
can obtain the data analyzed (cost, time frame, format,
confidentiality, respect of recognized standards,
copyright, etc.). Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 30
Conceptual model of
uncertainty in spatial data Uncertainty
Poorly Defined Objects
Well Defined Objects
Error Vagueness
Probability Fuzzy Set
Theory
Ambiguity
Discord Non-Specifity
Expert Opinion Dempster Schafer
Endorsement Theory, Fuzzy Set Theory
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 31
Definitions of geographic objects
An examples of well-defined geographical objects is
land ownership. The boundary between land parcels is
commonly marked on the ground, and shows an abrupt
and total change in ownership
Examples of poorly defined geographical objects are
the rule in natural resource mapping. The
conceptualization of mappable phenomena and the
spaces they occupy is rarely clear-cut
There are rarely sharp transitions from one vegetation
type to another
In a region there could be several types of vegetation
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 32
9
Five dimensions of objects A and B
Relation
Scale
Space
Time
Attribute
B
A
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 33
Error
Ideally, if an object is conceptualized as being definable
in both attribute and spatial dimensions, then it has a
Boolean occurrence; any location is either part of the
object, or it is not.
Within GIS, for a number of reasons, a location or the
assignment of an object to a location or to the a class
may be expressed as a probability.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 34
Common reasons for a
database being in error
Type of Error Cause of error
Measurement Measurement of a property is erroneous.
Assignment The object is assigned to the wrong class
because of measurement error by the
scientist in either the field or laboratory or by
the surveyor.
Class
Generalization
Following observation in the field, and for
reasons of simplicity, the object is grouped
with objects possessing somewhat dissimilar
properties.
Spatial
Generalization
Generalization of the cartographic
representation of the object before digitizing,
including displacement, simplification, etc. Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 35
Common reasons for a
database being in error
Type of Error Cause of error
Entry Data are miscoded during (electronic or
manual) entry in a GIS.
Temporal The object changes character between the
time of data collection and the time of
database use.
Processing In the course of data transformations an
error arises because of rounding or
algorithm error.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 36
10
Vagueness
Sorites Paradox (is a bald man with an additional 1
hair still bald?
When, exactly, is a house a house; a settlement, a
settlement; a city a city; an oak woodland, an oak
woodland?
The questions always revolve around the threshold
value of some measurable parameter or the opinion
of some individual, expert or otherwise.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 37
Vagueness
Fuzzy-set theory is an alternative to Boolean sets.
Membership of an object in a Boolean set is
absolute, and defined by one of two integer values
{0,1}.
Membership of a fuzzy set is defined by a real
number in the range [0,1]. Membership or non-
membership of the set is identified by the terminal
values, while all intervening values define an
intermediate degree of belonging to the set (a
membership of 0.25 reflects a smaller degree of
belonging to the set than a membership of 0.5.)
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 38
Ambiguity
Ambiguity occurs when there is doubt as to how a
phenomenon should be classified because of differing
perceptions of that phenomenon.
There are two types of ambiguity:
Discord – different definitions and interpretation of the
same piece of land. (not a problem of a single
classification but of multiple mapping of the same area)
in the defining of soil, for example, many countries
have slightly different definitions of what constitutes a
soil, names for soils and the spatial and attribute
boundaries between soil types.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 39
Definition of a Forest
0
2
4
6
8
10
12
14
16
0 10 20 30 40 50 60 70 80 90
Tre
e H
eig
ht
(m)
Canopy Coverage (%)
Portugal
Mexico
U.S. Israel
Belgium Malaysia
UN
Turkey
Estonia
Switzerland
Somalia New Zealand
UNESCO Australia Japan
Denmark
Morocco
Kenya
Zimbabwe
Sudan
Tanzania
Ethiopia
South Africa
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 40
11
Discord Ambiguity
a) There is more class1 than
class2
b) The “zone of transition” between
classes 1 and 2 is represented
by a mosaic of class1-&-class2 Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 41
Discord Ambiguity
c) the whole area is allocated
into a class1-&-class2
mosaic
d) the two distinct areas of class1
and class2 are separated by two
mosaics of class1-&-class2 and
class2-&-class1 Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 42
Discord Ambiguity
Some solutions for the problem of discord include:
Use of expert look-up tables and producer-supplied
metadata to compare classifications. This is an artificial
intelligence based solution.
Use personal (expert) judgment to compare
classification and phenomenon changes over a longer
period. This solution makes extensive use of rough and
fuzzy sets to accommodate the uncertainty in the
correspondence of classes.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 43
Non-specificity Ambiguity
Ambiguity through non-specificity can be illustrated by
geographical relationships.
The relation “A is north of B” is itself non-specific
because it can mean:
A lies on exactly the same line of longitude and
towards the north pole from B;
A lies somewhere to the north of a line running east to
west through B
A lies between perhaps north-east and north-west, but
is most likely to lie in the sector between north-north-
east and north-north-west of B.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 44
12
Non-specificity Ambiguity
The first two definitions are precise and specific, the third
is the natural language concept, which is itself vague.
Any lack of definition as to which should be used means
that uncertainty arises in the interpretation of “north of”.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 45
Uncertainty
Attribute uncertainty (Forest vs. Ag)
Positional uncertainty
Definitional uncertainty
Measurement uncertainty
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 46
The Necessity of “Fuzziness”
“It’s not easy to lie with maps, it’s essential...to present
a useful and truthful picture, an accurate map must tell
white lies.” -- Mark Monmonier
distort 3-D world into 2-D abstraction
characterize most important aspects of spatial reality
portray abstractions (e.g., gradients, contours) as
distinct spatial objects
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 47
Fuzziness (cont.)
All GIS subject to uncertainty
What the data tell us about the real world
Range of possible “truths”
Uncertainty affects results of analysis
Confidence limits - “plus or minus”
Difficult to determine
“If it comes from a computer it must be right”
“If it has lots of decimal places, it must be accurate”
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 48
13
Method for determination
conformance quality levels Assumptions
The more errors are in a dataset, the higher the likelihood of applying erroneous data for decisions or actions
Each false decision or action leads to consecutive costs
• costs on finding the right answer
• costs due to damages caused by false information e.g. by hitting a pipeline which was documented at a different location
• hidden costs by loosing confidence of the user community
• hidden costs due to image loss by the customer
A dataset is never completely free of errors
The effort to gain a certain quality level costs time and money
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 49
Data quality
Data Quality
Lineage
Accuracy Positional
Attribute
Completeness
Logical Consistency
Semantic Accuracy
Currency
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 50
Positional accuracy (2D example)
We distinguish between point objects, line objects, and area
objects.
For a point object, with (x ±σx, y ±σy) coordinates
The values of σx and σy may be known from:
– previous studies
– specifications
– derived from the collected data
The point positional accuracy is then PAP = 𝜎x2 + 𝜎y
2
sx
sy
(x,y)
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 51
Positional accuracy (2D example)
A simple approximation for Lines and Area objects with n
points is: PAL,A = 𝑛(𝜎x2 + 𝜎y
2
sx
sy sx
sy
sx
sy
sx
sy sx
sy
sx
sy
sx
sy
(x1,y1)
(x4,y4)
(x3,y3) (x2,y2)
(x1,y1) (x3,y3) (x2,y2)
Note: The size of each error could be different
Line
Polygon
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 52
14
Error Propagation or
Propagation of Random Errors
Definition:
Given independent variables each with an
uncertainty, error propagation is the method
of determining an uncertainty in a function of
these variables.
Computed errors Measurement or given Errors
E x , E y Angular and distance
E area Coordinates
E vol Distance and elevation Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 53
Error propagation is a way of combining two
or more random errors together to get a third.
It can be used when you need to measure
more than one quantity to get at your final
result. For example, an angle and a distance
to compute coordinates
Error propagation can also be used to
combine several independent sources of
random error on the same measurement.
Error Propagation
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 54
In General matrix equation Σzz = A Σxx AT
Error Propagation
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 55
Derivation of formulas.
Suppose that x is a measured quantity and y is computed
from
y = ax + b
If we knew xt is the true value of x, we could compute yt
yt = axt + b
The measured value of x has an error of dx or
x = xt + dx.
Thus y = a(xt + dx) + b = axt + b + a dx
y = yt + a dx
dy = a dx
x
ya
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 56
15
A general formula (assuming independence
or no correlation)
22
3
2
2
2
1
)()()()(321 nx
n
xxxyx
y
x
y
x
y
x
ys
s
s
s
s
Error Propagation
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 57
Random error of a sum
If y = x1 + x2 + x3 + . . . + xn
Then
2222
321 nxxxxy sssss
A leveling loop was measured with the following accuracies: DH1 = 12.34 ±0.01 DH2 = -8.72 ±0.02 DH3 = 4.93 ±0.005 DH4 = -8.53 ±0.01
The closure is 0.02
The accuracy is of the loop:
0.012+0.022+0.0052+0.012 =0.025
Error Propagation Examples:
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 58
Random error of a series
If y = x1 + x2 + x3 + . . . + xn and
Then
n x x x x s s s s
3 2 1
xy n ss
0.012+0.012+0.012+0.012 = 4 x 0.01 = 0.02
Example
Error Propagation Examples:
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 59
Random error of area
A = a b
2222
baA ab sss
sA = 802 x 0.022 + 1002 x 0.022 = 2.56’
Example
The sides of an 80’x100’ rectangle lot was measured with an accuracy of ±0.02’. What is the accuracy of the area of the lot?
Error Propagation Examples:
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 60
16
Error Propagation of Azimuth and
Distance to coordinates (x,y)
ABAB AZDXX sin
ABAB AZDYY cos
2
22222
206265)cos()(sin AZ
ABDABXX AZDAZAB
ssss
2
22222
206265)sin()(cos AZ
ABDABYY AZDAZAB
ssss
A
B
D
AZ
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 61
2222 )()( YXYYXXD ABAB DD
Y
X
YY
XXAZ
AB
AB
D
D
11 tantan
)()(1 222222
BABA YYXXD YXD
sssss DD
)()(1 222222
2 BABA YYXXAZ XYD
sssss DD
Error Propagation of coordinates (x,y) to Azimuth and Distance
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 62
Error Propagation of coordinates
to area of a closed polygon
)(2
111 iii yyxA
])[(])[(2
1 22
11
22
11 ii yiixiiA xxyy sss
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 63
A x y yi i i
1
21 1( ) s s sA i i xi i i yi
y y x x
1
21 1
2 21 1
2 2[( ) ] [( ) ]
AREA=
622473.45 sa=
30.81
A 10000.00
10000.00
B
9600.04 10599.96
-1300.01
-799.95
-7679521.06
78.76
468.01
C
8699.99 10799.95
-500.06
649.94
5654467.30
216.58
116.81
D
9099.98
9950.02 1300.01
799.95
7279498.42
570.06
1457.26
A 10000.00
10000.00
500.06
-649.94
Point
X
Y
Xi+1 -Xi-1
Yi-1 -Yi+1
Xi (Yi-1 -Yi+1)
[sx(Yi-1 -Yi+1)]2
[sy(Xi+1 -Xi-1)]
2
-6499391.57
545.43
344.47
B
9600.04 10599.96
-1244946.91
1410.83
2386.55
S
Error Propagation of coordinates
to area of a closed polygon
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 64
17
POSITIONAL ACCURACY
defined as the closeness of locational information
(usually coordinates) to the true position
How to test positional accuracy?
use an independent source of higher accuracy (e.g. GPS
or raw survey data)
use internal evidence
unclosed polygons, lines which overshoot or
undershoot junctions, are indications of inaccuracy -
the sizes of gaps, overshoots and undershoots may
be used as a measure of positional accuracy
compute accuracy from knowledge of the errors
introduced by different sources using error propagation Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 65
The National Standard for
Spatial Data Accuracy (NSSDA)
A well-defined statistic and testing methodology for positional accuracy of spatial data.
Applicable to digital and graphic forms (aerial photographs, satellite imagery, and maps)
The standard does not define “pass-fail” accuracy values. (agencies are to set criteria)
Accuracy report
http://www.fgdc.gov/standards/projects/FGDC-standards-projects/accuracy/
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 66
Spatial Accuracy (Horizontal
Accuracy)
Circular error is based on the sample
standard deviation of di, the difference
between the data set coordinate value and
the coordinate value determined by an
independent check survey of higher accuracy
for the same point.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 67
The standard deviation for the horizontal coordinate r is:
1
)( 2
n
ddi
rs
Where:
22
iii yxr ii checkdatai rrd
n
dd
i The mean discrepancy
n = total number of points checked
NSSDA horizontal accuracy is:
Accuracyr = 2.4477 * si , (95% confid. level) Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 68
18
The standard deviation for the z coordinate direction is:
1
)( 2
n
ddi
zs
where:
i ii data checkd z z
n
dd
i The mean discrepancy
n = total number of points checked
NSSDA vertical accuracy is: Accuracyr = 1.96 * si , (95% confidence level)
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 69
Well-Defined Points
Small scale Large scale
Road/Rail intersections Center of utility access cover
Small isolated shrubs Sidewalk/curb/gutter intersec.
Corners of structures Monuments
Features that can be identified within 1/3 of the
maximum expected uncertainty for the data set.
Acceptable features
Check survey points should have accuracies within one-third the data sets intended accuracy (95% CL)
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 70
Check Point Location (assuming rectangle area)
Spaced at intervals of at least 10% of the diagonal.
At least 20% of the points are located in each quad.
Check points may be distributed more densely in the vicinity
of important features
When data exist for only a portion of the data set, confine
test points to that area.
When the distribution of error is likely to be nonrandom, it
may be desirable to locate check points to correspond to
the error distribution.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 71
Positional Accuracy evaluation
of Othophotos in New Jersey
Point
Accuracy (ft)
1
4.25
2
4.07
3
2.28
4
3.98
5
4.18 Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 72
19
ATTRIBUTE ACCURACY
Defined as the closeness of attribute values to their true value
Note that while location does not change with time, attributes often do
Attribute accuracy must be analyzed in different ways depending on the nature of the data
For continuous attributes (surfaces) such as on a DEM or TIN:
accuracy is expressed as measurement error (e.g. ±1m)
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 73
ATTRIBUTE ACCURACY
For categorical attributes such as classified polygons:
Are the categories appropriate, sufficiently detailed and defined?
Is polygon classified as A really A or should be B?
How heterogeneous are the polygon (e.g. 70% A and 30% B
How well are A and B defined (e.g. soils classifications)
center area may be definitely A, but more like B at the edges
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 74
ATTRIBUTE ACCURACY
How to test attribute accuracy?
prepare a misclassification matrix and calculate the degree of correctness
Examples:
The Kappa coefficient
Map Producer’s accuracy
Map User’s accuracy
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 75
The Kappa coefficient
0 AA BBP P P
0
1
e
e
P PKappa
P
Dataset A Dataset B Comparing A to B
A B
A PAA PAB PAr
B PBA PBB PBr
PAc PBc 1
A B
A OAA OAB OAr
B OBA OBB OBr
OAc OBc Σ
e Ac Ar Bc BrP P P P P
O – Observed
P – Percentage
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 76
20
The Kappa coefficient
00.586 0.283 0.869P
0.869 0.5460.711
1 0.546Kappa
Dataset A Dataset B Comparing A to B
R B
R 0.586 0.061 0.646
B 0.071 0.283 0.354
0.657 0.343 1
R B
R 58 6 64
B 7 28 35
65 34 99
0.657 0.646 0.343 0.354 0.546e
P Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 77
How to interpret Kappa
Kappa is always less than or equal to 1.
A value of 1 implies perfect agreement and values less
than 1 imply less than perfect agreement.
In rare situations, Kappa can be negative. This is a sign
that the two observers agreed less than would be
expected just by chance.
A possible interpretation of Kappa. The agreement is:
0.0 0.2 0.4 0.6 0.8 1.0
Poor Fair Moderate Good Very good
Kappa Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 78
Assume we have a 9 cell land cover map, one from 1980 and one from 2000 with three categories: A, B, and C.
The cross tabulation can be quantified into a matrix oftentimes called a confusion matrix
Other Accuracy Assessment
A B C
A
B
C
1980 LC 2000 LC Cross Tabulated Grid
A B A
B C C
A A B
B B A
B B C
B A C
BA BB AA
BB BC CC
BA AA CB
2 0 2
0 2 1
0 1 1
The matrix shows the agreements
between the 1980 and 200 maps. As
an example, 2 cells remained A (AA),
1 cell was C and is now B (CB), etc.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 79
Other Accuracy Assessment
Sum up the rows and columns. But
what do these numbers tell us?
The bottom row tells us that there
were two cells that were A, five B,
and two C.
A B C
A
B
C
2 0 2
0 2 1
0 1 1
4
3
2
2 5 2
The rightmost column tells us that we mapped 4 cells as A, 3 as B, and 2 as C.
Adding up the Diagonal cells says that 5 cells were right.
The overall agreement between maps is:
Σdii /n = 5/9 = 0.55%
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 80
21
User and Producer Accuracy
The total correspondence of our example is 55%. But,
that only tells us part of the story. What if we were
really interested in classification B? Where there
changes in classification B? Even here, there are two
different ways of interpreting that question:
If I were interested in mapping all the areas of B,
how well did I get them all? This is called the map
Producer’s Accuracy. That is, how well did we
produce a map of classification B.
If I were to use the map to find B, how successful
would I be? This is called the Map User’s Accuracy.
That is, much confidence should a user of the map
have for a given classification. Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 81
User and Producer Accuracy
Map user’s accuracy = the total number correct within
a row divide by the total number in the whole row.
Map producer’s accuracy = the total number of
correct within a column divided by the
total number in the whole column.
Example of classification B
Map user’s accuracy = 2/3 = 67%
Map producer’s accuracy = 2/5 = 40%
A B C
A
B
C
2 0 2
0 2 1
0 1 1
4
3
2
2 5 2
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 82
User and Producer Accuracy
How can we use the above results?
This means that if we were to use this map and look
for the classification of B, we would be correct 67% of
the time.
This means that the map produced only 40% of all
the B’s that were out there.
This also gives us some indication of the nature of
the errors. For instance, it appears that we confused
classification A with classification B (we said on two
occasions that B was A). By understanding the
nature of the errors, perhaps we can go back, look
over our process and correct for that mistake. Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 83
LOGICAL CONSISTENCY
Refers to the degree of adherence to logical rules of
data structures (conceptual, logical or physical),
attribution and relationships. It includes:
Conceptual consistence; adherence to rules of
conceptual schema
Domain consistency; adherence of values to the value
domain
Format consistency; degree to which data is stored in
accordance to physical structure of the dataset
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 84
22
LOGICAL CONSISTENCY
Topological consistency; correctness of the explicitly
encoded topological characteristics of a dataset. For
example:
• If there are polygons, do they close?
• Is there exactly one label within each polygon?
• Are there nodes wherever arcs cross, or do arcs
sometimes cross without forming nodes?
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 85
COMPLETENESS
Refers to and absence of features, their attributes and
relationships of spatial data in comparing what is
defined in the data model or what is in the real world.
Error of commission – data presented in a data set that
is not present in the data model or the real world
Error of omission – data that is present in the data
model or the real world is absent in the dataset.
Affected by rules of selection, generalization and scale
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 86
LINEAGE
A record of the data sources and of the operations
which created the database
How was it digitized, from what documents?
When was the data collected?
What agency collected the data?
What steps were used to process the data?
• precision of computational results
Is often a useful indicator of accuracy
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 87
An Example of Data Quality Elements
and Sub-elements for Buildings
Quality
elements
Quality sub-
elements
Description by
examples
Completeness
Commission error Buildings with area less
than 4m2 are presented
in Building Polygon layer
of 1:1000 data set.
Omission error Buildings with area equal
to or larger than 4m2 are
absent from the Building
Polygon layer.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 88
23
An Example of Data Quality Elements
and Sub-elements for Buildings Quality
elements
Quality sub-
elements
Description by
examples
Positional
accuracy
Horizontal accuracy
RMSE of a building
polygon based on a com- parison of the horizontal coordinates of all the
nodes of its footprints of
a building in GIS with
the corresponding
reference values.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 89
An Example of Data Quality Elements
and Sub-elements for Buildings
Quality
elements
Quality sub-
elements
Description by
examples
Positional
accuracy
Vertical accuracy
RMSE of a building
polygon based on a
comparison of the
vertical coordinates of all
the nodes of its footprints
of a building in GIS with
the corresponding
reference values.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 90
An Example of Data Quality Elements
and Sub-elements for Buildings Quality
elements
Quality sub-
elements Description by examples
Attribute
accuracy
Classification
correctness
Correctness that a building or
related features is correctly
classified as one (or more)
building- related features.
Non-quantitative
attribute
correctness
The Name of a building
polygon may be correct or
wrong in a GIS.
Quantitative
attribute
correctness
The value of the field
"Building Top Level" of a
Building Polygon may be
correct or wrong. Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 91
An Example of Data Quality Elements
and Sub-elements for Buildings Quality
elements
Quality sub-
elements Description by examples
Logical
consistency
Conceptual
consistency
A tower is described to be
under its podium.
Domain
consistency
The classification of feature
code for a building polygon is
beyond any of the following
given classes: BR BAR BUP,
IBP, OSP, PWP, TSP.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 92
24
An Example of Data Quality Elements
and Sub-elements for Buildings Quality
elements
Quality sub-
elements Description by examples
Logical
consistency
Format
consistency
Building names in title case -
Hong Kong Airport- are
consistent, while a name
such as "HONG KONG
Airport" is not consistent in
format.
Topological
consistency
When the outline of a building
polygon is closed, the
topology is consistent; when
the outline is not closed, the
topology is not consistent. Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 93
Uncertainties Measured Based on
Various Mathematical Theories Uncertainty
Imprecision Ambiguity Vagueness
Confidence region
model Shi 1994
Entropy Shannon 1948 Hartley’s measure 1928
Discord measure, Confusion measure
and non-specificity measure
U-uncertainty, Fuzzy measure
Fuzzy topology measure
Probability and
statistical theory
Evidence theory
Fuzzy sets, Probability
and Fuzzy topology
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 94
Positional Uncertainty
DEM surface
Uncertainty In spatial analysis
Raster Image
A framework for modeling uncertainties
in spatial data and analysis
Real World
Object
Point
Line
Polygon
3D objects
Uncertain Topology
Uncertainty From
Multi-data source
Field Uncertainty of Remote
Sensing data
Errors in DEM
Positional Uncertainty
Hybrid DEM
Interpolation
Uncertain spatial Query
Geometric Correction and image
fusion
Pro
ce
ssin
g a
nd
the
un
ce
rtain
c
on
trol o
f Sp
atia
l da
ta
Vis
ua
lizatio
n a
nd
the
dis
tribu
tion
of
Un
ce
rtain
ty In
form
atio
n
Real World Data type Classification Of spatial data
Description of Uncertainty
Uncertainty modeling In spatial analysis
and query Control of
Uncertainties Visualization of Uncertainties Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 95
The transformation equation between U,V and X,Y is:
X
U V
Sx
Sy
Sv
Su
t
Y
X
t
t
t
t
V
U
cos
sin
sin
cos
t is rotation angle from Y axis to axis of largest error.
Su is the semi-major axis of ellipse. (Largest error) u
Sv is the semi-minor axis of ellipse. (Least error) v
Sx is the standard deviation in X of coordinate x
Sy is the standard deviation in Y of coordinate y
Error model of point – Error Ellipse
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 96
25
X
U V Sx
Sy
Sv
Su
t 22
22tan
YX
xy
SS
St
2222
4
)(XY
YX SSS
K
KSS
S YXu
2
222 K
SSS YX
u
2
222
Error model of point – Error Ellipse
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 97
Error model of line - Epsilon band
Assumptions:
1. each error effect relevant to a particular digital line in a
GIS can be treated as a random variable, perturbing the
true line to obtain the observed line.
2. the processes of generating a digital line in a GIS can be
treated as being independent.
The bandwidth is determined from a statistical function of
those positional errors on the line accumulated from the
first stage to the final stage of data capture.
The measured Line
The true Line Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 98
Error model of a polygon
The area S of the polygon is computed from:
The differential of the area is given as:
1 1 1, 1
1 1
1 1[ ( )] [ ]
2 2
n n
i i i i i i
i i
S x y y x y
D
1, 1 1, 1
1
1[ ]
2
n
i i i i i i
i
dS y dx x dy
D D
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 99
Error model of a polygon
For simplicity assume all coordinate accuracies are equal
to σo and covariance is 0 we get:
Where: li-1,i+1 is the distance between points Pi-1 and Pi+1
2 2 2 2 2
1, 1 1, 1 1, 1
1 1
1 1[ ] [ ]
4 4
n n
S i i i i o i i o
i i
y x ls s s
D D
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 100
26
What is a standard?
Standards are documented agreements containing
technical specifications or other precise criteria to be
used
consistently as rules, guidelines, or definitions of
characteristics, to
ensure that materials, products, processes and
services are fit for their purpose.
(as defined by ISO)
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 101
Traffic Signals – Road Signs
VISA / Mastercard: standards allow people to use a single card to obtain cash in the local currency around the world
Commerce/Manufacturing/Industry
World War II - Allied supplies and facilities were severely strained due to the incompatibility of tools, replacements parts, and equipment. The establishment of international standards helped to increase compatibility.
Examples of Everyday Standards
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 102
Disasters (fire, flood, …)
Great Baltimore Fire of 1904 - fire engines from different
regions arrived to help put out the fire, only they had
different hose coupling sizes that did not fit the Baltimore
hydrants - fire burned over 30 hours, resulted in destruction
of 1526 building covering 17 city blocks.
Metric System vs US Customary System
The Importance of Standards (when standards do not exist)
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 103
The Need for Standards in Geographic Information
To ensure common understanding through a common set of
terminology
To promote/enable interoperability
To support the establishment of geospatial infrastructures at
local, regional, and global levels
To promote data and information sharing/exchange
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 104
27
Types of geospatial standards
Data Classification
e.g., Vegetation Classification
Data Content
e.g., Digital Geospatial Metadata, Spatial Schema
Data Symbology or Presentation
e.g., Digital Geologic Map Symbolization
Data Transfer
Data Usability
e.g., Geospatial Positioning Accuracy
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 105
Evaluating and Reporting Quality Evaluation
Results [ISO 19114]
Dataset as specified by the scope
Identify a data quality measure
Select and apply a data quality evaluation method
Determine the data quality result
Identify an applicable data quality element, data quality subelement,
and data quality scope
Conformance quality level
Determine conformance
Product specification or user requirements
Report data quality result (quantitative)
Report data quality result (pass / fail)
work item
19131
ISO 19113 ISO 19113
ISO
19
11
4
5 step process on quality evaluation
1
2
3
4
5
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 106
Metadata Example
Without…
With…
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 108
28
Metadata need Example
WQPW- ID DIN Pb
PB-31 .34 .012
HK-14 .12 .023
PB12 35 034
PB-12 .35 .034
WA-3 .28 .001
PB-4 .23 .022
PB-5 .21 .013
HUH?
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 109
The Standard
Metadata has four major roles:
Availability- information needed to determine the
sets of data that exist for a geographic location.
Fitness for use- information needed to determine if a
set of data meets a specific need.
Access- information needed to acquire an identified
set of data.
Transfer- information needed to process and use a
set of data
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 110
Information that can be found in Metadata
• Title, Abstract, Publication Date (Section 1: Identification information) • Data Accuracy and Completeness (Section 2: Data Quality Information) • Data Form: Vector or Raster? (Section 3: Spatial Data Organization Information) • Projection or Geographic Reference System (Section 4: Spatial Reference Information) • What Values Are Associated with Geodata? (Section 5: Entity and Attribute Information) • How Do You Get It? Cost? (Section 6: Distribution Information) • How Current Is the Documentation? (Section 7: Metadata Reference Information) Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 111
The Value of Metadata
Organize and maintain an organization’s investment in data
Provide information to data catalogs and clearinghouses
Provide information to aid data transfer
Food for thought... Nothing happens overnight: get used to thinking of the long term benefits
of metadata. $$$
Documentation = defense
The Standard: don't judge a book by its cover
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 112
29
Metadata resources
The FGDC Federal Geographic Data Committee: Interagency committee that
coordinates federal geo-data activities.
The Content Standard for Digital Geospatial Metadata (CSDGM)
•The current US Federal Metadata standard
•Often referred to as the 'FGDC Metadata Standard‘
•Has been implemented in federal state and local governments
International Organization of Standards (ISO), has developed and
approved an international metadata standard, ISO 19115 – Geographic
Information Metadata
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 113
Metadata resources
• The objective of this International Standard is to provide a clear
procedure for the description of digital geographic datasets so that users
will be able to determine whether the data in a holding will be of use to
them and how to access the data. By establishing a common set of
metadata terminology, definitions and extension procedures, this
standard will promote the proper use and effective retrieval of geographic
data.
• Supplementary benefits of this standard for metadata are to facilitate the
organization and management of geographic data and to provide
information about an organization’s database to others.
• This standard for the implementation and documentation of metadata
furnishes those unfamiliar with geographic data the appropriate
information to characterize their geographic data and it makes possible
dataset cataloguing enabling data discovery, retrieval and reuse. Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 114
Entity and
Attribute
Informatio
n
Graphical Representation of the:
US Geological Survey Biological Resources Division
DRAFT Content Standard for Biological Metadata
Based on : The Federal Geographic Data Committee’s Content Standard for Digital Geospatial
Metadata June 8, 1994 version 1.0
Prepared by Susan Stitt, Center for Biological Informatics
1. 2. 3. 4. 5. 6. 7.
Identification
Information
Data Quality
Information
Spatial Data
Organization
Information
Spatial
Reference
Informatio
n
Distribution
Information
Metadata
Reference
Information
Mandatory Mandatory
if Applicable
Optional Biological
Items Added
Metadata
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 115
Best Practices for Writing Quality Metadata
Writing Principles
Write simply but completely
Document for a general audience
Adopt a consistent style
Avoid using jargon
Define technical terms
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 116
30
Best Practices for Writing Quality Metadata
In Practice
State clearly what your data are not
Find, evaluate, and reuse good examples
See examples from FGDC workbook
Mine the Clearinghouse for other examples
Use keywords as indicators of the contents of a dataset
Use a thesaurus or controlled vocabulary when possible
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 117
Best Practices for Writing Quality Metadata
In Practice (continued)
Use subtitles to define and clarify long passages
Quantify assessments wherever possible
Use “None” and “Unknown” carefully
Format date: YYYYMMD
Avoid using confusing symbols & conventions:
! @ # % { } | / \ < > ~
Unnecessary carriage returns, tabs, indents, etc.
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 118
Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 119 Data Quality and Error Analysis in GIS (c) Dr. J. Greenfeld 120