thinking critically about geospatial data quality michael f. goodchild university of california...

73
Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Upload: gloria-priscilla-bradley

Post on 12-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Thinking Critically about Geospatial Data Quality

Michael F. Goodchild

University of California

Santa Barbara

Page 2: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 3: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 4: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Starting points• All geospatial data leave the user to some

extent uncertain about the state of the real world– missing data– positional and attribute errors– uncertain definitions of classes and terms– missing metadata

• projection unspecified• horizontal or vertical datum

– cartographic license

Page 5: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 6: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 7: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Uncertainty is endemic• All geographic information leaves some degree of

uncertainty about conditions in the real world• x

– the Greenwich Meridian– the Equator– standard time

• z– definitions of terms– errors of measurement

Page 8: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Starting points• Some applications will be impacted, some

will not– for any given data set at least one application can

be found where uncertainty matters– knowing whether data are fit for use

• Our perspective on these issues has changed in the past two decades– from error to uncertainty– from top-down to bottom-up data production

Page 9: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 10: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

http://www.thesalmons.org/lynn/wh-greenwich.html

Page 11: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 12: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 13: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 14: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 15: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Definitions of terms• Precision:

– the number of significant digits used to report a measurement

– should never be more than is justified by the accuracy of the measuring device

– internal precision of the computer• single precision arithmetic 1 part in 107

• double precision 1 part in 1014

• relative to the Earth's dimensions, single precision is about a meter resolution, double is about the size of an atom

• no GIS should ever need more than single precision

• a GIS's internal precision is effectively infinite

• hard to persuade designers to drop those spurious digits

Page 16: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 17: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Scale• Relationship between measurement of distance on a

map and measurement on the ground– 1:24,000 is larger than 1:250,000

• A map used for digitizing or scanning always has a scale– a geospatial database never has scale– but we have complex conventions

• Scale as a useful surrogate for map contents, resolution, accuracy– scale of original map a useful item of geospatial metadata

Page 18: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Resolution• The minimum distance over which change isr

recorded– 0.5mm for a paper map

• Positional accuracy– set by national map accuracy standards to

roughly 0.5mm

Page 19: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Scale Positional accuracy and resolution

1:1000 0.5m

1:10,000 5m

1:24,000 12m

1:100,000 50m

1:250,000 125m

1:1,000,000 500m

1:2,000,000 1km

Page 20: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 21: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 22: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Accuracy• The difference between a measurement and the truth

– problems of defining truth for some geospatial variables, e.g. soil class

– if two people classified the same site would they agree?• Uncertainty of definition is a form of inaccuracy, along with:

– variation between observers or measuring instruments– temporal change– loss of information on e.g. projection, datum– transformation of datum– map registration– digitizing error– imperfect fit of the data model, e.g. heterogeneous polygons,

transition zones instead of boundaries– fuzziness of many geographic concepts– transformation of coordinate system, projection, data model, e.g.

raster/vector conversion

Page 23: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Truth• Often a source of higher accuracy

– circularity - accuracy is the difference between a measurement and a source of higher accuracy?

• What identifies a source as having higher accuracy?– larger scale– more recent– cost more, took longer to make, more careful– more accurate measuring instrument– certified by an expert– earlier in the chain reality-map-database (less processing)

Page 24: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

The problem• Uncertainty is endemic in geospatial data

– even a 1:1 mapping would not create a perfect representation of reality

• All GIS products are therefore subject to uncertainty– what is the plus or minus on estimates of length, area,

counts of objects, positions, attributes, viewsheds, buffer zones, ...

• GIS products are often used in decision-making by people who do not have intimate knowledge of the methods used to collect, digitize or process the data– results are often presented and used visually rather than

numerically

• Computer (GIS) output carries a false sense of credibility

Page 25: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Topology vs geometry• Which property does this pole lie in?• Which side of the street is this house?• Do these two streets connect?

Page 26: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Acres 1 2 3 4 5 1+2+5 1+2+3+5 1+2+3+4+5

0-1 0 0 0 1 2 2640 27566 77346

1-5 0 165 182 131 31 2195 7521 7330

5-10 5 498 515 408 10 1421 2108 2201

10-25 1 784 775 688 38 1590 2106 2129

25-50 4 353 373 382 61 801 853 827

50-100 9 238 249 232 64 462 462 413

100-200 12 155 152 158 72 248 208 197

200-500 21 71 83 89 92 133 105 99

500-1000 9 32 31 33 56 39 34 34

1000-5000 19 25 27 21 50 27 24 22

>5000 8 6 7 6 11 2 1 1

Totals 88 2327 2394 2149 487 9558 39188 90599

Page 27: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Data types• The area-class map

– soil maps– vegetation cover type– land use

• Boundaries surrounding areas of uniform type– what’s the accuracy issue?

Page 28: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 29: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 30: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

The area class map• Assigns every location x to a class

– Mark and Csillag term– c = f(x)– a nominal field (or perhaps ordinal)– classified scene– soil map, vegetation cover map, land use map

• Need to model uncertainty in this type of map

Page 31: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Uncertainty modeling• Area-class maps are made by a long and complex

process involving many stages, some partially subjective

• Maps of the same theme for the same area will not be the same– level of detail, generalization– vague definitions of classes– variation among observers– measuring instrument error– different classifiers, training sites– different sensors

Page 32: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Error and uncertainty• Error: true map plus distortion

– systematic measurements disturbed by stochastic effects– accuracy (deviation from true value)– precision (deviation from mean value)– variation ascribed to error

• Uncertainty: differences reflect uncertainty about the real world– no true map– possible consensus map– combining maps can improve estimates

Page 33: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Models of uncertainty• Determine effects of

uncertainty/variation/error on results of analysis– if there is known variation, the results of a single

analysis cannot be claimed to be correct– uncertainty analysis an essential part of GIS– error model the preferred term

Page 34: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Traditional error analysis• Measurements subject to distortion

– z' = z + z

• Propagate through transformations– r = f(z)– r + r = f(z + z)

• But f is rarely known– complex compilation and interpretation– complex spatial dependencies between elements of

resulting data set

Page 35: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Spatial dependence• In true values z• In errors e

• cov(ei,ej) a decreasing positive function of distance– geostatistical framework

• Scale effects, generalization as convolutions of z

Page 36: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

If this were not true• If Tobler’s First Law of Geography did not apply to

errors in maps• If errors were statistically independent• If relative errors were as large as absolute errors• Errors in derived products would be impossibly large

– e.g. slope– e.g. length

• Shapes would be unrecognizable

Page 37: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Realization• A single instance from an error model

– an error model must be stochastic– Monte Carlo simulation

• The Gaussian distribution metaphor– scalar realizations– a Gaussian distribution for maps

• an entire map as a realization

Page 38: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 39: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Model

• {p1,p2,…,pn}

• correlation in neighboring cell outcomes• posterior probabilities equal to priors• 80% sand, 20% inclusions of clay• no knowledge of correlations

Page 40: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 41: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 42: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 43: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 44: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 45: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 46: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 47: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 48: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Topographic data• Definition problems

– sand dunes– trees– buildings

• Classic measurement error model– measured elevation = truth + error– error spatially autocorrelated

Page 49: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 50: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 51: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 52: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 53: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 54: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

z'(x) = z(x) + δz(x)

Page 55: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Glyphs indicating wind direction, magnitude, and uncertainty (Pang, 2001)

Page 56: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Representation of estimated water balance surplus/deficit (using a mesh surface) and uncertainty in the estimates (using bars above and below the surface). The bars depict the range of a set of model predictions with those predictions above the mean in purple and those below the mean in orange.

Fauerbach, Edsall, Barnes, MacEachren animation

Page 57: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 58: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Options for uncertainty• Ignore• Present parameters• Present simulations

– confidence intervals

Page 59: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Point symbol sets depicting uncertainty with variation in (a) saturation (colors vary from saturated green, bottom, to unsaturated, top); (b) crispness of symbol edge – middle; and (c) transparency of symbol – right.

Page 60: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Alternative depictions of data (inorganic nitrogen in Chesapeake Bay) and uncertainty of data interpolated from sparse point samples. Left view shows bivariate depiction in which dark=more nitrogen and certainty is depicted with a diverging color scheme (blue = most certain and red = most uncertain). Right view depicts data in both panels (dark = more nitrogen), with the right panel showing the results of interactive focusing on the most certain data.

Page 61: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 62: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Communication of uncertainty• Producer to user

– abilities

• Metadata standards– parameters of complex models

• Assertion:– all knowledge of uncertainty can be expressed in

a suitable simulation model– equally likely realizations

Page 63: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

The "five-fold way"• Positional accuracy• Attribute accuracy• Logical consistency• Completeness• Lineage• Federal Geographic Data Committee

– Spatial Data Transfer Standard– Content Standard for Digital Geospatial Metadata– www.fgdc.gov

Page 64: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Metadata• Data about data

– handling instructions– catalog entry– fitness for use

• What is known about data quality– a measure of the success of spatial data quality research– much progress has been made– FGDC CSDGM 1994– ISO 19115 2003

Page 65: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Web 2.0• Dominated by user-generated content• Bottom-up supply of geospatial data• What are the issues?

Page 66: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 67: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 68: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara
Page 69: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

CSDGM, ISO 19115• Do they match the state of research?

– early 1990s– SDTS discussions of 1980s– the five-fold way

• positional accuracy• attribute accuracy• logical consistency• completeness• lineage

• Do they represent a user perspective?– committees staffed by data producers– production control mechanisms?

Page 70: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Producer or user?• Producer-centric

– details of the production process: the measurement and compilation systems used

– tests of data quality conducted under carefully controlled conditions

– formal specifications of data set contents• User-centric

– effects of uncertainties on specific uses of the data, from simple queries to complex analyses

– simple descriptions of quality that are readily understood by non-expert users

– tools to enable the user to determine the effects of quality on results

Page 71: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

Increasing complexity• Self-documentation

– notes to oneself

• A colleague– brief description

• Another discipline, language, culture– ideal metadata/data ratio?

Page 72: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

social distance

complexity of metadata

Page 73: Thinking Critically about Geospatial Data Quality Michael F. Goodchild University of California Santa Barbara

References• My CV and papers• Book list