2005 geog090 portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Data Portrayal and Special Considerations for Spatial Data

•Portraying data drawn from the various scales of measurement requires the use of different approaches that appropriate to their scales•Histograms are a useful method of portraying data from the higher scales of measurement, and can be used with absolute, relative, and cumulative frequencies•Spatial data presents some special challenges and opportunities for quantitative analyses


Data Portrayal•Before choosing descriptive statistics to summarize data, it is often useful to portray it in some fashion that allows you to get a sense of the dataset.•Many portrayal approaches still involve reducing the volume of data (and information content), but if applied properly, they can help you see the interesting characteristics of data•For the various scales of measurement, there are different approaches that are applicable


Nominal Data•From one of my dissertation transect samples, the frequency of types of segments are nominal data:

Class Frequency % of TotalWoody 105 32.92Herbaceous 151 47.34Water 1 0.31Ground 6 1.88Road 23 7.21Pavement 22 6.90Structures 11 3.45

Normalizing the data,

expressing it relative to the

total (some caveats here)


Nominal DataClass Frequency % of TotalWoody 105 32.92Herbaceous 151 47.34Water 1 0.31Ground 6 1.88Road 23 7.21Pavement 22 6.90Structures 11 3.45•This is a tabular presentation of data –has the advantage of giving the exact quantities, but can be ‘busy’, especially larger tables


Nominal DataClass FrequencyWoody 105Herbaceous 151Water 1Ground 6Road 23Pavement 22Structures 11•The frequency of nominal data can be well displayed by a bar graph

Segment Type Frequency

0

20

40

60

80

100

120

140

160

WoodyHerbac

eous

WaterGround

RoadPav

emen

tStru

ctures

Segment Types

Freq

uenc

y


Nominal DataClass % of TotalWoody 32.92Herbaceous 47.34Water 0.31Ground 1.88Road 7.21Pavement 6.90Structures 3.45•Once normalized, the values are well displayed in a pie chart, which emphasizes each category’s portion of the whole

Segment Types

Woody33%Water

0%

Ground2%

Road7%

Pavement7%

Structures3%

Herbaceous48%

Woody

Herbaceous

Water

Ground

Road

Pavement

Structures


Ordinal, Interval, & Ratio Data•From my dissertation, the set of all topographic moisture index values drawn from a raster data layer is an example of an interval dataset:


Ordinal, Interval, & Ratio Data•Pond Branch is a 37.55 hectare watershed, which is equivalent to 375,500 m2 (1 hectare = 10, 000 m2)•Using 11.25m x 11.25m pixels (126.5625 m2), there are ~ 2966 pixels from which we can draw TMI values


Ordinal, Interval, & Ratio Data•It would clearly be impractical to try and get a sense of the distribution of TMI values in Pond Branch by looking at a table of 2966 values•We need a data reduction approach by which we can reduce the number of values to a manageable amount, which in turn lends itself to some sort of graphical display•For ordinal, interval, and ratio scale data, we can make use of histograms for this purpose, and building a histogram involves following a multi-step procedure …


Building a Histogram1. Developing an ungrouped frequency table• That is, we build a table that counts the number

of occurrences of each variable value from lowest to highest:TMI Value Ungrouped Freq.4.16 24.17 44.18 0… …13.71 1

•We could attempt to construct a bar chart from this table, but it would have too many bars to really be useful


Building a Histogram2. Construct a grouped frequency table• This table has classes of values (in a sense we

are reducing our data back to the ordinal scale for display purposes)

• The decision on how to perform the grouping is a subjective one, but there are some common guidelines:

• Use class intervals with simple bounds and a common width (i.e. categories have same range)

• Adjacent intervals should not overlap (each datum should fit into one class)


Building a Histogram3. Select an appropriate number of classes• There are formulae available to make this

decision objectively, but in reality it is a somewhat subjective decision

• If you have more observations, you usually need more classes, because when you put observations together in a class, you are considering them to have the same value for display purposes there is a trade-off here between simplicity and loss of information (e.g. Pond Branch TMI -2966 observations grouped into 10 classes)


Building a Histogram3. Select an appropriate number of classes cont.

Class Frequency4.00 - 4.99 1205.00 - 5.99 8076.00 - 6.99 14117.00 - 7.99 4078.00 - 8.99 879.00 - 9.99 33

10.00 - 10.99 1711.00 - 11.99 2212.00 - 12.99 4313.00 - 13.99 19


Building a Histogram4. Plot the frequencies of each class• All that remains is to create the plot:

Pond Branch TMI Histogram

048

12162024283236404448

4 5 6 7 8 9 10 11 12 13 14 15 16

Topographic Moisture Index

Perc

ent o

f cel

ls in

cat

chm

ent


Some Guidelines for Grouping1. Generally we want 6 – 12 classes for the observations2. Each class should be the same width: uneven classes

lead to misleading displays3. Classes must be mutually exclusive and collectively

exhaustive (each and every observation must fit into one AND only one class)

4. Try to make use of a class interval (class size/width) that lets us see a pattern in the data (if there is one)

5. Watch out for outliers (radically different observations) as their inclusion can result in misleading plots


Basic Procedure for Histograms1. Compute the range of observations (min. & max. value)2. Choose an initial # of classes (most likely based on the

range of values, try and find a number of classes that divides evenly into the range of values and is still within the 6 – 12 class guideline)

3. Compute the class interval = range / number of classes round the precise range to the nearest convenient

number (preferably an integer, adjusting as necessary)4. Select a starting value for the classes that is less than or

equal to the lowest value in the observations


Basic Procedure for Histograms5. Adjust the range, width, and starting point if necessary6. Compute the midpoint of each class (this is particularly

useful if plotting a line plot histogram rather than the bar chart variety)

7. The ‘actual bounds’ will depend on the precision and accuracy of the data (e.g. class limits of 1-2, 3-4, etc. might have actual limits of 0.5-2.5, 2.5-4.5 etc. because we have rounded)

8. Plot the data• At this point, you’re not likely to be required to create a

histogram by hand very often (as it easily done using software), but it’s good to know the theory


Frequencies & Distributions•A histogram is one way to depict a frequency distribution. A loose definition of a frequency:•The number of times a variable takes on a particular value (note that any variable has a frequency distribution)•E.g. roll a pair of dice several times and record the resulting values (constrained to being between and 2 and 12), counting the number of times any given value occurs (the frequency of that value occurring), and take these all together to form a frequency distribution


Frequencies & Distributions•Frequencies can be absolute (when the frequency provided is the actual count of the occurrences of that particular frequency) or they can be relative(when they are normalized by dividing the absolute frequency by the total number of observations to yield a relative frequency between 0 and 1)•Relative frequencies are particularly useful if you want to compare distributions drawn from two different sources, i.e. while the numbers of observations of each source may be different, by normalizing them, they can be reasonably compared


Segment Length Distributions

0

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90 100

Segment length (meters)

Perc

ent o

f all

segm

ents

in c

lass

Woody Herbaceous Pavement Roads Structures

0

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90 100

Segment length (meters)

Perc

ent o

f all

segm

ents

in c

lass

Woody Herbaceous Pavement Roads Structures

Glyndon

Upper Baismans Run


Frequencies & Distributions•In addition to the conventional frequencies described thusfar, there is another type of frequency known as a cumulative frequency.•Cumulative frequencies are calculated by starting with the lowest class of an observed variable and its frequency and then adding each successive variable value to the preceding sum.•Cumulative frequencies are desirable when we want to know what proportion of observations have a value less than some threshold


Frequencies & Distributions•For example, here’s some frequency data for the woody vegetation class segments distance from streams in Upper Baisman’s Run:

CLASS MIN. VALUE FREQ. CUM FREQ.1 0.00000 9.30 9.302 23.31757 7.73 17.033 46.63514 7.08 24.114 69.95271 5.71 29.825 93.27028 4.70 34.526 116.58785 3.67 38.197 139.90542 3.17 41.368 163.22300 2.73 44.099 186.54057 5.36 49.45


Baismans Run Primary Class Distance from

Stream Distributions

Conventional

Cumulative

0

20

40

60

80

100

0 100 200 300 400 500 600 700 800 900 1000 1100 1200

Distance to stream along D8 flow paths (meters)

Perc

ent o

f all

cells

in c

lass

Woody Herbaceous Pavement and Road Structures Ground

0

5

10

15

20

25

30

0 100 200 300 400 500 600 700 800 900 1000 1100 1200

Distance to stream along D8 flow paths (meters)Pe

rcen

t of a

ll ce

lls in

cla

ss

Woody Herbaceous Pavement and Road Structures Ground

a.k.a. Ogive


Frequencies & Distributions• By examining the shape of freq. distribution

curves we can gain some sense of the distribution through some general characteristics:

1. Modality – Most distributions are unimodal, but we might also see bimodal or multi-modal dists.(if unimodal, we can also consider):

2. Symmetry – a.k.a. skewness of the distribution –Is it positively or negatively skewed?

3. Kurtosis – Describes the degree of peakedness or flatness of the curve


Frequencies & Distributions• Some other useful descriptive terms which we

apply to curves:• The tail of the curve• Inflection point(s)• The peak of the curve• Outliers• Concave-up or concave-down


Statistics, Space, and Independence

•Geography offers some special challenges to applying statistics, as well as some special opportunities – Tobler’s Law:

•“Everything in space is related, but near things are more related “

•This notion tells us something useful about relationships in space (which we can examine with spatial autocorrelation), but also presents a problem for parametric inferential statistics:

•Are samples truly independent?


Special Considerations with Spatial Data

• Geography: Acknowledges the relationships between nearby features’ characteristics based on their spatial proximity

• Statistics: Requires the independence of individual observations

• We have a special set of (spatial) issues:1. The modifiable area unit problem2. Boundary Problems3. Spatial Sampling Procedures4. Spatial Autocorrelation


1. The Modifiable Area Unit Problem

• Previous lecture: Individual vs. Grouped Data

• Grouped data was presented in the sense of aggregating observations in a ‘conventional’ fashion

• We are also concerned with the aggregation of observations in a spatial sense, because we often collect and report data based on geography zones

• BUT what logic do we use when delineating the zones for collection and analysis



• Rogerson presents a figure (Fig 1.7, p. 14) from Fotheringham and Rogerson (1993) that illustrates an example that assesses migration data using two zoning schemes:



• The previous example emphasized one aspect of the MAUP, which is concerned with the placement of zonal boundaries, when we have some idea of the desirable size of the areas of analysis

• A related aspect of the problem relates to the scale of analysis: What size areas do we want to use (or should we use)? Recall the nature of ecological fallacy which we discussed in the previous lecture.


Pond Branch - 6/26/02 - Average

0.0

0.1

0.2

0.3

0.4

0.5

0.6

4 5 6 7 8 9 10 11 12 13TMI

Vol. S

oil M

oistu

re (V

/V)

Comparing Soil Moisture and TMI

Sites

Theta

TMI

Compare


Digital Elevation Models Resolutions

• Interpolate DEMS from photogrammetric and LIDAR spot elevations at a range of resolutions:

• 0.5 m to 5 m DEMs in 0.5 m increments (e.g. 0.5m, 1m, 1.5m, 2m, 2.5m, 3m, 3.5m, 4m etc.)

• 5 m to 30 m DEMs in 1.25 m increments (e.g. 5m, 6.25m, 7.5m, 8.75m, 10m, 11.25m etc.)

• Derive Topographic Moisture Index Values at the full range of scales

• Perform the correlation comparisons at a range of scales in order to detect critical scales


-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20 25 30

Cell Size (metres)

Cor

rela

tion

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20 25 30

Cell Size (metres)

Cor

rela

tion

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20 25 30

Cell Size (metres)

Cor

rela

tion

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20 25 30

Cell Size (metres)

Cor

rela

tion

Glyndon Pond Branch

LIDA

RPhotogram

.

Cell Size


2. Boundary Problems• Anytime we define a study area for the collection

of data, we are making some assumptions about the extent of that area being relevant to the question at hand, and anything outside of it being irrelevant

• BUT what if some factor that is influencing the phenomenon of interest is located just outside the study area OR perhaps our chosen area is not homogenous and includes something other than our populations of interest


2. Boundary Problems• The basis of boundaries can be rational and less

vague (e.g. watershed boundaries … although multiple methods can lead to differing boundaries) or rather arbitrary (e.g. based on political units that don’t necessarily organize the phenomena of interest).

• One suggestion that Rogerson provides is buffering around a study area in order to provide some ‘margin of error’ for the inclusion of relevant information


Glyndon Catchment – UrbanizingColor Infrared Digital Orthophotography


Glyndon Sample 3Glyndon Sample 2Glyndon Sample 1

N

EW

S

0.5 0 0.5 1 1.5 Kilometers

Glyndon Sampling

• 3 Samples, 100 meters/ha, 100 meter long transects


Study Climate DivisionsN

EW

S

100 0 100 200 300 Kilometers

North Carolina

Climate Division 3

Maryland

Climate Division 6

AtlanticOcean

WestVirginia

Virginia

North Carolina

South Carolina

MarylandPennsylvania


MODIS LULC In Climate Divisions

Maryland CD6

North Carolina CD3


3. Spatial Sampling Procedures• Conventionally in statistics, samples are drawn

randomly from a larger population• If we’re interested in sampling some phenomenon

in terms of its location, we need some scheme by which to select locations

• There are any number of these sorts of schemes, for example:• Simple random (e.g. transect start points)• Stratified spatial (e.g. soil moisture points)


Transect Placement

Software selects a random starting position for each transect, applying criteria

Software assigns a random to each direction transect


Pond Branch Catchment – ControlColor Infrared Digital Orthophotography


Pond Branch CatchmentStratified TMI Sampling

Pond Branch TMI Histogram

048

12162024283236404448

4 5 6 7 8 9 10 11 12 13 14 15 16


Perc

ent o

f cel

ls in

cat

chm

ent

TMI Values at Soil Moisture Sampling Locations using 11.25m PG DEM

4 5 6 7 8 9 10 11 12 13 14


Pond Branch Glyndon


4. Spatial Autocorrelation• This refers to the fact that the value of a variable

at one point in space is often related to the value of that same variable at a nearby location (i.e. Tobler’s Law in action!)

• On the one hand, this causes some difficulties when it comes to asserting the independence of samples taken nearby one another

• On the other hand, it allows us to assess the degree of organization in spatial patterns by measuring their autocorrelation


4. Spatial Autocorrelation• There are a number of measures of spatial

autocorrelation (e.g. the Moran Coefficient, the Geary Ratio) that express the spatial structure of a pattern using a value ranging from –1 to 1• 1 ~ similar values tend to cluster• -1 ~ dissimilar values tend to cluster• 0 ~ a random scattering of values

• These express the relationship between values of a single variable due to the geographical arrangment of the sampled data

2005 geog090 portrayal

Documents