2005 geog090 portrayal
DESCRIPTION
qualityTRANSCRIPT
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Data Portrayal and Special Considerations for Spatial Data
•Portraying data drawn from the various scales of measurement requires the use of different approaches that appropriate to their scales•Histograms are a useful method of portraying data from the higher scales of measurement, and can be used with absolute, relative, and cumulative frequencies•Spatial data presents some special challenges and opportunities for quantitative analyses
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Data Portrayal•Before choosing descriptive statistics to summarize data, it is often useful to portray it in some fashion that allows you to get a sense of the dataset.•Many portrayal approaches still involve reducing the volume of data (and information content), but if applied properly, they can help you see the interesting characteristics of data•For the various scales of measurement, there are different approaches that are applicable
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Nominal Data•From one of my dissertation transect samples, the frequency of types of segments are nominal data:
Class Frequency % of TotalWoody 105 32.92Herbaceous 151 47.34Water 1 0.31Ground 6 1.88Road 23 7.21Pavement 22 6.90Structures 11 3.45
Normalizing the data,
expressing it relative to the
total (some caveats here)
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Nominal DataClass Frequency % of TotalWoody 105 32.92Herbaceous 151 47.34Water 1 0.31Ground 6 1.88Road 23 7.21Pavement 22 6.90Structures 11 3.45•This is a tabular presentation of data –has the advantage of giving the exact quantities, but can be ‘busy’, especially larger tables
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Nominal DataClass FrequencyWoody 105Herbaceous 151Water 1Ground 6Road 23Pavement 22Structures 11•The frequency of nominal data can be well displayed by a bar graph
Segment Type Frequency
0
20
40
60
80
100
120
140
160
WoodyHerbac
eous
WaterGround
RoadPav
emen
tStru
ctures
Segment Types
Freq
uenc
y
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Nominal DataClass % of TotalWoody 32.92Herbaceous 47.34Water 0.31Ground 1.88Road 7.21Pavement 6.90Structures 3.45•Once normalized, the values are well displayed in a pie chart, which emphasizes each category’s portion of the whole
Segment Types
Woody33%Water
0%
Ground2%
Road7%
Pavement7%
Structures3%
Herbaceous48%
Woody
Herbaceous
Water
Ground
Road
Pavement
Structures
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Ordinal, Interval, & Ratio Data•From my dissertation, the set of all topographic moisture index values drawn from a raster data layer is an example of an interval dataset:
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Ordinal, Interval, & Ratio Data•Pond Branch is a 37.55 hectare watershed, which is equivalent to 375,500 m2 (1 hectare = 10, 000 m2)•Using 11.25m x 11.25m pixels (126.5625 m2), there are ~ 2966 pixels from which we can draw TMI values
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Ordinal, Interval, & Ratio Data•It would clearly be impractical to try and get a sense of the distribution of TMI values in Pond Branch by looking at a table of 2966 values•We need a data reduction approach by which we can reduce the number of values to a manageable amount, which in turn lends itself to some sort of graphical display•For ordinal, interval, and ratio scale data, we can make use of histograms for this purpose, and building a histogram involves following a multi-step procedure …
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Building a Histogram1. Developing an ungrouped frequency table• That is, we build a table that counts the number
of occurrences of each variable value from lowest to highest:TMI Value Ungrouped Freq.4.16 24.17 44.18 0… …13.71 1
•We could attempt to construct a bar chart from this table, but it would have too many bars to really be useful
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Building a Histogram2. Construct a grouped frequency table• This table has classes of values (in a sense we
are reducing our data back to the ordinal scale for display purposes)
• The decision on how to perform the grouping is a subjective one, but there are some common guidelines:
• Use class intervals with simple bounds and a common width (i.e. categories have same range)
• Adjacent intervals should not overlap (each datum should fit into one class)
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Building a Histogram3. Select an appropriate number of classes• There are formulae available to make this
decision objectively, but in reality it is a somewhat subjective decision
• If you have more observations, you usually need more classes, because when you put observations together in a class, you are considering them to have the same value for display purposes there is a trade-off here between simplicity and loss of information (e.g. Pond Branch TMI -2966 observations grouped into 10 classes)
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Building a Histogram3. Select an appropriate number of classes cont.
Class Frequency4.00 - 4.99 1205.00 - 5.99 8076.00 - 6.99 14117.00 - 7.99 4078.00 - 8.99 879.00 - 9.99 33
10.00 - 10.99 1711.00 - 11.99 2212.00 - 12.99 4313.00 - 13.99 19
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Building a Histogram4. Plot the frequencies of each class• All that remains is to create the plot:
Pond Branch TMI Histogram
048
12162024283236404448
4 5 6 7 8 9 10 11 12 13 14 15 16
Topographic Moisture Index
Perc
ent o
f cel
ls in
cat
chm
ent
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Some Guidelines for Grouping1. Generally we want 6 – 12 classes for the observations2. Each class should be the same width: uneven classes
lead to misleading displays3. Classes must be mutually exclusive and collectively
exhaustive (each and every observation must fit into one AND only one class)
4. Try to make use of a class interval (class size/width) that lets us see a pattern in the data (if there is one)
5. Watch out for outliers (radically different observations) as their inclusion can result in misleading plots
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Basic Procedure for Histograms1. Compute the range of observations (min. & max. value)2. Choose an initial # of classes (most likely based on the
range of values, try and find a number of classes that divides evenly into the range of values and is still within the 6 – 12 class guideline)
3. Compute the class interval = range / number of classes round the precise range to the nearest convenient
number (preferably an integer, adjusting as necessary)4. Select a starting value for the classes that is less than or
equal to the lowest value in the observations
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Basic Procedure for Histograms5. Adjust the range, width, and starting point if necessary6. Compute the midpoint of each class (this is particularly
useful if plotting a line plot histogram rather than the bar chart variety)
7. The ‘actual bounds’ will depend on the precision and accuracy of the data (e.g. class limits of 1-2, 3-4, etc. might have actual limits of 0.5-2.5, 2.5-4.5 etc. because we have rounded)
8. Plot the data• At this point, you’re not likely to be required to create a
histogram by hand very often (as it easily done using software), but it’s good to know the theory
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Frequencies & Distributions•A histogram is one way to depict a frequency distribution. A loose definition of a frequency:•The number of times a variable takes on a particular value (note that any variable has a frequency distribution)•E.g. roll a pair of dice several times and record the resulting values (constrained to being between and 2 and 12), counting the number of times any given value occurs (the frequency of that value occurring), and take these all together to form a frequency distribution
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Frequencies & Distributions•Frequencies can be absolute (when the frequency provided is the actual count of the occurrences of that particular frequency) or they can be relative(when they are normalized by dividing the absolute frequency by the total number of observations to yield a relative frequency between 0 and 1)•Relative frequencies are particularly useful if you want to compare distributions drawn from two different sources, i.e. while the numbers of observations of each source may be different, by normalizing them, they can be reasonably compared
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Segment Length Distributions
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90 100
Segment length (meters)
Perc
ent o
f all
segm
ents
in c
lass
Woody Herbaceous Pavement Roads Structures
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90 100
Segment length (meters)
Perc
ent o
f all
segm
ents
in c
lass
Woody Herbaceous Pavement Roads Structures
Glyndon
Upper Baismans Run
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Frequencies & Distributions•In addition to the conventional frequencies described thusfar, there is another type of frequency known as a cumulative frequency.•Cumulative frequencies are calculated by starting with the lowest class of an observed variable and its frequency and then adding each successive variable value to the preceding sum.•Cumulative frequencies are desirable when we want to know what proportion of observations have a value less than some threshold
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Frequencies & Distributions•For example, here’s some frequency data for the woody vegetation class segments distance from streams in Upper Baisman’s Run:
CLASS MIN. VALUE FREQ. CUM FREQ.1 0.00000 9.30 9.302 23.31757 7.73 17.033 46.63514 7.08 24.114 69.95271 5.71 29.825 93.27028 4.70 34.526 116.58785 3.67 38.197 139.90542 3.17 41.368 163.22300 2.73 44.099 186.54057 5.36 49.45
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Baismans Run Primary Class Distance from
Stream Distributions
Conventional
Cumulative
0
20
40
60
80
100
0 100 200 300 400 500 600 700 800 900 1000 1100 1200
Distance to stream along D8 flow paths (meters)
Perc
ent o
f all
cells
in c
lass
Woody Herbaceous Pavement and Road Structures Ground
0
5
10
15
20
25
30
0 100 200 300 400 500 600 700 800 900 1000 1100 1200
Distance to stream along D8 flow paths (meters)Pe
rcen
t of a
ll ce
lls in
cla
ss
Woody Herbaceous Pavement and Road Structures Ground
a.k.a. Ogive
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Frequencies & Distributions• By examining the shape of freq. distribution
curves we can gain some sense of the distribution through some general characteristics:
1. Modality – Most distributions are unimodal, but we might also see bimodal or multi-modal dists.(if unimodal, we can also consider):
2. Symmetry – a.k.a. skewness of the distribution –Is it positively or negatively skewed?
3. Kurtosis – Describes the degree of peakedness or flatness of the curve
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Frequencies & Distributions• Some other useful descriptive terms which we
apply to curves:• The tail of the curve• Inflection point(s)• The peak of the curve• Outliers• Concave-up or concave-down
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Statistics, Space, and Independence
•Geography offers some special challenges to applying statistics, as well as some special opportunities – Tobler’s Law:
•“Everything in space is related, but near things are more related “
•This notion tells us something useful about relationships in space (which we can examine with spatial autocorrelation), but also presents a problem for parametric inferential statistics:
•Are samples truly independent?
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Special Considerations with Spatial Data
• Geography: Acknowledges the relationships between nearby features’ characteristics based on their spatial proximity
• Statistics: Requires the independence of individual observations
• We have a special set of (spatial) issues:1. The modifiable area unit problem2. Boundary Problems3. Spatial Sampling Procedures4. Spatial Autocorrelation
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
1. The Modifiable Area Unit Problem
• Previous lecture: Individual vs. Grouped Data
• Grouped data was presented in the sense of aggregating observations in a ‘conventional’ fashion
• We are also concerned with the aggregation of observations in a spatial sense, because we often collect and report data based on geography zones
• BUT what logic do we use when delineating the zones for collection and analysis
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
1. The Modifiable Area Unit Problem
• Rogerson presents a figure (Fig 1.7, p. 14) from Fotheringham and Rogerson (1993) that illustrates an example that assesses migration data using two zoning schemes:
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
1. The Modifiable Area Unit Problem
• The previous example emphasized one aspect of the MAUP, which is concerned with the placement of zonal boundaries, when we have some idea of the desirable size of the areas of analysis
• A related aspect of the problem relates to the scale of analysis: What size areas do we want to use (or should we use)? Recall the nature of ecological fallacy which we discussed in the previous lecture.
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Pond Branch - 6/26/02 - Average
0.0
0.1
0.2
0.3
0.4
0.5
0.6
4 5 6 7 8 9 10 11 12 13TMI
Vol. S
oil M
oistu
re (V
/V)
Comparing Soil Moisture and TMI
Sites
Theta
TMI
Compare
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Digital Elevation Models Resolutions
• Interpolate DEMS from photogrammetric and LIDAR spot elevations at a range of resolutions:
• 0.5 m to 5 m DEMs in 0.5 m increments (e.g. 0.5m, 1m, 1.5m, 2m, 2.5m, 3m, 3.5m, 4m etc.)
• 5 m to 30 m DEMs in 1.25 m increments (e.g. 5m, 6.25m, 7.5m, 8.75m, 10m, 11.25m etc.)
• Derive Topographic Moisture Index Values at the full range of scales
• Perform the correlation comparisons at a range of scales in order to detect critical scales
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
0 5 10 15 20 25 30
Cell Size (metres)
Cor
rela
tion
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
0 5 10 15 20 25 30
Cell Size (metres)
Cor
rela
tion
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
0 5 10 15 20 25 30
Cell Size (metres)
Cor
rela
tion
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
0 5 10 15 20 25 30
Cell Size (metres)
Cor
rela
tion
Glyndon Pond Branch
LIDA
RPhotogram
.
Cell Size
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
2. Boundary Problems• Anytime we define a study area for the collection
of data, we are making some assumptions about the extent of that area being relevant to the question at hand, and anything outside of it being irrelevant
• BUT what if some factor that is influencing the phenomenon of interest is located just outside the study area OR perhaps our chosen area is not homogenous and includes something other than our populations of interest
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
2. Boundary Problems• The basis of boundaries can be rational and less
vague (e.g. watershed boundaries … although multiple methods can lead to differing boundaries) or rather arbitrary (e.g. based on political units that don’t necessarily organize the phenomena of interest).
• One suggestion that Rogerson provides is buffering around a study area in order to provide some ‘margin of error’ for the inclusion of relevant information
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Glyndon Catchment – UrbanizingColor Infrared Digital Orthophotography
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Glyndon Sample 3Glyndon Sample 2Glyndon Sample 1
N
EW
S
0.5 0 0.5 1 1.5 Kilometers
Glyndon Sampling
• 3 Samples, 100 meters/ha, 100 meter long transects
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Study Climate DivisionsN
EW
S
100 0 100 200 300 Kilometers
North Carolina
Climate Division 3
Maryland
Climate Division 6
AtlanticOcean
WestVirginia
Virginia
North Carolina
South Carolina
MarylandPennsylvania
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
MODIS LULC In Climate Divisions
Maryland CD6
North Carolina CD3
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
3. Spatial Sampling Procedures• Conventionally in statistics, samples are drawn
randomly from a larger population• If we’re interested in sampling some phenomenon
in terms of its location, we need some scheme by which to select locations
• There are any number of these sorts of schemes, for example:• Simple random (e.g. transect start points)• Stratified spatial (e.g. soil moisture points)
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Transect Placement
Software selects a random starting position for each transect, applying criteria
Software assigns a random to each direction transect
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Pond Branch Catchment – ControlColor Infrared Digital Orthophotography
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Pond Branch CatchmentStratified TMI Sampling
Pond Branch TMI Histogram
048
12162024283236404448
4 5 6 7 8 9 10 11 12 13 14 15 16
Topographic Moisture Index
Perc
ent o
f cel
ls in
cat
chm
ent
TMI Values at Soil Moisture Sampling Locations using 11.25m PG DEM
4 5 6 7 8 9 10 11 12 13 14
Topographic Moisture Index
Pond Branch Glyndon
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
4. Spatial Autocorrelation• This refers to the fact that the value of a variable
at one point in space is often related to the value of that same variable at a nearby location (i.e. Tobler’s Law in action!)
• On the one hand, this causes some difficulties when it comes to asserting the independence of samples taken nearby one another
• On the other hand, it allows us to assess the degree of organization in spatial patterns by measuring their autocorrelation
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
4. Spatial Autocorrelation• There are a number of measures of spatial
autocorrelation (e.g. the Moran Coefficient, the Geary Ratio) that express the spatial structure of a pattern using a value ranging from –1 to 1• 1 ~ similar values tend to cluster• -1 ~ dissimilar values tend to cluster• 0 ~ a random scattering of values
• These express the relationship between values of a single variable due to the geographical arrangment of the sampled data