2005 geog090 portrayal

45
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Data Portrayal and Special Considerations for Spatial Data •Portraying data drawn from the various scales of measurement requires the use of different approaches that appropriate to their scales •Histograms are a useful method of portraying data from the higher scales of measurement, and can be used with absolute, relative, and cumulative frequencies •Spatial data presents some special challenges and opportunities for quantitative analyses

Upload: narendra

Post on 03-Feb-2016

237 views

Category:

Documents


0 download

DESCRIPTION

quality

TRANSCRIPT

Page 1: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Data Portrayal and Special Considerations for Spatial Data

•Portraying data drawn from the various scales of measurement requires the use of different approaches that appropriate to their scales•Histograms are a useful method of portraying data from the higher scales of measurement, and can be used with absolute, relative, and cumulative frequencies•Spatial data presents some special challenges and opportunities for quantitative analyses

Page 2: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Data Portrayal•Before choosing descriptive statistics to summarize data, it is often useful to portray it in some fashion that allows you to get a sense of the dataset.•Many portrayal approaches still involve reducing the volume of data (and information content), but if applied properly, they can help you see the interesting characteristics of data•For the various scales of measurement, there are different approaches that are applicable

Page 3: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Nominal Data•From one of my dissertation transect samples, the frequency of types of segments are nominal data:

Class Frequency % of TotalWoody 105 32.92Herbaceous 151 47.34Water 1 0.31Ground 6 1.88Road 23 7.21Pavement 22 6.90Structures 11 3.45

Normalizing the data,

expressing it relative to the

total (some caveats here)

Page 4: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Nominal DataClass Frequency % of TotalWoody 105 32.92Herbaceous 151 47.34Water 1 0.31Ground 6 1.88Road 23 7.21Pavement 22 6.90Structures 11 3.45•This is a tabular presentation of data –has the advantage of giving the exact quantities, but can be ‘busy’, especially larger tables

Page 5: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Nominal DataClass FrequencyWoody 105Herbaceous 151Water 1Ground 6Road 23Pavement 22Structures 11•The frequency of nominal data can be well displayed by a bar graph

Segment Type Frequency

0

20

40

60

80

100

120

140

160

WoodyHerbac

eous

WaterGround

RoadPav

emen

tStru

ctures

Segment Types

Freq

uenc

y

Page 6: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Nominal DataClass % of TotalWoody 32.92Herbaceous 47.34Water 0.31Ground 1.88Road 7.21Pavement 6.90Structures 3.45•Once normalized, the values are well displayed in a pie chart, which emphasizes each category’s portion of the whole

Segment Types

Woody33%Water

0%

Ground2%

Road7%

Pavement7%

Structures3%

Herbaceous48%

Woody

Herbaceous

Water

Ground

Road

Pavement

Structures

Page 7: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Ordinal, Interval, & Ratio Data•From my dissertation, the set of all topographic moisture index values drawn from a raster data layer is an example of an interval dataset:

Page 8: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Ordinal, Interval, & Ratio Data•Pond Branch is a 37.55 hectare watershed, which is equivalent to 375,500 m2 (1 hectare = 10, 000 m2)•Using 11.25m x 11.25m pixels (126.5625 m2), there are ~ 2966 pixels from which we can draw TMI values

Page 9: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Ordinal, Interval, & Ratio Data•It would clearly be impractical to try and get a sense of the distribution of TMI values in Pond Branch by looking at a table of 2966 values•We need a data reduction approach by which we can reduce the number of values to a manageable amount, which in turn lends itself to some sort of graphical display•For ordinal, interval, and ratio scale data, we can make use of histograms for this purpose, and building a histogram involves following a multi-step procedure …

Page 10: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Building a Histogram1. Developing an ungrouped frequency table• That is, we build a table that counts the number

of occurrences of each variable value from lowest to highest:TMI Value Ungrouped Freq.4.16 24.17 44.18 0… …13.71 1

•We could attempt to construct a bar chart from this table, but it would have too many bars to really be useful

Page 11: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Building a Histogram2. Construct a grouped frequency table• This table has classes of values (in a sense we

are reducing our data back to the ordinal scale for display purposes)

• The decision on how to perform the grouping is a subjective one, but there are some common guidelines:

• Use class intervals with simple bounds and a common width (i.e. categories have same range)

• Adjacent intervals should not overlap (each datum should fit into one class)

Page 12: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Building a Histogram3. Select an appropriate number of classes• There are formulae available to make this

decision objectively, but in reality it is a somewhat subjective decision

• If you have more observations, you usually need more classes, because when you put observations together in a class, you are considering them to have the same value for display purposes there is a trade-off here between simplicity and loss of information (e.g. Pond Branch TMI -2966 observations grouped into 10 classes)

Page 13: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Building a Histogram3. Select an appropriate number of classes cont.

Class Frequency4.00 - 4.99 1205.00 - 5.99 8076.00 - 6.99 14117.00 - 7.99 4078.00 - 8.99 879.00 - 9.99 33

10.00 - 10.99 1711.00 - 11.99 2212.00 - 12.99 4313.00 - 13.99 19

Page 14: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Building a Histogram4. Plot the frequencies of each class• All that remains is to create the plot:

Pond Branch TMI Histogram

048

12162024283236404448

4 5 6 7 8 9 10 11 12 13 14 15 16

Topographic Moisture Index

Perc

ent o

f cel

ls in

cat

chm

ent

Page 15: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Some Guidelines for Grouping1. Generally we want 6 – 12 classes for the observations2. Each class should be the same width: uneven classes

lead to misleading displays3. Classes must be mutually exclusive and collectively

exhaustive (each and every observation must fit into one AND only one class)

4. Try to make use of a class interval (class size/width) that lets us see a pattern in the data (if there is one)

5. Watch out for outliers (radically different observations) as their inclusion can result in misleading plots

Page 16: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Basic Procedure for Histograms1. Compute the range of observations (min. & max. value)2. Choose an initial # of classes (most likely based on the

range of values, try and find a number of classes that divides evenly into the range of values and is still within the 6 – 12 class guideline)

3. Compute the class interval = range / number of classes round the precise range to the nearest convenient

number (preferably an integer, adjusting as necessary)4. Select a starting value for the classes that is less than or

equal to the lowest value in the observations

Page 17: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Basic Procedure for Histograms5. Adjust the range, width, and starting point if necessary6. Compute the midpoint of each class (this is particularly

useful if plotting a line plot histogram rather than the bar chart variety)

7. The ‘actual bounds’ will depend on the precision and accuracy of the data (e.g. class limits of 1-2, 3-4, etc. might have actual limits of 0.5-2.5, 2.5-4.5 etc. because we have rounded)

8. Plot the data• At this point, you’re not likely to be required to create a

histogram by hand very often (as it easily done using software), but it’s good to know the theory

Page 18: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Frequencies & Distributions•A histogram is one way to depict a frequency distribution. A loose definition of a frequency:•The number of times a variable takes on a particular value (note that any variable has a frequency distribution)•E.g. roll a pair of dice several times and record the resulting values (constrained to being between and 2 and 12), counting the number of times any given value occurs (the frequency of that value occurring), and take these all together to form a frequency distribution

Page 19: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Frequencies & Distributions•Frequencies can be absolute (when the frequency provided is the actual count of the occurrences of that particular frequency) or they can be relative(when they are normalized by dividing the absolute frequency by the total number of observations to yield a relative frequency between 0 and 1)•Relative frequencies are particularly useful if you want to compare distributions drawn from two different sources, i.e. while the numbers of observations of each source may be different, by normalizing them, they can be reasonably compared

Page 20: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Segment Length Distributions

0

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90 100

Segment length (meters)

Perc

ent o

f all

segm

ents

in c

lass

Woody Herbaceous Pavement Roads Structures

0

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90 100

Segment length (meters)

Perc

ent o

f all

segm

ents

in c

lass

Woody Herbaceous Pavement Roads Structures

Glyndon

Upper Baismans Run

Page 21: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Frequencies & Distributions•In addition to the conventional frequencies described thusfar, there is another type of frequency known as a cumulative frequency.•Cumulative frequencies are calculated by starting with the lowest class of an observed variable and its frequency and then adding each successive variable value to the preceding sum.•Cumulative frequencies are desirable when we want to know what proportion of observations have a value less than some threshold

Page 22: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Frequencies & Distributions•For example, here’s some frequency data for the woody vegetation class segments distance from streams in Upper Baisman’s Run:

CLASS MIN. VALUE FREQ. CUM FREQ.1 0.00000 9.30 9.302 23.31757 7.73 17.033 46.63514 7.08 24.114 69.95271 5.71 29.825 93.27028 4.70 34.526 116.58785 3.67 38.197 139.90542 3.17 41.368 163.22300 2.73 44.099 186.54057 5.36 49.45

Page 23: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Baismans Run Primary Class Distance from

Stream Distributions

Conventional

Cumulative

0

20

40

60

80

100

0 100 200 300 400 500 600 700 800 900 1000 1100 1200

Distance to stream along D8 flow paths (meters)

Perc

ent o

f all

cells

in c

lass

Woody Herbaceous Pavement and Road Structures Ground

0

5

10

15

20

25

30

0 100 200 300 400 500 600 700 800 900 1000 1100 1200

Distance to stream along D8 flow paths (meters)Pe

rcen

t of a

ll ce

lls in

cla

ss

Woody Herbaceous Pavement and Road Structures Ground

a.k.a. Ogive

Page 24: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Frequencies & Distributions• By examining the shape of freq. distribution

curves we can gain some sense of the distribution through some general characteristics:

1. Modality – Most distributions are unimodal, but we might also see bimodal or multi-modal dists.(if unimodal, we can also consider):

2. Symmetry – a.k.a. skewness of the distribution –Is it positively or negatively skewed?

3. Kurtosis – Describes the degree of peakedness or flatness of the curve

Page 25: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Frequencies & Distributions• Some other useful descriptive terms which we

apply to curves:• The tail of the curve• Inflection point(s)• The peak of the curve• Outliers• Concave-up or concave-down

Page 26: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Statistics, Space, and Independence

•Geography offers some special challenges to applying statistics, as well as some special opportunities – Tobler’s Law:

•“Everything in space is related, but near things are more related “

•This notion tells us something useful about relationships in space (which we can examine with spatial autocorrelation), but also presents a problem for parametric inferential statistics:

•Are samples truly independent?

Page 27: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Special Considerations with Spatial Data

• Geography: Acknowledges the relationships between nearby features’ characteristics based on their spatial proximity

• Statistics: Requires the independence of individual observations

• We have a special set of (spatial) issues:1. The modifiable area unit problem2. Boundary Problems3. Spatial Sampling Procedures4. Spatial Autocorrelation

Page 28: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

1. The Modifiable Area Unit Problem

• Previous lecture: Individual vs. Grouped Data

• Grouped data was presented in the sense of aggregating observations in a ‘conventional’ fashion

• We are also concerned with the aggregation of observations in a spatial sense, because we often collect and report data based on geography zones

• BUT what logic do we use when delineating the zones for collection and analysis

Page 29: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

1. The Modifiable Area Unit Problem

• Rogerson presents a figure (Fig 1.7, p. 14) from Fotheringham and Rogerson (1993) that illustrates an example that assesses migration data using two zoning schemes:

Page 30: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

1. The Modifiable Area Unit Problem

• The previous example emphasized one aspect of the MAUP, which is concerned with the placement of zonal boundaries, when we have some idea of the desirable size of the areas of analysis

• A related aspect of the problem relates to the scale of analysis: What size areas do we want to use (or should we use)? Recall the nature of ecological fallacy which we discussed in the previous lecture.

Page 31: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Pond Branch - 6/26/02 - Average

0.0

0.1

0.2

0.3

0.4

0.5

0.6

4 5 6 7 8 9 10 11 12 13TMI

Vol. S

oil M

oistu

re (V

/V)

Comparing Soil Moisture and TMI

Sites

Theta

TMI

Compare

Page 32: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Digital Elevation Models Resolutions

• Interpolate DEMS from photogrammetric and LIDAR spot elevations at a range of resolutions:

• 0.5 m to 5 m DEMs in 0.5 m increments (e.g. 0.5m, 1m, 1.5m, 2m, 2.5m, 3m, 3.5m, 4m etc.)

• 5 m to 30 m DEMs in 1.25 m increments (e.g. 5m, 6.25m, 7.5m, 8.75m, 10m, 11.25m etc.)

• Derive Topographic Moisture Index Values at the full range of scales

• Perform the correlation comparisons at a range of scales in order to detect critical scales

Page 33: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20 25 30

Cell Size (metres)

Cor

rela

tion

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20 25 30

Cell Size (metres)

Cor

rela

tion

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20 25 30

Cell Size (metres)

Cor

rela

tion

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20 25 30

Cell Size (metres)

Cor

rela

tion

Glyndon Pond Branch

LIDA

RPhotogram

.

Cell Size

Page 34: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

2. Boundary Problems• Anytime we define a study area for the collection

of data, we are making some assumptions about the extent of that area being relevant to the question at hand, and anything outside of it being irrelevant

• BUT what if some factor that is influencing the phenomenon of interest is located just outside the study area OR perhaps our chosen area is not homogenous and includes something other than our populations of interest

Page 35: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

2. Boundary Problems• The basis of boundaries can be rational and less

vague (e.g. watershed boundaries … although multiple methods can lead to differing boundaries) or rather arbitrary (e.g. based on political units that don’t necessarily organize the phenomena of interest).

• One suggestion that Rogerson provides is buffering around a study area in order to provide some ‘margin of error’ for the inclusion of relevant information

Page 36: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Glyndon Catchment – UrbanizingColor Infrared Digital Orthophotography

Page 37: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Glyndon Sample 3Glyndon Sample 2Glyndon Sample 1

N

EW

S

0.5 0 0.5 1 1.5 Kilometers

Glyndon Sampling

• 3 Samples, 100 meters/ha, 100 meter long transects

Page 38: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Study Climate DivisionsN

EW

S

100 0 100 200 300 Kilometers

North Carolina

Climate Division 3

Maryland

Climate Division 6

AtlanticOcean

WestVirginia

Virginia

North Carolina

South Carolina

MarylandPennsylvania

Page 39: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

MODIS LULC In Climate Divisions

Maryland CD6

North Carolina CD3

Page 40: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

3. Spatial Sampling Procedures• Conventionally in statistics, samples are drawn

randomly from a larger population• If we’re interested in sampling some phenomenon

in terms of its location, we need some scheme by which to select locations

• There are any number of these sorts of schemes, for example:• Simple random (e.g. transect start points)• Stratified spatial (e.g. soil moisture points)

Page 41: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Transect Placement

Software selects a random starting position for each transect, applying criteria

Software assigns a random to each direction transect

Page 42: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Pond Branch Catchment – ControlColor Infrared Digital Orthophotography

Page 43: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

Pond Branch CatchmentStratified TMI Sampling

Pond Branch TMI Histogram

048

12162024283236404448

4 5 6 7 8 9 10 11 12 13 14 15 16

Topographic Moisture Index

Perc

ent o

f cel

ls in

cat

chm

ent

TMI Values at Soil Moisture Sampling Locations using 11.25m PG DEM

4 5 6 7 8 9 10 11 12 13 14

Topographic Moisture Index

Pond Branch Glyndon

Page 44: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

4. Spatial Autocorrelation• This refers to the fact that the value of a variable

at one point in space is often related to the value of that same variable at a nearby location (i.e. Tobler’s Law in action!)

• On the one hand, this causes some difficulties when it comes to asserting the independence of samples taken nearby one another

• On the other hand, it allows us to assess the degree of organization in spatial patterns by measuring their autocorrelation

Page 45: 2005 GEOG090 Portrayal

David Tenenbaum – GEOG 090 – UNC-CH Spring 2005

4. Spatial Autocorrelation• There are a number of measures of spatial

autocorrelation (e.g. the Moran Coefficient, the Geary Ratio) that express the spatial structure of a pattern using a value ranging from –1 to 1• 1 ~ similar values tend to cluster• -1 ~ dissimilar values tend to cluster• 0 ~ a random scattering of values

• These express the relationship between values of a single variable due to the geographical arrangment of the sampled data