data mining relationships among urban socioeconomic, land cover, and remotely sensed ecological data...

1
DATA MINING RELATIONSHIPS AMONG URBAN SOCIOECONOMIC, LAND COVER, AND REMOTELY SENSED ECOLOGICAL DATA Jeremy Mennis*, Carol, Wessman, and Nancy Golubiewski**, *Department of Geography and **Department of Ecology and Evolutionary Biology, University of Colorado Contact: Jeremy Mennis, Department of Geography, UCB 260, University of Colorado, Boulder, CO 80309, Phone: (303) 492-4794, Fax: (303) 492-7501, Email: [email protected] 66-1510 1511-1969 1973-2315 2318-2927 Residential Density Mean density of residential land in residen- tial land (cells) 1939-1957 1958-1971 1972-1979 1980-1997 Median year structures built Housing Year Population Density 322-2669 2673-3348 3389-4525 4527-19810 People/m^2 in residential land 1506-1610 1611-1637 1638-1668 1669-1817 Mean elev. in residential land (m) Elevation 270-1551 1556-3002 3035-5716 5732-26237 Mean distance to limited access highways in residential land (m) Distance to Highway 10-167 168-308 309-520 523-2191 Mean density of commercial land in residen- tial land (cells) Commercial Density NDVI (tract level) -0.12- 0.15 0.16-0.19 0.20-0.22 0.23-0.33 Mean NDVI in residential land -0.55- 0.00 0.01-0.20 0.21-0.40 0.40-0.78 NDVI NDVI (image) 37-77 78-89 90-95 96-99 % with a high school diploma Education natural veg. commercial residentia l agricultur e water other Land use Land Use Data: Sources and Preprocessing • Vegetation: NDVI from July 27, 1999 Landsat 7 ETM+ image • Land use: USGS (from aerial photography) • Socioeconomic Status: 2000 U.S. Census • Residential and Commercial Density: calculated by generating grids of the number of residential and commercial grid cells within 1 km of each cell, then calculating the tract mean • Elevation: USGS • Highways: ESRI Note that although colors are mapped to entire tracts, data represents only the residential land within each tract. Denver Boulder Methods Spatial data mining techniques are exploratory methods for detecting patterns in very large spatial databases. We use spatial association rule mining and spatial on-line analytical processing (OLAP), as well as mapping and statistics. Spatial Association Rule Mining seeks to discover associations among transactions encoded in a spatial database. An association rule takes the form A B where A and B are sets of predicates, and either A or B contains a spatial relationship. Interesting rules are found by using metrics such lift, which indicates how much more often than expected B occurs when paired with A. Magnum Opus Association Rule Mining Software Microsoft SQL Server Relational Star Schema Spatial On-Line Analytical Processing is an extension to the SQL GroupBy operation that exhaustively summarizes the value of a measurement variable contained in the fact table by all unique combinations of a set of categorical dimension variables contained in dimension tables. Here, we summarize NDVI by categorizations of the other variables, and export the results to GIS for mapping. Tract_ID 1 2 3 Education 73 58 82 Education _D 2 1 3 …. Education _D 0 1 2 Level_2 0 0 1 3 1 PopDen_D 0 1 2 Level_2 0 0 1 3 1 Minority_ D 0 1 2 Level_2 0 0 1 3 1 NDVI_D 0 1 2 Level_2 0 0 1 3 1 Fact Table Dimension Table Dimension Table Dimension Table Dimension Table 2.457 .344 7.141 .000 8.860E-04 .000 .209 5.159 .000 .391 .253 .162 .601 1.665 -1.32E-03 .000 -.331 -7.334 .000 -.128 -.348 -.230 .483 2.070 -8.20E-06 .000 -.293 -7.977 .000 -.513 -.375 -.250 .726 1.378 1.470E-04 .000 .124 3.273 .001 .275 .164 .103 .688 1.453 2.404E-06 .000 .225 6.405 .000 .263 .309 .201 .797 1.255 -4.44E-05 .000 -.268 -5.784 .000 -.513 -.281 -.181 .459 2.178 2.297E-05 .000 .227 5.040 .000 .492 .247 .158 .482 2.073 (Constant) PEDU MYR PDENRMU MELEV DLIMIT MCMUDEN RESDEN Model 1 B Std. Error Beta t Sig. Zero-order Partial Part Tolerance VIF Dependent Variable: MNDVI a. Results Statistics. Correlations indicate that the variables that have the strongest relationships with NDVI are Population Density (negative relationship), Commercial Density (negative relationship) and Residential Density (positive relationship). In a multivariate context, Housing Year exerts the most influence when the influence of the other explanatory variables are accounted for, although its zero-order correlation is much lower those of all of the other explanatory variables. Spatial Association Rule Mining. Results suggest that residential NDVI is lowest in older, socioeconomically disadvantaged neighborhoods nearby commercial centers. Residential NDVI is highest in older neighborhoods with higher socioeconomic status. Residential NDVI is also highest in areas of residential concentration but sparse population, i.e. planned developments with large lots. Note the role of low Housing Year in predicting both low and high residential NDVI, which explains its statistical results. Spatial On-Line Analytical Processing. The maps at right show one OLAP result where mean NDVI is calculated for dimensions of Residential Density and Housing Year. Each tract is categorized as belonging to a unique combination of the dimensions (e.g. low Residential Density and high Housing Year). The mean for all tracts within each category is then calculated. Maps use the HSV color model to display the multidimensional data. Hue is mapped to Housing Year where yellow, orange, red, and purple map from lowest (oldest) to highest (most recent). Saturation is mapped to Residential Density where low (high) saturation represents low (high) Residential Density. Value maps to the NDVI value using a linear stretch between values of 105 and 255. The map on the left shows the NDVI data mapped to tracts categorized by Residential Density and Housing Year. The map on the right maps the color value to the NDVI mean for the entire data set. Areas that are darker (lighter) in the map on the left have a relatively high (low) NDVI. Older, densely residential areas have high NDVI. Comparison of the color cubes shows that Residential Density distinguishes between high and low NDVI, but only between the areas of lowest Residential Density and the other classes. Likewise, Housing Year is important only in distinguishing the most recent residential development from other areas. Low High NDVI Sample of the Mined Rule Set If Housing Year is low and Residential Density is low then NDVI is low (Lift = 4.8) If %Minority is high and Residential Density is low then NDVI is low (Lift = 4.4) If Elevation is low and Income is low then NDVI is low (Lift = 4.1) If Education is low and Distance to Commercial is low then NDVI is low (Lift = 3.3) If Housing Year is low and % Minority is low then NDVI is high (Lift = 5.4) If Housing Year is low and Distance to Highway is high then NDVI is high (Lift = 5.0) If Population Den. is low and Residential Density is high then NDVI is high (Lift = 4.8) If Housing Value is high and Distance to Commercial is low then NDVI is high (Lift = 3.9) (Constant) Education Housing Year Population Den. Elevation Dist. to Highway Commercial Den. Residential Den. Res. Den. Hous. Yr. Res. Den. Hous. Yr. With Value Mapped to NDVI Data Without Value Mapped to NDVI Data Saturation Hue Value = NDVI Conclusions This research demonstrates that vegetation greeness in residential areas is a function of the age and type of development as well as socioeconomic status. Vegetation tends to be concentrated in older, densely residential developments that are far from commercial centers and highways and that contain primarily non-minority households with high educational attainment and income. Spatial data mining and visualization, in combination with multivariate statistics, have shown to be useful tools in identifying land cover, socioeconomic, and ecological relationships that are complex and non-linear. GIS serves a key function as data pre- processor and map display device. Future research will address using more sophisticated metrics of ecological character and the application of similar techniques to identify patterns and relationships in time series data. Objective and Motivation Analyzing socioeconomic-vegetation relations in the context of urban growth contributes to an understanding of the role of urban regions in carbon cycling and global environmental change. This project investigates the relationships among socioeconomic character, land use, and vegetation in residential land in the Front Range of Colorado, a rapidly urbanizing region.

Upload: amos-briggs

Post on 17-Dec-2015

219 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: DATA MINING RELATIONSHIPS AMONG URBAN SOCIOECONOMIC, LAND COVER, AND REMOTELY SENSED ECOLOGICAL DATA Jeremy Mennis*, Carol, Wessman, and Nancy Golubiewski**,

DATA MINING RELATIONSHIPS AMONG URBAN SOCIOECONOMIC, LAND COVER, AND REMOTELY SENSED ECOLOGICAL DATAJeremy Mennis*, Carol, Wessman, and Nancy Golubiewski**, *Department of Geography and **Department of Ecology and Evolutionary Biology, University of Colorado

Contact: Jeremy Mennis, Department of Geography, UCB 260, University of Colorado, Boulder, CO 80309, Phone: (303) 492-4794, Fax: (303) 492-7501, Email: [email protected]

66-15101511-19691973-23152318-2927

Residential Density

Mean density of residentialland in residen-tial land (cells)

1939-19571958-19711972-19791980-1997

Median year structures built

Housing YearPopulation Density

322-26692673-33483389-45254527-19810

People/m^2 in residential land

1506-16101611-16371638-16681669-1817

Mean elev.in residential land (m)

Elevation

270-15511556-30023035-57165732-26237

Meandistanceto limitedaccess highwaysin residential land (m)

Distance to Highway

10-167168-308309-520523-2191

Mean density of commercialland in residen-tial land (cells)

Commercial Density

NDVI (tract level)

-0.12-0.150.16-0.190.20-0.220.23-0.33

Mean NDVI inresidential land

-0.55-0.000.01-0.200.21-0.400.40-0.78

NDVI

NDVI (image)

37-7778-8990-9596-99

% with a high schooldiploma

Education

natural veg.commercialresidentialagriculturewaterother

Land use

Land Use

Data: Sources and Preprocessing• Vegetation: NDVI from July 27, 1999 Landsat 7 ETM+ image• Land use: USGS (from aerial photography) • Socioeconomic Status: 2000 U.S. Census • Residential and Commercial Density: calculated by generating grids of the number of residential and commercial grid cells within 1 km of each cell, then calculating the tract mean• Elevation: USGS• Highways: ESRI

Note that although colors are mapped to entire tracts, data represents only the residential land within each tract.

Denver

Boulder

Methods

Spatial data mining techniques are exploratory methods for detecting patterns in very large spatial databases. We use spatial association rule mining and spatial on-line analytical processing (OLAP), as well as mapping and statistics.

Spatial Association Rule Mining seeks to discover associations among transactions encoded in a spatial database. An association rule takes the form A → B where A and B are sets of predicates, and either A or B contains a spatial relationship. Interesting rules are found by using metrics such lift, which indicates how much more often than expected B occurs when paired with A.

Magnum Opus Association Rule Mining Software

Microsoft SQL Server Relational Star Schema

Spatial On-Line Analytical Processing is an extension to the SQL GroupBy operation that exhaustively summarizes the value of a measurement variable contained in the fact table by all unique combinations of a set of categorical dimension variables contained in dimension tables. Here, we summarize NDVI by categorizations of the other variables, and export the results to GIS for mapping.

Tract_ID

1

2

3

Education

73

58

82

Education_D

2

1

3

….

… … … …

Education_D

0

1

2

Level_2

0

0

1

3 1

PopDen_D

0

1

2

Level_2

0

0

1

3 1

Minority_D

0

1

2

Level_2

0

0

1

3 1

NDVI_D

0

1

2

Level_2

0

0

1

3 1

Fact Table

Dimension Table Dimension Table

Dimension TableDimension Table

2.457 .344 7.141 .000

8.860E-04 .000 .209 5.159 .000 .391 .253 .162 .601 1.665

-1.32E-03 .000 -.331 -7.334 .000 -.128 -.348 -.230 .483 2.070

-8.20E-06 .000 -.293 -7.977 .000 -.513 -.375 -.250 .726 1.378

1.470E-04 .000 .124 3.273 .001 .275 .164 .103 .688 1.453

2.404E-06 .000 .225 6.405 .000 .263 .309 .201 .797 1.255

-4.44E-05 .000 -.268 -5.784 .000 -.513 -.281 -.181 .459 2.178

2.297E-05 .000 .227 5.040 .000 .492 .247 .158 .482 2.073

(Constant)

PEDU

MYR

PDENRMU

MELEV

DLIMIT

MCMUDEN

RESDEN

Model1

B Std. Error Beta t Sig. Zero-order Partial Part Tolerance VIF

Dependent Variable: MNDVIa.

ResultsStatistics. Correlations indicate that the variables that have the strongest relationships with NDVI are Population Density (negative relationship), Commercial Density (negative relationship) and Residential Density (positive relationship). In a multivariate context, Housing Year exerts the most influence when the influence of the other explanatory variables are accounted for, although its zero-order correlation is much lower those of all of the other explanatory variables.

Spatial Association Rule Mining. Results suggest that residential NDVI is lowest in older, socioeconomically disadvantaged neighborhoods nearby commercial centers. Residential NDVI is highest in older neighborhoods with higher socioeconomic status. Residential NDVI is also highest in areas of residential concentration but sparse population, i.e. planned developments with large lots. Note the role of low Housing Year in predicting both low and high residential NDVI, which explains its statistical results.

Spatial On-Line Analytical Processing. The maps at right show one OLAP result where mean NDVI is calculated for dimensions of Residential Density and Housing Year. Each tract is categorized as belonging to a unique combination of the dimensions (e.g. low Residential Density and high Housing Year). The mean for all tracts within each category is then calculated. Maps use the HSV color model to display the multidimensional data. Hue is mapped to Housing Year where yellow, orange, red, and purple map from lowest (oldest) to highest (most recent). Saturation is mapped to Residential Density where low (high) saturation represents low (high) Residential Density. Value maps to the NDVI value using a linear stretch between values of 105 and 255.

The map on the left shows the NDVI data mapped to tracts categorized by Residential Density and Housing Year. The map on the right maps the color value to the NDVI mean for the entire data set. Areas that are darker (lighter) in the map on the left have a relatively high (low) NDVI. Older, densely residential areas have high NDVI. Comparison of the color cubes shows that Residential Density distinguishes between high and low NDVI, but only between the areas of lowest Residential Density and the other classes. Likewise, Housing Year is important only in distinguishing the most recent residential development from other areas.

Low

High

NDVI

Sample of the Mined Rule Set

If Housing Year is low and Residential Density is low then NDVI is low (Lift = 4.8)If %Minority is high and Residential Density is low then NDVI is low (Lift = 4.4)If Elevation is low and Income is low then NDVI is low (Lift = 4.1)If Education is low and Distance to Commercial is low then NDVI is low (Lift = 3.3)

If Housing Year is low and % Minority is low then NDVI is high (Lift = 5.4)If Housing Year is low and Distance to Highway is high then NDVI is high (Lift = 5.0)If Population Den. is low and Residential Density is high then NDVI is high (Lift = 4.8)If Housing Value is high and Distance to Commercial is low then NDVI is high (Lift = 3.9)

(Constant)EducationHousing YearPopulation Den.ElevationDist. to HighwayCommercial Den.Residential Den.

Re

s. D

en

.

Hous. Yr.

Re

s. D

en

.

Hous. Yr.

With Value Mapped to NDVI Data Without Value Mapped to NDVI Data

Sa

tura

tion

Hue

Value = NDVI

ConclusionsThis research demonstrates that vegetation greeness in residential areas is a function of the age and type of development as well as socioeconomic status. Vegetation tends to be concentrated in older, densely residential developments that are far from commercial centers and highways and that contain primarily non-minority households with high educational attainment and income.

Spatial data mining and visualization, in combination with multivariate statistics, have shown to be useful tools in identifying land cover, socioeconomic, and ecological relationships that are complex and non-linear. GIS serves a key function as data pre-processor and map display device.

Future research will address using more sophisticated metrics of ecological character and the application of similar techniques to identify patterns and relationships in time series data.

Objective and MotivationAnalyzing socioeconomic-vegetation relations in the context of urban growth contributes to an understanding of the role of urban regions in carbon cycling and global environmental change. This project investigates the relationships among socioeconomic character, land use, and vegetation in residential land in the Front Range of Colorado, a rapidly urbanizing region.