spatial.data.analysis.in.ecology.and.agriculture.using.r.by.richard.e..plant

637

Upload: aletheiaaiehtela

Post on 27-Nov-2015

179 views

Category:

Documents


72 download

TRANSCRIPT

  • K11009

    .#06

    Assuming no prior knowledge of R, Spatial Data Analysis in Ecology and Agriculture Using R provides practical instruction on the use of the R programming language to analyze spatial data arising from research in ecology and agriculture. Written in terms of four data sets easily accessible online, this book guides the reader through the analysis of each data set, including setting research objectives, designing the sampling plan, data quality control, exploratory and con!rmatory data analysis, and drawing scienti!c conclusions.

    Features Describes the analysis of real data sets from start to conclusion Includes an extensive set of examples of the use of R to construct

    graphs and maps Provides background information on exploratory and graphical data

    analysis Discusses the use of classi!cation and regression trees in spatial data

    analysis Covers alternative methods of data analysis including autoregression,

    mixed models, and Bayesian analysis

    Based on the authors spatial data analysis course at the University of California, Davis, the book is intended for classroom use or self-study by graduate students and researchers in ecology, geography, and agricultural science with an interest in the analysis of spatial data.

    Statistics

    .B&RYHULQGG $0

  • +%*#4&T.#060+8'45+6;1(#.+(14+0+##8+5X#.+(140+#X

  • CRC PressTaylor & Francis Group6000 Broken Sound Parkway NW, Suite 300Boca Raton, FL 33487-2742

    2012 by Taylor & Francis Group, LLCCRC Press is an imprint of Taylor & Francis Group, an Informa business

    No claim to original U.S. Government worksVersion Date: 20120208

    International Standard Book Number-13: 978-1-4398-1914-2 (eBook - PDF)

    This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material repro-duced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.

    Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.

    For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copy-right.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

    Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identifica-tion and explanation without intent to infringe.Visit the Taylor & Francis Web site athttp://www.taylorandfrancis.comand the CRC Press Web site athttp://www.crcpress.com

  • To Kathie, Carolyn, and Suzanne, with all my love

  • vContents

    Preface ..............................................................................................................................................xiAcknowledgments ...................................................................................................................... xiiiAuthor .............................................................................................................................................xv

    1. Working with Spatial Data ...................................................................................................11.1 Introduction ...................................................................................................................11.2 Analysis of Spatial Data ...............................................................................................5

    1.2.1 Types of Spatial Data .......................................................................................51.2.2 Components of Spatial Data ...........................................................................71.2.3 Topics Covered in the Text ..............................................................................7

    1.3 Data Sets Analyzed in This Book ...............................................................................91.3.1 Data Set 1: Yellow-Billed Cuckoo Habitat .....................................................91.3.2 Data Set 2: Environmental Characteristics of Oak Woodlands .............. 111.3.3 Data Set 3: Uruguayan Rice Farmers .......................................................... 121.3.4 Data Set 4: Factors Underlying Yield in Two Fields .................................. 131.3.5 Comparing the Data Sets .............................................................................. 14

    1.4 Further Reading .......................................................................................................... 15

    2. R Programming Environment ........................................................................................... 172.1 Introduction ................................................................................................................. 17

    2.1.1 Introduction to R ............................................................................................ 172.1.2 Setting Up Yourself to Use This Book ........................................................ 18

    2.2 R Basics ......................................................................................................................... 192.3 Programming Concepts .............................................................................................23

    2.3.1 Looping and Branching ................................................................................232.3.2 Functional Programming .............................................................................25

    2.4 Handling Data in R ..................................................................................................... 272.4.1 Data Structures in R ...................................................................................... 272.4.2 Basic Data Input and Output .......................................................................302.4.3 Spatial Data Structures ................................................................................. 31

    2.5 Writing Functions in R ............................................................................................... 372.6 Graphics in R ............................................................................................................... 39

    2.6.1 Traditional Graphics in R: Attribute Data ..................................................402.6.2 Traditional Graphics in R: Spatial Data ......................................................442.6.3 Trellis Graphics in R: Attribute Data .......................................................... 472.6.4 Trellis Graphics in R: Spatial Data ............................................................... 492.6.5 Using Color in R .............................................................................................50

    2.7 Other Software Packages ...........................................................................................532.8 Further Reading ..........................................................................................................53Exercises ..................................................................................................................................54

  • vi Contents

    3. Statistical Properties of Spatially Autocorrelated Data ............................................... 593.1 Introduction ................................................................................................................. 593.2 Components of a Spatial Random Process ..............................................................60

    3.2.1 Spatial Trends in Data ...................................................................................603.2.2 Stationarity ......................................................................................................65

    3.3 Monte Carlo Simulation .............................................................................................683.4 Review of Hypothesis and Signi!cance Testing .................................................... 703.5 Modeling Spatial Autocorrelation ............................................................................75

    3.5.1 Monte Carlo Simulation of Time Series ......................................................753.5.2 Modeling Spatial Contiguity ........................................................................803.5.3 Modeling Spatial Association in R ..............................................................84

    3.6 Application to Field Data ...........................................................................................903.7 Further Reading .......................................................................................................... 92Exercises .................................................................................................................................. 93

    4. Measures of Spatial Autocorrelation ................................................................................ 954.1 Introduction ................................................................................................................. 954.2 Preliminary Considerations ...................................................................................... 95

    4.2.1 Measurement Scale ........................................................................................ 954.2.1.1 Nominal Scale Data ....................................................................... 974.2.1.2 Ordinal Scale Data ......................................................................... 974.2.1.3 Interval Scale Data ......................................................................... 974.2.1.4 Ratio Scale Data .............................................................................. 98

    4.2.2 Resampling and Randomization Assumptions ........................................ 984.2.2.1 Resampling Assumption ............................................................... 984.2.2.2 Randomization Assumption ........................................................99

    4.2.3 Testing the Null Hypothesis ........................................................................994.3 Join-Count Statistics.................................................................................................. 1004.4 Morans I and Gearys c ............................................................................................ 1044.5 Measures of Autocorrelation Structure ................................................................. 107

    4.5.1 Moran Correlogram ..................................................................................... 1074.5.2 Moran Scatterplot......................................................................................... 1084.5.3 Local Measures of Autocorrelation ........................................................... 1114.5.4 Geographically Weighted Regression ....................................................... 114

    4.6 Measuring Autocorrelation of Spatially Continuous Data ................................. 1164.6.1 Variogram ..................................................................................................... 1164.6.2 Covariogram and Correlogram ................................................................. 121

    4.7 Further Reading ........................................................................................................ 122Exercises ................................................................................................................................ 123

    5. Sampling and Data Collection ......................................................................................... 1255.1 Introduction ............................................................................................................... 1255.2 Preliminary Considerations .................................................................................... 128

    5.2.1 Arti!cial Population .................................................................................... 1285.2.2 Accuracy, Bias, Precision, and Variance ................................................... 1315.2.3 Comparison Procedures ............................................................................. 131

    5.3 Developing the Sampling Patterns ......................................................................... 1325.3.1 Random Sampling ....................................................................................... 1325.3.2 Geographically Strati!ed Sampling .......................................................... 134

  • viiContents

    5.3.3 Sampling on a Regular Grid ...................................................................... 1365.3.4 Strati!cation Based on a Covariate............................................................ 1375.3.5 Cluster Sampling ......................................................................................... 143

    5.4 Methods for Variogram Estimation........................................................................ 1445.5 Estimating the Sample Size ..................................................................................... 1465.6 Sampling for Thematic Mapping ............................................................................ 1475.7 Design-Based and Model-Based Sampling ........................................................... 1485.8 Further Reading ........................................................................................................ 152Exercises ................................................................................................................................ 154

    6. Preparing Spatial Data for Analysis ............................................................................... 1556.1 Introduction ............................................................................................................... 1556.2 Quality of Attribute Data ......................................................................................... 156

    6.2.1 Dealing with Outliers and Contaminants ............................................... 1566.2.2 Quality of Ecological Survey Data ............................................................ 1586.2.3 Quality of Automatically Recorded Data ................................................. 158

    6.3 Spatial Interpolation Procedures ............................................................................ 1626.3.1 Inverse Weighted Distance Interpolation ................................................. 1626.3.2 Kriging Interpolation .................................................................................. 1676.3.3 Cokriging Interpolation .............................................................................. 170

    6.4 Spatial Recti!cation and Alignment of Data ........................................................ 1756.4.1 De!nitions of Scale-Related Processes ..................................................... 1756.4.2 Change of Coverage ..................................................................................... 1766.4.3 Change of Support ....................................................................................... 180

    6.5 Further Reading ........................................................................................................ 183Exercises ................................................................................................................................ 184

    7. Preliminary Exploration of Spatial Data ....................................................................... 1877.1 Introduction ............................................................................................................... 1877.2 Data Set 1 .................................................................................................................... 1887.3 Data Set 2 .................................................................................................................... 2017.4 Data Set 3 .................................................................................................................... 2137.5 Data Set 4 .................................................................................................................... 2247.6 Further Reading ........................................................................................................235Exercises ................................................................................................................................235

    8. Multivariate Methods for Spatial Data Exploration ................................................... 2378.1 Introduction ............................................................................................................... 2378.2 Principal Components Analysis ............................................................................. 237

    8.2.1 Introduction to Principal Components Analysis .................................... 2378.2.2 Principal Components Analysis of Data Set 2 ......................................... 2458.2.3 Application to Data Set 4, Field 1 ............................................................... 246

    8.3 Classi!cation and Regression Trees (a.k.a. Recursive Partitioning) .................. 2488.3.1 Introduction to the Method ........................................................................ 2488.3.2 Mathematics of Recursive Partitioning ....................................................2508.3.3 Exploratory Analysis of Data Set 2 with Regression Trees .................... 2518.3.4 Exploratory Analysis of Data Set 3 with Recursive Partitioning ..........2588.3.5 Exploratory Analysis of Field 4.1 with Recursive Partitioning .............264

  • viii Contents

    8.4 Random Forest ........................................................................................................... 2688.4.1 Introduction to Random Forest ................................................................. 2688.4.2 Application to Data Set 2 ............................................................................ 271

    8.5 Further Reading ........................................................................................................ 274Exercises ................................................................................................................................ 275

    9. Spatial Data Exploration via Multiple Regression ......................................................2779.1 Introduction ...............................................................................................................2779.2 Multiple Linear Regression .....................................................................................277

    9.2.1 Many Perils of Model Selection .................................................................2779.2.2 Multicollinearity, Added Variable Plots, and Partial Residual Plots .......... 282

    9.2.2.1 Added Variable Plots ................................................................... 2869.2.2.2 Partial Residual Plots ...................................................................288

    9.2.3 Cautious Approach to Model Selection as an Exploratory Tool ........... 2909.3 Building a Multiple Regression Model for Field 4.1 ............................................. 2919.4 Generalized Linear Models ..................................................................................... 301

    9.4.1 Introduction to Generalized Linear Models ............................................ 3019.4.2 Multiple Logistic Regression Model for Data Set 2 .................................3069.4.3 Logistic Regression Model of Count Data for Data Set 1 ....................... 3149.4.4 Analysis of the Counts of Data Set 1: Zero-In"ated Poisson Data ....... 318

    9.5 Further Reading ........................................................................................................ 322Exercises ................................................................................................................................ 323

    10. Variance Estimation, theEffectiveSampleSize, and the Bootstrap ........................ 32510.1 Introduction ............................................................................................................... 32510.2 Bootstrap Estimation of the Standard Error ......................................................... 32910.3 Bootstrapping Time Series Data ............................................................................. 332

    10.3.1 Problem with Correlated Data ................................................................... 33210.3.2 Block Bootstrap ............................................................................................33510.3.3 Parametric Bootstrap ...................................................................................338

    10.4 Bootstrapping Spatial Data ...................................................................................... 33910.4.1 Spatial Block Bootstrap ............................................................................... 33910.4.2 Parametric Spatial Bootstrap ......................................................................34410.4.3 Power of the Tests ........................................................................................345

    10.5 Application to the EM38 Data .................................................................................34610.6 Further Reading ........................................................................................................349Exercises ................................................................................................................................349

    11. Measures of Bivariate Association between Two Spatial Variables ........................ 35111.1 Introduction ............................................................................................................... 35111.2 Estimating and Testing the Correlation Coef!cient ............................................354

    11.2.1 Correlation Coef!cient ................................................................................35411.2.2 Clifford et al. Correction .............................................................................35611.2.3 Bootstrap Variance Estimate ...................................................................... 35911.2.4 Application to the Example Problem ........................................................ 361

    11.3 Contingency Tables ...................................................................................................36411.3.1 Large Sample Size Contingency Tables ....................................................36411.3.2 Small Sample Size Contingency Tables .................................................... 371

  • ixContents

    11.4 Mantel and Partial Mantel Statistics ...................................................................... 37511.4.1 Mantel Statistic ............................................................................................. 37511.4.2 Partial Mantel Test ....................................................................................... 379

    11.5 Modi!able Areal Unit Problem and Ecological Fallacy ...................................... 38211.5.1 Modi!able Areal Unit Problem .................................................................. 38211.5.2 Ecological Fallacy .........................................................................................386

    11.6 Further Reading ........................................................................................................388Exercises ................................................................................................................................388

    12. Mixed Model ....................................................................................................................... 39112.1 Introduction ............................................................................................................... 39112.2 Basic Properties of the Mixed Model ..................................................................... 39512.3 Application to Data Set 3 .......................................................................................... 39612.4 Incorporating Spatial Autocorrelation ................................................................... 40112.5 Generalized Least Squares ......................................................................................40612.6 Spatial Logistic Regression ......................................................................................409

    12.6.1 Upscaling Data Set 2 in the Coast Range .................................................40912.6.2 Incorporation of Spatial Autocorrelation ................................................. 413

    12.7 Further Reading ........................................................................................................420Exercises ................................................................................................................................420

    13. Regression Models for Spatially Autocorrelated Data ...............................................42313.1 Introduction ...............................................................................................................42313.2 Detecting Spatial Autocorrelation in a Regression Model ..................................42813.3 Models for Spatial Processes ...................................................................................430

    13.3.1 Spatial Lag Model ........................................................................................43013.3.2 Spatial Error Model ..................................................................................... 432

    13.4 Determining the Appropriate Regression Model ................................................43313.4.1 Formulation of the Problem .......................................................................43313.4.2 Lagrange Multiplier Test .............................................................................434

    13.5 Fitting the Spatial Lag and Spatial Error Models .................................................43613.6 Conditional Autoregressive Model ........................................................................43813.7 Application of SAR and CAR Models to Field Data ............................................ 439

    13.7.1 Fitting the Data ............................................................................................. 43913.7.2 Comparison of the Mixed and Autoregressive Models .........................443

    13.8 Autologistic Model for Binary Data .......................................................................44313.9 Further Reading ........................................................................................................447Exercises ................................................................................................................................448

    14. Bayesian Analysis of Spatially Autocorrelated Data ..................................................44914.1 Introduction ...............................................................................................................44914.2 Markov Chain Monte Carlo Methods ....................................................................45314.3 Introduction to WinBUGS ........................................................................................ 459

    14.3.1 WinBUGS Basics ........................................................................................... 45914.3.2 WinBUGS Diagnostics .................................................................................46314.3.3 Introduction to R2WinBUGS ......................................................................46414.3.4 Generalized Linear Models in WinBUGS ................................................ 472

    14.4 Hierarchical Models ................................................................................................. 474

  • x Contents

    14.5 Incorporation of Spatial Effects ............................................................................... 47914.5.1 Spatial Effects in the Linear Model ........................................................... 47914.5.2 Application to Data Set 3 ............................................................................483

    14.6 Further Reading ........................................................................................................ 487Exercises ................................................................................................................................488

    15. Analysis of Spatiotemporal Data .................................................................................... 48915.1 Introduction ............................................................................................................... 48915.2 Spatiotemporal Cluster Analysis ............................................................................ 48915.3 Factors Underlying Spatiotemporal Yield Clusters ..............................................50015.4 Bayesian Spatiotemporal Analysis .........................................................................505

    15.4.1 Introduction to Bayesian Updating ...........................................................50515.4.2 Application of Bayesian Updating to Data Set 3 .....................................508

    15.5 Other Approaches to Spatiotemporal Modeling .................................................. 51315.5.1 Models for Dispersing Populations ........................................................... 51315.5.2 State and Transition Models ....................................................................... 514

    15.6 Further Reading ........................................................................................................ 515Exercises ................................................................................................................................ 515

    16. Analysis of Data from Controlled Experiments .......................................................... 51716.1 Introduction ............................................................................................................... 51716.2 Classical Analysis of Variance ................................................................................ 51816.3 Comparison of Methods .......................................................................................... 522

    16.3.1 Comparison Statistics .................................................................................. 52216.3.2 Papadakis Nearest Neighbor Method ...................................................... 52516.3.3 Trend Method ............................................................................................... 52616.3.4 Correlated Errors Method ....................................................................... 52716.3.5 Published Comparisons of the Methods .................................................. 529

    16.4 Pseudoreplicated Data and the Effective Sample Size ......................................... 52916.4.1 Pseudoreplicated Comparisons ................................................................. 52916.4.2 Calculation of the Effective Sample Size ..................................................53016.4.3 Application to Field Data ............................................................................533

    16.5 Further Reading ........................................................................................................535Exercises ................................................................................................................................535

    17. Assembling Conclusions................................................................................................... 53717.1 Introduction ............................................................................................................... 53717.2 Data Set 1 .................................................................................................................... 53717.3 Data Set 2 ....................................................................................................................54217.4 Data Set 3 ....................................................................................................................54717.5 Data Set 4 ....................................................................................................................55017.6 Conclusions ................................................................................................................554

    Appendix A: Review of Mathematical Concepts ................................................................ 555

    Appendix B: The Data Sets ...................................................................................................... 581

    Appendix C: An R Thesaurus ................................................................................................. 589

    References ................................................................................................................................... 599

  • xi

    Preface

    This book is intended for classroom use or self-study by graduate students and researchers in ecology, geography, and agricultural science who wish to learn about the analysis of spatial data. It originated in a course entitled Spatial Data Analysis in Applied Ecology that I taught for several years at the University of California, Davis. Although most of the students were enrolled in Ecology, Horticulture and Agronomy, or Geography, there was a smattering of students from other programs such as Entomology, Soil Science, and Agricultural and Resource Economics. The book assumes that the reader has a back-ground in statistics at the level of an upper division undergraduate applied linear models course. This is also, in my experience, the level at which ecologists and agronomists teach graduate applied statistics courses to their own students. To be speci!c, the book assumes a statistical background at the level of Kutner et al. (2005). I do not expect the reader to have had exposure to the general linear model or modern mixed model analysis.

    The book is focused toward those who want to make use of these methods in their research, not for statistical or geographical specialists. It is always wise to seek out a special-ists help when such help is available, and I strongly encourage this practice. Nevertheless, the more one knows about a specialists area of knowledge, the better one is able to make use of that knowledge. Because this is a users book, I have elected on some occasions to take small liberties with technical points and details of the terminology. To dot every i and cross every t would drag the presentation down without adding anything useful.

    The book does not assume any prior knowledge of the R programming environment. All of the R code for all of the analyses carried out in this book are available on the books com-panion website, http://www.plantsciences.ucdavis.edu/plant/sda.htm. One of the best features of R is also its most challenging: the vast number of functions and contributed packages that provide a multitude of ways to solve any given problem. This presents a special challenge for a textbook, namely, how to !nd the best compromise between exposi-tion via manual coding and the use of contributed package functions, and which functions to choose. I have tried to use manual coding when it is easy or there is a point to be made, and save contributed functions for more complex operations. As a result, I sometimes have provided homemade code for operations that can also be carried out by a function from a contributed package.

    The book focuses on data from four case studies, two from uncultivated ecosystems and two from cultivated ecosystems. The data sets are also available on the books com-panion website. Each of the four data sets is drawn from my own research. My reason for this approach is a conviction that if one wants to really get the most out of a data set, one has to live with it for a while and get to know it from many different perspectives, and Iwant to give the reader a sense of that process as I have experienced it. I make no claim of uniqueness for this idea; Grif!th and Layne (1999), for example, have done it before me. I used data from projects in which I participated not because I think they are in any way special, but because they were available and I already knew them well when I started writing the book.

    For most of my career, I have had a joint appointment in the Department of Biological and Agricultural Engineering and a department that, until it was gobbled up in a !t of academic consolidation, bore the name Agronomy and Range Science. Faculty in this sec-ond department were fortunate in that, as the department name implies, we were able to

  • xii Preface

    work in both cultivated and uncultivated ecosystems, and this enabled me to include two of each in the book.

    I was originally motivated to write the book based on my experiences working in pre-cision agriculture. We in California entered this arena considerably later than our col-leagues in the Midwest, but I was in at the beginning in California. As was typical of academic researchers, I developed methods for site-speci!c crop management, presented them to farmers at innumerable !eld days, and watched with bemusement as they were not adopted very often. After a while, I came to the realization that farmers can !gure out how best to use this new technology in the ways that suit their needs, and indeed they are beginning to do so. This led to the further realization that we researchers should be using this technology for what we do best: research. This requires learning how to analyze the data that the technology provides, and that is what this book is about.

  • xiii

    Acknowledgments

    I have been very fortunate to have had some truly outstanding students work with me, and their work has contributed powerfully to my own knowledge of the subject. I particu-larly want to acknowledge those students who contributed to the research that resulted in this book, including (in alphabetical order) Steven Greco, Peggy Hauselt, Randy Horney, Claudia Marchesi, Ali Mermer, Jorge Perez-Quezada, Alvaro Roel, and Marc Vayssires. In particular, Steven Greco provided Data Set 1, Marc Vayssires provided Data Set 2, and Alvaro Roel provided Data Set 3. I also want to thank the students in my course, who read through several versions of this book and made many valuable suggestions. In particular, thanks go to Kimberley Miller, who read every chapter of the !nal draft and made many valuable comments. I have bene!ted from the interaction with a number of colleagues, too many to name, but I particularly want to thank Hugo Firpo, who collected the data on Uruguayan rice farmers for Data Set 3; Tony Turkovich, who let us collect the data for Data Set 4 in his !elds; Stuart Pettygrove, who managed the large-scale effort that led to that data set and made the notes and data freely available; and Robert Hijmans, who intro-duced me to his raster package and provided me with many valuable comments about the book in general. Finally, I want to thank Roger Bivand, who, without my asking him to do so, took the trouble to read one of the chapters and made several valuable sugges-tions. Naturally, these acknowledgments in no way imply that the persons acknowledged or anyone else endorses what I have written. Indeed, some of the methods presented in this book are very ad hoc and may turn out to be inappropriate. For that, as well as for the mistakes in the book and bugs in the code that I am sure are lurking there, I take full responsibility.

  • xv

    Author

    Richard Plant is a professor emeritus of biological and agricultural engineering and plant sciences at the University of California in Davis, California. He is the coauthor of Knowledge-Based Systems in Agriculture and is a former editor in chief of Computers and Electronics in Agriculture. He has also published extensively on applications of crop model-ing, expert systems, spatial statistics, geographic information systems, and remote sensing to problems in crop production and natural resource management.

  • 11Working with Spatial Data

    1.1 Introduction

    The last decades of the twentieth century witnessed the introduction of revolutionary new tools for collecting and analyzing ecological data at the landscape level. The best known of these are the global positioning system (GPS) and the various remote sensing technolo-gies such as multispectral, hyperspectral, thermal, radar, and light detection and ranging (LiDAR) imaging. Remote sensing permits data to be gathered over a wide area at relatively low cost, and the GPS permits locations to be easily determined on the surface of the earth with submeter accuracy. Other new technologies have had a similar impact on some areas of agricultural, ecological, or environmental research. For example, measurement of appar-ent soil electrical conductivity (ECa) permits the rapid estimation of soil properties such as salinity and clay content for use in agricultural (Corwin and Lesch, 2005) and environmen-tal (Kaya and Fang, 1997) research. The electronic nose is in its infancy, but in principle, it can be used to detect the presence of organic pollutants (Fenner and Stuetz, 1999) and plant diseases (Wilson et al., 2004). Sensor technologies such as these have not yet found their way into fundamental ecological research, but it is only a matter of time until they do.

    The impact of these technologies has been particularly strong in agricultural crop pro-duction, where they are joined by another device of strictly agricultural utility: the yield monitor. This device permits on-the-go measurement of the "ow of harvested mate-rial. If one has spatially precise measurements of yield and of the factors that affect yield, it is natural to attempt to adjust management practices in a spatially precise manner to improve yield or reduce costs. This practice is called site-speci!c crop management, which has been de!ned as the management of a crop at a spatial and temporal scale appropriate to its natural variation (Lowenberg-DeBoer and Erickson, 2000).

    Technologies such as remote sensing, soil electrical conductivity measurement, elec-tronic nose, and yield monitor are precision measurement instruments at the landscape scale, and they allow scientists to obtain landscape-level data at a previously unheard-of spatial resolution. These data, however, differ in their statistical properties from the data collected in traditional plot-based ecological and agronomic experiments. The data often consist of thousands of individual values, each obtained from a spatial location either con-tiguous with or very near to those of neighboring data. The advent of these massive spatial data sets has been part of the motivation for the further development of the theory of spa-tial statistics, particularly as it applies to ecology and agriculture.

    Although some of the development of the theory of spatial statistics has been motivated by large data sets produced by modern measurement technologies, there are many older data sets to which the theory may be pro!tably applied. Figure 1.1 shows the locations

  • 2 Spatial Data Analysis in Ecology and Agriculture Using R

    of a subset of about one-tenth of a set of 4101 sites taken from the Wieslander survey (Wieslander, 1935). This was a project led by Albert Wieslander of the U.S. Forest Service that documented the vegetation in about 155,000 km2 of California, or roughly 35% of its area. Allen et al. (1991) used a subset of the Wieslander survey to develop a classi!ca-tion system for Californias oak woodlands, and, subsequently, Evett (1994) and Vayssires et al. (2000) used these data to compare methods for predicting oak woodland species distributions based on environmental properties.

    Large spatial data sets like those arising from modern precision landscape measur-ing devices or from large-scale surveys such as the Wieslander survey present the ana-lyst with two major problems. The !rst is that, even in cases where the data satisfy the assumptions of traditional statistics, meaningful statistical inference is almost impos-sible with data sets having thousands of records because the null hypothesis is almost always rejected (Matloff, 1991). Real data never have exactly the hypothesized property, and the large size of the data set makes it impossible to be close enough. The second problem is that the data frequently do not satisfy the assumptions of traditional statis-tics. The data records from geographically nearby sites are often more closely related to each other than they are to data records from sites further away. Moreover, if one were to record some data attribute at a location in between two neighboring sample points, the resulting attribute data value would likely be similar to those of both of its neighbors and thus would be limited in the amount of information it added. The data in this case are, to use the statistical term, spatially autocorrelated. We will de!ne this term more precisely and discuss its consequences in greater detail in Chapter 3, but it is clear that these data violate the assumption of traditional statistical analysis that

    124W 122W 120WLongitude

    118W 116W 114W126W32N

    34N

    36N

    38N

    40N

    42N

    Latitud

    e

    FIGURE 1.1Locations of survey sites taken from the Wieslander vegetation survey. The white dots represent one-tenth of the 4101 survey locations used to characterize the ecological relationships of oak species in Californias oak woodlands. The hillshade map was downloaded from the University of California, Santa Barbara Biogeography Lab, http://www.biogeog.ucsb.edu/projects/gap/gap_data_state.html, and reprojected in ArcGIS.

  • 3Working with Spatial Data

    data values or errors are mutually independent. The theory of spatial statistics con-cerns itself with these sorts of data.

    Data sets generated by modern precision landscape measurement tools frequently have a second, more ecological, characteristic that we will call low ecological resolution. This is illustrated in the maps in Figure 1.2, which are based on data from an agricultural !eld. These three thematic maps represent measurements associated with soil proper-ties, vegetation density, and reproductive material, respectively. The !gures are arranged in a rough sequence of in"uence. That is, one can roughly say that soil properties in"u-ence vegetative growth and vegetative growth in"uences reproductive growth. These in"uences, however, are very complex and nonlinear. Three quantities are displayed in Figure 1.2: ECa in Figure 1.2a, infrared re"ectance (IR digital number) in Figure 1.2b, and grain yield in Figure 1.2c. The complexity of the relationship between soil, vegetation,

    Easting(a)

    North

    ing

    4267600

    4267700

    4267800

    592200 592400 592600 592800

    20

    25

    30

    35

    Easting(b)

    North

    ing

    4267600

    4267700

    4267800

    592200 592400 592600 592800

    80100120140160180200

    Easting(c)

    North

    ing

    4267600

    4267700

    4267800

    592200 592400 592600 592800

    1000200030004000500060007000

    FIGURE 1.2Spatial plots of three data sets from Field 2 of Data Set 4, a wheat !eld in Central California: (a) gray-scale rep-resentation of soil apparent electrical conductivity, ECa (mS m1), (b) digital number value of the infrared band of an image of the !eld taken in May 1996, and (c) grain yield map (kg ha1).

  • 4 Spatial Data Analysis in Ecology and Agriculture Using R

    and reproductive material is re"ected in the relationship between the patterns in Figure 1.2a through c, which is clearly not a simple one (Ball and Frazier, 1993; Benedetti and Rossini, 1993). In addition to the complexity of the relationships among these three quan-tities, each quantity itself has a complex relationship with more fundamental properties like soil clay content, leaf nitrogen content, and so forth. To say that the three data sets in Figure 1.2 have a very high spatial resolution but a low ecological resolution means that there is a complex relationship between these data and the fundamental quantities that are used to gain an understanding of the ecological processes in play. The low eco-logical resolution often makes it necessary to supplement these high spatial resolution data with other, more traditional data sets that result from the direct collection of soil and plant data.

    The three data sets in Figure 1.2, like many spatial data sets, are !lled with sugges-tive visual patterns. The two areas of low grain yield in the southern half of the !eld in Figure 1.2c clearly seem to align with patterns of low infrared re"ectance in Figure 1.2b. However, there are other, more subtle patterns that may or may not be related between the maps. These include an area of high ECa on the west side of the !eld that may correspond to a somewhat similarly shaped area of high infrared re"ectance, an area of high apparent ECa in the southern half of the !eld centered near 592600E that may possibly correspond to an area of slightly lower infrared re"ectance in the same location, and a possible correspondence between low apparent ECa, low infrared re"ec-tance, and low grain yield at the east end of the !eld. Scrutinizing the maps carefully might reveal other possibilities. The problem with searching for matching patterns like these, however, is that the human eye is famously capable of seeing patterns that are not really there. Also, the eye may miss subtle patterns that do exist. For this reason, a more objective means of detecting true patterns must be used, and statistical analysis can often provide this means.

    Data sets like the ones discussed in this book are often unreplicated, relying on observations rather than controlled experiments. Every student in every introductory applied statistics class is taught that the three fundamental concepts of experimen-tal design, introduced by Sir Ronald Fisher (1935), are randomization, replication, and blocking. The primary purpose of replication is to provide an estimate of the variance (Kempthorne, 1952, p.177). In addition, replication helps to ensure that experimental treat-ments are interspersed, which may reduce the impact of confounding factors (Hurlbert, 1984). Some statisticians might go so far as to say that data that are not randomized and replicated are not worthy of statistical analysis.

    In summary, we are dealing with data sets that may not satisfy the assumptions of clas-sical statistics, that often involve relations between measured quantities for which there is no easy interpretation, that tempt the analyst with suggestive but potentially misleading visual patterns, and that frequently result from unreplicated observations. In the face of all these issues, one might ask why anyone would even consider working with data like these, when traditional replicated small plot experiments have been used so effectively for almost 100 years. A short answer is that replicated small plot experiments are indeed valuable and can tell us many things, but they cannot tell us everything. Often phenom-ena that occur at one spatial scale are not observable at another, and sometimes complex interactions that are important at the landscape scale lose their importance at the plot scale. Richter et al. (2009) argue that laboratory experiments conducted under highly stan-dardized conditions can produce results with little validity beyond the speci!c environ-ment in which the experiment is conducted, and this same argument can also be applied to small plot !eld experiments. Moreover, econometricians and other scientists for whom

  • 5Working with Spatial Data

    replicated experiments are often impossible have developed powerful statistical methods for analyzing and drawing inferences from observational data. These methods have much to recommend them for ecological and agricultural use as well. The fundamental theme of this book is that methods of spatial statistics, many of them originally developed with econometric and other applications in mind, can prove useful to ecologists and crop and soil scientists. These methods are certainly not intended to replace traditional replicated experiments, but, when properly applied, they can serve as a useful complement, provid-ing additional insights, increasing the scope, and serving as a bridge from the small plot to the landscape. Even in cases where the analysis of data from an observational study does not by itself lead to publishable results, such an analysis may provide a much improved focus to the research project and enable the investigator to pose much more productive hypotheses.

    1.2 Analysis of Spatial Data

    1.2.1 Types of Spatial Data

    Cressie (1991, p. 8) provides a very convenient classi!cation of spatial data into three categories, which we will call (1) geostatistical data, (2) areal data, and (3) point pattern data. These can be explained as follows. Suppose we denote by D a site (the domain) in which the data are collected. Let the locations in D be speci!ed by position coor-dinates (x, y), and let Y(x, y) be a quantity or vector of quantities measured at location (x,y). Suppose, for example, the domain D is the !eld in Figure 1.2. The components of Y would then be the three quantities mapped in the !gure together with any other measured quantities. The !rst of Cressies (1991) categories, geostatistical data, consists of data in which the components of Y(x, y) vary continuously in the spatial variables (x, y). The value of Y is measured at a set of points with coordinates (xi, yi), i = 1, , n. Acommon goal in the analysis of geostatistical data is to interpolate Y at points where it is not measured.

    The second of Cressies (1991) categories consists of data that are de!ned only at a set of locations, which may be points or polygons, in the domain D. If the locations are polygons, then the data are assumed to be uniform within each polygon. If the data are measured at a set of points, then one can arrange a mosaic or lattice of polygons that each contains one or more measurement points and whose areal aggregate covers the entire domain D. For this reason, these data are called areal data. In referring to the spatial elements of areal data, we will use the terms cell and polygon synonymously. A mosaic is an irregular arrangement of cells, and a lattice is a regular rectangular arrangement.

    It is often the case that a biophysical process represented by areal data varies continu-ously over the landscape (Schabenberger and Gotway, 2005, p. 9). Thus, the same data set may be treated in one analysis as geostatistical data and in another as areal data. Figure 1.3 illustrates this concept. The !gure shows the predicted values of clay content in the same !eld as that shown in Figure 1.2. Clay content was measured in this !eld by taking soil cores on a rectangular grid of square cells 61 m apart. Figure 1.3a shows an interpolation of the values obtained by kriging (Section 6.4.3). This is an implementation of the geosta-tistical model. Figure 1.3b shows the same data modeled as a lattice of discrete polygons. Inboth !gures, the actual data collection locations are shown as black dots. It happens that

  • 6 Spatial Data Analysis in Ecology and Agriculture Using R

    each polygon in Figure 1.3b contains exactly one data location, but, in other applications, data from more than one location may be aggregated into a single polygon, and the poly-gons may not form a regular lattice.

    The appropriate data model, geostatistical or areal, may depend on the objective of the analysis. If the objective is to accurately predict the value of clay content at unmeasured locations, then generally the geostatistical model would be used. If the objective is to deter-mine how clay content and other quantities in"uence some response variable such as yield, then the areal model may be more appropriate. In many cases, it is wisest not to focus on the classi!cation of the data into one or another category but rather on the objective of the analysis and the statistical tools that may be best used to achieve this objective.

    The third category of spatial data, for which the de!nition of the domain D must be modi!ed, consists of points in a plane that are de!ned by their location, and for which the primary questions of interest involve the pattern of these locations. Data falling in this category, such as the locations of members of a species of tree, are called point patterns. For example, Figure 1.6 shows a photograph of oak trees in a Californias oak woodland. If each tree is considered as a point, then point pattern analysis would be concerned with the pattern of the spatial distribution of the trees. This book does not make any attempt to cover point pattern analysis. The primary focus of this book is on the analysis of spatial data with the objective of explaining the biophysical processes and interactions that gener-ated these data.

    Field 4.2 clay content (%)

    Easting(a)

    North

    ing

    4267600

    4267700

    4267800

    592200 592400 592600 592800

    262830323436384042

    Field 4.2 clay content (%)

    Easting(b)

    North

    ing

    4267600

    4267700

    4267800

    592200 592400 592600 592800

    25

    30

    35

    40

    45

    FIGURE 1.3Predicted values of soil clay content in Field 2 of Data Set 4 using two different spatial models. (a) Interpolated values obtained using kriging based on the geostatistical model. (b) Polygons based on the areal model. The soil sample locations are shown as black dots.

  • 7Working with Spatial Data

    1.2.2 Components of Spatial Data

    Spatial data, also known as georeferenced data, have two components. The !rst is the attri-bute component, which is the set of all the measurements describing the phenomenon being observed. The second is the spatial component, which consists of that part of the data that at a minimum describes the location at which the data were measured and may also include other properties such as spatial relationships with other data. The central problem addressed in this book is the incorporation into the analysis of attribute data of the effect of spatial relationships. As a simple example, consider the problem of developing a linear regression model relating a response variable to a single environmental explanatory vari-able. Suppose that there are two measured quantities. The !rst is leaf area index (LAI), the ratio between total leaf surface area and ground surface area, at a set of locations. We treat this as a response variable and represent it with the symbol Y. Let Yi denote the LAI mea-sured at location i. Suppose that there is one explanatory variable, say, soil clay content. Explanatory variables will be represented using the symbol X, possibly with subscripts if there are more than one of them. We can postulate a simple linear regression relation of the form

    Y X i ni i i= + + = 0 1 1, , , . (1.1)

    There are several ways in which spatial effects can in"uence this model. One very com-monly occurring one is a reduction of the effective sample size. Intuition for this concept can be gained as follows. Suppose that the n data values are collected on a square grid and that n = m2, where m is the number of measurement locations on a side. Suppose now that we consider sampling on a 2 m 2 m grid subtending the same geographic area. This would give us 4n data values and thus, if these values were independent, provide a much larger sample and greater statistical power. One could consider sampling on a 4 m 4 m grid for a still larger sample size, and so on ad in!nitum. Thus, it would seem that there is no limit to the power attainable if only we are willing to sample enough. The problem with this reasoning is that the data values are not independent: values sampled very close to each other will tend to be similar. If the values are similar, then the errors i will also be similar. This property of similarity among nearby error values violates a fundamental assumption on which the statistical analysis of regression data is based, namely, the inde-pendence of errors. This loss of independence of errors is one of the most important prop-erties of spatial data and will play a major role in the development of methods described in this book.

    1.2.3 Topics Covered in the Text

    The topics covered in this book are ordered as follows. The statistical software used for almost all of the analysis in the book is R (R Development Core Team, 2011). No familiarity with R is assumed, and Chapter 2 contains an introduction to the R package. The property of spatial data described in the previous paragraph, that attribute values measured at nearby locations tend to be similar, is called positive spatial autocorrelation. In order to develop a statistical theory that allows for spatial autocorrelation one must de!ne it precisely. This is the primary topic of Chapter 3. That chapter also describes the principal effects of spatial autocorrelation on the statistical properties of data. Having developed a precise de!nition of autocorrelation, one can measure its magnitude in a

  • 8 Spatial Data Analysis in Ecology and Agriculture Using R

    particular data set. Chapter 4 describes some of the most widely used measures of spa-tial autocorrelation and introduces some of their uses in data analysis.

    Beginning with Chapter 5, the chapters of this book are organized in the same gen-eral order as that in which the phases of a study are carried out. The !rst phase is the actual collection of the data. Chapter 5 contains a discussion of sampling plans for spatial data. The second phase of a study is the preparation of the data for analysis, which is covered in Chapter 6. This is a much more involved process with spatial data than with nonspatial data: !rst, because the data sets must often be converted to a !le format compatible with the analytical software; second, because the data sets must be georegistered (i.e., their locations must be aligned with the points on the earth at which they were measured); third, because the data are often not measured at the same loca-tion or on the same scale (this is the problem of misalignment); and fourth, because in many cases, the high volume and data collection processes make it necessary to remove outliers and other discordant data records. After data preparation is complete, the analyses can begin.

    Statistical analysis of a particular data set is often divided into exploratory and con-!rmatory components. Chapters 7 through 9 deal with the exploratory components. Chapter 7 discusses some of the common graphical methods for exploring spatial data. Chapter 8 describes nonspatial multivariate methods that can contribute to an under-standing of spatial data. Chapter 9 introduces multiple linear regression, in the context of exploration since spatial autocorrelation is not taken into account. The discussion of regression analysis also serves as a review on which to base much of the material in sub-sequent chapters.

    Chapter 10 serves as a link between the exploratory and con!rmatory components of the analysis. The focus of the chapter is the effect of spatial autocorrelation on the sample variance. The classical estimates based on independence of the random vari-ables are no longer appropriate, and, in many cases, an analytical estimate is impossible. Although approximations can be developed, in many cases, bootstrap estimation pro-vides a simple and powerful tool for generating variance estimates. Chapter 11 enters fully into con!rmatory data analysis. One of the !rst things one often does at this stage is to quantify, and test hypotheses about, association between measured quantities. It is at this point that the effects of spatial autocorrelation of the data begin to make them-selves strongly felt. Chapter 11 describes these effects and discusses some methods for addressing them.

    In Chapter 12, we begin to address the incorporation of spatial effects into models for the relationship between a response variable and its environment with the introduction of mixed model theory. Chapter 13 describes autoregressive models, which have become one of the most popular tools for statistical modeling.

    Chapter 14 provides an introduction to Bayesian methods of spatial data analysis. Chapter 15 provides a brief introduction to the study of spatiotemporal data. This area of spatial data analysis is still in its infancy, and the methods presented are very ad hoc. Chapter 16 describes methods for the explicit incorporation of spatial autocorrelation into the analysis of replicated experiments. Although the primary focus of this book is on data from observational studies, if one has the opportunity to incorporate replicated data, one certainly should not pass this up. Chapter 17 describes the summarization of the results of data analysis into conclusions about the processes that the data describe. This is the goal of every ecological or agronomic study, and the methods for carrying out the summarization are very much case dependent. All we can hope to do is to try to offer some examples.

  • 9Working with Spatial Data

    1.3 Data Sets Analyzed in This Book

    Every example in this book is based on one of four data sets. There are a few reasons for this arrangement. Probably, the most important is that one of the objectives of the book is to give the reader a complete picture, from beginning to end, of the collection and analysis of spatial data characterizing an ecological process. A second is that these data sets are not ideal textbook data. They include a few warts. Real data sets almost always do, and one should know how to confront them. Two of the data sets involve unmanaged ecosystems and two involve cultivated ecosystems. All four come from studies in which I have par-ticipated. This is not because I feel that these studies are somehow superior to others that Imight have chosen, but rather because, at one time or another, my students and collabora-tors and I have lived with these data sets for an extended period, and I believe one must live with a data set for a long time to really come to know it and analyze it effectively. The book is not a research monograph, and none of the analytical objectives given here is an actual current research objective. Their purpose is strictly expository. The present section contains a brief description of each of the four data sets. A more thorough, Materials and Methods, style description is given in Appendix B.

    The data sets, along with the R code to execute the analyses described in this book, are available on the books website http://www.plantsciences.ucdavis.edu/plant/sda.htm. The data are housed in four subfolders of the folder data, called Set1, Set2, Set3, and Set4. Two other data folders, which are not provided on the website, appear in the R code: aux-iliary and created. The folder created contains !les that are created as a result of doing the analyses and exercises in the book. The folder auxiliary contains data downloaded from the Internet that are used primarily in the construction of some of the !gures. In addi-tion to a description of the data sets, Appendix B also contains the R statements used to load them into the computer. These statements are therefore not repeated in the body of the text.

    1.3.1 Data Set 1: Yellow-Billed Cuckoo Habitat

    This data set was collected as a part of a study of the change over 50 years in the extent of suitable habitat for the western yellow-billed cuckoo (Coccyzus americanus occidentalis), a California state listed endangered species that is one of Californias rarest birds (Gaines and Laymon, 1984). The yellow-billed cuckoo, whose habitat is the interior of riparian forests, once bred throughout the Paci!c Coast states. Land use conversion to agriculture, urban-ization, and "ood control have, however, restricted its current habitat to the upper reaches of the Sacramento River. This originally meandering river has over most of its length been constrained by the construction of "ood control levees. One of the few remaining areas in which the river is unrestrained is in its north central portion (Figure 1.4).

    The data set consists of !ve explanatory variables and a response variable. The explana-tory variables are de!ned over a polygonal map of the riparian area of the Sacramento River between river-mile 196, located at Pine Creek Bend, and river-mile 219, located near the Woodson Bridge State Recreation Area. This map was created by a series of geographic information system (GIS) operations as described in Appendix B.1. Figure 1.5 shows the land cover mosaic in the northern end of the study area as measured in 1997. The white areas are open water, and the light gray areas are either managed (lightest gray) or unman-aged (medium gray) areas of habitat unsuitable for the yellow-billed cuckoo. Only the darkest areas contain suitable vegetation cover. These areas are primarily riparian forest,

  • 10 Spatial Data Analysis in Ecology and Agriculture Using R

    which consists of dense stands of willow and cottonwood. The response variable is based on cuckoo encounters at each of 21 locations situated along the river.

    Our objective in the analysis of Data Set 1 is to test a model for habitat suitability for the yellow-billed cuckoo. The model is based on a modi!cation of the California Wildlife Habitat Relationships (CWHR) classi!cation system (Mayer and Laudenslayer, 1988). This is a classi!cation system that contains life history, geographic range, habitat relationships, and management information on 694 species of amphibians, reptiles, birds, and mam-mals known to occur in the state (http://www.dfg.ca.gov/biogeodata/cwhr). The modi-!ed CWHR model is expressed in terms of habitat patches, where a patch is de!nedas

    FIGURE 1.4Location of study area of yellow-billed cuckoo habitat in California in the north central portion of the Sacramento River.

    Data Set 1 site location

    0 200 km

    Data Set 1 land cover, northern end, 1997

    0 1 km

    GrasslandCroplandDevelopedFreshwater wetlandGravel barLacustrineOrchardRiverineRiparian forestOak woodland

    FIGURE 1.5Land use in 1997 at the northern end of the region along the Sacramento River from which Data Set 1 was collected. The white areas are open water (riverine = "owing, lacustrine = standing). The light gray areas are unsuitable; only the charcoal-colored area is suitable riparian forest.

  • 11Working with Spatial Data

    ageographic area of contiguous land cover. Photographs of examples of the habitat types used in the model are available on the CWHR website given earlier. Our analysis will test a habitat suitability index based on a CWHR model published by Laymon and Halterman (1989). The habitat suitability model incorporates the following explanatory variables: (1)patch area, (2) patch width, (3) patch distance to water, (4) within-patch ratio of high vegetation to medium and low vegetation, and (5) patch vegetation species. The original discussion of this analysis is given by Greco et al. (2002).

    The scope of inference of this study is restricted to the upper Sacramento River, since this is the only remaining cuckoo habitat. Data Set 1 is the only one of the four data sets for which the analysis objective is predictive rather than explanatory. Moreover, the primary objective is not to develop the best (by some measure) predictive model but rather to test a speci!c model that has already been proposed. As a part of the analysis, however, we will try to determine whether alternative models have greater predicative power or, conversely, whether any simpler subsets of the proposed model are its equal in predictive power.

    1.3.2 Data Set 2: Environmental Characteristics of Oak Woodlands

    Hardwood rangelands cover almost four million hectares in California, or over 10% of the states total land area (Allen et al., 1991; Waddel and Barrett, 2005). About three-quarters of hardwood rangeland is classi!ed as oak woodland, de!ned as areas dominated by oak species and not productive enough to be considered timberland (Waddel and Barret, 2005, p. 12). Figure 1.6 shows typical oak woodland located in the foothills of the Coast Range. Blue oak (Quercus douglasii) woodlands are the most common hardwood range-land type, occupying about 500,000 ha. These woodlands occupy an intermediate eleva-tion between the grasslands of the valleys and the mixed coniferous forests of the higher elevations. Their climate is Mediterranean, with hot, dry summers and cool, wet winters. Rangeland ecologists have expressed increasing concern over the apparent failure of cer-tain oak species, particularly blue oak, to reproduce at a rate suf!cient to maintain their population (Standiford et al., 1997; Tyler et al., 2006; Zavaleta et al., 2007). The appearance in California in the mid-1990s of the pathogen Phytophthora ramorum, associated with the condition known as sudden oak death, has served to heighten this concern (Waddel and Barrett, 2005).

    FIGURE 1.6Oak woodlands located in the foothills of the Coast Range of California. (Photograph by the author.)

  • 12 Spatial Data Analysis in Ecology and Agriculture Using R

    The reason or reasons for the decline in oak recruitment rate are not well understood. Among the possibilities are climate change, habitat fragmentation, increased seedling pre-dation resulting from changes in herbivore populations, changes in !re regimes, exotic plant and animal invasions, grazing, and changes in soil conditions (Tyler et al., 2006). One component of an increased understanding of blue oak ecology is an improved char-acterization of the ecological factors associated with the presence of mature blue oaks. Although changes in blue oak recruitment probably began occurring coincidentally with the arrival of the !rst European settlers in the early nineteenth century (Messing, 1992), it is possible that an analysis of areas populated by blue oaks during the !rst half of the twentieth century would provide some indication of environmental conditions favorable to mature oaks. That is the general objective of the analysis of Data Set 2. This data set consists of records from 4101 locations surveyed as a part of the Vegetation Type Map (VTM) survey in California, carried out during the 1920s and 1930s (Wieslander, 1935; Jensen, 1947; Allen et al., 1991, Allen-Diaz and Holzman, 1991), depicted schematically in Figure 1.1. Allen-Diaz and Holzman (1991) entered these data into a database, and Evett (1994) added geographic coordinates and climatic data to this database. The primary goal of the analysis of Data Set 2 is to characterize the environmental factors associated with the presence of blue oaks in the region sampled by the VTM survey. The original analysis was carried out by Vayssires et al. (2000).

    The intended scope of inference of the analysis is that portion of the state of California where blue oak recruitment is possible. As a practical matter, this area can be divided roughly into four regions (Appendix B.2 and Figure 1.1): the Coast Range on the west side of the Great Central Valley, the Klamath Range to the north, the Sierra Nevada on the east side of the valley, and the Traverse Ranges to the south. The analysis in this book will be restricted to the Coast Range and the Sierra Nevada, with the Klamath Range and the Traverse Ranges used as validation data.

    1.3.3 Data Set 3: Uruguayan Rice Farmers

    Rice (Oryza sativa L.) is a major Uruguayan export crop. It is grown under "ooded condi-tions. Low dykes called taipas are often used to promote even water levels (Figure 1.7). Rice is generally grown in rotation with pasture, with a common rotation being 2 years of rice followed by 3 years of pasture. Data Set 3 contains measurements taken in 16 rice !elds in eastern Uruguay (Figure 1.8) farmed by 12 different farmers over a period of three grow-ing seasons. Several locations in each !eld were "agged, and measurements were made at each location of soil and management factors that might in"uence yield. Hand harvests of rice yield were collected at these locations. Regional climatic data were also recorded.

    Some of the farmers in the study obtained consistently higher yields than others. The overall goal of the analysis is to try to understand the management factors primarily responsible for variation in yields across the agricultural landscape. A key component of the analysis is to distinguish between management factors and environmental factors with respect to their in"uence on crop yield. We must take into account the fact that a skillful farmer might manage a !eld not having conditions conducive to high yield in a different way from that in which the same farmer would manage a !eld with conditions favorable to high yield. For example, a skillful farmer growing a crop on infertile soil might apply fertilizer at an appropriate rate, while a less skillful farmer growing a crop on a fertile soil might apply fertilizer at an inappropriate rate. The less skillful farmer might obtain a bet-ter yield, but this would be due to the environmental conditions rather than the farmers skill. The original analysis was carried out by Roel et al. (2007). The scope of this study is

  • 13Working with Spatial Data

    intended to include those rice farmers in eastern Uruguay who employ management prac-tices similar to those of the study participants.

    1.3.4 Data Set 4: Factors Underlying Yield in Two Fields

    One of the greatest challenges in site-speci!c crop management is to determine the fac-tors underlying observed yield variability. Data Set 4 was collected as a part of a 4 year study initiated in two !elds in Winters, California, which is located in the Central Valley.

    FIGURE 1.7Typical Uruguayan rice !eld. The low dykes, called taipas, are used to promote more even "ooding. (Photograph courtesy of Alvaro Roel.)

    12

    3

    4

    5

    6

    7

    8910

    1112

    13

    14

    15

    160 200 km

    FIGURE 1.8Map of Uruguay, showing locations of the 16 rice !elds measured in Data Set 3. Insets are not drawn to scale.

  • 14 Spatial Data Analysis in Ecology and Agriculture Using R

    Thiswas the !rst major precision agriculture study in California, and the original objec-tive was simply to determine whether or not the same sort of spatial variability in yield existed in California as had been observed in other states. We will pursue a different objec-tive in this book, namely, to determine whether we can establish which of the measured explanatory variables in"uenced the observed yield variability. Data from one of the !elds (designated Field 2) is shown in Figure 1.2.

    In the !rst year of the experiment, both !elds were planted to spring wheat (Triticum aestivum L.). In the second year, both !elds were planted to tomato (Solanum lycopersicum L.). In the third year, Field 1 was planted to bean (Phaseolus vulgaris L.) and Field 2 wasplanted to sun"ower (Helianthus annuus L.), and, in the fourth year, Field 1 was planted to sun-"ower and Field 2 was planted to corn (Zea mays L.). Each crop was harvested with a yield monitor equipped harvester so that yield maps of the !elds were available. False color infrared aerial photographs (i.e., images where infrared is displayed as red, red as green, and green as blue) were taken of each !eld at various points in the growing season. Only those from the !rst season are used in this book. During the !rst year, soil and plant data were collected by sampling on a square grid 61 m in size, or about two sample points per hectare. In addition, a more densely sampled soil electrical conductivity map of Field 2 wasmade during the !rst year of the experiment.

    As was mentioned in the discussion of Data Set 2, the climate in Californias Central Valley is Mediterranean, with hot, essentially rain-free summers and cool, wet winters. Spring wheat is planted in the late fall and grown in a supplementally irrigated cropping system, while the other crops, all of which are grown in the summer, are fully irrigated. Both !elds were managed by the same cooperator, a highly skilled farmer who used best practices as recommended by the University of California. Thus, while the objective of the analysis of Data Set 3 is to distinguish the most effective management practices among a group of farmers, the objective of the analysis of Data Set 4 is to determine the factors underlying yield variability in a pair of !elds in which there is no known variation in management practices.

    The original analysis of the data set was carried out by Plant et al. (1999). The scope of this study, strictly speaking, extends only to the two !elds involved. Obviously, however, the real interest lies in extending the results to other !elds. Success in achieving the objec-tive would not be to establish that the same factors that in"uence yield in these !elds were also in"uential in other !elds, which is not the case anyway. The objective is rather to develop a methodology that would permit the in"uential factors in other !elds to be identi!ed.

    1.3.5 Comparing the Data Sets

    The four data sets present a variety of spatial data structures, geographical structures, attribute data structures, and study objectives. Let us take a moment to compare them, with the idea that by doing so you might get an idea as to how they relate to your own data set and objectives. Data Sets 2 and 3 have the simplest spatial structure. Both data sets consist of a single set of data records, each of which is the aggregation of a collection of measurements made at a location modeled as a point in space. Data Set 2 contains quanti-ties that are measured at different spatial scales. The spatial structure of Data Set 1 is more complex in that it consists of data records that are modeled as properties of a mosaic of polygons. Data Set 1 is the only data set that includes this polygonal data structure in the raw data. Data Set 4 does not include any polygons, but it does include raster data sets (Lo and Yeung, 2007, p. 83). In addition, Data Set 4 includes point data sets measured at

  • 15Working with Spatial Data

    different locations. Moreover, the spatial points that represent the data locations in dif-ferent point data sets actually represent the aggregation of data collected over different areas. This combination of types of data, different locations of measurement, and different aggregation areas leads to the problem of misalignment of the data, which will be discussed frequently in this book but primarily in Chapter 6.

    Moving on to geographic structure, Data Set 4 is the simplest. The data are recorded in two !elds, each of size less than 100 ha. We will use universal transverse mercator (UTM) coordinates (Lo and Yeung, 2007, p. 43) to represent spatial location. If you are not famil-iar with UTM coordinates, you should learn about them. In brief, they divide the surface of the earth into 120 zones, based on location in the Northern Hemisphere or Southern Hemisphere. Within each hemisphere, UTM coordinates are based on location within 60 bands running north and south, each of width 6 of longitude. Within each zone, location is represented by an Easting (x coordinate) and a Northing (y coordinate), each measured in meters. Locations in Data Sets 1 and 3 are also represented in UTM coordinates, although both of these data sets cover a larger geographic area than that of Data Set 4. Depending on the application, locations in Data Set 2 may be represented in either degrees of longi-tude and latitude or UTM coordinates. This data set covers the largest geographic area. Moreover, the locations in Data Set 2 are in two different UTM zones.

    1.4 Further Reading

    Spatial statistics is a rapidly growing !eld, and the body of literature associated with it is growing rapidly as well. The seminal books in the !eld are Cliff and Ord (1973) and, even more importantly, Cliff and Ord (1981). This latter monograph set the stage for much that was to follow, providing a !rm mathematical foundation for measures of spatial autocor-relation and for spatial regression models. Ripley (1981) contains coverage of many topics not found in Cliff and Ord (1981). It is set at a slightly higher level mathematically but is nevertheless a valuable text even for the practitioner. Besides Cliff and Ord (1981), the other seminal reference is Cressie (1991). This practically encyclopedic treatment of spatial statis-tics is at a mathematical level that some may !nd dif!cult, but working ones way through it is well worth the effort. Grif!th (1987) provides an excellent introductory discussion of the phenomenon of spatial autocorrelation. Upton and Fingleton (1985) provide a wide array of examples for application-oriented readers. Their book is highly readable while still providing considerable mathematical detail.

    Among more recent works, the two monographs by Haining (1990, 2003) provide a broad range of coverage and are very accessible to readers with a wide range of mathemati-cal sophistication. Legendre and Legendre (1998) touch on many aspects of spatial data analysis. Grif!th and Layne (1999) present an in-depth discussion of spatial data analysis through the study of a collection of example data sets. Waller and Gotway (2004) discuss many aspects of spatial data analysis directly applicable to ecological problems.

    There are also a number of collections of papers on the subject of spatial data analysis. Two early volumes that contain a great deal of relevant information are those edited by Bartels and Ketellapper (1979) and Grif!th (1989). The collection of papers edited by Arlinghaus (1995) has a very application-oriented focus. The volume edited by Longley and Batty (1996), although it is primarily concerned with GIS modeling, contains a num-ber of papers with a statistical emphasis. Similarly, the collection of papers edited by

  • 16 Spatial Data Analysis in Ecology and Agriculture Using R

    Stein etal.(1999), althoughit focuses on remote sensing, is broadly applicable. The vol-ume edited by Scheiner and Gurevitch (2001) contains many very useful chapters on spatial data analysis.

    The standard introductory reference for geostatistics is Isaaks and Srivastava (1989). Webster and Oliver (1990, 2001) provide coverage of geostatistical data analysis that is remarkable for its ability to combine accessibility with depth of mathematical treatment. Another excellent source is Goovaerts (1997). The reference Journel and Huijbregts (1978) is especially valuable for its historical and applied perspective. Pielou (1977) is an excel-lent early reference on point pattern analysis. Fortin and Dale (2005) devote a chapter to the topic. Diggle (1983) provides a mathematically more advanced treatment of this topic.

    A good discussion of observational versus replicated data is provided by Kempthorne (1952). Sprent (1998) discusses hypothesis testing for observational data. Gelman and Hill (2007, chapters 9 an