transfer learning from deep features for remote sensing … · transfer learning from deep features...

Transfer Learning from Deep Features for Remote Sensing and Poverty Mapping

Michael Xie and Neal Jean and Marshall Burke and David Lobell and Stefano ErmonDepartment of Computer Science, Stanford University{xie, nealjean, ermon}@cs.stanford.edu

Department of Earth System Science, Stanford University{mburke,dlobell}@stanford.edu

Abstract

The lack of reliable data in developing countries is amajor obstacle to sustainable development, food secu-rity, and disaster relief. Poverty data, for example, istypically scarce, sparse in coverage, and labor-intensiveto obtain. Remote sensing data such as high-resolutionsatellite imagery, on the other hand, is becoming in-creasingly available and inexpensive. Unfortunately,such data is highly unstructured and currently no tech-niques exist to automatically extract useful insights toinform policy decisions and help direct humanitarian ef-forts. We propose a novel machine learning approach toextract large-scale socioeconomic indicators from high-resolution satellite imagery. The main challenge is thattraining data is very scarce, making it difficult to applymodern techniques such as Convolutional Neural Net-works (CNN). We therefore propose a transfer learn-ing approach where nighttime light intensities are usedas a data-rich proxy. We train a fully convolutionalCNN model to predict nighttime lights from daytimeimagery, simultaneously learning features that are use-ful for poverty prediction. The model learns filters iden-tifying different terrains and man-made structures, in-cluding roads, buildings, and farmlands, without any su-pervision beyond nighttime lights. We demonstrate thatthese learned features are highly informative for povertymapping, even approaching the predictive performanceof survey data collected in the field.

IntroductionNew technologies fueling the Big Data revolution are cre-ating unprecedented opportunities for designing, monitor-ing, and evaluating policy decisions and for directing hu-manitarian efforts (Abelson, Varshney, and Sun 2014;Varshney et al. 2015). However, while rich countries arebeing flooded with data, developing countries are suffer-ing from data drought. A new data divide is emerging, withhuge differences in the quantity and quality of data avail-able. For example, some countries have not taken a census indecades, and in the past five years an estimated 230 millionbirths have gone unrecorded (Independent Expert AdvisoryGroup Secretariat 2014). Even high-profile initiatives suchas the Millennium Development Goals (MDGs) are affected

Copyright c© 2016, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

(United Nations 2015). Progress based on poverty and infantmortality rate targets can be difficult to track. Often, povertymeasures must be inferred from small-scale and expensivehousehold surveys, effectively rendering many of the poor-est people invisible.

Remote sensing, particularly satellite imagery, is perhapsthe only cost-effective technology able to provide data at aglobal scale. Within ten years, commercial services are ex-pected to provide sub-meter resolution images everywhereat a fraction of current costs (Murthy et al. 2014). This levelof temporal and spatial resolution could provide a wealthof data towards sustainable development. Unfortunately, thisraw data is also highly unstructured, making it difficult toextract actionable insights at scale.

In this paper, we propose a machine learning approachfor extracting socioeconomic indicators from raw satelliteimagery. In the past five years, deep learning approachesapplied to large-scale datasets such as ImageNet have rev-olutionized the field of computer vision, leading to dramaticimprovements in fundamental tasks such as object recogni-tion (Russakovsky et al. 2014). However, the use of con-temporary techniques for the analysis of remote sensing im-agery is still largely unexplored. Modern approaches suchas Convolutional Neural Networks (CNN) can, in principle,be directly applied to extract socioeconomic factors, but theprimary challenge is a lack of training data. While such datais readily available in the United States and other developednations, it is extremely scarce in Africa where these tech-niques would be most useful.

We overcome this lack of training data by using a se-quence of transfer learning steps and a convolutional neu-ral network model. The idea is to leverage available datasetssuch as ImageNet to extract features and high-level repre-sentations that are useful for the task of interest, i.e., ex-tracting socioeconomic data for poverty mapping. Similarstrategies have proven quite successful in the past. For ex-ample, image features from the Overfeat network trainedon ImageNet for object classification achieved state-of-the-art results on tasks such as fine-grained recognition, imageretrieval, and attribute detection (Razavian et al. 2014).

Pre-training on ImageNet is useful for learning low-levelfeatures such as edges. However, ImageNet consists onlyof object-centric images, while satellite imagery is capturedfrom an aerial, bird’s-eye view. We therefore employ a sec-

arX

iv:1

510.

0009

8v2

[cs

.CV

] 2

7 Fe

b 20

16

ond transfer learning step, where nighttime light intensitiesare used as a proxy for economic activity. Specifically, westart with a CNN model pre-trained for object classifica-tion on ImageNet and learn a modified network that pre-dicts nighttime light intensities from daytime imagery. Toaddress the trade-off between fixed image size and informa-tion loss from image scaling, we use a fully convolutionalmodel that takes advantage of the full satellite image. Weshow that transfer learning succeeds in learning features rel-evant not only for nighttime light prediction but also forpoverty mapping. For instance, the model learns filters iden-tifying man-made structures such as roads, urban areas, andfields without any supervision beyond nighttime lights, i.e.,without any labeled examples of roads or urban areas (Fig-ure 2). We demonstrate that these features are highly infor-mative for poverty mapping and capable of approaching thepredictive performance of survey data collected in the field.

Problem SetupWe begin by reviewing transfer learning and convolutionalneural networks, the building blocks of our approach.

Transfer LearningWe formalize transfer learning as in (Pan and Yang 2010): Adomain D = {X , P (X )} consists of a feature space X anda marginal probability distribution P (X ). Given a domain,a task T = {Y, f(·)} consists of a label space Y and a pre-dictive function f(·) which models P (y|x) for y ∈ Y andx ∈ X . Given a source domainDS and learning task TS , anda target domain DT and learning task TT , transfer learningaims to improve the learning of the target predictive functionfT (·) in TT using the knowledge from DS and TS , whereDS 6= DT , TS 6= TT , or both. Transfer learning is particu-larly relevant when, given labeled source domain data DS

and target domain data DT , we find that |DT | � |DS |.In our setting, we are interested in more than two re-

lated learning tasks. We generalize the formalism by repre-senting the multiple source-target relationships as a transferlearning graph. First, we define a transfer learning prob-lem P = (D, T ) as a domain-task pair. The transfer learn-ing graph is then defined as follows: A transfer learninggraph G = (V, E) is a directed acyclic graph where ver-tices V = {P1, · · · ,Pv} are transfer learning problems andE = {(Pi1 ,Pj1), · · · , (Pie ,Pje)} is an edge set. For eachtransfer learning problem Pi = (Di, Ti) ∈ V , the aim is toimprove the learning of the target predictive function fi(·)in Ti using the knowledge in ∪(j,i)∈EPj .

Convolutional Neural NetworksDeep learning approaches are based on automatically learn-ing nested, hierarchical representations of data. Deep feed-forward neural networks are the typical example of deeplearning models. Convolutional Neural Networks (CNN) in-clude convolutional operations over the input and are de-signed specifically for vision tasks. Convolutional filters areuseful for encoding translation invariance, a key concept fordiscovering useful features in images (Bouvrie 2006).

A CNN is a general function approximator defined by aset of convolutional and fully connected layers ordered suchthat the output of one layer is the input of the next. For imagedata, the first layers of the network typically learn low-levelfeatures such as edges and corners, and further layers learnhigh-level features such as textures and objects (Zeiler andFergus 2013). Taken as a whole, a CNN is a mapping fromtensors to feature vectors, which become the input for a finalclassifier. A typical convolutional layer maps a tensor x ∈Rh×w×d to gi ∈ Rh×w×d such that

gi = pi(fi(Wi ∗ x+ bi)),

where for the i-th convolutional layer, Wi ∈ Rl×l×d is atensor of d convolutional filter weights of size l × l, (∗) isthe 2-dimensional convolution operator over the last two di-mensions of the inputs, bi is a bias term, fi is an element-wise nonlinearity function (e.g., a rectified linear unit orReLU), and pi is a pooling function. The output dimensionsh and w depend on the stride and zero-padding parametersof the layer, which control how the convolutional filters slideacross the input. For the first convolutional layer, the inputdimensions h, w, and d can be interpreted as height, width,and number of color channels of an input image, respec-tively.

In addition to convolutional layers, most CNN modelshave fully connected layers in the final layers of the net-work. Fully connected layers map an unrolled version of theinput x ∈ Rhwd, which is a one-dimensional vector of theelements of a tensor x ∈ Rh×w×d, to an output gi ∈ Rk

such thatgi = fi(Wix+ bi),

where Wi ∈ Rk×hwd is a weight matrix, bi is a bias term,and fi is typically a ReLU nonlinearity function. The fullyconnected layers encode the input examples as feature vec-tors, which are used as inputs to a final classifier. Since thefully connected layer looks at the entire input at once, thesefeature vectors “summarize” the input into a feature vec-tor for classification. The model is trained end-to-end usingminibatch gradient descent and backpropagation.

After training, the output of the final fully connected layercan be interpreted as an encoding of the input as a featurevector that facilitates classification. These features often rep-resent complex compositions of the lower-level features ex-tracted by the previous layers (e.g., edges and corners) andcan range from grid patterns to animal faces (Zeiler and Fer-gus 2013; Le et al. 2012).

Combining Transfer Learning and Deep LearningThe low-level and high-level features learned by a CNN on asource domain can often be transferred to augment learningin a different but related target domain. For target problemswith abundant data, we can transfer low-level features, suchas edges and corners, and learn new high-level features spe-cific to the target problem. For target problems with limitedamounts of data, learning new high-level features is difficult.However, if the source and target domain are sufficientlysimilar, the feature representation learned by the CNN onthe source task can also be used for the target problem. Deep

features extracted from CNNs trained on large annotateddatasets of images have been used as generic features veryeffectively for a wide range of vision tasks (Donahue et al.2013; Oquab et al. 2014).

Transfer Learning for Poverty MappingIn our approach to poverty mapping using satellite imagery,we construct a linear chain transfer learning graph withV = {P1,P2,P3} and E = {(P1,P2), (P2,P3)}. The firsttransfer learning problem P1 is object recognition on Im-ageNet (Russakovsky et al. 2014); the second problem P2

is predicting nighttime light intensity from daytime satelliteimagery; the third problem P3 is predicting poverty fromdaytime satellite imagery. Recognizing the differences be-tween ImageNet data and satellite imagery, we use the inter-mediate problem P2 to learn about the bird’s-eye viewpointof satellite imagery and extract features relevant to socioe-conomic development.

ImageNet to Nighttime LightsImageNet is an object classification image dataset of over14 million images with 1000 class labels that, along withCNN models, have fueled major breakthroughs in many vi-sion tasks (Russakovsky et al. 2014). CNN models trainedon the ImageNet dataset are recognized as good generic fea-ture extractors, with low-level and mid-level features suchas edges and corners that are able to generalize to many newtasks (Donahue et al. 2013; Oquab et al. 2014). Our goal isto transfer knowledge from the ImageNet object recognitionchallenge (P1) to the target problem of predicting nighttimelight intensity from daytime satellite imagery (P2).

In P1, we have an object classification problem withsource domain data D1 = {(x1i, y1i)} from ImageNet thatconsists of natural images x1i ∈ X1 and object class labels.In P2, we have a nighttime light intensity prediction prob-lem with target domain data D2 = {(x2i, y2i}) that consistsof daytime satellite images x2i ∈ X2 and nighttime lightintensity labels. Although satellite data is still in the spaceof image data, satellite imagery presents information from abird’s-eye view and at a much different scale than the object-centric ImageNet dataset (P (X1) 6= P (X2)). Previous workin domains with images fundamentally different from nor-mal “human-eye view” images typically resort to curating anew, specific dataset such as Places205 (Zhou et al. 2014).In contrast, our transfer learning approach does not requirehuman annotation and is much more scalable. Additionally,unsupervised approaches such as autoencoders may wasterepresentational capacity on irrelevant features, while thenighttime light labels guide learning towards features rele-vant to wealth and economic development.

The National Oceanic and Atmospheric Administration(NOAA) provides annual nighttime images of the worldwith 30 arc-second resolution, or about 1 square kilometer(NOAA National Geophysical Data Center 2014). The lightintensity values are averaged and denoised for each year toensure that ephemeral light sources do not affect the data.

The nighttime light dataset D2 is constructed as follows:The Demographic Health Survey (DHS) Program conducts

Figure 1: Locations (in white) of 330,000 sampled daytimeimages near DHS survey locations for the nighttime lightintensity prediction problem.

nationally representative surveys in Africa that focus mainlyon health outcomes (ICF International 2015). Predictinghealth outcomes is beyond the scope of this paper; how-ever, the DHS surveys offer the most comprehensive dataavailable for Africa. Thus, we use DHS survey locations asguidelines for sampling training images (see Figure 1). Im-ages in D2 are daytime satellite images randomly samplednear DHS survey locations in Africa. Satellite images aredownloaded using the Google Static Maps API, each with400× 400 pixels at zoom level 16, resulting in images sim-ilar in size to pixels in the NOAA nighttime lights data. Theaggregate dataset D2 consists of over 330,000 images, eachlabeled with an integer nighttime light intensity value rang-ing from 0 to 631. We further subsample and bin the datausing a Gaussian mixture model, as detailed in the compan-ion technical report (Xie et al. 2015).

Nighttime Lights to Poverty EstimationThe final and most important learning task P3 is that of pre-dicting poverty from satellite imagery, for which we havevery limited training data. Our goal is to transfer knowledgefrom P2, a data-rich problem, to P3.

The target domain data D3 = {(x3i, y3i)} consists ofsatellite images x3i ∈ X3 from the feature space of satelliteimages of Uganda and a limited number of poverty labelsy3i ∈ Y3, detailed below. The source data is D2, the night-time lights data. Here, the input feature space of images issimilar in both the source and target domains, drawn froma similar distribution of images (satellite images) from re-lated areas (Africa and Uganda), implying that X2 = X3,P (X2) ≈ P (X3). The source (lights) and target (poverty)tasks both have economic elements, but are quite different.

The poverty training data D3 relies on the Living Stan-dards Measurement Study (LSMS) survey conducted inUganda by the Uganda Bureau of Statistics between 2011

1Nighttime light intensities are from 2013, while the daytimesatellite images are from 2015. We assume that the areas understudy have not changed significantly in this two-year period, butthis temporal mismatch is a potential source of error.

and 2012 (Uganda Bureau of Statistics 2012). The LSMSsurvey consists of data from 2,716 households in Uganda,which are grouped into 643 unique location groups. Theaverage latitude and longitude location of the householdswithin each group is given, with added noise of up to 5kmin each direction. Individual household locations are with-held to preserve anonymity. In addition, each household hasa binary poverty label based on expenditure data from thesurvey. We use the majority poverty classification of house-holds in each group as the overall location group povertylabel. For a given group, we sample approximately 1001km×1km images tiling a 10km × 10km area centered atthe average household location as input. This defines theprobability distribution P (X3) of the input images for thepoverty classification problem P3.

Predicting Nighttime Light IntensityOur first goal is to transfer knowledge from the ImageNetobject recognition task to the nighttime light intensity pre-diction problem. We start with a CNN model with parame-ters trained on ImageNet, then modify the network to adapt itto the new task (i.e., change the classifier on the last layer toreflect the new nighttime light prediction task). We train onthe new task using SGD with momentum, using ImageNetparameters as initialization to achieve knowledge transfer.

We choose the VGG F model trained on ImageNet as thestarting CNN model (Chatfield et al. 2014). The VGG Fmodel has 8 convolutional and fully connected layers. Likemany other ImageNet models, the VGG F model accepts afixed input image size of 224 × 224 pixels. Input imagesin D2, however, are 400 × 400 pixels, corresponding to theresolution of the nighttime lights data.

We consider two ways of adapting the original VGG Fnetwork. The first approach is to keep the structure of thenetwork (except for the final classifier) and crop the inputto 224 × 224 pixels (random cropping). This is a reason-able approach, as the original model was trained by cropping224× 224 images from a larger 256× 256 image (Chatfieldet al. 2014). Ideally, we would evaluate the network at multi-ple crops of the 400× 400 input and average the predictionsto leverage the context of the entire input image. However,doing this explicitly with one forward pass for each cropwould be too costly. Alternatively, if we allow the multiplecrops of the image to overlap, we can use a convolution tocompute scores for each crop simultaneously, gaining speedby reusing filter outputs at all layers. We therefore proposea fully convolutional architecture (fully convolutional).

Fully Convolutional Model

Fully convolutional models have been used successfully forspatial analysis of arbitrary size inputs (Wolf and Platt 1994;Long, Shelhamer, and Darrell 2014). We construct the fullyconvolutional model by converting the fully connected lay-ers of the VGG F network to convolutional layers. This al-lows the network to efficiently “slide” across a larger inputimage and make multiple evaluations of different parts of theimage, incorporating all available contextual information.

Given an unrolled h×w×d-dimensional input x ∈ Rhwd,fully connected layers perform a matrix-vector product

x = f(Wx+ b)

where W ∈ Rk×hwd is a weight matrix, b is a bias term,f is a nonlinearity function, and x ∈ Rk is the output. Inthe fully connected layer, we take k inner products with theunrolled x vector. Thus, given a differently sized input, it isunclear how to evaluate the dot products.

We replace a fully connected layer by a convolutionallayer with k convolutional filters of size h × w, the samesize as the input. The filter weights are shared across allchannels, which means that the convolutional layer actuallyuses fewer parameters than the fully connected layer. Sincethe filter size is matched with the input size, we can takean element-wise product and add, which is equivalent to aninner product. This results in a scalar output for each fil-ter, creating an output x ∈ R1×1×k. Further fully connectedlayers are converted to convolutional layers with filter size1×1, matching the new input x ∈ R1×1×k. Fully connectedlayers are usually the last layers of the network, while allprevious layers are typically convolutional. After convertingfully connected layers to convolutional layers, the entire net-work becomes convolutional, allowing the outputs of eachlayer to be reused as the convolution slides the network overa larger input. Instead of a scalar output, the new output is a2-dimensional map of filter activations.

In our fully convolutional model, the 400×400 input pro-duces an output of size 2 × 2 × 4096, which represents thescores of four (overlapping) quadrants of the image for 4096features. The regional scores are then averaged to obtain a4096-dimensional feature vector that becomes the final in-put to the classifier predicting nighttime light intensity.

Training and Performance EvaluationBoth CNN models are trained using minibatched gradientdescent with momentum. Random mirroring is used for dataaugmentation, along with 50% dropout on convolutionallayers replacing fully connected layers. The learning rate be-gins at 1e-6, a hundredth of the ending learning rate of theVGGmodel. All other hyperparameters are the same as in theVGG model as described in (Chatfield et al. 2014). The VGGmodel parameters are obtained from the Caffe Model Zoo,and all networks are trained with Caffe (Jia et al. 2014). Thefully convolutional model is fine-tuned from the pre-trainedparameters of the VGG F model, but it randomly initializesthe convolutional layers that replace fully connected layers.

In the process of cropping, the random cropping modelthrows away over 68% of the input image when predict-ing the class scores, losing much of the spatial context. Therandom cropping model achieved a validation accuracy of70.04% after 400,200 SGD iterations. In comparison, thefully convolutional model achieved 71.58% validation accu-racy after only 223,500 iterations. Both models were trainedin roughly three days. Despite reinitializing the final convo-lutional layers from scratch, the fully convolutional modelexhibits faster learning and better performance. The finalfully convolutional model achieves a validation accuracy of71.71%, trained over 345,000 iterations.

Figure 2: Left: Each row shows five maximally activating images for a different filter in the fifth convolutional layer of theCNN trained on the nighttime light intensity prediction problem. The first filter (first row) activates for urban areas. The secondfilter activates for farmland and grid-like patterns. The third filter activates for roads. The fourth filter activates for water, plains,and forests, terrains contributing similarly to nighttime light intensity. The only supervision used is nighttime light intensity,i.e., no labeled examples of roads or farmlands are provided. Right: Filter activations for the corresponding images on the left.Filters mostly activate on the relevant portions of the image. For example, in the third row, the strongest activations coincidewith the road segments. Best seen in color. See the companion technical report for more visualizations (Xie et al. 2015). Imagesfrom Google Static Maps.

Visualizing the Extracted FeaturesNighttime lights are used as a data-rich proxy, so absoluteperformance on this task is not directly relevant for povertymapping. The goal is to learn high-level features that areindicative of economic development and can be used forpoverty mapping in the spirit of transfer learning.

We visualize the filters learned by the fully convolutionalnetwork by inspecting the 25 maximally activating imagesfor each filter (Figure 2, left and the companion technicalreport for more visualizations (Xie et al. 2015)). Activationlevels for filters in the middle of the network are obtained bypassing the images forward through the filter, applying theReLU nonlinearity, and then averaging the map of activationvalues. We find that many filters learn to identify semanti-cally meaningful features such as urban areas, water, roads,barren land, forests, and farmland. Amazingly, these fea-tures are learned without direct supervision, in contrastto previous efforts to extract features from aerial imagery,which have relied heavily on large amounts of expert-labeleddata, e.g., labeled examples of roads (Mnih and Hinton2010; 2012). To confirm the semantics of the filters, we vi-sualize their activations for the same set of images (Figure 2,right). These maps confirm our interpretation by identifyingthe image parts that are most responsible for activating thefilter. For example, the filter in the third row mostly acti-vates on road segments. These features are extremely usefulsocioeconomic indicators and suggest that transfer learningto the poverty task is possible.

Poverty Estimation and MappingThe first target task we consider is to predict whether the ma-jority of households are above or below the poverty thresh-old for 643 groups of households in Uganda.

Given the limited amount of training data, we do not at-tempt to learn new feature representations for the target task.

Instead, we directly use the feature representation learned bythe CNN on the nighttime lights task (P2). Specifically, weevaluate the CNN model on new input images and feed thefeature vector produced in the last layer as input to a logis-tic regression classifier, which is trained on the poverty task(transfer model). Approximately 100 images in a 10km ×10km area around the average household location of eachgroup are used as input. We compare against the perfor-mance of a classifier with features from the VGG F modeltrained on ImageNet only (ImageNet model), i.e., withouttransfer learning from nighttime lights. In both the ImageNetmodel and the transfer model, the feature vectors are aver-aged over the input images for each group.

The Uganda LSMS survey also includes household-specific data. We extract the features that could feasibly bedetected with remote sensing techniques, including roof ma-terial, number of rooms, house type, distances to various in-frastructure points, urban or rural classification, annual tem-perature, and annual precipitation. These survey features arethen averaged over each household group. The performanceof the classifier trained with survey features (survey model)represents the gold standard for remote sensing techniques.We also compare with a classifier trained using the nighttimelight intensities themselves as features (lights model). Thenighttime light features consist of the average light intensity,summary statistics, and histogram-based features for eacharea. Finally, we compare with a classifier trained using aconcatenation of ImageNet features and nighttime light fea-tures (ImageNet + lights model), an explicit way of com-bining information from both source problems.

All models are trained using a logistic regression classifierwith L1 regularization using a nested 10-fold cross valida-tion (CV) scheme, where the inner CV is used to tune a newregularization parameter for each outer CV iteration. Theregularization parameter is found by a two-stage approach:

Figure 3: Left: Predicted poverty probabilities at a fine-grained 10km × 10km block level. Middle: Predicted poverty proba-bilities aggregated at the district-level. Right: 2005 survey results for comparison (World Resources Institute 2009).

Survey ImgNet LightsImgNet

Transfer+Lights

Accuracy 0.754 0.686 0.526 0.683 0.716F1 Score 0.552 0.398 0.448 0.400 0.489Precision 0.450 0.340 0.298 0.338 0.394

Recall 0.722 0.492 0.914 0.506 0.658AUC 0.776 0.690 0.719 0.700 0.761

Table 1: Cross validation test performance for predictingaggregate-level poverty measures. Survey is trained on sur-vey data collected in the field. All other models are basedon satellite imagery. Our transfer learning approach outper-forms all non-survey classifiers significantly in every mea-sure except recall, and approaches the survey model.

a coarse linearly spaced search is followed by a finer lin-early spaced search around the best value found in the coarsesearch. The tuned regularization parameter is then validatedon the test set of the outer CV loop, which remained unseenas the parameter was tuned. All performance metrics are av-eraged over the outer 10 folds and reported in Table 1.

Our transfer model significantly outperforms every modelexcept the survey model in every measure except recall. No-tably, the transfer model outperforms all combinations offeatures from the source problems, implying that transferlearning was successful in learning novel and useful fea-tures. Remarkably, our transfer model based on remotelysensed data approaches the performance of the survey modelbased on data expensively collected in the field. As a sanitycheck, we find that using simple traditional computer visionfeatures such as HOG and color histograms only achievesslightly better performance than random guessing. This fur-ther affirms that the transfer learning features are nontrivialand contain information more complex than just edges andcolors.

To understand the high recall of the lights model, weanalyze the conditional probability of predicting “poverty”

given that the average light intensity is zero: The lightsmodel predicts “poverty” almost 100% of the time, thoughonly 51% of groups with zero average intensity are actu-ally below the poverty line. Furthermore, only 6% of groupswith nonzero average light intensity are below the povertyline, explaining the high recall of the lights model. In con-trast, the transfer model predicts “poverty” in 52% of groupswhere the average nighttime light intensity is 0, more accu-rately reflecting the actual probability. The transfer modelfeatures (visualized in Figure 2) clearly contain additional,meaningful information beyond what nighttime lights canprovide. The fact that the transfer model outperforms thelights model indicates that transfer learning has succeeded.

Mapping Poverty DistributionUsing our transfer model, we can scalably and inexpen-sively construct fine-grained poverty maps at the countryor even continent level. We evaluate this capability by es-timating a country-level poverty map for Uganda. We down-load over 370,000 satellite images covering Uganda and esti-mate poverty probabilities at 1km× 1km resolution with thetransfer model. Areas where the model assigns a low proba-bility of being impoverished are colored green, while areasassigned a high risk of poverty are colored red. A 10km ×10km resolution map is shown in Figure 3 (left), smoothed ata 0.5 degree radius for easy identification of dominant spa-tial patterns. Notably, poverty reduction in northern Ugandais lagging (Ministry of Finance 2014). Figure 3 (middle)shows poverty estimates aggregated at the district level. As avalidity check, we qualitatively compare this map against themost recent map of poverty rates available (Figure 3, right),which is based on 2005 survey data (World Resources In-stitute 2009). This data is now a decade old, but it looselycorroborates the major patterns in our predicted distribution.Whereas current maps are coarse and outdated, our methodoffers much finer temporal and spatial resolution and an in-expensive way to evaluate poverty at a global scale.

ConclusionWe introduce a new transfer learning approach for analyz-ing satellite imagery that leverages recent deep learning ad-vances and multiple data-rich proxy tasks to learn high-levelfeature representations of satellite images. This knowledgeis then transferred to data-poor tasks of interest in the spiritof transfer learning. We demonstrate an application of thisidea in the context of poverty mapping and introduce a fullyconvolutional CNN model that, without explicit supervision,learns to identify complex features such as roads, urban ar-eas, and various terrains. Using these features, we are ableto approach the performance of data collected in the field forpoverty estimation. Remarkably, our approach outperformsmodels based directly on the data-rich proxies used in ourtransfer learning pipeline. Our approach can easily be gen-eralized to other remote sensing tasks and has great potentialto help solve global sustainability challenges.

AcknowledgementsWe acknowledge the support of the Department of De-fense through the National Defense Science and EngineeringGraduate Fellowship Program. We would also like to thankNVIDIA Corporation for their contribution to this projectthrough an NVIDIA Academic Hardware Grant.

ReferencesAbelson, B.; Varshney, K.; and Sun, J. 2014. Targeting directcash transfers to the extremely poor. In Proceedings of the20th ACM SIGKDD international conference on Knowledgediscovery and data mining, 1563–1572. ACM.Bouvrie, J. 2006. Notes on convolutional neural networks.Chatfield, K.; Simonyan, K.; Vedaldi, A.; and Zisserman, A.2014. Return of the devil in the details: Delving deep intoconvolutional nets. arXiv preprint arXiv:1405.3531.Donahue, J.; Jia, Y.; Vinyals, O.; Hoffman, J.; Zhang, N.;Tzeng, E.; and Darrell, T. 2013. DeCAF: A deep con-volutional activation feature for generic visual recognition.CoRR abs/1310.1531.ICF International. 2015. Demographic and health surveys(various) [datasets].Independent Expert Advisory Group Secretariat. 2014. Aworld that counts: Mobilising the data revolution for sus-tainable development. Technical report.Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.;Girshick, R. B.; Guadarrama, S.; and Darrell, T. 2014.Caffe: Convolutional architecture for fast feature embed-ding. CoRR abs/1408.5093.Le, Q. V.; Ranzato, M.; Monga, R.; Devin, M.; Chen, K.;Corrado, G. S.; Dean, J.; and Ng, A. Y. 2012. Buildinghigh-level features using large scale unsupervised learning.In International Conference on Machine Learning.Long, J.; Shelhamer, E.; and Darrell, T. 2014. Fullyconvolutional networks for semantic segmentation. CoRRabs/1411.4038.Ministry of Finance. 2014. Poverty status report 2014:Structural change and poverty reduction in Uganda.

Mnih, V., and Hinton, G. E. 2010. Learning to detect roadsin high-resolution aerial images. In Computer Vision–ECCV2010. Springer. 210–223.Mnih, V., and Hinton, G. E. 2012. Learning to label aerialimages from noisy data. In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML-12), 567–574.Murthy, K.; Shearn, M.; Smiley, B. D.; Chau, A. H.; Levine,J.; and Robinson, D. 2014. Skysat-1: very high-resolutionimagery from a small satellite. In SPIE Remote Sensing,92411E–92411E. International Society for Optics and Pho-tonics.NOAA National Geophysical Data Center. 2014. F18 2013nighttime lights composite.Oquab, M.; Bottou, L.; Laptev, I.; and Sivic, J. 2014. Learn-ing and transferring mid-level image representations usingconvolutional neural networks. In Proceedings of the 2014IEEE Conference on Computer Vision and Pattern Recogni-tion, CVPR ’14, 1717–1724. Washington, DC, USA: IEEEComputer Society.Pan, S. J., and Yang, Q. 2010. A survey on transfer learn-ing. Knowledge and Data Engineering, IEEE Transactionson 22(10):1345–1359.Razavian, A. S.; Azizpour, H.; Sullivan, J.; and Carlsson, S.2014. CNN features off-the-shelf: an astounding baselinefor recognition. CoRR abs/1403.6382.Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.;Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.;Berg, A. C.; and Fei-Fei, L. 2014. ImageNet large scalevisual recognition challenge. International Journal of Com-puter Vision 1–42.Uganda Bureau of Statistics. 2012. Uganda national panelsurvey 2011/2012.United Nations. 2015. The millennium development goalsreport 2015.Varshney, K. R.; Chen, G. H.; Abelson, B.; Nowocin, K.;Sakhrani, V.; Xu, L.; and Spatocco, B. L. 2015. Targetingvillages for rural development using satellite image analysis.Big Data 3(1):41–53.Wolf, R., and Platt, J. C. 1994. Postal address block loca-tion using a convolutional locator network. In Advances inNeural Information Processing Systems, 745–752. MorganKaufmann Publishers.World Resources Institute. 2009. Mapping a better fu-ture: How spatial analysis can benefit wetlands and reducepoverty in Uganda.Xie, M.; Jean, N.; Burke, M.; Lobell, D.; and Ermon, S.2015. Transfer learning from deep features for remote sens-ing and poverty mapping. CoRR abs/1510.00098.Zeiler, M. D., and Fergus, R. 2013. Visualizing and under-standing convolutional networks. CoRR abs/1311.2901.Zhou, B.; Lapedriza, A.; Xiao, J.; Torralba, A.; and Oliva,A. 2014. Learning deep features for scene recognition usingPlaces database. In Advances in Neural Information Pro-cessing Systems, 487–495.

Appendix: Data PreparationOf the over 330,000 images in D2, 58% of the images arelabeled with zero nighttime light intensity. This is a very un-balanced dataset, which is difficult to learn from. We thusalter P (X2) by choosing to upsample images with higherintensities and downsample images with zero intensity untilthe least frequently occurring intensity has at least half theoccurrences of the most frequent class. Observing that ex-amples at similar intensity levels are hard to distinguish, arelabeled dataset with three integer label bins ranging from0 to 2 was created by clustering the intensity levels by fre-quency using a 3-component Gaussian mixture model andthen balancing the dataset for three classes. The 3-class bal-anced dataset consists of 150,000 training images and 8,000validation images. We find that binning by frequency sepa-rates the intensity classes more intuitively. Because predict-ing absolute nighttime light intensity is not the final targettask, it is more important for the model to extract featuresthat are semantically meaningful than to predict light inten-sity with the highest accuracy.

Filter VisualizationsWe provide 25 maximally activating images in the valida-tion set and their activation maps for four filters in our CNNmodel (Figures 4,5,6,7). The activation maps indicate the lo-cations where the filter activated the most. These filters seemto activate to different terrain types, man-made structures,and roads, all of which can be useful socioeconomic indica-tors.

Figure 4: A set of 25 maximally activating images and their corresponding activation maps for a filter in the fifth convolutionallayer of the network trained on the 3-class nighttime light intensity prediction task. This filter seems to activate for urban areas,which indicate economic development.

Figure 5: A set of 25 maximally activating images and their corresponding activation maps for a filter in the fifth convolutionallayer of the network trained on the 3-class nighttime light intensity prediction task. This filter seems to activate for roads, whichare indicative of infrastructure and economic development.

Figure 6: A set of 25 maximally activating images and their corresponding activation maps for a filter in the fifth convolutionallayer of the network trained on the 3-class nighttime light intensity prediction task. This filter seems to activate for water, barren,and forested lands, which this filter seems to group together as contributing similarly to nighttime light intensity.

Figure 7: A set of 25 maximally activating images and their corresponding activation maps for a filter in the fifth convolutionallayer of the network trained on the 3-class nighttime light intensity prediction task. This filter seems to activate for farmlandand for grid-like patterns, which are common in human-made structures.

transfer learning from deep features for remote sensing … · transfer learning from deep features...

Documents