can big data help in the production of reliable local area statistics?

28
Can big data help in the production of reliable local area statistics? Partha Lahiri Joint Program in Survey Methodology University of Maryland, College Park, USA SDAL, Virginia Tech. January 28, 2015 SDAL January 28, 2015 1 / 27

Upload: kimlyman

Post on 16-Jul-2015

545 views

Category:

Science


0 download

TRANSCRIPT

Can big data help in the production of reliablelocal area statistics?

Partha Lahiri

Joint Program in Survey MethodologyUniversity of Maryland, College Park, USA

SDAL, Virginia Tech.

January 28, 2015

SDAL January 28, 2015 1 / 27

Ref: http://farmdocdaily.illinois.edu/2013/07/concentration-corn-soybean-production.html

Based on average yearly production of corn during 2010-12 using NASS/USDA data

SDAL January 28, 2015 2 / 27

Remote Sensing for Crop Acreage

The NASS-USDA has been publishing county estimates of cropacreage, crop production, crop yield and livestock inventories since1917.

Uses: local agricultural decision making, payments to farmers if cropyields are below certain levels.

Can earth resources satellite data provide useful ancillary data sourcefor county estimates of crop acreage?

Satellite information is recorded for pixels (a term for pictureelements). A pixel is about .45 hectares;

Based on satellite readings in early Fall, it is possible to classify thecrop cover all pixels. This generates big data.

SDAL January 28, 2015 3 / 27

Ref: http://www.nass.usda.gov/Statistics-by-State/Iowa/Publications/Cropland-Data-Layer/2011/index.asp

2011 Hardin County, Iowa

0 1.41 2.83 4.24

miles

LandOCoverOCategories

-byOdecreasingOacreage*O

AGRICULTURE

Corn

Soybeans

GrasslandOHerbaceous

Alfalfa

OtherOHay/NonOAlfalfa

Oats

WinterOWheat

Rye

Fallow/IdleOCropland

Sod/GrassOSeed

NON-AGRICULTURED

Developed/OpenOSpace

DeciduousOForest

Developed/LowOIntensity

WoodyOWetlands

OpenOWater

Developed/MediumOIntensity

Produced by CropScape - http://nassgeodata.gmu.edu/CropScape * Only top 6 non-agriculturecategroies are listed.

SDAL January 28, 2015 4 / 27

Remote Sensing for Crop Acreage

Bellow et al.

NASS has been a user of remote sensing products since the1950’s when it began using midaltitude aerial photography toconstruct area sampling frames (ASF’s) for the 48 states of thecontinental United States. A new era in remote sensing began in1972 with the launch of the Landsat I earth-resource monitoringsatellite. Four additional Landsats have been launched since1972, with Landsat IV and V still in operation in 1993. Thepolar-orbiting Landsat satellites contain a multi-spectral scanner(MSS) that measures reflected energy in four bands of theelectromagnetic spectrum for an area of just under one acre. Thespectral bands were selected to be responsive to vegetationcharacteristics.

SDAL January 28, 2015 5 / 27

Remote Sensing for Crop Acreage

In addition to the MSS sensor, Landsats IV and V have aThematic Mapper (TM) sensor which measures seven energybands and has increased spatial resolution. The large area (185by 170 km) and repeat (16 day per satellite) coverage of thesesatellites opened new areas of remote sensing research: large areacrop inventories, crop yields, land cover mapping, area framestratification, and small area crop cover estimation.

SDAL January 28, 2015 6 / 27

Ref: Battese, Harter and Fuller (1988 JASA)

SDAL January 28, 2015 7 / 27

SDAL January 28, 2015 8 / 27

Ref: Battese, Harter and Fuller (1988 JASA)

SDAL January 28, 2015 9 / 27

Unit Level Model

yij : value of the study variable for the jth unit of the i small areapopulation (i = 1, · · · ,m; j = 1, · · · ,Ni )

We are interested in estimating the finite population means:

Yi = N−1i

Ni∑j=1

yij .

Nested Error Regression Model

yij = x ′ijβ + vi + eij ,

where xij is a p × 1 column vector of known auxiliary variables; {vi} and

{eij} are all independent with viiid∼ N(0, σ2v ) and eij

iid∼ N(0, σ2e )

SDAL January 28, 2015 10 / 27

An Example

Estimation of the number of hectares of corn for 12 Iowa countiesbased on the 1978 June Enumerative Survey and satellite data.

yij : the number of hectares of corn in the jth segment of the ithcounty as reported in the June Enumerative Survey.

x ′ij = (1, x1ij , x2ij), where x1ij (x2ij) is the number of pixels classified ascorn (soybean) in the jth segment of the ith county.

X ′ = (1, X1i , X2i ), where X1i (X2i ) is the mean number of pixels persegment classified as corn (soybean) for county i .

SDAL January 28, 2015 11 / 27

EBLUP

EBLUP (EB) estimators of Yi :

yEBi = fiˆY Regi + (1− fi ){(1− Bi )

ˆY Regi + Bi

ˆY Syni },

where

Bi =σ2e/ni

σ2v + σ2e/niˆY Regi = yi + (Xi − xi )

′β

ˆY Syni = X ′i β

Any standard variance component estimation method (e.g., REML)can be used to obtain σ2v and σ2e .

β: the weighted least squares estimator with estimated variancecomponents

SDAL January 28, 2015 12 / 27

Plots of Survey-Weighted Poverty Rates and SAE for a Small County

(drawn by Sam Hawala)

SDAL January 28, 2015 13 / 27

Plots of Estimated SE Survey-Weighted Poverty Rates and SAE for a

Small County (drawn by Sam Hawala)

SDAL January 28, 2015 14 / 27

A Cross-Sectional Model

Ref: Fay and Herriot (JASA 1979)

For i = 1, · · · ,m,

Level 1: (Sampling Distribution): yi = θi + ei ;

Level 2: (Linking Distribution): θi = x′iβ + vi

where

yi : direct survey estimate of true small area mean θi for area i

x i : p × 1 vector of known auxiliary variables coming from big data;

{ei} and {vi} are indep. with ei ∼ N(0, ψi ) and vi ∼ N(0, σ2v ); ψit ’sare assumed to be known.

The p × 1 vector of regression coefficients βt and model variance σ2vtare unknown.

SDAL January 28, 2015 15 / 27

Auxiliary Variables from big data

The proportion of child exemptions reported by families in poverty ontheir tax returns.

The proportion of people under 65 who did not file income taxreturns.

The proportion of people receiving food stamps.

SDAL January 28, 2015 16 / 27

A Time Series Cross-Sectional Model

Ref: Datta, Lahiri, Maiti and Lu (1999) Datta, Lahiri, Maiti (2002)

For i = 1, · · · ,m; t = 1, · · · ,T ,

Level 1: : yit = θit + eit ;

Level 2: : θit = x′itβ + vi + uit

Level 3: : uit = uit−1 + εit

where

yit : direct survey estimate of median income of four person family forstate i , year t

eit : sampling error

x′it : auxiliary variables coming from big data (previous census and

administrative records)

vi : state specific random effects

uit : state and year specific random effects

SDAL January 28, 2015 17 / 27

Estimates of Coefficient of Variations of CPS Direct estimates of

Median Income of 4-person Families in the US States: Year 1989

2.5

5.0

7.5

10.0

12.5

U.S. state level CV, CPS

SDAL January 28, 2015 18 / 27

Estimates of Coefficient of Variations of EB Direct estimates of

Median Income of 4-person Families in the US States: Year 1989

2.5

5.0

7.5

10.0

12.5

U.S. state level CV, EB

SDAL January 28, 2015 19 / 27

A Plot of Absolute Residuals From a Simple Linear Regression

Dep Variable: 1989 Median Income Estimates from 1990 CensusIndep. Variable: CPS or EB Estimates for 1989

0 10 20 30 40 50

020

0040

0060

0080

0010

000

Plot of absolute residual versus state

State

Abs

olut

e re

sidu

al

CPSEB

SDAL January 28, 2015 20 / 27

Poverty mapping: the Chilean Case

High poverty rates can work favorably to a Chilean municipality interms of securing more funds from the Chilean central government.

Consider the following situation. For a given small municipality,poverty rate for the current year turns out to be high by standarddesign-based method.

How do we convince the mayor of that municipality to go for astatistically efficient SAE method that yields lower poverty rate?

SDAL January 28, 2015 21 / 27

Plots of Survey-Weighted Poverty Rates and SAE forSelected Comunas (drawn by Carolina Casas-Cordero)

0

.1

.2

.3

.4

0

.1

.2

.3

.4

2000 2003 2006 2009 2012 2000 2003 2006 2009 2012

concón hualpén

lolol santiago

Direct SAE

Pov

erty

Rat

e

Year

Source: Casen Survey 2000 to 2011

Estimates of poverty rates for comunas, Chile

SDAL January 28, 2015 22 / 27

Initial set of auxiliary variables

Number and Name of the auxiliary variable

Institution responsible for data collection Frequency of publication of the data

#1. Subsidio Familiar Unidad de Prestaciones Monetarias, Ministerio de Desarrollo Social. monthly and yearly #2. Subsidio al Pago del Consumo de Agua Potable y Servicio de Alcantarillado de Aguas Servidas

Unidad de Prestaciones Monetarias, Ministerio de Desarrollo Social. monthly and yearly

#3. Bono Chile Solidario Unidad de Prestaciones Monetarias, Ministerio de Desarrollo Social. monthly and yearly #4. Subsidio de Discapacidad Mental Unidad de Prestaciones Monetarias, Ministerio de Desarrollo Social. monthly and yearly #5. Pensión Básica Solidaria (vejez e invalidez) Unidad de Prestaciones Monetarias, Ministerio de Desarrollo Social. December #6. Aporte Previsional Solidario (vejez e invalidez) Unidad de Prestaciones Monetarias, Ministerio de Desarrollo Social. December #7. Bonificación al Ingreso Ético Familiar Unidad de Prestaciones Monetarias, Ministerio de Desarrollo Social. monthly and yearly #8. Beca de Apoyo a la Retención Escolar, BARE Unidad de Prestaciones Monetarias, Ministerio de Desarrollo Social. monthly and yearly

#9. Afiliados Sistema de Capitalización Individual Superintendencia de Pensiones monthly and yearly #10. Matrícula Ministerio de Educación Yearly #11. Rendimiento Ministerio de Educación Yearly #12. SIMCE Ministerio de Educación Yearly or every two years #13. Titulados Educación Superior Ministerio de Educación Yearly #14. Índice de Vulnerabilidad del Establecimiento (IVE-SINAE)

Junta Nacional Escolar y Becas (Junaeb) Yearly

#15. Situación Nutricional estudiantes básica y media

Junta Nacional Escolar y Becas (Junaeb) Yearly

#16. Población beneficiaria Fonasa Ministerio de Salud Yearly #17. Atenciones sector privado Ministerio de Salud Yearly #18. Razón de analfabetos respecto a la población de 10 y más años en la comuna

CENSO, INE Every 10 years

#19. Porcentaje de Población Rural CENSO, INE Every 10 years #20. Porcentaje de Asistencia Escolar Comunal SINIM monthly #21. Tamaño promedio del hogar CENSO, INE Every 10 years #22. Tasa de pobreza histórica CASEN Every 2 or 3 years #23. Contribuciones de Vivienda SII (http://www.sii.cl/avaluaciones/estadisticas/estadisticas_bbrr.htm#2) Yearly #24. Remuneraciones promedio de los trabajadores dependientes

Yearly

Source: Ministerio de Desarrollo Social (2013a).

SDAL January 28, 2015 23 / 27

Regression Analysis

Independent variables Regression coefficient estimate (t-statistics): original comuna

weights

Average wage of dependent workers (log)

-0.09575646

(3.52**) Average of the poverty rate from Casen 2000, 2003 and 2006 (arcsin)

0.49548266

(7.92**)

% of population in rural areas (arcsin)

-0.13409847

(4.96**)

% of illiterate population (arcsin) 0.40349163

(2.57*) % of population attending to school (arcsin)

-0.21883535

(2.23*)

Dummy for region 7 (=1) 0.03442978

(2.11*)

Dummy for region 8 (=1) 0.03882056

(2.67**)

Dummy for region 9 (=1) 0.105632

(6.04**)

Constant 1.61477028

(4.24**)

Number of observations 235

Adjusted R2 0.67

SDAL January 28, 2015 24 / 27

Length of the direct and parametric bootstrap confidence intervals of the comuna-level poverty rates for comunas sorted by the limited translation empirical Bayes estimates of

the poverty rate.

SDAL January 28, 2015 25 / 27

”...D.J. Finney once wrote about the statistician whoseclient comes in and says, ”Here is my mountain of trash.Find the gems that lie therein.” Finney’s advice was tonot throw him out of the office but to attempt to find outwhat he considers ”gems”. After all, if the trainedstatistician does not help, he will find some one whowill....” David Salsburg, ASA Connect Discussion

SDAL January 28, 2015 26 / 27

First Latin American ISI Satellite Meeting on Small Area Estimation

August 3-5, 2015, Santiago, ChileInternational Statistical Institute (ISI) Satellite Meeting

At Pontificia Universidad Católica de Chile

Invited Talks: Malay Ghosh

“Small Area Estimation with Health Applications” Wayne Fuller

“Bootstrap Methods for Small Area Predictions” Partha Lahiri

“Recent Advances in Poverty Mapping Methodology” Angela Luna, Nikos Tzavidis and LiChun Zhang

“From start to finish: Specify – Adapt – Evaluate (SAE)” Danny Pfeffermann and Richard Tiller

“Small Area Labor Force Statistics using Time Series Models” J.N.K. Rao

“Measuring Uncertainty of Small Area Estimators”

Special Topics, Contributed & Poster Sessions:Submit abstracts by April 15th of 2015 at [email protected] accepted on a first-come basis.

Language of the conference: English

Website: http://www.encuestas.uc.cl/sae2015/

Main Organizer: Centro de Encuestas y Estudios Longitudinales, Universidad Católica de Chile. Co-organizers: International Statistical Institute (ISI), International Association of Survey Statisticians

(IASS), Sociedad Chilena de Estadística (SOCHE), Instituto Nacional de Estadísticas (INE), Ministerio de Desarrollo Social (MDS), Departamento de Estadística, Departamento de Salud Pública e Instituto de Sociología de la Universidad Católica de Chile.

Purpose:

We hope that this meeting will serve as a bridge between mathematical statisticians and practitioners working on small area estimation in academia, private and government agencies.

This meeting in Santiago will give researchers an opportunity to learn about state-of-the-art small area estimation techniques from the experts in the field.

Journal of the Royal

Statistical Society (JRSS) Series A

Special Issue on SAE !!!

THANK YOU!

SDAL January 28, 2015 27 / 27