a synthetic population generator that matches both household and person attribute distributions xin...

A Synthetic Population Generator that Matches Both Household

and Person Attribute Distributions

Xin Ye, Ram M. Pendyala, Karthik C. Konduri, Bhargava Sana

Department of Civil and Environmental Engineering

Outline1. Introduction

2. Iterative Proportional Fitting (IPF) Algorithm Example to Illustrate the Algorithm

3. Iterative Proportional Updating (IPU) Algorithm Example to Illustrate the Algorithm Geometric Interpretation

4. Population Synthesis for Small Geographies Zero-cell Problem Zero-marginal Problem

5. Case Study Estimating Weights Creating Synthetic Households Performance of the Algorithm

6. Flowchart

Introduction Emergence of Activity-based microsimulation

approaches in Travel Demand Analysis

Microsimulation models simulate activity-travel patterns subject to spatio-temporal constraints, and various agent interactions

Examples AMOS, FAMOS, CEMDAP, ALBATROSS, TASHA etc. Tour-based models have been implemented in some cities including

San Francisco, New York, Puget Sound etc.

Introduction Activity-based models operate at the level of the individual traveler

Calibration, Validation, and Application of these models requires Household and Person attribute data for the entire population in a region The disaggregate data for complete population is generally not available

Data Available Disaggregate data for sample of the population from PUMS or Household Travel

Surveys Aggregate distributions of Household and Person attributes for the population

from Census Summary Files or Agency Forecasts

Challenge: How to obtain Household and Person attribute data for the population in a region from available data? Create a Synthetic Population Select Households and Persons from the sample to match joint distributions of

key population characteristics

Iterative Proportional Fitting Joint distributions of population characteristics are not readily

available They can be estimated using Iterative Proportional Fitting (IPF)

procedure The IPF procedure takes frequency tables constructed from PUMS or

Household travel surveys as priors Marginal distributions from the Census Summary Files (Base Year),

Population Forecasts (Future Year) are used as controls

Iterative Proportional Fitting (IPF) Deming and Stephan (1941) presented the method to adjust sample

frequency tables to match known marginal distributions using a least squares approach

Wong (1992) showed that the IPF yields maximum entropy estimates

Iterative Proportional Fitting Synthetic Baseline Populations (Beckman 1996)

Proposed a method to create synthetic population based on IPF Joint distribution of Household attributes was estimated using IPF Synthetic Households were generated by randomly selecting

Households from the sample based on estimated joint distributions Synthetic Population comprised of persons from the selected

households This method has been adopted widely in TDM’s based on activity-based

approaches

Iterative Proportional Fitting Limitation of the Beckman (1996) procedure

The procedure only controls for household attributes and not person attributes As a result, synthetic populations fail to match given distributions of person

characteristics The method assumes that all households in the sample contributing to a

particular household type have same structure ( i.e. similar individual structure) However, the structure of households even within a same household type are

generally different and hence the need to have different weights based on household structure

Guo and Bhat (2007) and Arentze (2007) constitute initial attempts to control household and person level attributes simultaneously

The proposed Iterative Proportional Updating (IPU) algorithm simultaneously controls for both household and person attributes of interest Reallocates the weights of the households within a same household type to

account for the differences in their household structures

IPF Example

High Low-- --

1 -- 3.0 1.0 4.0 30.02 -- 2.0 4.0 6.0 40.0

3 or more -- 2.0 1.0 3.0 30.07.0 6.0

60.0 40.0

Household Size Category

Household Income Category

Household Size Marginals

Total

Household Income MarginalsTotal

Adjustment for

Household SizeAdjustment for Household Income

From PUMS or Household Travel Surveys

From Census Summary Files or Agency Forecasts

IPF Example

High Low8.57 6.67

1 -- 25.7 6.7 32.4 30.02 -- 17.1 26.7 43.8 40.0

3 or more -- 17.1 6.7 23.8 30.060.0 40.060.0 40.0Household Income Marginals

Total

Adjustment for Household Income

Adjustment for

Household Size

Total




High Low-- --

1 0.93 23.8 6.2 30.0 30.02 0.91 15.7 24.3 40.0 40.0

3 or more 1.26 21.6 8.4 30.0 30.061.1 38.960.0 40.0Household Income Marginals


Total

Adjustment for

Household SizeAdjustment for Household Income



Total

Iter 1: Adjust for Hhld Income

Iter 1: Adjust for Hhld Size

Adjustment

Adjusted Frequencies

Adjusted Totals

`Adjustment

Adjusted Frequencies

Adjusted Totals

IPF Example

High Low0.98 1.03

1 -- 23.4 6.3 29.8 30.02 -- 15.4 25.0 40.4 40.0

3 or more -- 21.2 8.6 29.9 30.060.0 40.060.0 40.0




Adjustment for

Household Size

Total


TotalHousehold Income Marginals

High Low-- --

1 1.01 23.6 6.4 30.0 30.02 0.99 15.2 24.8 40.0 40.0

3 or more 1.00 21.3 8.7 30.0 30.060.2 39.860.0 40.0

Adjustment for

Household Size


Total







IPF Example

High Low1.00 1.00

1 -- 23.5 6.4 30.0 30.02 -- 15.2 24.9 40.1 40.0

3 or more -- 21.3 8.7 30.0 30.060.0 40.060.0 40.0





Adjustment for

Household Size

Total


High Low-- --

1 1.00 23.6 6.4 30.0 30.02 1.00 15.2 24.8 40.0 40.0

3 or more 1.00 21.3 8.7 30.0 30.060.0 40.060.0 40.0

Total


Adjustment for

Household Size







Convergence Reached

Hhld Type Frequencies

IPU: Example

HH IDInit Wts

HH Type 1

HH Type 2

Person Type 1

Person Type 2

Person Type 3

1 1 1 0 1 1 12 1 1 0 1 0 13 1 1 0 2 1 04 1 0 1 1 0 25 1 0 1 0 2 16 1 0 1 1 1 07 1 0 1 2 1 28 1 0 1 1 1 0

Constraints 35.00 65.00 91.00 65.00 104.00Initial Weighted Sum 3.00 5.00 9.00 7.00 7.00δ1 0.89 0.94 0.89 0.88 0.92

Frequency Matrix From PUMS or Household Travel Surveys

Household Constraints – From IPF using Hhld Attributes

Person Constraints – From IPF using Person Attributes

IPU: Example

HH IDInit Wts

HH Type 1

HH Type 2

Person Type 1

Person Type 2

Person Type 3

Wts 1

1 1 1 0 1 1 1 11.672 1 1 0 1 0 1 11.673 1 1 0 2 1 0 11.674 1 0 1 1 0 2 1.005 1 0 1 0 2 1 1.006 1 0 1 1 1 0 1.007 1 0 1 2 1 2 1.008 1 0 1 1 1 0 1.00

Constraints 35.00 65.00 91.00 65.00 104.00Initial Weighted Sum 3.00 5.00 9.00 7.00 7.00Weighted Sum 1 35.00 5.00 51.67 28.33 28.33

Adjustment for HH Type 1

IPU: Example

HH IDInit Wts

HH Type 1

HH Type 2

Person Type 1

Person Type 2

Person Type 3

Wts 1 Wts 2

1 1 1 0 1 1 1 11.67 11.672 1 1 0 1 0 1 11.67 11.673 1 1 0 2 1 0 11.67 11.674 1 0 1 1 0 2 1.00 13.005 1 0 1 0 2 1 1.00 13.006 1 0 1 1 1 0 1.00 13.007 1 0 1 2 1 2 1.00 13.008 1 0 1 1 1 0 1.00 13.00

Constraints 35.00 65.00 91.00 65.00 104.00Initial Weighted Sum 3.00 5.00 9.00 7.00 7.00Weighted Sum 1 35.00 5.00 51.67 28.33 28.33Weighted Sum 2 35.00 65.00 111.67 88.33 88.33

Adjustment for HH Type 2

IPU: Example

HH IDInit Wts

HH Type 1

HH Type 2

Person Type 1

Person Type 2

Person Type 3

Wts 1 Wts 2 Wts 3

1 1 1 0 1 1 1 11.67 11.67 9.512 1 1 0 1 0 1 11.67 11.67 9.513 1 1 0 2 1 0 11.67 11.67 9.514 1 0 1 1 0 2 1.00 13.00 10.595 1 0 1 0 2 1 1.00 13.00 13.006 1 0 1 1 1 0 1.00 13.00 10.597 1 0 1 2 1 2 1.00 13.00 10.598 1 0 1 1 1 0 1.00 13.00 10.59

Constraints 35.00 65.00 91.00 65.00 104.00Initial Weighted Sum 3.00 5.00 9.00 7.00 7.00Weighted Sum 1 35.00 5.00 51.67 28.33 28.33Weighted Sum 2 35.00 65.00 111.67 88.33 88.33Weighted Sum 3 28.52 55.38 91.00 76.80 74.39

Adjustment for Person Type 1

IPU: Example

HH IDInit Wts

HH Type 1

HH Type 2

Person Type 1

Person Type 2

Person Type 3

Wts 1 Wts 2 Wts 3 Wts 4

1 1 1 0 1 1 1 11.67 11.67 9.51 8.052 1 1 0 1 0 1 11.67 11.67 9.51 9.513 1 1 0 2 1 0 11.67 11.67 9.51 8.054 1 0 1 1 0 2 1.00 13.00 10.59 10.595 1 0 1 0 2 1 1.00 13.00 13.00 11.006 1 0 1 1 1 0 1.00 13.00 10.59 8.977 1 0 1 2 1 2 1.00 13.00 10.59 8.978 1 0 1 1 1 0 1.00 13.00 10.59 8.97

Constraints 35.00 65.00 91.00 65.00 104.00Initial Weighted Sum 3.00 5.00 9.00 7.00 7.00Weighted Sum 1 35.00 5.00 51.67 28.33 28.33Weighted Sum 2 35.00 65.00 111.67 88.33 88.33Weighted Sum 3 28.52 55.38 91.00 76.80 74.39Weighted Sum 4 25.60 48.50 80.11 65.00 67.68


IPU: Example

HH IDInit Wts

HH Type 1

HH Type 2

Person Type 1

Person Type 2

Person Type 3

Wts 1 Wts 2 Wts 3 Wts 4 Wts 5

1 1 1 0 1 1 1 11.67 11.67 9.51 8.05 12.372 1 1 0 1 0 1 11.67 11.67 9.51 9.51 14.613 1 1 0 2 1 0 11.67 11.67 9.51 8.05 8.054 1 0 1 1 0 2 1.00 13.00 10.59 10.59 16.285 1 0 1 0 2 1 1.00 13.00 13.00 11.00 16.916 1 0 1 1 1 0 1.00 13.00 10.59 8.97 8.977 1 0 1 2 1 2 1.00 13.00 10.59 8.97 13.788 1 0 1 1 1 0 1.00 13.00 10.59 8.97 8.97

Constraints 35.00 65.00 91.00 65.00 104.00Initial Weighted Sum 3.00 5.00 9.00 7.00 7.00Weighted Sum 1 35.00 5.00 51.67 28.33 28.33Weighted Sum 2 35.00 65.00 111.67 88.33 88.33Weighted Sum 3 28.52 55.38 91.00 76.80 74.39Weighted Sum 4 25.60 48.50 80.11 65.00 67.68Weighted Sum 5 35.02 64.90 104.84 85.94 104.00δ2 0.00 0.00 0.15 0.32 0.00


IPU: Example

HH IDInit Wts

HH Type 1

HH Type 2

Person Type 1

Person Type 2

Person Type 3

Final IPU Wts

IPF Wts

1 1 1 0 1 1 1 1.36 11.672 1 1 0 1 0 1 25.66 11.673 1 1 0 2 1 0 7.98 11.674 1 0 1 1 0 2 27.79 13.005 1 0 1 0 2 1 18.45 13.006 1 0 1 1 1 0 8.64 13.007 1 0 1 2 1 2 1.47 13.008 1 0 1 1 1 0 8.64 13.00

Constraints 35.00 65.00 91.00 65.00 104.00Initial Weighted Sum 3.00 5.00 9.00 7.00 7.00Final Weighted Sum 35.00 65.00 91.00 65.00 104.00δ 401 0.00 0.00 0.00 0.00 0.00

Final Estimated Weights

IPU Example

0.000001

0.00001

0.0001

0.001

0.01

0.1

1

0 100 200 300 400 500 600 700Number of Iterations

d v

alue

(log

-sca

le)

• Improvement in Measure of Fit with Iterations

IPU: Geometric Interpretation

3

4

2

21

w

ww

HH ID HH Type Person Type Weights

1 1 0 w1

2 1 1 w2

Constraints 4 3

• Sample Household Structure and Population Constraints

• Weights can be estimated by solving the following system of linear equations

IPU: Geometric Interpretation• When solution is within the feasible region

w1

E

O

w2 =

3w

1 + w2 = 4

I

A

C

D

w2

S

B

IPU: Geometric Interpretation• When solution is outside the feasible region

w2 =

5

w1 + w

2 = 4I

I1

I2

C

AS

B

w1

Ow2

DE

Population Synthesis for Small Geographies

Zero-cell ProblemProblem

The disaggregate sample for the sub-region (PUMA) to which the small geography belongs does not capture infrequent household types

IPF for the geography fails to converge

Earlier SolutionAdd a small arbitrary number to the zero-cells (Beckman 1996)This procedure introduces an arbitrary bias (Guo and Bhat, 2006)

Proposed SolutionBorrow the prior information for the zero cells from the PUMS data

for the entire region subject to an upper limit on the probabilities


PUMS for the Region

Subsample for PUMA 1




BG 1 BG 2 BG 3 BG 4

Subsample provides priors for the BG’s during IPF

Subsample may not contain all Household/ Person Types Zero-cells


High Low High Low1 3 0 1 7 22 2 4 2 8 10

3 or more 2 1 3 or more 3 3Total 12 Total 33

Household Income

Household Size

Category

Household Income

Household Size

Category

Priors from PUMA to which BG belongs Priors from PUMS

High Low High Low1 0.25 0.00 1 0.21 0.062 0.17 0.33 2 0.24 0.30

3 or more 0.17 0.08 3 or more 0.09 0.09

Household Income Household Income

Household Size

Category

Household Size

Category

Probabilities for PUMA Probabilities for PUMS

Threshold Probability = 1/12 = 0.083


High Low High Low1 0.25 0.06 1 0.21 0.062 0.17 0.33 2 0.24 0.30

3 or more 0.17 0.08 3 or more 0.09 0.09

Household Income Household Income

Household Size

Category

Household Size

Category

Probability sum adds up to more than 1 (1.06), adjust probabilities for other cells

Zero-cell adjusted Probabilities from PUMS

High Low1 0.23 0.062 0.16 0.31

3 or more 0.16 0.08

Household Income

Household Size

Category

Adjusted priors from PUMA


Zero-Marginal ProblemProblem

The marginal values for certain categories of an attribute take a zero value

IPF procedure will assign a zero to all household/ person type constraints that are formed by that zero-marginal category

As a result the IPU algorithm may fail to proceed

SolutionProposed Solution: Add a small value (0.001) to the Zero-marginal

categories IPU now proceeds as expectedEffect of this adjustment on results is negligible


HH IDInit Wts

HH Type 1

HH Type 2

Person Type 1

Person Type 2

Person Type 3

Iter 1 wrt Person Type 1

1 1 1 0 1 1 1 0.02 1 1 0 1 0 1 0.03 1 1 0 2 1 0 0.04 1 0 1 1 0 2 0.05 1 0 1 0 2 1 w5

6 1 0 1 1 1 0 0.07 1 0 1 2 1 2 0.08 1 0 1 1 1 0 0.0

Constraints 35.00 65.00 91.00 65.00 104.00Initial Weighted Sum 3.00 5.00 9.00 7.00 7.00δ1 0.89 0.94 0.89 0.88 0.92

- If the constraint were a zero, all the household weights except HH ID 5 are adjusted 0

- The algorithm fails to proceed in the second iteration when we try to adjust weights wrt Household Type 1

Case Study: Estimating Weights In year 2000, in Maricopa County region

3,071,219 individuals resided in 1,133,048 households across 2,088 blockgroups (25 other blockgroups with 0 households)

5 percent 2000 PUMS was used as the household sample and it consists of 254,205 individuals residing in 95,066 households

Marginal distributions of attributes were obtained from 2000 Census Summary files

Two random blockgroups were chosen for the case study

Case Study: Estimating WeightsHousehold attributes chosen

Household Type (5 cat.), Household Size (7 cat.), Household Income (8 cat.)

280 different household types

Person attributes chosen Gender (2 cat.), Age (10 cat.), Ethnicity (7 cat.) 140 different person types

Household and Person type constraints were estimated using IPF

Case Study: Estimating Weights Reduction in Average Absolute Relative Difference with the IPU

algorithm

Blockgroup A

δ 2.471 0.041 in 20 iter.

Corner Solution Reached

Blockgroup B

δ 0.8151 0.00064 in 500 iter.

Near-perfect Solution Obtained

Case Study: Drawing Households

Joint household distribution from IPF gives the frequencies of different household types to be drawn

Proposed method of drawing households IPF frequencies are rounded The difference between the rounded frequency sum and the

actual household total is adjusted Households are drawn probabilistically based on IPU estimated

weights for each Household Type

Case Study: Algorithm Performance Average Absolute Relative Difference

Used for monitoring convergence of IPU It masks the difference in magnitude between estimated and expected

values Cannot be used to measure the fit of the synthetic population

Chi-squared Statistic () Provides a statistical procedure for comparing distributions 2

J-1() gives the level of confidence

Confidence level very close to one is desired for the synthetic household draw

This was used to compare the joint distribution of the synthesized individuals with the IPF generated person joint distribution

Case Study: Algorithm Performance

Blockgroup A

= 74.77, dof = 119, p-value = 0.999

Blockgroup B

= 52.01, dof = 99, p-value = 1.000

Computational PerformanceSynthetic Population was also generated for

entire Maricopa County Population synthesized for 2088 blockgroups A Dell Precision Workstation with Quad Core Intel Xeon

Processor was used Coded in Python and MySQL database was used Code was parallelized using Parallel Python module Run time was ~ 4 hours ~7 seconds per geography Please note that the actual processing time is ~28 seconds per

geography i.e. if run on a single core system it will take approximately 28 seconds per geography

Population Synthesis: Flowchart

Marginals from Census Summary Files (SF)

Marginals are corrected to account for the Zero-Marginal

Problem

Household and Person 5% PUMS Data

Priors for a particular PUMA are corrected to account for the Zero-

cell Problem

Run IPF procedure to obtain Household and Person level joint distributions.

Step 2

Step 1: Obtain Household and Person Level Constraints


YesNo

Household and Person 5% PUMS Data

For all Household/ Person Types, the weights of PUMS Households contributing to a particular Household/ Person type are adjusted to match the corresponding

constraint

Iteration

Create Frequency Matrix DN x m, where di , j in the matrix gives the contribution of a PUMS Household to the particular Household/ Person type

Column constraints for Household/ Person types are obtained from Step 1

Step 2: Estimate Weights to satisfy the Household and Person level joint distributions from Step 1 using IPU

Compute Goodness of Fit δ

If difference in δ for successive iterations

< ε

Step 3


Yes

No

Iteration

Step 3: Drawing Households

For each Household type, estimate Household selection probability distribution using the IPU adjusted weights

Create synthetic population by randomly selecting Households based on the probability distributions computed for each Household type

Round the Household level joint distributions from Step 1 and correct them for rounding errors, this gives the Frequency of Households types to be selected

If the P-value corresponding to χ2 statistic > 0.9999

Compute a χ2 statistic, comparing the Person joint distribution of the synthetic population with the Person joint distributions from Step 1

Store Synthetic population for the geography

In the near Future Build a GUI Port the results to the geography’s polygon shape file Use PostgreSQL for databases Test the code on ASU’s High Performance Cluster Document the algorithm/program on a wiki

Thank You!

Questions & Comments…

Website: http://www.ined.fr

a synthetic population generator that matches both household and person attribute distributions xin...

Documents

synthetic population

population synthesis

entire population

complete population

ipf synthetic households

person attributes

person attribute data

synthetic populations