a synthetic population generator that matches both household and person attribute distributions xin...
TRANSCRIPT
A Synthetic Population Generator that Matches Both Household
and Person Attribute Distributions
Xin Ye, Ram M. Pendyala, Karthik C. Konduri, Bhargava Sana
Department of Civil and Environmental Engineering
Outline1. Introduction
2. Iterative Proportional Fitting (IPF) Algorithm Example to Illustrate the Algorithm
3. Iterative Proportional Updating (IPU) Algorithm Example to Illustrate the Algorithm Geometric Interpretation
4. Population Synthesis for Small Geographies Zero-cell Problem Zero-marginal Problem
5. Case Study Estimating Weights Creating Synthetic Households Performance of the Algorithm
6. Flowchart
Introduction Emergence of Activity-based microsimulation
approaches in Travel Demand Analysis
Microsimulation models simulate activity-travel patterns subject to spatio-temporal constraints, and various agent interactions
Examples AMOS, FAMOS, CEMDAP, ALBATROSS, TASHA etc. Tour-based models have been implemented in some cities including
San Francisco, New York, Puget Sound etc.
Introduction Activity-based models operate at the level of the individual traveler
Calibration, Validation, and Application of these models requires Household and Person attribute data for the entire population in a region The disaggregate data for complete population is generally not available
Data Available Disaggregate data for sample of the population from PUMS or Household Travel
Surveys Aggregate distributions of Household and Person attributes for the population
from Census Summary Files or Agency Forecasts
Challenge: How to obtain Household and Person attribute data for the population in a region from available data? Create a Synthetic Population Select Households and Persons from the sample to match joint distributions of
key population characteristics
Iterative Proportional Fitting Joint distributions of population characteristics are not readily
available They can be estimated using Iterative Proportional Fitting (IPF)
procedure The IPF procedure takes frequency tables constructed from PUMS or
Household travel surveys as priors Marginal distributions from the Census Summary Files (Base Year),
Population Forecasts (Future Year) are used as controls
Iterative Proportional Fitting (IPF) Deming and Stephan (1941) presented the method to adjust sample
frequency tables to match known marginal distributions using a least squares approach
Wong (1992) showed that the IPF yields maximum entropy estimates
Iterative Proportional Fitting Synthetic Baseline Populations (Beckman 1996)
Proposed a method to create synthetic population based on IPF Joint distribution of Household attributes was estimated using IPF Synthetic Households were generated by randomly selecting
Households from the sample based on estimated joint distributions Synthetic Population comprised of persons from the selected
households This method has been adopted widely in TDM’s based on activity-based
approaches
Iterative Proportional Fitting Limitation of the Beckman (1996) procedure
The procedure only controls for household attributes and not person attributes As a result, synthetic populations fail to match given distributions of person
characteristics The method assumes that all households in the sample contributing to a
particular household type have same structure ( i.e. similar individual structure) However, the structure of households even within a same household type are
generally different and hence the need to have different weights based on household structure
Guo and Bhat (2007) and Arentze (2007) constitute initial attempts to control household and person level attributes simultaneously
The proposed Iterative Proportional Updating (IPU) algorithm simultaneously controls for both household and person attributes of interest Reallocates the weights of the households within a same household type to
account for the differences in their household structures
IPF Example
High Low-- --
1 -- 3.0 1.0 4.0 30.02 -- 2.0 4.0 6.0 40.0
3 or more -- 2.0 1.0 3.0 30.07.0 6.0
60.0 40.0
Household Size Category
Household Income Category
Household Size Marginals
Total
Household Income MarginalsTotal
Adjustment for
Household SizeAdjustment for Household Income
From PUMS or Household Travel Surveys
From Census Summary Files or Agency Forecasts
IPF Example
High Low8.57 6.67
1 -- 25.7 6.7 32.4 30.02 -- 17.1 26.7 43.8 40.0
3 or more -- 17.1 6.7 23.8 30.060.0 40.060.0 40.0Household Income Marginals
Total
Adjustment for Household Income
Adjustment for
Household Size
Total
Household Size Marginals
Household Income Category
Household Size Category
High Low-- --
1 0.93 23.8 6.2 30.0 30.02 0.91 15.7 24.3 40.0 40.0
3 or more 1.26 21.6 8.4 30.0 30.061.1 38.960.0 40.0Household Income Marginals
Household Income Category
Total
Adjustment for
Household SizeAdjustment for Household Income
Household Size Category
Household Size Marginals
Total
Iter 1: Adjust for Hhld Income
Iter 1: Adjust for Hhld Size
Adjustment
Adjusted Frequencies
Adjusted Totals
`Adjustment
Adjusted Frequencies
Adjusted Totals
IPF Example
High Low0.98 1.03
1 -- 23.4 6.3 29.8 30.02 -- 15.4 25.0 40.4 40.0
3 or more -- 21.2 8.6 29.9 30.060.0 40.060.0 40.0
Household Income Category
Adjustment for Household Income
Household Size Category
Adjustment for
Household Size
Total
Household Size Marginals
TotalHousehold Income Marginals
High Low-- --
1 1.01 23.6 6.4 30.0 30.02 0.99 15.2 24.8 40.0 40.0
3 or more 1.00 21.3 8.7 30.0 30.060.2 39.860.0 40.0
Adjustment for
Household Size
Household Income Category
Total
Household Size Marginals
Adjustment for Household Income
Household Size Category
TotalHousehold Income Marginals
Iter 2: Adjust for Hhld Income
Iter 2: Adjust for Hhld Size
IPF Example
High Low1.00 1.00
1 -- 23.5 6.4 30.0 30.02 -- 15.2 24.9 40.1 40.0
3 or more -- 21.3 8.7 30.0 30.060.0 40.060.0 40.0
Household Income Category
Adjustment for Household Income
Household Size Marginals
Household Size Category
Adjustment for
Household Size
Total
TotalHousehold Income Marginals
High Low-- --
1 1.00 23.6 6.4 30.0 30.02 1.00 15.2 24.8 40.0 40.0
3 or more 1.00 21.3 8.7 30.0 30.060.0 40.060.0 40.0
Total
Household Size Marginals
Adjustment for
Household Size
Household Income Category
Adjustment for Household Income
Household Size Category
TotalHousehold Income Marginals
Iter 3: Adjust for Hhld Income
Iter 3: Adjust for Hhld Size
Convergence Reached
Hhld Type Frequencies
IPU: Example
HH IDInit Wts
HH Type 1
HH Type 2
Person Type 1
Person Type 2
Person Type 3
1 1 1 0 1 1 12 1 1 0 1 0 13 1 1 0 2 1 04 1 0 1 1 0 25 1 0 1 0 2 16 1 0 1 1 1 07 1 0 1 2 1 28 1 0 1 1 1 0
Constraints 35.00 65.00 91.00 65.00 104.00Initial Weighted Sum 3.00 5.00 9.00 7.00 7.00δ1 0.89 0.94 0.89 0.88 0.92
Frequency Matrix From PUMS or Household Travel Surveys
Household Constraints – From IPF using Hhld Attributes
Person Constraints – From IPF using Person Attributes
IPU: Example
HH IDInit Wts
HH Type 1
HH Type 2
Person Type 1
Person Type 2
Person Type 3
Wts 1
1 1 1 0 1 1 1 11.672 1 1 0 1 0 1 11.673 1 1 0 2 1 0 11.674 1 0 1 1 0 2 1.005 1 0 1 0 2 1 1.006 1 0 1 1 1 0 1.007 1 0 1 2 1 2 1.008 1 0 1 1 1 0 1.00
Constraints 35.00 65.00 91.00 65.00 104.00Initial Weighted Sum 3.00 5.00 9.00 7.00 7.00Weighted Sum 1 35.00 5.00 51.67 28.33 28.33
Adjustment for HH Type 1
IPU: Example
HH IDInit Wts
HH Type 1
HH Type 2
Person Type 1
Person Type 2
Person Type 3
Wts 1 Wts 2
1 1 1 0 1 1 1 11.67 11.672 1 1 0 1 0 1 11.67 11.673 1 1 0 2 1 0 11.67 11.674 1 0 1 1 0 2 1.00 13.005 1 0 1 0 2 1 1.00 13.006 1 0 1 1 1 0 1.00 13.007 1 0 1 2 1 2 1.00 13.008 1 0 1 1 1 0 1.00 13.00
Constraints 35.00 65.00 91.00 65.00 104.00Initial Weighted Sum 3.00 5.00 9.00 7.00 7.00Weighted Sum 1 35.00 5.00 51.67 28.33 28.33Weighted Sum 2 35.00 65.00 111.67 88.33 88.33
Adjustment for HH Type 2
IPU: Example
HH IDInit Wts
HH Type 1
HH Type 2
Person Type 1
Person Type 2
Person Type 3
Wts 1 Wts 2 Wts 3
1 1 1 0 1 1 1 11.67 11.67 9.512 1 1 0 1 0 1 11.67 11.67 9.513 1 1 0 2 1 0 11.67 11.67 9.514 1 0 1 1 0 2 1.00 13.00 10.595 1 0 1 0 2 1 1.00 13.00 13.006 1 0 1 1 1 0 1.00 13.00 10.597 1 0 1 2 1 2 1.00 13.00 10.598 1 0 1 1 1 0 1.00 13.00 10.59
Constraints 35.00 65.00 91.00 65.00 104.00Initial Weighted Sum 3.00 5.00 9.00 7.00 7.00Weighted Sum 1 35.00 5.00 51.67 28.33 28.33Weighted Sum 2 35.00 65.00 111.67 88.33 88.33Weighted Sum 3 28.52 55.38 91.00 76.80 74.39
Adjustment for Person Type 1
IPU: Example
HH IDInit Wts
HH Type 1
HH Type 2
Person Type 1
Person Type 2
Person Type 3
Wts 1 Wts 2 Wts 3 Wts 4
1 1 1 0 1 1 1 11.67 11.67 9.51 8.052 1 1 0 1 0 1 11.67 11.67 9.51 9.513 1 1 0 2 1 0 11.67 11.67 9.51 8.054 1 0 1 1 0 2 1.00 13.00 10.59 10.595 1 0 1 0 2 1 1.00 13.00 13.00 11.006 1 0 1 1 1 0 1.00 13.00 10.59 8.977 1 0 1 2 1 2 1.00 13.00 10.59 8.978 1 0 1 1 1 0 1.00 13.00 10.59 8.97
Constraints 35.00 65.00 91.00 65.00 104.00Initial Weighted Sum 3.00 5.00 9.00 7.00 7.00Weighted Sum 1 35.00 5.00 51.67 28.33 28.33Weighted Sum 2 35.00 65.00 111.67 88.33 88.33Weighted Sum 3 28.52 55.38 91.00 76.80 74.39Weighted Sum 4 25.60 48.50 80.11 65.00 67.68
Adjustment for Person Type 2
IPU: Example
HH IDInit Wts
HH Type 1
HH Type 2
Person Type 1
Person Type 2
Person Type 3
Wts 1 Wts 2 Wts 3 Wts 4 Wts 5
1 1 1 0 1 1 1 11.67 11.67 9.51 8.05 12.372 1 1 0 1 0 1 11.67 11.67 9.51 9.51 14.613 1 1 0 2 1 0 11.67 11.67 9.51 8.05 8.054 1 0 1 1 0 2 1.00 13.00 10.59 10.59 16.285 1 0 1 0 2 1 1.00 13.00 13.00 11.00 16.916 1 0 1 1 1 0 1.00 13.00 10.59 8.97 8.977 1 0 1 2 1 2 1.00 13.00 10.59 8.97 13.788 1 0 1 1 1 0 1.00 13.00 10.59 8.97 8.97
Constraints 35.00 65.00 91.00 65.00 104.00Initial Weighted Sum 3.00 5.00 9.00 7.00 7.00Weighted Sum 1 35.00 5.00 51.67 28.33 28.33Weighted Sum 2 35.00 65.00 111.67 88.33 88.33Weighted Sum 3 28.52 55.38 91.00 76.80 74.39Weighted Sum 4 25.60 48.50 80.11 65.00 67.68Weighted Sum 5 35.02 64.90 104.84 85.94 104.00δ2 0.00 0.00 0.15 0.32 0.00
Adjustment for Person Type 3
IPU: Example
HH IDInit Wts
HH Type 1
HH Type 2
Person Type 1
Person Type 2
Person Type 3
Final IPU Wts
IPF Wts
1 1 1 0 1 1 1 1.36 11.672 1 1 0 1 0 1 25.66 11.673 1 1 0 2 1 0 7.98 11.674 1 0 1 1 0 2 27.79 13.005 1 0 1 0 2 1 18.45 13.006 1 0 1 1 1 0 8.64 13.007 1 0 1 2 1 2 1.47 13.008 1 0 1 1 1 0 8.64 13.00
Constraints 35.00 65.00 91.00 65.00 104.00Initial Weighted Sum 3.00 5.00 9.00 7.00 7.00Final Weighted Sum 35.00 65.00 91.00 65.00 104.00δ 401 0.00 0.00 0.00 0.00 0.00
Final Estimated Weights
IPU Example
0.000001
0.00001
0.0001
0.001
0.01
0.1
1
0 100 200 300 400 500 600 700Number of Iterations
d v
alue
(log
-sca
le)
• Improvement in Measure of Fit with Iterations
IPU: Geometric Interpretation
3
4
2
21
w
ww
HH ID HH Type Person Type Weights
1 1 0 w1
2 1 1 w2
Constraints 4 3
• Sample Household Structure and Population Constraints
• Weights can be estimated by solving the following system of linear equations
IPU: Geometric Interpretation• When solution is within the feasible region
w1
E
O
w2 =
3w
1 + w2 = 4
I
A
C
D
w2
S
B
IPU: Geometric Interpretation• When solution is outside the feasible region
w2 =
5
w1 + w
2 = 4I
I1
I2
C
AS
B
w1
Ow2
DE
Population Synthesis for Small Geographies
Zero-cell ProblemProblem
The disaggregate sample for the sub-region (PUMA) to which the small geography belongs does not capture infrequent household types
IPF for the geography fails to converge
Earlier SolutionAdd a small arbitrary number to the zero-cells (Beckman 1996)This procedure introduces an arbitrary bias (Guo and Bhat, 2006)
Proposed SolutionBorrow the prior information for the zero cells from the PUMS data
for the entire region subject to an upper limit on the probabilities
Population Synthesis for Small Geographies
PUMS for the Region
Subsample for PUMA 1
Subsample for PUMA 2
Subsample for PUMA 3
Subsample for PUMA 4
BG 1 BG 2 BG 3 BG 4
Subsample provides priors for the BG’s during IPF
Subsample may not contain all Household/ Person Types Zero-cells
Population Synthesis for Small Geographies
High Low High Low1 3 0 1 7 22 2 4 2 8 10
3 or more 2 1 3 or more 3 3Total 12 Total 33
Household Income
Household Size
Category
Household Income
Household Size
Category
Priors from PUMA to which BG belongs Priors from PUMS
High Low High Low1 0.25 0.00 1 0.21 0.062 0.17 0.33 2 0.24 0.30
3 or more 0.17 0.08 3 or more 0.09 0.09
Household Income Household Income
Household Size
Category
Household Size
Category
Probabilities for PUMA Probabilities for PUMS
Threshold Probability = 1/12 = 0.083
Population Synthesis for Small Geographies
High Low High Low1 0.25 0.06 1 0.21 0.062 0.17 0.33 2 0.24 0.30
3 or more 0.17 0.08 3 or more 0.09 0.09
Household Income Household Income
Household Size
Category
Household Size
Category
Probability sum adds up to more than 1 (1.06), adjust probabilities for other cells
Zero-cell adjusted Probabilities from PUMS
High Low1 0.23 0.062 0.16 0.31
3 or more 0.16 0.08
Household Income
Household Size
Category
Adjusted priors from PUMA
Population Synthesis for Small Geographies
Zero-Marginal ProblemProblem
The marginal values for certain categories of an attribute take a zero value
IPF procedure will assign a zero to all household/ person type constraints that are formed by that zero-marginal category
As a result the IPU algorithm may fail to proceed
SolutionProposed Solution: Add a small value (0.001) to the Zero-marginal
categories IPU now proceeds as expectedEffect of this adjustment on results is negligible
Population Synthesis for Small Geographies
HH IDInit Wts
HH Type 1
HH Type 2
Person Type 1
Person Type 2
Person Type 3
Iter 1 wrt Person Type 1
1 1 1 0 1 1 1 0.02 1 1 0 1 0 1 0.03 1 1 0 2 1 0 0.04 1 0 1 1 0 2 0.05 1 0 1 0 2 1 w5
6 1 0 1 1 1 0 0.07 1 0 1 2 1 2 0.08 1 0 1 1 1 0 0.0
Constraints 35.00 65.00 91.00 65.00 104.00Initial Weighted Sum 3.00 5.00 9.00 7.00 7.00δ1 0.89 0.94 0.89 0.88 0.92
- If the constraint were a zero, all the household weights except HH ID 5 are adjusted 0
- The algorithm fails to proceed in the second iteration when we try to adjust weights wrt Household Type 1
Case Study: Estimating Weights In year 2000, in Maricopa County region
3,071,219 individuals resided in 1,133,048 households across 2,088 blockgroups (25 other blockgroups with 0 households)
5 percent 2000 PUMS was used as the household sample and it consists of 254,205 individuals residing in 95,066 households
Marginal distributions of attributes were obtained from 2000 Census Summary files
Two random blockgroups were chosen for the case study
Case Study: Estimating WeightsHousehold attributes chosen
Household Type (5 cat.), Household Size (7 cat.), Household Income (8 cat.)
280 different household types
Person attributes chosen Gender (2 cat.), Age (10 cat.), Ethnicity (7 cat.) 140 different person types
Household and Person type constraints were estimated using IPF
Case Study: Estimating Weights Reduction in Average Absolute Relative Difference with the IPU
algorithm
Blockgroup A
δ 2.471 0.041 in 20 iter.
Corner Solution Reached
Blockgroup B
δ 0.8151 0.00064 in 500 iter.
Near-perfect Solution Obtained
Case Study: Drawing Households
Joint household distribution from IPF gives the frequencies of different household types to be drawn
Proposed method of drawing households IPF frequencies are rounded The difference between the rounded frequency sum and the
actual household total is adjusted Households are drawn probabilistically based on IPU estimated
weights for each Household Type
Case Study: Algorithm Performance Average Absolute Relative Difference
Used for monitoring convergence of IPU It masks the difference in magnitude between estimated and expected
values Cannot be used to measure the fit of the synthetic population
Chi-squared Statistic () Provides a statistical procedure for comparing distributions 2
J-1() gives the level of confidence
Confidence level very close to one is desired for the synthetic household draw
This was used to compare the joint distribution of the synthesized individuals with the IPF generated person joint distribution
Case Study: Algorithm Performance
Blockgroup A
= 74.77, dof = 119, p-value = 0.999
Blockgroup B
= 52.01, dof = 99, p-value = 1.000
Computational PerformanceSynthetic Population was also generated for
entire Maricopa County Population synthesized for 2088 blockgroups A Dell Precision Workstation with Quad Core Intel Xeon
Processor was used Coded in Python and MySQL database was used Code was parallelized using Parallel Python module Run time was ~ 4 hours ~7 seconds per geography Please note that the actual processing time is ~28 seconds per
geography i.e. if run on a single core system it will take approximately 28 seconds per geography
Population Synthesis: Flowchart
Marginals from Census Summary Files (SF)
Marginals are corrected to account for the Zero-Marginal
Problem
Household and Person 5% PUMS Data
Priors for a particular PUMA are corrected to account for the Zero-
cell Problem
Run IPF procedure to obtain Household and Person level joint distributions.
Step 2
Step 1: Obtain Household and Person Level Constraints
Population Synthesis: Flowchart
YesNo
Household and Person 5% PUMS Data
For all Household/ Person Types, the weights of PUMS Households contributing to a particular Household/ Person type are adjusted to match the corresponding
constraint
Iteration
Create Frequency Matrix DN x m, where di , j in the matrix gives the contribution of a PUMS Household to the particular Household/ Person type
Column constraints for Household/ Person types are obtained from Step 1
Step 2: Estimate Weights to satisfy the Household and Person level joint distributions from Step 1 using IPU
Compute Goodness of Fit δ
If difference in δ for successive iterations
< ε
Step 3
Population Synthesis: Flowchart
Yes
No
Iteration
Step 3: Drawing Households
For each Household type, estimate Household selection probability distribution using the IPU adjusted weights
Create synthetic population by randomly selecting Households based on the probability distributions computed for each Household type
Round the Household level joint distributions from Step 1 and correct them for rounding errors, this gives the Frequency of Households types to be selected
If the P-value corresponding to χ2 statistic > 0.9999
Compute a χ2 statistic, comparing the Person joint distribution of the synthetic population with the Person joint distributions from Step 1
Store Synthetic population for the geography
In the near Future Build a GUI Port the results to the geography’s polygon shape file Use PostgreSQL for databases Test the code on ASU’s High Performance Cluster Document the algorithm/program on a wiki
Thank You!
Questions & Comments…
Website: http://www.ined.fr