multivariate parametric analysis of interval data - dc.uba.ar · multivariate parametric analysis...
TRANSCRIPT
Outline
Multivariate Parametric Analysis of Interval Data
Paula Brito
Fac. Economia & LIAAD-INESC TEC, Universidade do Porto
ECI 2015 - Buenos AiresT3: Symbolic Data Analysis:
Taking Variability in Data into Account
Joint work with A. Pedro Duarte Silva and Jose G. Dias
P. Brito ECI - Buenos Aires - July 2015
Outline
Outline
1 Interval Data
2 Models for interval data
3 (M)ANOVA
4 Discriminant Analysis
5 Model-based ClusteringModel-Based Clustering Applications
6 R Package
7 Conclusions
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Outline
1 Interval Data
2 Models for interval data
3 (M)ANOVA
4 Discriminant Analysis
5 Model-based ClusteringModel-Based Clustering Applications
6 R Package
7 Conclusions
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Interval data
Y1 . . . Yj . . . Yp
s1 [l11, u11] . . . [l1j , u1j ] . . . [l1p, u1p]. . . . . . . . . . . .si [li1, ui1] . . . [lij , uij ] . . . [lip, uip]. . . . . . . . . . . .sn [ln1, un1] . . . [lnj , unj ] . . . [lnp, unp]
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Native Interval Data
Temperatures and pluviosity measured in 283 meteorologicalstations in the USA:temperature ranges in January and July, annual pluviosity range
Station State January July AnnualTemperature Temperature Pluviosity
HUNTSVILLE AL [32.3, 52.8] [69.7, 90.6] [3.23, 6.10]ANCHORAGE AK [9.3, 22.2] [51.5, 65.3] [0.52, 2.93]
NEW YORK (JFK) NY [24.7, 38.8] [66.7, 82.9] [2.70, 4.13]. . . . . . . . . . . . . . .
SAN JUAN PR [70.8, 82.4] [76.9, 87.4] [2.14, 6.17]
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Example: Price of different car models
0 100 200 300 400
Price [in 1000 Euro]
PassatSkodaOctaviaSkodaFabiaRover75Rover25PorscheVectraCorsaNissanMicraMercedesSMercedesEMercedesCMercedesSLLanciaKHondaNSKFocusPuntoFerrariBmw7Bmw5Bmw3AudiA8AudiA6AudiA3Alfa166Alfa156Alfa145
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Example: Price of different car models
0 100 200 300 400
Price [in 1000 Euro]
PassatSkodaOctaviaSkodaFabiaRover75Rover25PorscheVectraCorsaNissanMicraMercedesSMercedesEMercedesCMercedesSLLanciaKHondaNSKFocusPuntoFerrariBmw7Bmw5Bmw3AudiA8AudiA6AudiA3Alfa166Alfa156Alfa145
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Example: Price of different car models
0 100 200 300 400
Price [in 1000 Euro]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ][ ]
[ ]
PassatSkodaOctaviaSkodaFabiaRover75Rover25PorscheVectraCorsaNissanMicraMercedesSMercedesEMercedesCMercedesSLLanciaKHondaNSKFocusPuntoFerrariBmw7Bmw5Bmw3AudiA8AudiA6AudiA3Alfa166Alfa156Alfa145
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Outline
1 Interval Data
2 Models for interval data
3 (M)ANOVA
4 Discriminant Analysis
5 Model-based ClusteringModel-Based Clustering Applications
6 R Package
7 Conclusions
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Parametric models for interval data
Most existing methods: non-parametric descriptive approachesOur goal: parametric inference methodologies−→ probabilistic models for interval variables
For each si , Yj(si ) = Iij = [lij , uij ] is naturaly definedby the lower and upper bounds lij and uij
For modeling purposes → preferable equivalent parametrization:
Represent Yj(si ) by
the midpoint cij =lij + uij
2the range rij = uij − lij
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Parametric Models for interval data
Gaussian model:
Assume that the joint distribution of the midpoints C and the logsof the ranges R is multivariate Normal:
R∗ = ln(R), (C ,R∗) ∼ N2p(µ,Σ)
µ = [µtC , µtR∗ ]t ; Σ =
(ΣCC ΣCR∗
ΣR∗C ΣR∗R∗
)µC and µR∗ - p-dimensional column vectors of the mean values
ΣCC ,ΣCR∗ ,ΣR∗C and ΣR∗R∗ - p × p matrices
Model advantage :Straightforward application of classical inference methods
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Parametric Models for interval data
Intervals’ midpoints : location indicators → assuming a jointNormal distribution corresponds to the usual Gaussianassumption
Log transformation of the ranges → to cope with their limiteddomain
This model implies :
marginal distributions of the midpoints are Normals
marginal distributions of the ranges are Log-Normals
specific relation between mean, variance and skewness for theranges
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Parametric Models for interval data
More general models that try to alleviate limitations of themultivariate Normal distribution
Skew-Normal model:
Assume that the joint distribution of the midpoints C and the logsof the ranges R is multivariate Skew-Normal :
(C ,R∗) ∼ SN2p(ξ,Ω, α)
Skew-Normal distribution (Azzalini, 1985):
Generalizes the Gaussian distributionIntroducing an additional shape parameterPreserves some of its mathematical propertiesAlternative parametrization (traditional moments):SN2p(µ,Σ, γ1) (Arellano-Valle & Azzalini, 2008)
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Density of a p-dimensional Skew-Normal distribution :f (y ;α, ξ,Ω) = 2φp(x − ξ; Ω)Φp(αtω−1(x − ξ)), x ∈ IRp
ξ and α are p-dimensional vectors, Ω is a symmetric p × ppositive-definite matrix,ω is a diagonal matrix formed by the square-roots of the diagonalelements of Ωφp,Φp are, respectively, the density and the distribution function ofa p-dimensional standard Gaussian vector
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Parametric Models for interval data
However, for interval data:
Midpoint cij and Range rij of the value of an interval-valuedvariable are two quantities related to one only variable→ should not be considered separately
So : parameterizations of the global covariance matrix →take into account the link that may exist between midpoints andlog-ranges of the same or different variables
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Models for interval data
Most general formulation : allow for non-zero correlations amongall midpoints and log-ranges; other cases of interest:
The interval variables Yj are non-correlated, but for eachvariable, the midpoint may be correlated with its log-range;
Midpoints (respectively, log-ranges) of different variables maybe correlated, but no correlation between midpoints andlog-ranges is allowed;
Midpoints (respectively, log-ranges) of different variables maybe correlated, the midpoint of each variable may be correlatedwith its range, but no correlation between midpoints andlog-ranges of different variables is allowed.
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Models for interval data
Config. Characterization Σ
1 Non-restricted Non-restricted2 Cj not-correlated with R∗
` , ` 6= j ΣCR∗ = ΣR∗C diagonal3 Yj ’s non correlated ΣCC ,ΣCR∗ = ΣR∗C ,ΣR∗R∗ all diagonal4 C ’s non-correlated with R∗’s ΣCR∗ = ΣR∗C = 05 All C ’s and R∗’s are non-correlated Σ diagonal
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Parametric Models for interval data
Σ =ΣCC ΣCR*
ΣR*C ΣR*R*
Configuration 1
0
0
0
00
0
0
00
0
0
00
00
0
0
00
0
Configuration 2 Configuration 3 Configuration 4 Configuration 5
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Models for interval data
Configuration 2 is a particular case of 1
Both 3 and 4 are particular cases of 2
Configuration 5 is a particular case of all the others
In cases 3, 4 and 5, Σ can be written as a diagonal by blocksmatrix
Configuration 4 : the matrix Σ is formed by two p × p blocks,
Configuration 3 : there are p 2× 2 blocks
Configuration 5 : the 2p blocks are single real elements
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Parametric analysis of interval data: ML estimation
Gaussian model:
For all configurations,
ln L(µ,Σ) =
−np ln(2π)− n
2ln |Σ| − 1
2trEΣ−1 − n
2
(X − µ
)tΣ−1
(X − µ
)Σ−1 is symmetric positive definite ⇒maximum-likelihood estimate of the mean vector is always X
Maximization of the likelihood function with respect to Σ reducesto maximizing
ln L(µ,Σ) = constant − n
2ln |Σ| − 1
2trEΣ−1
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Parametric analysis of interval data: ML estimation
Configurations 3, 4 and 5, Σ is subject to the constraints
In these cases Σ can be written as a diagonal by blocks matrix,after a possible rearrangement of rows and columns
The maximum can be obtained by separately maximizing withrespect to each block of Σ
This result is not valid for configuration 2:standard numerical procedures
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Parametric analysis of interval data: ML estimation
Skew-Normal model:
Log-likelihood of a p-dimensional Skew-Normal distribution:
l = constant − 12nln |Ω| −
n2 tr(Ω−1V ) +
∑i ζ0(αtω−1(xi − ξ))
where V = n−1∑
i (xi − ξ)(xi − ξ)t and ζ0(x) = ln(2Φ(x))
Configuration 1 (Azzalini and Capitanio, 1999):
the log-likelihood can be re-parametrized asl = constant − 1
2nln |Ω| −n2 tr(Ω−1V ) +
∑i ζ0(η(xi − ξ))
Then, for each ξ and η the log-likelihood is maximized on Ω byΩ = V
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Parametric analysis of interval data: ML estimation
Other configurations:
Let θ = (ξ,Ω, η) = θ(ψ) with ψ = (µ,Σ, γ1)
We maximize numerically the log-likelihood of θ(ψ) using asarguments the free elements of µ,Σ, γ1, subject to admissibilityrestrictions
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Outline
1 Interval Data
2 Models for interval data
3 (M)ANOVA
4 Discriminant Analysis
5 Model-based ClusteringModel-Based Clustering Applications
6 R Package
7 Conclusions
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
(M)ANOVA : Objectives
Comparison of means of one or more numerical variablesin two or more populations,from which random samples were drawn
Example : compare the mean value of the sales of a given productin different shops.
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
(M)ANOVA
ANOVA :Compares the variance within samples (groups) − residual variancewith the variance between samples (groups) variance explained bythe factor.
If the residual variance is very low as compares to the explainedvariance − due to the factorwe may conclude that the mean values of the variable under studyare different between groups
MANOVA :Compares the determinant of the within-groups matrix Wwith the determinant of the global matrix T
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
(M)ANOVA
→ Likelihood ratio approach
Each interval-valued variable Yj is modelled by the pair (Cj ,R∗j ) ⇒
analysis of variance of Yj : two-dimensional MANOVA of (Cj ,R∗j )
Gaussian and Skew-Normal model:
Maximize the log-likelihood for the null (mean/location vectorsequal across groups) and the alternative hypothesis
In all cases, under the null hypothesis, 2lnλ follows asymptoticallya chi-square distribution with n − k degrees of freedom
Simultaneous analysis of all the Y ’s may be accomplished by a 2pdimensional MANOVA, following the same procedure
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
(M)ANOVA
For one interval-variable :
Likelihood ratio statistic
λ =
(|Ealt ||Enull |
) n2
Enull and Ealt are 2× 2 matricescorresponding to the null (means equal accross groups)and alternative (means different) hypothesis
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
(M)ANOVA
Simulation study:
When sample sizes are not too small:
Tests have good power
True significance level approaches nominal levels when theconstraints assumed for the model are respected
Method assuming data is Normal with configuration 1(non-restricted) never performs worse than any other methodwhen data is indeed Normal
Skew Normal model requires large samples
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
(M)ANOVA : Example
Temperatures measured in meteorological stations in northernChina.Data : intervals of observed temperatures (Celsius scale) in each ofthe four quarters, Q1 to Q4, of the years 1974 to 1988 in 22stations.
Station Region Q1 Q2 Q3 Q4
Beijing-1974 North [−9.5, 10.6] [6.5, 29.8] [12.6, 29.6] [−10.44, 9.06]Beijing-1975 North [−8.6, 12.9] [7.9, 30.2] [15.0, 31.6] [−7.0, 19.2]
. . . . . . . . . . . . . . . . . .ZhangYe-1988 Northwest [−15.4, 7.2] [2.3, 26.4] [8.6, 30.2] [−12.0, 15.1]
The full table comprises n = 22× 15 = 330 rows and 4 columns.
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
(M)ANOVA : Example
The 22 meteorological stations belong to 3 different regions inChina (North, Northwest, Northeast):MANOVA performed to assess whether the regions are differentMODEL 2 lnλ DF P-VALUE
NORM 1 480.2475 16 < 1E-10
NORM 2 527.4521 16 < 1E-10
NORM 3 989.9340 16 < 1E-10
NORM 4 529.2541 16 < 1E-10
NORM 5 1057.9210 16 < 1E-10
SkN 1 447.4244 16 < 1E-10SkN 2 518.1720 16 < 1E-10
SkN 3 974.0240 16 < 1E-10
SkN 4 530.3980 16 < 1E-10
SkN 5 1110.3840 16 < 1E-10
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Outline
1 Interval Data
2 Models for interval data
3 (M)ANOVA
4 Discriminant Analysis
5 Model-based ClusteringModel-Based Clustering Applications
6 R Package
7 Conclusions
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Discriminant Analysis
Gaussian model:
For each configuration, an estimate of the optimum classificationrule can be obtained with the corresponding Σ
Direct generalisation of the classical linear and quadraticdiscriminant classification rules
Linear :
Y = argmaxg (µgtΣ−1X − 1
2 µgtΣ−1µg + log πg )
Quadratic :
Y = argmaxg (−12X
tΣg−1
X + µgtΣg
−1X + log πg −
12 (log detΣg + µg
tΣg−1µg ))
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Discriminant Analysis
Skew-Normal model:
Three different alternatives may be considered:
1 the groups differ only in terms of µ;
2 the groups differ in terms of both µ and Σ;
3 the groups differ in terms of µ, Σ and γ1.
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Discriminant Analysis
Skew-Normal model:Considering cases 1) and 3) :
Location :
Y = argmaxg (ξgtΩ−1X− 1
2 ξgtΩ−1ξg +log πg +ζ0(αt ω−1(X−ξg )))
General :
Y = argmaxg (−12X
tΩg−1
X + ξgtΩg−1
X + log πg −12 (log det Ωg + ξg
tΩg−1ξg ) + ζ0(αg
t ωg−1(X − ξg )))
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Discriminant Analysis
Experimental results
Parametric rules generally outperform distance-based onesHomocedastic problems: linear discriminant rules perform bestLarge training samples and heterocedastic conditionsquadratic methods are usually superiorSmall training samples in heterocedastic problems: restrictedquadratic rules are preferable
even in some cases where the model assumed is not true
Restricted configurations 2 - 5:
provide a natural way of imposing constraintsare effective in reducing expected error ratesfor heterocedastic problems with small or moderate trainingsamples
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
Outline
1 Interval Data
2 Models for interval data
3 (M)ANOVA
4 Discriminant Analysis
5 Model-based ClusteringModel-Based Clustering Applications
6 R Package
7 Conclusions
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
Model-Based Clustering
f (xi ;ϕ) =k∑
`=1
π`f`(xi ; Θ`),
Maximum likelihood (ML) parameter estimation →maximization of the log-likelihood function:
`(ϕ; x) =n∑
i=1
ln f (xi ;ϕ)
Expectation-Maximization (EM) algorithm
Trying to avoid local optima → each search of the EM algorithm isreplicated from different starting points
Selection of the model and number of components (K ) →Bayesian Information Criterion : BIC= −2`(ϕ; x) + dϕln(n)
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
Income-debt application
Four interval-valued variables:Household Income (HI), Debt to Income Ratio (X 100) (DIR),Credit Card Debt (CCD) and Other Debts (OD)
5000 individual observations aggregated on the basis of:Gender, Age Category, Education Level and Job Category
→ 297 groups described by the intervals of observed values:
Group HI DIR CCD ODMale, 8-24 [15, 61] [0.1, 23.4] [0.0, 6.57] [0.02, 7.71]
High school degree, ServiceMale, 35-49, College degree, [19, 190] [1.4, 20.4] [0.04, 16.6] [0.12, 15.39]
Sales and OfficeFemale, 25-34, Some college [17, 100] [0.8, 31.7] [0.05, 6.57] [0.09, 7.65]Managerial and Professional
......
......
...P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
Income-debt application
Income-debt data - BIC values :
Nb. Homocedastic Heterocedasticcomp.
K Case 1 Case 2 Case 3 Case 4 Case 1 Case 2 Case 3 Case 42 9336.91 10312.07 10304.42 11557.28 8533.95 9222.99 9770.38 10641.343 9188.08 9985.10 10117.64 10733.92 7836.8 8057.74 9504.75 9890.294 9088.40 9546.50 9959.48 10273.03 7699.80 7772.59 9318.83 9534.245 9061.25 9406.52 9885.04 10124.77 7522.50 7281.54 9148.90 9341.716 8956.95 9366.33 9829.76 10000.48 7564.48 7055.61 9138.57 9174.657 8939.25 9171.05 9713.20 9897.91 — 7011.45 9160.22 9051.778 8902.65 9050.61 9654.98 9861.26 — 6859.76 9128.54 8977.989 8838.82 8992.64 9590.98 9780.71 — 6831.01 9278.78 8925.87
10 8852.52 9054.11 9595.48 9711.42 — 6870.38 9249.67 8924.33
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
Income-debt application
Income-debt data - Component Proportions and Mean-Vectors
C1 C2 C3 C4 C5 C6 C7 C8 C9Proportions 0.112 0.193 0.149 0.152 0.083 0.028 0.218 0.053 0.013Hinc-MidP 35.72 34.25 124.31 78.86 138.88 221.78 86.88 141.69 495.38Hinc-LogR 3.08 3.50 5.33 4.75 5.40 5.95 4.74 4.90 6.87
DIncR-MidP 7.91 12.51 15.10 13.43 13.06 17.64 11.62 11.39 16.16DIncR-LogR 2.10 2.99 3.28 3.20 3.04 3.41 2.84 2.01 3.30CCDbt-MidP 0.75 1.480 6.87 3.34 9.68 16.92 3.26 3.617 40.23CCDbt-LogR -0.06 0.93 2.57 1.87 2.90 3.45 1.71 1.46 4.29ODbt-MidP 1.73 2.65 13.06 6.44 13.09 14.72 5.97 8.73 56.73ODbt-LogR 0.53 1.49 3.22 2.48 3.08 3.34 2.29 2.23 4.68
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
Income-debt application
Component mean-values: to distinguish clusters bothMidpoints and Log-Ranges need to be considered
Estimated variance-covariance matrices clearly different acrossgroups
Noteworthy though different correlations between MidPointsand Log-Ranges of the same interval-valued variables
All correlations are positive: the higher the Midpoint thehigher the corresponding intrinsic variability
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
Labour force survey
Data from Portuguese Labour Force Survey, 1st semester of 2008
1540 cases : people who were unemployed at the time of the survey
Two variables:Activity Time, in years (AT)Unemployment Time, in months (UT)
Micro data were gathered on the basis ofGender, Region, Age-Group and Education →58 sociological groups
Lowest BIC value:solution in 5 components, heterocedastic setup,Case 2 - independent interval-valued variables
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
Labour force survey
Unemployement dataComponent Proportions, Mean-Vectors and Variances
C1 C2 C3 C4 C5Proportions 0.271 0.206 0.227 0.103 0.191
AT MidP 8.662 3.785 23.237 33.750 26.627Mean AT LogR 2.553 1.600 3.197 3.578 2.748Values UT MidP 31.985 7.495 66.990 150.500 18.869
UT LogR 4.042 2.110 4.849 5.690 3.060AT MidP 17.065 4.963 73.153 57.479 78.060
Variances AT LogR 0.340 0.544 0.119 0.040 0.182UT MidP 101.545 7.841 230.649 468.583 54.774UT LogR 0.113 0.685 0.057 0.021 0.225
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
Labour force survey
Although the number of observations is relatively low→ a restricted though heterocedastic model has beenidentified as best fit
The method chose the best parameters for clustering,preferring a heterocedastic model to a “lighter” homocedasticone → picking up a restricted configuration for thevariance-covariance matrix
Choosing Case 2 as opposed to Case 3 → correlation betweenthe two parts of the interval-variables is considered moreimportant than correlation between different variables
Components can only be separated by consideringsimultaneously Midpoints and Log-Ranges
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
USA meteorological data application
This dataset records temperatures and pluviosity measured in 282meteorological stations in the USA.
Three interval-valued variables, measured in each station:
Temperature ranges in January
Temperature ranges in July
Annual pluviosity ranges
The lowest BIC value is observed for the unrestricted (Case 1)heterocedastic solution with 6 natural clusters:
1: Arid Inland West, 2: Alaska,3: Southeast, 4: Northeast and Midwest,5: Pacific Islands and Puerto Rico, 6: Pacific Coast
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
USA meteorological data application
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
USA meteorological data application
Clusters are differentiated not only by the MidPoint but alsoby the Log-Range variables
Moreover, clusters display highly different variances
Alaska cluster presents very high variance for the JanuaryMidPoint variable, while Arid Inland West and Pacific Coastclusters present high variances for the July MidPoint variablethe Alaska cluster has a high variance for the Log-Range of thepluviosity and the Pacific Coast cluster for the Log-Range ofthe temperature in July
This stark difference illustrates well the need of aheterocedastic setup for these data
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
USA meteorological data application
Ward hierarchical clustering, partition in 6 clusters:
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
Model-Based Clustering : Conclusions
Proposed modelling successfully applied to real data sets ofdifferent nature and size
Adopting configurations adapted to interval data proved to bethe adequate approach
Important to consider both the information about
position - conveyed by the MidPointsintrinsic variability - conveyed by the LogRanges
when analysing interval data
Flexibility of the model in identifing heterocedastic models
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Outline
1 Interval Data
2 Models for interval data
3 (M)ANOVA
4 Discriminant Analysis
5 Model-based ClusteringModel-Based Clustering Applications
6 R Package
7 Conclusions
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
R Package
MAINT-Data - available at CRAN → Gaussian model
Specialized data classes for interval-data
Methods for Maximum Likelihood Estimation
ANOVA and MANOVA
Linear and Quadratic Discriminant Analysis
. . .
Extensions:
Multivariate Skew-Normal distributions
Robust estimation
Linear Regression
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Overview
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
The IData class
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
The IdtE classes: Single Distributions
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
The IdtE classes: Homocedastic Mixtures
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
The IdtE classes: Heterocedastic Mixtures
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Example: Creating Idata Objects
library(MAINT.Data)
ChinaT < − IData(ChinaTemp[1:8])VarNames < − c(”Q1”,”Q2”, ”Q3”,”Q4”)
#Display the first three observations
head(ChinaT,n=3)
Q1 Q2 Q3 Q4AnQing 1974 [ 0.673, 14.827] [13.435, 28.465] [19.821, 31.179] [2.216, 9.984]AnQing 1975 [ 2.319, 14.381] [12.829, 28.471] [23.192, 32.308] [1.013, 10.987]AnQing 1976 [ 0.906, 12.494] [11.795, 28.405] [19.680, 34.120] [2.992, 10.308]
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Example: MANOVA tests
ManvChina <- MANOVA(ChinaT,ChinaTemp$GeoReg)print(ManvChina)
Null Model Log likelihoods:NC1 NC2 NC3 NC4 NC5
-7336.254 -8331.416 -11564.904 -8390.351 -12648.760Full Model Log likelihoods:
NC1 NC2 NC3 NC4 NC5-6209.280 -6820.555 -9049.276 -6857.536 -9450.228Full Model Akaike Information Criteria:
NC1 NC2 NC3 NC4 NC512586.56 13793.11 18234.55 13851.07 19012.46Selected Model:[1]”NC1”
Null Model log-likelihood: -7336.254Full Model log-likelihood: -6209.28Qui-squared statistic: 2253.949degrees of freedom: 40p-value ≈ 0
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Example: Linear Discriminant Analysis
Chinalda < − lda(ManvChina)
PredRes < − predict(Chinalda,ChinaT)
#Estimate error rates by ten-fold cross-validation
CVlda < − DACrossVal(ChinaT,ChinaTemp$GeoReg,TrainAlg=lda,Config=BestModel(ManvChina@H1res),CVrep=1)
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Outline
1 Interval Data
2 Models for interval data
3 (M)ANOVA
4 Discriminant Analysis
5 Model-based ClusteringModel-Based Clustering Applications
6 R Package
7 Conclusions
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Conclusions
Parametric models specific for interval-valued variables
Model-based multivariate analysis of interval data
(M)ANOVADiscriminant analysisClustering - finite mixture modelling
Models and methods partially implemented in the R packageMAINT.Data, available on CRAN
Experimental results show the pertinence and usefulness ofthe proposed approach
P. Brito ECI - Buenos Aires - July 2015