an mfa application in tuberculosis prevalence analysis

14
AN APPLICATION IN TUBERCULOSIS PREVALENCE MULTIPLE FACTOR ANALYSIS WITH ESTIMATED DATA Dec 4, 2014 ISyE 6405 Fall 2014 Project Chaoyi Wu Farida Jariwala

Upload: chaoyi-wu

Post on 06-Aug-2015

21 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: AN MFA APPLICATION IN TUBERCULOSIS PREVALENCE ANALYSIS

AN APPLICATION IN TUBERCULOSIS PREVALENCE

MULTIPLE FACTOR ANALYSIS

WITH ESTIMATED DATA

Dec 4, 2014

ISyE 6405 Fall 2014 Project

Chaoyi Wu Farida Jariwala

Page 2: AN MFA APPLICATION IN TUBERCULOSIS PREVALENCE ANALYSIS

2

OVERVIEW

Problem Statement

• Background and motivation

• Challenges

Methodology

• Extend PCA to MCA (multiple factor analysis)

• Consider interval data in MCA

Analysis

• Data

• MCA modeling in R: package “FactoMineR”

• Output analysis

Conclusion and next steps

Reference 2

Page 3: AN MFA APPLICATION IN TUBERCULOSIS PREVALENCE ANALYSIS

3

Problem Statement

Tuberculosis, or TB, is an infectious bacterial disease which most commonly

affects the lungs. Tuberculosis (TB) is second only to HIV/AIDS as the greatest

killer worldwide due to a single infectious agent. In 2013, 9 million people fell ill

with TB and 1.5 million died from the disease. People living with HIV are 26-31

times more likely to develop TB than persons without HIV. (WHO 2014)

Patterns among TB prevalence, HIV cases, healthcare resources and other

factors are helpful to curb TB prevalence.

Background and motivation

Page 4: AN MFA APPLICATION IN TUBERCULOSIS PREVALENCE ANALYSIS

4

Problem Statement

Challenges

• The number of potential factors that affect TB is large

Healthcare input: TB Immunization, TB test, general health resource expense

HIV

Tobacco use

other

→ group factors (variables) into categories and reduce dimensions by MFA

• The data (i.e. Population, cases) is estimated with variances

→ include intervals in pattern analysis with vertices method symbolic PCA (V-SPCA)

Snap shot from WHO dataset http://apps.who.int/gho/data

Page 5: AN MFA APPLICATION IN TUBERCULOSIS PREVALENCE ANALYSIS

5

Methodology

• Multiple factor analysis (MFA)MFA analyzes observations described by several groups or sets of variables in two steps:

(1) A PCA is performed on each group which is then normalized. A same weight is associated to each variable of the a group. The weight is the largest eigenvalue of the PCA on the group.

(2) The normalized data sets are merged to form a unique matrix and a global PCA is performed on this matrix.

The data type of a variable can be continuous or categorical, but the data type for variables in one set should be the same. (Abdi, H. 2007)

5

Page 6: AN MFA APPLICATION IN TUBERCULOSIS PREVALENCE ANALYSIS

6

Methodology

• Use V-SPCA to deal with interval data in MCAVertices method symbolic PCA (V-SPCA) performs a classical PCA on interval data. Given a dataset 𝑿 that contains 𝑵 observations described by 𝒑 variables of interval type, derive a new dataset 𝑿𝑽 from it and use the new dataset for PCA. (Zuccolotto. 2006)

6The method is still valid for MCA.

Country Prevalent TB cases Number of adults aged 15 and over living with HIV

Chile 3500 [1500-6400] 39 000 [25 000-61 000]

Country Prevalent TB cases Number of adults aged 15 and over living with HIV

Chile1 1500

Chile2 1500

Chile3 6400

Chile4 6400

39 000

61 000

25 000

61 000

Page 7: AN MFA APPLICATION IN TUBERCULOSIS PREVALENCE ANALYSIS

7

Analysis

7

Data structure

• 7 countries: Sweden, Malaysia, Hungary, Sri Lanka Chile, Mexico

• 19 variables, 8 groupsdata for TB and HIV estimated with confidence intervals.

Group Variable name Note

TB TB prevalent TB cases in 2012/total population (%)

HIV HIV 15+ living with HIV/15+ population (%)

PM PM10 2012 PM10 (Annual mean, ug/m3)

TST DST 2012 2012 Laboratories providing DST (drug susceptibility testing) (per 5 million population)

TST DST 2011 2011 Laboratories providing DST (drug susceptibility testing) (per 5 million population)

TST DST 2010 2010 Laboratories providing DST (drug susceptibility testing) (per 5 million population)

TST TB dgns clt 2012 2012 Laboratories providing TB diagnostic services using culture (per 5 million population)

TST TB dgns clt 2011 2011 Laboratories providing TB diagnostic services using culture (per 5 million population)

TST TB dgns clt 2010 2010 Laboratories providing TB diagnostic services using culture (per 5 million population)

TST TB dgns micro 2012 2012 Laboratories providing TB diagnostic services using sputum smear microscopy (per 100 000 population)

TST TB dgns micro 2011 2011 Laboratories providing TB diagnostic services using sputum smear microscopy (per 100 000 population)

TST TB dgns micro 2010 2010 Laboratories providing TB diagnostic services using sputum smear microscopy (per 100 000 population)

GNI GNI 2012 General national income per capital (current USD)

HC Health exps 2010 2010 Health expenditure (public) per capital

HC Health exps 2011 2011 Health expenditure (public) per capital

HC Health exps 2012 2012 Health expenditure (public) per capital

BCG BCG 1 y Imz 1992 1992 Immunization, BCG (% of one-year-old children)

BCG BCG 1 y Imz 2012 2012 Immunization, BCG (% of one-year-old children)

SMK SMK 2011 2011 Smoking prevalence (% of adults)

Page 8: AN MFA APPLICATION IN TUBERCULOSIS PREVALENCE ANALYSIS

8

Output analysis

8

• MCA with mean estimation

Eigenvalues Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6

Variance 3.792 2.496 1.115 0.667 0.567 0.036

% of var 43.719 28.781 12.857 7.695 6.538 0.41

Cumulative % of var 43.719 72.5 85.357 93.052 99.59 100

Groups Dim.1 cos2 Dim.2 cos2 Dim.3 cos2

TB 0.058 0.003 0.783 0.614 0.137 0.019

HIV 0.000 0.000 0.798 0.637 0.189 0.036

PM 0.560 0.314 0.002 0.000 0.007 0.000

TST 0.399 0.125 0.340 0.091 0.272 0.058

GNI 0.948 0.899 0.004 0.000 0.003 0.000

HC 0.969 0.939 0.000 0.000 0.007 0.000

BCG 0.829 0.688 0.105 0.011 0.007 0.000

SMK 0.027 0.001 0.463 0.215 0.495 0.245

Page 9: AN MFA APPLICATION IN TUBERCULOSIS PREVALENCE ANALYSIS

9

Output analysis

9

• MCA with interval estimation

Eigenvalues Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6

Variance 3.794 2.38 1.087 0.658 0.57 0.149

% of var 43.749 27.445 12.532 7.59 6.572 1.72

Cumulative % of var 43.749 71.194 83.726 91.316 97.888 99.608

Groups Dim.1 cos2 Dim.2 cos2 Dim.3 cos2

TB 0.059 0.003 0.688 0.474 0.148 0.022

HIV 0.000 0.000 0.719 0.517 0.207 0.043

PM 0.564 0.318 0.005 0.000 0.002 0.000

TST 0.402 0.127 0.358 0.100 0.259 0.053

GNI 0.944 0.890 0.004 0.000 0.002 0.000

HC 0.966 0.934 0.000 0.000 0.006 0.000

BCG 0.834 0.695 0.102 0.010 0.005 0.000

SMK 0.026 0.001 0.505 0.255 0.457 0.208

Page 10: AN MFA APPLICATION IN TUBERCULOSIS PREVALENCE ANALYSIS

10

Output analysis

10

• Comparison

Cumulative % of var Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6

Mean 43.719 72.5 85.357 93.052 99.59 100

Interval 43.749 71.194 83.726 91.316 97.888 99.608

Groups Mean Interval Mean Interval TB 0.058 0.059 0.783 0.688

HIV 0.000 0.000 0.798 0.719

PM 0.560 0.564 0.002 0.005

TST 0.399 0.402 0.340 0.358

GNI 0.948 0.944 0.004 0.004

HC 0.969 0.966 0.000 0.000

BCG 0.829 0.834 0.105 0.102

SMK 0.027 0.026 0.463 0.505

Dim.2Dim.1

The comparison shows the two MCA’s are almost the same. WHY?

Page 11: AN MFA APPLICATION IN TUBERCULOSIS PREVALENCE ANALYSIS

11

Output analysis

11

• Pattern analysis with interval data

o The first two dimensions account for more than 70% of the variance

o {HIV, TB, SMK} and {PM, BCG, HC, GNI} are almost orthogonal

1. TB is almost not in the 1st dimension

2. The variables in the latter set doesn’t have a strong correlation with TB

o The individual contribution to the 2nd dimension brings further investigation:

it accounts for the difference between TB, HIV and Tobacco use

Guess: Smoking prevalance in 2011 is not a reasonable factor

It confirms HIV and TB are highly correlated

o The first dimension can be considered as

a score for economics. The higher, the better

(see the individual factor map in slide 9)

Variable Contribution to Dim.2

TB -0.948

HIV -0.794

SMK.2011 0.71

Variable Contribution to Dim.1

PM10 -0.751

DST.2012 0.656

DST.2011 0.656

DST.2010 0.67

TB.dgns.clt.2012 -0.1

TB.dgns.clt.2011 -0.093

TB.dgns.clt.2010 -0.117

TB.dgns.micro.2012 -0.532

TB.dgns.micro.2011 -0.499

TB.dgns.micro.2010 -0.556

GNI 0.971

Health.exps.2010 0.985

Health.exps.2011 0.984

Health.exps.2012 0.98

BCG.1.y.Imz.1992 -0.87

BCG.1.y.Imz.2012 -0.94

Page 12: AN MFA APPLICATION IN TUBERCULOSIS PREVALENCE ANALYSIS

12

Conclusion and next steps

12

• HIV and TB are correlated

• V-SPCA doesn’t make a

difference in this project

• Factors(variables) selection

is to be improved

• Model validation need to be

conducted

• Analysis with dimensions

can be more detailed

Page 13: AN MFA APPLICATION IN TUBERCULOSIS PREVALENCE ANALYSIS

13

Abdi, H. and Valentin, D. (2007). Multiple Factor Analysis. In Neil Salkind

(Ed): Encyclopedia of Measurement and Statistics. Thousand Oaks: Sage.

Zuccolotto, P.(2006). Principal Components of Sample Estimates: an

Approach through Symbolic Data Analysis. Stat Meth & Appl. 16. 173-192.

Springer (Verlag). DOI: 10.1007/s10260-006-0024-6.

Le, S., Josse, J. and Husson, F. (2008). FactoMineR: an R package for

multivariate analysis. Journal of Statistical Software. 25(1). American

Statistical Association

WHO. (2014). World Health Organization/Media center/Tuberculosis:

http://www.who.int/mediacentre/factsheets/fs104/en/

The world bank. (2014). The world bank/Data: http://www.worldbank.org/

13

Reference

Page 14: AN MFA APPLICATION IN TUBERCULOSIS PREVALENCE ANALYSIS

1414

Thank you

Q & A