workshop zagreb june 2004 correspondence analysis for data mining with applications in medicine...

28
WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France [email protected]

Upload: bethanie-williams

Post on 29-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

Correspondence analysis for data mining with applications in medicine

Annie MorinIRISA France

[email protected]

Page 2: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

Correspondence analysis

• Statistical vizualization method for displaying the associations between the levels of a two-contingency table and the distances between the categories of each variable => exploratory method

• Usually, Chi-square test for independence in a contingency table

Page 3: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

CA

• Duality between the row and the columns

• Use of the row profiles and of the column profiles

• Use of chi-square distance (distributional equivalence)

• Factorial analysis method (eigen values of a ad-hoc matrix) and reduction of dimensionality

Page 4: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

Example : Frequency table

heart forest surgery animal totalD1 11 20 4 1 37D2 3 9 2 2 16D3 1 5 2 3 11D4 3 14 3 16 36

total 18 48 12 21 100

Page 5: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

Row-profiles

heart forest surgery animalD1 31 54 12 5 100D2 16 58 15 11 100D3 8 45 22 25 100D4 9 39 8 44 100

mean prof 18 48 12 22 100

Page 6: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

Column profile

heart forest surgery animal col profD1 63 42 37 6 37D2 14 19 20 8 16D3 5 10 20 13 11D4 19 29 24 74 36

100 100 100 100 100

Page 7: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

Page 8: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

D1

D2

D3

D4

heart

forest

surgery

animal

Page 9: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

Distances

Between two columns

Between two rows

Page 10: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

• Diagonalization of a « covariance matrix » to find the eigenvalues and corresponding eigenvectors

• λ1≥λ2≥…….. ≥ λp

• Inertia of the cloud is ∑λi =2 / n

• Distance to the independence model

Page 11: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

Simultaneous representation• Of the rows and of the columns profiles on the same

factorial plane• Validity of representation :

– Inertia : contributions that describe the proportion of variance explained provided by each element (row or column profile) in building an axis

– Quality of representation of each element by the axes

Page 12: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

Applications in medicine

• Pharmacology

• Therapeutic trials (to avoid double blind procedures) : CA allows the physician to follow the evolution of the illness or/and of the therapy

• Textual analysis : reports, business intelligence, bibliometry

Page 13: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

Application on mucoviscidosis

• Mucoviscidosis : rare disease – No specific keywords – No specific magazines

• Goal : To define a minimum common vocabulary for the researchers working on mucoviscidosis (clinicians, geneticists, etc..)

Page 14: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

HYPOTHESIS :THE TYPICAL WORDS FOR A GIVEN TOPIC ARE

INDEPENDENT OF THE TECHNIQUES

SURGEON WORDS GENETICS WORDS

TOPIC WORDS

Page 15: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

Processing

• First step of the study : to create a “kernel” base which contains the references of scientific documents used by people working on the disease => 612 publications

Page 16: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

• 30 axes with a positive side and a negative one

• Each side of each axis is characterized by the words with a high relative contribution to the inertia (greatest than a threshold).

Page 17: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

DATA

• Two-table crossing the 612 documents (summaries) and 850 words

• CA on this two-way table

Page 18: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

Dimension of a word

• The words of a topic are one-dimensional

• The words of a filed are multidimensional

• The dimension of a word is the number of axis on which this word has a high relative contribution to inertia

• If we want to find the minimum common vocabulary, the dimension of a word must be high

Page 19: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

MUCOVISCIDOSIS BASE EXON ALLELES CBAVD MUTATIONS NOVEL DEFERENS FAMILIES IDENTIFICATION CONGENITAL ALLELE CODING SCREENING POPULATION ELECTROPHORESIS MUTATION PCR DETECTION DELTAF DIAGNOSIS DNA GENE ANALYSIS DELTA REGULATOR VENTRICULAR LEFT HYPERTENSION TRANSPLANTATIONS CF CFTR FAILURE DOUBLE HEART LIVER FOLLOW CASES COMPLICATIONS CHILDREN LUNG PULMONARY REJECTION MEAN TREATMENT CONDUCTANCE + EXPRESSION PROTEIN HUMAN CELLS ACTIVITY CELL MEMBRANE ALPHA TRANSPORT APICAL ELASTASE INDUCED ATP CHANNEL MU SECRETION CHANNELS INHIBITOR CA BILE

Page 20: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

81 words have a dimension greatest than 10

ACID ADENOSINE ADENOVIRUS ADHESION AERUGINOSA ALPHA ALVEOLAR AMILORIDE ANTIGENS ANTITRYPSIN ASPERGILLOSIS ATP AUREUS BRONCHIAL CAMP CASES CELL CELLS CFTR CHANNEL CHANNELS CHILDREN CHROMOSOME CIRRHOSIS CONCENTRATIONS CYSTIC DIAGNOSIS DOUBLE DRUG ELASTASE ELASTIN EMPHYSEMA EPITHELIUM EXPRESSION FETAL FIBROSIS FLOW FLUID HLA INHIBITOR LEFT LIVER LUNG MARKERS MUCIN MUCINS MUCUS MUTATIONS NASAL NEONATAL NEUTROPHILS PATCHES PEPTIDE PERFORMANCE PLASMA PNEUMONIA PRENATAL PROPERTIES PROTEASE PROTEIN PROTEINASE PSEUDOMONAS RAT RATS RECEPTOR RECEPTORS REJECTION RIGHT SCREENING SECRETION SECRETIONS SPUTUM STRAINS THERAPY TRANSFER TRANSPLANTATIOTRANSPORT TRYPSIN VENTRICULAR VIVO WATER

Page 21: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

Is a high dimension a sufficient

condition to characterize the disease?

To check it, we use other thematic databases and in each of them, we count the number of documents with at least two words among the previous 81 words.

Page 22: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

 5 thematic databases

• BREAST CANCER …………………………..9871 doc

• POLYAMINES……………………………...12726 doc• LEUCOCYTE INFILTRATED TUMOR ……586 doc• ACUTE LYPMPHOBLAST LEUKEMIA …2063 doc• MUCOVISCIDOSCIS………………………...612 doc

Page 23: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

RETRIEVAL STATISTICS WITH THE 81 WORDS

SUJET DES BASES TAUX DE RECUPERATION EFFECTIF BASESMUCOVISCIDOSE 612

(100%)612

LEUCEMIE AIGUË LYMPHOBLASTIQUE 1990 (96%)

2063

POLYAMINES 11912 (94%)

12726

CANCER DU SEIN 8728 (88%)

9871

TUMOR INFILTRATING LEUCOCYTE 546 (93%)

586

TOTAL

23788 (92%)

25858

Page 24: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

HLA antigens cases screening diagnosis therapy chromosome BASE LAL flow transplantation BASE CANCER SEIN adhesion BASE TIL receptor expression children right left lung mutations aspergillosis cell neutrophils mucins vivo epithelium drug protein rejection pneumonia plasma secretion alpha peptide alveolar BASE POLYAMINE acid transfer protease inhibitor ATP ventricular adenovirus adenosine CAMP prenatal proteinase stains transport channel cirrhosis antitrypsin neonatal aureus BASE MUCO bronchial pseudomonas secretions amiloride patches aeruginosa elastase sputum fibrosis nasal cystic elastin mucus emphysema CFTR

CA of the 5 databases and 81 words

Page 25: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

20 left words

adenovirus Aeruginosa Amiloride Antitrypsin

Aureus Bronchial Cftr Cirrhosis

Cystic Elastase Elatin Emphysema

Fibosis Mucus Nasal Patches

proteinase Pseudomon.

secretions sputum

Page 26: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

Retrieval statistics with thess 20 words

SUBJECT Retrieval rate Db size

Mucoviscidosis

550 (89.9%) 612

Leukemia 38 (1.8%) 2063

Polyamines 341 (2.7%) 12726

Breast cancer 202 (2.1%) 9878

Tumor Infilt. Leu

9 (1.5%) 586

Page 27: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

Conclusion

• CA is a very powerful methof to display teh association among variables

• It can be used with large datasets (one of the dimension must be « tractable »)

Page 28: WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France amorin@irisa.fr

WORKSHOP ZAGREB JUNE 2004

• Thanks to Michel Kerbaol for allowing me to use its data on mucoviscidosis

[email protected]

• Software : Qnomis