workshop zagreb june 2004 correspondence analysis for data mining with applications in medicine...
TRANSCRIPT
WORKSHOP ZAGREB JUNE 2004
Correspondence analysis for data mining with applications in medicine
Annie MorinIRISA France
WORKSHOP ZAGREB JUNE 2004
Correspondence analysis
• Statistical vizualization method for displaying the associations between the levels of a two-contingency table and the distances between the categories of each variable => exploratory method
• Usually, Chi-square test for independence in a contingency table
WORKSHOP ZAGREB JUNE 2004
CA
• Duality between the row and the columns
• Use of the row profiles and of the column profiles
• Use of chi-square distance (distributional equivalence)
• Factorial analysis method (eigen values of a ad-hoc matrix) and reduction of dimensionality
WORKSHOP ZAGREB JUNE 2004
Example : Frequency table
heart forest surgery animal totalD1 11 20 4 1 37D2 3 9 2 2 16D3 1 5 2 3 11D4 3 14 3 16 36
total 18 48 12 21 100
WORKSHOP ZAGREB JUNE 2004
Row-profiles
heart forest surgery animalD1 31 54 12 5 100D2 16 58 15 11 100D3 8 45 22 25 100D4 9 39 8 44 100
mean prof 18 48 12 22 100
WORKSHOP ZAGREB JUNE 2004
Column profile
heart forest surgery animal col profD1 63 42 37 6 37D2 14 19 20 8 16D3 5 10 20 13 11D4 19 29 24 74 36
100 100 100 100 100
WORKSHOP ZAGREB JUNE 2004
WORKSHOP ZAGREB JUNE 2004
D1
D2
D3
D4
heart
forest
surgery
animal
WORKSHOP ZAGREB JUNE 2004
Distances
Between two columns
Between two rows
WORKSHOP ZAGREB JUNE 2004
• Diagonalization of a « covariance matrix » to find the eigenvalues and corresponding eigenvectors
• λ1≥λ2≥…….. ≥ λp
• Inertia of the cloud is ∑λi =2 / n
• Distance to the independence model
WORKSHOP ZAGREB JUNE 2004
Simultaneous representation• Of the rows and of the columns profiles on the same
factorial plane• Validity of representation :
– Inertia : contributions that describe the proportion of variance explained provided by each element (row or column profile) in building an axis
– Quality of representation of each element by the axes
WORKSHOP ZAGREB JUNE 2004
Applications in medicine
• Pharmacology
• Therapeutic trials (to avoid double blind procedures) : CA allows the physician to follow the evolution of the illness or/and of the therapy
• Textual analysis : reports, business intelligence, bibliometry
WORKSHOP ZAGREB JUNE 2004
Application on mucoviscidosis
• Mucoviscidosis : rare disease – No specific keywords – No specific magazines
• Goal : To define a minimum common vocabulary for the researchers working on mucoviscidosis (clinicians, geneticists, etc..)
WORKSHOP ZAGREB JUNE 2004
HYPOTHESIS :THE TYPICAL WORDS FOR A GIVEN TOPIC ARE
INDEPENDENT OF THE TECHNIQUES
SURGEON WORDS GENETICS WORDS
TOPIC WORDS
WORKSHOP ZAGREB JUNE 2004
Processing
• First step of the study : to create a “kernel” base which contains the references of scientific documents used by people working on the disease => 612 publications
WORKSHOP ZAGREB JUNE 2004
• 30 axes with a positive side and a negative one
• Each side of each axis is characterized by the words with a high relative contribution to the inertia (greatest than a threshold).
WORKSHOP ZAGREB JUNE 2004
DATA
• Two-table crossing the 612 documents (summaries) and 850 words
• CA on this two-way table
WORKSHOP ZAGREB JUNE 2004
Dimension of a word
• The words of a topic are one-dimensional
• The words of a filed are multidimensional
• The dimension of a word is the number of axis on which this word has a high relative contribution to inertia
• If we want to find the minimum common vocabulary, the dimension of a word must be high
WORKSHOP ZAGREB JUNE 2004
MUCOVISCIDOSIS BASE EXON ALLELES CBAVD MUTATIONS NOVEL DEFERENS FAMILIES IDENTIFICATION CONGENITAL ALLELE CODING SCREENING POPULATION ELECTROPHORESIS MUTATION PCR DETECTION DELTAF DIAGNOSIS DNA GENE ANALYSIS DELTA REGULATOR VENTRICULAR LEFT HYPERTENSION TRANSPLANTATIONS CF CFTR FAILURE DOUBLE HEART LIVER FOLLOW CASES COMPLICATIONS CHILDREN LUNG PULMONARY REJECTION MEAN TREATMENT CONDUCTANCE + EXPRESSION PROTEIN HUMAN CELLS ACTIVITY CELL MEMBRANE ALPHA TRANSPORT APICAL ELASTASE INDUCED ATP CHANNEL MU SECRETION CHANNELS INHIBITOR CA BILE
WORKSHOP ZAGREB JUNE 2004
81 words have a dimension greatest than 10
ACID ADENOSINE ADENOVIRUS ADHESION AERUGINOSA ALPHA ALVEOLAR AMILORIDE ANTIGENS ANTITRYPSIN ASPERGILLOSIS ATP AUREUS BRONCHIAL CAMP CASES CELL CELLS CFTR CHANNEL CHANNELS CHILDREN CHROMOSOME CIRRHOSIS CONCENTRATIONS CYSTIC DIAGNOSIS DOUBLE DRUG ELASTASE ELASTIN EMPHYSEMA EPITHELIUM EXPRESSION FETAL FIBROSIS FLOW FLUID HLA INHIBITOR LEFT LIVER LUNG MARKERS MUCIN MUCINS MUCUS MUTATIONS NASAL NEONATAL NEUTROPHILS PATCHES PEPTIDE PERFORMANCE PLASMA PNEUMONIA PRENATAL PROPERTIES PROTEASE PROTEIN PROTEINASE PSEUDOMONAS RAT RATS RECEPTOR RECEPTORS REJECTION RIGHT SCREENING SECRETION SECRETIONS SPUTUM STRAINS THERAPY TRANSFER TRANSPLANTATIOTRANSPORT TRYPSIN VENTRICULAR VIVO WATER
WORKSHOP ZAGREB JUNE 2004
Is a high dimension a sufficient
condition to characterize the disease?
To check it, we use other thematic databases and in each of them, we count the number of documents with at least two words among the previous 81 words.
WORKSHOP ZAGREB JUNE 2004
5 thematic databases
• BREAST CANCER …………………………..9871 doc
• POLYAMINES……………………………...12726 doc• LEUCOCYTE INFILTRATED TUMOR ……586 doc• ACUTE LYPMPHOBLAST LEUKEMIA …2063 doc• MUCOVISCIDOSCIS………………………...612 doc
WORKSHOP ZAGREB JUNE 2004
RETRIEVAL STATISTICS WITH THE 81 WORDS
SUJET DES BASES TAUX DE RECUPERATION EFFECTIF BASESMUCOVISCIDOSE 612
(100%)612
LEUCEMIE AIGUË LYMPHOBLASTIQUE 1990 (96%)
2063
POLYAMINES 11912 (94%)
12726
CANCER DU SEIN 8728 (88%)
9871
TUMOR INFILTRATING LEUCOCYTE 546 (93%)
586
TOTAL
23788 (92%)
25858
WORKSHOP ZAGREB JUNE 2004
HLA antigens cases screening diagnosis therapy chromosome BASE LAL flow transplantation BASE CANCER SEIN adhesion BASE TIL receptor expression children right left lung mutations aspergillosis cell neutrophils mucins vivo epithelium drug protein rejection pneumonia plasma secretion alpha peptide alveolar BASE POLYAMINE acid transfer protease inhibitor ATP ventricular adenovirus adenosine CAMP prenatal proteinase stains transport channel cirrhosis antitrypsin neonatal aureus BASE MUCO bronchial pseudomonas secretions amiloride patches aeruginosa elastase sputum fibrosis nasal cystic elastin mucus emphysema CFTR
CA of the 5 databases and 81 words
WORKSHOP ZAGREB JUNE 2004
20 left words
adenovirus Aeruginosa Amiloride Antitrypsin
Aureus Bronchial Cftr Cirrhosis
Cystic Elastase Elatin Emphysema
Fibosis Mucus Nasal Patches
proteinase Pseudomon.
secretions sputum
WORKSHOP ZAGREB JUNE 2004
Retrieval statistics with thess 20 words
SUBJECT Retrieval rate Db size
Mucoviscidosis
550 (89.9%) 612
Leukemia 38 (1.8%) 2063
Polyamines 341 (2.7%) 12726
Breast cancer 202 (2.1%) 9878
Tumor Infilt. Leu
9 (1.5%) 586
WORKSHOP ZAGREB JUNE 2004
Conclusion
• CA is a very powerful methof to display teh association among variables
• It can be used with large datasets (one of the dimension must be « tractable »)
WORKSHOP ZAGREB JUNE 2004
• Thanks to Michel Kerbaol for allowing me to use its data on mucoviscidosis
• Software : Qnomis