using cluster analysis on biomedical data for knowledge … · 2019-02-07 · • the resulting...

1
Using Cluster Analysis on Biomedical Data for Knowledge Generation and Precision Health Marcia A. Bowen 1 , Rachel L. Richesson 1 , James Moody 2 , Robert McCarter 3 , W. Scott Campbell 4 , Jay G. Pederson 4 , Michael McKenzie 5 , Eric Monson 6 1 Duke University School of Nursing and Duke Center for Health Informatics; 2 Duke University Department of Sociology and Duke Social Sciences Research Institute; 3 Children’s National Health System, The Children’s Research Institute, 4 Univeristy of Nebraska Medical Center, 5 Washington, D.C., 6 Duke University Libraries Data and Visualization Services As large and complex datasets are becoming increasingly available to the research community, more advanced and sophisticated big data analytical techniques are needed to exploit and manage these data. Machine learning and data mining methods can be used to mine significant knowledge from a variety of large and heterogeneous textual and tabulated data sources, supporting biomedical research and healthcare delivery. These machine learning and data mining methods can be facilitated with the use of standardized clinical coding systems, which can be applied at the time of data collection or by annotation of free text found in narrative clinical notes. The ideal clinical coding system will be comprehensive across clinical and biomedical domains, and also have the capacity to represent normal and abnormal findings. To demonstrate the use of cluster analysis in an existing research dataset encoded with SNOMED CT in order to visualize and identify different clinical phenotypes associated with increased disease severity in a population of children diagnosed with one of 8 different Urea Cycle Disorders, and replicate these methods to other non-research and clinical datasets. METHODS Use an existing research data set encoded in SNOMED CT. Leverage open-source tools to manipulate codes to an optimum level of specificity based upon the number of occurrences in the record. Conduct cluster analysis. Review results with clinical experts for interpretation. Synthesize discovery with current clinical practice standards. and biological knowledge. Replicate in clinical data set. Systematized Nomenclature of Medicine – Clinical Terms The most comprehensive and precise clinical health terminology in the world. Owned and distributed by SNOMED International; distributed in U.S. by the National Library of Medicine. Designated standard for use in U.S. Federal Government systems for the electronic exchange of clinical health information (problem lists). Approximately 40K concepts and 800,000 synonyms. Has polyhierarchical (parent-child) and associative relationships. (Figs 1 a & 1b) Can support semantic analyses but there are few demonstrations. Tools for using SNOMED CT in data analyses are not intuitive or well integrated with current data analysis software or approaches. A general problem in clinical medicine is to identify patients who are at risk for complications so that they can be prevented or treated. For many rare diseases, less is known about the condition and its treatments. Cluster analysis can be used to identify clinical phenotypes associated with increased disease severity in a population of children. This requires manipulating the SNOMED CT codes to preserve the maximum level of semantics while making the data exploration feasible and useful. Figure 1b – Multiple Level Hierarchical and Associative Relationships in SNOMED CT for Seizure Figure 1a – First Level Hierarchical (Parent-Child) Relationships in SNOMED CT for Seizure Figure 2 – Example Density Map of Patient History Data with Seizures Seizure Codes in Red Related Codes in Blue Data Source The Urea Cycle Disorders Research Consortium generated a 10-year natural history study of children with one of eight different Urea Cycle disorders. The data records were encoded using SNOMED CT for physical exam and medical history and RxNorm for medications, and some records including free text notes from physicians and research nurses. Data Preparation The data set was formatted and cleaned. Approximately 2,000 SNOMED CT codes were generated from free text findings and verified for accuracy, after eliminating duplicates. The analysis dataset includes 26,386 records and 5,219 unique SNOMED CT codes. Through joining tables of SNOMED CT codes, relationships, and descriptions with our dataset (using SAS©JMP software), the hierarchical relationships were developed from the dataset codes to the broadest concept, resulting in 47,222 relationships from 7 iterations. A clinician reviewed 29 out of 50 modules, and determined that 85 out of 5,219 codes had no clinical relevance, and were eliminated from the analysis. This resulted in 38,440 relationships and 5,160 unique concept codes. • 733 patients • 26,386 records - anonymous patient identifier - visit type - age range - verified SNOMED CT concept codes Reducing Semantic Relationships in the Dataset A file of the transitive closure relationships for SNOMED CT was generated from the published Perl script distributed by SNOMED International. To reduce the number of relationships to those relevant to our dataset, the complete set of transitive closure relationships was filtered to include only SNOMED CT codes that were subtypes of the SNOMED CT codes from the dataset., resulting in 79,945 semantic relationships. • To enhance visualization of our dataset, we further reduced the number of relationships by included only certain SNOMED CT hierarchies. These were: - assessment scale, body structure, cell, cell structure, clinical drug, disorder, ethnic group, finding, medicinal product, medicinal product form, morphologic abnormality, observable entity, organism, procedure, racial group, regime/therapy, situation, specimen, staging scale, and tumor staging. • The resulting file of relationships filtered by urea cycle disorders contained 16,490 records. The resulting file for the transitive closure filtered by urea cycle disorders contained 73,314 records. 5,111of the SNOMED codes in our large UCD dataset had only one instance. We leveraged the formal semantic relationships of SNOMED CT to intelligently aggregate our data into larger groups, allowing us to visualize patterns in our data that we might not otherwise see with very specific terms in a complex data set. Leveraging the formal semantics embedded in SNOMED CT is a challenge, largely due to the polyhierarchical structure of SNOMED CT. There are multiple “pathways” to consider to reclassify detailed data into broader concepts. We considered the UCD SNOMED CT codes being the child and traced parent concepts up various hierarchies. We found that International version of SNOMED CT best supported this project. SNOMED CT transitive closure files can be integrated with clinical data sets to semantically aggregate and reduce the number of semantic relationships in a dataset to facilitate cluster analysis. • Using the UCDC dataset enhanced with SNOMED CT relationships, we will use cluster analysis to visualize each patient’s disease history to find common patterns of occurrence. We will using multiple data types (SNOMED CT, RxNorm medications, and disease severity) to produce a contour map (Fig. 2) to stimulate discussion and further clinical inquiry. The goal is to develop visual representations of the UCDC Data that clinicians can interpret to better understand and treat the condition. We plan the following steps for the analysis, using open-source software: 1. Cluster based upon published relationships from SNOMED International version (Neo4j) a. Load SNOMED International and RxNorm into graph database with the patient information (Neo4j) b Run Subsumption identification, PageRank for importance, and Label propagation for community detection (Neo4j) c. Consolidate nodes and edges (Gephi) d. Create clusters based on their association with SNOMED clusters (VOS Viewer) e. Create density maps of these clusters, incorporating disease severity scales computed from the UCDC data (VOS Viewer) 2. Incorporate weighted frequency of occurrence 3. Incorporate changes over time • Using undergraduate students in the Duke Data + Program, we will replicate this work in larger clinical data sets. Students will format the data, create the graph database and add additional data and semantic knowledge using Neo4j and then transform into VOS viewer for visual analysis. The growing amount of biomedical data in electronic health records has tremendous potential to increase our understanding and treatment of disease. Integrating the semantics of SNOMED CT with data-driven aggregation of data creates a “bird’s eye” view allowing interpretation from multiple disciplines. The use of cluster analysis allows for big data analysis of large datasets that contain different types of data and codes. The combination of SNOMED CT and cluster analysis can be used to identify certain clinical phenotypes associated with worse patient outcomes. These patients can be studied to understand different biological or genetic mechanisms of disease, and identify strategies for customized and early treatment. This approach has the potential to improve clinical management of patients and lead to discovery that can support precision medicine. We look forward to sharing results at future conferences. This work is supported by the Urea Cycle Disorders Consortium (NIH project #U54HD061221), a part of the NIH Rare Disease Clinical Research Network (RDCRN), an initiative of the Office of Rare Diseases Research, National Center for Advancing Translational Sciences (NCATS) funded by NCATS. The SNOMED CT coding was supported by the RDCRN Data Management and Coordinating Center (5U01TR001263-15). New analysis are being supported by CTSA supplement (NCATS, project 3U54HD061221-15S1). BACKGROUND SNOMED CT DISCUSSION OBJECTIVE RESULTS AND EXPERIENCE APPROACH NEXT STEPS # of Unique SNOMED Codes # of Asserted Relationships UCD Research Dataset 5,219 19,182 Transitive Closure n/a 47,222 Merged File 5,150 38,440 SNOMED CT CONCEPT (SNOMED RT+CTV3) DISEASE (DISORDER) DISORDER BY BODY SITE (DISORDER) DISORDER OF HEAD (DISORDER) DISORDER OF BRAIN (DISORDER) SEIZURE DISORDER (DISORDER) ISA ISA ISA ISA ISA Urea Cycle Disorders Consortium https://www.rarediseasesnetwork.org/cms/ucdc/

Upload: others

Post on 21-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Cluster Analysis on Biomedical Data for Knowledge … · 2019-02-07 · • The resulting file of relationships filtered by urea cycle disorders contained 16,490 records. The

Using Cluster Analysis on Biomedical Data for Knowledge Generation and Precision Health

Marcia A. Bowen1, Rachel L. Richesson1, James Moody2, Robert McCarter3 , W. Scott Campbell4 , Jay G. Pederson4, Michael McKenzie5, Eric Monson6

1Duke University School of Nursing and Duke Center for Health Informatics; 2Duke University Department of Sociology and Duke Social Sciences Research Institute; 3Children’s National Health System, The Children’s Research Institute, 4Univeristy of Nebraska Medical Center, 5Washington, D.C., 6Duke University Libraries Data and Visualization Services

As large and complex datasets are becoming increasingly available to the research community, more advanced and sophisticated big data analytical techniques are needed to exploit and manage these data.

Machine learning and data mining methods can be used to mine significant knowledge from a variety of large and heterogeneous textual and tabulated data sources, supporting biomedical research and healthcare delivery.

These machine learning and data mining methods can be facilitated with the use of standardized clinical coding systems, which can be applied at the time of data collection or by annotation of free text found in narrative clinical notes.

The ideal clinical coding system will be comprehensive across clinical and biomedical domains, and also have the capacity to represent normal and abnormal findings.

To demonstrate the use of cluster analysis in an existing research dataset encoded with SNOMED CT in order to visualize and identify different clinical phenotypes associated with increased disease severity in a population of children diagnosed with one of 8 different Urea Cycle Disorders, and replicate these methods to other non-research and clinical datasets.

METHODS

� Use an existing research data set encoded in SNOMED CT.

� Leverage open-source tools to manipulate codes to an optimum level of specificity based upon the number of occurrences in the record.

� Conduct cluster analysis.

� Review results with clinical experts for interpretation.

� Synthesize discovery with current clinical practice standards. and biological knowledge.

� Replicate in clinical data set.

Systematized Nomenclature of Medicine – Clinical Terms

� The most comprehensive and precise clinical health terminology in the world.

� Owned and distributed by SNOMED International; distributed in U.S. by the National Library of Medicine.

� Designated standard for use in U.S. Federal Government systems for the electronic exchange of clinical health information (problem lists).

� Approximately 40K concepts and 800,000 synonyms.

� Has polyhierarchical (parent-child) and associative relationships. (Figs 1 a & 1b)

� Can support semantic analyses but there are few demonstrations.

� Tools for using SNOMED CT in data analyses are not intuitive or well integrated with current data analysis software or approaches.

A general problem in clinical medicine is to identify patients who are at risk for complications so that they can be prevented or treated. For many rare diseases, less is known about the condition and its treatments.

Cluster analysis can be used to identify clinical phenotypes associated with increased disease severity in a population of children. This requires manipulating the SNOMED CT codes to preserve the maximum level of semantics while making the data exploration feasible and useful.

Figure 1b – Multiple Level Hierarchical and Associative Relationships in SNOMED CT for Seizure

Figure 1a – First Level Hierarchical (Parent-Child) Relationships in SNOMED CT for Seizure

Figure 2 – Example Density Map of Patient History Data with Seizures

Seizure Codes in Red Related Codes in Blue

Data SourceThe Urea Cycle Disorders Research Consortium generated a 10-year natural history study of children with one of eight different Urea Cycle disorders.

The data records were encoded using SNOMED CT for physical exam and medical history and RxNorm for medications, and some records including free text notes from physicians and research nurses.

Data PreparationThe data set was formatted and cleaned. Approximately 2,000 SNOMED CT codes were generated from free text findings and verified for accuracy, after eliminating duplicates.

The analysis dataset includes 26,386 records and 5,219 unique SNOMED CT codes.

Through joining tables of SNOMED CT codes, relationships, and descriptions with our dataset (using SAS©JMP software), the hierarchical relationships were developed from the dataset codes to the broadest concept, resulting in 47,222 relationships from 7 iterations.

A clinician reviewed 29 out of 50 modules, and determined that 85 out of 5,219 codes had no clinical relevance, and were eliminated from the analysis. This resulted in 38,440 relationships and 5,160 unique concept codes.

• 733 patients• 26,386 records

- anonymous patient identifier - visit type - age range - verified SNOMED CT concept codes

Reducing Semantic Relationships in the Dataset A file of the transitive closure relationships for SNOMED CT was generated from the published Perl script distributed by SNOMED International.

To reduce the number of relationships to those relevant to our dataset, the complete set of transitive closure relationships was filtered to include only SNOMED CT codes that were subtypes of the SNOMED CT codes from the dataset., resulting in 79,945 semantic relationships.

• To enhance visualization of our dataset, we further reduced the number of relationships by included only certain SNOMED CT hierarchies. These were:

- assessment scale, body structure, cell, cell structure, clinical drug, disorder, ethnic group, finding, medicinal product, medicinal product form, morphologic abnormality, observable entity, organism, procedure, racial group, regime/therapy, situation, specimen, staging scale, and tumor staging.

• The resulting file of relationships filtered by urea cycle disorders contained 16,490 records. The resulting file for the transitive closure filtered by urea cycle disorders contained 73,314 records.

5,111of the SNOMED codes in our large UCD dataset had only one instance.

We leveraged the formal semantic relationships of SNOMED CT to intelligently aggregate our data into larger groups, allowing us to visualize patterns in our data that we might not otherwise see with very specific terms in a complex data set.

Leveraging the formal semantics embedded in SNOMED CT is a challenge, largely due to the polyhierarchical structure of SNOMED CT. There are multiple “pathways” to consider to reclassify detailed data into broader concepts.

We considered the UCD SNOMED CT codes being the child and traced parent concepts up various hierarchies. We found that International version of SNOMED CT best supported this project.

SNOMED CT transitive closure files can be integrated with clinical data sets to semantically aggregate and reduce the number of semantic relationships in a dataset to facilitate cluster analysis.

• Using the UCDC dataset enhanced with SNOMED CT relationships, we will use cluster analysis to visualize each patient’s disease history to find common patterns of occurrence. We will using multiple data types (SNOMED CT, RxNorm medications, and disease severity) to produce a contour map (Fig. 2) to stimulate discussion and further clinical inquiry.

• The goal is to develop visual representations of the UCDC Data that clinicians can interpret to better understand and treat the condition.

• We plan the following steps for the analysis, using open-source software:

1. Cluster based upon published relationships from SNOMED International version (Neo4j)

a. Load SNOMED International and RxNorm into graph database with the patient information (Neo4j)

b Run Subsumption identification, PageRank for importance, and Label propagation for community detection (Neo4j)

c. Consolidate nodes and edges (Gephi) d. Create clusters based on their association with SNOMED clusters

(VOS Viewer)e. Create density maps of these clusters, incorporating disease

severity scales computed from the UCDC data (VOS Viewer)

2. Incorporate weighted frequency of occurrence

3. Incorporate changes over time

• Using undergraduate students in the Duke Data + Program, we will replicate this work in larger clinical data sets. Students will format the data, create the graph database and add additional data and semantic knowledge using Neo4j and then transform into VOS viewer for visual analysis.

� The growing amount of biomedical data in electronic health records has tremendous potential to increase our understanding and treatment of disease.

� Integrating the semantics of SNOMED CT with data-driven aggregation of data creates a “bird’s eye” view allowing interpretation from multiple disciplines.

� The use of cluster analysis allows for big data analysis of large datasets that contain different types of data and codes.

� The combination of SNOMED CT and cluster analysis can be used to identify certain clinical phenotypes associated with worse patient outcomes. These patients can be studied to understand different biological or genetic mechanisms of disease, and identify strategies for customized and early treatment.

� This approach has the potential to improve clinical management of patients and lead to discovery that can support precision medicine.

� We look forward to sharing results at future conferences.

This work is supported by the Urea Cycle Disorders Consortium (NIH project #U54HD061221), a part of the NIH Rare Disease Clinical Research Network (RDCRN), an initiative of the Office of Rare Diseases Research, National Center for Advancing Translational Sciences (NCATS) funded by NCATS. The SNOMED CT coding was supported by the RDCRN Data Management and Coordinating Center (5U01TR001263-15). New analysis are being supported by CTSA supplement (NCATS, project 3U54HD061221-15S1).

BACKGROUND

SNOMED CT

DISCUSSION

OBJECTIVE RESULTS AND EXPERIENCE

APPROACH

NEXT STEPS

# of Unique SNOMED Codes # of Asserted Relationships

UCD Research Dataset 5,219 19,182

Transitive Closure n/a 47,222

Merged File 5,150 38,440

SNOMED CT CONCEPT (SNOMED RT+CTV3)

DISEASE (DISORDER)

DISORDER BY BODY SITE (DISORDER)

DISORDER OF HEAD (DISORDER)

DISORDER OF BRAIN (DISORDER)

SEIZURE DISORDER (DISORDER)

ISA

ISA

ISA

ISA

ISA

Urea Cycle Disorders Consortium

https://www.rarediseasesnetwork.org/cms/ucdc/