info.ornl.govinfo.ornl.gov/sites/rams2012/k_gaster/documents...  · web viewwith longer life spans...

18
Knowledge Discovery through Text Mining Biomedical Documents Relating Climate and Cancer Incidence Katherine Gaster Wofford College Research Alliance in Math and Science Oak Ridge National Laboratory Dr. Georgia Tourassi Biomedical Science and Engineering Division Oak Ridge National Laboratory August 2012

Upload: truongdang

Post on 31-Jan-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: info.ornl.govinfo.ornl.gov/sites/rams2012/k_gaster/Documents...  · Web viewWith longer life spans and more advanced screening technology, ... such as the association between solar

Knowledge Discovery through Text Mining Biomedical Documents Relating Climate and Cancer Incidence

Katherine GasterWofford College

Research Alliance in Math and ScienceOak Ridge National Laboratory

Dr. Georgia TourassiBiomedical Science and Engineering Division

Oak Ridge National Laboratory

August 2012

Page 2: info.ornl.govinfo.ornl.gov/sites/rams2012/k_gaster/Documents...  · Web viewWith longer life spans and more advanced screening technology, ... such as the association between solar

2

Abstract

There is much literature available that examines the effects of climate on health. Some of the most cited examples include the effects of climatic temperature stress on cardiovascular diseases and the effects of solar UV radiation on skin cancer incidence. Research this summer investigated a possible link between climatic or environmental conditions and cancer in general. With longer life spans and more advanced screening technology, cancer incidence rates are expected to increase dramatically. Because of the devastating effects of cancer at the personal as well as the societal level, it is important to understand how climate affects cancer risk. An individual can, to an extent, control their environment, and knowledge of what types of climates and environments affect their cancer risk profile could be helpful.

Research was carried out through a targeted literature search for articles referring to cancer and either climatic or environmental factors. The literature was used to identify trends that linked certain climate factors and environmental conditions to different types of cancer. This was done using Piranha, a data mining tool that clusters documents based on textual similarities and differences it identifies. Careful review of these clusters helps in identification of factors that are frequently associated with specific types of cancer. Ultimately, identification of these trends will help formulate scientific hypotheses that can be tested extensively with traditional epidemiological studies, which are too time-consuming and expensive to do without a well-researched hypothesis.

Page 3: info.ornl.govinfo.ornl.gov/sites/rams2012/k_gaster/Documents...  · Web viewWith longer life spans and more advanced screening technology, ... such as the association between solar

3

Introduction

The effects of climate on human health have been widely studied. Two of the most commonly researched examples of this relationship are the effect of climatic temperature stress on cardiovascular diseases [1] and the effects of solar UV radiation on skin cancer incidence [2]. Today, individuals have longer expected life spans and more advanced cancer screening technology is available. Due to both of these factors, cancer incidence rates are expected to increase dramatically [3]. With longer expected life spans, individuals are more likely to develop cancer, a slow-developing disease, at some point during their lifetime, and with the continuing development of more sophisticated screening technologies, cancer incidence rates are expected to increase in the future. Before today’s wide variety of screening technology was available, an individual might develop cancer without ever being diagnosed. One study on the increase of breast cancer incidence, for example, concluded that the most reasonable of several explanations for the rising incidence was the increase in mammography use in asymptomatic women [4]. Because more individuals are being diagnosed with cancer, it is extremely important to research not only new and more effective ways to treat those who are diagnosed with cancer, but also to research the factors that might predispose an individual to developing cancer in the first place and determine ways to reduce or counteract these factors.

It is accepted that there are both genetic and environmental risk factors in developing cancer. While an individual cannot control or alter their genetic risk factors for any disease, they may be able to control the environmental factors to an extent. Any individual wishing to lower their likelihood of developing cancer would be greatly benefitted by information on the risk factors that can be either altered or monitored – if a certain climatic condition is associated with higher incidence of a specific type of cancer, that condition cannot be changed. However, individuals living under those climatic conditions may wish to be aware of any controllable risk factors predisposing them further to cancer. Additionally, these individuals may choose to be screened more frequently for this type of cancer, as cancer is often more easily treated when diagnosed early.

The objective of this study is knowledge discovery through the text mining of biomedical documents that discuss climate burden on cancer incidence. Text mining is the process of attempting to glean meaningful and useful information by analyzing a large number of text sources [5]. The goal is that associations between specific climatic or environmental factors and specific types of cancer will be identified through a targeted literature search and with the use of a sophisticated text mining tool, Piranha. Piranha is designed to search for textual similarities in uploaded documents and to cluster them based on these similarities. The identifications of trends found within the identified cluster could help in the formulation of new scientific hypotheses that could able to be tested extensively in the future using traditional epidemiological studies. These studies are time-consuming and too expensive to perform without a well-researched hypothesis. Therefore, this study aims to provide an efficient mechanism to generate and support such hypotheses.

Page 4: info.ornl.govinfo.ornl.gov/sites/rams2012/k_gaster/Documents...  · Web viewWith longer life spans and more advanced screening technology, ... such as the association between solar

4

Method

Data Collection

Literature was obtained both through Google Scholar and through ORNL’s Research Library. Search terms used were “cancer” and either “climate” or “environment”. Each relevant document was saved into the same folder after reading the abstract or the equivalent to confirm that the article was relevant to the study. In total, 99 documents were collected and uploaded to Piranha. These documents included epidemiological studies and observational studies, and the smallest study included contained 479 participants.

Data Analysis

In order to cluster documents, Piranha requires that the user choose threshold values that will be used to determine whether several articles are significantly similar, and to what degree. In this study, user-defined threshold values were used rather than the default values. When the articles are clustered, each cluster is assigned one of the defined threshold values to signify how similar the articles are, or how much they have in common. Higher values denote that the articles in the cluster have more in common; lower values show that the articles are less similar. Possible threshold values range from 0.0 to 1.0, and the original threshold values were within this range at intervals of 0.2. After setting these thresholds, the node cluster generator was run to create a visual representation of the clusters of the raw data – the 99 articles and the user-defined threshold values without any other information put into Piranha.

Figure 1 depicts a close-up of Piranha’s visual representation of the degree to which the documents are similar to one another under the described conditions. The central cluster contains all articles uploaded into Piranha, and every other cluster either branches off of this central cluster or off of another non-central cluster. The number displayed above the folder icon is the threshold value, which shows how similar the included articles are, and the word displayed on the folder icon shows the top word in that cluster – the word that appears most frequently in every article included in the cluster. Each cluster lists 5 top words total, with the second through fifth most frequently used words only being displayed in a window upon hovering over the respective cluster’s folder icon. These 5 top words show part of what it is that makes each of the articles included in a cluster similar to each other, and an example of this feature is shown in Figure 5. Each cluster was examined to determine whether any unusual or unexpected trends were being generated.

After reviewing the clusters, new threshold values were set at narrower intervals. The new values still ranged from 0.0 to 1.0, but were now chosen at intervals of 0.1, meaning that articles would now be clustered based on more specific similarities, because Piranha was looking at smaller differences in the similarities between each article. This process of halving the threshold intervals was repeated twice more. As before, each cluster was examined for unexpected trends, and also to make sure that each cluster made logical sense. Figure 2 depicts Piranha’s visual representation of the data clusters after the third interval halving.

Page 5: info.ornl.govinfo.ornl.gov/sites/rams2012/k_gaster/Documents...  · Web viewWith longer life spans and more advanced screening technology, ... such as the association between solar

5

Piranha includes a feature that allows users to input words, called stop words, which Piranha ignores when clustering documents. Therefore, these stop words are not used in determining how similar articles are to each other. This is useful in ignoring words, like database names or “results”, which are likely to appear frequently in scientific literature but are not useful to the study. Uninformative words that appeared as one of a cluster’s top five words were added to the stop words list, and the cluster node generator was run again. This was done several times with several stop word combinations, and was also done by combining these stop word groups with the varied threshold values used earlier. Each cluster was examined for changes in trends and for any new unexpected trends. Figure 3 depicts Piranha’s representation of the data clusters when all identified stop words were used at once under the threshold value intervals shown in Figure 2.

Finally, Piranha’s “category” option was used. Piranha has two types of categories: Piranha categories and document categories. The Piranha categories option allows users to create categories that are expected to appear frequently in the uploaded literature, and also to create word banks for words relating to the category. The user then runs Piranha with these new

Figure 1. Visual representation of document clusters.

Figure 2. Visual representation of document clusters under smallest threshold value intervals.

Figure 3. Clustering when all identified stop words were used.

3

21

Page 6: info.ornl.govinfo.ornl.gov/sites/rams2012/k_gaster/Documents...  · Web viewWith longer life spans and more advanced screening technology, ... such as the association between solar

6

categories in place and each document is assigned to exactly one of the specified categories. If any document does not use words from any category’s word list frequently enough to be considered significant, Piranha categorizes it as “unknown”. When Piranha clusters the articles, the clusters themselves do not change, but the bars depicted below each cluster’s folder icon are now color-coded to represent the ratios of categories to which their articles belong. This allows for further pattern identification, as a trend between two categories is immediately more noticeable due to the color coding. Following the above procedure, categories for specific types of cancer, for climate, and for genetics were created. Clusters were generated for different threshold values, different stop word combinations, and combinations of both stop word groups and varied threshold values. In each case, previously unidentified clusters that appeared to reveal an unusual trend were examined. Figure 4 depicts the representation of the cluster from Figure 3 with the category option being used.

Piranha is also able to distinguish whether articles were all uploaded from the same place or if they were uploaded from many different locations. This allows users to utilize the document categories option. When this option is used, the bars depicted below each folder icon are color coded based on the origin of the articles included in that folder’s cluster. Like the regular categories’ color coding option, this gives an immediate visual representation of the relative makeup of each cluster. However, users may not simultaneously color code the document clusters with Piranha categories option and the document categories option.

Figure 4. Document clustering depicted in figure 3 with the Piranha categories option in use.

Page 7: info.ornl.govinfo.ornl.gov/sites/rams2012/k_gaster/Documents...  · Web viewWith longer life spans and more advanced screening technology, ... such as the association between solar

7

Results

Results of the Study

Several clusters showing expected results appeared regularly. One such cluster was a cluster of articles discussing skin cancer and exposure to UV radiation. A close-up of this cluster and its top words are depicted in Figure 5. Other regularly appearing clusters included articles discussing genetic risk factors for breast cancer and prostate cancer, depicted in Figure 6, and one large cluster that appeared regularly under all user-defined conditions grouped many of the articles with the top word “genetics”, showing that many articles tended to discuss the genetic component of the type of cancer being studied. This cluster is depicted in Figure 7. This was generally the largest cluster each time Piranha was run with any variation on threshold value intervals or stop words.

One unexpected cluster that consistently appeared in after every document clustering, regardless of threshold levels or the presence of stop words, was a cluster that contained articles discussing skin cancer and receipt of organ transplants. This cluster also contained a subcluster of articles discussing incidence of nonmelanoma skin cancer and kidney transplant

Figure 5. Cluster depicting relationship between solar UV radiation and skin cancer.

Figure 6. Cluster depicting relationship between genetics and breast and prostate cancers.

Figure 7. Cluster depicting discussed genetic component of cancer.7

65

Page 8: info.ornl.govinfo.ornl.gov/sites/rams2012/k_gaster/Documents...  · Web viewWith longer life spans and more advanced screening technology, ... such as the association between solar

8

recipients. Figures 8 and 9 depict these clusters under the smallest threshold level intervals and with the utilization of most identified stop words, but without using the category feature.

18 additional publications containing the terms “kidney transplant” and “skin cancer” were obtained to investigate this association further. These publications were collected in a separate folder from the original articles used for the first portion of the research. The new documents were then analyzed separately and in combination with the original publications. The document categories feature was used in these cases to distinguish between the types of articles. The results of this subsequent analysis are depicted in Figures 10 and 11.

Discussion

Text mining led to revealing the expected links between climate and cancer incidence, such as the effect of solar UV radiation on skin cancer incidence. It also confirmed known links such as the role genetics play in an individual’s risk of developing cancer. Breast cancer and prostate cancer specifically were linked to genetics several times. It is extremely interesting that, despite the makeup of the articles being analyzed, genetics was such a commonly used term that it consistently appeared as a top word in a cluster containing over half of the articles included. A requirement of the articles included was that they discussed the impact of climatic or environmental factors on skin cancer incidence, so the fact that genetics was still such a prominent term underscores the fact that genetic and environmental risk factors for a disease are not completely independent, and may in fact exacerbate each other when an individual’s risk profile includes predisposing genetic and environmental factors in conjunction with one another.

An interesting, unexpected link was discovered that associated skin cancer incidence with individuals who have received a kidney transplant. Further analysis of the articles included in the cluster that revealed this association showed that the expected association is that individuals who receive a kidney transplant are later at a higher risk for developing skin cancer.

Figure 8. Relationship between organ transplants (kidney transplants in particular) and skin cancer incidence.

Figure 9. Relationship between kidney transplants and incidence of nonmelanoma skin cancer.

Page 9: info.ornl.govinfo.ornl.gov/sites/rams2012/k_gaster/Documents...  · Web viewWith longer life spans and more advanced screening technology, ... such as the association between solar

9

While this association does not directly involve climate, the consistency with which this cluster appeared piqued interest. The link between skin cancer and UV exposure is a significant factor in this association, which is important to consider research further, and is very relevant in terms of health policy. If the link discovered in this study is demonstrated to hold true, it is extremely likely that those who have received a kidney transplant and live in areas with high UV exposure will also have a higher likelihood of developing skin cancer than those individuals who either receive a kidney transplant and live in areas with low UV exposure or those who live in areas with high UV exposure but have never received a kidney transplant. This is because the risk factors of receipt of a kidney transplant and high amounts of sun exposure are likely to exacerbate each other.

It is important to note that many kidney transplant recipients (and many organ transplant recipients in general) are elderly, as these individuals are more likely to need an organ transplant in the first place [6]. Additionally, regions of the U.S. with warmer climates – and higher UV exposure – are commonly thought to have a higher portion of their populations consisting of the elderly. For example, in 2011, 17.6% of Floridians were aged 65 years of age or older [7]. This conjunction of the two identified risk factors means that the risk factors are likely to exacerbate each other, leading to higher overall risk for skin cancer in elderly individuals who have both received kidney transplants and who live in warmer climates. Awareness of this association in the elderly and those who care for the elderly is very important, as is further research into this link.

When new and old publications were analyzed in Piranha simultaneously, the new publications were consistently grouped with the three articles, depicted in Figure 8, that related skin cancer and kidney transplants in the first portion of the research. When only the new publications were analyzed, there were very few clusters, and the articles were all determined to be very similar to each other. Both of these observations are expected, as all new articles were selected for the reason that they are similar both to each other and to the three original articles discussing skin cancer incidence rates and kidney transplants.

One limitation of the study is the fact that the results are based on currently published and accessible literature. Expanding the number of and sources of documents could potentially lead to richer findings, but might render those findings less reliable than these results, which are based only in peer-reviewed literature.

Page 10: info.ornl.govinfo.ornl.gov/sites/rams2012/k_gaster/Documents...  · Web viewWith longer life spans and more advanced screening technology, ... such as the association between solar

10

Figure 10. Document clustering when original articles and supplementary articles are analyzed together. Original articles are shown in purple; supplementary articles are shown in red. Document categories are used to show the respective origins of each article.

Figure 11. Document clustering of supplementary articles alone.

Page 11: info.ornl.govinfo.ornl.gov/sites/rams2012/k_gaster/Documents...  · Web viewWith longer life spans and more advanced screening technology, ... such as the association between solar

11

Summary and Conclusions

While expected associations, such as the association between solar UV radiation and skin cancer or the association between genetics and breast cancer, were discovered through the text mining of biomedical literature pertaining to climate or environment and cancer incidence, unexpected associations relating climatic or environmental factors to the incidence of specific cancers were not discovered during this study. Cluster analysis identified an unexpected association between kidney transplants and increased incidence of nonmelanoma skin cancer. This link was further investigated by obtaining further publications on the topic. Cluster analysis of these publications alone and of these publications in conjunction with the originally analyzed literature further confirmed the expected clustering patterns.

Possible future work that could stem from the results of this study includes a study linking actual kidney transplant records and skin cancer incidence data. Such an investigation would allow for researchers to determine whether the link appearing in this study can be further shown to exist at a larger scale. Additionally, climate data could be obtained to determine whether such an association between skin cancer and kidney transplants in sunny climates is much stronger.

Page 12: info.ornl.govinfo.ornl.gov/sites/rams2012/k_gaster/Documents...  · Web viewWith longer life spans and more advanced screening technology, ... such as the association between solar

12

Acknowledgements

Special thanks to Georgia Tourassi, Robert Patton, Kara Kruse, Debbie McCoy, and Rashida Askia.

The work was performed at the Oak Ridge National Laboratory, which is managed by UT-Battelle, LLC under Contract No. De-AC05-00OR22725. This work has been authored by a contractor of the U.S. Government, accordingly, the U.S. Government retains a non-exclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes.

The Research Alliance in Math and Science program is sponsored by the Office of Advanced Scientific Computing Research, U.S. Department of Energy.

Page 13: info.ornl.govinfo.ornl.gov/sites/rams2012/k_gaster/Documents...  · Web viewWith longer life spans and more advanced screening technology, ... such as the association between solar

13

References

1. Crandall, C.G., & Gonzalez-Alonso, J. (2010). Cardiovascular Function in the Heat-Stressed Human. Acta Physiologica, 407-23.

2. van der Leun, J. C., & de Gruijl, F. R. (2002). Climate change and skin cancer [Abstract]. Photochemical and Photobiological Sciences, 324-26.

3. Jemal, A., Center, M. M., & DeSantis, C. (2010). Global Patterns of Cancer Incidence and Mortality Rates and Trends. Cancer Epidemiology Biomarkers & Prevention, 1893-907.

4. Garfinkel, L., Boring, C. C., & Heath, C. W., Jr. (2009). Changing trends: An overview of breast cancer incidence and mortality. Cancer, 74.

5. Witten, I. H. (2004). Text mining. In M. P. Singh (Ed.), Practical handbook of internet computing. Boca Raton, FL: Chapman & Hall/CRC Press.

6. Transplants in the U.S. by recipient age. (2012, July 27). Retrieved from OPTN: Organ Procurement and Transplantation Network database.

7. Florida QuickFacts from the U.S. census bureau [Table]. (2012, June 7). Retrieved from http://quickfacts.census.gov/qfd/states/12000.html