methods for clustering and analyzing nih grants using word2vec › sites › default › files ›...

1
dad Travis A. Hoppe and George Santangelo Office of Portfolio Analysis, Division of Program Coordination, Planning, and Strategic Initiatives, Office of the Director, National Institutes of Health, Bethesda, MD 20892 INTRODUCTION 3) Document representation 6) Entropy and dispersion Methods for clustering and analyzing NIH grants using word2vec 1) Data pre-processing Y 4) Clustering and network topology 5) Application-level similarity comparisons The abstract and specific aims sections of a grant application are rich sources of semantic, contextual, and categorical information for portfolio analysis. To extract this information from a large number of grants, however, the natural language of the grant must be parsed and encoded in a way a computer can understand. The Office of Portfolio Analysis has constructed an open-source data processing pipeline* to analyze and cluster topically-related grants using the "word2vec" embedding scheme. This pipeline handles the pre- and post-processing of data, as well as the clustering, and represents an advance relative to the previous standard methodology (such as keyword analysis, TF-IDF, or IN-SPIRE) by improving recall and model interpretability, and by increasing scalability. As a concrete example, we illustrate the process by clustering competing R01 grant applications over a five- year window. *https://github.com/NIHOPA/word2vec_pipeline Document representations can be built up by taking the average vector from each of the tokens. Shown on the left is a t-SNE projection for five years of R01s, consisting of 150,000 competing applications. Color represents topic area. The clumping indicates the presence of large-scale structure in the dataset. 2) word2vec embedding Word2vec relies on the “distributional assumption” for text. That is, the semantic sense for words are learned by their co- occurrence patterns over a large sampling. Over the NIH corpus, we can find mathematically-related words by comparing their vector representations. Unidecode: α-helix and β-strand a-helix and b-strand Title caps: THE STUDY OF ALZHEIMER’s The study of Alzheimer's Abbreviation expansion: (HGH) Human Growth Hormone MeSH tokenization: Psychological Theory MESH_Psychological_Theory Data pre-processing is a critical step to processing text. The NIH grant corpus needs particular care to handle data errors, historical idiosyncrasies, and extraneous tokens. In addition to stemming and part-of-speech reduction (only nouns are preserved), the pipeline performs the following pre-processing steps: Example word2vec similarities over the NIH corpus: cancer (malignancy, neoplasm, prostate_cancer) elegans (nematode, Caenorhabditis_elegans) channel (ion_channel, voltage_gated) Clustering the R01 applications over the document vectors and comparing the cluster centroids gives a quantitate similarity map for each content area. Shown on the left is the a similarity map for 150 fixed clusters over all R01s from fiscal years 2011-2015. The document vectors can be collapsed over pre-existing categories, like study section or IC. The topology of the network can be built up by requiring nodes to share a minimal content similarity. Shown on the left are the study sections from the R01 corpus, colored by their betweenness within the network. Study sections can be directly compared to each other on the application level by plotting each application along axes measuring the distance to each centroid. Shown above are two study sections that share an edge in the network–MIM (Myocardial Ischemia and Metabolism) and BINP (Brain Injury and Neurovascular Pathologies)–and two that do not–MIM and CONC (Clinical Oncology). Dispersion (as measured by the extent the collection of documents spread around a cluster) and entropy (as measured by the spread of the document over a fixed set of labels) correlate for administrative IC distribution of content clusters. Either measure captures the extent to which an IC distributes itself across content, highlighting the differences in funding strategies for each IC.

Upload: others

Post on 29-Jun-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Methods for clustering and analyzing NIH grants using word2vec › sites › default › files › #23 Hoppe et al.pdf · Methods for clustering and analyzing NIH grants using word2vec

dad

Travis A. Hoppe and George SantangeloOffice of Portfolio Analysis, Division of Program Coordination, Planning, and Strategic Initiatives, Office of the Director, National Institutes of Health, Bethesda, MD 20892

INTRODUCTION

3) Document representation

6) Entropy and dispersion

Methods for clustering and analyzing NIH grants using word2vec

1) Data pre-processing

Y

4) Clustering and network topology

5) Application-level similarity comparisons

The abstract and specific aims sections of a grant application are rich sources of semantic, contextual, and categorical information for portfolio analysis. To extract this information from a large number of grants, however, the natural language of the grant must be parsed and encoded in a way a computer can understand. The Office of Portfolio Analysis has constructed an open-source data processing pipeline* to analyze and cluster topically-related grants using the "word2vec" embedding scheme. This pipeline handles the pre- and post-processing of data, as well as the clustering, and represents an advance relative to the previous standard methodology (such as keyword analysis, TF-IDF, or IN-SPIRE) by improving recall and model interpretability, and by increasing scalability. As a concrete example, we illustrate the process by clustering competing R01 grant applications over a five-year window.*https://github.com/NIHOPA/word2vec_pipeline

Document representations can be built up by taking the average vector from each of the tokens.

Shown on the left is a t-SNE projection for five years of R01s, consisting of 150,000 competing applications. Color represents topic area. The clumping indicates the presence of large-scale structure in the dataset.

2) word2vec embeddingWord2vec relies on the “distributional assumption” for text. That is, the semantic sense for words are learned by their co-occurrence patterns over a large sampling. Over the NIH corpus, we can find mathematically-related words by comparing their vector representations.

• Unidecode: α-helix and β-strand a-helix and b-strand• Title caps: THE STUDY OF ALZHEIMER’s The study of Alzheimer's• Abbreviation expansion: (HGH) Human Growth Hormone• MeSH tokenization: Psychological TheoryMESH_Psychological_Theory

Data pre-processing is a critical step to processing text. The NIH grant corpus needsparticular care to handle data errors, historical idiosyncrasies, and extraneoustokens. In addition to stemming and part-of-speech reduction (only nouns arepreserved), the pipeline performs the following pre-processing steps:

Example word2vec similarities over the NIH corpus: • cancer (malignancy, neoplasm, prostate_cancer)• elegans (nematode, Caenorhabditis_elegans)• channel (ion_channel, voltage_gated)

Clustering the R01 applications over the document vectors and comparing the cluster centroids gives a quantitate similarity map for each content area.

Shown on the left is the a similarity map for 150 fixed clusters over all R01s from fiscal years 2011-2015.

The document vectors can be collapsed over pre-existing categories, like study section or IC. The topology of the network can be built up by requiring nodes to share a minimal content similarity.

Shown on the left are the study sections from the R01 corpus, colored by their betweennesswithin the network.

Study sections can be directly compared to each other on the application level by plotting each application along axes measuring the distance to each centroid. Shown above are two study sections that share an edge in the network–MIM (Myocardial Ischemia and Metabolism) and BINP (Brain Injury and Neurovascular Pathologies)–and two that do not–MIM and CONC (Clinical Oncology).

Dispersion (as measured by the extent the collection of documents spread around a cluster) and entropy (as measured by the spread of the document over a fixed set of labels) correlate for administrative IC distribution of content clusters.

Either measure captures the extent to which an IC distributes itself across content, highlighting the differences in funding strategies for each IC.