mining pubmed articles in watson explorer …...to understand watson explorer analytics components,...
Post on 12-May-2020
6 Views
Preview:
TRANSCRIPT
Mining PubMed Articles in Watson Explorer Analytics Components
Cognitive Medical Computing for Research
Contents Introduction ...................................................................................................................................................4
National Council of Biotechnical Information ...........................................................................................4
JATS Xml Format ........................................................................................................................................4
Artifacts you will need: ..............................................................................................................................4
Installation .....................................................................................................................................................4
Importing the Collection configuration archive ........................................................................................4
Installing the Processing Engine Archive ...................................................................................................7
Installing the crawler plugin ................................................................................................................... 12
Set the default title field ..................................................................................................................... 18
Confirm Index fields and Facets. ............................................................................................................ 21
Confirm Facet Processing Configuration ................................................................................................ 23
Build the Collection ................................................................................................................................ 24
Mining the NCBI PubMed Collection for Insights ....................................................................................... 29
Basic Principles: the Facet ...................................................................................................................... 29
Basic Principles: the Correlation Index ................................................................................................... 31
Programmatic Mining using REST ........................................................................................................... 34
Author Date Version
Kameron Cole
kameroncole@us.ibm.com
April 25, 2016 0.1
Introduction
National Council of Biotechnical Information
JATS Xml Format
Artifacts you will need:
1) One of two Analytics Collection configuration archives:
a. (optional) - a configuration to be used with IBM BigInsights integration,
for hadoop-enabled indexing
b. Xxx is a configuration to be used without Hadoop processing.
2) The hci_annotator_01.pear Processing Engine Archive
3) The crawler plugin xml2field.jar archive (version 1, or version 2)
a. The first version creates index fields for all the JATS elements in each
document, and, creates a searchfields.xml file (this file, when complete, can
reach gigabytes in size)
b. The second version only creates the index fields. This option is used with
the Analytics Collection configuration.
4) collection.xml (optional)
5) Four zipped portions of PubMed documents
Installation The installation consists of importing a preconfigured Analytics Collection, installing
custom processing components, and “crawling”, “parsing”, and “indexing” the documents.
Importing the Collection configuration archive
All instructions assume that the WEX system is running.
___1. Open a command prompt as *ESADMIN* user. Enter the following
command
esadmin import –fname <configuration zip file> –cid <collection id> -name <new collection
name>
___2. You may notice that two files are not found during the import
a. Crawler plugin
b. Although not shown, the HealthAnalyticsTS.pear file for annotations may
not be installed on your system
___3. Check that the collection was installed, and check components
a. Navigate in a browser to ESAdmin (http://<hostname>/ESAdmin). You
should see the collection:
b. Check the Text Analytics Archive installation. In ESAdmin, look in Parse
and Index->More->Text Processing Options
c. Check in System->Parser->Configure system text analysis engines-Text
Analytics Engines for gene01
d. If not installed, proceed to Installing the Processing Engine Archive
Installing the Processing Engine Archive
___1. In System->Parser->Configure system text analysis engines-Text Analytics
Engines, click the button: Add System Text Analysis Engine
___2. Configure the installation options
a. Give the TAE a name
b. Check Use processing engine archive
c. Browse to the path, either locally, or on another server
d. Verify installation
___3. Extract configuration files from the .pear archive
a. Using any archive tool, open the .pear file
b. Extract the entire /config directory
c. These files will be extracted
___4. Associate TAE with new Collection
___1. Navigate to Collections-> Parse and Index->More->Text Processing
Options->Select a system text analysis engine
___2. Install the mapping descriptor for CAS to index. It is in the /config
folder you unzipped above.
Installing the crawler plugin
___1. Copy the JATS xml libraries into the <ES_NODE_ROOT>/logs
directory. In this installation kit, the libraries have preserved the directory
structure exactly. Copy /logs into /logs
___2. Check the crawler plugin location in ESAdmin. Navigate to
Crawlers, and you should see:
___3. Hover over the lower right-hand corner, until you see the pencil
icon. You will edit the crawler properties (not crawlspace)
___4. !IMPORTANT. You will see that this is either a Unix file system
crawler, or a Windows file system. If the type of crawler is correct for your
machine, then simply adjust the path to the raw NCBI PubMed data.
Otherwise, skip to step 4.
___5. You will have to delete this crawler, and create one that matches
you system
a. Delete the crawler.
b. Click the plus (+) button.
c. Choose the correct crawler for you file system
d. In the crawler properties, give the crawler any name. Under
Advanced options, configure the crawler plugin. You should have
placed the plugin file in some directory – this location is used for
the Plug-in class path element. The Plug-in class name should be
the same.
e. Select the directory where you have the expanded data files to be
crawled:
f. Complete the rest of the wizard, accepting defaults.
Set the default title field
In the current scenario, the title field is populated using the <title> element in the PubMed
xml. This element is not correct. You will see in the raw xml that the PubMed developers
have used <title> as an html markup field, rather than referring to the intuitive notion of
title:
This results in an unacceptable title display
One way to fix this is to map the element <article-title> to the default title element in
WEX, since <article-title> is generally the more desirable title field:
In the WEX ESAdmin web application, find the crawler and edit the crawlspace
Choose Edit Metadata
From the dropdown menu next to the _$Title$_ field, select the articletitle field as a
mapping. Note that this field is derived from the <article-title> xml field, in the code of
the crawler plugin you installed. Otherwise, it would not be available.
Confirm Index fields and Facets.
The index field and facets should have been configured when you did the initial import.
Please check the following.
___1. Navigate to the index fields pane.
___2. You should see quite a few index fields. To be sure that everything
is correct, click the Import Index Fields button.
___3. Select the searchfields.xml file that came with this installation kit.
Do not use the searchfields.xml that you extracted in the /conf directory of
the .pear file.
___4. When you look in the Import column, all the boxes should be
grayed-out, indicating that all these fields are already in the configuration.
___5. Back in the Parse and Index pane, navigate to Analytics Resources-
>Facet Tree. You should see Facets like below
Confirm Facet Processing Configuration
With such a collection as this, some exceptional processing is required. This collection has
a large taxonomy – the ontology of cataloged facets. The taxonomy cache stores generated
facets that are used by the indexer component. Watson Explorer 10 provides three types of
taxonomy caches:
LRU
Partially in memory. Scalability is limited.
TrieL2O
Completely in memory. This cache type is 2–5 times faster than LRU.
DA
Completely in memory. This cache is 5 - 10% faster than TrieL2O.
Either the LRU cache or the DA cache should be used depending on the size of memory
assigned to the indexer. By default, the LRU cache is set. If you have enough memory,
use a complete cache, i.e. DA. If you want to process a large document set over a longer
duration with a smaller memory footprint, you can configure the system to use the LRU
cache that partially loads the taxonomy index in memory.
For this collection, insure the following:
___1. In the
$ES_NODE_ROOT/master_config/collection_ID.indexservice/collection.xml file,
find the <index> element with a <type> element value of Facet. The XPath of this
element is: /config/collection/indexes/index[type=Facet]
___2. Add the CacheType property to the <index> element and set its value to
DA: <index> <type>Facet</type> <path>facets</path> ... <property
name="CacheType" value="DA"/> </index>
___3. If you use the DA taxonomy cache, you should also set the number of
partitions. Note that this is a different setting than the collection partitions
mentioned previously. Because concurrent read/write operations can be performed
for each partition, setting the number of partitions enables the CPU resource to be
efficiently utilized when processing facets. The valid range of the number of
partitions is 1 – 36.
a. To set the number of partitions of the DA taxonomy cache, add <property
name=”NumberOfCachePartitions” value=”16”/> as a sibling element of
the <property name=”CacheType” value=”DA”/> element that you inserted
above.
Build the Collection
Now you should turn on all the runtime processes, and create the collection!
___1. Turn on the Parser Indexer
a. Configure memory – these are just suggestions. You will have to
adjust according to your available resources.
b. Turn on runtime
___2. Turn on Crawler
___3. Turn on Searcher
a. Configure memory
b. Turn on runtime
You can monitor the process of the Crawler by clikcing the “eyeball”
icon. The Parser Indexer should say waiting.
Navigate in a browser to the http://<hostname>/ui/analytics application.
Make sure your select the new collection:
You should see the collection complete with Facet tree. Click on the
Facet pane and navigate through the Facet tree. Should should have
Facet values.
You should also check one of the JATS/PubMed Facets, which are in
lower-case in the Facet tree.
Mining the NCBI PubMed Collection for Insights
Basic Principles: the Facet
To understand Watson Explorer Analytics Components, the fundamental building block is
the Facet. The name Facet is particularly appropriate -as opposed to, say, “keyword”. A
Facet is similar to a category in a traditional ontological taxonomy. The Facets in this
collection are created in two ways.
The first was is through leveraging a custom plugin, which creates Facets directly from the
Journal Article and Tag Suite. i The Journal Article Tag Suite (JATS) is a standard
(NISO Z39.96-2012) that defines a set of XML elements and attributes for tagging journal
articles and describes three article models. JATS is a continuation of the NLM Archiving
and Interchange DTD work begun in 2002 by NCBI.
The JATS tags are converted directly into Facets, which can participate in the statistical
calculations of the Miner Application.
Here is a sample of the raw article text, with some of the tags highlighted: <!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-
archivearticle1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-
article"><?properties open_access?><front><journal-meta><journal-id journal-id-type="nlm-ta">AAPS J</journal-id><journal-id
journal-id-type="iso-abbrev">AAPS J</journal-id><journal-title-group><journal-title>The AAPS Journal</journal-title></journal-title-
group><issn pub-type="epub">1550-7416</issn><publisher><publisher-name>Springer US</publisher-name><publisher-
loc>Boston</publisher-loc></publisher></journal-meta><article-meta><article-id pub-id-type="pmid">20711763</article-id><article-
id pub-id-type="pmc">2976985</article-id><article-id pub-id-type="publisher-id">9226</article-id><article-id pub-id-
type="doi">10.1208/s12248-010-9226-9</article-id><article-categories><subj-group subj-group-type="heading"><subject>White
Paper</subject></subj-group></article-categories><title-group><article-title>NonClinical Dose Formulation Analysis Method
Validation and Sample Analysis</article-title></title-group><contrib-group><contrib contrib-type="author"
corresp="yes"><name><surname>Whitmire</surname><given-names>Monica Lee</given-
names></name><address><email>monica.whitmire@mpiresearch.com</email></address
In the Miner, these will appear as lower-case facets.
The rest of the Facets are created using the IBM Health Care Accelerator Annotators, and
asset now in its seventh year of maturity, it uses advance UIMA Text Analytics to make
complex semantic segments into quantifiable entities. In Studio, the depth of information
of each annotator can be reviewed. For example, the “Problem” Annotator has multiple
features, many of which were directly taken from the Unified Medical Language System
(UMLS) Metathaurus.
Each of the properties (called Features in UIMA) can be mapped to a Facet in the Miner.
For example, Concept ID is a standard defined in Snomed CT
It allows for a technically powerful “grouping” of related medical conditions. In search
engine terms, this is a direct mapping of the notion of Facet. As a Facet, I can use the
concept ID, instead of a keyword, to give a more-inclusive, but not diffused, precision to
my search results. Compare:
Figure 1: Keyword Search, yields 2385
Figure 2: Faceted search, yields 24882
Note the search syntax for the Faceted search: /”Facet Name”/”Facet Value”/. One of the
powerful featured of the CID annotator is the the actual CID number need not accur in the
text. The linguistic construction of the Annotator “derives” the appropriate CID.
Basic Principles: the Correlation Index
Of the three statistical indexes built into WEX’s Content Miner, the Correlation index is
the most valuable, albeit the least understood. This statistical formula, along with the rest
of what makes up the core analytics of this product, originally known by the Japanese
name TAKMI, was included among IBM’s 100 Icons of Progress, to celebrate its
Centennialii.
It is not, in fact, a statistical correlation at all; rather it describes a relationship in the
density of occurrences within a data corpus (D) between two Facets, A and B:
The letter D represents the entire collection of documents and the # symbol represents the number of documents in the collection. The left and the right sides of the equation are equal to each other
The right side of the equation is a ratio between the product of density of A and density of
B (#A/#D) (#B/#D), and the actual density of (AnB), which is #(AnB)/#D, representing a
deviation from independence of A and B. The right side is more intuitive than the left side
as a 2-dimensional index, i.e. the Facet Pairs View in the Miner interface.
This can be illustrated as described in the following graphic. For example, although only 5% of all the
documents in a data corpus are about obtaining an instruction manual (A), this figure rises to 20% when
only personal computer-related documents (B) are examined.
Sample Scenario: Gene Research as influenced by health care insurance
The ICD9 Code 199.1 has the following description:
We can create a data mining investigation around the relationship of this ICD9 Code, and
its relationship to particular genes – of course, as always with this data corpus, within the
context of medical research. Initially, we use the Facets View, and rank by Correlation
value, to see which genes are most densely related to this ICD9 code.
The FacetPairs View gives us a “heat map” of the relative densities of ICD9 Codes, to
genes:
We should investigate those cells which show yellow to red – the color indicates the
strenght of the Corrleation Index value.
Finally, the Connections View gives us additional edges, in a directed graph, as opposed
to the the 2-dimensional array, provided in the FacetPairs. We now have:
Edge 1: Gene -> ICD9
Edge 2: ICD9 -> Gene
Edge 3: Gene -> Gene
Edge 4: ICD9 -> ICD9
The strength of the Correlation value for each edge is indicated by the red-gradient.
Programmatic Mining using REST
The REST API provided with WEX is one of its most powerful features. Every aspect of
the Miner UI is created through REST queries. We will look at the “cube” query, which
has the following syntax:
http://<hostname>:8393/api/v10/search/facet/cube?collection=<col_id>&facets[{“nam
espace”:”keyword”,”id”:”$<.facet_path>”,”count”:50},{“namespace”:”keyword”,”id”:
”$<.facet_path>”,”count”:50}]&correlation=facetPairs&query=*:*&output=applicatio
n/xml http://x18n04.pbm.ihost.com/api/v10/search/facet/cube?collection=PubMed_NCBI&facets=[{%22namespace%22:%22k
eyword%22,%22id%22:%22$.icd9%22,%22count%22:50},{%22namespace%22:%22keyword%22,%22id%22:%22$.g
ene%22,%22count%22:50}]&correlation=facetPairs&query=*:*&output=application/xml
You will notice that the correlation value for each ICD9 code is 1.0. In Watson Analytics
Components, this is equivalent to no correlation. The same is true of the gene code
returned. This is critical to the understanding of the “cube” query: it must have 3
dimensions (like a cube). The third dimension is the query parameter.
Note that the query issued was “*:*”. This is the reserved, all documents query. So, the
essence of the thrid cube dimension here is vacuous – it’s like saying “whatever.”
Everything correlates with “whatever.”
If we change the value to a something more specific, say, 199.1, we get
The subtle point here is that the query for 199.1 is not the same thing as a query for the
ICD9 code “199.1”. The current query is for any string containing 199.1. To get the
“real” answer for the correlative values of ICD9 and Gene, with regards to ICD9 199.1,
the query must be “faceted.”:
http://<hostname:port>/api/v10/search/facet/cube?collection=PubMed_NCBI&facets=[{"n
amespace":"keyword","id":"$.icd9","count":50},{"namespace":"keyword","id":"$.gene","
count":50}]&correlation=facetPairs&query=keyword::/ICD9/199.1&output=application/
xml
These programmatic returns are analogous to the FacetPairs View shown in the previous
section. The special consideration required for processing REST resturns is that these
returns are not ordered, as they are in the FacetPairs View. Thus, one must search through
the returned XML
i http://jats.nlm.nih.gov/archiving/1.0/
ii http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/takmi/
top related