a knowledge-anchored integrative image search and retrieval system

17
A Knowledge-Anchored Integrative Image Search and Retrieval System Selnur Erdal, 1 Umit V. Catalyurek, 2 Philip R. O. Payne, 2 Joel Saltz, 2 Jyoti Kamal, 1 and Metin N. Gurcan 2 Clinical data that may be used in a secondary capacity to support research activities are regularly stored in three significantly different formats: (1) structured, codified data elements; (2) semi-structured or unstructured narrative text; and (3) multi-modal images. In this manuscript, we will describe the design of a computa- tional system that is intended to support the ontology- anchored query and integration of such data types from multiple source systems. Additional features of the described system include (1) the use of Grid services- based electronic data interchange models to enable the use of our system in multi-site settings and (2) the use of a software framework intended to address both poten- tial security and patient confidentiality concerns that arise when transmitting or otherwise manipulating potentially privileged personal health information. We will frame our discussion within the specific experimen- tal context of the concept-oriented query and integration of correlated structured data, narrative text, and images for cancer research. KEY WORDS: Image retrieval, information retrieval, ontologies, text mining, Grid computing INTRODUCTION W ithin academic medical centers, large amounts of multi-dimensional, heteroge- neous data are collected electronically on an ongoing basis. This data includes clinical param- eters derived during the patient care process, as well as financial and operational information. Clinical data can take many forms, including but not limited to (1) structured, codified data ele- ments; (2) semi-structured or unstructured narra- tive text; and (3) multi-modal images. 1 While such data is readily available to clinical providers and administrators, accessing the same data for research or business intelligence purposes is often a chal- lenge, usually because of regulatory compliance requirements and concerns over patient privacy and confidentiality. Furthermore, data stored in opera- tional systems is not necessarily structured in a manner that lends itself to integrative longitudinal or class-based query and analysisa requirement that often exists in the research context. 2 Therefore, even when regulatory and patient privacy and confidentiality concerns are adequately addressed, it often remains difficult to query and access such data for research and business intelligence. One of the ways institutions have tried to address this problem is by extracting clinical, operational, and financial data from source systems and subse- quently storing it in a centralized data warehouse that uses a data model optimized for longitudinal and/or class-based queries. 3,4 However, imaging data sets are not usually physically stored in such warehouses because of concerns over data storage capacity and are instead commonly stored and managed using a separate Picture Archiving and Communication System (PACS). However, for purposes of clinical, 5 translational, and educational research, 68 the integration and retrieval of image 1 From the Information Warehouse, The Ohio State Univer- sity Medical Center, 640 Ackerman Road, P.O. Box 183111, Columbus, OH 43218, USA. 2 From the Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA. Correspondence to: Selnur Erdal, Information Warehouse, The Ohio State University Medical Center, 640 Ackerman Road, P.O. Box 183111, Columbus, OH 43218, USA; tel: +1- 614-2937623; fax: +1-614-2932210; e-mail: selnur.erdal@ osumc.edu Copyright * 2007 by Society for Imaging Informatics in Medicine Online publication 27 November 2007 doi: 10.1007/s10278-007-9086-8 166 Journal of Digital Imaging, Vol 22, No 2 (April), 2009: pp 166Y182

Upload: selnur-erdal

Post on 14-Jul-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A Knowledge-Anchored Integrative Image Search and Retrieval System

A Knowledge-Anchored Integrative Image Search and Retrieval System

Selnur Erdal,1 Umit V. Catalyurek,2 Philip R. O. Payne,2 Joel Saltz,2 Jyoti Kamal,1 and Metin N. Gurcan2

Clinical data that may be used in a secondary capacity tosupport research activities are regularly stored in threesignificantly different formats: (1) structured, codifieddata elements; (2) semi-structured or unstructurednarrative text; and (3) multi-modal images. In thismanuscript, we will describe the design of a computa-tional system that is intended to support the ontology-anchored query and integration of such data types frommultiple source systems. Additional features of thedescribed system include (1) the use of Grid services-based electronic data interchange models to enable theuse of our system in multi-site settings and (2) the use ofa software framework intended to address both poten-tial security and patient confidentiality concerns thatarise when transmitting or otherwise manipulatingpotentially privileged personal health information. Wewill frame our discussion within the specific experimen-tal context of the concept-oriented query and integrationof correlated structured data, narrative text, and imagesfor cancer research.

KEY WORDS: Image retrieval, information retrieval,ontologies, text mining, Grid computing

INTRODUCTION

W ithin academic medical centers, largeamounts of multi-dimensional, heteroge-

neous data are collected electronically on anongoing basis. This data includes clinical param-eters derived during the patient care process, aswell as financial and operational information.Clinical data can take many forms, including butnot limited to (1) structured, codified data ele-ments; (2) semi-structured or unstructured narra-tive text; and (3) multi-modal images.1 While suchdata is readily available to clinical providers andadministrators, accessing the same data for researchor business intelligence purposes is often a chal-lenge, usually because of regulatory compliance

requirements and concerns over patient privacy andconfidentiality. Furthermore, data stored in opera-tional systems is not necessarily structured in amanner that lends itself to integrative longitudinalor class-based query and analysis—a requirementthat often exists in the research context.2 Therefore,even when regulatory and patient privacy andconfidentiality concerns are adequately addressed,it often remains difficult to query and access suchdata for research and business intelligence. One ofthe ways institutions have tried to address thisproblem is by extracting clinical, operational, andfinancial data from source systems and subse-quently storing it in a centralized data warehousethat uses a data model optimized for longitudinaland/or class-based queries.3,4 However, imagingdata sets are not usually physically stored in suchwarehouses because of concerns over data storagecapacity and are instead commonly stored andmanaged using a separate Picture Archiving andCommunication System (PACS). However, forpurposes of clinical,5 translational, and educationalresearch,6–8 the integration and retrieval of image

1From the Information Warehouse, The Ohio State Univer-sity Medical Center, 640 Ackerman Road, P.O. Box 183111,Columbus, OH 43218, USA.

2From the Department of Biomedical Informatics, The OhioState University, Columbus, OH 43210, USA.

Correspondence to: Selnur Erdal, Information Warehouse,The Ohio State University Medical Center, 640 AckermanRoad, P.O. Box 183111, Columbus, OH 43218, USA; tel: +1-614-2937623; fax: +1-614-2932210; e-mail: [email protected]

Copyright * 2007 by Society for Imaging Informatics inMedicine

Online publication 27 November 2007doi: 10.1007/s10278-007-9086-8

166 Journal of Digital Imaging, Vol 22, No 2 (April), 2009: pp 166Y182

Page 2: A Knowledge-Anchored Integrative Image Search and Retrieval System

data along with structured data and narrative text ishighly desirable.8,9

There are several examples in the currentliterature of integrative and ontology-anchoredimage search or query tools;2,10–12 however, tothe best of our knowledge, none of these toolshave been shown to also support the simultaneousquery and subsequent integration with image datasets of structured data and narrative text. Webelieve that ability to execute such truly integra-tive, ontology-anchored queries across multipledata types is critical to the ability to execute highlyeffective clinical and translational research. There-fore, to address the preceding gap in knowledge,we have formulated a model computational systemthat is intended to support the integrative query ofstructured data, narrative text, and image data setsin support of research activities. This system hasalso been designed to address the challenges posedby regulatory compliance, patient privacy/confi-dentiality concerns, and the need to facilitatemulti-center research paradigms. The model wewill describe is motivated by two types ofresearch-oriented end users, specifically (1) clini-cal researchers who need to perform queries suchas “find all patients with brain CT and MRI imagesand who have a prior medical history of congestiveheart failure, high blood pressure, and diabetes;”and (2) imaging informatics researchers who needto obtain image sets defined by both a specificanatomic location and phenotypic parameters toevaluate techniques such as the use of computer-aided diagnosis algorithms.13 In either scenario,such a query requires the ability to locate patientswith the specified phenotypic parameters, utilizingsome combination of structured data and narrativetext, and then subsequently, query a PACS systemto identify and obtain image sets for such patientsthat are potentially generated by one or moremodalities (e.g., computed tomography, CT; mag-netic resonance imaging, MRI; etc.) and thatcorrespond to the desired anatomic location(s).Given the preceding motivation and use cases,

in the following sections of this manuscript, wewill (1) provide pertinent and contributing back-ground material concerning information needs inthe clinical and translational research domains,Grid-computing electronic data interchange plat-forms, ontology-anchored information retrievaltechniques, image retrieval tools, applicable regu-latory compliance, and patient privacy/confidenti-

ality concerns that must be addressed when usingclinical data for research purposes, and the specificexperimental context for our model formulation—The Ohio State University Medical Center(OSUMC) Information Warehouse (IW); (2) de-scribe the methods and system design approachesused for our model formulation process; (3) reportupon initial feasibility evaluation results relating tothe described model; and (4) describe the implica-tions, limitations, and next steps for our work.

BACKGROUND AND SIGNIFICANCE

In this section, we will summarize several areasthat contribute to or otherwise inform the modelformulation work reported in later sections of themanuscript.

Information Needs in the Clinicaland Translational Research Domains

The relationship between the information needsof clinical and translational researchers and cur-rently available information technology (IT) can beunderstood by conceptualizing the translationalresearch process (which subsumes clinical re-search) as a sequential information-flow model asillustrated in Figure 1.At each stage in this model, a combination of dual-

purpose and research-specific IT systems may beutilized. Examples of such systems that can supporttranslational research include (1) literature searchtools such as PubMed and OVID,14 (2) protocolauthoring15 and data mining tools,16 (3) simulationand visualization tools,17 (4) research-specific webportals,18 (5) electronic data collection or capturetools,19 (6) participant screening tools,20 (7) elec-tronic health records (EHRs),21 (8) computerizedphysician order entry (CPOE) systems,22 (9) deci-sion support systems,23 and (10) picture archivingand communications systems (PACS).24,25

Numerous reports have described increasedtranslational capacity, data quality, and decreasedclinical trial protocol deviations resulting from theuse of such IT systems.17,19 In addition, the use ofIT26 in studies involving multiple, geographicallydistributed research sites has been shown to havedemonstrable benefits in terms of increased effi-ciency and decreased resource requirements.22,27

However, despite the promise that the integration

INTEGRATIVE IMAGE SEARCH AND RETRIEVAL 167

Page 3: A Knowledge-Anchored Integrative Image Search and Retrieval System

of research-specific and dual-purpose IT systemswith the translational research enterprise holds,institutional informatics infrastructures providingsuch integration are generally absent, an outcomelargely attributed to socio-technical factors.28

Given such problematic adoption of IT in thetranslational research domain, it is critical for newsystems and technology frameworks to providesignificant value to investigators and research staffto overcome potential barriers to acceptance.A particular concern in the context of transla-

tional research and the utilization of informationtechnology to support such activities is the need tomaintain proper security and confidentiality ofprivileged or protected health information (PHI). Inmany cases, providing for such security, confiden-tiality, and ethical conduct of research requires theprovision of partially or completely deidentified datasets (including imaging data) to researchers. Ingeneral, two overriding frameworks exist that dictatesuch needs, Institutional Review Boards (IRBs) andthe Health Insurance Portability and AccountabilityAct (HIPAA), as summarized briefly below:

� Institutional Review Boards (IRBs) are feder-ally mandated oversight bodies who have theresponsibility to monitor, approve, and ensurethe regulatory compliance of human-subjectsresearch (of note, IRBs also exist to overseeanimal research).

� The HIPAA, which includes compulsorysecurity standards pertaining to the protectionand use of a well-defined collection ofprivileged data elements known as PHI.29,30

HIPAA guidelines mandate the removal ofover 18 specified identifiers from PHI beforeits research use. Significant challenges existwhen attempting to transact in such PHI tosupport research operations, especially whenemploying technologies such as the Grid-based electronic data interchange models

described earlier.31–35 These challenges in-clude ensuring that appropriate access controlsare maintained throughout a distributed archi-tecture, the certification that consumers of suchdata have a valid and documented purpose foraccessing PHI, and maintaining the confiden-tiality of such data while in-transit or beingstored throughout a potentially heterogeneouscomputing environment. An example of aHIPAA compliant research data repository thatcontains image data and phenotypic data is theReference Image Database to Evaluate Re-sponse (RIDER) archive of CT scans for lungcancer patients.7

Grid-Computing Electronic Data InterchangePlatforms

We define a computational Grid per the conven-tion used by Foster and Kesslman as “…a hardwareand software infrastructure that provides dependable,consistent, pervasive, and inexpensive access tohigh-end computational capabilities… concerned,above all, with large-scale pooling of resources,whether compute cycles, data, sensors, or people.”36

Grid computing has become the new means forcreating distributed infrastructures and virtualorganizations for multi-institutional research andenterprise applications. The use of Grid computinghas evolved from being a platform targeted at large-scale computing applications to a new architectureparadigm for sharing information, data, and soft-ware, as well as computational and storageresources. A vast array of middleware systems andtoolkits has been developed to support the imple-mentation of Grid computing infrastructures. Theseinclude middleware and tools for service deploy-ment and remote service invocation, security,resource monitoring and scheduling, high-speeddata transfer, metadata and replica management,

Fig 1. Illustration of translational research information flow model (adapted from Payne, et al.69).

168 ERDAL ET AL.

Page 4: A Knowledge-Anchored Integrative Image Search and Retrieval System

component-based application composition, andworkflow management. Central to the ability todevelop such Grid computing middleware andtoolkits is the existence of widely accepted standards.The Grid services framework builds on and extendsweb services for scientific applications. It definesmechanisms for such additional features as statefulservices, service notification, and management ofservice/resource lifetime. A primary example of aGrid computing initiative focusing upon the biomed-ical domain is the NCI’s Cancer Biomedical Infor-matics Grid (caBIG, https://www.cabig.nci.nih.gov/)program, which uses a Grid computing infrastruc-ture, named caGrid, to provide a common, extensi-ble, collaborative platform for data interchange andanalysis between cancer researchers and institutions.

Knowledge-Anchored Information Retrieval

Ontologies and terminologies (also described asvocabularies) are a type of conceptual knowledgecollection comprised of definitions of both atomicunits of knowledge (e.g., facts) and the network ofhierarchical and/or semantic relationships betweenthose atoms.37 Ontologies can be either formal orsemi-formal, which corresponds to their level ofontological commitment that can be described asthe degree to which the output of computationalagents that reason based upon the ontology isconsistent with the ontologies’ definition andstructure. Controlled terminologies or vocabulariesthat do not satisfy the preceding definition of whatconstitutes an ontology but that do contain defi-nitions of concepts and hierarchical relationshipsbetween such definitions are another example of aconceptual knowledge collection. Instances ofontologies in the biomedical domain includeSNOMED-CT and the NCI thesaurus, whereascommonly used controlled terminologies includeICD9-CM and Current Procedural Terminology(CPT). A frequently employed resource when rea-soning over or upon the content of such ontologiesand terminologies and their potential interrelation-ships is the Unified Medical Language System(UMLS), a meta-thesaurus of over 100 biomedicalontologies and vocabularies incorporating in excessof 1 million concepts and 4 million synonyms,which is maintained by the National Institutesof Health (NIH)–National Library of Medicine(NLM).38–40 The primary benefit of using theUMLS is the ability to identify a concept of interest

and, subsequently, discover synonymous definitionsfor that concept in multiple source ontologies orterminologies. In addition, it is also possible toreason upon all possible hierarchical and semanticrelationships between concepts in the UMLS basedupon the relational structures subsumed from theincluded source ontologies and terminologies. Suchconceptual knowledge collections are highly usefulwhen performing information retrieval tasks, asthey allow for object-oriented, semantic, or class-based definitions or expansions of queries. Forexample, if one were to pose a query of the fol-lowing type (represented in pseudo-code):“SELECT ALL WHERE PROCEDURE = ‘MRI’and ANATOMIC LOCATION = ‘lower extremi-ty,’” it would be difficult to identify and return suchpatients, as it is highly unlikely that such imagingprocedures would be codified and stored as havingan anatomic location of “lower extremity.” Instead,it is likely that such procedures would be codified ashaving an anatomic location such as “foot,” “ankle,”“lower leg,” or “knee.” However, if the concept“lower extremity” is contained in an ontology orterminology and is related (hierarchically and/orsemantically) to concepts including the precedingspecific anatomic locations (e.g., foot, etc.), then, byreasoning upon the contents of that knowledgecollection, the earlier query could be expanded to“SELECT ALL WHERE PROCEDURE = ‘MRI’and ANATOMIC LOCATION = (‘foot’, ‘ankle’,‘lower leg’, ‘knee’).”In the context of our model system formulation,

there is a particular focus on the use of controlledterminologies, specifically ICD9-CM (CM, clinicalmodification to the International Classification ofDiseases, ninth revision). Within the USA, ICD9-CM is the most commonly used medical codingsystem for procedures and diagnoses. Such codesare manually assigned to patient records bymedical information management and billingexperts based on numerous information sources,including (1) clinician-provided progress and/ordiagnostic reports (usually consisting of free text)and lab results, (2) quantitative findings such aslaboratory data, and (3) prior medical history (inboth coded and non-coded forms).41,42 A patientencounter (which could encompass either anoutpatient visit or a multi-day inpatient admission)can be characterized by multiple diagnosis andprocedure codes usually summarized by what isknown as a primary code that represents the

INTEGRATIVE IMAGE SEARCH AND RETRIEVAL 169

Page 5: A Knowledge-Anchored Integrative Image Search and Retrieval System

motivation for that encounters (e.g., the diagnosisor procedure that incurred the encounter). ICD9-CM codes are hierarchically organized. For exam-ple, the codes between 160 and 165 correspond tomalignant neoplasms of respiratory and intra-thoracic organs, with associated subdiagnoses ofsuch neoplasms being indicated using decimalizedvariants of the codes, such as code 162.2 thatspecifically defines “primary disease in the upperlung lobes.”In addition to the use of ICD9-CM, our model

system formulation also employs an advanced typeof knowledge-anchored information retrievalknown text mining. At the most basic level, textmining involves the use of a computational agent,usually informed by one or more conceptualknowledge collections to parse (i.e., decomposetext into constituent components at one or morelevels of granularity ranging from paragraphs towords) and tag (i.e., apply a codified conceptidentifier to a parsed component that is represen-tative of its lexical and/or semantic meaning)narrative, free text. Such parsed and tagged textcan then be queried as structured data. An exampleof the current state-of-the-art in biomedical textmining tools includes the open-source MetaMapTransfer (MMTx) application provided by theNLM, which uses the contents of the UMLS as aknowledge source to inform the parsing andtagging of biomedical narrative text (codifyingsuch text in terms of UMLS concepts).43–46 Inaddition to open-source applications, numerouscommercial database vendors such as IBM, Micro-soft, and Oracle provide free-text indexing andmining capabilities within the scope of theirdatabase management systems.47–49 However,these capabilities are usually limited to keyword-based searches. One exception to the previousstatement is the free text search functions providedby Oracle, which do have limited thesaurus-basedcapabilities. Yet, in the case of the thesaurus-basedtext mining functionality provided by Oracle, thereis a noticeable lack of available biomedicalthesauri that can be utilized for such purposes.Additional examples of commercial text miningtools include (1) IBM’s UIMA,50 which employsan ontology-anchored approach to concept tag-ging, and (2) Vivisimo,51 which supports concept-oriented search of tagged narrative text where suchtagging can be informed by the use of commonlyavailable ontologies or terminologies.

Image Retrieval Tools

In the case of medical images, the mostcommonly used storage repositories are PictureArchival and Communication Systems (PACS).Most PACS systems support the Digital Imagingand Communications in Medicine (DICOM) stan-dard52 (currently version 3). In PACS that arecompliant with the DICOM standard, medicalimages are stored and retrieved using patientmetadata (e.g., descriptors such as medical recordnumber, MRN). The primary focus of imageretrieval functionality within modern PACS is tosupport conventional clinical operations and notnecessarily research requirements. An example usecase in terms of clinical operations would be aradiologist querying the system for a patient’slatest MRI or CT scan using the patient’s MRN,and then reviewing the imagery for the relatedvisit. Recently, there have been efforts to improvethe image retrieval process to support image-related research efforts,34,35,53–55 as well as toenable better integration of imaging data withElectronic Health Records (EHRs).56–58 However,the preceding efforts are still relatively immature,and wide-scale adoption of such tools and pro-cesses is limited.Research uses of most clinical data, including

imaging data, presents additional challenges be-yond those associated with clinical uses and, inparticular, are due to the frequent need to deiden-tify such data sets. To address the deidentificationof imaging data sets for research purposes, thecommonly employed practice is to generate adeidentified duplicate of the desired DICOMimage as found in the source, production PACS,and then store the duplicate in a research-specificPACS instance.7,55,59 While this approach isfeasible for the retrieval and use of well-definedcases for a given experimental context, it does notreadily support the integration with and subsequentretrieval and deidentification of related imagingand phenotypic data.

OSUMC Information Warehouse

The Information Warehouse (IW) at the OhioState University Medical Center (OSUMC) is acomprehensive repository integrating data fromover 80 clinical, operational, and research systemsthroughout the institution. The IW serves a broad

170 ERDAL ET AL.

Page 6: A Knowledge-Anchored Integrative Image Search and Retrieval System

variety of customers in all mission areas atOSUMC, including (1) clinical operations, (2)administration, (3) education, and (4) research.Content in the IW includes structured medical andfinancial data, clinical free-text reports, tissue andgenomic data, and limited numbers of medicalimages. Data stored in the IW can be queried fromand presented to end users in identifiable, partiallydeidentified, or completely deidentified forms,depending on applicable IRB and institutionalpolicies and requirements. As described earlier,the IW does include medical free text from sourcessuch as radiology reports, pathology reports, aswell as structured and codified data elements suchas age, sex, and diagnosis. However, imaging (e.g.,PET, CT, MRI) data are stored in a separatepicture archival systems (PACS), with only limitedphysically duplicated image data within the IW.This physical separation of most image data makesthe integrative query of multiple data typesthroughout OSUMC, including image data, ex-tremely challenging. It should be noted that thisproblem is not unique to this IW but quitecommon in other IWs.

METHODS

As stated at the outset of this manuscript, we arereporting upon the development and implementa-tion of a model system intended to enable theintegrative, knowledge-anchored query of multipleinformation types, including structured, narrativetext, and image data, in support of researchrequirements. Our model system is based upon aframework60–62 that is intended to provide aconvenient interactive environment for such inte-grative data retrieval tasks (Fig. 2). The resultingsystem that was implemented based upon thisframework allows end users to retrieve imagesbased on characteristics defined in correlative textand structured data elements stored within theOSUMC IW. Such retrieval and presentation ofimage datasets using phenotypic context derivedfrom narrative text and structured data involves thehandling of all data types related to the image,including the image itself, as well as associatedheterogeneous, multi-dimensional textual andstructured data. To enable such a data handlingprocess, our framework leverages knowledge-anchored information retrieval techniques, specif-

ically combining the UMLS Meta-Thesaurus(UMLS MT)63,64 knowledge collection with textmining platforms incumbent to the Oracle databasemanagement system employed by the IW. The useof the UMLS MT allows the expansion of simplekeyword-based queries into concept-orientedqueries that can then be applied to the contents ofclinical narrative text and structured data elementssuch as diagnosis codes. Because of the researchorientation of our framework, we have alsoincorporated mechanisms to partially or complete-ly deidentify the PHI contained in any returneddataset. This provides compliance with the appli-cable regulatory requirements, including HIPAA.Finally, to allow for the elegant evolution of ourframework in light of constantly emerging tech-nology platforms and standards, as well as theneed to enable research activities that spaninstitutional and geographic boundaries, our frame-work incorporates a multi-tiered service-orientedarchitecture (SOA). A primary benefit of thismulti-tiered SOA is the ability to utilize emergent,

Fig 2. Conceptual model for the our software framework andmodel system implementation.

INTEGRATIVE IMAGE SEARCH AND RETRIEVAL 171

Page 7: A Knowledge-Anchored Integrative Image Search and Retrieval System

research-oriented electronic data interchange plat-forms, such as the previously introduced caGridmiddleware.65

Motivating Use Case

For the remainder of this manuscript, we willframe our discussion using the following motivat-ing use case: An investigator is interested in lungcancer patients. In addition to images, the investi-gator would like to assess findings in bothradiology and pathology reports, as well asdiagnosis codes, for each patient from whomimages are obtained. The specific findings ofinterest in the radiology reports should mentionlung nodules, whereas the pathology reportsshould mention adenocarcinoma.The interaction between our system (Fig. 3) and

the researcher (i.e., end-user) would incorporatethe following components:

Query Construction

The construction of a query intended to identifyand retrieve patients that meet the given criteriacould be decomposed into two parts: (1) astructured data search query and (2) a free-textsearch query:

� Structured data search query. In this case, theuser is looking for ICD9-CM codes thatcorrespond to different types of lung cancer.At this stage, if the user already knows the codesfor lung cancer (e.g., 162, 162.2, etc.), he or shecan provide those codes. For those users not

familiar with the ICD9-CM coding system, acode lookup facility is provided, informed bythe UMLS knowledge collection. This lookupsystem supports keyword-based searches. OnceICD9-CM codes are entered or selected, con-struction of the structured query is complete.

� Free-text search query. The free-text query inthis instance requires the assessment of twodifferent report types: radiology and patholo-gy, using the initial keywords “lung nodules”and “carcinoma,” respectively. The system’send user is provided with the ability to expandthese keywords using the contents of theUMLS knowledge collection. Once the enduser has identified the report types, entered thesearch keywords, and expanded the search scopeper the contents of the UMLS, the constructionof the free-text search query is complete.

During the construction of both query types(free text and structured), the presentation tiersolicits end-user data entry as appropriate andpasses that data to a Grid service in an applicationtier. This Grid service then interacts with a localUMLS instance that is available via a data sourcestier. Asynchronous calls that take place in thepresentation tier (e.g., the user interface) allowsuch transactions to occur in parallel, so that theuser can construct both queries simultaneously.

Query Execution

Once the user is satisfied with the queries asconstructed previously, he or she can execute thequery and identify matching patients or cases. Atthis point in the system workflow, the presentationtier passes the query to the application tier that, inturn, interacts with the data sources tier. Allqueries are executed in parallel; queries ondiagnosis codes for lung cancer, free text queriesfor lung nodules (radiology reports), and queriesfor adenocarcinoma (pathology reports) all returnresults simultaneously. Once all available resultshave been returned, the presentation tier joins themand presents them to the user.

Data Browsing and Retrieval

Upon completion of the preceding phase of thesystems workflow, end users may browse thereturned patients or cases. Once a patient or case

Fig 3. By first querying for available meta-data in one or morerelational databases, users are able to identify patients of interest;later, corresponding images can be retrieved from a PACS.

172 ERDAL ET AL.

Page 8: A Knowledge-Anchored Integrative Image Search and Retrieval System

is selected, the presentation tier calls the Gridservices in application tier to retrieve the patient orcase-specific data. Instead of presenting abstractresults for such a query, the radiology reportcontaining the concept of “lung nodule,” thepathology report containing the concept of “ade-nocarcinoma,” any structured codified data per-taining to the “diagnosed with lung cancer”characteristic, and finally, the correspondingimages are presented collectively for furtherevaluation and review (Fig. 4).

Three-Tiered Software Framework

There are several complementary goals withinthis framework. The realization of all these goalsrequires the use of components that satisfy bothinfrastructure constraints (e.g., available platformsand prevailing data interchange standards) andinstitutional or regulatory requirements as theypertain to the use of PHI for research purposes.The tiered architecture is one of the commontechniques used today for the separation ofpresentation, application, and data66,67 for web-

based applications. We utilize this approach withinour framework to achieve flexibility and efficiencyin terms of development, deployment, and man-agement. The framework used in our modelsystem (Fig. 5) consists specifically of the follow-ing three tiers:

� Presentation tier provides end-user interactioncapabilities based upon the features of theframework via web interfaces.

� Application tier utilizes standards-compliantGrid services to interact with and apply logicto multiple, heterogeneous data sources.

� Data sources tier support access to datasources such as relational database manage-ment systems and PACS.

In the following subsection, detailed descrip-tions of the design approaches and functionalitythat pertains to each such tier will be provided.

Presentation Tier.As stated previously, the prima-ry goal of this framework and the resulting systemis to provide simple yet flexible ways of identify-ing and retrieving images from a PACS system

Fig 4. The interface provides interactive assistance to users in order to map keywords used during query formulation to appropriatediagnosis codes (ICD9-CM). Once a query is executed, users may browse the result sets using a hierarchical “drill down” model.

INTEGRATIVE IMAGE SEARCH AND RETRIEVAL 173

Page 9: A Knowledge-Anchored Integrative Image Search and Retrieval System

that are characterized by one or more heteroge-neous sources of phenotypic data. Simplicity isdelivered through a unified user interface that dealswith all available data types, whereas flexibility isdemonstrated by the expandability of queries. Ouruser interface gives the end user the flexibility toexecute such queries by first allowing searches onexisting sources of phenotypic data such as free-text reports and diagnosis codes (or any otherpatient-related data). Once the patients are identi-fied according to all available meta-data, the PACSsystem is then queried, and associated images areretrieved (Fig. 4). Because the number of retrievedimages depends on the number of identifiedpatients, proper utilization of existing free text(radiology and pathology reports) and structureddata (diagnosis codes) is crucial in the search

criteria. This framework brings together knowl-edge-anchored free-text capabilities provided by acombination of the UMLS knowledge collectionand the text indexing and keyword search capabil-ities of Oracle. The utilization of the UMLS andOracle’s text indexing allow our framework toprovide interactive query expansion capabilities onfree-text documents, along with assistance ondiagnosis code selection through an interactiveweb-based user interface. Selected images areretrieved using PixelMed, an open source Java-based toolkit.68 The web interface component ofthe presentation tier enables users to construct andexecute queries based upon both free-text (e.g.,radiology and pathology) and structured dataelements (e.g., diagnosis codes) in an interactivefashion. Once the patients are identified based

Fig 5. The multi-tier implementation of our framework provides a means to implement Grid based or web based electronic datainterchange platform within a service-oriented architecture. End-user access to privileged data is managed in one or more ways: (1) afully privileged user within the institutional firewall (left-hand side) can access the data in anyway they prefer; (2) an external user(located outside the firewall, right-hand side) is subject to additional restriction to access data; and (3) the multi-tier service-orientedapproach allows for the deployment of custom services easily.

174 ERDAL ET AL.

Page 10: A Knowledge-Anchored Integrative Image Search and Retrieval System

upon specified query parameters, their imagesalong with radiology reports, pathology reports,and diagnosis codes (ICD9-CM) are presented tothe users. During the preceding query constructionand result set presentation process, end usersinteract with and are assisted by the presentationlayer in the following ways.

UMLS-Based Text Query Expansion. Within theIW, free-text reports are stored in relational data-bases (Oracle, version 10gR2), where they areindexed for fast text searches. However, thoseindexes only allow keyword-based searches. Asintroduced earlier, we have adopted a combinationof the UMLS Meta-Thesaurus (MT) and MetaMap(MMTx) to convert users’ keyword queries intoconceptual queries. UMLS MT allows us topresent alternatives to or expansions of a givenquery criteria. For example, the user might enterthe search term “melanocarcinoma” when search-ing the contents of pathology reports but wouldreceive few results because of the infrequent use ofthat specific term in such text (in the test dataset tobe described later in the evaluation section of thismanuscript, such a search returns no text reports).However, if this keyword were expanded toinclude the synonym “melanoma” per the contentsof the UMLS MT, the user will likely retrievemany more reports (again, in our test dataset, sucha search returns 2,706 text reports with thespecified term). The MMTx API allows us toparse clinically relevant terms and phrases that canthen be expanded and used for query operationswhen the initial user-entered search criteria con-sists of larger text constructs (e.g., a sentence orparagraph). For example, the concept of “chronicobstructive lung disease” can be extracted from thesentence “Patient has chronic obstructive lungdisease” and can subsequently be mapped to as acorresponding UMLS concept.

UMLS-Based Diagnosis Code Selection. Whenusers are unsure which ICD9-CM code to choosewhen querying structured data elements that areencoded using that terminology, context-specificassistance that enabled those users to selectappropriate code(s) is provided. Specifically, dur-ing our UMLS MT installation, controlled vocab-ularies and dictionaries related to ICD9-CM codeswere included. This allows us to provide hierar-chical searches on ICD9-CM codes initiated by

user supplied text. For example, during theconstruction of a query, a user may not know theproper codes for a targeted type of lung cancer. Inthis case, the user may use the assistance mecha-nism to navigate the ICD9-CM hierarchy, travers-ing through the concepts “neoplasm” and “lungcancer” to visualize and select the appropriatespecific lung cancer codes.

Image Handling and Display. After constructingand executing queries, users may browse the resultset using a hierarchical “drill-down” model. Forexample, the interface allows end users to drilldown through multiple layers of granularity in aCT study, beginning with a series and ending at asingle image. In addition, the presentation layerallows the user to compare images within a series.Corresponding radiology or pathology reports arealso displayed when retrieving and presentingimage data (Fig. 4).The systems presentation layer is implemented

using Java Server Pages (JSP) running on ApacheTomcat. In addition, asynchronous calls to theapplication tier are used to provide a moredynamic human–computer interaction experience.

Application Tier

To support relational database connectivitysufficient to access both textual and structureddata elements and to support the use of the UMLSknowledge collection as described earlier, anOntology Tools Package (OTP) was created withinthe application tier. In addition, to enable thequery, retrieval, and manipulation of images fromour PACS system, we utilized the functionalityprovided by the PixelMed open source Javatoolkit. These two packages form the core of ourapplication tier and are designed and implementedto manage interaction between the presentation anddata source tiers. Furthermore, these packages arewrapped as caGrid services as described in thefollowing sections.

Ontology Tools Package. The OTP is a collectionof functions from several existing API’s, includingboth the MMTx API from the National Library ofMedicine and Oracle’s JDBC drivers. Additional-ly, OTP allows for queries on diagnostic datatables to be executed and linked with UMLS MT,and expanded conceptual search queries on radi-

INTEGRATIVE IMAGE SEARCH AND RETRIEVAL 175

Page 11: A Knowledge-Anchored Integrative Image Search and Retrieval System

ology and pathology reports. For performanceconsiderations, local copies of controlled vocabular-ies and dictionaries from the UMLS MT aremaintained and utilized by the OTP. When a userinteracts with the relational databases targeted by oursystem, OTP’s functions (Fig. 6) are utilized duringthe query construction, execution, and retrieval ofthe resulting datasets, as described below:

� Query construction with OTP. During theconstruction of text queries, OTP takes freetext entered by end users and expands that freetext and returns synonyms and other semanti-cally relevant concepts through a process ofreasoning upon the previously described localUMLS tables. However, the final selection ofsuch search terms to define a conceptual textquery is based upon end-user evaluation ofsuch suggested expansions. Similarly, for theconstruction of diagnosis code-based queries,user-supplied free-text entries are expandedwith the help of UMLS. Instead of keywords,ICD9-CM codes and their descriptions arereturned to the user. Again, end users decidewhich of the returned diagnosis codes are tobe included with their query.

� Query execution with OTP. During the queryexecution process, OTP executes the generat-ed query against text and structured datatables, links the datasets, and returns identi-fiers for further retrieval of other correlateddata such as radiology images contained in aPACS. This data linkage process is largelymade possible through the use of somecombination of the three common identifiersused to store and identify PHI at OSUMC(and which are generally analogous to thoseused in most common clinical informationsystems): MRN (used to identify an individualpatient), encounter number (ENC, used toidentify a patient-specific encounter fromwhich an item of data is derived), andaccession number (ASC, used to identify thespecific component or activity associationwith an encounter from which an item of datais derived). When a user-constructed query isexecuted, OTP returns only identifiers ordeidentified pointers to the data, dependingon the access privileges assigned to a user fora specific project. It is important to note that,at this stage, no other data than such identi-fiers is returned by OTP (e.g., no textual,structured, or image data is returned).

� Data retrieval with OTP. Once the end userselects a patient or group of patients fromwhich they wish to obtain further data, OTPmanages the retrieval of all non-imaging data(with such image data retrieval being managedusing PixelMed). All the data tables that needto be accessed with this framework areindexed using the three previously definedidentifiers MRN, ENC, and ASC. Therefore,on-demand access to any of the text orstructured data elements does not presentsignificant performance implications.

Use of PixelMed. PixelMed provides open sourceJava libraries for reading, writing, manipulating,and communicating DICOM objects. In addition,PixelMed provides a simple PACS and WADO(web access to DICOM objects) server implemen-tation. In our framework and model system,PixelMed is used for handling all operationsrelated to DICOM objects as follows:

� DICOM query construction and execution.Once a user identifies which patient cases he

Fig 6. The Ontology Tools Package (OTP) allows users tointeract with a local UMLS knowledge collection instance,diagnostic databases, and text report databases. Users can queryOTP with a keyword or textual phrase, which is then mapped toone or more concept codes derived from text mining approachesinformed by the UMLS and using MMTx. Once users finalize a setof targeted search concepts, text reports that contain thoseconcepts are queried and returned. Users can also retrieve ICD9-CM codes that correspond to their query keywords and use thosecodes to query tables containing structured data.

176 ERDAL ET AL.

Page 12: A Knowledge-Anchored Integrative Image Search and Retrieval System

or she wants to review, it is then necessary toquery for and retrieve corresponding imagesfrom the targeted PACS. Our PACS system(AGFA, Impax 5.2) can be queried by ASC.Here, the ASCs retrieved from the OTP areformed into a DICOM query, and those queriesare subsequently executed through the use ofstandard DICOM messaging. Each ASC ismapped to a DICOM study object within thePACS, supporting the retrieval of (1) the entireimaging study, (2) any series contained in thestudy, or (3) any single image from any series.

� DICOM object manipulations. When imagesare retrieved for display via the previouslydescribed web interface, they must be con-verted to a web compatible format such asJPEG or Bitmap (BMP). In addition, imagesmay need to be deidentified based upon auser’s and/or project’s privileges. PixelMedprovides functionality such as the conversionand deidentification of DICOM objects (basedon DICOM 3.0 standards).

Grid Enablement. Grid enabling our model systemallows us to support research collaborations in-volving institutionally and geographically dispa-rate participants, a scenario that has motivatedmany recent infrastructural research and develop-ment programs, such as caBIG. In this project, wespecifically have Grid enabled our system usingthe caBIG developed caGrid middleware. Toimplement this type of functionality, a wrapperapplication was created that supports Grid-compliantelectronic data interchange with both OTP andPixelMed (thus, making them available as caGridanalytical services). As Grid services operate in amanner similar to web services or applications,implementing our Grid-based system within theinstitutional firewall did not require any specializednetwork configurations other than the placement ofthe presentation and application tiers within ademilitarized zone (DMZ) of our network, whichfor security purposes, incorporates an access controllist to restrict connectivity to known hosts.

Data Sources Tier

Several key requirements influenced the designof the data sources tier, specifically, the needs to(1) ensure HIPAA-compliant deidentification of

the data when necessary, (2) generate proxy datasources for efficient access, and (3) manage accessprivileges at multiple levels of granularity fromprojects to individuals. These requirements aredescribed in detail in the following sections.

Deidentification. The need to deidentify data canbe broadly separated into requirements that pertainto structured data, textual data, and image data:

� Structured data. As structured data resides onrelational databases, its deidentification isstraightforward compared to other data types.This type of deidentification was achieved byremoving and replacing patient-unique identi-fiers (using surrogate identifiers that we referas deidentifiers).

� Text data. Within the OSUMC IW, all textreports are pre-processed to generate distinct,deidentified versions of the original sourcetext, which were used in this project.

� Image data. PixelMed allows DICOM objectsto be separated into metadata and imagecomponents and provides total programmaticcontrol over the metadata section through itsJava-based API. Hence, by rewriting andmanipulating the metadata of the DICOMobjects, images are deidentified or anonymizedaccording to the HIPAA guidelines.

Proxy Data Sources. To prevent computationalperformance issues pertaining to the use ofoperational data repositories, we implementedproxy relational databases and proxy PACS aspart of our data sources tier. Research images anddata are moved to these proxy resources in a time-sensitive, batch operation. To support regulatorycompliance, research datasets are also deidentifiedduring this batch oration. The two specific com-ponents of our proxy data source are:

� Proxy relational database. Within our frame-work this is a physically and logically distinctOracle instance. Oracle enables tables andviews based on remote data sources; however,in our case, this proxy database only holdsdeidentified summary tables and materializedviews that are based on our operational data-bases. User accounts created for accessing thisproxy database only have view-only privilegesfor that data.

INTEGRATIVE IMAGE SEARCH AND RETRIEVAL 177

Page 13: A Knowledge-Anchored Integrative Image Search and Retrieval System

� Proxy PACS. The proxy PACS is an instanceof PixelMed’s PACS server. Images that havebeen approved for research use are moved tothis PACS in deidentified form. The use ofthis proxy PACS allows for more frequent andlarge-scale queries to be executed as would bepossible if searching the operational PACSused for clinical care purposes, owing toperformance degradation concerns in such analternative scenario.

RESULTS

To evaluate the feasibility and performance ofour model system implementation, we performedan evaluation of the system using the previouslyintroduced motivating use case applied to30 months of patient data stored in the OSUMCIW, which comprised 1,373,194 radiology reportsand 304,212 pathology reports that correspondedto 4,753,985 patient encounters. Our examplesearches focused on two main ICD9 code catego-

ries: lung cancer (162, malignant neoplasm oftrachea, bronchus, and lung) and coagulationdefects (286, coagulation defects). Whereas ourexample searches involve both radiology andpathology reports for the first category (lungcancer), searches for the second category (coagula-tion defects) were applied to radiology reports only.

Lung Cancer Searches

Under this category, seven different ICD9-CMcodes were used for query construction: trachea(162.0), main bronchus (162.2), upper lobe (162.3),middle lobe (162.4), lower lobe (162.5), other partsof bronchus or lung (162.8), and bronchus and lung,unspecified (162.9). For our search on radiologyreports, the concept “lung nodule” was expanded toinclude “thoracic,” “chest” for “lung,” and “lesion”for “nodule.” For our search on pathology reports,the keyword “carcinoma” was expanded by thekeyword “neoplasm.” Table 1 depicts the effects ofthe thesaurus-based conceptual expansions for eachlung cancer ICD9 code.

Table 1. Combining ICD9 Codes for “Malignant Neoplasm of Trachea, Bronchus, and Lung” with Radiology Reports that ContainConcept Corresponding to “Lung Nodules” along with Pathology Reports that Contain Concept Corresponding “Carcinoma”

In some cases, the use of knowledge-anchored query expansion captures more relevant data than would be possible in queries withoutsuch expansion. Column descriptions: Exp Both terms “lung” and “nodule” are expanded during the search on radiology reports; (-) Lungthe term “lung” is not expanded during the search on radiology reports; (-) Nodule the term “nodule” is not expanded during the search onradiology reports; No exp no UMLS MT expansions were applied during the search on radiology reports. Row descriptions: Exp With aidof UMLS MT, both keywords “carcinoma” and “neoplasm” are used during the search on pathology reports; No exp no UMLS MTexpansions were applied during the search on pathology reports.

178 ERDAL ET AL.

Page 14: A Knowledge-Anchored Integrative Image Search and Retrieval System

Coagulation Defects Searches

Under this category, nine different ICD9-CMcodes were used for query construction: “congen-ital factor VIII disorder” (286.0), “congenitalfactor IX disorder” (286.1), “congenital factor XIdeficiency” (286.2), “congenital deficiency ofother clotting factors” (286.3), “von Willebrand’sdisease” (286.4), “hemorrhagic disorder due tointrinsic circulating anticoagulants” (286.5), “defi-brination syndrome” (286.6), “acquired coagula-tion factor deficiency” (286.7), and “other andunspecified coagulation defects” (286.9). For oursearch on radiology reports the keyword “pulmo-nary embolism” was expanded as follows: “lung,”“chest,” also synonym for pulmonary embolism“PE.” Table 2 depicts the effects of the thesaurus-based conceptual expansions for each coagulationdefects ICD9 code.In our “lung cancer” cases, query expansion

returned 43% more reports on average thanwithout such expansion when applied to radiologyreports (Table 1). However, in the context ofpathology reports, this increase in returned reportswas only 4%. In “coagulation defects” cases,similar query expansion yielded 8% additionalreports than a query without such expansion(Table 2). Considering our earlier example on thekeyword expansion of “melanocarcinoma” with“melanoma” where we demonstrate much greater

(0 to 2706) expansion, these results demonstratethe impact on overall performance of the initialkeywords used to pose a query.The execution times associated with our model

system were evaluated in two stages: (1) The firststage focused on the average time required toidentify and retrieve images, which was found torange between 2–10 s; (2) the second stagefocused on the average time required to retrievestructured and free-text data corresponding to agiven image or image series, which was found torange between 4–10 s. These measurements werederived by 30 repeated measurements of theelapsed time required to retrieve and view the firsteight images of a CT or MRI series for a givenpatient, and then to retrieve three on averagerelated text reports and structured data elementsfor that study via the previously introduced webinterface.

DISCUSSION

The retrieval of image data in support ofresearch requirements is usually more meaningfulwhen patient-derived phenotypic context dataaccompanies such images. Such phenotypic datacan be derived from multiple sources, includingfree-text data such as radiology or pathologyreports, and structured data such as diagnosis

Table 2. Combining ICD9 Codes for “Coagulation Defects” with Radiology Reports that Contain Concept Corresponding to “PulmonaryEmbolism”

Column descriptions: Exp Both terms “lung” and “chest” are expanded during the search on radiology reports; (-) Lung the term “lung” isnot expanded during the search on radiology reports; (-) Chest the term “chest” is not expanded during the search on radiology reports;No exp no UMLS MT expansions were applied during the search on radiology reports. Row descriptions: Exp With aid of UMLS MT, theadditional synonym “PE” for the concept of “pulmonary embolism” was used for expansion; No exp no synonyms were added during thesearch on radiology reports.

INTEGRATIVE IMAGE SEARCH AND RETRIEVAL 179

Page 15: A Knowledge-Anchored Integrative Image Search and Retrieval System

codes. As we have demonstrated, by applyingintegrative knowledge-anchored strategies, con-ceptual searches spanning all of the preceding datatypes are possible and, in some cases, can generatelarger amounts of data meeting the criteria used todefine a motivating use case and its associated dataquery requirements than is otherwise possible. Wehave also demonstrated that, in those instanceswhere the deidentification of imaging andcorresponding phenotypic data is needed to satisfyregulatory and patient confidentiality requirements,such deidentification can be performed in such amanner that overall context and the ability torecreate the linkage between data elements ismaintained. Whereas such an approach by neces-sity introduced additional technical challenges tothe proper deidentification of data, the program-matic utilization of the open-source PixelMedPACS API within our framework allows us toreplace identifiers within metadata for images withdeidentifiers, which are consistent with otherdeidentified phenotypic data, thus, demonstratingone possible solution to such challenges.Interoperability is a major requirement for

sharing data and collaborative work in a researchenvironment, especially when that environmentspans institutional and geographically distributedinvestigators and research participants. By adapt-ing to the caGrid middleware in our frameworkand model system, we are able to facilitate suchinteroperability and both the syntactic and seman-tic levels—thus, addressing the prior requirement.In particular, the multi-tier architecture we haveadopted simplifies deployment of new technologysuch as caGrid within our framework. This multi-tiered framework has additional benefits, includingmore efficient control of and access to multipleunderlying data sources and the ability to mitigatepotential performance concerns in operationalsystems through the utilization of appropriateproxy data sources.There are several limitations with our frame-

work and model system that should be notedincluding (1) our relatively simple approach to textmining does not exploit more advanced semanticinterpretation of clinical narrative text nor does itallow for the detection and reasoning uponnegation within that text, which could effect therecall and precision of information retrieval tasks(however, an analysis of the recall and precision ofthe text mining process was infeasible during this

study because of limited scope); (2) the queryexpansion techniques employed by virtue of thepreviously noted simple approaches to text miningdo not lend themselves to fully assisting end usersin identifying optimal descriptors or codes to beused during the query formulation process, andthus, the efficacy of our queries are, in part, relianton the domain expertise and heuristic knowledgeof our end users; (3) the text data deidentificationscheme employed relies on preexisting deidentifi-cation processes external to our framework andmodel system; and (4) our evaluation of thedescribed software framework and model systemis limited to a single instance and basic scope,owing to the preliminary status of our work as amodel formulation effort.

CONCLUSION

We have described a model and associatedsoftware framework with promising and uniquecombination of components that are capable ofproviding translational research users with anintegrative query and information retrieval toolthat spans multiple, critical biomedical informationsources including structured data, narrative text,and images. Furthermore, the inclusion of bothdeidentification mechanisms and standard-compliantelectronic data interchange modalities within oursystem have significant potential to address inherentchallenges to the conduct of multi-center or cross-disciplinary translational research in the modernregulatory environment. Our future plans for thisproject include the continued evaluation of theframework, with specific emphasis on the types ofnovel hypotheses that can be addressed using such aknowledge-anchored, integrative query platform, aswell as its applicability to other usage scenarios. Wefully anticipate that our system, with its focus onsatisfying a critical translational research informationneed, will continue to develop into an operationalplatform for use by researchers at OSUMC that willalso be extensible to the broader informatics andresearch communities.

ACKNOWLEDGMENTS

Authors would like to thank Jason Buskirk, Felix Liu, ScottSilvey, Tremayne Smith, Ty Tolley and Herb Smaltz.

180 ERDAL ET AL.

Page 16: A Knowledge-Anchored Integrative Image Search and Retrieval System

REFERENCES

1. Cimino JJ: From data to knowledge through concept-oriented terminologies: experience with the Medical Enti-ties Dictionary. J Am Med Inform Assoc 7(3):288–297,2000

2. Sujansky W: Heterogeneous database integration inbiomedicine. J Biomed Inform 34(4):285–298, 2001

3. Kamal J, et al: Information warehouse as a tool to analyzecomputerized physician order entry order set utilization:opportunities for improvement. AMIA Annu Symp Proc2003:336–340, 2003

4. Prather JC, et al: Medical data mining: knowledgediscovery in a clinical data warehouse. Proc AMIA Annu FallSymp 1997:101–105, 1997

5. Brown M, et al: CAD in clinical trials: current role andarchitectural requirements. Comput Med Imaging Graph 31(4–5):332–337, 2007

6. Kamauu AW, et al: Informatics in radiology (infoRAD):vendor-neutral case input into a server-based digital teachingfile system. Radiographics 26(6):1877–1885, 2006

7. NCIA: Reference Image Database to Evaluate Response(RIDER). Available at http://ncia.nci.nih.gov/ncia/collections.Cited 2007

8. Sigal R: PACS as an e-academic tool InternationalCongress series 2005, 1281, CARS 2005: Computer AssistedRadiology and Surgery, pp. 900–904

9. Boochever SS: HIS/RIS/PACS integration: getting to thegold standard. Radiol Manage 26:16–24, 2004

10. Gruber TR: Toward principles for the design ofontologies used for knowledge sharing. In: Guarino N, Poli REds. Formal Ontology in Conceptual Analysis and KnowledgeRepresentationNorwell: Kluwer, 1993

11. Joseph P, Bruce GB: Ontology-guided knowledgediscovery in databases. In: Proceedings of the internationalconference on knowledge capture. Victoria, British Columbia,Canada: ACM Press, 2001

12. Smith B, Kumar A: On controlled vocabularies inbioinformatics: a case study in the gene ontology. Biosilico:Drug Discovery Today 2(1):246–252, 2004

13. Gurcan MN, et al: Lung nodule detection on thoraciccomputed tomography images: preliminary evaluation of acomputer-aided diagnosis system. Med Phys 29(11):2552–2558, 2002

14. Ebbert JO, Dupras DM, Erwin PJ: Searching the medicalliterature using PubMed: a tutorial. Mayo Clin Proc 78(1):87–91, 2003

15. Olson GM, et al: Collaboratories to support distributedscience: the example of international HIV/AIDS research. In:Proceedings of SAICSIT, South Africa. Victoria, BritishColumbia, Canada: ACM Press, 2002

16. Butler D: Data, data, everywhere. Nature 414(6866):840–841, 2001

17. Marks RG, Conlon M, Ruberg SJ: Paradigm shifts inclinical trials enabled by information technology. Stat Med 20(17–18):2683–2696, 2001

18. Payne PR, Greaves AW, Kipps TJ: CRC clinical trialsmanagement system (CTMS): an integrated information man-agement solution for collaborative clinical research. AMIAAnnu Symp Proc 2003:967, 2003

19. Kuchenbecker J, et al: Use of internet technologies fordata acquisition in large clinical trials. Telemed J E Health 7(1):73–76, 200120. Marks L, Power E: Using technology to address

recruitment issues in the clinical trial process. Trends Biotechnol20(3):105–109, 200221. Bates DW, et al: A proposal for electronic medical

records in U.S. primary care. J Am Med Inform Assoc 10(1):1–10, 200322. Sung NS, et al: Central challenges facing the national

clinical research enterprise. JAMA 289(10):1278–1287, 200323. Bates DW, et al: Effect of computerized physician order

entry and a team intervention on prevention of seriousmedication errors. JAMA 280(15):1311–1316, 199824. Huang H, et al: Picture archiving and communication

systems (PACS) in medicine, New York: Springer, 199125. Duerinckx AJ, Pisa EJ: Filmless picture archiving and

communication system (PACS) in diagnostic radiology. ProcSPIE 318:9–18, 198226. Gurcan MN, et al: GridImage: a novel use of grid

computing to support interactive human and computer-assisteddetection decision support. J Digit Imaging 20:160–171, 200727. Craver JM, Gold RS: Research collaboratories: their

potential for health behavior researchers. Am J Health Behav26(6):504–509, 200228. Kukafka R, et al: Grounding a new information

technology implementation framework in behavioral science: asystematic analysis of the literature on IT use. J Biomed Inform36(3):218–227, 200329. Johnson MS, Gonzales MN, Bizila S: Responsible

conduct of radiology research part V. The health insuranceportability and accountability act and research. Radiology 237(3):757–764, 200530. Liu BJ, Zhou Z, Huang HK: A HIPAA-compliant

architecture for securing clinical images. J Digit Imaging 19(2):172–180, 200631. Amendolia SR, et al: MammoGrid: a service oriented

architecture based medical grid application. In: 3rd InternationalConference on Grid and Cooperative Computing, Wuhan,China, 200432. Blanquer I, et al: A Middleware grid for storing,

retrieving and processing DICOM medical images. In: Work-shop on Distributed Databases and Processing in Medical ImageComputing (DIDAMIC), Rennes, France, 200433. Espert IB, Garcaa VH, Quilis JD: An OGSA middleware

for managing medical images using ontologies. J Clin MonitComput 19(4–5):295–305, 200534. Montagnat J, et al: Medical image content-based queries

using the grid. In: HealthGrid’03, France, Lyon, 200335. Power D, et al: A relational approach to the capture of

DICOM files for Grid-enabled medical imaging databases. In:ACM symposium on applied computing, Cyprus, Nicosia,2004, pp 272–27936. Foster I, Kesselman C: The Grid 2: blueprint for a new

computing infrastructure, 2nd edition. New York: MorganKaufman, 2003, p. 74837. Payne PR, et al: Conceptual knowledge acquisition in

biomedicine: a methodological review. J Biomed Inform40:582–602, 200738. NLM: Unified Medical Language System. Available at

http://www.nlm.nih.gov/research/umls/meta2.html. Cited 2007

INTEGRATIVE IMAGE SEARCH AND RETRIEVAL 181

Page 17: A Knowledge-Anchored Integrative Image Search and Retrieval System

39. Bodenreider O: Using UMLS semantics for classificationpurposes. Proc AMIA Symp 2000:86–90, 200040. Campbell KE, et al: Representing thoughts, words, and

things in the UMLS. J Am Med Inform Assoc 5(5):421–431,199841. Thomas BJ, et al: Automated computer-assisted categoriza-

tion of radiology reports. Am J Roentgenol 184(2):687–690, 200542. Tsui F-C, et al: Value of ICD-9-coded chief complaints

for detection of epidemics. J Am Med Inform Assoc 9:S41–S47, 200243. Friedman C, et al: Automated encoding of clinical

documents based on natural language processing. J Am MedInform Assoc 11(5):392–402, 200444. Srinivasan S, et al: Finding UMLS Metathesaurus

concepts in MEDLINE. In: American Medical InformaticsAssociation Annual Symposium, 2002, pp 727–73145. Taira RK, Soderland SG, Jakobovits RM: Automatic

structuring of radiology free-text reports. Radiographics 21:237–245, 200146. Zou Q, et al: IndexFinder: a method of extracting key

concepts from clinical texts for indexing. In: AmericanMedical Informatics Association Annual Symposium, 2003,pp 763–76747. Alonso O, et al: Oracle text white paper. Available at

http://www.oracle.com/technology/products/text/index.html.Cited 200648. International Business Machines Corporation: DB2 text

extender. Available at ftp://ftp.software.ibm.com/software/data/db2/extenders/text/db2tewkspecsheet.pdf. Cited 200249. Microsoft Corporation: SQL Server 2000 full-text search

deployment white paper. Available at http://www.support.microsoft.com/kb/323739. Cited 200450. Ferrucci D, Lally A: Building an example application

with the unstructured information management architecture.IBM Syst J 43(3):455–475, 200451. Baecker R, Small I, Mander R: Bringing icons to life.

Proceedings of the SIGCHI conference on human factors incomputing systems: reaching through technology. New Orleans,Louisiana, USA: ACM Press, 1991, pp 1–652. NEMA: Digital imaging and communications in medi-

cine. Available at http://www.medical.nema.org/. Cited 200753. Armato III, SG, et al: Lung image database consortium:

developing a resource for the medical imaging researchcommunity. Radiology 232:739–748, 200454. Sigal R: PACS as an e-academic tool. In CARS 2005:

computer assisted radiology and surgery. 2005

55. Toms AP, et al: Building an anonymized cataloguedradiology museum in PACS: a feasibility study. Br J Radiol79:661–671, 2006

56. Cohen S, Gilboa F, Uri S: PACS and electronic healthrecords, San Diego, CA: SPIE, 2002

57. Lehmann T,Wein B, Greenspan H: Integration of content-based image retrieval to picture archiving and communicationsystems. In: Medical Informatics Europe Conference, 2003

58. Traina A, Rosa NA, Traina C. Integrating images topatient electronic medical records through content-based re-trieval techniques. In: 16th IEEE Symposium on Computer-Based Medical Systems, 2003

59. Leoni L, et al: A virtual data grid architecture formedical data using SRB. In: EuroPACS-MIR 2004, Trieste,Italy, 2004

60. Erdal S, et al: Flexible patient information search andretrieval framework: pilot implementation. In: Proceedings ofthe SPIE Medical Imaging, San Diego, CA, 2007

61. Erdal S, et al: Information warehouse application ofcaGrid: a prototype implementation. In: caBIG 2007 AnnualMeeting, Washington, DC, 2007

62. Erdal S, et al: Integrating a PACS system to grid: a de-identification and integration framework. In: Annual Meeting ofthe Society for Imaging Informatics in Medicine (SIIM) 2007,Providence, RI, 2007

63. Lindberg C: The unified medical language system(UMLS) of the national library of medicine. J Am Med RecAssoc 61(5):40–42, 1990

64. Lindberg DA, Humphreys BL, McCray AT: The unifiedmedical language system. Methods Inf Med 32(4):281–291,1993

65. Cancer Biomedical Informatics Grid (caBIGä). Availablettps://cabig.nci.nih.gov/workspaces/Architecture/caGrid/,https://cabig.nci.nih.gov/workspaces/Architecture/caGrid/. Cited2006

66. Eckerson WW: Three tier client/server architecture:achieving scalability, performance, and efficiency in clientserver applications. Open Inf Syst 10:1–12, 1995

67. Gallaugher J, Ramanathan S: Choosing a client/serverarchitecture. A comparison of two-tier and three-tier systems.Inf Syst Manage Mag 13(2):7–13

68. Clunie DA: DICOM structured reporting, Bangor,Pennsylvania: PixelMed, 2000

69. Payne PR, et al: Breaking the translational barriers: thevalue of integrating biomedical informatics and translationalresearch. J Investig Med 53(4):192–200, 2005

182 ERDAL ET AL.