0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + #...

76
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES 1 QSPACE VISUALISATION OF MEDLINE ARTICLES A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE DEGREE OF MASTER OF SCIENCE IN THE FACULTY OF ENGINEERING AND PHYSICAL SCIENCES 2005 Rasmus Winter SCHOOL OF COMPUTER SCIENCE

Upload: others

Post on 16-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

1

QSPACE VISUALISATION OF MEDLINE ARTICLES

A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER

FOR THE DEGREE OF MASTER OF SCIENCE IN THE FACULTY

OF ENGINEERING AND PHYSICAL SCIENCES

2005

Rasmus Winter

SCHOOL OF COMPUTER SCIENCE

Page 2: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

2

TABLE OF CONTENTS

List of Figures .............................................................................................. 5

Abstract........................................................................................................ 6

Declaration .................................................................................................. 7

Copyright ..................................................................................................... 8

Acknowledgements ...................................................................................... 9

The Author ................................................................................................ 10

1. Introduction ........................................................................................... 11

1.1 Context of the Study ....................................................................... 11

1.2 Existing Software ............................................................................ 12

1.3 Structure of Dissertation.................................................................. 13

2. Analysis ................................................................................................. 15

2.1 System Users................................................................................... 15

2.2 Requirements Analysis.................................................................... 15

2.3 Utilised Technology ........................................................................ 16

2.3.1 MEDLINE............................................................................. 17

2.3.2 PubMed.................................................................................. 17

2.3.3 MAVERIK 6.2 ....................................................................... 19

2.3.4 Q-SPACE............................................................................... 20

2.3.5 Qt 3.3 ..................................................................................... 22

2.4 Programming Languages ................................................................ 23

2.5 Developmental Approach ............................................................... 23

3. System Design........................................................................................ 25

Page 3: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

3

3.1 PubMed Data Collecting and Processing........................................ 25

3.2 Visualisation ................................................................................... 26

3.2.1 Q-SPACE Structure................................................................ 26

3.2.2 BioQSpace Visualiser Structure............................................... 31

3.3 File Structures ................................................................................. 33

3.4 GUI Design .................................................................................... 34

4. Implementation ...................................................................................... 38

4.1 Abstract Comparison Attributes ...................................................... 38

4.2 pubmed.pl....................................................................................... 39

4.2.1 Querying PubMed .................................................................. 40

4.2.2 Processing the Results............................................................. 41

4.2.3 Saving the Results................................................................... 42

4.3 BioQSpace Visualiser ...................................................................... 43

4.3.1 Article Storage and Comparison Algorithms ........................... 44

4.3.2 GUI........................................................................................ 45

4.3.3 MAVERIK Navigation........................................................... 51

5. Testing and Evaluation........................................................................... 53

5.1 Testing ............................................................................................ 53

5.2 Evaluation ...................................................................................... 54

5.3 Installation...................................................................................... 56

6. Conclusions............................................................................................ 58

6.1 Summary ........................................................................................ 58

6.2 Performance Issues.......................................................................... 58

6.2 Further Work.................................................................................. 60

Page 4: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

4

Glossary..................................................................................................... 63

Bibliography............................................................................................... 64

Appendix A: E-Utility Results .................................................................... 70

Appendix B: Files used by pubmed.pl......................................................... 73

Page 5: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

5

LIST OF FIGURES

Figure 2.1: Minimal Spanning Trees ..................................................... 21

Figure 2.2: A set of tuples in its final configuration................................ 22

Figure 3.1: A screenshot of the original Q-SPACE ................................ 27

Figure 3.2: Original Q-SPACE program structure ................................. 27

Figure 3.3: A MAV_qobj ...................................................................... 29

Figure 3.4: A MAV_hull ....................................................................... 29

Figure 3.5: A series of MAV_qobjs linked by a MAV_trail .................... 30

Figure 3.6: The structure of BioQSpace................................................. 33

Figure 3.7: The basic graphical user interface design ............................. 36

Figure 4.1: The final graphical user interface design .............................. 46

Figure 4.2: The menu bar...................................................................... 47

Figure 4.3: Mark articles by attribute dialog .......................................... 47

Figure 4.4: Help window ...................................................................... 48

Figure 4.5: The word stems window ..................................................... 48

Figure 4.6: About BioQSpace window.................................................. 48

Figure 4.7: The toolbar ......................................................................... 49

Figure 4.8: Action of the ‘show labels’ and ‘use tooltips’ checkboxes ..... 50

Figure 4.9: The advanced options dialog ............................................... 50

Figure 4.10: The article information panel............................................. 50

Figure 4.11: The attribute weight sliders................................................ 51

Figure 6.1: Completion times of loading and reloading sets of articles ... 59

Figure 6.2: A method for parallelising the comparison algorithm .......... 61

Page 6: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

6

ABSTRACT

Upon querying the citation and biomedical article database PubMed, interpreting

and spotting relationships in the resulting list of articles can be difficult. Some sort of

visualisation to help with these processes is highly desirable, and to that end,

BioQSpace was designed and built. BioQSpace attempts to visualise the relationships

between the articles by rendering them as clustered sets of objects in a navigable 3D

environment.

The application will perform a PubMed search on a given query, parse the resulting

article list, calculate the relationships between each of the articles, and finally cluster

and colour them in 3D.

This thesis describes the design and development of BioQSpace, its usage, and a

critical analysis of the final product.

Page 7: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

7

DECLARATION

No portion of the work referred to in this thesis has been submitted in support of an

application for another degree or qualification of this or any other university or other

institute of learning.

Page 8: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

8

COPYRIGHT

1. Copyright in text of this thesis rests with the Author. Copies (by any process)

either in full, or of extracts, may be made only in accordance with instructions

given by the Author and lodged in the John Rylands University Library of

Manchester. Details may be obtained from the Librarian. This page must

form part of any such copies made. Further copies (by any process) of copies

made in accordance with such instructions may not be made without the

permission (in writing) of the Author.

2. The ownership of any intellectual property rights which may be described in

this thesis is vested in the University of Manchester, subject to any prior

agreement to the contrary, and may not be made available for use by third

parties without the written permission of the University, which will prescribe

the terms and conditions of any such agreement.

3. Further information on the conditions under which disclosures and

exploitation may take place is available from the Head of the Department of

Computer Science.

Page 9: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

9

ACKNOWLEDGMENTS

The author wishes to express his thanks to Steve Pettifer and Anna Divoli for their

help, guidance and their many contributions to the direction and content of this

project.

Page 10: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

10

THE AUTHOR

The author graduated from Manchester University in 2004 with the degree of

Bachelor of Science in Computer Science and Maths, and stayed on to study for a

Masters degree in Computer Science, for which the work described in this thesis is a

substantial part.

Page 11: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

11

1. INTRODUCTION

1.1 Context of the study

PubMed is a publicly accessible search and retrieval system for a number of medical

literature databases, the biggest being MEDLINE, the US National Library of

Medicine’s (NLM’s) database of biomedical citations and abstracts, which contains

over 12 million entries from 4,800 journals. The sheer number of articles resulting

from a PubMed search can often be overwhelming, and PubMed’s simple textual

presentation of them does not provide any clues as to the relationships between them.

This forces the user to read through the titles and abstracts of each one, or to check

the related articles links, when trying to find relevant articles – a task at which

humans are particularly inefficient.

A visual representation of the relationships between the articles would help the user

to focus their attention on groups of similar articles, instead of searching linearly

through a somewhat arbitrarily ordered list. This requires a metric for the similarity

of different articles, which can incorporate many factors such as any drugs, diseases

or biological terms their titles or abstracts have in common.

The purpose of this project is to explore how this similarity measure can be

calculated, and to implement it as an algorithm in an application that will display the

results of a query in a more structured way. The resulting application, BioQSpace,

presents the articles as points in 3D space, and allows the user to explore that space,

both in terms of 3D navigation and the raw comparison data from the articles, and to

dynamically tweak the comparison algorithm to place emphasis on particular

attributes that comprise the comparison algorithm.

In addition to simply assisting in locating relevant information, it is hoped that

BioQSpace can be used in a variety of research topics, such as: discovering previously

unnoticed relationships, by combining two separate search results into one

Page 12: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

12

visualisation; understanding trends in treatment, by combining results for different

time spans; or exploring the relationships of syntax vs. semantics.

The emphasis of the project is on visualising data and determining relationships

within the data, not on natural language processing (NLP) in terms of the extraction

of important data from the articles, though some techniques will be explored. Much

research has been done in the area of text mining of biomedical literature [SGM05,

KSBG04, SJORB05, KBSP04], and the developed techniques can be fairly complex,

but discussions of them are kept brief.

1.2 Existing Software

Several pieces of software exist that try to explore and visualise relationships amongst

MEDLINE articles, or concepts discussed within them. They all use the PubMed

database query tools, and present results using either text or diagrams, in 2D or 3D.

The application with features most similar to that of BioQSpace is RefViz [Ref]. It

allows searching of ISI Web of Science and OCLC in addition to PubMed, and by

analysing keywords in titles, abstracts and notes is capable of producing 2D diagrams

of abstracts organised in clusters in themes based on their content.

XplorMed [PBA01, Xpl] is a web-based online tool for exploring MEDLINE,

filtering the abstracts produced from a query to extract the ones that most fit the

user’s requirements. It uses a step-by-step interactive procedure, asking the user to

eliminate or elaborate the sets of articles resulting from the previous step, starting

with a standard PubMed query. Among the stages involved in narrowing down the

results are a categorisation using MeSH Terms (see chapter 2.3.2) and fuzzy binary

relation calculations for words in the same abstract. The user can perform the process

iteratively, to minimise the number of irrelevant results as much as possible.

Page 13: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

13

Chilibot [CS04, Chi] is also web-based, and is used to generate graphical

representations of the relationships among user provided terms, using PubMed and

NLP techniques. Chilibot searches PubMed for the user provided terms (typically

gene or protein names), and analyses abstracts in which two of more of the terms

appear, to determine if they are related. If they are, the sentences are further analysed

to find out whether or not it is an interactive relationship, whether the relationship is

stimulatory or inhibitory, and to what extent the terms are expressed. From this

information, a 2D line graph is produced showing all the valid relationships between

the terms.

BiopathwayBuilder [LPP04, Bio] uses information extraction (IE) of MEDLINE

abstracts to build and display gene and protein interaction networks, and allows the

user to enhance the usefulness of the automatic IE results by manually removing or

amending relationships in a 3D environment.

None of these tools perform the same analysis as BioQSpace: that of calculating

relationships between all members of a data set. Visualisation tools often assume

some biased perspective of the data, trying to categorise the elements based on

arbitrarily imposed rules. BioQSpace is completely unbiased in the method it uses to

cluster the data, which is interactive and customisable by the user – weights and

thresholds can be used to change how much each of the attributes that comprise the

comparison algorithm (used for clustering) contribute.

1.3 Structure of Dissertation

This chapter has described the motivation for the project, outlined the key features of

BioQSpace, and suggested some of its possible uses.

The next chapter outlines the target users and the requirements, reviews the existing

technologies on which BioQSpace was built, and discusses the development process

and specification.

Page 14: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

14

Chapter three covers the design process of BioQSpace, and is followed in chapter

four by a discussion of the implementation process and the issues involved therein.

Chapter five describes the testing and evaluation of BioQSpace, concluding with the

installation process.

Chapter six contains concluding remarks, performance issues and ideas for further

work.

Page 15: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

15

2. ANALYSIS

This chapter begins with the identification of the likely users of the system, and a

requirements analysis that lists the capabilities and features that the system is

expected to provide. Then there is a description of the technology relevant to the

project, followed by a brief description of how these existing tools and resources can

be combined and adapted for the purposes specific to BioQSpace. This is followed by

a discussion of the programming languages and the developmental process to be

used, with justification for the chosen options. The chapter is concluded with a

systems analysis and specification of the major features.

2.1 System Users

It is vital to identify the users of this type of system, as well as their abilities and

experience with computer systems. The target users of BioQSpace are medical

researchers or bioinformaticians, and it is unlikely that they all have considerable

computer knowledge. To that end, much of this work has been designed and

constructed with input from Anna Divoli, on staff in the bioinformatics department

at Manchester University, who has experience writing applications for the target

users and encourages ease-of-use to be a top priority.

2.2 Requirements Analysis

The expectations from the user are listed below, where the bold items are MUSTs

that are essential to the system and have to be completed for a successful project, and

the rest are SHOULDs – non-essential items that would be nice to have, but may not

be possible due to time constraints.

Page 16: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

16

1. The system should include tools to query PubMed, parse the results to

extract important attributes, and visualise the relationships between the

articles in 3D.

2. It should be possible to save queries in different directories, and load them

at a later date.

3. Navigation of the visualisation should be possible (and intuitive) using a

standard mouse and/or a graphical user interface.

4. The user should be able to change the way the comparison values are

calculated by changing the weights for the attributes.

5. The user should be able to select and highlight articles using the mouse.

6. The user should be able to highlight articles that have certain attributes.

7. Further information about selected articles should be displayed.

8. The user should be able to remove articles from the visualisation.

9. A comprehensive help system should be available.

10. The two components of the system – querying/parsing and visualisation –

should be separate entities, but linked by an application that can execute both.

11. The user should be able to tweak the comparison algorithm to be more/less

thorough in the data it considers.

2.3 Utilised Technology

The following software, resources and tools are all to be incorporated into the final

system, in varying degrees.

Page 17: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

17

2.3.1 MEDLINE

MEDLINE (Medical Literature Analysis and Retrieval System Online) is a

bibliographic database of citations to journal articles in life sciences, covering the

fields of medicine, nursing, dentistry, veterinary medicine, the health care system,

and the preclinical sciences, but with a particular focus on biomedicine [Meda]. The

referenced papers generally range from 1966 to the present, and total an estimated 12

to 15 million (depending on the reference source), collated from 4,800 journals, with

between 1,500 and 3,500 references added most days of the week, ten months per

year, since 2002.

The majority of entries in MEDLINE record the articles’ authors, title, abstract, date

of publication, and other pertinent information, though it is not required for all

possible fields to be filled for each citation. The list of fields can be seen at [Ovi].

Although MEDLINE does not contain the entire text of the articles they cite, the

titles, abstracts and other pertinent information are available, and the term ‘article’ is

used throughout this dissertation to refer to the collection of data associated with the

cited articles.

MEDLINE cannot be directly searched for free, but its contents can be accessed

through several portals including PubMed [Puba], Infotrieve [Inf] or Medportal

[Medb], some of which are freely accessible, others requiring subscription fees.

2.3.2 PubMed

This is how the PubMed website [Puba] describes its service:

PubMed, available via the NCBI Entrez retrieval system, was developed by the National Center

for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), located at

the National Institutes of Health (NIH). ... PubMed was designed to provide access to citations

Page 18: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

18

from biomedical literature. LinkOut provides access to full-text articles at journal Web sites and

other related Web resources. PubMed also provides access and links to the other Entrez

molecular biology resources.

PubMed is a database of citations to articles that encompasses MEDLINE, and

references several other databases, including OLDMEDLINE, which predates

MEDLINE and lacks some of its fields. In addition to the MEDLINE fields,

PubMed provides lists of related articles, links to external resources (such as the

article in full), and MeSH Terms for most of the articles.

The citations are manually indexed using terms from NLM’s controlled vocabulary,

MeSH (Medical Subject Headings) [Mes], which describe the contents of the article,

primarily to assist in searching PubMed. MeSH consists of a set of terms naming

descriptors in an alphabetical and hierarchical stucture, with broad headings such as

‘Anatomy’ or ‘Mental Disorders’ at the top, and more specific headings lower down,

in an eleven-level hierarchy. MeSH is annually updated during November and

December, but at the time of writing there are 22,997 descriptors.

The related articles are calculated using an algorithm that computes similarity scores

based on MeSH term frequencies and frequencies of words/phrases in the titles and

abstracts of each of the articles, recording those with the highest score [Com].

PubMed can be searched in a web browser using NLM’s Entrez tool [Enta]. A basic

search can be performed simply using key concepts, such as treatment or disease

terms, but it can be refined by using search tags to search only certain fields, such as

the title ([ti]), the authors list ([au]), or the journal ([ta]). For a complete listing of the

available search tags, see [Pubb].

By default, a search produces a list of article summaries that contain the authors, the

article title, the publication journal and date, related articles, and a unique PubMed

identifier (PMID). A variety of other presentation formats are available that contain

varying amounts of information, from a single-line brief summary to a complete

description.

Page 19: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

19

As well as the web browser interface, Entrez provides Entrez Programming Utilities

(E-Utilities) to retrieve raw PubMed data in several of the presentation formats, in

html or xml format, to parse and use in other applications [Entb]. These include

ESearch, for performing a search on a query term; EFetch, to get the information

about a particular article or set of articles; ELink, to retrieve the list of related articles

for a particular article.

2.3.3 MAVERIK 6.2

MAVERIK, the MAnchester Virtual EnviRonment Interface Kernel, is a publicly

available virtual reality system, capable of producing complex virtual environments

and interacting with 3D peripherals [Mav]. It is written in C, and provides several

core services vital for producing an interactive 3D environment, including the

following features which play important roles in this project:

! A complete set of default primitive objects.

! A spatial management system.

! High performance algorithms for culling, navigation and collision detection.

The primitive objects include boxes, cylinders, spheres, cones and polygons, and can

be rendered using any colours or textures available. MAVERIK graphical objects are

not only restricted to the primitives; new ones can be created by writing functions for

drawing, intersections, bounding boxes and so on, and associating them with a

MAV_class. Typically, all objects will contain information related to dimension,

location and orientation, which can be recorded using MAVERIK data types such as

MAV_vector or MAV_matrix, and associated functions like mav_vectorRotate

or mav_matrixMult.

To save the programmer having to keep track of all graphical objects in an

environment, they are placed in a Spatial Management Structure (SMS) that controls

Page 20: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

20

how and when they are rendered, and plays a central role in culling, object selection

and collision detection [CH02].

2.3.4 Q-SPACE

Q-SPACE is a tool for visualising sets of comparable items, consisting of objects

positioned in a three dimensional environment, and is written in a combination of C

and C++, utilising the graphical capabilities of MAVERIK [PC01, PCM01]. Q-

SPACE consists of a single MAVERIK window containing graphical representations

of the objects and their relationships, and is navigated and controlled using a 3D-

mouse and keyboard. To implement Q-SPACE for a particular set of data, one has to

write a mechanism to create instances of a subclass of QSERV_tuple, which needs

to provide data structures to store appropriate attribute data and an algorithm to

compare elements of the data set, which returns a value between 0 (elements are

completely different) and 1 (elements are identical).

Q-SPACE uses the tuple class’ comparison algorithm to pair-wise compare all

created instances, storing their comparison values in a triangular matrix. These

values are used to make an ordered list of comparisons, with the most similar pairing

at the head, and the least similar at the tail. This list is used to create a minimal

spanning tree (MST) of the tuples (see figure 2.1), using the similarity values as edge

weights, which is used to ‘colour’ them in groups, where a tuple and its parent are

determined to be in different groups if their similarity is below a given threshold. A

large threshold yields a small number of highly populated groups; a small threshold

splits the tree up into a large number of sparsely populated groups. A tuple that forms

a new group is said to be the dominant tuple in that group.

Page 21: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

21

Figure 2.1: Minimal Spanning Trees.

Given a graph of vertices with weighted edges (a), a spanning

tree is a subgraph that contains all of the vertices and is a tree

((b), (c) and (d)). For it to be an MST, the sum of the weights

of edges must be the minimum for all possible spanning trees

(d) [Gou03].

Once the colouring is complete, the tuples are displayed as cubes (MAV_boxes) in

their group colour, with all members of a group encapsulated by a semi-transparent

minimal convex hull (in the same colour), and with the dominant tuples of different

groups joined by lines. The tuples are iteratively positioned in 3D using a force

placement algorithm that exploits the MST structure by attracting the dominant

tuples from each colour group, then repelling the tuples in each group away from

their dominant tuple. The tuples all begin at the origin, then move outwards,

‘organising themselves’ into linked groups until a stable formation has been reached,

where the tuples are separated by a distance proportional to their similarity. Figure

2.2 shows Q-SPACE’s default data set in its final configuration.

8

6

3

10

5

1

8

6

10

6

5

1 6

3

1

Spanning Trees

(b) Tree weight = 12 (c) Tree weight = 24 (d) Tree weight = 10

(a)

Graph

Page 22: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

22

Figure 2.2: A set of tuples in its final configuration.

The semi-transparent clusters contain all the tuples from the

same group, with separate clusters joined by lines.

2.3.5 Qt 3.3

Qt is a cross-platform C++ Graphical User Interface (GUI) toolkit designed and

maintained by Trolltech [Tro]. Qt has formed the basis of thousands of applications

worldwide, and is the basis of the KDE Linux desktop environment [Qta]. Although

version 4 of Qt is now available, it is a fairly substantial redesign compared to version

3.3 and is incompatible with MAVERIK in its current form, which is unlikely to

change.

Qt offers a large collection of object-oriented graphical widgets such as buttons, labels

and dialogs, as well as tools for handling streams, databases and threads.

Page 23: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

23

2.4 Programming Languages

In order to be able to adapt Q-SPACE, which is written in a combination of C and

C++, the code for features specific to BioQSpace must also be in C and C++. The

use of C++ allows Qt to be used to build the GUI, and enables the use of the

Standard Template Library (STL) [SGI], a collection of container classes, algorithms

and iterators, most of which are templates, so can be used for any data types.

The choice of language for the visualisation, however, does not restrict the choice of

language for the information extraction part of the system, and for this Perl was

chosen, due to its simple and efficient text and regular expression handling

capabilities.

2.5 Developmental Approach

To produce an application capable of visualising the relationships between

MEDLINE abstracts, all of the components described above have to be integrated.

PubMed will be queried to gather the MEDLINE abstracts and related fields, which

can then be parsed to extract all useful attributes and stored in a subclass of

QSERV_tuple, which in turn can be integrated into a version of Q-SPACE in a Qt

GUI for visualisation.

Q-SPACE forms the bulk of the system, but is a complete application in itself, and as

such, the number of possible developmental approaches is restricted. When building

an application from the ground up, the Waterfall process is a desirable and efficient

methodology [Kol05]. This requires having a complete specification of every element

of the system, implementing each of them separately, then combining and testing

them as a whole. As much of the software for this system has already been written,

and changes to the original code are undesirable, the Waterfall process is not suitable,

so the Incremental Prototyping model is used instead [Pro]. This consists of

Page 24: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

24

producing a basic functioning application, then incorporating new features according

to the list of requirements, or as they are thought of. This approach allows plenty of

feedback as the project progresses, ensuring that each feature performs exactly as

intended, without conflicts between them. In addition, if at some point in the

production process it is decided that a new comparison attribute is needed, say, it

should not be difficult to incorporate it.

In summary, BioQSpace should provide a tool to query PubMed using E-Utilities,

parse and process the results, and save comparison attributes for the articles in local

files. It should also provide a tool to visualise the contents of those files in a navigable

and interactive 3D environment. The details of these tools are elaborated in the next

chapter.

Page 25: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

25

3. SYSTEM DESIGN

This chapter covers the design of the two parts of the system: the PubMed data

collecting and processing, and the visualisation, which are to be implemented as two

standalone applications. The design discussion covers program and file structures, as

well as GUI design.

3.1 PubMed Data Collecting and Processing

To satisfy parts of user requirements 1 and 2, the PubMed interaction part of the

system needs to perform three jobs:

! Query PubMed with the user’s search term and a maximum number of

results, and store all of the required data locally, in a directory specified by the

user.

! Process the data, extracting key words and phrases to use as comparison

attributes.

! Save the processed data in a format that can be efficiently used by the

visualisation application.

These tasks indicate two appropriate structural decisions: that the script should take

three arguments (target directory, maximum number of results and search query);

and that the jobs should be separated into three subroutines, which will allow each

task to be changed, tested and evaluated independently of the others.

The only possible errors that could occur in this script are disk I/O and internet

connection related, but all should be caught and dealt with appropriately, informing

the user what went wrong.

All of this will be contained in one Perl script, pubmed.pl.

Page 26: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

26

3.2 Visualisation

The general structure of the Q-SPACE and BioQSpace applications will now be

described. Details of the variables and functions associated with each of the

mentioned components have been deliberately omitted here, as they will be explained

in later sections if and when they are deemed necessary.

3.2.1 Q-SPACE Structure

Q-SPACE consists of a curious mix of C and C++ files (and associated headers) that

uses true object-oriented C++ design in some places and pseudo-object-oriented C

design in others, the latter especially when directly interacting with MAVERIK,

which is written purely in C.

A list of computer file information is used for the default data set, where the tuples

are created from the output of the Linux command ls –l which has been saved to a

text file, files.txt. The attributes used are the file name, file type, directory and size.

A screenshot of the original application is shown in figure 3.1. The structure of the

program with the interactions and hierarchies of the Q-SPACE C structs and C++

classes are shown in figure 3.2, and the important elements are summarised below it.

Page 27: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

27

Figure 3.1: A screenshot of the original Q-SPACE.

Figure 3.2: Original Q-SPACE program structure,

interactions and hierarchies.

Linked list interface

hull

MAV_m2n

Logging tools

DEVA_traceString

DEVA_tracer

QSERV_tStore

QSERV_tuple

QSERV_tupleFile

List of files (files.txt)

MAVERIK modules

MAV_hull

MAV_hiliteBox

MAV_spangly

MAV_tooltip

MAV_trail twine

MAV_qobj

MAV_qpit

DEVA_link

DEVA_list

qserver

qpit_main

Page 28: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

28

DEVA_link, DEVA_list These are templates for creating linked lists of objects, and are used throughout the

application.

DEVA_traceString, DEVA_tracer DEVA_tracer is a class with static methods for logging the progress of the

application, either to stderr or a file, primarily for debugging purposes throughout the

application. They use DEVA_traceStrings to easily concatenate primitive data

types to strings.

There are 13 different types of logging message, including warnings, fatal errors and

sanity messages, and any or all of them can be output by setting the mode

appropriately.

MAVERIK Modules

These are pseudo-object-oriented classes that define how they are created, are drawn,

deal with intersections and are deleted (amongst others) by registering function

callbacks with MAVERIK.

MAV_qobj This is a visual representation of a single tuple, and is simply a coloured cube (figure

3.2(a)), although it has three attributes that can alter its appearance: if it is selected, it

has a rotating white box (MAV_hiliteBox) drawn around it (figure 3.2(b)); if it is

marked, it has flashing white lines (MAV_spangly) emitting from it (figure 3.2(c)); if

it is deleted, then it is not drawn.

MAV_hull This is a semi-transparent minimal convex hull that encapsulates all MAV_qobjs in

the same group. MAV_hull does not construct the hull itself, but uses code written

by Joseph O'Rourke, John Kutcher, Catherine Schevon and Susan Weller to

calculate the minimum set of vertices needed to define the faces of hull, and in what

order, and then draws planes for each face. A MAV_hull is shown in figure 3.4.

Page 29: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

29

Figure 3.3: A MAV_qobj (a) unselected and unmarked, (b)

selected using a MAV_hiliteBox, (c) marked using a

MAV_spangly.

Figure 3.4: A MAV_hull, surrounding all of the MAV_qobjs

in the same group.

MAV_tooltip This is intended to be a rectangle that appears at the position of the cursor if it pauses

long enough over a MAV_qobj or MAV_hull, which contains information about

that object. Due to version 6.2 of MAVERIK not implementing some required text-

related rendering functions, however, nothing actually appears.

MAV_trail A trail (figure 3.5) is used to track the visited tuples. When a MAV_qobj is clicked

on, it is added to the end of the trail’s list of MAV_qobjs. A MAV_trail is drawn

using the twine library written by James Marsh to calculate intermediate points

between the 3D locations of consecutive MAV_qobjs which smoothly interpolates all

of the MAV_qobjs in the list, and joining up those points with yellow lines to

produce a continuous curve that passes through them all.

Page 30: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

30

Figure 3.5: A series of MAV_qobjs linked by a MAV_trail:

a smooth 3D curve.

MAV_qpit A MAV_qpit is a container for all of the MAV_qobjs in the visualisation, each of

which has its own drawing function, so its draw callback does no actual drawing, but

instead recalculates the forces between the components in order to reposition them in

every loop.

MAV_m2n This is a tool for ‘flying’ from the current location to a clicked-on MAV_qobj, and

remaining focused on it until another MAV_qobj is selected. It prevents the user

selecting another MAV_qobj if it is already in flight.

QSERV_tuple This is an abstract class with virtual functions that have to be overwritten for the

specific uses of subclasses. The most important virtual functions are compare and

compareAttribute that define how two instances of the same type of tuple

should be compared, using their attributes.

Page 31: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

31

QSERV_tupleFile This is a particular subclass of QSERV_tuple that compares file information, used

for Q-SPACE’s default data set.

QSERV_tStore This contains all of the (subclasses of) QSERV_tuples in the data set, and provides

all the necessary functions for clustering the data and changing the appearance

attributes of the MAV_qobjs.

qpit_main This is the main body of the program. It begins by initialising the progress monitoring

tools, MAVERIK and all of the modules, then creates a MAV_qpit, a MAV_trail

and a MAV_tooltip and adds them to the main MAVERIK SMS.

When the MAV_qpit is made, it creates a QSERV_tStore, and adds to it a

QSERV_tupleFile instance for each line of files.txt. The tuples are then

clustered and assigned a MAV_qobj to represent them visually.

It then enters the MAVERIK infinite rendering loop which acts upon any input

events (e.g. from the mouse or keyboard), updates the hulls for each of the groups in

the MAV_qpit, and draws everything in the SMS.

3.2.2 BioQSpace Visualiser Structure

The original structure of Q-SPACE should be retained as much as possible, but with

a new interface and additional features. The major programming changes required to

adapt Q-SPACE to satisfy the user requirements should include:

! Construction of the GUI as part of the existing qpit_main initialisation

function.

Page 32: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

32

! Addition of new functions to qpit_main to perform the actions associated

with the interactive GUI widgets.

! Replacement of the existing 3D-mouse navigation system with a 2D-mouse

and/or GUI navigation system.

! Replacement of the DEVA_link and DEVA_list classes, examples of

obscure legacy code, with suitable classes from the STL.

! Writing of a new subclass of QSERV_tuple, QSERV_tupleArticle, to

deal with all of the data associated with MEDLINE articles.

! Writing of a function in MAV_qpit to read and parse the results from

pubmed.pl to create QSERV_tupleArticle instances.

! Rewriting of all user feedback code, so that messages are displayed in a dialog

as part of the GUI, in addition to being written to the terminal.

The ways in which these changes are implemented are described in chapter 4.3. The

amended structure diagram is shown in figure 3.6.

Page 33: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

33

Figure 3.6: The structure of BioQSpace.

The new or substituted elements are in black, and those that

remain from the Q-SPACE structure are in grey.

3.3 File Structures

Once pubmed.pl has been run, the output files can be parsed by the visualiser as

many times as the user wishes, so it is sensible to try to minimise the amount of work

required for parsing by organising the data intelligently. Minimising the size of the

output files is also a desirable feature.

The output files will consist of a main file, qspace_main.txt, that contains all of

the data for all of the articles returned from PubMed, and a file for each attribute,

with names of the form qspace_[attribute_name].txt, that list all

encountered examples of the attribute. Instead of listing in the main file all of an

article’s attribute examples in full, the line number of the corresponding attribute file

hull

MAV_m2n

Logging tools

DEVA_traceString

DEVA_tracer

QSERV_tStore

QSERV_tuple

QSERV_tupleArticle

MAVERIK modules

MAV_hull

MAV_hiliteBox

MAV_spangly

MAV_tooltip

MAV_trail twine

MAV_qobj

MAV_qpit

Files output from pubmed.pl Qt Widgets

qpit_main

qserver

Page 34: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

34

can be listed instead. As an example, let one of the attributes be disease names, and

the 10th disease name (alphabetically) be Alzheimer’s. Any article that contains the

word Alzheimer’s can then save the number ‘10’ instead of the word ‘Alzheimer’s’ in

qspace_main.txt. Not only will this drastically reduce the file size of

qspace_main.txt, but MAV_qpit can then read in all of the attribute files, index

their contents in arrays, before parsing the main file, where the values can be quickly

extracted from the arrays.

Attributes that are numbers or involve scores should be normalised to lie between 0

and 1. For a given set of articles, normalisation need only be performed once, so

should be done before the data is saved. This way, the visualiser does not need to do

any normalisation of the raw attribute values.

3.4 GUI Design

A good GUI design follows a number of sound principles. The list below is an

abridged version of the guidelines from [IBM]:

! Simplicity: Don’t compromise usability for function. Keep the interface

simple and straightforward, minimising clutter. Common functions should be

immediately apparent, keeping advanced options less obvious.

! Support: Place the user in control and provide proactive assistance. Do not

restrict the user in the number of ways they can complete tasks: provide

alternative routes that they may be more comfortable with. Provide assistance

with achieving tasks, but in an unobtrusive way.

! Familiarity: Build on users’ prior knowledge. If the GUI performs similarly

to software the user is familiar with, and the behaviour is consistent across the

GUI, the interface will be easier to learn and operate. The design should be

based around what the user would expect to find.

Page 35: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

35

! Obviousness: Make objects and their controls visible and intuitive. Using

real-world representations for application functions, such as an icon of a trash

can for a discard function, can help familiarise the user with the associations

between the controls and their functionality.

! Encouragement: Make actions predictable and reversible. Allow the user to

explore the tools the application provides, without fear of being unable to

recover a previous state. Do not bundle actions together in a way the user may

not anticipate.

! Satisfaction: Create a feeling of progress and achievement. Reflect the

results of actions immediately, instead of forcing the user to wait. If this is not

practical, communicate the progress of the process, or offer a preview of a

likely outcome of the action.

! Availability: Make all objects available at all times. Users should be able to

use all of their objects in any sequence and at any time. Restrictions on the

availability of objects can frustrate the user and should be avoided.

! Safety: Keep the user out of trouble. Every attempt should be made to

prevent the user from being able to cause errors. In cases where errors are out

of the system’s control, two-way communication is necessary to clarify what

the user intends, or to remedy the problem.

! Versatility: Support alternative interaction techniques. Allow the user to

choose a method of interaction that suits them best. This includes input

methods, including the mouse, keyboard, microphone or stylus, and output,

such as spoken instruction.

! Personalisation: Allow users to customise. Customisation of colour schemes

and backgrounds can help make an interface comfortable and familiar.

Providing the ability to change default values can enable them to save time

and effort.

Page 36: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

36

Figure 3.7: The basic graphical user interface design.

! Affinity: Bring objects to life through good visual design. The final result

should be an intuitive and familiar representation that is second nature to

users.

Not all of these concepts are applicable to the BioQSpace GUI, but considerable

efforts will be made to satisfy them where appropriate. To this end, the general layout

in figure 3.7 should be adhered to. The components are:

! Toolbar: Common functionality should be placed here, organised in an

intuitive manner. In many applications, this is positioned on the left of the

window, so this GUI should do the same.

! Menu Bar: More advanced options should be accessible through menu items.

The items should have intuitive shortcut keys associated with them.

! MAVERIK viewport: This is the focus of the GUI, where practically all of

the actions will take place, so should comprise the bulk of the window. When

the window is resized, this is the only component that should grow in both

directions.

Menu bar

Toolbar MAVERIK viewport

Abstract attribute data Weights of attributes

Status bar

Page 37: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

37

! Abstract attribute data: When an object in the visualisation is clicked on, the

user will expect to find out more information about it. That information

should appear in this box in a clear, easy-to-read format.

! Weights of attributes: One of the main requirements of the system is that the

user can adjust the way the abstracts are compared, via changing the weights

of the attributes. This action should have a clear interface, physically

separated from the rest of the available interactive actions.

! Status bar: Any progress or status messages that do not require user

acknowledgement should appear here.

Page 38: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

38

4. IMPLEMENTATION

This chapter describes the main issues involved in the building of BioQSpace, using

the design concepts from the previous chapter, and how the important functions were

implemented in the two parts of the system: pubmed.pl and the visualiser. As the

incremental method was used to develop the system, the order in which elements are

described in this chapter do not reflect the order in which they were implemented –

they are descriptions are of the final result.

4.1 Abstract Comparison Attributes

To compare two abstracts, a set of attributes are needed, which can individually be

compared in a suitable way, then their comparison values combined. The 15

considered attributes and their meanings are listed below, ordered by decreasing

importance, as judged by the author.

! Title words. All of the words in the title with their associated importance

scores.

! Abstract words. All of the words in the abstract with their associated

importance scores.

! Title & abstract words combined. All of the words in both the title and the

abstract with their associated importance scores.

! MeSH Terms. The MeSH terms that were used to classify the article in

PubMed.

! Drugs. Any drugs mentioned in the title or abstract.

! Diseases. Any diseases mentioned in the title or abstract.

Page 39: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

39

! Function Terms. Any words or phrases in the title or abstract that refer to the

functionality of biological entities.

! Structure Terms. Any words or phrases in the title or abstract that refer to the

structure of biological entities.

! Location Terms. Any words or phrases in the title or abstract that refer to the

locations where biological entities act or are acted upon.

! User-defined Terms. Any words or phrases in the title or abstract that are in a

custom list provided by the user.

! PubMed Related Articles. The list of related articles as calculated by PubMed

(used as part of two different attributes).

! Publication Date. The year of publication.

! Authors. The list of contributing authors.

! Journal. The journal that the article was published in, the publishing house

the journal belongs to, and any portals that the journal can be accessed

through.

4.2 pubmed.pl

The three tasks pubmed.pl performs – querying PubMed, processing the results and

saving the results – are now described in more depth. It is assumed that the three

arguments (target directory, maximum number of results and search query) have all

been provided. In the case that the target directory, maximum number of results and

search query have not all been provided, the script exits, informing the user of the

missing arguments.

Page 40: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

40

4.2.1 Querying PubMed

Using the ESearch E-Utility, a query is performed using the user’s search term and

maximum number of results. The URL takes the form

http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=

pubmed&retmax=[max number]&usehistory=n&term=[search term]

This produces an xml-formatted file containing the PMIDs of the abstracts that

match the search term, plus additional information relating to the number of

occurrences of the search term, which is unused.

The file is parsed to extract the PMIDs only, which are concatenated (separated by

commas) and used in the URL for the EFetch E-Utility:

http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=p

ubmed&id=[list of PMIDs]&retmode=html&rettype=medline

This results in an html file (containing only the bare minimum of html code) that lists

all utilised MEDLINE fields and the corresponding values for each PMID in the list.

This file is saved as medline_data.txt in the target directory.

The PMIDs are then individually fed into the ELink E-Utility, for which the URL is:

http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfro

m=pubmed&id=[PMID]&bd=pubmed

This also produces an xml-formatted file that lists the PubMed-defined list of related

articles, with their similarity scores. These pairs of numbers are extracted and saved

as [PMID]:[score] pairs in a file, rel_[PMID], in the related subdirectory of the

target directory.

Examples of the results from performing each of these E-Utilities can be seen in

Appendix A, and an example of a file put into related is in Appendix B.

Page 41: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

41

4.2.2 Processing the Results

First, medline_data.txt and the contents of the related subdirectory are

parsed to extract the title, abstract, MeSH terms, authors, journal title, publication

date and related articles for each PMID. The related article scores are normalised so

they lie between 0 and 1.

Then all of the titles and abstracts are scanned for drugs, diseases, function terms,

structure terms and location terms, by using regular expressions listed in 5 files (see

Appendix B), and for user-defined terms, which are listed in a file that the user can

optionally create.

An attempt is then made to identify the importance of all of the words in the title and

abstract, by performing term frequency – inverse document frequency (tf-idf) analysis

on them [Tfi]. First, each word is changed to lower case and, if it is not listed in

common_words.txt, is stemmed using the Porter Stemmer Algorithm [Por80].

This way, related words such as disease, Diseases and DISEASED will all be

counted using the same stem. This approach has a drawback though: some related

biological entities have names that are spelt the same, but differ in the case of the

letters (e.g. Myc = protein, myc = gene). This algorithm will not differentiate between

the two.

For each encountered word stem for each PMID, two values are calculated: ni, the

number of times the stem i appears in the title (or abstract, or both); and di, the

number of titles (or abstracts, or both) the stem appears in. The tf-idf value for each

word can then be calculated using the formula

""#

$%%&

'()* ik k

i

dD

nnidftf log*

where !k nk evaluates as the number of word stems in the title (or abstract, or both)

and |D| is the total number of PubMed results.

Page 42: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

42

This produces high values if a stem appears frequently within a title (or abstract, or

both), but appears in only a small fraction of titles (or abstracts, or both) in the whole

set. Those stems that score highly are considered to be important words.

The tf-idf values are then normalised.

4.2.3 Saving the Results

Each encountered attribute value is saved into one of:

! qspace_authors.txt

! qspace_diseases.txt

! qspace_drugs.txt

! qspace_functions.txt

! qspace_journals.txt

! qspace_locations.txt

! qspace_mesh_terms.txt

! qspace_structures.txt

! qspace_user_terms.txt

! qspace_words.txt

Then each of the articles are saved to qspace_main.txt in the 16 line format

below, where indices indicate the line numbers in the corresponding qspace file, with

the items separated by | characters. Non-applicable fields are indicated with a #

character.

PMID

Title

Author indices list

MeSH term indices list

Page 43: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

43

Journal name : provider indices list

Publication date

Drug indices list

Disease indices list

Function indices list

Structure indices list

Location indices list

User term indices list

Title word indices list, with tf-idf scores

Abstract word indices list, with tf-idf scores

Entire document word indices list, with tf-idf scores

Related article PMIDs list, with similarity scores

A reference file, journal_list, is used to find the publishing houses and portals of

each known journal. It is created from [Lin05], and should be occasionally updated

by the user with the short getJournals.pl script when new journals, publishing

houses or journal access portals are established.

Also output is qspace_word_stems.html, a collection of all of the word stems

and the words they can represent.

4.3 BioQSpace Visualiser

The implementations of the features added to the original Q-SPACE application,

which together create the BioQSpace visualisation application, are now described.

Page 44: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

44

4.3.1 Article Storage and Comparison Algorithms

MAV_qpit’s list file parsing function has been replaced with one to parse the files

created by pubmed.pl, extract the attribute values and create

QSERV_tupleArticle instances with them. The majority of the attributes are lists

of words or phrases, but the title words, abstract words, title + abstract words and

related articles lists have associated scores. The comparison algorithm for the lists

without scores calculates the fraction of the items from the lists that both articles

share. For the lists with scores, the pseudocode for the comparison value is below:

set comparison value to 0

for each item in list 1

if list 2 contains item

multiply score from list 1 with score from list 2

add result to comparison value

divide comparison value by total number of unique items

from list 1 and list 2

The lists can be quite large, so to reduce the time spent calculating these values (at the

expense of accuracy), subsets of the lists are made in the constructor which contain

the items with the highest scores, so the pseudocode becomes:

set comparison value to 0

for each item in list subset 1

if list 2 contains item

multiply score from list 1 with score from list 2

add result to comparison value

for each item in list subset 2

if list 1 contains item

multiply score from list 1 with score from list 2

add result to comparison value

Page 45: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

45

divide comparison value by total number of unique items

from list subset 1 and list subset 2

Score thresholds are used in creating the list subsets, which are automatically chosen

when loading a data set so that they consist of approximately 25% of the complete

lists. The thresholds can be changed using the GUI, which alters the size of the list

subsets. A lower threshold means that more items are considered, which means the

calculation takes longer; a higher threshold creates subsets with fewer items, which

can dramatically speed up comparison calculations.

The PubMed related articles list is used in two attributes: one uses the algorithm

described above (PubMed Related Articles); the other checks if the PMID of the first

article is included in the second article’s related articles list, and vice versa (Direct

PubMed relation).

The publication date comparison score is the difference between the years of

publication, as a fraction of the complete range of years for the data set.

The publication journal comparison score yields 1 if the journals are the same;

otherwise it is calculated with the same algorithm as for the lists without scores, using

the journal’s list of publication house and portals.

The maximum comparison values for the different attributes differs greatly – the

publication date and journal value will often evaluate to 1, but the related articles

value can peak at only " 10-2. This greatly undermines the intuitiveness of the

weights, so normalisation values are precalculated when the data set is loaded, which

ensures that all comparison values are relatively spread out in the range (0, 1].

4.3.2 GUI

The final design of the GUI is shown in figure 4.1. The layout is mostly consistent

with the proposed design in figure 3.7, having a menu bar at the top, a toolbar of

Page 46: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

46

common actions of the left hand side, article information and weight sliders at the

bottom, and the MAVERIK viewport filling most of the window. The only

component missing is the status bar, which has been replaced with a status dialog.

The GUI is built from default Qt widgets, some customised widgets (made by

subclassing existing Qt widgets) and a MAVERIK window. Actions (slots) are

associated with signals that are produced when interactive widgets are activated (such

as the clicked() signal from a QPushButton, or the valueChanged(int)

signal from a QSlider) using the QObject::connect function, which works in a

similar way to how C callback functions are registered [Qtb].

Figure 4.1: The final graphical user interface design.

Page 47: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

47

Figure 4.2: The menu bar.

Menus (figure 4.2)

The path on the right hand side of the menu is the directory whose contents are being

displayed. The File menu contains options to load a new set of articles, which

displays a directory selection dialog to choose a directory from, and to exit. The

Select Articles menu contains actions to let the user select all the articles, no articles,

the inverse of the current selection, or any that are marked. The Mark Articles menu

contains actions to mark the selected articles, no articles, or those that fit attribute

criteria, which uses the dialog in figure 4.3 to let the user choose the attribute values

to mark. The Help menu consists of links to the help system (figure 4.4), the word

stem meanings file produced by pubmed.pl (figure 4.5) and the About BioQSpace

window (figure 4.6).

Figure 4.3: Mark articles by attribute dialog.

Page 48: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

48

Figure 4.4: Help window.

Figure 4.5: The word stems window.

Figure 4.6: About BioQSpace window.

Page 49: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

49

Figure 4.7: The toolbar.

Toolbar (figure 4.7)

This panel consists of two sections: navigation, and miscellaneous options and

operations. The navigation system is discussed in chapter 4.3.3, which describes the

functions of the eight direction buttons, the two zoom buttons and the focus on last

selected article checkbox. The traverse trail buttons navigate through any articles

that have been added to the trail, and the fly to article drop-down list allows the user

to navigate to an article with the selected PMID.

The show labels and use tooltips checkboxes toggle the labels and tooltips, as

illustrated in figure 4.8. The label is the PMID of the article, which appears below the

MAV_qobj. The faulty code for MAV_tooltip has been fixed so that it now appears

as expected, displaying the PMID and the keywords from the article.

Delete selected articles and clear trail act as one would expect. Advanced options

displays the dialog shown in figure 4.9. The top four sliders change the thresholds

described in chapter 4.3.1; the last one changes the threshold that the Q-SPACE

grouping algorithm uses to cluster the articles.

Page 50: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

50

Figure 4.8: Action of the ‘show labels’ and ‘use tooltips’

checkboxes.

Figure 4.9: The advanced options dialog.

Figure 4.10: The article information panel.

Article Information (figure 4.10)

When an article is selected, this panel lists all of its attribute values, described in

chapter 4.1, though only the important subsets of the whole lists are included for

those attributes with scores.

Page 51: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

51

Figure 4.11: The attribute weight sliders.

Weight Sliders (figure 4.11)

With these sliders, the user can change the way the comparison value is calculated.

Higher or lower emphasis can be given to each attribute by positioning the handles

appropriately. When the user is happy with the choice, they click the recalculate

with new weights button, and wait for the comparisons to be recalculated, after

which the new visualisation is displayed. If any of the similarity values evaluate to

zero, the user is warned that the visualisation may be unrepresentative of the data,

indicating how many pairs have zero similarity.

Error, Warning and progress Messages

QMessageBoxes are used to display all error and warning messages. Progress

messages are displayed line-by-line in a custom QDialog, which automatically

disappears when a process has finished.

4.3.3 MAVERIK Navigation

The 3D space is not much good without some useful methods of navigation. In Q-

SPACE, this was performed using a 3D mouse, but in BioQSpace the user can use

either the right-button of a standard 2D mouse or buttons on the GUI.

Page 52: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

52

There are two modes of navigation: free and focused. In free navigation, the user can

look up, down, left and right, and by holding shift can move backwards and

forwards. Focused navigation is initiated when an article is clicked on or otherwise

navigated to (via the trail traversal buttons, for instance). In focused navigation, the

user orbits the selected article, which remains at the centre of the viewport. Zooming

in is restricted so that the article cannot be passed through. The navigation mode can

be toggled using the focus on last selected article checkbox on the GUI. Reverting

back to focused mode after being in free mode will fly back to the last selected article.

In Q-SPACE, left-clicking on an article focuses on it, selects it and adds it to the trail,

while right-clicking does nothing. Holding shift while left-clicking allowed articles to

be selected in a group. In BioQSpace, the right-mouse-button is given functionality:

right-clicking an article focuses on it only, so the user can examine an article without

being forced to add it to the trail.

Page 53: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

53

5. TESTING AND EVALUATION

All software should be thoroughly tested to locate possible bugs and limitations. A

comparison of the final product with the initial specification and requirements is a

good method of evaluating the success of the system. This chapter describes both of

those processes when applied to BioQSpace, and concludes with instructions for

installation of the software.

5.1 Testing

Testing a large, complex software application can never be exhaustive, so a program

cannot be proved to be correct. However, through rigorous testing of the individual

components and the integrated system, one can be more confident that it is correct.

Successful tests should expose flaws in the system, and as such they should be

carefully designed to perform tasks that the programmer would not predict or expect

to be attempted.

Writing in C or C++, languages that do not offer garbage collection, means that in

addition to testing that functions perform as expected, memory allocation and

deallocation must be kept under control. Memory leaks are a common problem, and

can be difficult to pin down, but tools such as Valgrind can help locate where the

problems originate.

It is assumed that the Q-SPACE code is essentially correct, so the sections that have

survived unchanged in BioQSpace need not be tested as thoroughly as the new or

adapted code.

pubmed.pl

The only potential problems out of the control of the programmer are in the data

collection stage, which are susceptible to internet timeouts. Disconnecting the

Page 54: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

54

computer from the network at various points while running the script meant that the

way the script copes with network issues could be analysed, and, in two cases, fixed.

Other problems can arise in the way the files are written to and read. Searches were

performed that were known to contain results that lacked one or more of the fields,

e.g. relatively recent submissions that have not yet been allocated MeSH terms, to

check that they were saved properly.

Many tests of the resulting files were made to ensure the format was consistently

correct.

User Interface

Using various combinations of the actions of: navigating through the 3D space;

clicking on multiple articles; traversing along the trail; deleting articles; clearing the

trail; and changing between the two different navigation modes, the application’s

stability could be thoroughly tested.

Setting all the weights to zero simply gives all attributes equal weighting, and throws

up no errors.

When the user tries to change the source path, an error message is displayed if the

directory does not exist, or does not contain all the required files.

Cancelling all of the dialogs, even if changes were made, was checked to ensure any

changes were not retained.

5.2 Evaluation

The testing process proved fruitful in locating bugs and unexpected behaviour. The

majority of the discovered problems were fixed, though some known bugs (and, of

course, some unknown bugs) remain that, given more time, would also be remedied.

The remaining known bugs are all minor, cosmetic issues that do not impair the use

Page 55: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

55

of BioQSpace, for example deleting the currently focused article or article(s) in the

trail, which can result in focused navigation around an empty point in space.

All eight of the MUST requirements in chapter 2.2 have been satisfied. The user can:

! Query PubMed and visualise relationships between the resulting articles, in a

3D environment, saving results of queries.

! Navigate the environment, and interact with the (graphical representations of)

articles via the mouse and GUI.

! Tweak the comparison algorithm to place emphasis on particular attributes.

! View the attribute information of the articles.

! Remove articles from the visualisation.

The SHOULD requirements have been partially satisfied: the two halves of the

system have been implemented in separate applications, and the user can tweak the

amount of data considered in the comparison algorithm. The help system is not

comprehensive, but does cover the basics of how to use the system, and no joining

application was written.

With a little practice, the navigation is simple to use, and the GUI buttons allow

Apple Mac users to navigate despite the lack of a second mouse button. Overall, the

GUI design is uncluttered and intuitive, yet still provides all of the required

functionality.

No attempt has been made to quantitatively evaluate the usefulness of BioQSpace, so

it is unknown to what extent it achieves the original intended purpose of assisting

users of PubMed to find what they want or to conduct research. This would likely

require a lot more time and research, and is beyond the scope of this project.

Page 56: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

56

5.3 Installation

The requirements for BioQSpace are:

! Internet access

! MAVERIK version 6.2, compiled with the --QT option

! Qt version 3.3

! Perl interpreter

The twine library used for the trail comes zipped in the BioQSpace package, but can

be unzipped anywhere. The following environment variables need to be set to the

home directories of the corresponding application in order to compile and run

BioQSpace:

! MAV_HOME (MAVERIK)

! QTDIR (Qt)

! TWINE_HOME (twine)

! BIOQSPACE (BioQSpace)

LD_LIBRARY_PATH needs to be appended with $MAV_HOME/lib and

$TWINE_HOME/lib. Running make in $BIOQSPACE/bioqspace will then

compile the application into the same directory.

If $BIOQSPACE is added to the PATH environment variable, pubmed.pl and the

visualisation can be run from anywhere.

This is a demanding installation procedure that can prove difficult on non-Linux

machines, and time-consuming on machines lacking the required libraries, which

have to be downloaded and installed first. Attempts were made to install BioQSpace

Page 57: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

57

on an Apple Mac and Microsoft Windows (using Cygwin [Cyg]), but were

abandoned due to excessive problems with dynamic libraries. Many hours were spent

trying to fix the problems, but after too little progress was made it was judged to be

futile.

Page 58: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

58

6. CONCLUSIONS

This chapter concludes the dissertation by summarising the intentions and

achievements of the project, and providing some suggestions for improvements and

extensions to the system.

6.1 Summary

BioQSpace was developed as a visualisation tool to graphically represent the

relationships between articles gathered from PubMed, an online medical literature

database. The project involved adapting an existing application, one that visualises

relationships among abstract data sets, for a specific purpose, and developing tools to

interact with the visualisation.

This thesis began with some background information on the motivation for the

project, and a discussion of related software in existence. The rest of the document

described the stages involved in developing BioQSpace, from requirements analysis,

through design and implementation to testing and evaluation.

6.2 Performance Issues

There are two main bottlenecks to the performance speeds of the visualiser:

! Comparing the articles.

! Positioning and rendering the MAV_qobjs using the force placement

algorithm.

The first is a one-off job, whereas the second is performed continuously, but for both,

the faster they can be completed, the higher the level of user satisfaction. The time

Page 59: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

59

complexities of the algorithms have order N2 and N log N [PCM01] respectively,

which can dramatically slow down the software when the number of articles gets too

large, but being O(N2), comparing the articles will have the bigger impact. The force

placement algorithm has been kept unchanged from Q-SPACE, and it is unlikely to

be possible to improve on its complexity, but the time taken for article comparisons

depends on the attribute values of the individual results; how many of the weights are

zero; and on the thresholds in the advanced options dialog, so can vary greatly.

BioQSpace was tested under Linux Fedora Core 2, running on a computer with an

AMD Athlon XP 2200+ processor with 512MB RAM and an NVIDIA GeForce4

MX 440 with AGP8x. The query term ‘transcription factor[ti]’ was used, with

maximum results ranging from 50 to 500. As the number of articles approached the

500 mark, the frame rate (determined by the speed of the force placement algorithm)

became noticeably sluggish, though still usable. Figure 6.1 shows the completion

times for loading a set of articles (which comprises reading files, comparison of the

articles, normalisation of comparison values, and estimation of the list subset

thresholds) and reloading a set of articles (recalculating the comparisons without

changing the weights). Each task was performed 5 times, and averages taken.

Completion Times

0

50

100

150

200

250

0 50 100 150 200 250 300 350 400 450 500

Number of Articles

Tim

e to

com

plet

e (s

ecs)

Figure 6.1: Completion times of loading and reloading sets of

articles.

First load time

Average load time (excluding first)

Reload time

Page 60: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

60

The graph uses quadratic trendlines to approximately interpolate the points, due to

the algorithms being O(N2). As expected, the time taken to reload a set of articles is

approximately half of that to load it, but it takes longer to load a set of articles for the

first time, which is both unexpected and unexplained. Because of this, the time taken

for the first attempt is plotted as a curve separate from the average of the following 4

times.

6.3 Further Work

There are a number of aspects to the system that are unsatisfactory, or have potential

to be greatly improved.

The comparison algorithm, which is performed whenever new data sets are loaded,

or the attribute weights or thresholds are changed, is the most time-consuming part of

the system. The following pseudocode shows how the algorithm works:

make a list of all of the articles

for each article a1 in list

for each article a2 after a1 in list

if neither are deleted

similarity = compare a1 with a2

else

similarity = 0

record similarity in matrix

This is an easily parallelisable algorithm. By recruiting multiple processors to help

compare groups of articles, the time taken to populate the similarity matrix would be

greatly reduced. One possible way of doing it would be to send the data required to

complete each outer loop (an article and all articles following it in the list) to a

separate processor. As each iteration of the loop uses one less article than the

previous one, this would be a very uneven distribution and would be wasteful of the

Page 61: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

61

resources, so pairs of outer loops could be sent to each processor – one from the front

and one from the rear, as illustrated in figure 6.2.

Figure 6.2: A method for parallelising the comparison

algorithm.

N is the number of articles. +, and -. mean floor and ceiling

respectively.

Another possibility is to output the comparison values for each of the attributes in a

data set into a file the first time it is visualised. Then, any further visualisations of the

same data set can read those values back in, adapting them to reflect the attribute

weights and the thresholds, rather than recalculating them from scratch. A potential

drawback to this approach could be some disk space wastage if many sets of abstracts

are visualised only a small number of times, so some method of clearing out

unnecessary files would be required.

As the primary focus of the system was on the visualisation side, rather than the text

mining side, the algorithms used to process the PubMed data in pubmed.pl, such as

tf-idf, are somewhat basic and naïve. Implementing some more advanced NLP

techniques in the analyses of the titles and abstracts would improve the quality and

relevance of the attribute data that BioQSpace uses in the comparison algorithm.

A particular example of where the comparison algorithm is lacking is with the MeSH

terms. Currently they are strictly compared as complete strings, which does not take

advantage of their hierarchical nature. The levels of the hierarchy are marked with a

Processor 1

Processor 2

Processor 3

Processor 4

Processor N/2

Comparison

algorithm

loop 0, loop N

loop 1, loop N-1

loop 2, loop N-2

loop 3, loop N-3

loop +N/2,, loop -N/2.

Page 62: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

62

/, so are easily identified, and important elements are marked with a *. Instead of

rejecting MeSH terms that are not identical, partial comparisons values can be

assigned if MeSH terms share some of their hierarchy. A higher value can be given if

there is a *.

The current progress status dialog is somewhat primitive and unreliable due to some

clumsy usage of QThreads. Improving the way it is displayed would reassure the

user that the application is working properly.

The GUI does not follow one of the major GUI design principles; that of allocating

representative icons to the buttons and menu items – the only elements with icons are

the navigation buttons. Some time could be spent designing some user-friendly icons

to improve BioQSpace’s usability.

The 10th user requirement from chapter 2.2, that the two components of the system

should be linked with a single application that can run both, was not implemented.

The system would be more cohesive if this linking application were written.

Finally, to address the complication of the installation process, it would be good if

BioQSpace could be packed into a self contained application that does not require the

separate installation of Qt and MAVERIK. Finding out how to perform an

installation on Mac and Windows machines would also make BioQSpace more

accessible, and hence more useful.

Page 63: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

63

GLOSSARY

GUI. Graphical User Interface.

IE. Information Extraction.

MAVERIK. The MAnchester Virtual EnviRonment Interface Kernel, a Virtual Reality system.

MEDLINE. A bibliographic biomedicine database; the largest component of PubMed.

MeSH. Medical Subject Headings, the NLM’s controlled vocabulary thesaurus.

NLM. The National Library of Medicine.

NLP. Natural Language Processing.

PubMed. One of the services provided by the NLM’s Entrez retrieval system, providing tools to query MEDLINE and other databases.

Q-SPACE. An application to visualise similarity amongst a set of abstract data.

Qt. A C++ GUI widget library.

STL. The C++ Standard Template Library.

tf-idf. Term frequency – inverse document frequency.

Page 64: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

64

BIBLIOGRAPHY

[Bio] BiopathwayBuilder [online, cited 15 September 2005]. Available from

World Wide Web: http://www.biopathway.org/BiopathwayBuilder.

[Bio05] Anna Divoli. BioIE – Extracting information from the biomedical

literature [online, cited 15 September 2005]. Available from World Wide

Web: http://umber.sbs.man.ac.uk/dbbrowser/bioie.

[CH02] Jon Cook and Toby Howard (Editors). MAVERIK Programmer’s Guide.

University of Manchester, 2002.

[Chi] Chilibot: finding gene and protein relationships from MEDLINE [online,

cited 15 September 2005]. Available from World Wide Web:

http://www.chilibot.net.

[Com] Computation of Related Articles [online, cited 15 September 2005].

Available from World Wide Web: http://www.ncbi.nlm.nih.gov/

entrez/query/ static/computation.html.

[CS04] Hao Chen and Burt M. Sharp. Content-rich biological network

constructed by mining PubMed abstracts. BMC Bioinformatics, 5:147,

2004.

[Cyg] Cygwin Information and Installation. [online, cited 15 September 2005].

Available from World Wide Web: http://www.cygwin.com.

[Enta] Entrez PubMed [online, cited 15 September 2005]. Available from World

Wide Web: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB=

pubmed.

[Enta] Entrez Utilities [online, cited 15 September 2005]. Available from World

Wide Web: http://eutils.ncbi.nlm.nih.gov/entrez/query/static/

eutils_help.html.

Page 65: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

65

[FL04] Tancred Frickey and Andrei Lupas. CLANS: a Java application for

visualizing protein families based on pairwise similarity. Bioinformatics

Applications Note, pages 3702-3704, 2004.

[Gou03] Graham Gough. Algorithms and Data Structures (Lecture notes). The

University of Manchester, 2003.

[HDCV04] Robert Hoffman, Joaquin Dopazo, Juan C. Cigudosa and Alfonso

Valencia. HCAD, closing the gap between breakpoints and genes. Nucleic

Acids Research, 2004.

[IBM] IBM Ease of Use – Design basics [online, cited 15 September 2005].

Available from World Wide Web: http://www.3.ibm.com/ibm/easy/

eou_ext.nsf/Publish/6.

[Inf] Infotrieve Online [online, cited 15 September 2005]. Available from

World Wide Web:

http://www4.infotrieve.com/newmedline/search.asp.

[KBSP04] Ronald N. Kostoff, Joel A. Block, Jesse A. Stump and Kirstin M. Pfeil.

Information content in Medline record fields. International Journal of

Medical Informatics, 73:515-527, 2004.

[Kol05] Adam Kolawa. Which Development Method is Right for your Project?

[online, cited 15 September 2005]. Available from World Wide Web:

http://www.stickyminds.com/sitewide.asp?ObjectId=3152&Function=

DETAILBROWSE&ObjectType=ART

[KSBG03] Thomas Karopka, Thomas Scheel, Sven Bansemer and Änne Glass.

Automatic construction of gene relation networks using text mining and

gene expression data. Medical Informatics, 2:169-183, 2004.

[Lee04] James Lee (with Simon Cozens and Peter Wainwright). Beginning Perl

second edition. Apress, 2004.

Page 66: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

66

[Lin05] LinkOut Journals by Provider [online, cited 15 September 2005].

Available from World Wide Web:

http://www.ncbi.nlm.nih.gov/entrez/linkout/journals/jourlists.cgi?type

id=1&type=journals&format=text&operation=Show

[LPP04] Changsu Lee, Jinah Park and Jong C. Park. A graphic tool for curating

molecular interaction networks from the literature. Computers in Biology

and Medicine, 35:555-564, 2004.

[Mas03] Louis Massey. On the quality of ART1 text clustering. Neural Networks,

16:771-778, 2003.

[Mav] The Advanced Interfaces Group – MAVERIK [online, cited 15

September 2005]. Available from World Wide Web:

http://aig.cs.man.ac.uk/ maverik/maverik.php.

[Meda] MEDLINE fact sheet [online, cited 15 September 2005]. Available from

World Wide Web: http://www.nlm.nih.gov/pubs/factsheets/

medline.html.

[Medb] Medportal [online, cited 15 September 2005]. Available from World

Wide Web: http://www.medportal.com.

[Mes] Medical Subject Headings [online, cited 15 September 2005]. Available

from World Wide Web: http://www.nlm.nih.gov/mesh/

meshhome.html.

[Oua97] Steve Oualline. Practical C Programming 3rd edition. O’Reilly, 1997.

[Ovi] Ovid MEDLINE Field Guide [online, cited 15 September 2005].

Available from World Wide Web: http://www2.umdnj.edu/rwjlbweb/

ovidpuzz/ startscope.htm.

Page 67: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

67

[PBA01] Carolina Perez-Iratxeta, Peer Bork and Miguel A. Andrade. XplorMed: a

tool for exploring MEDLINE abstracts. TRENDS in Biochemical Sciences,

September 2001.

[Por80] Martin Porter. Porter Stemming Algorithm [online, cited 15 September

2005]. Available from World Wide Web: http://tartarus.org/~martin/

PorterStemmer.

[PC01] Steve Pettifer and Jonathan Cook. Exploring Realtime Visualisation of

Large Abstract Data Spaces with QSPACE. IEEE, 2001

[PCM01] Steve Pettifer, Jon Cook and John Mariani. Towards Real-Time

Interactive Visualisation in Virtual Environments. Virtual Reality

International Conference, May 2001.

[Pro] Production Processes [online, cited 15 September 2005]. Available from

World Wide Web: http://www.scism.sbu.ac.uk/law/Section5/chap6/

s5c6p2.html.

[Puba] PubMed Overview [online, cited 15 September 2005]. Available from

World Wide Web: http://www.ncbi.nlm.nih.gov/entrez/query/static/

overview.html.

[Pubb] Searching PubMed: Table 1. Search Field Descriptions and tags [online,

cited 15 September 2005]. Available from World Wide Web:

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helppubmed.table.pub

medhelp.T37.

[Qta] Qt 3.3: About Qt [online, cited 15 September 2005]. Available from

World Wide Web: http://doc.trolltech.com/3.3/aboutqt.html.

[Qtb] Qt 3.3: Signals and Slots [online, cited 15 September 2005]. Available

from World Wide Web: http://doc.trolltech.com/3.3/

signalsandslots.html.

Page 68: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

68

[Ref] RefViz [online, cited 15 September 2005]. Available from World Wide

Web: http://www.refviz.com.

[RKKGKHKRF00] Andrey Rzhetsky, Tomohiro Koike, Sergey Kalachikov, Shawn

M. Gomez, Michael Krauthammer, Sabina H. Kaplan, Pauline Kra,

James J. Russo and Carol Friedman. A knowledge model for analysis

and simulation of regulatory networks. Bioinformatics Ontology, 12:1120-

1128, 2000.

[SFW03] Ellen Siever, Stephen Figgins and Aaron Weber. Linux in a Nutshell 4th

edition. O’Reilly, 2003.

[SGI] SGI – Standard Template Library Programmer’s Guide [online, cited 15

September 2005]. Available from World Wide Web:

http://www.sgi.com/tech/stl.

[SGM04] Jeremy R. Semeiks, L. R. Grate and I. S. Mian. Text-based analysis of

genes, proteins, aging and cancer. Mechanisms of Ageing and Development,

126:193-208, 2004.

[SJORB05] Jasmin Šari!, Lars Juhl Jensen, Rossitza Ouzounova, Isabel Rojas and

Peer Bork. Extraction of regulatory gene/protein networks from

Medline. Bioinformatics, 2005.

[SK05] Nicholas A. Solter and Scott J. Kleper. Professional C++. Wiley

Publishing Inc., 2005.

[Tfi] Tf-idf – Wikipedia, the free encyclopedia [online, cited 15 September

2005]. Available from World Wide Web:

http://en.wikipedia.org/wiki/Tfidf.

[Tro] Trolltech – Qt Product Overview [online, cited 15 September 2005].

Available from World Wide Web:

http://www.trolltech.com/products/qt.

Page 69: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

69

[Xpl] XplorMed: eXploring Medline abstracts [online, cited 15 September

2005]. Available from World Wide Web: http://www.bork.embl-

heidelberg.de/xplormed.

Page 70: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

70

APPENDIX A: E-UTILITY RESULTS

Listed here are the Entrez Programming Utilities used in BioQSpace, with example

URLs and their results.

ESearch

URL: http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=10&usehistory=n&term=lupus

Result: <?xml version="1.0"?> <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd"> <eSearchResult> <Count>47317</Count> <RetMax>10</RetMax> <RetStart>0</RetStart> <IdList> <Id>16127435</Id> <Id>16127360</Id> <Id>16127015</Id> <Id>16127001</Id> <Id>16126989</Id> <Id>16126986</Id> <Id>16126985</Id> <Id>16126984</Id> <Id>16126981</Id> <Id>16126980</Id> </IdList> <TranslationSet> <Translation> <From>lupus</From> <To>(&quot;lupus&quot;[MeSH Terms] OR &quot;systemic lupus

erythematosus&quot;[Text Word] OR &quot;lupus erythematosus, systemic&quot;[MeSH Terms] OR lupus[Text Word])</To>

</Translation> </TranslationSet> <TranslationStack> <TermSet> <Term>&quot;lupus&quot;[MeSH Terms]</Term> <Field>MeSH Terms</Field> <Count>795</Count> <Explode>Y</Explode> </TermSet> <TermSet> <Term>&quot;systemic lupus erythematosus&quot;[Text Word]</Term> <Field>Text Word</Field> <Count>35520</Count> <Explode>Y</Explode> </TermSet> <OP>OR</OP> <TermSet> <Term>&quot;lupus erythematosus, systemic&quot;[MeSH Terms]</Term> <Field>MeSH Terms</Field> <Count>32040</Count> <Explode>Y</Explode> </TermSet> <OP>OR</OP> <TermSet> <Term>lupus[Text Word]</Term>

Page 71: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

71

<Field>Text Word</Field> <Count>47317</Count> <Explode>Y</Explode> </TermSet> <OP>OR</OP> <OP>GROUP</OP> </TranslationStack> </eSearchResult>

EFetch

URL: http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=14527253&retmode=html&rettype=medline

Result: <Html><Title>PmFetch response</Title><Body> <Pre> PMID- 14527253 OWN - NLM STAT- MEDLINE DA - 20031006 DCOM- 20040323 LR - 20041117 PUBM- Print IS - 0278-2715 VI - Suppl Web Exclusives DP - 2003 Jan-Jun TI - Creating consensus on coverage choices. PG - W3-199-211 AB - The framework for reaching near-universal coverage outlined in this paper combines tax credits for private insurance and public program expansions. It illustrates how a series of incremental steps could be phased in to achieve near-universal coverage. Hallmarks include creation of a Congressional Health Plan; use of the income tax system to provide tax credits and enroll uninsured people; creation of a state Family Health Insurance Program open to everyone below 150 percent of poverty; and creation of a Medicare Part E, open to the disabled and uninsured older adults. The paper provides coverage and cost estimates and identifies potential sources of revenue to finance coverage. AD - Commonwealth Fund, New York City, New York, USA. FAU - Davis, Karen AU - Davis K FAU - Schoen, Cathy AU - Schoen C LA - eng PT - Journal Article PL - United States TA - Health Aff (Millwood) JID - 8303128 SB - IM CIN - Health Aff (Millwood). 2003 Jan-Jun;Suppl Web Exclusives:W3-212-5. PMID: 14527254 CIN - Health Aff (Millwood). 2003 Jan-Jun;Suppl Web Exclusives:W3-216-8. PMID: 14527255 MH - Financing, Government/legislation &amp; jurisprudence MH - Health Care Reform/*legislation &amp; jurisprudence MH - Humans MH - Income Tax/legislation &amp; jurisprudence MH - Insurance, Health/*legislation &amp; jurisprudence MH - Medically Uninsured MH - Medicare/legislation &amp; jurisprudence MH - Politics MH - Privatization/legislation &amp; jurisprudence MH - Program Development

Page 72: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

72

MH - Tax Exemption/*legislation &amp; jurisprudence MH - United States MH - Universal Coverage/*legislation &amp; jurisprudence EDAT- 2003/10/07 05:00 MHDA- 2004/03/24 05:00 PST - ppublish SO - Health Aff (Millwood) 2003 Jan-Jun;Suppl Web Exclusives:W3-199-211. </Pre></Body></Html>

ELink

URL: http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10000000&bd=pubmed

Result (extract): <?xml version="1.0"?> <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd"> <eLinkResult> <LinkSet> <DbFrom>pubmed</DbFrom> <IdList> <Id>10000000</Id> </IdList> <LinkSetDb> <DbTo>pubmed</DbTo> <LinkName>pubmed_pubmed</LinkName> <Link> <Id>10000000</Id><Score>2147483647</Score> </Link> <Link> <Id>9979842</Id><Score>10559862</Score> </Link> <Link> <Id>9994279</Id><Score>9966528</Score> </Link> <Link> <Id>10009023</Id><Score>9874436</Score> </Link> <Link> <Id>12398705</Id><Score>9812176</Score> </Link> <Link> <Id>10009474</Id><Score>9781612</Score> </Link> <Link> . . . <Link> <Id>12747441</Id><Score>6657858</Score> </Link> <Link> <Id>10472728</Id><Score>6651632</Score> </Link> <Link> <Id>11689937</Id><Score>6642576</Score> </Link> </LinkSetDb> </LinkSet> </eLinkResult>

Page 73: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

73

APPENDIX B: FILES USED BY PUBMED.PL

B.1 Lists of regular expressions used to extract attribute values from titles and

abstracts. They are printed in columns here, but the actual files consist of only one

column.

drug_list [T|t]herapies [T|t]herapeutic compounds{0,1} [M|m]edical compounds{0,1} [M|m]edicinal compounds{0,1} [T|t]herapeutic anti[a-z]{1,16} agents{0,1} [a-z]{0,16}prazoles{0,1} [a-z]{0,16}pamines{0,1} [a-z]{0,16}lamines{0,1} [a-z]{0,16}imines{0,1} [a-z]{0,16}piroles{0,1} [a-z]{0,16}adines{0,1}

[a-z]{0,16}udines{0,1} [a-z]{0,16}acins{0,1} [a-z]{0,16}izines{0,1} [a-z]{0,16}cagons{0,1} [a-z]{0,16}hyllines{0,1} [a-z]{0,16}razones{0,1} [a-z]{0,16}nisteins{0,1} [a-z]{0,16}benecids{0,1} [a-z]{0,16}othymines{0,1} [a-z]{0,16}osporines{0,1} [a-z]{0,16}oprines{0,1} [a-z]{0,16}mycins{0,1}

disease_list [A|a]nemia [A|a]naemia Alzheimer [A|a]nxiety [A|a]nomal[a-z]{1,3} [A|a]ngina [A|a]bnormalit[a-z]{1,3} [A|a]llerg[a-z]{1,2} [A|a]ttacks{0,1} [A|a]troph[a-z]{1,2} [A|a]sthma [A|a]sthmatics{0,1} [A|a]utoimmune [A|a]utosomal recessive [A|a]utosomal-recessive [A|a]utosomal dominant [A|a]utosomal-dominant [B|b]ruxism blood coagulation blood clotting Crohn Creutzfeldt [C|c]oronary [C|c]ancers{0,1} [C|c]arcinomas{0,1} [C|c]arcinogenesis conditions{0,1} [C|c]ongenital Cushing [C|c]hemotherapy [C|c]linical [D|d]ementia [D|d]epression Down [D|d]iabet[a-z]{1,2} [D|d]egenarat[a-z]{1,3} [D|d]iagnosis [D|d]iseases{0,1} [D|d]isorders{0,1} [D|d]ysplasia [D|d]yspepsia

[D|d]ystrophy [D|d]ysfunctions{0,1} [D|d]efects{0,1} [D|d]eficit [D|d]osage [D|d]rugs{0,1} [E|e]pilepsy [E|e]epileptic fever fibrosis failure [H|h]emophilia [H|h]aemophilia [H|h]emorrag[a-z]{1,2} [H|h]aemorrag[a-z]{1,2} [H|h]emorag[a-z]{1,2} [H|h]aemorag[a-z]{1,2} [H|h]ereditary [H|h]allucinations [H|h]ealth [H|h]ealing Huntington [I|i]schemi[a|c] [I|i]schaemi[a|c] [I|i]nfections{0,1} [I|i]nfect[a-z]{0,3} [i|I]nflammations{0,1} [I|i]nflammatory [I|i]nflammat[a-z]{1,2} [I|i]nherited [I|i]nheritable [I|i]njury [I|i]njuries [I|i]nsomnia [L|l]ymphoma [L|l]eukemia [L|l]eukaemia [M|m]alignant [M|m]alignancy [M|m]elanomas{0,1} [M|m]etastasis

[M|m]edical [M|m]igraine [M|m]yocardial infarction [N|n]eoplasms{0,1} [N|n]eoplastic [O|o]steoporosis [P|p]athogens{0,1} [P|p]athogenesis of [P|p]athogenesis [I|i]n patients with [P|p]atients{0,1} Parkinson's Parkinson [P|p]aralysis [P|p]soriasis [P|p]neumonia [P|p]rophylaxis [P|p]rophylactics{0,1} [P|p]redisposition [P|p]rognosis [R|r]adiotherapy [S|s]troke [S|s]clerosis [S|s]ickness [S|s]ick [S|s]chizophrenia [S|s]ymptoms{0,1} [S|s]yndromes{0,1} [T|t]hrombosis [T|t]uberculosis [T|t]rauma [G|g]ene therapy [T|t]herapy preventative treatment antibiotic treatment [T|t]reatments{0,1} [T|t]umorigenesis [T|t]umors{0,1} [T|t]umours{0,1} [T|t]halassaemia [T|t]halassemia

Page 74: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

74

ulcer wound virulence vomiting [a-z]{0,16}carcinomas{0,1}

[a-z]{1,16}itis [a-z]{1,16}pathy [a-z]{1,16}pathies [a-z]{1,16}penia [A-Z][a-z]{1,16}itis [A-Z][a-z]{1,16}pathy

[A-Z][a-z]{1,16}pathies [A-Z][a-z]{1,16}penia [a-z]{1,16}pathic [a-z]{1,16}ergic activity

function_list abolish[a-z]{0,3} accompan[a-z]{0,4} accumulat[a-z]{0,3} acts{0,1} acted acting deactivate[a-z]{0,1} deactivating deactivation activate[a-z]{0,1} activating activation activity affinity aim[a-z]{0,3} antagoni[s|z]e[a-z]{0,1} antagoni[s|z]ing associate[a-z]{0,1} associating attenuat[a-z]{0,3} alkylat[a-z]{1,3} activators{0,1} agonists{0,1} annealing antagonists{0,1} antigens{0,1} antiporters{0,1} antireceptors{0,1} antiterminators{0,1} bending branching bind[a-z]{0,3} block[a-z]{0,2} blocking bound carboxylat[a-z]{1,3} cataly[s|z][a-z]{0,3} characteri[s|z]ed by cleave[a-z]{0,1} compartmentali[s|z]at[a-z]{0,3} compet[a-z]{0,3} contribut[a-z]{0,3} control{0,4} coordinat[a-z]{0,3} couple[a-z]{0,1} coupling channels{0,1} capping carry carrying carrie[s|d] chaperones{0,1} cleaves{0,1} co-activators{0,1} coagulations{0,1} conductors{0,1}

constituents{0,1} dock docking donates{0,1} donating decreas[a-z]{0,3} depend[a-z]{0,3} dephosphorylat[a-z]{0,3} desensiti[s|z][a-z]{1,5} determine dimeris[a-z]{0,3} dissociate[a-z]{0,1} dissociating downregulat[a-z]{0,3} down-regulat[a-z]{0,3} encoded by encod[a-z]{0,3} enhance[a-z]{0,1} enhancing expressed expression of elongat[a-z]{0,3} endocytosis energi[z|s]ers{0,1} escort{0,3} export{0,3} function[a-z]{0,3} form[a-z]{0,3} generate[a-z]{0,1} generating hydroxylat[a-z]{1,3} hetedimer[a-z]{0,7} homodimer[a-z]{0,7} implicat[a-z]{0,3} increas[a-z]{0,3} induc[a-z]{0,3} inhibition of inhibit[a-z]{0,1} inhibiting interact[a-z]{0,3} involv[a-z]{0,3} inhibitors{0,1} inhibitory initiations{0,1} ligands{0,1} ligate[s|d]{0,1} locali[z|s]ers{0,1} lyases{0,1} lytic lead[a-z]{0,3} led link[a-z]{0,3} methylat[a-z]{1,3} mediate[a-z]{0,1} modulate[a-z]{0,1} modulating mediators{0,1}

motors{0,1} oxidat[a-z]{1,3} participate[a-z]{0,1} participating phosphorylat[a-z]{0,3} plays{0,1} protect[a-z]{0,3} prevent[a-z]{0,3} produce[a-z]{0,1} producing proliferat[a-z]{0,3} promote promote[s|d] promoting polymeri[s|z]ing recogni[s|z][a-z]{1,3} reduc[a-z]{0,3} regulate[a-z]{0,1} regulating regulation of relat[a-z]{0,3} relea[s|z][a-z]{0,3} requir[a-z]{0,3} respond[a-z]{0,3} response results in resulted in result[a-z]{0,3} roles{0,1} secret[a-z]{0,3} singal{0,4} splic[a-z]{1,3} stimulate[a-z]{0,1} stimulating stimulation stop[a-z]{0,4} suppress[a-z]{0,3} switch[a-z]{0,3} synthesi[s|z][a-z]{1,3} transcript[a-z]{0,3} transduc[a-z]{1,3} transfer[a-z]{0,3} transport[a-z]{0,3} target[a-z]{0,3} trafficking transactivat[a-z]{0,3} transfect[a-z]{0,2} transfecting transfer[a-z]{0,3} translate[a-z]{0,1} translating trigger[a-z]{0,3} up-regulat[a-z]{0,3} upregulat[a-z]{0,3} uncouple[a-z]{0,1} uncoupling

structure_list 3D [d+] amino acids [A|a]ngstrom angle

box boxes covalent bonds{0,1} hydrogen bonds{0,1}

phi bonds{0,1} psi bonds{0,1} van der waals bonds{0,1} bonds{0,1}

Page 75: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

75

bridges{0,1} bundles{0,1} alpha-barrel beta-barrel barrel central region chains chiral cleft conformational change conformation cores{0,1} compris{0,3} composed components{0,1} consist[a-z]{0,2} consisting covalently bound covalently linked covalent-linked covalent crystallography highly-conserved highly conserved conserved contain[a-z]{0,2} containing coils{0,1} characteri[s|z]ed by degrees heterodimer[a-z]{0,7} homodimer[a-z]{0,7} homology hydrophobic hydrophilic detergent dimers{0,1} dimeri[s|z][a-z]{1,5} distributed disulphides domains{0,1} dipolar [D|d]altons{0,1} forms{0,1} folds{0,1} folded self-folding foldings{0,1} frames{0,1} framework zinc fingers{0,1} fingers{0,1} major grooves{0,1} minor grooves{0,1}

grooves{0,1} alpha-helix alpha-helices helix helical helices hydration interchains{0,1} identity identical to isomer isomeri[s|z]ed [d+]nm [d+] nm [d+].[d+] nm [d+]A [d+]-A [d+] A [d+].[d+] A [d+]kDa [d+]kilodalton [d+]kilobases{0,1} [d+]kbp [d+] kDa [d+] kilodalton [d+] kilobases{0,1} [d+] kbp [d+]-kDa [d+]-kilodalton [d+]-kilobase{0,1} [d+]-kbp kDa kilodalton lipid bilayers{0,1} bilayers{0,1} layers{0,1} loops{0,1} linked together linked to membrane molecular mass molecular weight motifs{0,1} monomers{0,1} monomeric multidomains{0,1} multi-domains{0,1} NMR organised in organized in occupied by antiparallel anti-parallel parallel

patterns{0,1} pockets{0,1} pores{0,1} protein complex primary sequences{0,1} primary structures{0,1} quaternary repeat-regions{0,1} repeat regions{0,1} regions{0,1} residues{0,1} resolution repeats{0,1} rod-like rings{0,1} beta sheets{0,1} beta-sheets{0,1} sheets{0,1} scattering sequences{0,1} similarity size subunits{0,1} surface alpha-strands0,1} beta-strands{0,1} strands{0,1} structural structures{0,1} segments{0,1} symmetry asymmetric symmetric scaffolds{0,1} active site binding site coordination site site tail terminus termini tetramers{0,1} tetrameric tertiary tetrahedral C-terminal C terminal N-terminal N terminal terminal transmembranes{0,1} topology

location_list is found in are found in is common in are common in allocat[a-z]{0,3} derived detected in distributed in distributed along discovered in encoded in expressed in exist[a-z]{0,3} in extracellular found within found only in

found throughout found within found in found at found on colocalise[a-z]{0,1} with colocalize[a-z]{0,1} with co-localise[a-z]{0,1} with co-localize[a-z]{0,1} with colocalise[a-z]{0,1} colocalize[a-z]{0,1} co-localise[a-z]{0,1} co-localize[a-z]{0,1} contained in

intracellular inside localise[a-z]{0,1} in localize[a-z]{0,1} in localise[a-z]{0,1} at localize[a-z]{0,1} at localise[a-z]{0,1} on localize[a-z]{0,1} on localise[a-z]{0,1} localize[a-z]{0,1} locat[a-z]{0,3} observed in obtained from outside originat[a-z]{0,3} occurs{0,1} in

Page 76: 0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + # &130.88.97.239/bioqspace/Winter2005_MSCthesis.pdf · 0 (# " * (1 ) &1 2 &$ + 3 0 + &"! * (. 0 + # & & &!" # $ % # &' * + ! & , # - "

RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES

76

present in position[a-z]{0,3} in position[a-z]{0,3} at position[a-z]{0,3} recognised in recognized in subcellullar topology in plants in mammals in animals in humans in algae in fungi in bacteria

in yeast in the brain in the liver in the bowel in the pancreas in the kidney in the heart in the cerebellum in arteries in the aorta in the ileum in the intestine in the small intestine in the large intestine in the duodenum

in the cytoplasm in the cytoskeleton in the endothelium in the endothelial in the endothelia in the nucleus in the endoplasmic reticulum in the mitochondria in the mitochondrium in the vacuole in the outer part in the periplasmic

B.2 Extract from related articles file rel_15167971. Lines consist of PMID:score

pairs.

11705672:54128223 15140465:50038799 15843060:46688894 8632154:46445710 11411616:46267554 1708303:46101814 7890729:46012060 11005628:45825644 1988561:45243645 9302078:45242854 1377078:44776853 9884076:44762947 11494368:44442335 10653166:44122862 15105791:43335631 . . . 15341182:35590820 15682931:28640352 15952243:27408332 15928825:17280546 15968139:11776961 15335893:7420826 13263611:6926708