aweb-baseddashboardfor visualisationandcurationofdata...

University of Manchester

Third Year Project

A web-based dashboard forvisualisation and curation of data

extracted from tables

Ruixin Su

supervised by

Dr. Goran Nenadic

May 1, 2016

Abstract

The amount of published biomedical literature has grown exponentially dur-ing the past decades, which creates a problem for researchers who have toanalyse and digest the information from a huge amount of scientific papersto make any breakthrough. It could be very difficult for researchers to copewith the information manually. Text mining can be the approach to helpwith extracting information from the textual body of the articles automat-ically. However, during text mining, the table figures extraction is mostlyignored. Although some current table mining tools provide accurate findings,it still requires manual curation to evaluate the knowledge extracted.

The project aims to build a user interface for helping potential users visualiseand curate tables in biomedical literature manually, the accuracy of whichcan then be further improved.

A web application, TCGUI (Table Curation Graphic User Interface), was de-signed, implemented and evaluated using a MVC-based framework ThinkPHPalong with jQuery and Bootstrap visualisation library. TCGUI allows theuser to view the extracted table data and curate the misleading information.According to the final result, TCGUI is a useful tool for researchers to ensurethe tables extracted from biomedical literature is highly accurate.

AcknowledgementsI would like to thank Goran Nenadic for being a supportive supervisor whoprovided me with instant feedback and kept me on the right track during theentire academic year.

I would also like to thank Nikola Milosevic for his valuable inputs, and forintroducing me the background information and domain-specific knowledgeneeded.

Special thanks to my second marker, David Lester, for insightful advice froma different perspective after the presentations.

Finally, I want to thank my friends and family for the encouragement andthe support throughout the whole project.

1

Contents

1 Introduction 41.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Aim & Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Report Structure . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background & Context 82.1 Table Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Data Curation . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Methodologies & Technologies . . . . . . . . . . . . . . . . . 16

2.3.1 Back-end Technologies . . . . . . . . . . . . . . . . . . 172.3.2 Front-end Technologies . . . . . . . . . . . . . . . . . . 17

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Requirements 193.1 Data Curation System . . . . . . . . . . . . . . . . . . . . . . 193.2 User Stories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Functional Requirements . . . . . . . . . . . . . . . . . . . . . 213.4 Non-functional Requirements . . . . . . . . . . . . . . . . . . 24

4 Design 254.1 System Workflow and Architecture . . . . . . . . . . . . . . . 25

4.1.1 Back-end Design . . . . . . . . . . . . . . . . . . . . . 274.1.2 Front-end Design . . . . . . . . . . . . . . . . . . . . . 29

4.2 Principles of Project Design . . . . . . . . . . . . . . . . . . . 314.3 Choices of Languages and Tools . . . . . . . . . . . . . . . . . 324.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2

5 Implementation 345.1 Interesting Features . . . . . . . . . . . . . . . . . . . . . . . . 34

5.1.1 Data Manipulation . . . . . . . . . . . . . . . . . . . . 345.1.2 Interactive User Actions . . . . . . . . . . . . . . . . . 35

5.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.3.1 Unit Testing . . . . . . . . . . . . . . . . . . . . . . . . 385.3.2 Automated Acceptance Testing . . . . . . . . . . . . . 40

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6 Results & Evaluation 426.1 The Final Product . . . . . . . . . . . . . . . . . . . . . . . . 42

6.1.1 Homepage/Input Page . . . . . . . . . . . . . . . . . . 426.1.2 Visualisation Page . . . . . . . . . . . . . . . . . . . . 436.1.3 Cell Manipulation Menu . . . . . . . . . . . . . . . . . 44

6.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

7 Reflection & Conclusion 507.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . 507.2 Self-reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . 537.3 Further Improvements . . . . . . . . . . . . . . . . . . . . . . 547.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 54

Appendix A Glossary 58

Appendix B Questionnaire 59

3

Chapter 1

Introduction

In recent years, the amount of published scientific research is acceleratingrapidly. In 2008 NSB reported that the world S&E article output between1995 and 2005 grew with an average annual rate of 2.3%, reaching 710,000articles in 2005 [8].Within scientific fields, biomedicine is one that has especially grown in im-portance at present. The exponential growth can be better viewed throughthe cumulative number of citations in, which can be seen in Figure 1.1.MEDLINE (National Library of Medicine journal citation database) containsover 22 million references from 5,600 worldwide biomedical journals in about40 languages[15]. Only in 2015, over 800 thousand new citations were addedto MEDLINE [16].

Figure 1.1: Cumulative number of Medline citations [14]

4

1.1 Motivation

Traditionally, researchers identified and extracted information manually. Incurrent times, facing a large amount of scientific articles, the researchers areable to process the textual body of the articles automatically by using somemature and relatively accurate text mining tools, such as IBM SPSS Modeler,SAS and STATISTICA text miner [13].

Meanwhile, huge amount of factual and statistical data in scientific publica-tion is stored in tables. In the biomedical field, the results of clinical trials,interactions between substances, drug side effects, information about armsand patients are usually stored in tables. For example, in PMC (PubMedCentral) database, over 72% scientific articles contain tables and it has beencalculated that there are 2.72 tables in each article on average, which sup-ports the argument that tables are used frequently in presenting significantportion of information [12].

Thus, the table data in the articles are just as important as the text but aretypically ignored by current data mining systems. The accuracy of currenttable mining systems is also very limited because the table data is presentedin a very flexible way, which is very complicated for a machine to generatemodels. For example, complex table layout, such as multidimensional data,and relationships between cells are the main challenges faced during theprocess of automatic extraction [12]. The extracted table data has to behighly accurate for a query run in order to be commercially useful. Forexample, biomedical term confusion may cause serious medication errors inthe treatment process, which has the potential of causing harm to the patients[7]. Hence, it is necessary to figure out the way to enhance the currentaccuracy of table mining tools.

1.2 Aim & Objectives

There is an ongoing table mining project in the University of Manchester.The accuracy of the table mining method used is already as high as 85%[12]. My project is part of this system, aiming to improve the accuracy byproviding an interface to enable manual data curation.

5

Further on, a set of objectives were refined and identified according to projectaims.

1. To allow the user to search exist articles in databases by inputting IDs

2. To visualise & highlight data presented in tables

3. To allow the user to compare information and curate data through aninteractive interface

4. To allow the user to reverse actions and save the current curated table

5. To trace previous versions of curated data

1.3 Report Structure

This report is structured to present the process of how the objectives havebeen achieved from conceptual design to final product. The topics coveredin each chapter is showed in the following:

Chapter 1 covers the analysis of the problem, the point of utilising theproject and the initial goals for achieving a successful completion of theproject.

Chapter 2 will start with introducing the basic concepts of table miningand data curation. Current in-use data curation interfaces will be discussed.After that, this chapter will describe both advantages and disadvantages inthe existing interfaces to conclude what the innovations of this project will be.It will also presents the research done for preparing the technical knowledgeneeded throughout the development.

Chapter 3 will refine a set of detailed requirements based on the discussionwith users and the background research.

Chapter 4 will illustrate a high-level overview of the system and its visualdesign. It will state the development principles and explain the technicaldecisions made for implementation stage.

6

Chapter 5 will show some implementation details for certain features andthen will explain the challenge met and the possible solutions towards it.This section will be finished along with presenting the testing coverage andresults.

Chapter 6 will demonstrate the final product. Then it will evaluate TCGUIaccording to user experiences together with the general discussion with pro-fessional users.

Chapter 7 will check whether the initial requirements has been satisfied andsum up the achievements. Then, it will provide a self-reflection towards theentire process and outline the possible future improvements.

7

Chapter 2

Background & Context

This chapter aims to achieve a better understanding of the requirements byclarifying relevant concepts and covering topics: table mining, data curation,as well as some introductions to currently existing data curation interfacesand useful technologies.

2.1 Table Mining

In terms of text mining, it uses information extraction and is defined asthe process of discovering and extracting knowledge from unstructured data[11, 5]. With named entity recognition, relation recognition and co-referenceresolution included as the main parts, information extraction is consideredas one of the most complex tasks in text mining and natural language pro-cessing.Similarly, the main work of the table mining project is also to build the tableinformation extraction engine. This part of the system should be able to ex-tract information from the source documents automatically. The workflow ofthe proposed table information extraction engine can be seen in Figure 2.1.In the first part of the method, tables are decomposed into the cell objectscontaining a cell’s value and the data from its navigational paths (header,sub-headers and stub). In the second step, cell values are normalized andsemantically analysed. In the final step, inference on the table level is per-formed. Relationships between cells are recognized and information may beextracted [12].

8

Figure 2.1: Workflow of the table information extraction system [12]

9

2.2 Data Curation

Data curation is a term used to indicate management activities related toannotation and integration of data collected from various sources, such thatthe data is fit for contemporary purpose and remains available for discoveryand reuse [9].

There are already exist curation interfaces in this area and the following areseveral examples:

• ODIN (Ontogene Document INspector) As an interactive text cura-tion system, ODIN inputs document directly from PubMed Centraland process the data with NLP pipeline to recognise entities and ex-tract relationships, aiming at serving the needs of the curation commu-nity. Its usage has been tested within the scope of collaborations withcuration groups, including PharmGKB, CTD and RegulonDB [18].

Figure 2.2: An example of an annotated document imported in ODINpanels [19]

As shown in Figure 2.2, the annotated documents are displayed clearlyin multiple forms. It has three panels; inspector panel on the left,then Document panel and the Annotations panel is on the right. Theentities are mainly application-specific concepts and are highlighted inthe document panel, where entities and relevant relationship values

10

can be changed inside annotation panel. ODIN has plenty of featuresother than curating the entities. For example, it supports multipleviews when sorted by different criteria. Additionally, it includes othersupporting features, such as login and extensive logging functionalities[19]. I think the best feature here is that the GUI shows the interactionsamong entities very clearly.

Meanwhile, there is still some room to improve. For instance, it isrelatively easy for a user to inspect the interactions, and then confirmor reject them but the system still lacks a way to add completely newinteractions. Also, it does not have the solution for the possible over-lapping of the new term and existing term annotations. Besides, theuse of highlighting entities isn’t the exactly most convenient way forusers to check through every term.

• LiverCancerMarkerRIF

LiverCancerMarkerRIF is a biomedical-text-mining-based curation in-terface. Similar to ODIN it can be used with the PubMed as the sourc-ing database. It enables users to retrieve biomarker related narrationsand curate evidence on biomarker directly when the user is browsingPubMed. As is displayed in Figure 2.3, other than basic recognisingbiomedical concepts function, what makes LiverCancerMarkerRIF dif-ferent from other text curation systems is that it has the ability toaccommodate distinct user interfaces to PubMed instead of navigatingthe curator away to independent websites. So the curation process canbe conducted directly inside PubMed [2]. However, at the same time,the use of this interface is very limited because it can only be applied tobiomedical science area. From my perspective, a truly successful cura-tion interface should have sufficient functionalities and can be utilisedin multiple fields at the same time.

11

Figure 2.3: An example of user retrieving biomarker-relatednarrations and curating supporting evidence on liver-cancer

biomarkers directly while browsing PubMed [2]

So far, text curation interfaces are quite mature especially when itcomes to annotating concepts, but table curation interfaces apparentlyneed more improvements in order to handle tables and information inarticles.

Unlike the body of text, tables are densely packed with multi-dimensionaland detailed information. Tables in scientific articles are structured indifferent forms which may include words, numbers, formulae, and visualpattern such as graphics. Beside that, tables present in the scientificliterature often have various elements in them, such as caption, cells,header, and stub. Thus, visualisation and curation for complex tablesare relatively challenging. At present, there are not many table cura-tion specific tools in use but there are some data cleansing interfacesthat include similar functions.

12

• TIBCO Clarity

TIBCO Clarity is a data preparation tool that improves the accuracyand quality of data for business use by visualising, cleansing, trans-forming and standardising raw data collected from disparate sources.

Figure 2.4: Main TIBCO Clarity interface in TIBCO Clarity userguide

Figure 2.5: Available actions for data curation in TIBCO Clarity userguide

13

Figure 2.6: Search and Undo/Redo panel in TIBCO Clarity user guide

As shown in Figure 2.4, 2.5 and 2.6, the project data page is dividedinto two main areas: data table and Search and Undo Panel. A listof available actions for project data can be done by selecting a specificrow or column according to the primary key and the users are able tosearch through data use this simple engine on the right. TIBCO Clarityalso comes with some other useful features, such as being able to parsedifferent kinds of files, uploading data from more 10 kinds of sources,validating data, etc. However, since it is not designed especially fortable mining use, TIBCO Clarity only includes the most basic datatransformation functions curation [6]. It definitely is necessary for usto implement some new functions for data. For instance, instead ofdisplaying one table at one time, our interface should be able to displaytwo tables on one page for more convenient comparison. Otherwise,users may have to keep changing to different pages back and forth if

14

they can’t see both tables at the same time very clearly. Moreover, itdoes not visualise the link between different cells and can only allowthe user to edit the whole row or column instead of random multiplecells. These become constraints for researchers who more often havethe need to compare across datasets in research. This would not bepossible in the case of TIBCO Clarity.

The interfaces introduced above have their own advantages and disadvan-tages. Two of them are text curation GUIs and the last one is table curationGUI. For text curation interfaces, what we can actually learn from here isthe way to visualise the relationship between different entities and possibleactions for entities curation, e.g. interactive operations from users. Also, theway they implement other supporting features is valuable to learn. Thereare several ways which can make it relatively easy for data curation, e.g.sorting, logging functionality and linking documents back to the article forbetter understanding. Taking TIBCO Clarity as an example, although it hasthe curation function, it is not designed for table-mining-based system andmainly focus on table data cleansing. Thus, the possible curating actions arevery limited and data visualisation design is not convenient for the use ofcomparison. But TIBCO Clarity still has positive features, e.g. validatingdata, supporting multiple data sources and formats and de-duplication, etc.These are some very basic elements associated with table curation in gen-eral, albeit the table mining and can be made use of in the newer interfacecreation.

In order to make sure the current project is an innovation on the advantagesof precedent methods, several challenges have to be dealt with for furtherimprovements. For visualisation part, unlike highlighting the concepts wordsand going thorough explanation for every term of text curation tools, it isnecessary to get rid of the text body and extract the table from currentinput files. Moreover, the types of entities the author are facing here arein different forms which are mostly cells instead of a single string. So, themanipulations become more complicated. Another issue is that the designhas to be convenient and easy enough even for researchers who have notechnical knowledge at all to curate data successfully. So learnability andsimplicity of interface are required. Also, for more complex tables like thesuper-row table, displaying the interactions between cells clearly would bechallenging. For curation part, the project should not set limits for users toedit data in a certain way, which means it should include merge and splitcells functions and so on.

15

2.3 Methodologies & Technologies

In order to implement TCGUI successfully, it has to follow a certain kind ofsoftware methodology to structure, plan, and control the process of develop-ment.As the one of the most efficient development methods, agile style develop-ment enables the users to be involved throughout the project and uses userstories with business-focused acceptance criteria to define product features.The developers have the opportunity to constantly refine and reprioritise theoverall product backlog during each iteration, which significantly reduce theoverall risk associated with software development. Diagram 2.7 displays thedifferences between agile and waterfall development processes [21].

Figure 2.7: By delivering working, tested, deployable software on anincremental basis, agile development delivers increased value, visibility, andadaptability much earlier in the lifecycle, significantly reducing project risk

[21]

16

Based on the comparison for above curation interface examples, the followingtools and technologies are considered as the most possible ones for furtheruse.

2.3.1 Back-end Technologies

• Web server solution stackXAMPP is a cross-platform web service package which includes every-thing needed to set up a web server, such as Apache server application,MySql database, PHP scripting language. It is very convenient fordevelopment because XAMPP is already configured with all featuresturned on [20]. It uses Apache for hosting PHP website and has abuilt-in installer of Mysql, so we could finish building environment inone single step. It provides us with batch files to control server anddatabase engine, which enables XAMPP to achieve high portability aswell.

• MVC-based frameworkMVC-based framework separates the web application into three logiccomponents: model, view and controller. Each component is respon-sible for handling one specific development aspect, which contributesto effective management of complex applications because only one as-pect/layer needs to be focused on at a time. Loose coupling makesevery layer is independent of each other and changing one of them willnot affect the other two. Sharing one model by multiple views alsoimproves its usability and the controller is used to combine models andviews to fulfil users’ needs for increasing flexibility and configurabilityof the application [10]. As an easy-to-learn lightweight MVC frame-work, ThinkPHP is rich in libraries and it enables the user to create adirectory structure automatically [22].

2.3.2 Front-end Technologies

• BootstrapIt is an front-end framework which contains enough HTML and CSS-based templets for common interface component as well as abundantJavascript extensions and jQuery built-in library [4].

17

• jQueryjQuery is a JavaScript library that enables users to deal with HTMLdocuments, events, animation effects, and easier AJAX interactionmore easily [3].

2.4 Summary

This chapter gave an overview of table mining and data curation. Then, itprovided some comparison between three curation interfaces. Most of thecurrent data curation interfaces focus on the textual body of the article andthe ones focus on tables are very rare. The first two examples in this chapterare text curation interfaces. Although the last TIBCO Clarity interface al-lows the user to modify table data, it mainly aims at data cleansing insteadof data curation. Thus, TCGUI is developed specially for the purpose of ta-ble data comparison and curation. Possible technologies and methodologieswere also discussed for the implementation phase. With the respect to theinitial objectives, next chapter will refine user requirements in details.

18

Chapter 3

Requirements

A set of user stories and the requirements were firstly drafted according tothe main features proposed by the users. Then, under the considerationsof the responsibilities data curation interface in table mining system andthe background research conducted, initial requirements were refined into amore detailed living document. In order to ensure TCGUI has high usabil-ity, regular discussions were had with Nikola Milosevic, a Phd student at theUniversity of Manchester, as the primary user to update requirements dur-ing every iteration. PhD students make use of data curation tools in orderto collect data required for the research so direct interaction with the userhelped focus on requirements better. With the goal of achieving all neces-sary functions and elegant interface, the requirements were separated intofunctional requirements and non-functional requirements.

3.1 Data Curation System

A typical semi-automated data curation system will usually include infor-mation extraction engine, data store, data query interface and data curationinterface [12]. The engine normally will process table in tree steps: tablerecognition (detecting tables in papers), table functional analysis (determin-ing the function of each cell, e.g. whether the cell is header, stub, content,etc) and table understanding (semantic analysis for determining the meaningof the table and its data) [12]. The basic workflow of the entire table miningand data curation system is shown in Figure 3.1.

19

Figure 3.1: General workflow of curation system1. Retrieved documents are sent to the table mining engine. 2. The tablemining engine uses knowledge sources to extract information from the

table. 3. Extracted information is stored to a data store. 4. Data curatorsreview and correct extracted information. 5. Users may submit queries tothe query interface in natural language. 6. Queries are processed andnormalized using knowledge sources. 7. Using normalized queries, datastore is queried. 8. Relevant information is presented to the user. [12]

3.2 User Stories

Use cases describe the possible interactions between the actors and the sys-tem. As shown in Figure 3.2, the use cases diagram summarises most relatedactions the user can perform to accomplish a specific goal. For TCGUI, thereis only one type of users with no administrators and guests.

20

Figure 3.2: Use cases diagram

3.3 Functional Requirements

The functional requirements are divided into several subsets to map withdifferent user stories which are Data Import (Table 3.1), Data Visualisation(Table 3.2), Data Curation (Table 3.3) and Data Export (Table 3.4) respec-tively. For each user story, it has an estimated deadline and is marked as amilestone recorded in the project logs which helped the author keep track ofthe progress made and achieve effective time management.The priority andcomplexity for each requirement are rated from 1 (low priority and complex-ity) to 5 (high priority and complexity).

21

Requirement descriptions Priority Complexity DurationUsers should be able to choose ar-ticle for viewing the table data inthat article and the data retrievedshould be in its most updated ver-sion

5 5 10

The article may be chosen by pro-viding article ID or uploading filesfrom localhost

4 5 5

Users can also view all previousversions of an article by providingits history version ID

3 3 2

Table 3.1: Data importing functionality (1 means low priority andcomplexity; 5 means high priority and complexity; Duration indicates the

estimated days needed for the task)

Requirement descriptions Priority Complexity DurationBoth tables from the original pa-pers and the ones extracted by ta-ble mining engine should be visi-ble to users once the interfaced isloaded

5 4 7

Users should be able to view ta-bles one by one in the same orderas it is in the original paper

5 3 5

The cell type of extracted datashould be shown clearly

5 3 4

Table 3.2: Data visualisation functionality (1 means low priority andcomplexity; 5 means high priority and complexity; Duration indicates the


22

Requirement descriptions Priority Complexity DurationWhenever users detect any er-ror in the table, users should beable to select the cell(s) which in-clude(s) errors

5 3 2

Users may change either the con-tents or the type of that cell

5 4 4

Users may change the table struc-ture to match it with the originaldata

5 5 8

Users can also add new informa-tion to the tables

4 5 2

Users should be capable of reverseany action made

4 4 5

Users can change same type of er-rors at one operation

3 3 3

Table 3.3: Data curation functionality (1 means low priority andcomplexity; 5 means high priority and complexity; Duration indicates the


Requirement descriptions Priority Complexity DurationOnce the user finishes the cura-tion, all kinds of changes shouldbe saved and the current table ismarked as most updated versionfor that article. If user re-inputthe same article, the table datashould include all the changes lasttime made

5 5 15

Users should be able to stop theexporting process at any timeeven in the progress

3 2 2

Users should be capable of choos-ing another new article at anystage of curating and exportingdata

3 2 1

Table 3.4: Data exporting functionality (1 means low priority andcomplexity; 5 means high priority and complexity; Duration indicates the


23

3.4 Non-functional Requirements

The NFRs were refined as the principles for evaluating the application andare listed below:

• The interface is built as a web application and can be accessed univer-sally.

• The web application can be configured and deployed on any web server.

• The design should be convenient and clean enough for users to finishtraining within a time limit.

• The response time of every action should be acceptable.

24

Chapter 4

Design

Based on the requirements refined, this chapter will introduce the design ofthe table curation interface. It will also give a high-level overview of the sys-tem architecture and justify the decisions made for the use of correspondingtechnologies.

4.1 System Workflow and Architecture

A workflow diagram 4.1 is shown below to illustrate a high-level overview ofthe system from user’s perspective. The user in the diagram is a researcher.The user can be anybody who has a basic knowledge of why curation is beingdone and who wants to make use of the interface for table data curation. Thesystem at present will accept the article id as the input. So, basically, afterthe user inputs an article ID, TCGUI will import the article tables from theroot database which includes all data extracted by table mining engine tothe local database. A local database copy is created at this point and thislocal database copy (in the context of being shared among other researchers)would be locked until the researcher is done working on it. The interfacewill visualise the extracted tables on the interface. The user will be able tocurate data and the actions at this stage are saved or undone only insidelocal database. After that, if the user finalises all the changes, and decidesto export it to the root database, the most updated version of the currenttable will be uploaded in this root sql database and the previous versions inthe root database will be overwritten. The process ends.

25

Figure 4.1: System workflow diagram

In order to implement the system, the diagram 4.2 below shows the sys-tem architecture technically from a developer’s perspective. TCGUI worksas a normal web application, consisting of both server and client side. Thedatabases deal with data structures for tables and provide the raw data forinformation retrieving; the front end includes the use of HTML, CSS andJavascript for viewing data. The following sections will introduce the corre-sponding components in more details.

26

Figure 4.2: System architecture

4.1.1 Back-end Design

The system needs to read from databases and send the data as a ‘reasonablepackage’ to the front end. The backend should have such functionalities asto reduce redundancy in changes and also must ensure that data integrityis maintained. Local copies would need to be locked until a user is donewith changes. There must be enforcement rules written in order to ensurethat a user who retains a local copy for long must be reminded to returnthe copy to DB pool either by updating and accepting their changes or byrejecting them. The back end is also responsible for receiving the changeddata from POST functions. Then, the data received has to be analysed andverified before actually saved into the root database and corresponding fieldsare modified accordingly.

The root database (Figure 4.3) consists of huge amount of data and complexrelationships, which is the reason why another database should be used forgetting rid of unnecessary information. The local database (Figure 4.4) onlyincludes the data structures for tables because the tables are the only thingneeded to be curated in the system.

27

Figure 4.3: Root database model

Figure 4.4: Local database model, including 4 tables: tcgui_files,tcgui_cell, tcgui_range, tcgui_merge.

28

tcgui_files is used to keep all the table data of the articles retrieved fromroot database and the information included in its fields is used to identifythe article. The filename represents the title of the article.

tcgui_range is used to keep table information, including maxcol as breadthand maxrow as length.

tcgui_cells consists of all cell data. Specially, there are three IDs for onecell to locate it precisely: the cell ID in the article, the original article ID inroot database and the file ID in the local database. And the rest are the celllocation information and cell content.

tcgui_merge contains the data used for merging, splitting and any otheraction related to changing table structure. ‘row’ and ‘col’ are the start points;rowspan and colspan are the combined length and width size Because of thedifferences between two databases, another task for back-end program is tomake sure two databases are compatible.

4.1.2 Front-end Design

The web pages are primarily responsible for providing a visual interface thatallows users to read data and operate on the system. It includes generaloperations, such as importing and exporting data, as well as operations ona particular cell. The main front-end functionality falls into processing thetables dynamically, such as modifying cell attributes, merging cells.

Figure 4.5, 4.6 and 4.7 show the visual design for the input page, curationpage and backlog page of TCGUI respectively.

29

Figure 4.5: Homepage for importing data by submitting article ID oruploading xml files from localhost

Figure 4.6: Curation interface for displaying table data and providingpossible actions on cell(s)

30

Figure 4.7: Backlog page for keeping the track of changes in case the userwant to reverse the action

4.2 Principles of Project Design

An interactive user interface is a connection between the researchers andback-end data. A certain balance of the appearance and simple usability hasto be achieved for implementing an elegant GUI. A set of rules should beconsidered before starting development.

• ReliabilityTCGUI needs to have a certain degree of fault tolerance and a pre-arranged plan for the errors and exceptions thrown so as to ensure itwill input, visualise and curate data and save changes efficiently. If anerror occurs, it should be able to point out the reason cause that toprevent any system crash.

• UsabilitiesSimple usabilities require TCGUI in a good design and to provide user-friendly experiences. For example, it should only use a minimum num-ber of mouse clicks to finish the corresponding operations efficiently.

31

The operating procedures should be clear enough for training potentialusers within a time limit. Also, most interfaces are usually attachedwith all kinds of hint and explanations for specific functions to makethe system comprehensible and improve operability.

• MaintainabilityIt requires TCGUI to save the changes made and record the exceptionsduring running process. Then, the author will be able to make thenecessary adjustments to the system according to detailed backlogs.

The principles above are considered as the most relevant ones for TCGUI,which be referred to during implementation.

4.3 Choices of Languages and Tools

Technical decisions were made based on the analysis of available technolo-gies. The following paragraphs will introduce how the author utilises therelated technologies with the implementation of TCGUI. The functionalitiesare similar to that of table editor ( such as Office Excel) to some extent.However, in order to achieve ‘online editing’ and database storage, we haveto use HTML + Javascript to interact with the web.

• TCGUI uses Bootstrap as its front-end interface framework, whicheliminates the needs for colour and shape design. The tables are writtenin HTML. TCGUI uses ThinkPHP to save the effort for database andThinkPHP is responsible for controller and model layer.

• Since the web pages can be changed dynamically, it needs Javascript tocontrol pages. So, Javascript is used for view layer with the supportof jQuery to manipulate DOM more efficiently. At the same time, thesystem needs to use Ajax to connect with the server.

• With the purpose of utilising all features, the programming languagehas to be chosen carefully. Undoubtedly, PHP is very convenient toeasy for implement web services. PHP code runs in its own memoryspace to give TCGUI a fast sites loading speed. And PHP can interactwith many different database languages including MySQL [23].

32

• Speaking of which, the database play as a storage role in this system,which is responsible for using serialisation to save and deserialisation toremove the model data within the system. Since the information doesnot contain much business logic, TCGUI doesn’t need to have a verycomplex database or design complicated table structures for saving theobject data. Based on the above, regarding the database selection,the author prefers to choose easy-to-operate and relatively commondatabase system. After comparison and research, MySQL is chosenas the database platform. As a lightweight database, the operationsare simple and reliable and it’s easy to install. Besides, it is an opensource database. For an academic project, the use of free open-sourcetool eliminates the need for research expenditure and also is beneficialto the development of integrating with other open source projects.

4.4 Summary

In this chapter, it illustrated the high-level architecture to be used for im-plementation of the curation interface, the basic workflow of the system anda justification of the technical choices are presented. An appropriate designfor satisfying the requirements outlined in Chapter 4 was presented from twoaspects: front-end and back-end design. Chapter 5 will describe the imple-mentation phase with details of interesting features and the main challenge.

33

Chapter 5

Implementation

The implementation of the web application went through several iterations.This chapter does not intend to be a comprehensive overview of them all;rather, it serves to present the reader with a set of implementation detailsthat were challenging or interesting to the author and the challenges met.Then, it will give the details of testing. The system is mainly implementedfrom three aspects: back-end PHP controller, visual pages and javascriptdynamic processing module at the front-end.

5.1 Interesting Features

5.1.1 Data Manipulation

Most data manipulation functions are inside controller layer. ‘Importsql’function acquires ‘articleid’ from importsql.html interface and import all thedata associated with that ‘articleid’ . ‘Sql_update’ function uses POST toobtain data from sql_update.html interface and updates the changes trig-gered by different ‘events’. These ‘events’ are generated by front-end IOmodule (documanager.js) which processes each action of the user into a cer-tain data structure, making it more convenient to interact with the server.In order to realise real-world usage, all the changes have to be saved into rootdatabase. When users finalise the changes and click export button, the mod-ified data will be exported from local database to the root database. Eachcell has its own ‘cellID’ and we could match the relevant cell with the corre-

34

sponding one in root database according to their unique ID. Also, whenevernew data is added, the ‘cellID’ in local database will be set to 0 so that itwill be easily discovered and recognised as a new cell in root database byback-end program. Raw cell data in root database does not include merginginformation. Therefore, the system will check if the incoming data has themerging information table. If it does, the system will choose the most up-dated merging data (the largest ID for the same article in local database bydefault) and build new tables in root database to save it.

5.1.2 Interactive User Actions

The system uses javascript extensions to utilise table interactive curations.

• (Multi-)Selecting cellWhen a cell receives an on-click event, the cell will be selected. Cells aremade responsive in the context of this event by using the ‘cell.onclick= function () ’ which can be added in decoration function and ‘CellS-elected ()’ is used to utilise the view by changing the style of that cell,for example, changing the border width of that cell. At the same time,selectedCell is used as global variables to record the current selectedcell. If selectedCell is not empty, indicating that another cell has al-ready been selected, this cell needs to change it back to its originalstyle first.Selecting multiple cells depends on ‘mousedown’, ‘mouseup’, ‘mouseover’three events which happen when the user selects cells. Global vari-able ‘startCell’ and ‘endCell’ record the first cell clicked by the mouse,which is the starting point, and the last cell the mouse move to, whichis the end point. After having starting and end points, ‘CellMultiSe-lect’ function will perform the action with similar effects as the above‘CellSelected’ function.

• Merging cellsAfter selecting multiple cells, ‘startCell’ and ‘endCell’, which are usedas parameters for function ‘pointToRectangle’ to select the matrix, isrecorded. This matrix has a variable to record the position of thecell at top left corner and another variable named rectangle to recordrow span and column span. Then the system uses ‘merge_cell’ func-tion to perform merging operation, which actually is about modifying‘rowSpan’ and ‘colSpan’ of the cell and deleting other redundant cells.

35

These two functions are integrated into a non-argument ‘mergeBySe-lected’ function. This function directly uses global variable ‘startCell’and ‘endCell’ to achieve cell merging and can be called by the controllerlayer.

5.2 Challenges

The author faced a couple of challenges but managed to solve them success-fully. The main challenge is to utilise the splitting functions.

In the original root database, the first problem is that ‘cell’ is saved as thesmallest unit in a table, which means a cell cannot be splitter further. Rownumber, column number and value are the three main attribute included inthe data structure for cells, which is similar to Excel. At the same time,add rows and columns functions are very common in table manipulationsand can actually be combined with the merge function to achieve the sameeffects as split function. Since a table layout and the total number of rowsand columns have been fixed and the cell is the smallest unit which can nolonger be divided, if the system wants to achieve similar splitting effects, thealgorithm is to increase the rows and columns of cells at the front or end ofthe target cell and merge unnecessary cells.

Implementing the split functions is complicated because it needs differentlogic to deal with staggered tables cause by multiple splits. For example,TCGUI may appear the following situation (Figure 5.1) after multiple-stepmerges and splits. It may cause the confusion for understanding e.g. for thesame row, the column numbers are different.

Figure 5.1: A staggered table example

36

Meanwhile, after splitting, the second problem appears when it comes to sav-ing the changes back to root database because the table structure is changedand it’s not only about updating a specific piece of information. Taking a2×2 Table 5.1 for example,

2 (0, 0) 3 (0, 1)5 (1, 0) 6 (1, 1)

Table 5.1: A table example and the information inside the bracket meansthe row and the column of the cell

After 2 splits for Table 5.1 (shown in Table 5.2), the cell data has to beupdated for saving changes, which is the main concern.

1 (0,0) 2 (0, 1) 3 (0, 2)4 (1,0) 5 (1, 1) 6 (1, 2)

Table 5.2: Table 5.1 after applying split function

Several ways were compared for most efficient implementation. The mostconvenient way is to change the content in the cell with same position infor-mation, which is shown in Table 5.3. For example, the original cell 3 with(0,1) position information in Table 5.1 become the new cell 2 with same po-sition information in Figure 5.2. However, there is a drawback regarding thisapproach. Since the value in the same cell location has been changed, thecell type and all the data related to that cell, such as annotations and rela-tionships, have to be changed as well because the cell becomes a completelydifferent one from user’s view.

ID Row Column Value(Old->New)1 0 0 2->12 0 1 3->23 1 0 5->44 1 1 6->4

5 (new) 0 2 N/A->36 (new) 1 2 N/A->6

Table 5.3: First possible way of changing cell data from that of Table 5.1 toTable 5.2

37

Another way is to change the position information of the cells instead ofthe content inside. The cell data will be updated as Table 5.4. For thisapproach, the only thing needs to be updated in cell data table is just theposition information, which is more efficient the first approach. Meanwhile,the original position information cannot be used again for locating cells,which means extra tables is needed for mapping the original coordinates tothe new coordinates for the same cell. The tables for mapping coordinatesare based on the uniqueness of cell ID.

ID Row Column(Old->New) Value1 0 0->1 22 0 1->2 33 1 0->1 54 1 1->2 6

5 (new) 0 0 16 (new) 1 0 4

Table 5.4: Second possible way of changing cell data from that of Table 5.1to Table 5.2

After careful consideration and comparison, the second approach was chosenand the split function was successfully implemented from algorithm designto data exporting.

5.3 Testing

The project was conducted in Agile working style and manual testing wasperformed progressively throughout all iterations to ensure the possible bugswere disclosed during the early development phase. Then, during the test-ing phase, unit tests were performed for Javascript and PHP classes andautomated acceptance tests were performed for different user stories.

5.3.1 Unit Testing

Unit tests were the key parts for testing individual class or function, whichhelped me find the bugs with its position quickly. A PHP unit testing frame-work named PHPUnit was used for data retrieval tests [1]. Every functioninside controller class was tested by pass certain test dataSets/dataTables.

38

For example, the tests extended the DbUnit of PHPUnit to create a databaseconnection with MySql server. Then, the local xml testing files were passedas arguments to createMySQLXMLDataSet() method for creating tabulardatasets. After the connection tests passed, the datasets were sent to testimport_sql() function. The file is in XML format and is used for generatingtables in mysqldump() utility. Part of my result is shown in Figure 5.2.

Figure 5.2: Part of test results for Controller Layer

Meanwhile, JavaScript unit testing framework named QUnit was used fortable data manipulations. It is very similar to Junit [17]. Firstly, a QUnitTest Suite was created and a set of assertions were run for each method.

39

5.3.2 Automated Acceptance Testing

A set of test cases were drafted for acceptance tests based on user require-ments (Table 5.5). Selenium Web Driver together with Fitnesses frameworkwere used for automated acceptance to hit the user interface [24]. Java fixturecode and Fitnesse scripts were written for inputting data and implementingbasic user actions.

5.4 Summary

Chapter 5 described some interesting features and the main challenge metthroughout the implementation phase and a more detailed explanation canbe found in Appendix B. A detailed acceptance test plan and part of theunit testing results were provided. Chapter 6 will discuss the final result andTCGUI will be demonstrated by screenshots.

40

Test Descriptions Inputs Expected Results OutcomeImport

data

The web driver gives sev-

eral numbers as inputs

respectively for import-

ing data from sql root

database.

Non-exist article No. 50000;

Exist article No. 7

The loading time is no more than 5 sec-

onds; For non-exist article, it should re-

turn blank page and remind user; For

existing article, it should visualise the

data correctly

Passed

Check

buttons

Use web driver to

click the previ-

ous/next/choose files

buttons respectively

For previous and next but-

tons, take suitable table

data as inputs; For choose

files button, take homepage

URL as input

For previous/next buttons, check if

the data newly-loaded is visualised the

same as inputs; For choose files button,

check if the page newly-loaded has the

same URL as input

Passed

Change

function

Change cell type and

content

Di↵erent combinations of

cell types and random num-

bers

After save changes, check if the current

data is the same as input

Passed

Split

cell

Select a specific cell and

split it as the input size.

Check if the cells are

filled with the same con-

tent as the target cell and

their location data

Size: 5*6; Input Cell: Arti-

cle 7 Table 1 Clam cell

The number of Clam cells are 30 in to-

tal; The cells next to the original Clam

cells are merged

Passed

Merge

cell

Select a set of cells

and merge it. Check

if the newly-merged cell

includes all information

and its row span and col-

umn span

N/A n No information missing; Merge 4

cells, the row span and column span be-

come 2

Passed

Delete

cell

Select a specific cell and

delete the content. Check

that the original cell is

not empty and the cell af-

ter delete is empty

N/A Cell is filled with the same type but

without content

Passed

Redo &

Undo

buttons

Do a

change/merge/split/add

rows/delete actions and

undo/redo it. Check

if the data is the same

before/after the actions.

N/A The data stay the same as before after

applying undo action.

Passed

Add

rows or

columns

Add new rows and check

if the row added is next

to the target cell

N/A The number of rows is increased.

The position information for cell be-

low/above the newly-added row is in-

creased/decreased

Passed

Save to

database

Save di↵erent kinds of

changes to databases

N/A Reload the same article by importing

data again and the data visualised is

updated with the changes

Passed

Table 5.5: Acceptance test cases

41

Chapter 6

Results & Evaluation

The aim of this chapter is to discuss whether the project satisfies the require-ments through a formal evaluation. It will also demonstrate the final resultsvisually.

6.1 The Final Product

As the result of the research and design undertaken, the data curation inter-face project is implemented successfully. Since the purpose of creating thedata curation interface is extensibility to existing interfaces, in the future,the id could be substituted with other unique identifier numbers, such as theISBN for book, or a title and more. This will offer the user better flexibility

6.1.1 Homepage/Input Page

As shown in Figure 6.1, user will be able to type article ID in ‘Import fromOrigin SQL’ text box, importing data from root database. For ‘Read fromLocal SQL’, the data imported is from local database. There are two textboxes and ‘checking file’ is the one user want to curate.

42

Figure 6.1: Input page of final product

6.1.2 Visualisation Page

As shown in Figure 6.2, ight table is the extracted data and is visualised indifferent colour and the left plain table is the one from original paper. Cellsin the right table are filled with different colours, representing the type ofthat cell and the exact meanings of different colours are listed on the left. Forexample header cell and stub cell are filled with orange and pink respectively;and a different colour for cell with multiple role type. Since there could bemore than one tables in a same article, user could browse all the tables inthat article by clicking previous and next buttons on the left. It will tell userwhen it reaches the first table or final table.

43

Figure 6.2: An example of visualisation page for displaying article No.11

6.1.3 Cell Manipulation Menu

Each cell in the right table is selectable and can be applied to several actionsfor curating data, including change, merge, split, etc. The change and deletefunction can be applied to one single cell or multiple cells together. Changefunction can change the type and content of the cell(s) and the delete functioncan empty the content. The change function menu is shown in Figure 6.3.

Figure 6.4 displays the effects after applying merge, split and add columnfunctions. Two text boxes next to split button means the number of rowsand columns user want to split the target cell into.

Figure 6.3: An example of available actions for selected cell

44

(a) This figure displays the effects after ap-plying ‘Merge’ action for ‘REM sleep’ and‘34 ± 32 min (11 ± 10%)’ two cells

(b) This figure displays the effects after ap-plying ‘Split’ and ‘addCol_end’ action forCell ‘SWS’ and ‘REM sleep’ respectively

Figure 6.4: The effects after applying particular functions. The originallook of these two cells is shown in Figure 6.3

6.2 Evaluation

In order to evaluate if the final product meets the target requirements per-fectly, usability testing with both general users and professional users wasperformed. And as the primary user of TCGUI, a deeper discussion washeld with Nikola Milosevic. Questionnaires (can be seen in Appendix C)were disturbed to 10 users and they performed certain tasks and rated thequality of TCGUI from 1 to 5. The task list is shown in Table 6.1 and theaverage rating results are displayed in Figure 6.5, Figure 6.6 and Figure 6.7respectively.

Based on the feedback and the rating provided, the tasks were straightfor-ward and no major issue appeared during the process. For users withoutprofessional background, TCGUI is still ease to use. The high reliabilitymeans the effects after the actions performed are mostly what users wereexpecting and the data was processed accurately. Also, in general, users feelthat the design is clean and understandable and the system has a certaindegree of tolerance to errors caused by possible meaningless actions done byusers. For example, users are satisfied with the feature that error reminderswill appear if user type non-exist article ID. Simply put, users are willing touse TCGUI again for table curation.

45

Task Descriptions FeedbackImport data by inputting article IDPerformed successfully Browse alltables in order for the same article

Performed successfully

Change cell(s) content Performed successfullyChange cell(s) type Performed successfullyMerge multiple cells Performed successfullySplit cell into random numbers ofsmaller cells

Users were confused about the textboxes next to split button and didnot know which means the rownumber or column number for thetarget cell to perform split function

Delete content Most performed successfully but,sometimes, user was expecting todelete the whole row/column notjust content inside

Add new row/col at the front/back Performed successfullyReverse/redo any action Performed successfullySave changes back to database Most performed successfully but

one user was confused the stepsneeded for saving back

View curated article history and re-store any previous version

Performed successfully

Overall experiences (Design, Effi-ciency, etc.)

GUI was easy to learn but still needextra contextual hints. The designis clean and the interface was load-ing fast

Table 6.1: Tasks completion results

46

Figure 6.5: Reaction bar chart for learnability

Figure 6.6: Reaction bar chart for reliability

47

Figure 6.7: Reaction bar chart for overall experience

However, the results also indicate there are some areas for improvements.

• Firstly, the interface needs some contextual explanations for severalbuttons. For example, the text boxes next to the split button haven’tbeen displayed. The location of two text boxes could be moved belowthe split button and marked which one is row number or column numberrespectively. When users want to save changes back to the database,they need to click ‘save’ and ‘export to sql’ buttons, which is a bitconfusing. What can be done is that two buttons could be combinedas one and marked as ‘finish curation’.

• Another spot needs to improve is that the current version of TCGUIcannot delete the whole row or column. Although this is not a hardfunction to implemented, it was almost ignored during the develop-ment. When a cell is selected, the user will be able to perform delete_rowor delete_column actions for the row or column of the cell. The orig-inal delete content button can be merged to change function (changecontent to empty).

48

• Moreover, it would be better if TCGUI is able to include about pagefor the motivation and the background of TCGUI and the explanationsto specific functions, which will serve users as a basic instruction guide.Accessibility and confidentiality issues might arise in sensitive data cu-ration. Integrating with login function will enable TCGUI to validatethe authority of users.

As the primary user to TCGUI, Nikola found that the web application fitsthe requirements completely and is suitable for real-world usage. The designis attractive and straightforward for people in scientific research area.

6.3 Summary

This chapter highlighted the main functionalities of the final product andevaluated the results according to user feedback. The last chapter will con-tain a reflection of what the author has learned during the process and theconclusion.

49

Chapter 7

Reflection & Conclusion

The final chapter will summarise the milestones achieved from the originalplanning. Then, it will conclude the whole development process and possiblefurther extensions.

7.1 Achievements

The aim of the project is to build a user interface for curating table dataextracted from scientific biomedical literature and a set of objectives forTCGUI was listed in Table 7.1.

TCGUI is implemented with a MVC-based lightweight framework and usesJavascript to enable web interactive operation. It hides the technical de-tails from users and is easily understandable for the users with or withoutprofessional background. TCGUI was developed iteration by iteration.

Table 7.2 lists all requirements defined and whether the final result satisfiesthem.XML files were initially used as the input data but discarded after-wards. Then, TCGUI only imports data from sql database. For backlog, itis not as important as other feature and were assigned with low priorities.Overall, the requirements have been met and the project is a success.

50

To allow the user to search existarticles in databasesTo visualise & highlight data pre-sented in tablesTo allow the user to com-pare information and curate datathrough an interactive interfaceTo allow the user to reverse ac-tions and save the current curatedtableTo realise the project as a web ap-plicationTo utilise TCGUI as state-of-the-art MVC based and easy-to-useapplication

Table 7.1: Initial objectives

51

Requirement descriptions OutcomeUsers should be able to choose article for viewing the tabledata in that article and the data retrieved should be in itsmost updated version

Satisfied

The article may be chosen by providing article ID or upload-ing files from localhost

Partiallysatisfied

Users can also view all previous versions of an article by pro-viding its history version ID

Partiallysatisfied

Both tables from the original papers and the ones extractedby table mining engine should be visible to users once theinterfaced is loaded

Satisfied

Users should be able to view tables one by one in the sameorder as it is in the original paper

Satisfied

The cell type of extracted data should be shown clearly SatisfiedWhenever users detect any error in the table, users should beable to select the cell(s) which include(s) errors

Satisfied

Users may change either the contents or the type of that cell SatisfiedUsers may change the table structure to match it with theoriginal data

Satisfied

Users can also add new information to the tables SatisfiedUsers should be capable of reverse any action made SatisfiedUsers can change same type of errors at one operation SatisfiedOnce the user finishes the curation, all kinds of changesshould be saved and the current table is marked as mostupdated version for that article. If user re-input the same ar-ticle, the table data should include all the changes last timemade

Partiallysatisfied

Users should be able to stop the exporting process at anytime even in the progress

Not satis-fied

Users should be capable of choosing another new article atany stage of curating and exporting data

Satisfied

The interface is built as a web application and can be accesseduniversally

Satisfied

The web application can be configured and deployed on anyweb server

Satisfied

The design should be convenient and clean enough for usersto finish training within a time limit

Satisfied

The response time of every action should be acceptable Satisfied

Table 7.2: Initial requirements and the outcome

52

7.2 Self-reflection

Being involved in the full development cycle of a software project provides theauthor with the opportunities to acquire both technical and transferable skillsduring different stages, such as background research, requirement analysis,design, implementation and testing.

Due to abundant background research, the author gained a deeper under-standing of text mining and table mining as well as the principles of user in-terface design. A broad range of technologies was needed to develop TCGUI.Although the author already has 2 years experience in HTML, CSS, MySql,having to integrate those components into a coherent system requires furtherskills and greatly emphasises the importance of design. New challenges wereencountered with Javascript jQuery and the use of MVC-based frameworkbecause these are the main technical skills the author had to develop dur-ing implement phase. The author also took advantages of different kinds oftechnologies and tools for more efficient development. For instance, Boot-strap front-end framework was used for CSS design and XAMPP was usedfor constructing environment.

Working on a large-scale project , such as TCGUI, and being in an infor-mative environment encouraged the innovative thinking, especially when itcomes to algorithm design. The author is quite proud of the algorithm ofsplit function because the algorithm has low complexity. Besides, the authorimproved her ability of taking the advantages of the merge function codealready written, which reduce the space and time complexity as well.

In addition, the author has developed her transferable skills, such as pre-sentation skills during seminar and demonstration, time management skillsunder agile software development. Meanwhile, the testing approach is test-driven-development (TDD). Unit testing was used for single pieces of codeand automated acceptance was for testing work flow. The author became fa-miliar with PHPUnit and QUnit. Moreover, a log book was used throughoutthe whole process to keep track of the progress made.

53

7.3 Further Improvements

TCGUI can be considered as a successfully managed, implemented applica-tion but some further improvements are still necessary for continuous devel-opments.

• AutofillInstead of changing data manually on TCGUI page, users will be ableto use the autofill feature to fill cells with data that follows a patternor that is based on data in other cells.

• Integration with function inputting toolsSimilar to Excel, a certain formula can be performed on cell(s) withthe inputs of either constants or the values in other cells.

• Integration with visualisation tools (e.g., Google Charts, Graphviz)This kind of integration will enable word-cloud, workflow diagram, andcharts generation, which are apparently helpful for data analysis errordetection.

7.4 Concluding Remarks

The rate of published scientific paper growth is increasing rapidly, whichboosts the needs of comprehensive methods to extract table data automati-cally. The existing systems have differentiated capabilities and it is necessaryto create a more comprehensive system that adds in all the advantages andbenefits of the existing interfaces. Furthermore, as this research study high-lights, the absence of a tool required for curating table data efficiently hasbeen its main motivation. Therefore, table curation tool is one of the mostessential parts in table mining system to improve the accuracy of extracteddata. This project has been a success for curating table manually and ishighly-valued for making a possible contribution in biomedical area. TCGUIcan be further developed with proper adjustments and extensions.

54

Bibliography

[1] Bergmann, S. (2005) PHPUnit manual.Available at: https://phpunit.de/manual/current/en/index.html(Accessed: 26 April 2016).

[2] Dai, H.-J., Wu, J.C.-Y., Lin, W.-S., Reyes, A.J.F., Syed-Abdul, S., Tsai,R.T.-H., Hsu, W.-L. and Mira Anne C. dela Rosa (2014) ‘LiverCancer-MarkerRIF’, A liver cancer biomarker interactive curation system com-bining text mining and expert annotations, 2014.

[3] Duckett, J. (2014) JavaScript and JQuery: Interactive front-end webdevelopment. New York: John Wiley & Sons.

[4] Efron, B. and Tibshirani, R.J. (1994) An introduction to the Bootstrap,Vol. 57. Boca Raton, FL: Chapman & Hall/CRC.

[5] Hearst, M.A. (1999) ‘ACL ’99 Proceedings of the 37th annual meet-ing of the Association for Computational Linguistics on Compu-tational Linguistics’, Untangling text data mining, pp. 3–10. doi:10.3115/1034678.1034679.

[6] Inc, T.S. (2015) TIBCO clarity user’s guide.Available at: https://docs.tibco.com/pub/clarity/2.3.0/doc/html/GUID-4ECC7C91-3B2A-47A0-9BE9-FC8E8D34D970.html(Accessed: 26 April 2016).

[7] Jeetu, G. and Girish, T. (2010) ‘Prescription drug labeling medicationerrors: A big deal for pharmacists’, Journal of Young Pharmacists: JYP,2(1), pp. 107–111.

[8] Larsen, P.O. and Von Ins, M. (2010) ‘The rate of growth in scientific pub-lication and the decline in coverage provided by science citation index’,

55

Scientometrics, 84(3), pp. 575–603. doi: 10.1007/s11192-010-0202-z.

[9] Lord, P., Macdonald, A., Lyon, L. and Giaretta, D. (2004) From datadeluge to data Curation.Available at: http://www.allhands.org.uk/2004/proceedings/papers/150.pdf(Accessed: 26 April 2016).

[10] Lucassen, J.M., Maes, S.H., International and Corporation, M. (2001)Patent US6996800 - MVC (model-view-controller) based multi-modal au-thoring tool and development environment.Available at: http://www.google.co.uk/patents/US6996800(Accessed: 26 April 2016).

[11] Meystre, S., Savova, G., Kipper-Schuler, K. and Hurdle, J. (2008) ‘Ex-tracting information from textual documents in the electronic healthrecord: A review of recent research’, Yearbook of medical informatics.,35, pp. 128–44.

[12] Milosevic, N. (2014) TABLE MINING AND DATA CURATION FROMBIOMEDICAL LITERATURE.Available at: https://www.escholar.manchester.ac.uk/api/datastream?publicationPid=uk-ac-man-scw:267619&datastreamId=FULL-TEXT.PDF(Accessed: 26 April 2016).

[13] . Miner, G., Elder, J.I., Hill, T., Fast, A., Nisbet, R. and Delen, D.(2014) Practical text mining and statistical analysis for non-structuredtext data applications. United States: Academic Press.

[14] National Library of Medicine, U.S. (2003) MEDLINE: Number ofcitations to English language articles; Number of citations containingabstracts1 (as of mid - November 2015)*.Available at: http://www.nlm.nih.gov/bsd/medline_lang_distr.html(Accessed: 1 May 2016).

[15] National Library of Medicine, U.S. (2004) Fact SheetMEDLINE.Available at: https://www.nlm.nih.gov/pubs/factsheets/medline.html(Accessed: 1 May 2016).

[16] National Library of Medicine, U.S. (2007) Citations added to MEDLINEby fiscal year.

56

Available at: https://www.nlm.nih.gov/bsd/stats/cit_added.html(Accessed: 1 May 2016).

[17] jQuery, F. (2016) QUnit: A JavaScript unit testing framework.Available at: http://qunitjs.com/(Accessed: 26 April 2016).

[18] Rinaldi, F. (2014) Semi-Automated semantic Annotation of the biomed-ical literature, 1272, pp. 473–476.

[19] Rinaldi, F., Clematide, S., Schneider, G., Romacker, M. and Vachon,T. (2010) ODIN: An advanced interface for the Curation of biomedicalliterature, 1, p. 1. doi: 10.1038/npre.2010.5169.1.

[20] Surhone, L.M., Timpledon, M.T. and Marseken, S.F. (2010) Xampp.VDM Publishing.

[21] VersionOne (2016) The benefits of agile software development.Available at: https://www.versionone.com/agile-101/agile-software-development-benefits/(Accessed: 29 April 2016).

[22] Wang, J., Li, Y. and Wang, C. (2006) Researchon ThinkPHP framework based on the mode ofMVC–Electronic science and Technology Available at:http://en.cnki.com.cn/Article_en/CJFDTotal-DZKK201404045.htm(Accessed: 26 April 2016).

[23] Williams, H.E. and Lane, D. (2002)Web database applications with PHPand MySQL. United States: O’Reilly Media, Inc, USA.

[24] Wowro, M. (2013) ‘Dreamteam: Selenium WebDriver and FitNesse’,WebTesting & Selenium, 18 January.Available at: http://it-kosmopolit.de/blog/2013/01/18/dreamteam-selenium-and-fitnesse/(Accessed: 26 April 2016).

57

Appendix A

Glossary

Term DefinitionCell The basic grouping within a table. One cell usually

contains only one value, word, phrase or concept. Of-ten, cells are divided by horizontal and vertical lines.

Column A set of vertically aligned table cells Row A set ofhorizontally aligned table cells

Header Top-most row (or set of several top-most rows) of atable and defines what the column data are

Stub Left-most column of the table, usually containing thelist of subjects or instances to which the values in thetable body apply (row header). The stub column isthe only column that may not require a column head.

Sub-header An additional dimension of the table. Sub-header rowis usually placed between data rows, separating themby some dimension or concept.

Informationextraction

The task of automatically extracting structured in-formation from unstructured and/or semi-structuredmachine-readable documents. In most of the casesthis activity concerns processing human languagetexts by means of NLP.

NLP Natural Language Processing

58

Appendix B

Questionnaire

59

aweb-baseddashboardfor visualisationandcurationofdata...

Documents